#!/usr/bin/env python
# coding: utf-8

# # Quantum Circuit Builder v1
# 
# In this notebook we rely on IBM *qiskit* [1], OpenAI *gym* [2] and the library *stable-baselines* [3] to setup a quantum game and have some artificial reinforcement learning agent play and learn them.
# 
# In a previous notebook we run a very simple game, *qcircuit-v0*. We now try out a more challenging version, *qcircuit-v0*, and we compare the performances of different agents playing it.

# ## Setup
# 
# As before, it is necessary to setup the packages required for this simulation as explained in [Setup.ipynb](Setup.ipynb).
# 
# Next, we import some basic libraries.

# In[1]:


import numpy as np
import gym

from IPython.display import display


# ## Importing the game
# 
# The game we will run is provided in **gym-qcircuit** [4], and it is implemented complying with the standard OpenAI gym interface. 
# 
# The game is a simple *quantum circuit building* game: given a fixed number of qubits and a desired final state for these qubits, the objective is to design a quantum circuit that takes the given qubits to the desired final state. 

# In[2]:


import qcircuit


# The module **qcircuit** offers two versions of the game:
# - *qcircuit-v0*: it presents the player with a single qubit, and it requires to design a simple circuit setting this qubit in a perfect superposition.
# - *qcircuit-v1*: a slightly more challenging scenario where the player is presented with two qubits and he/she is requested to design a circuit setting the qubits in the state $\frac{1}{\sqrt{2}}\left|00\right\rangle +\frac{1}{\sqrt{2}}\left|11\right\rangle $.
# 
# Details on the implementation of these games are available at https://github.com/FMZennaro/gym-qcircuit/blob/master/qcircuit/envs/qcircuit_env.py.

# ## qcircuit-v1
# We start loading the first scenario and run agents on it.

# In[3]:


env = gym.make('qcircuit-v1')


# The game *qcircuit-v1* is *completely observed*, and both its *state space* and *action space* are described below.
# 
# Remember that two qubits are described by $\alpha\left|00\right\rangle +\beta\left|01\right\rangle +\gamma\left|10\right\rangle +\delta\left|11\right\rangle$, where $\alpha, \beta, \gamma, \delta$ are complex numbers and $\left|00\right\rangle, \left|01\right\rangle, \left|10\right\rangle, \left|11\right\rangle$ are the measurement axes. The state space is then described by eight real numbers between -1 and 1 representing the real and complex part of $\alpha, \beta, \gamma, \delta$.
# 
# An agent plays the game interacting with a quantum circuit, adding and removing standard gates. In this version of the game there are seven actions available: add an *X gate* on the first or on the second qubit, add a *Hadamard gate* on the first or on the second qubit, add a *CNOT gate* on the first or on the second qubit, or remove the last inserted gate.
# 
# Again, details on the implementation of the state space and the action space are available at https://github.com/FMZennaro/gym-qcircuit/blob/master/qcircuit/envs/qcircuit_env.py.

# ### Random agent
# First, we simply run a random agent. This allows us to test out the game and see its evolution.
# 
# A random agent selects a possible action from the action space at random and executes it. Given the number of actions, and the relative low probability of removing a gate ($\frac{1}{7}$) over adding a new gate ($\frac{1}{7}$), it is likely that the random agent will run for a very long time developing a long and complex circuit before stumbling in the correct solution.

# In[4]:


env.reset()
display(env.render())

done = False
while(not done):
    obs, _, done, info = env.step(env.action_space.sample())
    display(info['circuit_img'])
       
env.close()


# ### PPO2 Agent
# 
# We now run a *PPO2* agent, a more sophisticated agent picked from the library of *stable_baselines*.
# 
# First we import the agent.

# In[5]:


from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2


# Then we train it.

# In[6]:


env = DummyVecEnv([lambda: env])
modelPPO2 = PPO2(MlpPolicy, env, verbose=1)
modelPPO2.learn(total_timesteps=10000)


# Last, we test it by letting it play the game; we run ten steps of the game (notice, though, that the agent could reach the solution before the tenth step, which would cause the game to restart).

# In[7]:


obs = env.reset()
display(env.render())

for _ in range(10):
    action, _states = modelPPO2.predict(obs)
    obs, _, done, info = env.step(action)
    display(info[0]['circuit_img'])
    
env.close()


# As expected, the agent easily learned the optimal circuit. 

# ### A2C Agent
# 
# For comparison, we now run an *A2C* agent, another agent from the library of *stable_baselines*.
# 
# First we import the agent.

# In[8]:


from stable_baselines import A2C


# We train it.

# In[9]:


modelA2C = A2C(MlpPolicy, env, verbose=1)
modelA2C.learn(total_timesteps=10000)


# And we test it by letting it play ten steps of the game (as before the agent may reach a solution before the tenth step).

# In[10]:


obs = env.reset()
display(env.render())

for _ in range(10):
    action, _states = modelA2C.predict(obs)
    obs, _, done, info = env.step(action)
    display(info[0]['circuit_img'])
    
env.close()


# ## Comparison of the agents
# 
# Finally, we compare the agents quantitavely by contrasting their average reward computed running 1000 episodes of the game. We rely on the *evaluation* module that provides simple and standard routines to evaluate the agents.

# In[11]:


import evaluation
n_episodes = 1000

PPO2_perf, _ = evaluation.evaluate_model(modelPPO2, env, num_steps=n_episodes)
A2C_perf, _ = evaluation.evaluate_model(modelA2C, env, num_steps=n_episodes)

env = gym.make('qcircuit-v1')
rand_perf, _ = evaluation.evaluate_random(env, num_steps=n_episodes)


# In[12]:


print('Mean performance of random agent (out of {0} episodes): {1}'.format(n_episodes,rand_perf))
print('Mean performance of PPO2 agent (out of {0} episodes): {1}'.format(n_episodes,PPO2_perf))
print('Mean performance of A2C agent (out of {0} episodes): {1}'.format(n_episodes,A2C_perf))


# The reinforcement learning agents (PPO2, A2C) learned to play the game to different degrees. On the opposite, the random agent performed very badly, showing that even with this limited state and action space a random policy rarely finds the right solution.

# ## References
# 
# [1] IBM qiskit, https://qiskit.org/
# 
# [2] OpenAI gym, http://gym.openai.com/docs/
# 
# [3] stable-baselines, https://github.com/hill-a/stable-baselines
# 
# [4] gym-qcircuit, https://github.com/FMZennaro/gym-qcircuit