In this notebook we rely on IBM qiskit [1], OpenAI gym [2] and the library stable-baselines [3] to setup a quantum game and have some artificial reinforcement learning agent play and learn them.
We setup a very simple game, qcircuit-v0, and we compare the performances of different agents playing it.
First of all, let us setup the packages necessary for this simulation as explained in Setup.ipynb.
Next, let us import some basic libraries.
import numpy as np
import gym
from IPython.display import display
The game we will run is provided in gym-qcircuit [4], and it is implemented complying with the standard OpenAI gym interface.
The game is a simple quantum circuit building game: given a fixed number of qubits and a desired final state for these qubits, the objective is to design a quantum circuit that takes the given qubits to the desired final state.
import qcircuit
The module qcircuit offers two versions of the game:
Details on the implementation of these games are available at https://github.com/FMZennaro/gym-qcircuit/blob/master/qcircuit/envs/qcircuit_env.py.
We start loading the first scenario and run agents on it.
env = gym.make('qcircuit-v0')
The game qcircuit-v0 is completely observed, and both its state space and action space are small.
Remember that a single qubit is described by $\alpha\left|0\right\rangle +\beta\left|1\right\rangle$, where $\alpha, \beta$ are complex numbers and $\left|0\right\rangle, \left|1\right\rangle$ are the measurement axes. The state space is then described by four real numbers between -1 and 1 representing the real and complex part of $\alpha, \beta$.
An agent plays the game interacting with a quantum circuit, adding and removing standard gates. In this version of the game there are only three actions available: add an X gate, add a Hadamard gate, or remove the last inserted gate.
Again, details on the implementation of the state space and the action space are available at https://github.com/FMZennaro/gym-qcircuit/blob/master/qcircuit/envs/qcircuit_env.py.
First, we simply run a random agent. This allows us to test out the game and see its evolution.
A random agent selects a possible action from the action space at random and executes it. Given the limited amount of actions (including the possibility of undoing actions by removing a gate), and the simple objective, the random agent should be able to land on the right circuit in a limited amount of actions.
env.reset()
display(env.render())
done = False
while(not done):
obs, _, done, info = env.step(env.action_space.sample())
display(info['circuit_img'])
env.close()
We now run a PPO2 agent, a more sophisticated agent picked from the library of stable_baselines.
First we import the agent.
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2
WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons * https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.
Then we train it.
env = DummyVecEnv([lambda: env])
modelPPO2 = PPO2(MlpPolicy, env, verbose=1)
modelPPO2.learn(total_timesteps=10000)
WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/tf_util.py:57: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/tf_util.py:66: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/policies.py:115: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/input.py:25: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/policies.py:562: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.flatten instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/tensorflow_core/python/layers/core.py:332: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version. Instructions for updating: Please use `layer.__call__` method instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/a2c/utils.py:156: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/distributions.py:323: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/distributions.py:324: The name tf.log is deprecated. Please use tf.math.log instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/ppo2/ppo2.py:193: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/ppo2/ppo2.py:201: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/tensorflow_core/python/ops/math_grad.py:1424: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/ppo2/ppo2.py:209: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/ppo2/ppo2.py:243: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/ppo2/ppo2.py:245: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead. -------------------------------------- | approxkl | 7.973169e-05 | | clipfrac | 0.0 | | explained_variance | -0.000982 | | fps | 22 | | n_updates | 1 | | policy_entropy | 1.0985414 | | policy_loss | -0.0070484644 | | serial_timesteps | 128 | | time_elapsed | 3.58e-06 | | total_timesteps | 128 | | value_loss | 3996.909 | -------------------------------------- -------------------------------------- | approxkl | 3.2882737e-05 | | clipfrac | 0.0 | | explained_variance | 0.000217 | | fps | 30 | | n_updates | 2 | | policy_entropy | 1.0981455 | | policy_loss | -0.004966901 | | serial_timesteps | 256 | | time_elapsed | 5.61 | | total_timesteps | 256 | | value_loss | 4248.403 | -------------------------------------- -------------------------------------- | approxkl | 6.612671e-05 | | clipfrac | 0.0 | | explained_variance | -0.000358 | | fps | 28 | | n_updates | 3 | | policy_entropy | 1.0974414 | | policy_loss | -0.0062959143 | | serial_timesteps | 384 | | time_elapsed | 9.82 | | total_timesteps | 384 | | value_loss | 3907.8992 | -------------------------------------- -------------------------------------- | approxkl | 1.0456845e-05 | | clipfrac | 0.0 | | explained_variance | -0.000337 | | fps | 30 | | n_updates | 4 | | policy_entropy | 1.0968205 | | policy_loss | -0.002253918 | | serial_timesteps | 512 | | time_elapsed | 14.4 | | total_timesteps | 512 | | value_loss | 4101.919 | -------------------------------------- -------------------------------------- | approxkl | 0.00016963946 | | clipfrac | 0.0 | | explained_variance | -1.24e-05 | | fps | 28 | | n_updates | 5 | | policy_entropy | 1.095028 | | policy_loss | -0.0103989765 | | serial_timesteps | 640 | | time_elapsed | 18.6 | | total_timesteps | 640 | | value_loss | 4050.2324 | -------------------------------------- -------------------------------------- | approxkl | 0.00015043674 | | clipfrac | 0.0 | | explained_variance | -0.00013 | | fps | 39 | | n_updates | 6 | | policy_entropy | 1.0920858 | | policy_loss | -0.009303682 | | serial_timesteps | 768 | | time_elapsed | 23 | | total_timesteps | 768 | | value_loss | 3945.5684 | -------------------------------------- -------------------------------------- | approxkl | 0.00032779737 | | clipfrac | 0.0 | | explained_variance | 0.00017 | | fps | 28 | | n_updates | 7 | | policy_entropy | 1.0865046 | | policy_loss | -0.01436338 | | serial_timesteps | 896 | | time_elapsed | 26.3 | | total_timesteps | 896 | | value_loss | 3870.6 | -------------------------------------- ------------------------------------- | approxkl | 0.0006958895 | | clipfrac | 0.0 | | explained_variance | 0.000924 | | fps | 37 | | n_updates | 8 | | policy_entropy | 1.0740579 | | policy_loss | -0.019546235 | | serial_timesteps | 1024 | | time_elapsed | 30.8 | | total_timesteps | 1024 | | value_loss | 3722.3884 | ------------------------------------- -------------------------------------- | approxkl | 0.00084322505 | | clipfrac | 0.0 | | explained_variance | 0.000482 | | fps | 25 | | n_updates | 9 | | policy_entropy | 1.0548321 | | policy_loss | -0.025074812 | | serial_timesteps | 1152 | | time_elapsed | 34.2 | | total_timesteps | 1152 | | value_loss | 3957.3313 | -------------------------------------- ------------------------------------- | approxkl | 0.0018638866 | | clipfrac | 0.0 | | explained_variance | -0.000548 | | fps | 35 | | n_updates | 10 | | policy_entropy | 1.01555 | | policy_loss | -0.04078684 | | serial_timesteps | 1280 | | time_elapsed | 39.3 | | total_timesteps | 1280 | | value_loss | 4177.0903 | ------------------------------------- ------------------------------------- | approxkl | 0.0021850825 | | clipfrac | 0.0 | | explained_variance | 0.00265 | | fps | 22 | | n_updates | 11 | | policy_entropy | 0.9594531 | | policy_loss | -0.039690077 | | serial_timesteps | 1408 | | time_elapsed | 42.9 | | total_timesteps | 1408 | | value_loss | 4124.8867 | ------------------------------------- ------------------------------------- | approxkl | 0.0033041902 | | clipfrac | 0.0 | | explained_variance | -0.000456 | | fps | 34 | | n_updates | 12 | | policy_entropy | 0.8843033 | | policy_loss | -0.04696823 | | serial_timesteps | 1536 | | time_elapsed | 48.7 | | total_timesteps | 1536 | | value_loss | 4201.283 | ------------------------------------- ------------------------------------- | approxkl | 0.0030925882 | | clipfrac | 0.021484375 | | explained_variance | 0.000952 | | fps | 34 | | n_updates | 13 | | policy_entropy | 0.7641637 | | policy_loss | -0.05795412 | | serial_timesteps | 1664 | | time_elapsed | 52.5 | | total_timesteps | 1664 | | value_loss | 4440.92 | ------------------------------------- ------------------------------------- | approxkl | 0.0038732863 | | clipfrac | 0.04296875 | | explained_variance | 0.00518 | | fps | 22 | | n_updates | 14 | | policy_entropy | 0.6456456 | | policy_loss | -0.061438754 | | serial_timesteps | 1792 | | time_elapsed | 56.2 | | total_timesteps | 1792 | | value_loss | 4367.134 | ------------------------------------- ------------------------------------- | approxkl | 0.003097237 | | clipfrac | 0.041015625 | | explained_variance | 0.00306 | | fps | 33 | | n_updates | 15 | | policy_entropy | 0.51307905 | | policy_loss | -0.057217635 | | serial_timesteps | 1920 | | time_elapsed | 61.7 | | total_timesteps | 1920 | | value_loss | 4417.79 | ------------------------------------- ------------------------------------- | approxkl | 0.0021201954 | | clipfrac | 0.03125 | | explained_variance | -0.0133 | | fps | 33 | | n_updates | 16 | | policy_entropy | 0.40890202 | | policy_loss | -0.04748139 | | serial_timesteps | 2048 | | time_elapsed | 65.5 | | total_timesteps | 2048 | | value_loss | 4432.298 | ------------------------------------- ------------------------------------- | approxkl | 0.0014529026 | | clipfrac | 0.01953125 | | explained_variance | 0.00501 | | fps | 17 | | n_updates | 17 | | policy_entropy | 0.31277794 | | policy_loss | -0.03971738 | | serial_timesteps | 2176 | | time_elapsed | 69.3 | | total_timesteps | 2176 | | value_loss | 4414.0176 | ------------------------------------- ------------------------------------- | approxkl | 0.0011339688 | | clipfrac | 0.01953125 | | explained_variance | 0.00757 | | fps | 32 | | n_updates | 18 | | policy_entropy | 0.24806914 | | policy_loss | -0.025814183 | | serial_timesteps | 2304 | | time_elapsed | 76.6 | | total_timesteps | 2304 | | value_loss | 4313.063 | ------------------------------------- ------------------------------------- | approxkl | 0.0007391007 | | clipfrac | 0.01171875 | | explained_variance | 0.0316 | | fps | 23 | | n_updates | 19 | | policy_entropy | 0.19068442 | | policy_loss | -0.02404321 | | serial_timesteps | 2432 | | time_elapsed | 80.5 | | total_timesteps | 2432 | | value_loss | 4365.6025 | ------------------------------------- ------------------------------------- | approxkl | 0.0008076045 | | clipfrac | 0.01171875 | | explained_variance | -0.0131 | | fps | 29 | | n_updates | 20 | | policy_entropy | 0.14820632 | | policy_loss | -0.026867293 | | serial_timesteps | 2560 | | time_elapsed | 85.9 | | total_timesteps | 2560 | | value_loss | 4335.884 | ------------------------------------- -------------------------------------- | approxkl | 5.5856257e-05 | | clipfrac | 0.0 | | explained_variance | -0.0232 | | fps | 27 | | n_updates | 21 | | policy_entropy | 0.11489105 | | policy_loss | -0.0046238033 | | serial_timesteps | 2688 | | time_elapsed | 90.3 | | total_timesteps | 2688 | | value_loss | 4331.92 | -------------------------------------- -------------------------------------- | approxkl | 2.1058022e-05 | | clipfrac | 0.0 | | explained_variance | 0 | | fps | 17 | | n_updates | 22 | | policy_entropy | 0.09650411 | | policy_loss | -0.0025608484 | | serial_timesteps | 2816 | | time_elapsed | 94.9 | | total_timesteps | 2816 | | value_loss | 4296.74 | -------------------------------------- -------------------------------------- | approxkl | 4.2838517e-05 | | clipfrac | 0.0 | | explained_variance | -0.0345 | | fps | 28 | | n_updates | 23 | | policy_entropy | 0.08733705 | | policy_loss | -0.004682667 | | serial_timesteps | 2944 | | time_elapsed | 102 | | total_timesteps | 2944 | | value_loss | 4259.8 | -------------------------------------- --------------------------------------- | approxkl | 0.000102847254 | | clipfrac | 0.001953125 | | explained_variance | -0.0375 | | fps | 29 | | n_updates | 24 | | policy_entropy | 0.07257957 | | policy_loss | -0.0053370036 | | serial_timesteps | 3072 | | time_elapsed | 107 | | total_timesteps | 3072 | | value_loss | 4229.9087 | --------------------------------------- -------------------------------------- | approxkl | 1.1270785e-07 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 32 | | n_updates | 25 | | policy_entropy | 0.06345923 | | policy_loss | 0.0 | | serial_timesteps | 3200 | | time_elapsed | 111 | | total_timesteps | 3200 | | value_loss | 4198.5474 | -------------------------------------- -------------------------------------- | approxkl | 5.508119e-06 | | clipfrac | 0.0 | | explained_variance | -0.0257 | | fps | 28 | | n_updates | 26 | | policy_entropy | 0.06236457 | | policy_loss | -0.0010745144 | | serial_timesteps | 3328 | | time_elapsed | 115 | | total_timesteps | 3328 | | value_loss | 4156.753 | -------------------------------------- -------------------------------------- | approxkl | 1.636944e-05 | | clipfrac | 0.0 | | explained_variance | 0 | | fps | 15 | | n_updates | 27 | | policy_entropy | 0.055864867 | | policy_loss | -0.0025804834 | | serial_timesteps | 3456 | | time_elapsed | 119 | | total_timesteps | 3456 | | value_loss | 4127.7583 | -------------------------------------- -------------------------------------- | approxkl | 2.8690836e-05 | | clipfrac | 0.0 | | explained_variance | -1.19e-07 | | fps | 29 | | n_updates | 28 | | policy_entropy | 0.0511804 | | policy_loss | -0.0035447306 | | serial_timesteps | 3584 | | time_elapsed | 128 | | total_timesteps | 3584 | | value_loss | 4091.4973 | -------------------------------------- -------------------------------------- | approxkl | 7.4687875e-08 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 30 | | n_updates | 29 | | policy_entropy | 0.047582556 | | policy_loss | 0.0 | | serial_timesteps | 3712 | | time_elapsed | 132 | | total_timesteps | 3712 | | value_loss | 4069.115 | -------------------------------------- -------------------------------------- | approxkl | 4.1875264e-06 | | clipfrac | 0.0 | | explained_variance | 0 | | fps | 29 | | n_updates | 30 | | policy_entropy | 0.045673173 | | policy_loss | -0.0009450745 | | serial_timesteps | 3840 | | time_elapsed | 136 | | total_timesteps | 3840 | | value_loss | 4034.147 | -------------------------------------- ------------------------------------- | approxkl | 8.003526e-08 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 31 | | n_updates | 31 | | policy_entropy | 0.04141441 | | policy_loss | 0.0 | | serial_timesteps | 3968 | | time_elapsed | 141 | | total_timesteps | 3968 | | value_loss | 4008.5713 | ------------------------------------- --------------------------------------- | approxkl | 2.251936e-06 | | clipfrac | 0.0 | | explained_variance | 0 | | fps | 31 | | n_updates | 32 | | policy_entropy | 0.03956447 | | policy_loss | -0.00078786956 | | serial_timesteps | 4096 | | time_elapsed | 145 | | total_timesteps | 4096 | | value_loss | 3974.9924 | --------------------------------------- ------------------------------------- | approxkl | 6.080667e-08 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 32 | | n_updates | 33 | | policy_entropy | 0.035701253 | | policy_loss | 0.0 | | serial_timesteps | 4224 | | time_elapsed | 149 | | total_timesteps | 4224 | | value_loss | 3948.7913 | ------------------------------------- -------------------------------------- | approxkl | 1.4656681e-09 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 12 | | n_updates | 34 | | policy_entropy | 0.03424095 | | policy_loss | 0.0 | | serial_timesteps | 4352 | | time_elapsed | 153 | | total_timesteps | 4352 | | value_loss | 3919.9714 | -------------------------------------- ------------------------------------- | approxkl | 6.521696e-12 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 29 | | n_updates | 35 | | policy_entropy | 0.034051728 | | policy_loss | 0.0 | | serial_timesteps | 4480 | | time_elapsed | 163 | | total_timesteps | 4480 | | value_loss | 3890.7466 | ------------------------------------- --------------------------------------- | approxkl | 5.814285e-07 | | clipfrac | 0.0 | | explained_variance | -0.0895 | | fps | 29 | | n_updates | 36 | | policy_entropy | 0.03464252 | | policy_loss | -0.00025901757 | | serial_timesteps | 4608 | | time_elapsed | 168 | | total_timesteps | 4608 | | value_loss | 3858.5444 | --------------------------------------- -------------------------------------- | approxkl | 1.0398925e-05 | | clipfrac | 0.0 | | explained_variance | 2.98e-07 | | fps | 30 | | n_updates | 37 | | policy_entropy | 0.030399777 | | policy_loss | -0.0020691967 | | serial_timesteps | 4736 | | time_elapsed | 172 | | total_timesteps | 4736 | | value_loss | 3828.6802 | -------------------------------------- -------------------------------------- | approxkl | 2.3587065e-08 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 29 | | n_updates | 38 | | policy_entropy | 0.02762542 | | policy_loss | 0.0 | | serial_timesteps | 4864 | | time_elapsed | 176 | | total_timesteps | 4864 | | value_loss | 3804.909 | -------------------------------------- ------------------------------------- | approxkl | 5.955296e-10 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 25 | | n_updates | 39 | | policy_entropy | 0.026657818 | | policy_loss | 0.0 | | serial_timesteps | 4992 | | time_elapsed | 181 | | total_timesteps | 4992 | | value_loss | 3777.6287 | ------------------------------------- -------------------------------------- | approxkl | 2.6384682e-12 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 28 | | n_updates | 40 | | policy_entropy | 0.026530726 | | policy_loss | 0.0 | | serial_timesteps | 5120 | | time_elapsed | 186 | | total_timesteps | 5120 | | value_loss | 3750.984 | -------------------------------------- ------------------------------------ | approxkl | 6.78427e-12 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 27 | | n_updates | 41 | | policy_entropy | 0.026557334 | | policy_loss | 0.0 | | serial_timesteps | 5248 | | time_elapsed | 190 | | total_timesteps | 5248 | | value_loss | 3724.9565 | ------------------------------------ ---------------------------------------- | approxkl | 1.6480338e-07 | | clipfrac | 0.0 | | explained_variance | -0.0964 | | fps | 12 | | n_updates | 42 | | policy_entropy | 0.027045451 | | policy_loss | -0.000119969714 | | serial_timesteps | 5376 | | time_elapsed | 195 | | total_timesteps | 5376 | | value_loss | 3695.2075 | ---------------------------------------- -------------------------------------- | approxkl | 1.0748278e-05 | | clipfrac | 0.0 | | explained_variance | 1.19e-07 | | fps | 31 | | n_updates | 43 | | policy_entropy | 0.023529774 | | policy_loss | -0.0020988667 | | serial_timesteps | 5504 | | time_elapsed | 205 | | total_timesteps | 5504 | | value_loss | 3668.289 | -------------------------------------- -------------------------------------- | approxkl | 9.843582e-06 | | clipfrac | 0.0 | | explained_variance | -0.0959 | | fps | 29 | | n_updates | 44 | | policy_entropy | 0.021820018 | | policy_loss | -0.0019928538 | | serial_timesteps | 5632 | | time_elapsed | 209 | | total_timesteps | 5632 | | value_loss | 3643.6006 | -------------------------------------- -------------------------------------- | approxkl | 1.6955488e-05 | | clipfrac | 0.0 | | explained_variance | -0.0952 | | fps | 29 | | n_updates | 45 | | policy_entropy | 0.019816618 | | policy_loss | -0.0025166627 | | serial_timesteps | 5760 | | time_elapsed | 213 | | total_timesteps | 5760 | | value_loss | 3617.57 | -------------------------------------- ------------------------------------- | approxkl | 8.778779e-09 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 30 | | n_updates | 46 | | policy_entropy | 0.017607767 | | policy_loss | 0.0 | | serial_timesteps | 5888 | | time_elapsed | 218 | | total_timesteps | 5888 | | value_loss | 3594.7239 | ------------------------------------- -------------------------------------- | approxkl | 2.3879365e-10 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 30 | | n_updates | 47 | | policy_entropy | 0.01695734 | | policy_loss | 0.0 | | serial_timesteps | 6016 | | time_elapsed | 222 | | total_timesteps | 6016 | | value_loss | 3569.6123 | -------------------------------------- --------------------------------------- | approxkl | 7.391328e-09 | | clipfrac | 0.0 | | explained_variance | -1.19e-07 | | fps | 30 | | n_updates | 48 | | policy_entropy | 0.016850296 | | policy_loss | -4.8967544e-05 | | serial_timesteps | 6144 | | time_elapsed | 226 | | total_timesteps | 6144 | | value_loss | 3541.0596 | --------------------------------------- -------------------------------------- | approxkl | 2.9383868e-05 | | clipfrac | 0.0 | | explained_variance | -0.0493 | | fps | 30 | | n_updates | 49 | | policy_entropy | 0.01639788 | | policy_loss | -0.004105901 | | serial_timesteps | 6272 | | time_elapsed | 230 | | total_timesteps | 6272 | | value_loss | 3514.154 | -------------------------------------- -------------------------------------- | approxkl | 1.8241103e-08 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 28 | | n_updates | 50 | | policy_entropy | 0.013761687 | | policy_loss | 0.0 | | serial_timesteps | 6400 | | time_elapsed | 234 | | total_timesteps | 6400 | | value_loss | 3496.9583 | -------------------------------------- ------------------------------------- | approxkl | 4.921573e-10 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 26 | | n_updates | 51 | | policy_entropy | 0.012798915 | | policy_loss | 0.0 | | serial_timesteps | 6528 | | time_elapsed | 239 | | total_timesteps | 6528 | | value_loss | 3473.544 | ------------------------------------- -------------------------------------- | approxkl | 1.2990284e-11 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 10 | | n_updates | 52 | | policy_entropy | 0.0126367025 | | policy_loss | 0.0 | | serial_timesteps | 6656 | | time_elapsed | 244 | | total_timesteps | 6656 | | value_loss | 3449.3325 | -------------------------------------- ------------------------------------- | approxkl | 4.424428e-15 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 24 | | n_updates | 53 | | policy_entropy | 0.01261734 | | policy_loss | 0.0 | | serial_timesteps | 6784 | | time_elapsed | 256 | | total_timesteps | 6784 | | value_loss | 3425.9346 | ------------------------------------- ------------------------------------- | approxkl | 3.002179e-13 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 28 | | n_updates | 54 | | policy_entropy | 0.012623467 | | policy_loss | 0.0 | | serial_timesteps | 6912 | | time_elapsed | 261 | | total_timesteps | 6912 | | value_loss | 3402.3008 | ------------------------------------- -------------------------------------- | approxkl | 4.8433507e-13 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 29 | | n_updates | 55 | | policy_entropy | 0.012635817 | | policy_loss | 0.0 | | serial_timesteps | 7040 | | time_elapsed | 266 | | total_timesteps | 7040 | | value_loss | 3379.5684 | -------------------------------------- ------------------------------------- | approxkl | 4.046746e-13 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 30 | | n_updates | 56 | | policy_entropy | 0.01264801 | | policy_loss | 0.0 | | serial_timesteps | 7168 | | time_elapsed | 270 | | total_timesteps | 7168 | | value_loss | 3356.5557 | ------------------------------------- -------------------------------------- | approxkl | 4.0507368e-13 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 30 | | n_updates | 57 | | policy_entropy | 0.0126599725 | | policy_loss | 0.0 | | serial_timesteps | 7296 | | time_elapsed | 274 | | total_timesteps | 7296 | | value_loss | 3333.3027 | -------------------------------------- -------------------------------------- | approxkl | 4.2151704e-13 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 30 | | n_updates | 58 | | policy_entropy | 0.012672037 | | policy_loss | 0.0 | | serial_timesteps | 7424 | | time_elapsed | 279 | | total_timesteps | 7424 | | value_loss | 3310.2527 | -------------------------------------- ------------------------------------- | approxkl | 3.842457e-13 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 31 | | n_updates | 59 | | policy_entropy | 0.012684286 | | policy_loss | 0.0 | | serial_timesteps | 7552 | | time_elapsed | 283 | | total_timesteps | 7552 | | value_loss | 3287.4731 | ------------------------------------- ------------------------------------- | approxkl | 5.241878e-13 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 27 | | n_updates | 60 | | policy_entropy | 0.012696676 | | policy_loss | 0.0 | | serial_timesteps | 7680 | | time_elapsed | 287 | | total_timesteps | 7680 | | value_loss | 3264.962 | ------------------------------------- ------------------------------------- | approxkl | 4.985035e-13 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 28 | | n_updates | 61 | | policy_entropy | 0.012709266 | | policy_loss | 0.0 | | serial_timesteps | 7808 | | time_elapsed | 291 | | total_timesteps | 7808 | | value_loss | 3242.7092 | ------------------------------------- --------------------------------------- | approxkl | 3.615873e-10 | | clipfrac | 0.0 | | explained_variance | -3.58e-07 | | fps | 26 | | n_updates | 62 | | policy_entropy | 0.012707609 | | policy_loss | -3.6573038e-06 | | serial_timesteps | 7936 | | time_elapsed | 296 | | total_timesteps | 7936 | | value_loss | 3217.091 | --------------------------------------- -------------------------------------- | approxkl | 3.4761847e-09 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 28 | | n_updates | 63 | | policy_entropy | 0.011906338 | | policy_loss | 0.0 | | serial_timesteps | 8064 | | time_elapsed | 301 | | total_timesteps | 8064 | | value_loss | 3198.923 | -------------------------------------- ------------------------------------- | approxkl | 9.972234e-11 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 26 | | n_updates | 64 | | policy_entropy | 0.0114687495 | | policy_loss | 0.0 | | serial_timesteps | 8192 | | time_elapsed | 305 | | total_timesteps | 8192 | | value_loss | 3177.3672 | ------------------------------------- --------------------------------------- | approxkl | 9.6275965e-09 | | clipfrac | 0.0 | | explained_variance | -0.0722 | | fps | 8 | | n_updates | 65 | | policy_entropy | 0.011702726 | | policy_loss | -5.7652127e-05 | | serial_timesteps | 8320 | | time_elapsed | 310 | | total_timesteps | 8320 | | value_loss | 3153.4082 | --------------------------------------- -------------------------------------- | approxkl | 1.8454681e-09 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 26 | | n_updates | 66 | | policy_entropy | 0.010539748 | | policy_loss | 0.0 | | serial_timesteps | 8448 | | time_elapsed | 326 | | total_timesteps | 8448 | | value_loss | 3134.8647 | -------------------------------------- ------------------------------------- | approxkl | 5.293721e-11 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 29 | | n_updates | 67 | | policy_entropy | 0.0102127725 | | policy_loss | 0.0 | | serial_timesteps | 8576 | | time_elapsed | 331 | | total_timesteps | 8576 | | value_loss | 3113.911 | ------------------------------------- ------------------------------------- | approxkl | 9.471275e-13 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 26 | | n_updates | 68 | | policy_entropy | 0.010160062 | | policy_loss | 0.0 | | serial_timesteps | 8704 | | time_elapsed | 335 | | total_timesteps | 8704 | | value_loss | 3093.1445 | ------------------------------------- ------------------------------------- | approxkl | 6.822244e-14 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 28 | | n_updates | 69 | | policy_entropy | 0.010157408 | | policy_loss | 0.0 | | serial_timesteps | 8832 | | time_elapsed | 340 | | total_timesteps | 8832 | | value_loss | 3072.5557 | ------------------------------------- ------------------------------------- | approxkl | 1.745386e-13 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 26 | | n_updates | 70 | | policy_entropy | 0.010164159 | | policy_loss | 0.0 | | serial_timesteps | 8960 | | time_elapsed | 345 | | total_timesteps | 8960 | | value_loss | 3052.1367 | ------------------------------------- --------------------------------------- | approxkl | 1.7796338e-09 | | clipfrac | 0.0 | | explained_variance | -0.0652 | | fps | 25 | | n_updates | 71 | | policy_entropy | 0.0104862265 | | policy_loss | -1.4320016e-05 | | serial_timesteps | 9088 | | time_elapsed | 349 | | total_timesteps | 9088 | | value_loss | 3029.3086 | --------------------------------------- ------------------------------------- | approxkl | 2.333519e-09 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 28 | | n_updates | 72 | | policy_entropy | 0.009473039 | | policy_loss | 0.0 | | serial_timesteps | 9216 | | time_elapsed | 354 | | total_timesteps | 9216 | | value_loss | 3011.7778 | ------------------------------------- ------------------------------------- | approxkl | 6.870047e-11 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 28 | | n_updates | 73 | | policy_entropy | 0.009099268 | | policy_loss | 0.0 | | serial_timesteps | 9344 | | time_elapsed | 359 | | total_timesteps | 9344 | | value_loss | 2990.942 | ------------------------------------- -------------------------------------- | approxkl | 1.6176166e-12 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 27 | | n_updates | 74 | | policy_entropy | 0.009037095 | | policy_loss | 0.0 | | serial_timesteps | 9472 | | time_elapsed | 363 | | total_timesteps | 9472 | | value_loss | 2970.5623 | -------------------------------------- -------------------------------------- | approxkl | 1.3288071e-15 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 26 | | n_updates | 75 | | policy_entropy | 0.009030421 | | policy_loss | 0.0 | | serial_timesteps | 9600 | | time_elapsed | 368 | | total_timesteps | 9600 | | value_loss | 2949.9497 | -------------------------------------- --------------------------------------- | approxkl | 1.1359422e-08 | | clipfrac | 0.0 | | explained_variance | -0.0324 | | fps | 27 | | n_updates | 76 | | policy_entropy | 0.009321555 | | policy_loss | -6.9826376e-05 | | serial_timesteps | 9728 | | time_elapsed | 373 | | total_timesteps | 9728 | | value_loss | 2924.0024 | --------------------------------------- -------------------------------------- | approxkl | 2.3084932e-09 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 30 | | n_updates | 77 | | policy_entropy | 0.008434594 | | policy_loss | 0.0 | | serial_timesteps | 9856 | | time_elapsed | 378 | | total_timesteps | 9856 | | value_loss | 2909.5547 | -------------------------------------- ------------------------------------- | approxkl | 6.620775e-11 | | clipfrac | 0.0 | | explained_variance | nan | | fps | 28 | | n_updates | 78 | | policy_entropy | 0.008057287 | | policy_loss | 0.0 | | serial_timesteps | 9984 | | time_elapsed | 382 | | total_timesteps | 9984 | | value_loss | 2889.783 | -------------------------------------
<stable_baselines.ppo2.ppo2.PPO2 at 0x7fda7c594550>
Last, we test it by letting it play the game.
obs = env.reset()
display(env.render())
for _ in range(1):
action, _states = modelPPO2.predict(obs)
obs, _, done, info = env.step(action)
display(info[0]['circuit_img'])
env.close()
As expected, the agent easily learned the optimal circuit.
For comparison, we now run an A2C agent, another agent from the library of stable_baselines.
First we import the agent.
from stable_baselines import A2C
We train it.
modelA2C = A2C(MlpPolicy, env, verbose=1)
modelA2C.learn(total_timesteps=10000)
WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/tf_util.py:312: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/common/tf_util.py:312: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/stable_baselines/a2c/a2c.py:159: The name tf.train.RMSPropOptimizer is deprecated. Please use tf.compat.v1.train.RMSPropOptimizer instead. WARNING:tensorflow:From /home/fmzennaro/miniconda2_1/envs/quantumgymstable/lib/python3.7/site-packages/tensorflow_core/python/training/rmsprop.py:119: calling Ones.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor --------------------------------- | explained_variance | 0.0313 | | fps | 7 | | nupdates | 1 | | policy_entropy | 1.1 | | total_timesteps | 5 | | value_loss | 9.92e+03 | --------------------------------- --------------------------------- | explained_variance | 0.00121 | | fps | 12 | | nupdates | 100 | | policy_entropy | 1.1 | | total_timesteps | 500 | | value_loss | 7.91e+03 | --------------------------------- --------------------------------- | explained_variance | 0.000339 | | fps | 16 | | nupdates | 200 | | policy_entropy | 1.1 | | total_timesteps | 1000 | | value_loss | 7.81e+03 | --------------------------------- --------------------------------- | explained_variance | -0.0287 | | fps | 18 | | nupdates | 300 | | policy_entropy | 1.1 | | total_timesteps | 1500 | | value_loss | 9.76e+03 | --------------------------------- --------------------------------- | explained_variance | -0.00107 | | fps | 20 | | nupdates | 400 | | policy_entropy | 1.09 | | total_timesteps | 2000 | | value_loss | 7.64e+03 | --------------------------------- ---------------------------------- | explained_variance | -1.19e-07 | | fps | 21 | | nupdates | 500 | | policy_entropy | 1.07 | | total_timesteps | 2500 | | value_loss | 9.46e+03 | ---------------------------------- --------------------------------- | explained_variance | -0.00371 | | fps | 21 | | nupdates | 600 | | policy_entropy | 1.06 | | total_timesteps | 3000 | | value_loss | 3.73e+03 | --------------------------------- --------------------------------- | explained_variance | 0 | | fps | 23 | | nupdates | 700 | | policy_entropy | 1 | | total_timesteps | 3500 | | value_loss | 8.85e+03 | --------------------------------- --------------------------------- | explained_variance | 0 | | fps | 23 | | nupdates | 800 | | policy_entropy | 0.939 | | total_timesteps | 4000 | | value_loss | 7.97e+03 | --------------------------------- --------------------------------- | explained_variance | 0 | | fps | 23 | | nupdates | 900 | | policy_entropy | 0.8 | | total_timesteps | 4500 | | value_loss | 6.86e+03 | --------------------------------- ---------------------------------- | explained_variance | -1.19e-07 | | fps | 23 | | nupdates | 1000 | | policy_entropy | 0.661 | | total_timesteps | 5000 | | value_loss | 3.6e+03 | ---------------------------------- --------------------------------- | explained_variance | -1.57 | | fps | 24 | | nupdates | 1100 | | policy_entropy | 0.554 | | total_timesteps | 5500 | | value_loss | 5.12e+03 | --------------------------------- --------------------------------- | explained_variance | 0 | | fps | 24 | | nupdates | 1200 | | policy_entropy | 0.375 | | total_timesteps | 6000 | | value_loss | 4.35e+03 | --------------------------------- --------------------------------- | explained_variance | nan | | fps | 24 | | nupdates | 1300 | | policy_entropy | 0.264 | | total_timesteps | 6500 | | value_loss | 3.79e+03 | --------------------------------- --------------------------------- | explained_variance | nan | | fps | 24 | | nupdates | 1400 | | policy_entropy | 0.172 | | total_timesteps | 7000 | | value_loss | 3.24e+03 | --------------------------------- --------------------------------- | explained_variance | nan | | fps | 24 | | nupdates | 1500 | | policy_entropy | 0.159 | | total_timesteps | 7500 | | value_loss | 2.74e+03 | --------------------------------- --------------------------------- | explained_variance | nan | | fps | 24 | | nupdates | 1600 | | policy_entropy | 0.122 | | total_timesteps | 8000 | | value_loss | 2.28e+03 | --------------------------------- --------------------------------- | explained_variance | nan | | fps | 24 | | nupdates | 1700 | | policy_entropy | 0.103 | | total_timesteps | 8500 | | value_loss | 1.86e+03 | --------------------------------- --------------------------------- | explained_variance | nan | | fps | 23 | | nupdates | 1800 | | policy_entropy | 0.0669 | | total_timesteps | 9000 | | value_loss | 1.49e+03 | --------------------------------- --------------------------------- | explained_variance | nan | | fps | 23 | | nupdates | 1900 | | policy_entropy | 0.0509 | | total_timesteps | 9500 | | value_loss | 1.16e+03 | --------------------------------- --------------------------------- | explained_variance | nan | | fps | 24 | | nupdates | 2000 | | policy_entropy | 0.0564 | | total_timesteps | 10000 | | value_loss | 870 | ---------------------------------
<stable_baselines.a2c.a2c.A2C at 0x7fd9682dd450>
And we test it by letting it play the game.
obs = env.reset()
display(env.render())
for _ in range(1):
action, _states = modelA2C.predict(obs)
obs, _, done, info = env.step(action)
display(info[0]['circuit_img'])
env.close()
Finally, we compare the agents quantitavely by contrasting their average reward computed running 1000 episodes of the game. We rely on the evaluation module that provides simple and standard routines to evaluate the agents.
import evaluation
n_episodes = 1000
PPO2_perf, _ = evaluation.evaluate_model(modelPPO2, env, num_steps=n_episodes)
A2C_perf, _ = evaluation.evaluate_model(modelA2C, env, num_steps=n_episodes)
env = gym.make('qcircuit-v0')
rand_perf, _ = evaluation.evaluate_random(env, num_steps=n_episodes)
print('Mean performance of random agent (out of {0} episodes): {1}'.format(n_episodes,rand_perf))
print('Mean performance of PPO2 agent (out of {0} episodes): {1}'.format(n_episodes,PPO2_perf))
print('Mean performance of A2C agent (out of {0} episodes): {1}'.format(n_episodes,A2C_perf))
Mean performance of random agent (out of 1000 episodes): 97.674 Mean performance of PPO2 agent (out of 1000 episodes): 99.9 Mean performance of A2C agent (out of 1000 episodes): 99.893
As expected the reinforcement learning agents (PPO2, A2C) learned to play the game optimally. The random agent is still able to play and reach a solution given the small state and action space available; its average reward, however, is clearly inferior; on average it takes the random agent two and half more actions (or guesses) than PPO2/A2C per episode to reach the solution.
[1] IBM qiskit, https://qiskit.org/
[2] OpenAI gym, http://gym.openai.com/docs/
[3] stable-baselines, https://github.com/hill-a/stable-baselines
[4] gym-qcircuit, https://github.com/FMZennaro/gym-qcircuit