Gertjan Verhoeven
March 2021
Having learned the basics of PettingZoo, and how to use hashing with defaultdict
to build up a Q-table, we are ready to implement Q-learning on the tic-tac-toe environment using PettingZoo. Before we get into the coding part, first some general remarks about Q-learning in a two-player alternating turn game.
Here we will train a multi-agent game using single agent Q-learning. This means that from the perspective of each player, the other player is part of the environment. It follows that the learned strategy of the first player is tuned to the strategy used by the second player. The second player's strategy can be thought of forming a part of the environment the first player learns to perform optimally in.
Therefore, if we use Q-learning against a player that uses a random strategy, The Q-learner optimizes for play in an evironment that contains a player that uses a random strategy.
Here, we let both players learn using Q-learning with constant learning parameter $\alpha = 0.6$, and have them use the same decreasing $\epsilon$ exploration rate. So both players start off playing randomly ($\epsilon = 1$), and after 200.000 steps end up exploiting the learned policy for 97.5% of the time.
We expect that this produces Q-tables for both players that perform well against random players and human players.
Many implementations of Q-learning for Tic-Tac-Toe play out a full game and memorize the sequence of states for both players. After a single game has ended, the "Q-learning" part is done in one go, starting from the end state en propagating back to the beginning.
This is possible for Tic-Tac-Toe because no state is ever visited twice during a single game. However, this is not true for all games, so I think it is better to implement Q-learning in a more general way, where we learn each time the player "steps" through the game.
In single agent RL using Gym, we start in a state, make a move, end up in a new state, and collect our reward. This allowed us to use variables like current_state
and next_state
within a single RL iteration for the Q-learning.
In multi-agent (two player alternating turn, to be more precise), an agent observes a state, chooses an action, THEN hands over the turn, where the other agent observes the state, and chooses an action, then control returns to the first agent, which sees the result of its previous action, and finds itself in a new state.
Thus, we need a way to store the previous state and previous action for each player to use Q-learning. We store these values in two simple dictionaries:
prev_state = {'player_1': -1,
'player_2': -1}
prev_action = {'player_1': -1,
'player_2': -1}
I went for two independent Q-tables, and player 1 always starts first (This is how the environment works). So we will end up with two Q-tables, one for the player going first, and one for the player going second.
from pettingzoo.classic import tictactoe_v3
import random
import numpy as np
from collections import defaultdict
import dill
It is advisable to have the actual Q-learning and helper functions in a separate marl_qlearning.py
python script, which we import and run code from in Jupyter notebook, i.e.:
from marl_qlearning import *
We build on the code from the previous notebook.
We already have encode_state()
to convert the observation into a unique state label.
import hashlib
def encode_state(observation):
# encode observation as bytes
obs_bytes = str(observation).encode('utf-8')
# create md5 hash
m = hashlib.md5(obs_bytes)
# return hash as hex digest
state = m.hexdigest()
return(state)
We need a function that contains the epsilon-greedy policy functionality.
Code a function epsilon_greedy_policy()
that takes as arguments:
nA
, number of actionsQ
, the list of Q-table dictionaries for both players, indexed by agent and by stateagent
, the agent currently actingaction_mask
, containing the available actions in the current statestate
, the hash of the current stateeps
, the value of the exploration parameter epsilon(Hint: I used Q[agent][state][action_mask == 1]
to select the Q-values of all available actions for an agent in a state)
def epsilon_greedy_policy(nA, Q, agent, action_mask, \
state, eps):
return action
Code a function update_Q_value() that takes as arguments:
Q
, the list of Q-table dictionaries for both players, indexed by agent and by stateagent
, the agent currently actingprevious_state
, the hash of the previous stateprevious_action
, the previous actionreward
, the reward received since the previous actionalpha
, the learning parametergamma
, the discounting parametercurrent_state
, the hash value of the current statedef update_Q_value(Q, agent, previous_state, previous_action,\
reward, alpha, gamma, current_state = None):
return Q
Finally, the main program where all the parts come together. For this function, we built on the Pettingzoo code from the previous notebook.
Code a function marl_q_learning()
that takes as arguments
def marl_q_learning(multi_env, num_episodes, alpha, gamma=1.0, \
eps_start=1.0, eps_decay=.99999, \
eps_min=0.025):
multi_env.reset()
Q = {}
for agent in multi_env.agents:
nA = multi_env.action_spaces[agent].n
Q[agent] = defaultdict(lambda: np.zeros(nA))
epsilon = eps_start
i_episode = 1
prev_state = {'player_1': -1,
'player_2': -1}
prev_action = {'player_1': -1,
'player_2': -1}
# keeps iterating over active agents until num episode break
while i_episode <= num_episodes:
for agent in multi_env.agent_iter():
# get observation (state) for current agent:
observation, reward, done, info = multi_env.last()
# perform q-learning with update_Q_value()
# your code here
# store current state
prev_state[agent] = state
if not done:
# choose action using epsilon_greedy_policy()
# your code here
multi_env.step(action)
# store chosen action
prev_action[agent] = action
else:
# agent is done
multi_env.step(None)
# reset env and memory
multi_env.reset()
prev_state = {'player_1': -1,
'player_2': -1}
prev_action = {'player_1': -1,
'player_2': -1}
# bump episode
i_episode += 1
# decrease epsilon
epsilon = max(epsilon*eps_decay, eps_min)
return Q
We run the Q-learning algorithm for 500.000 steps and use dill
to save the Q dictionary to disk.
from marl_qlearning import *
env = tictactoe_v3.env()
random.seed(123)
fullrun = 0
if fullrun:
Q, N = marl_q_learning(env, 500_000, alpha = 0.6, gamma = 0.95, \
decay = True, render = False)
with open('cached/Q.pkl', 'wb') as f:
dill.dump(Q, f)
else:
with open('cached/Q.pkl', 'rb') as f:
Q = dill.load(f)
The original plan was to use the learned policy greedily for both players. However, this will result in deterministic moves on both sides, if for each state a single action has the highest Q-value. So this is not very informative about how good our AI players have become.
Lets try out what it learned by playing a few thousand games fully exploiting the Q-table against a random playing opponent. From https://planspace.org/20191103-q_learning_tic_tac_toe_briefly/, we expect at least 93% wins. We do not expect 100% wins, because the random player can "by accident" play the right moves to obtain a draw, which is the best Player 2 can do. We do not expect Player 1 to lose.
To do so I took the marl_q_learning()
function, removed the Q-learning part and the epsilon decay part, and added a conditional step that sets epsilon to 0 if Player 1 chooses its action, and sets epsilon to 1 if Player 2 chooses its action. I named this function test_policy()
. I managed to get an outcome with 93% wins for Player 1, and 7% draws. Curious if this is actually the best one can do for this game...
The code in this notebook is copyrighted by Gertjan Verhoeven and licensed under the new BSD (3-clause) license:
https://opensource.org/licenses/BSD-3-Clause
The text and figures in this notebook (if any) are copyrighted by Gertjan Verhoeven and licensed under the CC BY-NC 4.0 license: