Multi-agent Reinforcement Learning using PettingZoo: Tic-tac-toe example part II

Gertjan Verhoeven

March 2021

Having learned the basics of PettingZoo, and how to use hashing with defaultdict to build up a Q-table, we are ready to implement Q-learning on the tic-tac-toe environment using PettingZoo. Before we get into the coding part, first some general remarks about Q-learning in a two-player alternating turn game.

Training multi-agent games using single agent techniques

Here we will train a multi-agent game using single agent Q-learning. This means that from the perspective of each player, the other player is part of the environment. It follows that the learned strategy of the first player is tuned to the strategy used by the second player. The second player's strategy can be thought of forming a part of the environment the first player learns to perform optimally in.

Therefore, if we use Q-learning against a player that uses a random strategy, The Q-learner optimizes for play in an evironment that contains a player that uses a random strategy.

Here, we let both players learn using Q-learning with constant learning parameter $\alpha = 0.6$, and have them use the same decreasing $\epsilon$ exploration rate. So both players start off playing randomly ($\epsilon = 1$), and after 200.000 steps end up exploiting the learned policy for 97.5% of the time.

We expect that this produces Q-tables for both players that perform well against random players and human players.

Q-learning with two players: remembering previous states and actions

Many implementations of Q-learning for Tic-Tac-Toe play out a full game and memorize the sequence of states for both players. After a single game has ended, the "Q-learning" part is done in one go, starting from the end state en propagating back to the beginning.

This is possible for Tic-Tac-Toe because no state is ever visited twice during a single game. However, this is not true for all games, so I think it is better to implement Q-learning in a more general way, where we learn each time the player "steps" through the game.

In single agent RL using Gym, we start in a state, make a move, end up in a new state, and collect our reward. This allowed us to use variables like current_state and next_state within a single RL iteration for the Q-learning.

In multi-agent (two player alternating turn, to be more precise), an agent observes a state, chooses an action, THEN hands over the turn, where the other agent observes the state, and chooses an action, then control returns to the first agent, which sees the result of its previous action, and finds itself in a new state.

Thus, we need a way to store the previous state and previous action for each player to use Q-learning. We store these values in two simple dictionaries:

In [7]:
prev_state = {'player_1': -1,
              'player_2': -1}
prev_action = {'player_1': -1,
              'player_2': -1}

I went for two independent Q-tables, and player 1 always starts first (This is how the environment works). So we will end up with two Q-tables, one for the player going first, and one for the player going second.

Coding Q-learning in PettingZoo step by step

In [8]:
from pettingzoo.classic import tictactoe_v3
import random
import numpy as np
from collections import defaultdict
import dill

It is advisable to have the actual Q-learning and helper functions in a separate python script, which we import and run code from in Jupyter notebook, i.e.:

In [ ]:
from marl_qlearning import *

We build on the code from the previous notebook. We already have encode_state() to convert the observation into a unique state label.

In [2]:
import hashlib

def encode_state(observation):
    # encode observation as bytes           
    obs_bytes = str(observation).encode('utf-8')
    # create md5 hash
    m = hashlib.md5(obs_bytes)
    # return hash as hex digest
    state = m.hexdigest()

Exercise 1: Create epsilon_greedy_policy()

We need a function that contains the epsilon-greedy policy functionality.

Code a function epsilon_greedy_policy() that takes as arguments:

  • nA, number of actions
  • Q, the list of Q-table dictionaries for both players, indexed by agent and by state
  • agent, the agent currently acting
  • action_mask, containing the available actions in the current state
  • state , the hash of the current state
  • eps, the value of the exploration parameter epsilon

(Hint: I used Q[agent][state][action_mask == 1] to select the Q-values of all available actions for an agent in a state)

In [3]:
def epsilon_greedy_policy(nA, Q, agent, action_mask, \
                          state, eps):
    return action

Exercise 2: Create update_Q_value()

Code a function update_Q_value() that takes as arguments:

  • Q, the list of Q-table dictionaries for both players, indexed by agent and by state
  • agent, the agent currently acting
  • previous_state, the hash of the previous state
  • previous_action, the previous action
  • reward, the reward received since the previous action
  • alpha, the learning parameter
  • gamma, the discounting parameter
  • current_state, the hash value of the current state
In [3]:
def update_Q_value(Q, agent, previous_state, previous_action,\
                   reward, alpha, gamma, current_state = None): 
    return Q

Exercise 3: create marl_q_learning()

Finally, the main program where all the parts come together. For this function, we built on the Pettingzoo code from the previous notebook.

Code a function marl_q_learning() that takes as arguments

  • multi_env, a pettingzoo multi-agent environment
  • num_episodes, the number of episodes to run
  • alpha, the learning parameter
  • gamma, the discounting parameter with default 1
  • eps_start, the starting value for epsilon
  • eps_decay, the decay factor for epsilon
  • eps_min, the minimal value for epsilon
In [5]:
def marl_q_learning(multi_env, num_episodes, alpha, gamma=1.0, \
                    eps_start=1.0, eps_decay=.99999, \
    Q = {}
    for agent in multi_env.agents:    
        nA = multi_env.action_spaces[agent].n
        Q[agent] = defaultdict(lambda: np.zeros(nA))
    epsilon = eps_start

    i_episode = 1    

    prev_state = {'player_1': -1,
                  'player_2': -1}
    prev_action = {'player_1': -1,
                  'player_2': -1}

    # keeps iterating over active agents until num episode break
    while i_episode <= num_episodes:
        for agent in multi_env.agent_iter():        
            # get observation (state) for current agent:
            observation, reward, done, info = multi_env.last()
            # perform q-learning with update_Q_value()
            # your code here
            # store current state
            prev_state[agent] = state
            if not done: 
                # choose action using epsilon_greedy_policy()
                # your code here               
                # store chosen action
                prev_action[agent] = action 
                # agent is done
        # reset env and memory
        prev_state = {'player_1': -1,
                      'player_2': -1}
        prev_action = {'player_1': -1,
                       'player_2': -1}
        # bump episode
        i_episode += 1
        # decrease epsilon
        epsilon = max(epsilon*eps_decay, eps_min)
    return Q

Learning the optimal policy for Tic-Tac-Toe

We run the Q-learning algorithm for 500.000 steps and use dill to save the Q dictionary to disk.

In [9]:
from marl_qlearning import *

env = tictactoe_v3.env()


fullrun = 0

if fullrun:

    Q, N = marl_q_learning(env, 500_000, alpha = 0.6, gamma = 0.95, \
                           decay = True, render = False)
    with open('cached/Q.pkl', 'wb') as f:
        dill.dump(Q, f)    

    with open('cached/Q.pkl', 'rb') as f:
        Q = dill.load(f)   

Test the learned policy against a Random Player

The original plan was to use the learned policy greedily for both players. However, this will result in deterministic moves on both sides, if for each state a single action has the highest Q-value. So this is not very informative about how good our AI players have become.

Lets try out what it learned by playing a few thousand games fully exploiting the Q-table against a random playing opponent. From, we expect at least 93% wins. We do not expect 100% wins, because the random player can "by accident" play the right moves to obtain a draw, which is the best Player 2 can do. We do not expect Player 1 to lose.

To do so I took the marl_q_learning() function, removed the Q-learning part and the epsilon decay part, and added a conditional step that sets epsilon to 0 if Player 1 chooses its action, and sets epsilon to 1 if Player 2 chooses its action. I named this function test_policy(). I managed to get an outcome with 93% wins for Player 1, and 7% draws. Curious if this is actually the best one can do for this game...


The code in this notebook is copyrighted by Gertjan Verhoeven and licensed under the new BSD (3-clause) license:

The text and figures in this notebook (if any) are copyrighted by Gertjan Verhoeven and licensed under the CC BY-NC 4.0 license:

In [ ]: