Gertjan Verhoeven
March 2021
In this notebook, we will take our first steps into the exciting world of multi-agent reinforcement learning. The simplest approach to multi-agent learning is to have the agents learn independently from each other, without having knowledge of each other. From the perspective of the learning agent, the other agents are simply part of the environment. (Literature: Littman 1994, Busoniu 2010).
The notebook uses PettingZoo, a Python library for conducting research in multi-agent reinforcement learning. It's akin to a multi-agent version of OpenAI's Gym library.
PettingZoo has a large collection of environments (games) available, including Tic-Tac-Toe. We already experimented with Tic-Tac-Toe as part of the introduction chapter of Sutton and Barto. Tic-Tac-Toe therefore seems a natural starting point to start with multi-agent learning.
Before we can start with PettingZoo, we first need to learn about the concept behind the package, which is to model each game as Agent Environment Cycle (AEC) games.
From the paper by Justin Terry et al:
The base component of an AEC game is a changeable list of agents. After the first agent in the list acts, the environment can “act” (allowing agents’ observations to be updated), or the next designated agent can act (skipping environment turns are how truly simultaneous games are depicted). This process continues indefinitely.
As for reward, after every agent takes a turn a “partial” reward is emitted to every other agent. The reward associated with a single action performed by an agent is the total of all the partial rewards following that action and before the agent’s next turn (until this point, the reward is not fully defined).
Different aspects of a game will be responsible for different portions of reward. As shown in [this paper], thinking about rewards in this atomized manner instead of lumping the reward process all together can be very helpful.
Environments can be interacted with using a similar interface to Gym:
{python}
env.reset()
for agent in env.agent_iter():
observation, reward, done, info = env.last()
action = policy(observation, agent)
env.step(action)
The commonly used methods are:
agent_iter(max_iter=2**63)
returns an iterator that yields the current agent of the environment. It terminates when all agents in the environment are done or when max_iter
(steps have been executed).
last(observe=True)
returns observation
, reward
, done
, and info
for the agent currently able to act. The returned reward is the cumulative reward that the agent has received since it last acted. If observe is set to False
, the observation will not be computed, and None
will be returned in its place. Note that a single agent being done does not imply the environment is done.
Code example:
{python}
observation, reward, done, info = env.last()
reset()
resets the environment and sets it up for use when called the first time. Only after calling this function do objects like agents
become available.
step(action)
takes and executes the action of the agent in the environment, automatically switches control to the next agent.
While developing code, several lower level methods I found useful.
agent_selection
displays the currently selected agent.
agents
list all available agents.
The complete API including lower level functionality is at https://www.pettingzoo.ml/api
It is best to create a clean Python 3 virtual environment to run this notebook in.
# create venv
python3 -m venv marl-env
# active venv
source marl-env/bin/activate
# upgrade really old pip version on my system
pip install --upgrade pip
# install packages
pip install pettingzoo[classic]
pip install spyder-notebook
pip install dill
This installs Spyder, the IDE I currently use for Python development, with the Notebook plugin to work both with Python scripts and Jupyter notebooks.
The TicTacToe environment AEC diagram is depicted below:
We start with loading the required libraries:
import random
import numpy as np
from collections import defaultdict
import dill
from pettingzoo.classic import tictactoe_v3
We create an instance of a TicTacToe environment, call reset()
to initialize the game and list the available agents (players):
env = tictactoe_v3.env()
env.reset()
env.agents
It helps to understand exactly how the PettingZoo "mechanics" work.
Below we use the agent_selection
method to show exactly when the active agent switches between the players:
env.reset()
env.agent_selection
#env.step()
# env acts by updating the observation and
# switches to the next player
env.agent_selection
# now player 2 can act
env.step(1)
env.agent_selection
So, directly after an agent takes an action using env.step()
, the game moves on to the environent which "acts" by updating the board position, and after that the other player can act.
The TicTacToe PettingZoo environment uses so-called "action masks" to filter out actions that are invalid or not available given the current state of the environment. The action mask is part of the observation
output from last()
.
observation, reward, done, info = env.last()
observation['action_mask']
This mask tells us that for the current agent, actions 0
and 1
are not available.
Our policy()
function needs this information for action selection.
If we choose an illegal action the environment throws an error message and terminates the current game:
env.reset()
# player 1
env.step(0)
# player 2 attempts same move
env.step(0)
If done
is True
, we can let the agent play action None
. This allows the agents to keep on stepping until all rewards are received by all agents.
For example this game where Player 1 plays the winning move:
env.reset()
env.step(0)
env.step(3)
env.step(1)
env.step(4)
env.step(2)
env.render()
Now the player is Player 2, that receives its (negative) reward for losing the game. Note that it cannot play any legal moves anymore because the game has ended, but we need to call step()
with action None
to move back to Player 1:
observation, reward, done, info = env.last()
print(done)
print(reward)
env.step(None)
Player 2 is removed from the list of available agents:
env.agents
Now player 1 can collect its reward for winning the game!
observation, reward, done, info = env.last()
print(done)
print(reward)
env.step(None)
# no active agents anymore, need to call env.reset() to start a new game
env.agents
When two players who play completely randomly play Tic-Tac-Toe, the first player wins 58.49% of the time, the second player wins 28.81% of the time, and the game is a draw 12.70% of the time.
Code up a function that has both players play a random policy for 10.000 games. Store the outcomes of the games (W/D/L) for both Players. Check your work by comparing with the percentages above.
# use this as starting point
def policy(observation, agent):
action = random.choice(np.flatnonzero(observation['action_mask']))
return action
env.reset()
for agent in env.agent_iter():
observation, reward, done, info = env.last()
action = policy(observation, agent) if not done else None
env.step(action)
env.render() # this visualizes a single game
In the TicTacToe environment, observations of agents consist of a complete description of the board position. An observation of the board is a 3D array and looks like this:
env.reset()
observation, reward, done, info = env.last()
observation['observation']
Compare this to the properly rendered board position:
env.render()
For Q-learning, we need the environment to store Q-values for each unique board position. A convenient way to create unique identifiers for all board positions is to use a hash-function.
We encountered this concept at the beginning of the course:
Hash functions are used to transform a large amount of data (such as a complete board position aka game state) into a single number.
The special thing about hash functions is that every board position is transformed into a unique number, i.e. there are no two board positions that are transformed to the same unique number. This allows us to use this to label / identify each board position, and use this as an identifier to store information about that board position.
Example code (first convert observation to string, then hash):
state = hash(str(observation['observation']))
state
Update:
I discovered that in Python 3, the hash()
function is, by design, not reproducible between python sessions! This makes it unsuitable for our purpose, since we want to learn an optimal policy for each state, and save that policy (the Q-table) to disk for later use.
This later use will consist of things like testing the policy's performance, or as an AI player to play against ourselves.
To have reproducible hashing we can use hashlib
, a Python library containing various hashing algorithms. I chose the MD5
algorithm:
import hashlib
def encode_state(observation):
# encode observation as bytes
obs_bytes = str(observation).encode('utf-8')
# create md5 hash
m = hashlib.md5(obs_bytes)
# return hash as hex digest
state = m.hexdigest()
return(state)
encode_state(observation['observation'])
To make self-play (An single agent that plays against itself) easy to implement, in PettingZoo the observation contains information about which player is making the observation. This information is encoded in the observation by flipping the board position player index order (aka "inverting the channels").
Take for example the observation of the board state after Player 1 made a first move:
env = tictactoe_v3.env()
env.reset()
env.step(4)
env.observe('player_1')['observation']
Now compare this with how Player 2 observes the same board position:
env.observe('player_2')['observation']
In practice, this is only an issue if both players see the same board position, which only occurs at the end of a game, when the players collect their rewards.
env = tictactoe_v3.env()
env.reset()
env.step(0)
env.step(6)
env.step(1)
env.step(5)
env.step(2)
env.observe("player_1")['observation']
env.step(None)
env.observe("player_2")['observation']
To avoid double counting of end-game board positions, I used this trick:
state = encode_state(env.render(mode = 'ansi'))
state
For this exercise, adapt your code from Exercise 1 to add a defaultdict
dictionary that contains the value 0
for each board position (identified using encode_state()
) the agents encounter.
Run your code for 20.000 games with the agents playing a random policy to find out how many distinct states Tic-Tac-Toe contains. The dictionary should max out at 5478 different states.
You can use the code provided below.
from collections import defaultdict
env.reset()
Q = defaultdict(lambda: np.zeros(nA))
# reminder about how default dict works
Q['32433'] = 0
Q['-5323'] = 0
Q['2397887'] = 0
Q
The code in this notebook is copyrighted by Gertjan Verhoeven and licensed under the new BSD (3-clause) license:
https://opensource.org/licenses/BSD-3-Clause
The text and figures in this notebook (if any) are copyrighted by Gertjan Verhoeven and licensed under the CC BY-NC 4.0 license: