Written by Jan Leike and Tom Everitt.

The code is written in Python 3, but should be compatible with Python 2.

The code can be used to calculate the value estimates of the decision theories discussed in our paper Sequential Extensions of Causal and Evidential Decision Theory. Generally, the implemented value functions take as input a distribution defining the environmental probabilities, an action/policy, lists of possible percepts and hidden states, and a utility function. The implementation aims to follow the formulas in the paper as closely as possible. See the code documentation for details.

The results of the calculations are displayed below the code boxes. If you wish to re-run some code (with or without changes), please first run all earlier code boxes in order, as functions in later code boxes depend on earlier code boxes having been executed at least once.

To run a code box, press Shift+Enter.

In a *decision problem*,
we take one action $a \in \mathcal{A}$, receive a percept $e \in \mathcal{E}$
(typically called *outcome* in the decision theory literature)
and get a payoff according to the utility function
$u: \mathcal{E} \to [0, 1]$.
We assume that the set of actions $\mathcal{A}$ and the set of percepts $\mathcal{E}$ are finite.
Additionally,
the environment contains a *hidden state* $s \in \mathcal{S}$.
The hidden state holds information that is inaccessible to the agent
at the time of the decision,
but may influence the decision and the percept.
Formally, the environment is given by
a probability distribution $P$ over the hidden state, the action and the percept
that factors according to a *causal graph* (Pearl, 2009).

A causal graph over the random variables $x_1,\dots,x_n$ is a directed acyclic graph with nodes $x_1,\dots,x_n$ and probability distributions $P(x_i\mid pa_i)$ for each node $x_i$ where $pa_i$ is the set of parents of $x_i$ in the graph. It is natural to identify the causal graph with the factored distribution $P(x_1,\dots, x_n) = \prod_{i=1}^n P(x_i\mid pa_i)$. Given such a causal graph, the $\texttt{do}$-operator is defined as $$ P(x_1, \ldots, x_j, \ldots, x_n \mid \texttt{do}(x_j := b)) = \frac{\prod_{i=1}^n P(x_i \mid pa_i)}{P(x_j \mid pa_j)} $$ if $x_j = b$ on the left hand side and $0$ otherwise. The result is a new probability distribution that can be marginalized and conditioned in the standard way. Intuitively, intervening on node $x_j$ means ignoring all incoming arrows to $x_j$, as the effects they represent are no longer relevant when we intervene; the division removes the $P(x_j \mid pa_j)$ factor from $P(x_1,\dots,x_n)$. Note that the $\texttt{do}$-operator is only defined for distributions for which a causal graph has been specified. See (Pearl, 2009, Ch. 3.4) for details.

In the first two (one-shot) examples, the environment is given as the following causal graph over the hidden state $s$, action $a$, and percept $e$.

According to this causal graph, the probability distribution $P$ factors causally into $P(s, a, e) = P(s) P(a \mid s) P(e \mid s, a)$.

The environment has a prior belief over the agent's actions,
but that prior does not have to assign high probability to the
actions that the agent actually ends up taking.
We think of this prior as *partial information* about the agent
or in multi-agent systems beliefs held by other agents.
Since we assume the agent's policy to be deterministic and inside the environment,
the environment could actually know all the agent's actions in advance.
However, for self-consistency reasons, the agent needs to be
uncertain about what the environment knows about it:
if the agent knew which action she will take,
she could decide to take different actions which leads to a paradox.

In [1]:

```
# Imports and Helper functions
from fractions import Fraction # exact arithmetic with fractions
def distribution(d):
"""
Return a probability distribution
corresponding the probability mu(a | b)
Input: a dictionary specifying the distribution
Output: a function that queries the dictionary and returns 0 if nothing is found
"""
def mu(a, b=''):
#print("evaluating ",a,b)
if b == '':
return d[a] if a in d else 0
else:
return d[(a, b)] if (a, b) in d else 0
return mu
```

One-shot evidential value function: $$ V^{\mathrm{evi},a}_{\mu,1} := \sum_{e \in \mathcal{E}} \mu(e \mid a) u(e) = \sum_{e \in \mathcal{E}} \sum_{s \in \mathcal{S}} \mu(e \mid s, a) \mu(s \mid a) u(e) $$ One-shot causal value function: $$ V^{\mathrm{cau},a}_{\mu,1} := \sum_{e \in \mathcal{E}} \mu(e \mid \texttt{do}(a)) u(e) = \sum_{e \in \mathcal{E}} \sum_{s \in \mathcal{S}} \mu(e \mid s, a) \mu(s) u(e) $$

In [2]:

```
def evidential_value(mu, a, E, S, u):
"""
Evidential value of action a in environment mu
with possible percepts E, hidden states S, and utility function u.
"""
mu_a = sum([mu(a, s)*mu(s) for s in S]) # mu(a)
return sum([u[e]*mu(e, s + a)*mu(s)*mu(a, s)/mu_a for e in E for s in S])
def causal_value(mu, a, E, S, u):
"""
Causal value of action a in environment mu
with possible percepts E, hidden states S, and utility function u
"""
return sum([u[e]*mu(e, s + a)*mu(s) for e in E for s in S])
```

In Newcomb's Problem there are two boxes: an opaque box that is either empty or contains one million dollars and a transparent box that contains one thousand dollars. The agent can choose between taking only the opaque box ("one-boxing") and taking both boxes ("two-boxing"). The content of the opaque box is determined by a very reliable predictor: If it predicts the agent will one-box, the box contains the million, and if it predicts the agent will two-box, the box is empty.

In Newcomb's problem EDT prescribes to one-box because one-boxing is evidence that the box contains a million dollars. In contrast, CDT prescribes to two-box because two-boxing dominates one-boxing: in either case we are a thousand dollars richer, and our decision cannot causally affect the prediction. Newcomb's problem has been raised as a critique to CDT, but many philosophers insist that two-boxing is in fact the rational choice, even if it makes you end up poor.

In [3]:

```
# Newcomb's Problem
S = {'E', 'F'} # E = empty, F = full
A = {'1', '2'} # 1 = one-box, 2 = two-box
E = {'0', 'T', 'M', 'A'} # 0 = no money, T = $1000, M = $1000000, A = $1001000 (all of the money)
u = {'0': 0, 'T': 1000, 'M': 1000000, 'A': 1001000} # Utilities
eps = Fraction('0.01') # Prediction inaccuracy
mu = distribution({
'E': Fraction('0.5'), # Prob. box empty
'F': Fraction('0.5'), # Prob. box full
('1', 'F'): 1 - eps, # Prob. agent one-boxes given box full
('1', 'E'): eps, # Prob. agent one-boxes given box empty
('2', 'F'): eps, # Prob. agent two-boxes given box full
('2', 'E'): 1 - eps, # Prob. agent one-boxes given box full
('0', 'E1'): 1, # Prob. receiving $0 given empty and one-boxing
('T', 'E2'): 1, # Prob. receiving $T given empty and two-boxing
('M', 'F1'): 1, # Prob. receiving $M given full and one-boxing
('A', 'F2'): 1, # Prob. receiving $M+$T given full and two-boxing
}) # Unspecified conditionals default to 0
print('Evidential value of one-boxing: ${:}'.format(evidential_value(mu, '1', E, S, u)))
print('Evidential value of two-boxing: ${:}'.format(evidential_value(mu, '2', E, S, u)))
print('Causal value of one-boxing: ${:}'.format(causal_value(mu, '1', E, S, u)))
print('Causal value of two-boxing: ${:}'.format(causal_value(mu, '2', E, S, u)))
assert evidential_value(mu, '1', E, S, u) > evidential_value(mu, '2', E, S, u) # EDT prefers 1
assert causal_value(mu, '1', E, S, u) < causal_value(mu, '2', E, S, u) # CDT prefers 2
```

This problem takes place in a world in which there is a certain parasite that cause their host to be attracted to cats, in addition to uncomfortable sideeffects. The agent is handed an adorable little kitten and is faced with the decision of whether or not to pet it. Petting the kitten feels nice and therefore yields more utility than not petting it. However, people suffering from the parasite are more likely to pet the kitten.

Petting the kitten constitutes evidence of the presence of the parasite, and thus EDT recommends against it. CDT correctly observes that there is no causal connection between petting the kitten and having the parasite, and is therefore in favor of petting.

In [4]:

```
# Toxoplasmosis Problem
S = {'T', 'H'} # T = toxoplasmosis, H = healthy
A = {'P', 'N'} # P = petting, N = not petting
E = {'0', '1', '8', '9'} # 0 = sick, not petting; 1 = sick, petting; 8 = healthy, not petting; 9 = healthy, petting
u = {'0': -10, '1': -9, '8': 0, '9': 1} # Utilities
mu = distribution({
'T': Fraction('0.5'),
'H': Fraction('0.5'),
('P', 'T'): Fraction('0.8'),
('P', 'H'): Fraction('0.2'),
('N', 'T'): Fraction('0.2'),
('N', 'H'): Fraction('0.8'),
('0', 'TN'): 1,
('1', 'TP'): 1,
('8', 'HN'): 1,
('9', 'HP'): 1,
}) # Unspecified conditionals default to 0
print('Evidential value of petting: {:}'.format(evidential_value(mu, 'P', E, S, u)))
print('Evidential value of not petting: {:}'.format(evidential_value(mu, 'N', E, S, u)))
print('Causal value of petting: {:}'.format(causal_value(mu, 'P', E, S, u)))
print('Causal value of not petting: {:}'.format(causal_value(mu, 'N', E, S, u)))
assert evidential_value(mu, 'P', E, S, u) < evidential_value(mu, 'N', E, S, u) # EDT prefers N
assert causal_value(mu, 'P', E, S, u) > causal_value(mu, 'N', E, S, u) # CDT prefers P
```

For the remainder of this notebook,
we consider an environment $\mu$ that
the agent interacts with sequentially:
at time step $t$ the agent chooses an *action* $a_t \in \mathcal{A}$ and
receives a *percept* $e_t \in \mathcal{E}$
which yields a *utility* of $u(e_t) \in \mathbb{R}$;
the cycle then repeats for $t + 1$.
A *history* is an element of $(\mathcal{A} \times \mathcal{E})^*$.
We use $æ \in \mathcal{A} \times \mathcal{E}$ to denote one interaction cycle,
and $æ_{<t}$ to denote a history of length $t - 1$.
A *policy* is a function that maps a history $æ_{<t}$ to
the next action $a_t$.
We only consider deterministic policies.

This section assumes that the environment is fully known
except for some stochasticity and an unknown hidden state.
In other words, it is a *planning problem*.
The hidden state influences both percepts and actions,
and all actions and percepts (potentially) influence the entire future.
The environment $\mu$ is thus formally specified by
a probability distribution over hidden states and histories
that for any $t \in \mathbb{N}$ factors as
$$
\mu(s, æ_{<t})
= \mu(s) \prod_{i=1}^{t-1} \mu(a_i \mid s, æ_{<i}) \mu(e_i \mid s, æ_{<i}a_i).
$$
While this factorization is possible for any distribution, we additionally demand
that this factorization is *causal* according to the following causal graph.

When deciding between actions from $\mathcal{A}$,
we want to pick an action that maximizes expected utility.
Given a history $æ_{<t}$, a policy $\pi$, and an environment $\mu$,
the *value $V^\pi*{\mu,k}$ of the policy $\pi$ in the environment $\mu$_
is the future expected utility when following the policy $\pi$ with lifetime $m$.
If we have a value function,
we get the utility-maximizing action/policy by picking the policy that
has the highest value.
Therefore it is enough to define
the value function of sequential evidential decision theory
and sequential causal decision theory.

The agent-environment interaction generates a growing action-percept history $æ_{<t}$. The pivotal distinction between EDT and CDT is how they update their prediction of a next percept $e_t$ provided they take action $a_t$ after history $æ_{<t}$.

The *action-evidential value of a policy $\pi$
with lifetime $2$ in environment $\mu$* is
$$
V^{\mathrm{aev},\pi}_{\mu,2}
= \sum_{e_1} \left( \sum_{s \in \mathcal{S}} \mu(s \mid a_1) \mu(e_1 \mid s, a_1) \right)
\left( u(e_1) + \sum_{e_2} u(e_2) \sum_{s \in \mathcal{S}} \mu(s \mid æ_1 a_2) \mu(e_2 \mid s, æ_1 a_2) \right)
$$
where $a_1 := \pi(\epsilon)$ and $a_2 := \pi(æ_1)$.

The *policy-evidential value of a policy $\pi$
with lifetime $2$ in environment $\mu$* is
$$
V^{\mathrm{pev},\pi}_{\mu,2}
= \sum_{e_1} \frac{\sum_{s \in \mathcal{S}} \mu(s a_1 e_1 \pi(a_1 e_1))}{\sum_{s \in \mathcal{S}} \sum_{e \in \mathcal{E}} \mu(s a_1 e \pi(a_1 e))}
\left( u(e_1) + \sum_{e_2} u(e_2) \sum_{s \in \mathcal{S}} \mu(s \mid æ_1 a_2) \mu(e_2 \mid s, æ_1 a_2) \right)
$$
where $a_1 := \pi(\epsilon)$ and $a_2 := \pi(æ_1)$.

The *causal value of a policy $\pi$
with lifetime $2$ in environment $\mu$* is
$$
V^{\mathrm{cau},\pi}_{\mu,2}
= \sum_{e_1} \left( \sum_{s \in \mathcal{S}} \mu(s) \mu(e_1 \mid s, a_1) \right)
\left( u(e_1) + \sum_{e_2} u(e_2) \sum_{s \in \mathcal{S}} \mu(s \mid æ_1) \mu(e_2 \mid s, æ_1 a_2) \right)
$$
where $a_1 := \pi(\epsilon)$ and $a_2 := \pi(æ_1)$.

In [5]:

```
def saedt_value(mu, pi, E, S, u):
"""
Action-evidential value of policy pi in environment mu with horizon 2
with possible percepts E, hidden states S, and utility function u.
"""
a1 = pi['']
mu_a = sum([mu(a1, s)*mu(s) for s in S]) # mu(a1)
mu_aea = lambda e1: sum([mu(s)*mu(a1, s)*mu(e1, s + a1)*mu(pi[a1 + e1], s + a1 + e1) for s in S]) # mu(a1 e1 a2)
return sum([mu(e1, s + a1)*mu(a1, s)*mu(s)/mu_a # outer sum over e1 and s
*(u[e1] + sum([ # inner sum over e2 and s
mu(e2, s + a1 + e1 + pi[a1 + e1])
*mu(s)*mu(a1, s)*mu(e1, s + a1)*mu(pi[a1 + e1], s + a1 + e1)
/mu_aea(e1)*u[e2]
for e2 in E for s in S
]))
for e1 in E for s in S if mu(e1, s + a1) > 0])
def spedt_value(mu, pi, E, S, u):
"""
Policy-evidential value of policy pi in environment mu with horizon 2
with possible percepts E, hidden states S, and utility function u.
"""
a1 = pi['']
# mu(e_1 | pi) = \sum_s mu(s | pi)*mu(e_1 | s, pi)
mu_e1_pi = lambda e1: sum([mu(s)*mu(a1, s)*mu(e1, s + a1)*mu(pi[a1 + e1], s + a1 + e1)
for s in S if mu(e1, s + a1) > 0]) / \
sum([mu(s)*mu(a1, s)*mu(e, s + a1)*mu(pi[a1 + e], s + a1 + e)
for s in S for e in E if mu(e, s + a1) > 0])
mu_aea = lambda e1: sum([mu(s)*mu(a1, s)*mu(e1, s + a1)*mu(pi[a1 + e1], s + a1 + e1) for s in S]) # mu(a1 e1 a2)
return sum([mu_e1_pi(e1)*(u[e1] + sum([mu(s)*mu(a1, s)*mu(e1, s + a1)
*mu(pi[a1 + e1], s + a1 + e1)
*mu(e2, s + a1 + e1 + pi[a1 + e1])
/mu_aea(e1)*u[e2]
for e2 in E for s in S if mu(e1, s + a1) > 0
])) for e1 in E])
def scdt_value(mu, pi, E, S, u):
"""
Causal value of policy pi in environment mu
with possible percepts E, hidden states S, and utility function u.
"""
a1 = pi['']
mu_ae = lambda e1: sum([mu(a1, s)*mu(e1, s + a1)*mu(s) for s in S]) # mu(a1e1)
return sum([mu(e1, s + a1)*mu(s)*(u[e1] + sum([
mu(e2, s + a1 + e1 + pi[a1 + e1])*mu(s)*mu(a1, s)*mu(e1, s + a1)/mu_ae(e1)*u[e2]
for e2 in E for s in S
])) for e1 in E for s in S if mu(e1, s + a1) > 0])
def print_policy_values(policies):
"""Prints SAEDT, SPEDT, and SCDT values for a list of the form (policy, name)"""
for value, vname in [(saedt_value, 'Action-evidential'),
(spedt_value, 'Policy-evidential'),
(scdt_value, 'Causal')]:
print('======================================')
print('{} value'.format(vname))
values = {pname: value(mu, pi, E, S, u) for pi, pname in policies}
m = max(values.values())
for pi, pname in policies:
print('{:<20}: {:}{:}'.format(pname, values[pname],
' BEST' if values[pname] == m else ''))
```

In this variation to Newcomb's problem the agent may look into the opaque box before making the decision which box to take. A SCDT agent is indifferent towards looking because she will take both boxes anyways. However, an SAEDT or SPEDT agent will avoid looking into the box, because once the content is revealed he two-boxes.

In [6]:

```
# Newcomb with Looking
S = {'E', 'F'} # E = empty, F = full
A = {'1', '2'} # 1 = looking, 2 = not looking, 1 = one-box, 2 = two-box
E = {'E', 'F', '0', 'T', 'M', 'A'} # E = empty, F = full, 0 = no money,
# T = $1000, M = $1000000, A = $1001000 (all of the money)
u = {'E': 0, 'F': 0, '0': 0, 'T': 1000, 'M': 1000000, 'A': 1001000}
eps = Fraction('0.01') # prediction accuracy
mu = distribution({
# Hidden state
'E': Fraction('0.5'),
'F': Fraction('0.5'),
# First action: looking vs. not looking
('1', 'F'): Fraction('0.5'),
('1', 'E'): Fraction('0.5'),
('2', 'F'): Fraction('0.5'),
('2', 'E'): Fraction('0.5'),
# First percept: box content if looking else 0
('E', 'E1'): 1,
('0', 'E2'): 1,
('F', 'F1'): 1,
('0', 'F2'): 1,
# Second action: one-boxing vs. two-boxing
('1', 'E1E'): eps,
('2', 'E1E'): 1 - eps,
('1', 'E20'): eps,
('2', 'E20'): 1 - eps,
('1', 'F1F'): 1 - eps,
('2', 'F1F'): eps,
('1', 'F20'): 1 - eps,
('2', 'F20'): eps,
# Second percept: box contents
('0', 'E1E1'): 1,
('0', 'E201'): 1,
('T', 'E1E2'): 1,
('T', 'E202'): 1,
('M', 'F1F1'): 1,
('M', 'F201'): 1,
('A', 'F1F2'): 1,
('A', 'F202'): 1,
})
# There are six possible policies:
policies = [
({'': '1', '1E': '1', '1F': '1'}, 'Curious one-boxer'),
({'': '1', '1E': '2', '1F': '2'}, 'Curious two-boxer'),
({'': '1', '1E': '2', '1F': '1'}, 'Fatalistic'),
({'': '1', '1E': '1', '1F': '2'}, 'Paradox-lover'),
({'': '2', '20': '1'}, 'Incurious one-boxer'),
({'': '2', '20': '2'}, 'Incurious two-boxer')
]
print_policy_values(policies)
```

While the curious one-boxer is an optimal strategy for SPEDT,
it is not *time consistent*: Once the box content is reveiled,
both evidential decision theories want to two-box.

In [7]:

```
def after_seeing(box_content):
"""Generate the conditional distribution for the second step,
after the box content has been reveiled"""
def distr(a, b=''):
if b == '':
return mu(a)
else:
return mu(a, b[0] + '1' + box_content + b[1:])
return distr
print('Evidential value of one-boxing after seeing the box empty: ${:}'.format(
evidential_value(after_seeing('E'), '1', E, S, u)))
print('Evidential value of two-boxing after seeing the box empty: ${:}'.format(
evidential_value(after_seeing('E'), '2', E, S, u)))
print('Evidential value of one-boxing after seeing the box full: ${:}'.format(
evidential_value(after_seeing('F'), '1', E, S, u)))
print('Evidential value of two-boxing after seeing the box full: ${:}'.format(
evidential_value(after_seeing('F'), '2', E, S, u)))
```

In this variation to Newcomb's problem the agent first has the option to pay \$300,000 to sign a contract that binds the agent to pay \$2000 in case of two-boxing. An SAEDT or SPEDT agent knows that he will one-box anyways and hence has no need for the contract. A SCDT agent knows that she favors two-boxing, but signs the contract only if this occurs before the prediction is made (so it has a chance of causally affecting the prediction). With the contract in place, one-boxing is the dominant action, and thus the SCDT agent is predicted to one-box.

In [8]:

```
# Newcomb with Precommitment
S = {'E', 'F'} # E = empty, F = full
A = {'1', '2'} # 1 = signing, 2 = not signing, 1 = one-box, 2 = two-box
E = {'C', '0', 'T', 'M', 'A', 'R'}
# C = -$300000 for the contract, T = -$1000, M = $1000000, A = $1001000 (all of the money),
# R = $999000 rich while violating the contract
u = {'C': -300000, '0': 0, 'T': 1000, 'M': 1000000, 'A': 1001000, 'R': 999000}
eps = Fraction('0.01') # prediction accuracy
mu = distribution({
# Hidden state
'E': Fraction('0.5'),
'F': Fraction('0.5'),
# First action: signing vs. not signing
('1', 'F'): Fraction('0.5'),
('1', 'E'): Fraction('0.5'),
('2', 'F'): Fraction('0.5'),
('2', 'E'): Fraction('0.5'),
# First percept: C if signing the contract else 0
('C', 'E1'): 1,
('0', 'E2'): 1,
('C', 'F1'): 1,
('0', 'F2'): 1,
# Second action: one-boxing vs. two-boxing
('1', 'E1C'): eps,
('2', 'E1C'): 1 - eps,
('1', 'E20'): eps,
('2', 'E20'): 1 - eps,
('1', 'F1C'): 1 - eps,
('2', 'F1C'): eps,
('1', 'F20'): 1 - eps,
('2', 'F20'): eps,
# Second percept: box contents minus contract cost
('M', 'E1C1'): 1,
('0', 'E201'): 1,
('R', 'E1C2'): 1,
('T', 'E202'): 1,
('M', 'F1C1'): 1,
('M', 'F201'): 1,
('R', 'F1C2'): 1,
('A', 'F202'): 1,
})
# There are four possible policies:
policies = [
({'': '1', '1C': '1'}, 'Signing one-boxer'),
({'': '1', '1C': '2'}, 'Signing two-boxer'),
({'': '2', '20': '1'}, 'Refusing one-boxer'),
({'': '2', '20': '2'}, 'Refusing two-boxer')
]
print_policy_values(policies)
# Compute values after signing the contract
def after_singing(a, b=''):
"""Generate the conditional distribution for the second step,
after the contract has been signed"""
if b == '':
return mu(a)
else:
return mu(a, b[0] + '1C' + b[1:])
print('======================================')
for value, name in [(evidential_value, 'Evidential'), (causal_value, 'Causal')]:
print('{:} value of one-boxing after signing: ${:}'.format(
name, value(after_seeing('C'), '1', E, S, u)))
print('{:} value of two-boxing after signing: ${:}'.format(
name, value(after_seeing('C'), '2', E, S, u)))
```

In our sequential variation of the toxoplasmosis problem the agent has some probability of encountering a kitten. Additionally, the agent has the option of seeing a doctor (for a fee) and getting tested for the parasite, which can then be safely removed. In the very beginning, an SPEDT agent updates his belief on the fact that if he encountered a kitten, he would not pet it, which lowers the probability that he has the parasite and thus seeing the doctor is unattractive. An SAEDT agent only updates his belief about the parasite when he actually encounters a kitten, and this prefers seeing the doctor.

First the environment gives the parasite (T) or not (H) to the the agent. The agent then chooses to go to the doctor (Y) or not (N). At the next step an agent that do not go to the doctor may see a kitten (K) or get sick (S) (in case infected). After seeing a kitten, the agent can choose to either pet it (Y) or not (N). Agents that go to the doctor pay the doctor cost (C). After S or C, empty actions Y or N may be taken. The game tree gives the probabilities.

In [9]:

```
# Sequential Toxoplasmosis
S = {'T', 'H'} # T = toxoplasmosis, H = healthy
A = {'Y', 'N'} # Y = see doctor/pet kitten, N = not pet kitten/not see doctor
E = {'C', 'K', 'S', 's', 'P', '0'} # C = doctor cost, K = kitten, S = sick, s = sick but pet kitten, P = pet kitten, 0 = empty percept
u = {'C': -4, 'K': 0, 'S': -10, 's': -9, 'P': 1, '0': 0} # utilities
mu = distribution({
# Hidden state
'T': Fraction('0.5'),
'H': Fraction('0.5'),
# First action: See the doctor?
('Y', 'T'): Fraction('0.5'),
('N', 'T'): Fraction('0.5'),
('Y', 'H'): Fraction('0.5'),
('N', 'H'): Fraction('0.5'),
# First percept: doctor cost or seeing a kitten or getting sick
('C', 'TY'): 1,
('C', 'HY'): 1,
('S', 'TN'): Fraction('0.8'),
('K', 'TN'): Fraction('0.2'),
('K', 'HN'): 1,
# Second action: Pet the kitten?
('Y', 'TNK'): Fraction('0.8'),
('N', 'TNK'): Fraction('0.2'),
('Y', 'HNK'): Fraction('0.2'),
('N', 'HNK'): Fraction('0.8'),
# Second action: these are empty actions,
# needed to get the history to length 2
('Y', 'TYC'): Fraction('0.5'),
('N', 'TYC'): Fraction('0.5'),
('Y', 'HYC'): Fraction('0.5'),
('N', 'HYC'): Fraction('0.5'),
('Y', 'TNS'): Fraction('0.5'),
('N', 'TNS'): Fraction('0.5'),
# Second percept: payoff from petting
('s', 'TNKY'): 1,
('S', 'TNKN'): 1,
('P', 'HNKY'): 1,
('0', 'HNKN'): 1,
})
# There are four possible policies:
policies = [
({'': 'Y', 'YC': 'N'}, "see doctor "),
({'': 'N', 'NK': 'Y', 'NS': 'N'}, "don't see doctor, pet "),
({'': 'N', 'NK': 'N', 'NS': 'N'}, "don't see doctor, don't pet"),
]
print_policy_values(policies)
```

- Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009.
- Tom Everitt, Jan Leike, and Marcus Hutter. Sequential Extensions of Causal and Evidential Decision Theory. Algorithmic Decision Theory, 2015.