#!/usr/bin/env python
# coding: utf-8

# # Sequential Extensions of Causal and Evidential Decision Theory
# 
# Written by [Jan Leike](http://jan.leike.name/) and [Tom Everitt](http://www.tomeveritt.se/).
# 
# ## About this notebook
# 
# The code is written in Python 3, but should be compatible with Python 2.
# 
# The code can be used to calculate the value estimates of the decision theories discussed in our paper [Sequential Extensions of Causal and Evidential Decision Theory](http://users.cecs.anu.edu.au/~leike/publications/Sequential Extensions of Causal and Evidential Decision Theory - Everitt, Leike, Hutter 2015.pdf). Generally, the implemented value functions take as input a distribution defining the environmental probabilities, an action/policy, lists of possible percepts and hidden states, and a utility function. The implementation aims to follow the formulas in the paper as closely as possible. See the code documentation for details.
# 
# The results of the calculations are displayed below the code boxes. If you wish to re-run some code (with or without changes), please first run all earlier code boxes in order, as functions in later code boxes depend on earlier code boxes having been executed at least once. 
# 
# To run a code box, press Shift+Enter.
# 
# 
# ## One-Shot Decision Making
# 
# In a _decision problem_,
# we take one action $a \in \mathcal{A}$, receive a percept $e \in \mathcal{E}$
# (typically called _outcome_ in the decision theory literature)
# and get a payoff according to the utility function
# $u: \mathcal{E} \to [0, 1]$.
# We assume that the set of actions $\mathcal{A}$ and the set of percepts $\mathcal{E}$ are finite.
# Additionally,
# the environment contains a _hidden state_ $s \in \mathcal{S}$.
# The hidden state holds information that is inaccessible to the agent
# at the time of the decision,
# but may influence the decision and the percept.
# Formally, the environment is given by
# a probability distribution $P$ over the hidden state, the action and the percept
# that factors according to a _causal graph_ (Pearl, 2009).
# 
# ### Causal Graphs
# 
# A causal graph over the random variables $x_1,\dots,x_n$ is
# a directed acyclic graph with nodes $x_1,\dots,x_n$ and
# probability distributions $P(x_i\mid pa_i)$ for each node $x_i$
# where $pa_i$ is the set of parents of $x_i$ in the graph.
# It is natural to identify the causal graph with 
# the factored distribution $P(x_1,\dots, x_n) = \prod_{i=1}^n P(x_i\mid pa_i)$.
# Given such a causal graph, 
# the $\texttt{do}$-operator is defined as
# $$
#   P(x_1, \ldots, x_j, \ldots, x_n \mid \texttt{do}(x_j := b))
# = \frac{\prod_{i=1}^n P(x_i \mid pa_i)}{P(x_j \mid pa_j)}
# $$
# if $x_j = b$ on the left hand side and $0$ otherwise.
# The result is a new probability distribution
# that can be marginalized and conditioned in the standard way.
# Intuitively, intervening on node $x_j$ means
# ignoring all incoming arrows to $x_j$, as the effects they represent
# are no longer relevant when we intervene;
# the division removes the $P(x_j \mid pa_j)$ factor from $P(x_1,\dots,x_n)$.
# Note that the $\texttt{do}$-operator is only defined for distributions
# for which a causal graph has been specified.
# See (Pearl, 2009, Ch. 3.4) for details.
# 
# In the first two (one-shot) examples, the environment is given as
# the following causal graph over the hidden state $s$, action $a$, and percept $e$.
# 
# ![im1](one-shot.png)
# 
# According to this causal graph,
# the probability distribution $P$ factors causally into
# $P(s, a, e) = P(s) P(a \mid s) P(e \mid s, a)$.
# 
# The environment has a prior belief over the agent's actions,
# but that prior does not have to assign high probability to the
# actions that the agent actually ends up taking.
# We think of this prior as _partial information_ about the agent
# or in multi-agent systems beliefs held by other agents.
# Since we assume the agent's policy to be deterministic and inside the environment,
# the environment could actually know all the agent's actions in advance.
# However, for self-consistency reasons, the agent needs to be
# uncertain about what the environment knows about it:
# if the agent knew which action she will take,
# she could decide to take different actions which leads to a paradox.

# In[1]:


# Imports and Helper functions
from fractions import Fraction # exact arithmetic with fractions

def distribution(d):
    """
    Return a probability distribution
    corresponding the probability mu(a | b)
    
    Input: a dictionary specifying the distribution
    Output: a function that queries the dictionary and returns 0 if nothing is found
    """
    def mu(a, b=''):
        #print("evaluating ",a,b)
        if b == '':
            return d[a] if a in d else 0
        else:
            return d[(a, b)] if (a, b) in d else 0
    return mu


# ### Value Functions
# 
# One-shot evidential value function:
# $$
#     V^{\mathrm{evi},a}_{\mu,1}
# := \sum_{e \in \mathcal{E}} \mu(e \mid a) u(e)
#  = \sum_{e \in \mathcal{E}} \sum_{s \in \mathcal{S}} \mu(e \mid s, a) \mu(s \mid a) u(e)
# $$
# One-shot causal value function:
# $$
#     V^{\mathrm{cau},a}_{\mu,1}
# := \sum_{e \in \mathcal{E}} \mu(e \mid \texttt{do}(a)) u(e)
#  = \sum_{e \in \mathcal{E}} \sum_{s \in \mathcal{S}} \mu(e \mid s, a) \mu(s) u(e)
# $$

# In[2]:


def evidential_value(mu, a, E, S, u):
    """
    Evidential value of action a in environment mu
    with possible percepts E, hidden states S, and utility function u.
    """
    mu_a = sum([mu(a, s)*mu(s) for s in S]) # mu(a)
    return sum([u[e]*mu(e, s + a)*mu(s)*mu(a, s)/mu_a for e in E for s in S])

def causal_value(mu, a, E, S, u):
    """
    Causal value of action a in environment mu
    with possible percepts E, hidden states S, and utility function u
    """
    return sum([u[e]*mu(e, s + a)*mu(s) for e in E for s in S])


# ### Newcomb's Problem
# 
# In [Newcomb's Problem](http://en.wikipedia.org/wiki/Newcomb%27s_paradox) there are two boxes:
# an opaque box that is either empty or contains one million dollars and
# a transparent box that contains one thousand dollars.
# The agent can choose between taking only the opaque box ("one-boxing")
# and taking both boxes ("two-boxing").
# The content of the opaque box is determined by a very reliable predictor:
# If it predicts the agent will one-box, the box contains the million, and
# if it predicts the agent will two-box, the box is empty.
# 
# In Newcomb's problem
# EDT prescribes to one-box because one-boxing is evidence that
# the box contains a million dollars.
# In contrast, CDT prescribes to two-box because two-boxing dominates one-boxing:
# in either case we are a thousand dollars richer,
# and our decision cannot causally affect the prediction.
# Newcomb's problem has been raised as a critique to CDT,
# but many philosophers insist that two-boxing is in fact
# the rational choice, even if it makes you end up poor.

# In[3]:


# Newcomb's Problem
S = {'E', 'F'} # E = empty, F = full
A = {'1', '2'} # 1 = one-box, 2 = two-box
E = {'0', 'T', 'M', 'A'} # 0 = no money, T = $1000, M = $1000000, A = $1001000 (all of the money)
u = {'0': 0, 'T': 1000, 'M': 1000000, 'A': 1001000} # Utilities
eps = Fraction('0.01') # Prediction inaccuracy 
mu = distribution({
    'E': Fraction('0.5'), # Prob. box empty
    'F': Fraction('0.5'), # Prob. box full
    ('1', 'F'): 1 - eps, # Prob. agent one-boxes given box full
    ('1', 'E'): eps,     # Prob. agent one-boxes given box empty
    ('2', 'F'): eps,     # Prob. agent two-boxes given box full
    ('2', 'E'): 1 - eps, # Prob. agent one-boxes given box full
    ('0', 'E1'): 1,      # Prob. receiving $0 given empty and one-boxing
    ('T', 'E2'): 1,      # Prob. receiving $T given empty and two-boxing
    ('M', 'F1'): 1,      # Prob. receiving $M given full and one-boxing 
    ('A', 'F2'): 1,      # Prob. receiving $M+$T given full and two-boxing
})                       # Unspecified conditionals default to 0

print('Evidential value of one-boxing: ${:}'.format(evidential_value(mu, '1', E, S, u)))
print('Evidential value of two-boxing: ${:}'.format(evidential_value(mu, '2', E, S, u)))
print('Causal value of one-boxing:     ${:}'.format(causal_value(mu, '1', E, S, u)))
print('Causal value of two-boxing:     ${:}'.format(causal_value(mu, '2', E, S, u)))

assert evidential_value(mu, '1', E, S, u) > evidential_value(mu, '2', E, S, u) # EDT prefers 1
assert causal_value(mu, '1', E, S, u) < causal_value(mu, '2', E, S, u)         # CDT prefers 2


# ### Toxoplasmosis Problem
# 
# This problem takes place in a world in which there is a certain parasite that
# cause their host to be attracted to cats,
# in addition to uncomfortable sideeffects.
# The agent is handed an adorable little kitten and
# is faced with the decision of whether or not to pet it.
# Petting the kitten feels nice and
# therefore yields more utility than not petting it.
# However, people suffering from the parasite are more likely to pet the kitten.
# 
# Petting the kitten constitutes evidence
# of the presence of the parasite, and thus EDT recommends against it.
# CDT correctly observes that there is no causal connection between
# petting the kitten and having the parasite,
# and is therefore in favor of petting.

# In[4]:


# Toxoplasmosis Problem
S = {'T', 'H'} # T = toxoplasmosis, H = healthy
A = {'P', 'N'} # P = petting, N = not petting
E = {'0', '1', '8', '9'} # 0 = sick, not petting; 1 = sick, petting; 8 = healthy, not petting; 9 = healthy, petting   
u = {'0': -10, '1': -9, '8': 0, '9': 1} # Utilities
mu = distribution({
    'T': Fraction('0.5'),
    'H': Fraction('0.5'),
    ('P', 'T'): Fraction('0.8'),
    ('P', 'H'): Fraction('0.2'),
    ('N', 'T'): Fraction('0.2'),
    ('N', 'H'): Fraction('0.8'),
    ('0', 'TN'): 1,
    ('1', 'TP'): 1,
    ('8', 'HN'): 1,
    ('9', 'HP'): 1,
}) # Unspecified conditionals default to 0

print('Evidential value of petting:     {:}'.format(evidential_value(mu, 'P', E, S, u)))
print('Evidential value of not petting: {:}'.format(evidential_value(mu, 'N', E, S, u)))
print('Causal value of petting:         {:}'.format(causal_value(mu, 'P', E, S, u)))
print('Causal value of not petting:     {:}'.format(causal_value(mu, 'N', E, S, u)))

assert evidential_value(mu, 'P', E, S, u) < evidential_value(mu, 'N', E, S, u) # EDT prefers N
assert causal_value(mu, 'P', E, S, u) > causal_value(mu, 'N', E, S, u)         # CDT prefers P


# ## Sequential Decisions
# 
# For the remainder of this notebook,
# we consider an environment $\mu$ that
# the agent interacts with sequentially:
# at time step $t$ the agent chooses an _action_ $a_t \in \mathcal{A}$ and
# receives a _percept_ $e_t \in \mathcal{E}$
# which yields a _utility_ of $u(e_t) \in \mathbb{R}$;
# the cycle then repeats for $t + 1$.
# A _history_ is an element of $(\mathcal{A} \times \mathcal{E})^*$.
# We use $æ \in \mathcal{A} \times \mathcal{E}$ to denote one interaction cycle,
# and $æ_{<t}$ to denote a history of length $t - 1$.
# A _policy_ is a function that maps a history $æ_{<t}$ to
# the next action $a_t$.
# We only consider deterministic policies.
# 
# This section assumes that the environment is fully known
# except for some stochasticity and an unknown hidden state.
# In other words, it is a _planning problem_.
# The hidden state influences both percepts and actions,
# and all actions and percepts (potentially) influence the entire future.
# The environment $\mu$ is thus formally specified by
# a probability distribution over hidden states and histories
# that for any $t \in \mathbb{N}$ factors as
# $$
#   \mu(s, æ_{<t})
# = \mu(s) \prod_{i=1}^{t-1} \mu(a_i \mid s, æ_{<i}) \mu(e_i \mid s, æ_{<i}a_i).
# $$
# While this factorization is possible for any distribution, we additionally demand
# that this factorization is _causal_ according to the following causal graph.
# 
# ![Causal graph for sequential decision problems](sequential.png)
# 
# When deciding between actions from $\mathcal{A}$,
# we want to pick an action that maximizes expected utility.
# Given a history $æ_{<t}$, a policy $\pi$, and an environment $\mu$,
# the _value $V^\pi_{\mu,k}$ of the policy $\pi$ in the environment $\mu$_
# is the future expected utility when following the policy $\pi$ with lifetime $m$.
# If we have a value function,
# we get the utility-maximizing action/policy by picking the policy that
# has the highest value.
# Therefore it is enough to define
# the value function of sequential evidential decision theory
# and sequential causal decision theory.
# 
# The agent-environment interaction generates a growing action-percept 
# history $æ_{<t}$. The pivotal distinction between EDT and CDT is how
# they update their prediction of a next percept $e_t$
# provided they take action $a_t$ after history $æ_{<t}$.
# 
# 
# ### Value Functions
# 
# The _action-evidential value of a policy $\pi$
# with lifetime $2$ in environment $\mu$_ is
# $$
#   V^{\mathrm{aev},\pi}_{\mu,2}
# = \sum_{e_1} \left( \sum_{s \in \mathcal{S}} \mu(s \mid a_1) \mu(e_1 \mid s, a_1) \right)
#     \left( u(e_1) + \sum_{e_2} u(e_2) \sum_{s \in \mathcal{S}} \mu(s \mid æ_1 a_2) \mu(e_2 \mid s, æ_1 a_2) \right)
# $$
# where $a_1 := \pi(\epsilon)$ and $a_2 := \pi(æ_1)$.
# 
# The _policy-evidential value of a policy $\pi$
# with lifetime $2$ in environment $\mu$_ is
# $$
#   V^{\mathrm{pev},\pi}_{\mu,2}
# = \sum_{e_1} \frac{\sum_{s \in \mathcal{S}} \mu(s a_1 e_1 \pi(a_1 e_1))}{\sum_{s \in \mathcal{S}} \sum_{e \in \mathcal{E}} \mu(s a_1 e \pi(a_1 e))}
#     \left( u(e_1) + \sum_{e_2} u(e_2) \sum_{s \in \mathcal{S}} \mu(s \mid æ_1 a_2) \mu(e_2 \mid s, æ_1 a_2) \right)
# $$
# where $a_1 := \pi(\epsilon)$ and $a_2 := \pi(æ_1)$.
# 
# The _causal value of a policy $\pi$
# with lifetime $2$ in environment $\mu$_ is
# $$
#   V^{\mathrm{cau},\pi}_{\mu,2}
# = \sum_{e_1} \left( \sum_{s \in \mathcal{S}} \mu(s) \mu(e_1 \mid s, a_1) \right)
#     \left( u(e_1) + \sum_{e_2} u(e_2) \sum_{s \in \mathcal{S}} \mu(s \mid æ_1) \mu(e_2 \mid s, æ_1 a_2) \right)
# $$
# where $a_1 := \pi(\epsilon)$ and $a_2 := \pi(æ_1)$.

# In[5]:


def saedt_value(mu, pi, E, S, u):
    """
    Action-evidential value of policy pi in environment mu with horizon 2
    with possible percepts E, hidden states S, and utility function u.
    """
    a1     = pi['']
    mu_a   = sum([mu(a1, s)*mu(s) for s in S]) # mu(a1)
    mu_aea = lambda e1: sum([mu(s)*mu(a1, s)*mu(e1, s + a1)*mu(pi[a1 + e1], s + a1 + e1) for s in S]) # mu(a1 e1 a2)
    return sum([mu(e1, s + a1)*mu(a1, s)*mu(s)/mu_a # outer sum over e1 and s
                *(u[e1] + sum([ # inner sum over e2 and s
                              mu(e2, s + a1 + e1 + pi[a1 + e1])
                              *mu(s)*mu(a1, s)*mu(e1, s + a1)*mu(pi[a1 + e1], s + a1 + e1)
                              /mu_aea(e1)*u[e2]
                              for e2 in E for s in S
                              ])) 
                for e1 in E for s in S if mu(e1, s + a1) > 0])

def spedt_value(mu, pi, E, S, u):
    """
    Policy-evidential value of policy pi in environment mu with horizon 2
    with possible percepts E, hidden states S, and utility function u.
    """
    a1      = pi['']
    # mu(e_1 | pi) = \sum_s mu(s | pi)*mu(e_1 | s, pi)
    mu_e1_pi = lambda e1: sum([mu(s)*mu(a1, s)*mu(e1, s + a1)*mu(pi[a1 + e1], s + a1 + e1)
                               for s in S if mu(e1, s + a1) > 0]) / \
                          sum([mu(s)*mu(a1, s)*mu(e, s + a1)*mu(pi[a1 + e], s + a1 + e)
                               for s in S for e in E if mu(e, s + a1) > 0])
    mu_aea = lambda e1: sum([mu(s)*mu(a1, s)*mu(e1, s + a1)*mu(pi[a1 + e1], s + a1 + e1) for s in S]) # mu(a1 e1 a2)
    return sum([mu_e1_pi(e1)*(u[e1] + sum([mu(s)*mu(a1, s)*mu(e1, s + a1)
                               *mu(pi[a1 + e1], s + a1 + e1)
                               *mu(e2, s + a1 + e1 + pi[a1 + e1])
                               /mu_aea(e1)*u[e2]
                               for e2 in E for s in S if mu(e1, s + a1) > 0
                    ])) for e1 in E])

def scdt_value(mu, pi, E, S, u):
    """
    Causal value of policy pi in environment mu
    with possible percepts E, hidden states S, and utility function u.
    """
    a1     = pi['']
    mu_ae  = lambda e1: sum([mu(a1, s)*mu(e1, s + a1)*mu(s) for s in S]) # mu(a1e1)
    return sum([mu(e1, s + a1)*mu(s)*(u[e1] + sum([
                        mu(e2, s + a1 + e1 + pi[a1 + e1])*mu(s)*mu(a1, s)*mu(e1, s + a1)/mu_ae(e1)*u[e2]
                        for e2 in E for s in S
                    ])) for e1 in E for s in S if mu(e1, s + a1) > 0])

def print_policy_values(policies):
    """Prints SAEDT, SPEDT, and SCDT values for a list of the form (policy, name)"""
    for value, vname in [(saedt_value, 'Action-evidential'),
                         (spedt_value, 'Policy-evidential'),
                         (scdt_value,  'Causal')]:
        print('======================================')
        print('{} value'.format(vname))
        values = {pname: value(mu, pi, E, S, u) for pi, pname in policies}
        m = max(values.values())
        for pi, pname in policies:
            print('{:<20}: {:}{:}'.format(pname, values[pname],
                                        ' BEST' if values[pname] == m else ''))


# ### Newcomb with Looking
# 
# In this variation to Newcomb's problem
# the agent may look into the opaque box before making the decision
# which box to take.
# A SCDT agent is indifferent towards looking because
# she will take both boxes anyways.
# However, an SAEDT or SPEDT agent will avoid looking into the box,
# because once the content is revealed he two-boxes.
# 
# ![Game tree for Newcomb with Looking](Newcomb-with-looking.png)

# In[6]:


# Newcomb with Looking
S = {'E', 'F'} # E = empty, F = full
A = {'1', '2'} # 1 = looking, 2 = not looking, 1 = one-box, 2 = two-box
E = {'E', 'F', '0', 'T', 'M', 'A'} # E = empty, F = full, 0 = no money,
                                   # T = $1000, M = $1000000, A = $1001000 (all of the money)
u = {'E': 0, 'F': 0, '0': 0, 'T': 1000, 'M': 1000000, 'A': 1001000}
eps = Fraction('0.01') # prediction accuracy
mu = distribution({
    # Hidden state
    'E': Fraction('0.5'),
    'F': Fraction('0.5'),
    # First action: looking vs. not looking
    ('1', 'F'): Fraction('0.5'),
    ('1', 'E'): Fraction('0.5'),
    ('2', 'F'): Fraction('0.5'),
    ('2', 'E'): Fraction('0.5'),
    # First percept: box content if looking else 0
    ('E', 'E1'): 1,
    ('0', 'E2'): 1,
    ('F', 'F1'): 1,
    ('0', 'F2'): 1,
    # Second action: one-boxing vs. two-boxing
    ('1', 'E1E'): eps,
    ('2', 'E1E'): 1 - eps,
    ('1', 'E20'): eps,
    ('2', 'E20'): 1 - eps,
    ('1', 'F1F'): 1 - eps,
    ('2', 'F1F'): eps,
    ('1', 'F20'): 1 - eps,
    ('2', 'F20'): eps,
    # Second percept: box contents
    ('0', 'E1E1'): 1,
    ('0', 'E201'): 1,
    ('T', 'E1E2'): 1,
    ('T', 'E202'): 1,
    ('M', 'F1F1'): 1,
    ('M', 'F201'): 1,
    ('A', 'F1F2'): 1,
    ('A', 'F202'): 1,
})    

# There are six possible policies:
policies = [
    ({'': '1', '1E': '1', '1F': '1'}, 'Curious one-boxer'),
    ({'': '1', '1E': '2', '1F': '2'}, 'Curious two-boxer'),
    ({'': '1', '1E': '2', '1F': '1'}, 'Fatalistic'),
    ({'': '1', '1E': '1', '1F': '2'}, 'Paradox-lover'),
    ({'': '2', '20': '1'}, 'Incurious one-boxer'),
    ({'': '2', '20': '2'}, 'Incurious two-boxer')
]

print_policy_values(policies)


# While the curious one-boxer is an optimal strategy for SPEDT,
# it is not _time consistent_: Once the box content is reveiled,
# both evidential decision theories want to two-box.

# In[7]:


def after_seeing(box_content):
    """Generate the conditional distribution for the second step,
    after the box content has been reveiled"""
    def distr(a, b=''):
        if b == '':
            return mu(a)
        else:
            return mu(a, b[0] + '1' + box_content + b[1:])
    return distr

print('Evidential value of one-boxing after seeing the box empty: ${:}'.format(
        evidential_value(after_seeing('E'), '1', E, S, u)))
print('Evidential value of two-boxing after seeing the box empty: ${:}'.format(
        evidential_value(after_seeing('E'), '2', E, S, u)))
print('Evidential value of one-boxing after seeing the box full:  ${:}'.format(
        evidential_value(after_seeing('F'), '1', E, S, u)))
print('Evidential value of two-boxing after seeing the box full:  ${:}'.format(
        evidential_value(after_seeing('F'), '2', E, S, u)))


# ### Newcomb with Precommitment
# 
# In this variation to Newcomb's problem
# the agent first has the option to pay \$300,000 to sign a contract that
# binds the agent to pay \$2000 in case of two-boxing.
# An SAEDT or SPEDT agent knows that he will one-box anyways
# and hence has no need for the contract.
# A SCDT agent knows that she favors two-boxing,
# but signs the contract only if this occurs before the prediction is made
# (so it has a chance of causally affecting the prediction).
# With the contract in place, one-boxing is the dominant action,
# and thus the SCDT agent is predicted to one-box.
# 
# ![Game tree for Newcomb with Precommitment](Newcomb-with-precommitment.png)

# In[8]:


# Newcomb with Precommitment
S = {'E', 'F'} # E = empty, F = full
A = {'1', '2'} # 1 = signing, 2 = not signing, 1 = one-box, 2 = two-box
E = {'C', '0', 'T', 'M', 'A', 'R'}
    # C = -$300000 for the contract, T = -$1000, M = $1000000, A = $1001000 (all of the money),
    # R = $999000 rich while violating the contract
u = {'C': -300000, '0': 0, 'T': 1000, 'M': 1000000, 'A': 1001000, 'R': 999000}
eps = Fraction('0.01') # prediction accuracy
mu = distribution({
    # Hidden state
    'E': Fraction('0.5'),
    'F': Fraction('0.5'),
    # First action: signing vs. not signing
    ('1', 'F'): Fraction('0.5'),
    ('1', 'E'): Fraction('0.5'),
    ('2', 'F'): Fraction('0.5'),
    ('2', 'E'): Fraction('0.5'),
    # First percept: C if signing the contract else 0
    ('C', 'E1'): 1,
    ('0', 'E2'): 1,
    ('C', 'F1'): 1,
    ('0', 'F2'): 1,
    # Second action: one-boxing vs. two-boxing
    ('1', 'E1C'): eps,
    ('2', 'E1C'): 1 - eps,
    ('1', 'E20'): eps,
    ('2', 'E20'): 1 - eps,
    ('1', 'F1C'): 1 - eps,
    ('2', 'F1C'): eps,
    ('1', 'F20'): 1 - eps,
    ('2', 'F20'): eps,
    # Second percept: box contents minus contract cost
    ('M', 'E1C1'): 1,
    ('0', 'E201'): 1,
    ('R', 'E1C2'): 1,
    ('T', 'E202'): 1,
    ('M', 'F1C1'): 1,
    ('M', 'F201'): 1,
    ('R', 'F1C2'): 1,
    ('A', 'F202'): 1,
})

# There are four possible policies:
policies = [
    ({'': '1', '1C': '1'}, 'Signing one-boxer'),
    ({'': '1', '1C': '2'}, 'Signing two-boxer'),
    ({'': '2', '20': '1'}, 'Refusing one-boxer'),
    ({'': '2', '20': '2'}, 'Refusing two-boxer')
]

print_policy_values(policies)

# Compute values after signing the contract
def after_singing(a, b=''):
    """Generate the conditional distribution for the second step,
    after the contract has been signed"""
    if b == '':
        return mu(a)
    else:
        return mu(a, b[0] + '1C' + b[1:])

print('======================================')
for value, name in [(evidential_value, 'Evidential'), (causal_value, 'Causal')]:
    print('{:} value of one-boxing after signing: ${:}'.format(
            name, value(after_seeing('C'), '1', E, S, u)))
    print('{:} value of two-boxing after signing: ${:}'.format(
            name, value(after_seeing('C'), '2', E, S, u)))


# ### Sequential Toxoplasmosis
# 
# In our sequential variation of the toxoplasmosis problem
# the agent has some probability of encountering a kitten.
# Additionally, the agent has the option of seeing a doctor (for a fee) and
# getting tested for the parasite, which can then be safely removed.
# In the very beginning, an SPEDT agent updates his belief on the fact that
# if he encountered a kitten, he would not pet it,
# which lowers the probability that he has the parasite
# and thus seeing the doctor is unattractive.
# An SAEDT agent only updates his belief about the parasite
# when he actually encounters a kitten,
# and this prefers seeing the doctor.
# 
# First the environment gives the parasite (T) or not (H) to the the agent. The agent then chooses to go to the doctor (Y) or not (N). At the next step an agent that do not go to the doctor may see a kitten (K) or get sick (S) (in case infected). After seeing a kitten, the agent can choose to either pet it (Y) or not (N). Agents that go to the doctor pay the doctor cost (C). After S or C, empty actions Y or N may be taken. The game tree gives the probabilities.
# 
# ![Game tree of Sequential Toxoplasmosis](Sequential-toxoplasmosis.png)

# In[9]:


# Sequential Toxoplasmosis
S = {'T', 'H'} # T = toxoplasmosis, H = healthy
A = {'Y', 'N'} # Y = see doctor/pet kitten, N = not pet kitten/not see doctor
E = {'C', 'K', 'S', 's', 'P', '0'} # C = doctor cost, K = kitten, S = sick, s = sick but pet kitten, P = pet kitten, 0 = empty percept 
u = {'C': -4, 'K': 0, 'S': -10, 's': -9, 'P': 1, '0': 0} # utilities
mu = distribution({
    # Hidden state
    'T': Fraction('0.5'),
    'H': Fraction('0.5'),
    # First action: See the doctor?
    ('Y', 'T'): Fraction('0.5'),
    ('N', 'T'): Fraction('0.5'),
    ('Y', 'H'): Fraction('0.5'),
    ('N', 'H'): Fraction('0.5'),
    # First percept: doctor cost or seeing a kitten or getting sick
    ('C', 'TY'): 1,
    ('C', 'HY'): 1,
    ('S', 'TN'): Fraction('0.8'),
    ('K', 'TN'): Fraction('0.2'),
    ('K', 'HN'): 1,
    # Second action: Pet the kitten?
    ('Y', 'TNK'): Fraction('0.8'),
    ('N', 'TNK'): Fraction('0.2'),
    ('Y', 'HNK'): Fraction('0.2'),
    ('N', 'HNK'): Fraction('0.8'),
    # Second action: these are empty actions,
    # needed to get the history to length 2
    ('Y', 'TYC'): Fraction('0.5'),
    ('N', 'TYC'): Fraction('0.5'),
    ('Y', 'HYC'): Fraction('0.5'),
    ('N', 'HYC'): Fraction('0.5'),
    ('Y', 'TNS'): Fraction('0.5'),
    ('N', 'TNS'): Fraction('0.5'),
    # Second percept: payoff from petting
    ('s', 'TNKY'): 1,
    ('S', 'TNKN'): 1,
    ('P', 'HNKY'): 1,
    ('0', 'HNKN'): 1,
})

# There are four possible policies:
policies = [
    ({'': 'Y', 'YC': 'N'},                       "see doctor                 "),
    ({'': 'N', 'NK': 'Y', 'NS': 'N'}, "don't see doctor, pet      "),
    ({'': 'N', 'NK': 'N', 'NS': 'N'}, "don't see doctor, don't pet"),
]

print_policy_values(policies)


# ## References
# 
# * Judea Pearl. [Causality: Models, Reasoning, and Inference](https://books.google.de/books/about/Causality.html?id=wnGU_TsW3BQC). Cambridge University Press, 2nd edition, 2009.
# * Tom Everitt, Jan Leike, and Marcus Hutter.
#   [Sequential Extensions of Causal and Evidential Decision Theory](http://users.cecs.anu.edu.au/~leike/publications/Sequential Extensions of Causal and Evidential Decision Theory - Everitt, Leike, Hutter 2015.pdf).
#   Algorithmic Decision Theory, 2015.