%matplotlib inline
# The following code is necessary for animated matplotlib videos
# Based on http://jakevdp.github.io/blog/2013/05/12/embedding-matplotlib-animations/
# and on http://nbviewer.ipython.org/gist/edrex/9044756
# Important: The ffmpeg library needs to be installed to make this work http://www.ffmpeg.org/
from matplotlib import animation, pyplot as plt
VIDEO_TAG = """<video controls>
<source src="data:video/x-m4v;base64,{0}" type="video/mp4">
Your browser does not support the video tag.
</video>"""
# a bit ugly to use a global variable here
fps = 10
def anim_to_html(anim):
if not hasattr(anim, '_encoded_video'):
anim.save("tmp.mp4", fps=fps, extra_args=['-vcodec', 'libx264'])
video = open("tmp.mp4","rb").read()
plt.close()
anim._encoded_video = video.encode("base64")
return VIDEO_TAG.format(anim._encoded_video)
animation.Animation._repr_html_ = anim_to_html
This ipython notebook provides a basic tutorial regarding the HypTrails approach. It utilizes the Python implementations provided at https://github.com/psinger/HypTrails and https://github.com/psinger/PathTools. HypTrails is a Bayesian approach that allows to compare hypotheses about human trails on the Web. Fundamentally, HypTrails is based on a first-order Markov chain model. Hypotheses are expressed as belief in parameters of the model. Then, HypTrails incorporates these hypotheses as elicited Dirichlet priors into a Bayesian inference process. The relative plausibility of hypotheses then is determined by their relative marginal likelihoods and Bayes factors.
Details about the paper initially presenting HypTrails as well as further information can be found at: http://philippsinger.info/hyptrails.
First, this tutorial provides a short introduction to Bayesian inference which is necessary for understanding the main concepts of HypTrails. Then, a step-by-step example is elaborated which also takes excursions to related topics such as Markov chain models and Dirichlet distribution. Afterwards, a larger synthetic example is elaborated. At the end of this tutorial, some elementary definitions are provided.
Before we focus on how we can utilize Bayesian inference for comparing hypotheses about human trails with HypTrails, let us go through the basics of Bayesian inference. If you are already familiar with the basics of Bayesian statistics, feel free to completely skip this section (specifically, the coin flip example).
In general, statistical inference is the process of determining properties of a distribution by studying data. In our Markov chain case, this would mean that we e.g., want to determine the parameters (transition probabilities) given the data. Bayesian inference can be seen as the Bayesian counterpart to frequentist inference. In frequentist inference, parameters are usually seen as unknown constants and point estimates are inferred. Contrary, Bayesian inference treats the model parameters as random variables and usually wants to deduce probabilistic statements about the distribution of parameters or also competing models or hypotheses. Bayesian inference utilizes the famous Bayes rule:
$$ P(A|B) = \frac{P(B | A) P(A)}{P(B)} $$For model based inference, we can replace $A$ with the parameters $\theta$ and $B$ with the data $D$ at interest. Furthermore, we can introduce $I$ which can be used to introduce an additional assumption (knowledge) to the inference such as which model to use.
$$ \overbrace{P(\theta| D, I)}^{\text{posterior}} = \frac{\overbrace{P(D | \theta, I)}^{\text{likelihood}}\overbrace{P(\theta|I)}^{\text{prior}}}{\underbrace{P(D|I)}_{\text{marginal likelihood}}} $$The prior distribution $P(\theta|I)$ specifies our assumption about the parameters $\theta$ before taking the data into account. The likelihood $P(D | \theta, I)$ represents the probability of the data if the parameters $\theta$ are specified. The marginal likelihood (or evidence) $P(D|I)$ is the distribution of the data $D$ given our additional assumption $I$. It is the normalization of the Bayes rule and plays an important rule for model comparison which HypTrails heavily utilizes and which we will talk about later in this tutorial. Finally, the posterior $P(\theta| D, I)$ is the distribution of the parameters after taking the observed data $D$ and our additional (prior) assumption $I$ into account. We can also say that the posterior is proportional to the likelihood and the prior.
$$ posterior \propto likelihood \times prior $$For better understanding of Bayesian inference let us consider the classic coin flip example.
We are observing whether a coin flip results in "heads" or "tails". We are not sure whether the coin at interest is fair or whether it might be biased due to some asperity or similar things. Thus, we want to conduct statistical inference of the parameter $p$, which should describe the probability of flipping "heads", by utilizing the Bayesian framework. The probability of flipping "tails" is simply $1-p$. Further, we consider a set of observations $D$ by flipping the coin several times. Thus, by applying Bayes inference, we want to determine:
$$ P(p|D) \propto P(D|p) \times P(p) $$As underlying model, we can use the binomial distribution which is a discrete probability distribution for the number of successes in a sequence of n independent bianry experiments (e.g., coin flips resulting in "heads" or "tails"). The binomial distribution is conditioned on the parameters $n$ (number of trials) and $p$ the probability of flipping "heads". The probability mass function of the binomial distribution---which determines the probability of observing $k$ "heads" with parameters $p$ and $n$---is defined as:
$$ f(k|p,n) = \binom nk p^k (1-p)^{n-k} $$We can rewrite above Bayes rules as:
$$ P(p|k,n) \propto P(k|p,n) \times P(p) $$We know how we can calculate the likelihood based on the PMF of the binomial distribution. But, how can we now use the prior? As mentioned, the prior is our belief in the parameter(s) before observing the data. So basically, we can express our assumption about the parameter $p$ in our model in the form of another probability distribution. In case of the binomial distribution as likelihood, we can use the Beta distribution as a conjugate prior. This means that also the posterior distribution will be of the same family (i.e., Beta distribution) as the prior distribution.
The beta distribution is a continuous probability distribution over the interval $[0,1]$. The PDF of the Beta distribution is defined as:
$$ f(x|\alpha,\beta) = \frac{1}{B(\alpha,\beta)} x^{\alpha-1}(1-x)^{\beta-1} = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\, x^{\alpha-1}(1-x)^{\beta-1} $$$B$ is the beta function and is a normalization constant to ensure that the probability integrates to 1. $\Gamma$ is the gamma function.
$\alpha$ and $\beta$ are positive shape parameters controlling the shape of the distribution. If they are $>=1$, we can think of them as pseudo counts; i.e., counts before observing the data. So for our example, this would mean that $\alpha$ would specify the pseudo counts of observing a "heads" flip, while $\beta$ would refer to the counts of observing a "tails" flip. The Wikipedia page of the Beta distribution goes into detail about the behavior of the shape parameters.
For a better understanding, let us visualize some examples next for the case of $\alpha >= 1$ and $\beta >= 1$.
from scipy.stats import beta
import matplotlib.pyplot as plt
import numpy as np
# helper function for plotting
def plot_beta(a,b,ax, print_interval=True):
ax.set_xlabel("p")
ax.set_ylabel("probability density")
x = np.linspace(0.00,1, 100)
label = "$\\alpha= " + str(a) + ", \\beta=" + str(b) + "$"
dist = beta(a,b)
# plot density
ax.plot(x, dist.pdf(x),
lw=2, alpha=0.6, label=label)
# determine the 95% HDI
if print_interval:
print "Interval containing 95% of the distribution: ", dist.interval(0.95)
Let us consider a symmetric Beta distribution. This means that we want to express our prior belief that the coin is fair. We can express this prior with the shape parameters $\alpha$ and $\beta$ which we can interpret as pseudo counts. As we have a fair belief, we set both shape parameters to the same value. Let us start with $\alpha=\beta=10$.
fig, ax = plt.subplots(1,1)
plot_beta(10,10,ax)
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels)
Interval containing 95% of the distribution: (0.2886432479169988, 0.7113567520830012)
<matplotlib.legend.Legend at 0xa45ff28>
What we can see is the PDF of the Beta distribution with our given shape parameters. We can see that the parameter $p=0.5$ receives the highest density. However, as we keep our pseudo counts relatively low, we also allow other parameter configurations to receive a certain density. If we determine the interval containing 95% of the distribution, we receive $0.29$ as the lower and $0.71$ as the upper boundary.
fig, ax = plt.subplots(1,1)
plot_beta(100,100,ax)
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels)
Interval containing 95% of the distribution: (0.43095093094181675, 0.56904906905818331)
<matplotlib.legend.Legend at 0xaddfa58>
If we increase our symmetric shape parameters to a value of $100$, we can see a much higher density for parameters around the fair probability of $0.5$. The 95% highest density interval now lies between $0.43$ and $0.71$.
fig, ax = plt.subplots(1,1)
plot_beta(1000,1000,ax)
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels)
Interval containing 95% of the distribution: (0.47809471962068506, 0.52190528037931494)
<matplotlib.legend.Legend at 0xae8cf98>
By further increasing the pseudo counts, we can further increase our "fair" coin belief.
Next, let us also shortly consider non-symmetric shape parameters (but let us still follow the pseudo count interpretation ant thus, only use parameters larger than one).
fig, ax = plt.subplots(1,1)
plot_beta(10,1,ax)
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels)
Interval containing 95% of the distribution: (0.69150289218123928, 0.99747142145553824)
<matplotlib.legend.Legend at 0x1b1c4eb8>
Setting $\alpha=10$ and $\beta=1$ increases the belief in an unfair coin biased towards flipping "heads". Finally, let us shortly consider what happens when we set $\alpha=\beta=1$.
fig, ax = plt.subplots(1,1)
plot_beta(1,1,ax)
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels)
Interval containing 95% of the distribution: (0.025000000000000022, 0.97499999999999998)
<matplotlib.legend.Legend at 0x1b5525c0>
We can see that we now have a flat prior, meaning that all parameter configurations are equally likely. I suggest to play around with the shape parameters to see what happens with other configurations; also try to set the parameters lower than one.
Now that we have specified the likelihood-distribution as well as the prior-distribution we can rewrite the posterior-distribution as:
$$ P(p|k,n,\alpha,\beta) = \text{Beta}(p|\alpha+k, \beta+(n-k)) $$As mentioned earlier, the posterior is from the same distribution family (Beta) as the prior (Beta), if the prior distribution is the conjugate prior to the likelihood distribution (binomial). For details about the derivation of this closed form of the posterior, please refer to this article.
Thus, based on our Bayesian inference, the posterior distribution of the parameter $p$ indicating the probability of flipping "heads", is expressed as a Beta distribution. The parameters of this Beta distribution are constituted by both our a-priori assumptions---i.e., the hyperparameters $\alpha$ and $\beta$---as well as observed data---i.e., the number of times we flipped "heads" $k$ and "tails" $n-k$. The resulting posterior can then be used as a new prior as we gather more data.
For demonstrating how the posterior now changes with an increasing number of observations, let us assume that we know the parameter of flipping "heads" in advance---i.e., we manually assume $p$ of the binomial distribution to be $0.4$. For simplicity, let us assume that our inital Beta prior has the hyperparameters $\alpha=\beta=1$ meaning that we have no real a-priori tendency and work with a flat prior. Next, we incrementally sample single random variates from the binomial distribution with $p=0.4$ (Bernoulli trials) and update our posterior with corresponding samples.
from scipy.stats import beta, binom
# initializing plot
def init_plot():
fig = plt.figure()
ax = plt.axes(xlim=(0, 1), ylim=(0, 13))
line, = ax.plot([], [], lw=2)
ttl = ax.text(0.6,0.8,'', transform = ax.transAxes, va='center', size=20)
ax.set_xlabel("p")
ax.set_ylabel("probability density")
return fig, ax, ttl, line
# random variates
samples = binom.rvs(1,0.4, size=200)
# starting parameters and x values
a = 1
b = 1
x = np.linspace(0.00,1, 100)
# init function
def init():
ttl.set_text("$\\alpha= " + str(a) + ", \\beta=" + str(b) + "$")
y = beta.pdf(x,a,b)
line.set_data(x,y)
return line,
# animating the stuff
def animate(i):
global a,b
# somehow the init frame is not drawn, so a small hack here
if i != 0:
a += samples[i-1]
b += 1 - samples[i-1]
ttl.set_text("$\\alpha= " + str(a) + ", \\beta=" + str(b) + "$")
y = beta.pdf(x,a,b)
line.set_ydata(y)
return line,
# let's animate
# you can increase/decrease the FPS at the beginning of this notebook
fig, ax, ttl, line = init_plot()
animation.FuncAnimation(fig, animate, init_func=init,
frames=200, interval=1, blit=True)
In above animation, we can see that with an increasing amount of random samples from the binomial distribution, starting with an initial uniform (flat) belief (prior), we increasingly belief in the true parameter (i.e., $p=0.4$) from the underlying distribution that we manually specified beforehand. Bayesian inference not only allows us to make statements about the most probable values for $p$, but also about the credibility of these statements (highest density regions).
What happens though, if we have stronger beliefs a-priori? Let us specify a Beta prior with $\alpha=50$ and $\beta=10$. Hence, we have decently high belief in a biased coin towards "heads", while our binomial distribution is slightly biased towards "tails" with $p=0.4$.
fig, ax, ttl, line = init_plot()
a = 50
b = 10
animation.FuncAnimation(fig, animate, init_func=init,
frames=200, interval=1, blit=True)
We can see, that with our 200 samples, our highest posterior belief is around $p=0.5$. This indicates the influence of the prior. Actually, the posterior can be seen as a convex combination of prior belief (pseudo counts) and the data ("heads" and "tails" counts). If our prior belief is strong, we need more data to "overrule" the prior.
However, if we "steer" our prior in the right direction, we can increase our credibility around the "true" parameter, much faster.
fig, ax, ttl, line = init_plot()
a = 4
b = 6
animation.FuncAnimation(fig, animate, init_func=init,
frames=50, interval=1, blit=True)
You can play around with the parameters here. I also suggest to increase the prior pseudo counts to large numbers which results in the necessity to sample many variates to see differences regarding your prior knowledge (this may take a while though).
Equipped with a basic understand about Bayesian inference, this tutorial now demonstrates the main methodological concepts and ideas of HypTrails. This should be achieved by going through a simple example step-by-step.
Let us assume that we are the coach of a small football (soccer) team that---for simplicity---consists of three players (1, 2 and 3). We are interested in better understanding how our three players pass the ball between each other. To that end, we start to track the passes between our players which produce passing trails with each state of each trail corresponding to a player that has received the pass. For a better understanding, let us consider one sample trail depicted next.
from IPython.display import Image
Image(url="http://i.imgur.com/0K2oEh3.png")
In this example, player 1 first passes the ball to player 2 who then passes the ball to player 3. Finally, player 3 scores a goal and the trail is finished. Thus, this trail would look like:
Player 1 -> Player 2 -> Player 3
Of course, trails can be much longer. For example, player 3 might pass the ball back to player 2.
Having these kind of trails, we are now interested in better understanding the mechanisms producing these trails in order to make informed decisions regarding our team and our potential coaching focus. To that end, we come up with certain hypotheses about how our players prefer to pass the ball between each other and we want to compare their plausibility relatively.
Before we proceed, we need to introduce the underlying model that HypTrails is based on and which we utilize for modeling the data; the Markov chain model (see our PlosOne paper for details). The MC model is a stochastic model that models transition probabilities---which are the parameters of the model---between states. In general, a Markov chain model is memoryless, which means that the next state in a trail only depends on the current one and not on a series of preceding states. We can write the Markov property as:
$$ P(X_{t+1}=s_j|X_1=s_{i_1}, ..., X_{t-1}=s_{i_{t-1}}, X_t=s_{i_t})= \\ P(X_{t+1}= s_j|X_t=s_{i_t}) = p_{i,j}. $$The following figure visualizes the various parameters of the model for our simple example.
Image(url="http://i.imgur.com/FR4fqqQ.png")
The players represent the various states and the arrows the parameters (transition probabilities) of the MC model. For example, $p_{2,1}$ represents the probability that player 2 will pass the ball to player 1 if player 2 currently is in possession of the ball. The parameters $p_{i,j}$ are usually expressed in a transition matrix with rows $i$ and columns $j$ representing the various states of the Markov chain. As this matrix is stochastic, it holds for all $i$ that:
$$ \sum_j p_{i,j} = 1 $$Each row of the transition matrix follows a categorical probability distribution. Hence, the likelihood for a Markov chain can be determined by a product of likelihoods for the individual categorical distributions---$n_{i,j}$ represents the number of times this transition occurs in the data:
$$ P(D | \theta) = \prod_i \prod_j p_{i,j}^{n_{i,j}} $$An example of calculating the log-likelihood of a Markov chain with given transition probabilities (parameters) is given at the end of this tutorial.
Elementary, we are interested in comparing hypotheses about human trails on the Web. This means that we come up with certain hypotheses that make assumptions about how our three players pass the ball between each other. For example, we might hypothesize that our players pass the ball between each other based on their friendship status; i.e., good friends will pass the ball to each other more frequently. Or, we might hypothesize that players always pass the ball to the best player. But, how can we intuitively express these and similar generic hypotheses in a coherent way?
To that end, with HypTrails, hypotheses are expressed as Markov chain parameters (transitions) and our belief in them. This means, that according to a given hypothesis, we increase our belief in certain transitions if the hypothesis supports it. Note, we overall have three states; i.e., $S = {\text{Player 1}, \text{Player 2}, \text{Player 3}}$. This means that we have $|S|*|S|$ parameters; i.e., 9. Next, some examples are presented to give a better understanding about the process of expressing hypotheses with HypTrails. For these examples, we first gather some background information about the players and their relations.
from prettytable import PrettyTable
x = PrettyTable(["Player", "Market value", "Friendship to Player 1", "Friendship to Player 2", "Friendship to Player 3"])
x.add_row(["Player 1", "1000€", "x", "70", "30"])
x.add_row(["Player 2", "500€", "70", "x", "50"])
x.add_row(["Player 3", "800€", "30", "50", "x"])
print x
+----------+--------------+------------------------+------------------------+------------------------+ | Player | Market value | Friendship to Player 1 | Friendship to Player 2 | Friendship to Player 3 | +----------+--------------+------------------------+------------------------+------------------------+ | Player 1 | 1000€ | x | 70 | 30 | | Player 2 | 500€ | 70 | x | 50 | | Player 3 | 800€ | 30 | 50 | x | +----------+--------------+------------------------+------------------------+------------------------+
The market value should give us an indication about the strength of a player. The friendship relationship data is totally arbitrary here and might be initialized based on the personal feeling of us (being the coach of the players) based on a scale from 0 (worst enemies) to 100 (bff). We can then utilize this background knowledge to express our hypotheses as belief in Markov transitions. Note that the data utilized for expressing these hypotheses should not stem from the trail data studied because then we would already incorporate data into our hypotheses. Next, we want to express some exemplary hypotheses. Each hypothesis can be captured by a hypothesis matrix $Q$ with elements $q_{i,j}$ denoting the strength of our belief. The only limitations for the values are that they need to be positive and that higher values correspond to more belief; normalization is conducted at a later point.
For the following examples, we will always assign zero belief to elements of the diagonal. We do this because we know that players pass the ball to other players all the time. If we would also somehow collect data that a player keeps the ball instead of passing it to another player, self-loops might happend and we might want to incorporate this into hypotheses at interest.
Let us start with the friendship hypothesis, which believes that our players prefer to pass the ball to players they are friend with. This means that we want to increase values of $Q$ based on the friendship relations between the players. Thus, we can simply re-use the values provided in above table.
# First we need to build a vocabulary that assigns indices to state names
# This is necessary so that the HypTrails framework knows which rows and columns
# correspond to which player.
vocab = {"Player 1": 0, "Player 2": 1, "Player 3": 2}
# The HypTrails framework works with sparse scipy matrices
# Thus, we already start to specify them here
from scipy.sparse import lil_matrix, csr_matrix
# Building is better with a lil_matrix
q_friendship = lil_matrix((3,3))
q_friendship[vocab["Player 1"], vocab["Player 2"]] = 70.
q_friendship[vocab["Player 2"], vocab["Player 1"]] = 70.
q_friendship[vocab["Player 1"], vocab["Player 3"]] = 30.
q_friendship[vocab["Player 3"], vocab["Player 1"]] = 30.
q_friendship[vocab["Player 2"], vocab["Player 3"]] = 50.
q_friendship[vocab["Player 3"], vocab["Player 2"]] = 50.
# We need csr_matrices later
q_friendship = q_friendship.tocsr()
print q_friendship.todense()
[[ 0. 70. 30.] [ 70. 0. 50.] [ 30. 50. 0.]]
With this hypothesis, we believe that players prefer to pass the ball to players with high market values. In teams like Barcelona or Real Madrid this would mean that players would prefer to pass the ball to Messi or C. Ronaldo followed by other players of the team ranked by their market value. Note that for this hypothesis, we do not differ between various start players (giving the pass). Regardless whether player 1, 2 or 3 starts to pass, we always believe that the pass will end at the players ranked by their market value. Again, for our small example, we can directly use the values provided in above table for our hypothesis matrix $Q$.
from scipy.sparse import lil_matrix, csr_matrix
q_market = lil_matrix((3,3))
q_market[:,0] = 1000.
q_market[:,1] = 500.
q_market[:,2] = 800.
# No self-loops
q_market.setdiag(0.)
# We need csr_matrices later
q_market = q_market.tocsr()
print q_market.todense()
[[ 0. 500. 800.] [ 1000. 0. 800.] [ 1000. 500. 0.]]
Finally, let us consider a uniform hypothesis which believes that all pass opportunities are equally likely. This can be seen as a baseline hypothesis.
from scipy.sparse import lil_matrix, csr_matrix
q_uniform = lil_matrix((3,3))
q_uniform[:] = 1.
# No self-loops (also for the uniform)
q_uniform.setdiag(0.)
# We need csr_matrices later
q_uniform = q_uniform.tocsr()
print q_uniform.todense()
[[ 0. 1. 1.] [ 1. 0. 1.] [ 1. 1. 0.]]
Now that we have our expressed hypotheses, we need to collect some data. As we do not really coach a three-player soccer team (too bad), we cannot work with real data here. So let us just create some arbitrary passing trails next.
trails = [
["Player 1", "Player 2", "Player 3", "Player 2", "Player 1"],
["Player 2", "Player 1", "Player 3", "Player 1", "Player 2"],
["Player 1", "Player 2", "Player 1", "Player 2", "Player 1", "Player 2", "Player 1"],
["Player 1", "Player 3"],
["Player 3", "Player 1", "Player 2", "Player 3", "Player 1", "Player 2", "Player 1"],
["Player 1", "Player 3", "Player 2", "Player 3", "Player 1", "Player 2", "Player 3", "Player 1", "Player 2"],
["Player 1", "Player 2", "Player 1", "Player 2", "Player 1", "Player 2"],
["Player 3", "Player 1"],
["Player 2", "Player 3", "Player 1", "Player 2", "Player 1", "Player 2", "Player 1", "Player 2"],
["Player 3", "Player 2", "Player 1"]
]
HypTrails directly works with the data provided above. For completeness, let us determine the maximum likelihood (see our PlosOne paper for details) for the parameters which can be simply done by calculating:
$$ p_{ij} = \frac{n_{ij}}{\sum_j n_{ij}} $$The MLE can be seen as a parameter configuration that maximizes the likelihood of given data.
# PathTools includes the Markov chain functionality; see https://github.com/psinger/PathTools
from pathtools.markovchain import MarkovChain
# We initialize a first-order Markov chain model
# For details about the parameters please refer to the github repo and
# the synthetic example presented later in this tutorial.
markov = MarkovChain(k=1, use_prior=False, reset = False, modus="mle")
markov.prepare_data(trails)
markov.fit(trails)
t = markov.transition_dict_norm_
x = PrettyTable(["", "Player 1", "Player 2", "Player 3"])
x.add_row(["Player 1", t[("Player 1",)]["Player 1"], t[("Player 1",)]["Player 2"], t[("Player 1",)]["Player 3"]])
x.add_row(["Player 2", t[("Player 2",)]["Player 1"], t[("Player 2",)]["Player 2"], t[("Player 2",)]["Player 3"]])
x.add_row(["Player 3", t[("Player 3",)]["Player 1"], t[("Player 3",)]["Player 2"], t[("Player 3",)]["Player 3"]])
print x
+----------+----------+----------------+----------------+ | | Player 1 | Player 2 | Player 3 | +----------+----------+----------------+----------------+ | Player 1 | 0.0 | 0.833333333333 | 0.166666666667 | | Player 2 | 0.6875 | 0.0 | 0.3125 | | Player 3 | 0.7 | 0.3 | 0.0 | +----------+----------+----------------+----------------+
In this table, we can see the transition probabilities (pass probabilities) between the three states (players) as determined by the maximum likelihood estimation. Due to the stochastic nature, each row sums to one. We can see that player 1 passes most of the time to player 2. Player 2 most of the time to player 1 (slightly more balanced) and player 3 also prefers to pass to player 1. We can also take a look at the raw transition counts.
t = markov.transition_dict_
x = PrettyTable(["", "Player 1", "Player 2", "Player 3"])
x.add_row(["Player 1", t[("Player 1",)]["Player 1"], t[("Player 1",)]["Player 2"], t[("Player 1",)]["Player 3"]])
x.add_row(["Player 2", t[("Player 2",)]["Player 1"], t[("Player 2",)]["Player 2"], t[("Player 2",)]["Player 3"]])
x.add_row(["Player 3", t[("Player 3",)]["Player 1"], t[("Player 3",)]["Player 2"], t[("Player 3",)]["Player 3"]])
print x
+----------+----------+----------+----------+ | | Player 1 | Player 2 | Player 3 | +----------+----------+----------+----------+ | Player 1 | 0.0 | 15.0 | 3.0 | | Player 2 | 11.0 | 0.0 | 5.0 | | Player 3 | 7.0 | 3.0 | 0.0 | +----------+----------+----------+----------+
For players 1 and 2 we have a bit more data compared to passes starting from player 3.
Up to this point, we have expressed our hypotheses as beliefs in parameters and we have collected data. But, we do not know yet how we can make relative judgements about the plausibility of our hypotheses given the data. To that end, HypTrails resorts to Bayesian inference.
So why have we spent so much time going through the basic of Bayesian inference in the beginning of this tutorial? Because we utilize it for comparing hypotheses about human trails. The main idea that we follow with HypTrails is that we want to incorporate hypotheses as priors into the inference process. The reason behind this is that the prior is exactly what our hypotheses should represent: beliefs about parameters before observing the data.
In the coin flip example at the beginning of this tutorial, we have utilized the Beta prior as a conjugate prior to the binomial distribution. In this case, we work with a Markov chain model where each row of the transition matrix follows a categorical distribution (see beginning of this tutorial). The conjugate prior for the categorical distribution is the Dirichlet distribution which we will shortly describe next.
The Dirichlet distribution is a continuous multivariate probability distribution and is the multivariate generalization of the Beta distribution described earlier. Similar to the Beta distribution, the parameters of the Dirichlet distribution can be interpreted as pseudo counts when we consider values $>=1$ which we do throughout this tutorial. In detail, the Dirichlet distribution is now parametrized by a vector $\alpha=[\alpha_1, \alpha_2, ..., \alpha_n]$. The length of this vector is determined by the number of categories considered (or number of states for our Markov chain use case). Thus, the hyperparameters (pseudo counts) can represent our prior belief about the parameters of the Markov chain model; the more we belief in them, the higher we want to set them. This means that the Dirichlet distribution represents probabilities of probabilities---or to be more precise, probabilties over probability configurations. For each row of our transition matrix, we thus have a Dirichlet prior defined by the following probability density function:
$$ Dir(\alpha_{i}) = \frac{\Gamma(\sum_j \alpha_{ij})}{\prod_j \Gamma(\alpha_{ij})}\prod_j p_{ij}^{\alpha_{ij} - 1} $$$\Gamma$ is the gamma function, $\alpha_j > 0$ for each $j$ and $\sum_j x_j = 1$ is a probability simplex. The probability outside of the simplex is $0$.
Similar as for the Beta distribution, let us visualize some exemplary settings for the parameters $\alpha$ of a Dirichlet distribution. For simplicity, let us consider that we only have three categories leading to the hyperparameters $\alpha=[\alpha_1, \alpha_2, \alpha_3]$ which steer our belief about the underlying parameters $p=[p_1, p_2, p_3]$. We start with a symmetric distribution where all elements of $\alpha$ have the same value. We start with setting all vaues of $\alpha$ to $1$ and increase these values by $1$ in each iteration. We visualize the resulting Dirichlet densities on the simplex.
# Many thanks to http://blog.bogatron.net/blog/2014/02/02/visualizing-dirichlet-distributions/
# for providing the code for plotting Dirichlet distributions
import matplotlib.tri as tri
corners = np.array([[0, 0], [1, 0], [0.5, 0.75**0.5]])
triangle = tri.Triangulation(corners[:, 0], corners[:, 1])
midpoints = [(corners[(i + 1) % 3] + corners[(i + 2) % 3]) / 2.0 \
for i in range(3)]
def xy2bc(xy, tol=1.e-3):
'''Converts 2D Cartesian coordinates to barycentric.'''
s = [(corners[i] - midpoints[i]).dot(xy - midpoints[i]) / 0.75 \
for i in range(3)]
return np.clip(s, tol, 1.0 - tol)
class Dirichlet(object):
def __init__(self, alpha):
from math import gamma
from operator import mul
self._alpha = np.array(alpha)
self._coef = gamma(np.sum(self._alpha)) / \
reduce(mul, [gamma(a) for a in self._alpha])
def pdf(self, x):
'''Returns pdf value for `x`.'''
from operator import mul
return self._coef * reduce(mul, [xx ** (aa - 1)
for (xx, aa)in zip(x, self._alpha)])
def draw_pdf_contours(dist, nlevels=200, subdiv=8, **kwargs):
import math
refiner = tri.UniformTriRefiner(triangle)
trimesh = refiner.refine_triangulation(subdiv=subdiv)
pvals = [dist.pdf(xy2bc(xy)) for xy in zip(trimesh.x, trimesh.y)]
plt.tricontourf(trimesh, pvals, nlevels, **kwargs)
plt.axis('equal')
plt.xlim(0, 1)
plt.ylim(0, 0.75**0.5)
plt.axis('off')
# initializing plot
nlevels=200
subdiv=8
refiner = tri.UniformTriRefiner(triangle)
trimesh = refiner.refine_triangulation(subdiv=subdiv)
# initializing plot
def init_plot():
fig = plt.figure()
plt.axis('equal')
plt.xlim(0, 1)
plt.ylim(0, 0.75**0.5)
plt.axis('off')
return fig
# init function
def init():
dist = Dirichlet(alpha)
pvals = [dist.pdf(xy2bc(xy)) for xy in zip(trimesh.x, trimesh.y)]
plt.tricontourf(trimesh, pvals, nlevels)
# animating the stuff
def animate(i):
para = f(alpha,i)
plt.title("$\\alpha=["+ ",".join([str(x) for x in para]) + "]$")
dist = Dirichlet(para)
pvals = [dist.pdf(xy2bc(xy)) for xy in zip(trimesh.x, trimesh.y)]
plt.tricontourf(trimesh, pvals, nlevels)
# let's animate
# you can increase/decrease the FPS at the beginning of this notebook
fig = init_plot()
alpha = [1, 1, 1]
f = lambda a,i: [x+i for x in a]
# change our ugly global fps variable
fps = 5
animation.FuncAnimation(fig, animate, init_func=init,
frames=30, interval=1, blit=True)
In this simplex, various parameter configurations are depicted. Note that these parameter configurations represent the $p$ values of our three state discrete distribution, not the hyperparameters $\alpha$ set for the Dirichlet distribution. This is similar to the Beta distribution which expresses the probability distribution of parameter configurations for the single parameter $p$ (e.g., flipping "heads"). So to note again, we have probabilities of probabilities.
Thus, the bottom left corner of the simplex corresponds to the parameter configuration $p=[1.0,0.0,0.0]$ while the bottom right corner represents the configuration $p=[0.0,1.0,0.0]$ and the top corner represents $p=[0.0,0.0,1.0]$. The most centric point corresponds to the $p=[1/3,1/3,1/3]$ configuration.
With our symmetric pseudo count configuration, we start with $\alpha_1=\alpha_2=\alpha_3=1$ and thus, we start with a flat prior that is uniform on the simplex meaning that all parameter configurations for $p$ are equally likely. However, with increasing concentration parameter (the sum of all values for $\alpha$), we increase our symmetric belief and decrease our belief for other parameter configurations. Thus, we have more and more belief about the $p=[1/3,1/3,1/3]$ parameter configuration.
Let us consider another example, where we believe that $p_1$ and $p_3$ are equally important, while $p_2$ is unimportant. Thus, we increase the pseudo counts for $\alpha_1$ and $\alpha_3$ with each iteration and leave $\alpha_2=1$.
fig = init_plot()
alpha = [1, 1, 1]
f = lambda a,i: [a[0]+i, a[0], a[0]+i]
animation.FuncAnimation(fig, animate, init_func=init,
frames=30, interval=1, blit=True)