Notebook

pyroで打率の推定モデルの実行¶

1 pyroで打率の推定モデルの実行
2 データ
3 Model
4 DATA SUMMARIZE UTILS
5 モデル評価用関数
6 実行
7 ToDo
8 Link
9 その他

pyroはpytorchをベースとした確率的プログラミングのためのフレームワークです。 pytorchの性能や早い時期にNUTSが実装されていることや書き方がtensorflow probabilityに比べ簡潔なところもあって注目されています。 Uberで開発され使われているようですがDropout as variational bayesなどで有名なGhahramani先生がチーフサイエンティストらしいので開発に関わっているのかもしれません。

サンプルとして公開されている野球選手のシーズン中の打数、打率のデータから打率を推定するコードではstanでの実施レポート　に沿ってデータの取得、モデルの定義と実行、事後予測分布(posterior predictive distribution)の表示と評価がされており、　以下ではそれを追って実行します。

stanのレポートの方ではさらに

estimate event probabilities
evaluate posterior predictive densities to evaluate model predictions on held-out data
rank items by chance of success
perform multiple comparisons in several settings
replicate new data for posterior p-values
perform graphical posterior predictive checks

といった内容があります。

In [2]:

from __future__ import absolute_import, division, print_function

import argparse
import logging
import math
import os

import numpy as np
import pandas as pd
import torch

import pyro
import pyro.poutine as poutine
from pyro.distributions import Beta, Binomial, HalfCauchy, Normal, Pareto, Uniform
from pyro.distributions.util import logsumexp
from pyro.infer import EmpiricalMarginal
from pyro.infer.abstract_infer import TracePredictive
from pyro.infer.mcmc import MCMC, NUTS
pyro.__version__

Out[2]:

'0.3.0'

In [3]:

import matplotlib.pyplot as plt
plt.style.use("ggplot")
import seaborn as sns

In [4]:

logging.basicConfig(format='%(message)s', level=logging.INFO)
# Enable validation checks

データ¶

データのダウンロードと表示

In [6]:

pyro.enable_validation(True)
DATA_URL = "https://d2fefpcigoriu7.cloudfront.net/datasets/EfronMorrisBB.txt"
baseball_dataset = pd.read_csv(DATA_URL, "\t")

18人の打者のデータ

In [8]:

baseball_dataset.head()

Out[8]:

	FirstName	LastName	At-Bats	Hits	BattingAverage	RemainingAt-Bats	RemainingAverage	SeasonAt-Bats	SeasonHits	SeasonAverage
0	Roberto	Clemente	45	18	0.400	367	0.3460	412	145	0.352
1	Frank	Robinson	45	17	0.378	426	0.2981	471	144	0.306
2	Frank	Howard	45	16	0.356	521	0.2764	566	160	0.283
3	Jay	Johnstone	45	15	0.333	275	0.2218	320	76	0.238
4	Ken	Berry	45	14	0.311	418	0.2727	463	128	0.276

In [35]:

baseball_dataset["BattingAverage"].plot.bar()
plt.show()
baseball_dataset["RemainingAverage"].plot.bar()
plt.show()
baseball_dataset["SeasonAverage"].plot.bar()
plt.show()

名字と名前をくっつけてtrain data(打率とヒットの数), test data(全シーズンでの打率とヒット数)を分離する。

In [7]:

def train_test_split(pd_dataframe):
    """
    Training data - 45 initial at-bats and hits for each player.
    Validation data - Full season at-bats and hits for each player.
    """
    train_data = torch.tensor(pd_dataframe[["At-Bats", "Hits"]].values, dtype=torch.float)
    test_data = torch.tensor(pd_dataframe[["SeasonAt-Bats", "SeasonHits"]].values, dtype=torch.float)
    first_name = pd_dataframe["FirstName"].values
    last_name = pd_dataframe["LastName"].values
    player_names = [" ".join([first, last]) for first, last in zip(first_name, last_name)]
    return train_data, test_data, player_names

In [8]:

train, _, player_names = train_test_split(baseball_dataset)
at_bats, hits = train[:, 0], train[:, 1]
print(train.shape)

torch.Size([18, 2])

In [49]:

baseball_dataset.columns

Out[49]:

Index(['FirstName', 'LastName', 'At-Bats', 'Hits', 'BattingAverage',
       'RemainingAt-Bats', 'RemainingAverage', 'SeasonAt-Bats', 'SeasonHits',
       'SeasonAverage'],
      dtype='object')

In [51]:

sns.pairplot(baseball_dataset[['At-Bats', 'Hits', 'BattingAverage',
       'RemainingAt-Bats', 'RemainingAverage', 'SeasonAt-Bats', 'SeasonHits',
       'SeasonAverage']])
plt.show()

Model¶

打率(at_bats)からのヒットの数の推定に4つのモデル

Fully pooled model
Not pooled model
partially pooled model
partially pooled model(with logit)

を試している。階層モデルになってるがpyro.sampleの出力を別の確率変数にパラメータとして代入するような書き方になっている。

Fully pooled model(complete pooling)¶

それぞれの打者の打率を独立のベルヌーイ分布に従うとするモデル、pyroでは統計モデルは関数として定義する。

$\phi \sim Uniform(0,1)$

$y_n \sim Binominal(K_n,\phi)$

$K_n$は選手nの打率(モデルの引数) ハイパーパラメータ$\phi＄は[0,1]間の一様分布、new_tensor(0),new_tensor(1)はモデルで使う定数

観測データは出力されるsample関数にobs引数として追加することで関係を示している。

In [10]:

def fully_pooled(at_bats, hits):
    r"""
    Number of hits in $K$ at bats for each player has a Binomial
    distribution with a common probability of success, $\phi$.
    
    :param (torch.Tensor) at_bats: Number of at bats for each player.
    :return: Number of hits predicted by the model.
    """
    phi_prior = Uniform(at_bats.new_tensor(0), at_bats.new_tensor(1))
    phi = pyro.sample("phi", phi_prior)
    return pyro.sample("obs", Binomial(at_bats, phi), obs=hits)

Not pooled model¶

$\theta_n \sim Uniform(0,1)$

$y_n \sim Binominal(K_n,\theta_n)$

それぞれの選手のハイパーパラメータ$\theta_n$が独立になっている。

pyro.plateという関数で内部で使うパラメータを定義している(ように見える)

In [11]:

def not_pooled(at_bats, hits):
    r"""
    Number of hits in $K$ at bats for each player has a Binomial
    distribution with independent probability of success, $\phi_i$.
    
    :param (torch.Tensor) at_bats: Number of at bats for each player.
    :param (torch.Tensor) hits: Number of hits for the given at bats.
    :return: Number of hits predicted by the model.
    """
    num_players = at_bats.shape[0]
    with pyro.plate("num_players", num_players):
        phi_prior = Uniform(at_bats.new_tensor(0), at_bats.new_tensor(1))
        phi = pyro.sample("phi", phi_prior)
        return pyro.sample("obs", Binomial(at_bats, phi), obs=hits)
    

partially pooled model¶

$\theta_n \sim Beta(\alpha,\beta)$

$\alpha=\kappa\phi,\beta=\kappa(1-\phi)$

$\phi \sim Uniform(0,1)$

$\kappa \sim Pareto(1,1.5) \propto \kappa^{2.5}$

$y_n \sim Binominal(K_n,\theta_n)$

$\theta$を[0,1]内に収めるためにbeta分布を用いている階層モデル。さらにサンプリングで$\alpha,\beta$を範囲内に収めるために$\kappa,\phi$を定義している。pyroではPareto分布も使える！

In [12]:

def partially_pooled(at_bats, hits):
    r"""
    Number of hits has a Binomial distribution with independent
    probability of success, $\phi_i$. Each $\phi_i$ follows a Beta
    distribution with concentration parameters $c_1$ and $c_2$, where
    $c_1 = m * kappa$, $c_2 = (1 - m) * kappa$, $m ~ Uniform(0, 1)$,
    and $kappa ~ Pareto(1, 1.5)$.
    :param (torch.Tensor) at_bats: Number of at bats for each player.
    :param (torch.Tensor) hits: Number of hits for the given at bats.
    :return: Number of hits predicted by the model.
    """
    num_players = at_bats.shape[0]
    m = pyro.sample("m", Uniform(at_bats.new_tensor(0), at_bats.new_tensor(1)))
    kappa = pyro.sample("kappa", Pareto(at_bats.new_tensor(1), at_bats.new_tensor(1.5)))
    with pyro.plate("num_players", num_players):
        phi_prior = Beta(m * kappa, (1 - m) * kappa)
        phi = pyro.sample("phi", phi_prior)
        return pyro.sample("obs", Binomial(at_bats, phi), obs=hits)

$\alpha_n = \rm logit(\theta_n) =\frac{\theta_n}{1-\theta_n}$

サンプルするパラメータの出力にlogitをかませたversion。$\alpha \in (-\infty,\infty)＄でありこれによって$\theta$が0,1近辺の時に浮動小数点演算の桁あふれ、不安定性を防ぐ。 Binominalにはオプションとしてlogits引数がある。

In [14]:

def partially_pooled_with_logit(at_bats, hits):
    r"""
    Number of hits has a Binomial distribution with a logit link function.
    The logits $\alpha$ for each player is normally distributed with the
    mean and scale parameters sharing a common prior.
    
    :param (torch.Tensor) at_bats: Number of at bats for each player.
    :param (torch.Tensor) hits: Number of hits for the given at bats.
    :return: Number of hits predicted by the model.
    """
    
    num_players = at_bats.shape[0]
    loc = pyro.sample("loc", Normal(at_bats.new_tensor(-1), at_bats.new_tensor(1)))
    scale = pyro.sample("scale", HalfCauchy(scale=at_bats.new_tensor(1)))
    with pyro.plate("num_players", num_players):
        alpha = pyro.sample("alpha", Normal(loc, scale))
        return pyro.sample("obs", Binomial(at_bats, logits=alpha), obs=hits)
    

stanのレポートではさらに事後分布を均一化させるためにNon-Centered Parameterizationという変数変換を行なっている(参考 )。

DATA SUMMARIZE UTILS¶

平均、分散、四分位点の計算、traceからの要約統計やRhatの表示のための関数

In [23]:

def get_site_stats(array, player_names):
    """
    Return the summarized statistics for a given array corresponding
    to the values sampled for a latent or response site.
    """
    # TODO: only use pandas (or any lightweight tabular package) to display final result
    if len(array.shape) == 1:
        df = pd.DataFrame(array).transpose()
    else:
        df = pd.DataFrame(array, columns=player_names).transpose()
    return df.apply(pd.Series.describe, axis=1)[["mean", "std", "25%", "50%", "75%"]]


def summary(trace_posterior, sites, player_names, transforms={}, diagnostics=True):
    """
    Return summarized statistics for each of the ``sites`` in the
    traces corresponding to the approximate posterior.
    """
    marginal = trace_posterior.marginal(sites)
    site_stats = {}
    for site_name in sites:
        marginal_site = marginal.support(flatten=True)[site_name]
        if site_name in transforms:
            marginal_site = transforms[site_name](marginal_site)

        site_stats[site_name] = get_site_stats(marginal_site.numpy(), player_names)
        if diagnostics and trace_posterior.num_chains > 1:
            diag = marginal.diagnostics()[site_name]
            site_stats[site_name] = site_stats[site_name].assign(n_eff=diag["n_eff"].numpy(),
                                                                 r_hat=diag["r_hat"].numpy())
    return site_stats

モデル評価用関数¶

posterior predictive densities評価のための関数

Inference Utiliyとして周辺分布の取得などが用意されているのでそれを利用している。

http://docs.pyro.ai/en/0.2.1-release/inference_algos.html#module-pyro.infer.abstract_infer

In [21]:

def sample_posterior_predictive(posterior_predictive, baseball_dataset):
    """
    Generate samples from posterior predictive distribution.
    """
    train, test, player_names = train_test_split(baseball_dataset)
    at_bats = train[:, 0]
    at_bats_season = test[:, 0]
    logging.Formatter("%(message)s")
    logging.info("\nPosterior Predictive:")
    logging.info("Hit Rate - Initial 45 At Bats")
    logging.info("-----------------------------")
    # set hits=None to convert it from observation node to sample node
    train_predict = posterior_predictive.run(at_bats, None)
    train_summary = summary(train_predict, sites=["obs"],
                            player_names=player_names, diagnostics=False)["obs"]
    train_summary = train_summary.assign(ActualHits=baseball_dataset[["Hits"]].values)
    logging.info(train_summary)
    logging.info("\nHit Rate - Season Predictions")
    logging.info("-----------------------------")
    test_predict = posterior_predictive.run(at_bats_season, None)
    test_summary = summary(test_predict, sites=["obs"],
                           player_names=player_names, diagnostics=False)["obs"]
    test_summary = test_summary.assign(ActualHits=baseball_dataset[["SeasonHits"]].values)
    logging.info(test_summary)


def evaluate_log_predictive_density(posterior_predictive, baseball_dataset):
    """
    Evaluate the log probability density of observing the unseen data (season hits)
    given a model and empirical distribution over the parameters.
    """
    _, test, player_names = train_test_split(baseball_dataset)
    at_bats_season, hits_season = test[:, 0], test[:, 1]
    test_eval = posterior_predictive.run(at_bats_season, hits_season)
    trace_log_pdf = []
    for tr in test_eval.exec_traces:
        trace_log_pdf.append(tr.log_prob_sum())
    # Use LogSumExp trick to evaluate $log(1/num_samples \sum_i p(new_data | \theta^{i})) $,
    # where $\theta^{i}$ are parameter samples from the model's posterior.
    posterior_pred_density = logsumexp(torch.stack(trace_log_pdf), dim=-1) - math.log(len(trace_log_pdf))
    logging.info("\nLog posterior predictive density")
    logging.info("--------------------------------")
    logging.info("{:.4f}\n".format(posterior_pred_density))

実行¶

オブジェクトとしてNUTS kernelを作りそこにmodelを代入する。

MCMCクラス、run関数でサンプリングを実行する。その後上で定義した関数(sample_posterior_predictive, evaluate_log_predictive_density)を用いて事後分布の値の範囲、Rhat, Log posterior predictive densityを表示する。

In [17]:

num_samples=200
num_chains=3
warmup_steps=100

num_predictive_samples = num_samples * num_chains

In [18]:

rng_seed=52
pyro.set_rng_seed(rng_seed)

Full Pooling Model¶

In [24]:

nuts_kernel = NUTS(fully_pooled)
posterior_fully_pooled = MCMC(nuts_kernel,
                              num_samples=num_samples,
                              warmup_steps=warmup_steps,
                              num_chains=num_chains).run(at_bats, hits)
logging.info("\nModel: Fully Pooled")
logging.info("===================")
logging.info("\nphi:")
logging.info(summary(posterior_fully_pooled, sites=["phi"], player_names=player_names)["phi"])
posterior_predictive = TracePredictive(fully_pooled,
                                       posterior_fully_pooled,
                                       num_samples=num_predictive_samples)
sample_posterior_predictive(posterior_predictive, baseball_dataset)
evaluate_log_predictive_density(posterior_predictive, baseball_dataset)

Widget Javascript not detected.  It may not be installed or enabled properly.

Widget Javascript not detected.  It may not be installed or enabled properly.

Widget Javascript not detected.  It may not be installed or enabled properly.

Model: Fully Pooled
===================

phi:
       mean       std       25%       50%       75%  n_eff     r_hat
0  0.267115  0.015634  0.254908  0.267284  0.278384    NaN  1.012708

Posterior Predictive:
Hit Rate - Initial 45 At Bats
-----------------------------
                  mean  std   25%   50%   75%  ActualHits
Roberto Clemente  18.0  0.0  18.0  18.0  18.0          18
Frank Robinson    17.0  0.0  17.0  17.0  17.0          17
Frank Howard      16.0  0.0  16.0  16.0  16.0          16
Jay Johnstone     15.0  0.0  15.0  15.0  15.0          15
Ken Berry         14.0  0.0  14.0  14.0  14.0          14
Jim Spencer       14.0  0.0  14.0  14.0  14.0          14
Don Kessinger     13.0  0.0  13.0  13.0  13.0          13
Luis Alvarado     12.0  0.0  12.0  12.0  12.0          12
Ron Santo         11.0  0.0  11.0  11.0  11.0          11
Ron Swaboda       11.0  0.0  11.0  11.0  11.0          11
Rico Petrocelli   10.0  0.0  10.0  10.0  10.0          10
Ellie Rodriguez   10.0  0.0  10.0  10.0  10.0          10
George Scott      10.0  0.0  10.0  10.0  10.0          10
Del Unser         10.0  0.0  10.0  10.0  10.0          10
Billy Williams    10.0  0.0  10.0  10.0  10.0          10
Bert Campaneris    9.0  0.0   9.0   9.0   9.0           9
Thurman Munson     8.0  0.0   8.0   8.0   8.0           8
Max Alvis          7.0  0.0   7.0   7.0   7.0           7

Hit Rate - Season Predictions
-----------------------------
                  mean  std   25%   50%   75%  ActualHits
Roberto Clemente  18.0  0.0  18.0  18.0  18.0         145
Frank Robinson    17.0  0.0  17.0  17.0  17.0         144
Frank Howard      16.0  0.0  16.0  16.0  16.0         160
Jay Johnstone     15.0  0.0  15.0  15.0  15.0          76
Ken Berry         14.0  0.0  14.0  14.0  14.0         128
Jim Spencer       14.0  0.0  14.0  14.0  14.0         140
Don Kessinger     13.0  0.0  13.0  13.0  13.0         168
Luis Alvarado     12.0  0.0  12.0  12.0  12.0          41
Ron Santo         11.0  0.0  11.0  11.0  11.0         148
Ron Swaboda       11.0  0.0  11.0  11.0  11.0          57
Rico Petrocelli   10.0  0.0  10.0  10.0  10.0         152
Ellie Rodriguez   10.0  0.0  10.0  10.0  10.0          52
George Scott      10.0  0.0  10.0  10.0  10.0         142
Del Unser         10.0  0.0  10.0  10.0  10.0          83
Billy Williams    10.0  0.0  10.0  10.0  10.0         205
Bert Campaneris    9.0  0.0   9.0   9.0   9.0         168
Thurman Munson     8.0  0.0   8.0   8.0   8.0         137
Max Alvis          7.0  0.0   7.0   7.0   7.0          21

Log posterior predictive density
--------------------------------
-77.0128

No Pooling Model¶

In [25]:

nuts_kernel = NUTS(not_pooled)
posterior_not_pooled = MCMC(nuts_kernel,
                            num_samples=num_samples,
                            warmup_steps=warmup_steps,
                            num_chains=num_chains).run(at_bats, hits)

logging.info("\nModel: Not Pooled")
logging.info("=================")
logging.info("\nphi:")
logging.info(summary(posterior_not_pooled, sites=["phi"], player_names=player_names)["phi"])
posterior_predictive = TracePredictive(not_pooled,
                                       posterior_not_pooled,
                                       num_samples=num_predictive_samples)
sample_posterior_predictive(posterior_predictive, baseball_dataset)
evaluate_log_predictive_density(posterior_predictive, baseball_dataset)

Widget Javascript not detected.  It may not be installed or enabled properly.

Widget Javascript not detected.  It may not be installed or enabled properly.

Widget Javascript not detected.  It may not be installed or enabled properly.

Model: Not Pooled
=================

phi:
                      mean       std       25%       50%       75%  n_eff  \
Roberto Clemente  0.399943  0.075035  0.346843  0.396658  0.452164    NaN   
Frank Robinson    0.380921  0.070193  0.334455  0.375778  0.428543    NaN   
Frank Howard      0.362561  0.071634  0.312462  0.358348  0.413378    NaN   
Jay Johnstone     0.336355  0.071878  0.289077  0.329880  0.382827    NaN   
Ken Berry         0.314747  0.065654  0.270413  0.310458  0.356329    NaN   
Jim Spencer       0.315158  0.062964  0.268278  0.311988  0.358840    NaN   
Don Kessinger     0.301659  0.068475  0.252892  0.298465  0.349038    NaN   
Luis Alvarado     0.283721  0.062200  0.239647  0.282782  0.324852    NaN   
Ron Santo         0.258056  0.059806  0.215130  0.255764  0.297286    NaN   
Ron Swaboda       0.256684  0.067631  0.208116  0.255430  0.299480    NaN   
Rico Petrocelli   0.229894  0.059704  0.186203  0.224607  0.270622    NaN   
Ellie Rodriguez   0.236017  0.057246  0.191330  0.234975  0.274804    NaN   
George Scott      0.234043  0.059666  0.190660  0.231042  0.274553    NaN   
Del Unser         0.236534  0.061990  0.193895  0.231431  0.276140    NaN   
Billy Williams    0.236507  0.057134  0.193631  0.233240  0.275560    NaN   
Bert Campaneris   0.212033  0.061096  0.168675  0.206219  0.252786    NaN   
Thurman Munson    0.191130  0.057044  0.146603  0.187605  0.228246    NaN   
Max Alvis         0.171947  0.054420  0.131149  0.166210  0.210815    NaN   

                     r_hat  
Roberto Clemente  0.996561  
Frank Robinson    0.996544  
Frank Howard      0.997555  
Jay Johnstone     0.997095  
Ken Berry         1.001117  
Jim Spencer       0.997749  
Don Kessinger     0.998231  
Luis Alvarado     1.000688  
Ron Santo         0.999810  
Ron Swaboda       0.998106  
Rico Petrocelli   0.996533  
Ellie Rodriguez   0.999227  
George Scott      0.998112  
Del Unser         0.998835  
Billy Williams    0.995157  
Bert Campaneris   1.002228  
Thurman Munson    0.999830  
Max Alvis         0.996702  

Posterior Predictive:
Hit Rate - Initial 45 At Bats
-----------------------------
                  mean  std   25%   50%   75%  ActualHits
Roberto Clemente  18.0  0.0  18.0  18.0  18.0          18
Frank Robinson    17.0  0.0  17.0  17.0  17.0          17
Frank Howard      16.0  0.0  16.0  16.0  16.0          16
Jay Johnstone     15.0  0.0  15.0  15.0  15.0          15
Ken Berry         14.0  0.0  14.0  14.0  14.0          14
Jim Spencer       14.0  0.0  14.0  14.0  14.0          14
Don Kessinger     13.0  0.0  13.0  13.0  13.0          13
Luis Alvarado     12.0  0.0  12.0  12.0  12.0          12
Ron Santo         11.0  0.0  11.0  11.0  11.0          11
Ron Swaboda       11.0  0.0  11.0  11.0  11.0          11
Rico Petrocelli   10.0  0.0  10.0  10.0  10.0          10
Ellie Rodriguez   10.0  0.0  10.0  10.0  10.0          10
George Scott      10.0  0.0  10.0  10.0  10.0          10
Del Unser         10.0  0.0  10.0  10.0  10.0          10
Billy Williams    10.0  0.0  10.0  10.0  10.0          10
Bert Campaneris    9.0  0.0   9.0   9.0   9.0           9
Thurman Munson     8.0  0.0   8.0   8.0   8.0           8
Max Alvis          7.0  0.0   7.0   7.0   7.0           7

Hit Rate - Season Predictions
-----------------------------
                  mean  std   25%   50%   75%  ActualHits
Roberto Clemente  18.0  0.0  18.0  18.0  18.0         145
Frank Robinson    17.0  0.0  17.0  17.0  17.0         144
Frank Howard      16.0  0.0  16.0  16.0  16.0         160
Jay Johnstone     15.0  0.0  15.0  15.0  15.0          76
Ken Berry         14.0  0.0  14.0  14.0  14.0         128
Jim Spencer       14.0  0.0  14.0  14.0  14.0         140
Don Kessinger     13.0  0.0  13.0  13.0  13.0         168
Luis Alvarado     12.0  0.0  12.0  12.0  12.0          41
Ron Santo         11.0  0.0  11.0  11.0  11.0         148
Ron Swaboda       11.0  0.0  11.0  11.0  11.0          57
Rico Petrocelli   10.0  0.0  10.0  10.0  10.0         152
Ellie Rodriguez   10.0  0.0  10.0  10.0  10.0          52
George Scott      10.0  0.0  10.0  10.0  10.0         142
Del Unser         10.0  0.0  10.0  10.0  10.0          83
Billy Williams    10.0  0.0  10.0  10.0  10.0         205
Bert Campaneris    9.0  0.0   9.0   9.0   9.0         168
Thurman Munson     8.0  0.0   8.0   8.0   8.0         137
Max Alvis          7.0  0.0   7.0   7.0   7.0          21

Log posterior predictive density
--------------------------------
-99.6108

Rhatが1以下の$\phi$選手が多く収束していない。posterior predictive densityの値も小さい。

Partially Pooled Model¶

In [27]:

    nuts_kernel = NUTS(partially_pooled)
    posterior_partially_pooled = MCMC(nuts_kernel,
                                      num_samples=num_samples,
                                      warmup_steps=warmup_steps,
                                      num_chains=num_chains).run(at_bats, hits)
    logging.info("\nModel: Partially Pooled")
    logging.info("=======================")
    logging.info("\nphi:")
    logging.info(summary(posterior_partially_pooled, sites=["phi"],
                         player_names=player_names)["phi"])
    posterior_predictive = TracePredictive(partially_pooled,
                                           posterior_partially_pooled,
                                           num_samples=num_predictive_samples)
    sample_posterior_predictive(posterior_predictive, baseball_dataset)
    evaluate_log_predictive_density(posterior_predictive, baseball_dataset)

Widget Javascript not detected.  It may not be installed or enabled properly.

Widget Javascript not detected.  It may not be installed or enabled properly.

Widget Javascript not detected.  It may not be installed or enabled properly.

Widget Javascript not detected.  It may not be installed or enabled properly.
Widget Javascript not detected.  It may not be installed or enabled properly.

Model: Partially Pooled
=======================

phi:
                      mean       std       25%       50%       75%  n_eff  \
Roberto Clemente  0.330732  0.056697  0.287433  0.329061  0.367337    NaN   
Frank Robinson    0.318623  0.053476  0.281985  0.313461  0.349818    NaN   
Frank Howard      0.312894  0.053505  0.274333  0.307558  0.347047    NaN   
Jay Johnstone     0.295636  0.050027  0.261132  0.291267  0.324851    NaN   
Ken Berry         0.288313  0.046968  0.256531  0.284000  0.318049    NaN   
Jim Spencer       0.287504  0.045141  0.256558  0.284777  0.318844    NaN   
Don Kessinger     0.278573  0.048426  0.246480  0.277425  0.309384    NaN   
Luis Alvarado     0.266445  0.045962  0.236697  0.264992  0.295940    NaN   
Ron Santo         0.259523  0.047678  0.228490  0.259872  0.289555    NaN   
Ron Swaboda       0.259049  0.046129  0.228279  0.256043  0.290896    NaN   
Rico Petrocelli   0.246646  0.045131  0.216728  0.246783  0.272855    NaN   
Ellie Rodriguez   0.244179  0.048296  0.213908  0.244600  0.277833    NaN   
George Scott      0.249798  0.044556  0.220636  0.249137  0.280757    NaN   
Del Unser         0.245764  0.041945  0.217120  0.245693  0.273705    NaN   
Billy Williams    0.248844  0.043270  0.220262  0.249451  0.277614    NaN   
Bert Campaneris   0.234063  0.045807  0.203822  0.234369  0.263453    NaN   
Thurman Munson    0.226682  0.046412  0.198276  0.225546  0.260829    NaN   
Max Alvis         0.215490  0.043255  0.185727  0.214943  0.244984    NaN   

                     r_hat  
Roberto Clemente  1.008639  
Frank Robinson    1.004020  
Frank Howard      1.005832  
Jay Johnstone     1.014933  
Ken Berry         1.005918  
Jim Spencer       1.021570  
Don Kessinger     1.002926  
Luis Alvarado     1.003754  
Ron Santo         0.997998  
Ron Swaboda       1.003618  
Rico Petrocelli   1.004025  
Ellie Rodriguez   1.004003  
George Scott      0.999369  
Del Unser         1.002674  
Billy Williams    1.000619  
Bert Campaneris   1.006408  
Thurman Munson    1.010476  
Max Alvis         1.008355  

Posterior Predictive:
Hit Rate - Initial 45 At Bats
-----------------------------
                  mean  std   25%   50%   75%  ActualHits
Roberto Clemente  18.0  0.0  18.0  18.0  18.0          18
Frank Robinson    17.0  0.0  17.0  17.0  17.0          17
Frank Howard      16.0  0.0  16.0  16.0  16.0          16
Jay Johnstone     15.0  0.0  15.0  15.0  15.0          15
Ken Berry         14.0  0.0  14.0  14.0  14.0          14
Jim Spencer       14.0  0.0  14.0  14.0  14.0          14
Don Kessinger     13.0  0.0  13.0  13.0  13.0          13
Luis Alvarado     12.0  0.0  12.0  12.0  12.0          12
Ron Santo         11.0  0.0  11.0  11.0  11.0          11
Ron Swaboda       11.0  0.0  11.0  11.0  11.0          11
Rico Petrocelli   10.0  0.0  10.0  10.0  10.0          10
Ellie Rodriguez   10.0  0.0  10.0  10.0  10.0          10
George Scott      10.0  0.0  10.0  10.0  10.0          10
Del Unser         10.0  0.0  10.0  10.0  10.0          10
Billy Williams    10.0  0.0  10.0  10.0  10.0          10
Bert Campaneris    9.0  0.0   9.0   9.0   9.0           9
Thurman Munson     8.0  0.0   8.0   8.0   8.0           8
Max Alvis          7.0  0.0   7.0   7.0   7.0           7

Hit Rate - Season Predictions
-----------------------------
                  mean  std   25%   50%   75%  ActualHits
Roberto Clemente  18.0  0.0  18.0  18.0  18.0         145
Frank Robinson    17.0  0.0  17.0  17.0  17.0         144
Frank Howard      16.0  0.0  16.0  16.0  16.0         160
Jay Johnstone     15.0  0.0  15.0  15.0  15.0          76
Ken Berry         14.0  0.0  14.0  14.0  14.0         128
Jim Spencer       14.0  0.0  14.0  14.0  14.0         140
Don Kessinger     13.0  0.0  13.0  13.0  13.0         168
Luis Alvarado     12.0  0.0  12.0  12.0  12.0          41
Ron Santo         11.0  0.0  11.0  11.0  11.0         148
Ron Swaboda       11.0  0.0  11.0  11.0  11.0          57
Rico Petrocelli   10.0  0.0  10.0  10.0  10.0         152
Ellie Rodriguez   10.0  0.0  10.0  10.0  10.0          52
George Scott      10.0  0.0  10.0  10.0  10.0         142
Del Unser         10.0  0.0  10.0  10.0  10.0          83
Billy Williams    10.0  0.0  10.0  10.0  10.0         205
Bert Campaneris    9.0  0.0   9.0   9.0   9.0         168
Thurman Munson     8.0  0.0   8.0   8.0   8.0         137
Max Alvis          7.0  0.0   7.0   7.0   7.0          21

Log posterior predictive density
--------------------------------
-46.9007

Partially Pooled with Logit Model¶

In [29]:

nuts_kernel = NUTS(partially_pooled_with_logit)
posterior_partially_pooled_with_logit = MCMC(nuts_kernel,
                                             num_samples=num_samples,
                                             warmup_steps=warmup_steps,
                                             num_chains=num_chains).run(at_bats, hits)
logging.info("\nModel: Partially Pooled with Logit")
logging.info("==================================")
logging.info("\nSigmoid(alpha):")
logging.info(summary(posterior_partially_pooled_with_logit,
                     sites=["alpha"],
                     player_names=player_names,
                     transforms={"alpha": lambda x: 1. / (1 + (-x).exp())})["alpha"])
posterior_predictive = TracePredictive(partially_pooled_with_logit,
                                       posterior_partially_pooled_with_logit,
                                       num_samples=num_predictive_samples)
sample_posterior_predictive(posterior_predictive, baseball_dataset)
evaluate_log_predictive_density(posterior_predictive, baseball_dataset)

Widget Javascript not detected.  It may not be installed or enabled properly.

Widget Javascript not detected.  It may not be installed or enabled properly.

Widget Javascript not detected.  It may not be installed or enabled properly.

Model: Partially Pooled with Logit
==================================

Sigmoid(alpha):
                      mean       std       25%       50%       75%  n_eff  \
Roberto Clemente  0.296156  0.040719  0.266291  0.290531  0.320242    NaN   
Frank Robinson    0.290568  0.039773  0.264530  0.284151  0.312349    NaN   
Frank Howard      0.287571  0.039618  0.261698  0.280873  0.308512    NaN   
Jay Johnstone     0.281311  0.036731  0.254630  0.278680  0.301954    NaN   
Ken Berry         0.275172  0.034034  0.251361  0.273174  0.296399    NaN   
Jim Spencer       0.275183  0.038243  0.248527  0.271953  0.294773    NaN   
Don Kessinger     0.269896  0.034686  0.246947  0.265581  0.290342    NaN   
Luis Alvarado     0.265240  0.035537  0.244533  0.264933  0.287217    NaN   
Ron Santo         0.259697  0.034639  0.239275  0.262939  0.279452    NaN   
Ron Swaboda       0.260685  0.035905  0.235296  0.259989  0.281871    NaN   
Rico Petrocelli   0.253204  0.034641  0.232819  0.255023  0.273528    NaN   
Ellie Rodriguez   0.254260  0.031659  0.235744  0.254733  0.273609    NaN   
George Scott      0.253762  0.032302  0.234679  0.255867  0.275410    NaN   
Del Unser         0.255131  0.031011  0.237398  0.257930  0.274724    NaN   
Billy Williams    0.253935  0.032343  0.235494  0.254280  0.274392    NaN   
Bert Campaneris   0.249635  0.037554  0.228033  0.251631  0.272752    NaN   
Thurman Munson    0.243726  0.035686  0.222682  0.246175  0.265775    NaN   
Max Alvis         0.240035  0.036551  0.218575  0.244242  0.265604    NaN   

                     r_hat  
Roberto Clemente  1.013094  
Frank Robinson    1.026909  
Frank Howard      1.026840  
Jay Johnstone     1.012468  
Ken Berry         1.004120  
Jim Spencer       1.003538  
Don Kessinger     0.998158  
Luis Alvarado     0.995409  
Ron Santo         0.998881  
Ron Swaboda       1.007532  
Rico Petrocelli   0.998102  
Ellie Rodriguez   0.999133  
George Scott      0.998382  
Del Unser         1.008240  
Billy Williams    1.000821  
Bert Campaneris   0.995667  
Thurman Munson    1.002400  
Max Alvis         1.012194  

Posterior Predictive:
Hit Rate - Initial 45 At Bats
-----------------------------
                  mean  std   25%   50%   75%  ActualHits
Roberto Clemente  18.0  0.0  18.0  18.0  18.0          18
Frank Robinson    17.0  0.0  17.0  17.0  17.0          17
Frank Howard      16.0  0.0  16.0  16.0  16.0          16
Jay Johnstone     15.0  0.0  15.0  15.0  15.0          15
Ken Berry         14.0  0.0  14.0  14.0  14.0          14
Jim Spencer       14.0  0.0  14.0  14.0  14.0          14
Don Kessinger     13.0  0.0  13.0  13.0  13.0          13
Luis Alvarado     12.0  0.0  12.0  12.0  12.0          12
Ron Santo         11.0  0.0  11.0  11.0  11.0          11
Ron Swaboda       11.0  0.0  11.0  11.0  11.0          11
Rico Petrocelli   10.0  0.0  10.0  10.0  10.0          10
Ellie Rodriguez   10.0  0.0  10.0  10.0  10.0          10
George Scott      10.0  0.0  10.0  10.0  10.0          10
Del Unser         10.0  0.0  10.0  10.0  10.0          10
Billy Williams    10.0  0.0  10.0  10.0  10.0          10
Bert Campaneris    9.0  0.0   9.0   9.0   9.0           9
Thurman Munson     8.0  0.0   8.0   8.0   8.0           8
Max Alvis          7.0  0.0   7.0   7.0   7.0           7

Hit Rate - Season Predictions
-----------------------------
                  mean  std   25%   50%   75%  ActualHits
Roberto Clemente  18.0  0.0  18.0  18.0  18.0         145
Frank Robinson    17.0  0.0  17.0  17.0  17.0         144
Frank Howard      16.0  0.0  16.0  16.0  16.0         160
Jay Johnstone     15.0  0.0  15.0  15.0  15.0          76
Ken Berry         14.0  0.0  14.0  14.0  14.0         128
Jim Spencer       14.0  0.0  14.0  14.0  14.0         140
Don Kessinger     13.0  0.0  13.0  13.0  13.0         168
Luis Alvarado     12.0  0.0  12.0  12.0  12.0          41
Ron Santo         11.0  0.0  11.0  11.0  11.0         148
Ron Swaboda       11.0  0.0  11.0  11.0  11.0          57
Rico Petrocelli   10.0  0.0  10.0  10.0  10.0         152
Ellie Rodriguez   10.0  0.0  10.0  10.0  10.0          52
George Scott      10.0  0.0  10.0  10.0  10.0         142
Del Unser         10.0  0.0  10.0  10.0  10.0          83
Billy Williams    10.0  0.0  10.0  10.0  10.0         205
Bert Campaneris    9.0  0.0   9.0   9.0   9.0         168
Thurman Munson     8.0  0.0   8.0   8.0   8.0         137
Max Alvis          7.0  0.0   7.0   7.0   7.0          21

Log posterior predictive density
--------------------------------
-52.6788

ToDo¶

レポートの可視化（stan、pymcのforestplotのようなもの）
ハイパーパラメータの分布のプロット
Centered Parameterization, Non-Centered Parameterizationの比較(plot)
stanの例の続き
- estimate event probabilities
- evaluate posterior predictive densities to evaluate model predictions on held-out data
- rank items by chance of success
- perform multiple comparisons in several settings
- replicate new data for posterior p-values
- perform graphical posterior predictive checks

Link¶

pymcの階層ベイズモデルの例

なぜかスポーツのデータが多い。

その他¶

poutineとは統計モデルをデータを結びつけるための関数でwith構文にしたりデコレーターとしたりといろいろな使い方ができる。少し古いversionではこれを直接使ってMCMC用のkernelを作っていた。 http://docs.pyro.ai/en/0.2.1-release/poutine.html

In [ ]:

def conditioned_model(model, at_bats, hits):
    return poutine.condition(model, data={"obs": hits})(at_bats)