Performance implications of frozen distributions in SciPy¶

This notebook is a quick demonstration of the surprising performance implications of two different interfaces for sampling from probability distributions in SciPy.

In [ ]:
import scipy.stats
import numpy as np


One option for sampling from a distribution — in this case, a Poisson distribution with a λ of 5 — is to create a distribution object, like this:

In [ ]:
distribution = scipy.stats.poisson(5)
distribution.random_state = 0x00c0ffee
distribution.rvs(size=12)


Another option is to use the poisson object and specify the distribution parameters and the random state in each call, like this:

In [ ]:
rs = np.random.RandomState(1234)

scipy.stats.poisson.rvs(5, size=12, random_state=rs)


RandomState is really a pseudorandom number generator, and we can see that each call modifies its state:

In [ ]:
rs = np.random.RandomState(1234)

print(scipy.stats.poisson.rvs(5, size=12, random_state=rs))
print(scipy.stats.poisson.rvs(5, size=12, random_state=rs))
print(scipy.stats.poisson.rvs(5, size=12, random_state=rs))
print(scipy.stats.poisson.rvs(5, size=12, random_state=rs))


These two techniques produce identical results. To see whether or not they behave identically, let's collect some timings.

In [ ]:
def mkpoisson(l,seed):
p = scipy.stats.poisson(l)
p.random_state = seed
return p

def experiment_one(agents, steps):
def mkpoisson(l,seed):
p = scipy.stats.poisson(l)
p.random_state = seed
return p

seeds = np.random.randint(1<<32, size=agents)
streams = [mkpoisson(12, seed) for seed in seeds]
for p in streams:
p.rvs(size=steps)

def experiment_two(agents, steps):
seeds = np.random.randint(1<<32, size=agents)
states = [np.random.RandomState(seed) for seed in seeds]
for rs in states:
scipy.stats.poisson.rvs(12, size=steps, random_state=rs)

In [ ]:
%%time

experiment_one(10000,1000)

In [ ]:
%%time

experiment_two(10000,1000)


If you're running these experiments in a similar environment to my computer, you'll see radically different timings — experiment_two is going to be substantially faster than experiment_one. (I've observed a 2x difference on Binder and a 4-5x difference locally.)

To see why, let's profile both functions:

In [ ]:
from cProfile import run
import pstats
from pstats import SortKey

In [ ]:
run("experiment_one(10000,1000)", sort=SortKey.TIME)

In [ ]:
run("experiment_two(10000,1000)", sort=SortKey.TIME)