This notebook is a quick demonstration of the surprising performance implications of two different interfaces for sampling from probability distributions in SciPy.
import scipy.stats
import numpy as np
One option for sampling from a distribution — in this case, a Poisson distribution with a λ of 5 — is to create a distribution object, like this:
distribution = scipy.stats.poisson(5)
distribution.random_state = 0x00c0ffee
distribution.rvs(size=12)
Another option is to use the poisson
object and specify the distribution parameters and the random state in each call, like this:
rs = np.random.RandomState(1234)
scipy.stats.poisson.rvs(5, size=12, random_state=rs)
RandomState
is really a pseudorandom number generator, and we can see that each call modifies its state:
rs = np.random.RandomState(1234)
print(scipy.stats.poisson.rvs(5, size=12, random_state=rs))
print(scipy.stats.poisson.rvs(5, size=12, random_state=rs))
print(scipy.stats.poisson.rvs(5, size=12, random_state=rs))
print(scipy.stats.poisson.rvs(5, size=12, random_state=rs))
These two techniques produce identical results. To see whether or not they behave identically, let's collect some timings.
def mkpoisson(l,seed):
p = scipy.stats.poisson(l)
p.random_state = seed
return p
def experiment_one(agents, steps):
def mkpoisson(l,seed):
p = scipy.stats.poisson(l)
p.random_state = seed
return p
seeds = np.random.randint(1<<32, size=agents)
streams = [mkpoisson(12, seed) for seed in seeds]
for p in streams:
p.rvs(size=steps)
def experiment_two(agents, steps):
seeds = np.random.randint(1<<32, size=agents)
states = [np.random.RandomState(seed) for seed in seeds]
for rs in states:
scipy.stats.poisson.rvs(12, size=steps, random_state=rs)
%%time
experiment_one(10000,1000)
%%time
experiment_two(10000,1000)
If you're running these experiments in a similar environment to my computer, you'll see radically different timings — experiment_two
is going to be substantially faster than experiment_one
. (I've observed a 2x difference on Binder and a 4-5x difference locally.)
To see why, let's profile both functions:
from cProfile import run
import pstats
from pstats import SortKey
run("experiment_one(10000,1000)", sort=SortKey.TIME)
run("experiment_two(10000,1000)", sort=SortKey.TIME)