# Statistical tools¶

In [1]:
import addutils.toc ; addutils.toc.js(ipy_notebook=True)

Out[1]:

With this tutorial we are going to see some of the statistical and computational tools offered by pandas.

In [2]:
import datetime
import scipy.io
import numpy as np
import pandas as pd
import bokeh.plotting as bk
from IPython.display import display, HTML
css_notebook()

Out[2]:

## 1 Percent change¶

Given a pandas.Series the method pct_change returns a new pandas.Series object containing percent change over a given number of periods.

In [3]:
s1 = pd.Series(range(10, 18) + np.random.randn(8) / 10)

pct_ch_1d = s1.pct_change() * 100
pct_ch_3d = s1.pct_change(periods=3) * 100

HTML(side_by_side2(s1, pct_ch_1d, pct_ch_3d))

Out[3]:
0
0 9.986220
1 11.011012
2 12.019591
3 12.960166
4 14.031684
5 14.895877
6 16.096982
7 16.835873
0
0 NaN
1 10.262067
2 9.159731
3 7.825347
4 8.267777
5 6.158871
6 8.063341
7 4.590243
0
0 NaN
1 NaN
2 NaN
3 29.780503
4 27.433187
5 23.929980
6 24.203519
7 19.984695

## 2 Covariance¶

Given two pandas.Series the method cov computes covariance between them, excluding missing values.

In [4]:
s1 = pd.util.testing.makeTimeSeries(7)
s2 = s1 + np.random.randn(len(s1)) / 10
HTML(side_by_side2(s1, s2))

Out[4]:
0
2000-01-03 -1.348669
2000-01-04 -0.550629
2000-01-05 -0.668182
2000-01-06 -0.359217
2000-01-07 0.010123
2000-01-10 -1.494454
2000-01-11 -1.115500
0
2000-01-03 -1.570559
2000-01-04 -0.546077
2000-01-05 -0.506982
2000-01-06 -0.346936
2000-01-07 0.083605
2000-01-10 -1.195086
2000-01-11 -1.064149
In [5]:
s1.cov(s2)

Out[5]:
0.29886732146355777

It is also possibile to compute pairwise covariance of a pandas.DataFrame columns using pandas.DataFrame.cov method. Here we use the module pandas.util.testing in order to generate random data easily:

In [6]:
d1 = pd.util.testing.makeTimeDataFrame()
print (d1.cov())

                   A         B         C         D
2000-01-03  1.873764  0.006606  1.031176 -1.266077
2000-01-04 -0.269824  1.588418  0.675261 -0.437837
2000-01-05  0.223423  0.312011 -0.882502 -0.710833
2000-01-06 -1.163556  1.019891 -0.310532 -1.039061
2000-01-07  0.344085  0.833134 -2.606216 -3.023771
A         B         C         D
A  1.222005 -0.284295 -0.085028 -0.272069
B -0.284295  1.024260 -0.125176 -0.434902
C -0.085028 -0.125176  1.353324  0.492702
D -0.272069 -0.434902  0.492702  1.353826


## 3 Correlation¶

pandas.Series.corr allows to compute correlation between two pandas.Series. By the method paramether it's possible to choose between:

• Pearson
• Kendall
• Spearman
In [7]:
s1.corr(s2, method='pearson')

Out[7]:
0.9596913637468012

Like we just seen for covariance, it is possibile to call pandas.DataFrame.corr to obtain pairwise correlation of columns over a pandas.DataFrame

In [8]:
d1.corr()

Out[8]:
A B C D
A 1.000000 -0.254113 -0.066119 -0.211525
B -0.254113 1.000000 -0.106320 -0.369322
C -0.066119 -0.106320 1.000000 0.364000
D -0.211525 -0.369322 0.364000 1.000000

## 4 Rolling moments and Binary rolling moments¶

pandas provides also a lot of methods for calculating rolling moments.

In [9]:
[n for n in dir(pd) if n.startswith('rolling')]

Out[9]:
['rolling_apply',
'rolling_corr',
'rolling_count',
'rolling_cov',
'rolling_kurt',
'rolling_max',
'rolling_mean',
'rolling_median',
'rolling_min',
'rolling_quantile',
'rolling_skew',
'rolling_std',
'rolling_sum',
'rolling_var',
'rolling_window']

Let's see some examples:

In [10]:
s3 = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
s3 = s3.cumsum()
s3_max = s3.rolling(60).max()
s3_mean = s3.rolling(60).mean()
s3_min = s3.rolling(60).min()
data = {'cumsum':s3, 'max':s3_max, 'mean':s3_mean, 'min':s3_min}
df = pd.DataFrame(data)
df.tail()

Out[10]:
cumsum max mean min
2002-09-22 22.552347 22.552347 12.281759 3.117080
2002-09-23 22.350431 22.552347 12.577096 3.117080
2002-09-24 21.767979 22.552347 12.887944 3.675339
2002-09-25 21.627637 22.552347 13.187149 4.936078
2002-09-26 20.773243 22.552347 13.449934 4.936078
In [11]:
bk.output_notebook()

In [12]:
fig = bk.figure(x_axis_type = "datetime",
tools="pan,box_zoom,reset", title = 'Rolling Moments',
plot_width=750, plot_height=400)
fig.line(df.index, df['max'], color='mediumorchid', legend='Max')
fig.line(df.index, df['min'], color='mediumpurple', legend='Min')
fig.line(df.index, df['mean'], color='navy', legend='Mean')
bk.show(fig)


pandas.Series.cumsum returns a new pandas.Series containing the cumulative sum of the given values.

In [13]:
s4 = s3 + np.random.randn(len(s3))
rollc = s3.rolling(window=10).corr(s3)
data2 = {'cumsum':s3, 'similar':s4, 'rolling correlation':rollc}
df2 = pd.DataFrame(data2)

In [14]:
fig = bk.figure(x_axis_type = "datetime", title = 'Rolling Correlation',
plot_width=750, plot_height=400)
fig.line(df2.index, df2['similar'], color='mediumorchid', legend='Similar')
fig.line(df2.index, df2['rolling correlation'], color='navy', legend='Rolling Corr.')
fig.legend.location = "bottom_right"
bk.show(fig)


## 5 A pratical example: Return indexes and cumulative returns¶

In [15]:
AAPL = pd.read_csv('example_data/p03_AAPL.txt', index_col='Date', parse_dates=True)
display(price.tail())

Date
2012-09-17    699.78
2012-09-18    701.91
2012-09-19    702.10
2012-09-20    698.70
2012-09-21    700.09
Name: Adj Close, dtype: float64

pandas.Series.tail returns the last n rows of a given pandas.Series.

In [16]:
price['2011-10-03'] / price['2011-3-01'] - 1
returns = price.pct_change()
ret_index = (1 + returns).cumprod()
ret_index[0] = 1
monthly_returns = ret_index.resample('BM').last().pct_change()

In [17]:
fig = bk.figure(x_axis_type = 'datetime', title = 'Monthly Returns',
plot_width=750, plot_height=400)
fig.line(monthly_returns.index, monthly_returns)
bk.show(fig)