This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.
from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
player = 'Roger Federer'
filename = "data/{name}.csv".format(
name=player.replace(' ', '-'))
df = pd.read_csv(filename)
The loaded data is a DataFrame
, a 2D tabular data where each row is an observation, and each column is a variable. We can have a first look at this dataset by just displaying it in the IPython notebook.
df
tail
method displays the last rows of the column.df['win'] = df['winner'] == player
df['win'].tail()
df['win']
is a Series
object: it is very similar to a NumPy array, except that each value has an index (here, the match index). This object has a few standard statistical functions. For example, let's look at the proportion of matches won.print("{player} has won {vic:.0f}% of his ATP matches.".format(
player=player, vic=100*df['win'].mean()))
start date
field contains the start date of the tournament as a string. We can convert the type to a date type using the pd.to_datetime
function.date = pd.to_datetime(df['start date'])
df['dblfaults'] = (df['player1 double faults'] /
df['player1 total points total'])
head
and tail
methods to take a look at the beginning and the end of the column, and describe
to get summary statistics. In particular, let's note that some rows have NaN
values (i.e. the number of double faults is not available for all matches).df['dblfaults'].tail()
df['dblfaults'].describe()
groupby
. This function allows us to group together rows that have the same value in a particular column. Then, we can aggregate this group-by object to compute statistics in each group. For instance, here is how we can get the proportion of wins as a function of the tournament's surface.df.groupby('surface')['win'].mean()
groupby
.gb = df.groupby('year')
gb
is a GroupBy
instance. It is similar to a DataFrame
, but there are multiple rows per group (all matches played in each year). We can aggregate those rows using the mean
operation. We use matplotlib's plot_date
function because the x-axis contains dates.plt.figure(figsize=(8, 4))
plt.plot_date(date.astype(datetime), df['dblfaults'], alpha=.25, lw=0);
plt.plot_date(gb['start date'].max(),
gb['dblfaults'].mean(), '-', lw=3);
plt.xlabel('Year');
plt.ylabel('Proportion of double faults per match.');
You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).
IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).