This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.

7.1. Explore a dataset with Pandas and matplotlib¶

We import NumPy, Pandas and matplotlib.

In [ ]:

from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

The dataset is a CSV file, i.e. a text file with comma-separated values. Pandas lets us load this file with a single function.

In [ ]:

player = 'Roger Federer'
filename = "data/{name}.csv".format(
              name=player.replace(' ', '-'))
df = pd.read_csv(filename)

The loaded data is a DataFrame, a 2D tabular data where each row is an observation, and each column is a variable. We can have a first look at this dataset by just displaying it in the IPython notebook.

In [ ]:

df

There are many columns. Each row corresponds to a match played by Roger Federer. Let's add a boolean variable indicating whether he has won the match or not. The tail method displays the last rows of the column.

In [ ]:

df['win'] = df['winner'] == player
df['win'].tail()

df['win'] is a Series object: it is very similar to a NumPy array, except that each value has an index (here, the match index). This object has a few standard statistical functions. For example, let's look at the proportion of matches won.

In [ ]:

print("{player} has won {vic:.0f}% of his ATP matches.".format(
      player=player, vic=100*df['win'].mean()))

Now, we are going to look at the evolution of some variables across time. The start date field contains the start date of the tournament as a string. We can convert the type to a date type using the pd.to_datetime function.

In [ ]:

date = pd.to_datetime(df['start date'])

We are now looking at the proportion of double faults in each match (taking into account that there are logically more double faults in longer matches!). This number is an indicator of the player's state of mind, his level of self-confidence, his willingness to take risks while serving, and other parameters.

In [ ]:

df['dblfaults'] = (df['player1 double faults'] / 
                   df['player1 total points total'])

We can use the head and tail methods to take a look at the beginning and the end of the column, and describe to get summary statistics. In particular, let's note that some rows have NaN values (i.e. the number of double faults is not available for all matches).

In [ ]:

df['dblfaults'].tail()

In [ ]:

df['dblfaults'].describe()

A very powerful feature in Pandas is groupby. This function allows us to group together rows that have the same value in a particular column. Then, we can aggregate this group-by object to compute statistics in each group. For instance, here is how we can get the proportion of wins as a function of the tournament's surface.

In [ ]:

df.groupby('surface')['win'].mean()

Now, we are going to display the proportion of double faults as a function of the tournament date, as well as the yearly average. To do this, we also use groupby.

In [ ]:

gb = df.groupby('year')

gb is a GroupBy instance. It is similar to a DataFrame, but there are multiple rows per group (all matches played in each year). We can aggregate those rows using the mean operation. We use matplotlib's plot_date function because the x-axis contains dates.

In [ ]:

plt.figure(figsize=(8, 4))
plt.plot_date(date.astype(datetime), df['dblfaults'], alpha=.25, lw=0);
plt.plot_date(gb['start date'].max(), 
              gb['dblfaults'].mean(), '-', lw=3);
plt.xlabel('Year');
plt.ylabel('Proportion of double faults per match.');

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).