Introduction to Pandas¶

Pandas is a library providing high-performance, easy-to-use data structures and data analysis tools. The core of pandas is its dataframe which is essentially a table of data. Pandas provides easy and powerful ways to import data from a variety of sources and export it to just as many. It is also explicitly designed to handle missing data elegantly which is a very common problem in data from the real world.

The offical pandas documentation is very comprehensive and you will be answer a lot of questions in there, however, it can sometimes be hard to find the right page. Don't be afraid to use Google to find help.

Pandas has a standard convention for importing it which you will see used in a lot of documentation so we will follow that in this course:

In [1]:

import pandas as pd
from pandas import Series, DataFrame

Series¶

The simplest of pandas' data structures is the Series. It is a one-dimensional list-like structure. Let's create one from a list:

In [2]:

Series([14, 7, 3, -7, 8])

Out[2]:

0    14
1     7
2     3
3    -7
4     8
dtype: int64

There are three main components to this output. The first column (0, 2, etc.) is the index, by default this is numbers each row starting from zero. The second column is our data, stored i the same order we entered it in our list. Finally at the bottom there is the dtype which stands for 'data type' which is telling us that all our data is being stored as a 64-bit integer. Usually you can ignore the dtype until you start doing more advanced things.

In the first example above we allowed pandas to automatically create an index for our Series (this is the 0, 1, 2, etc. in the left column) but often you will want to specify one yourself

In [3]:

s = Series([14, 7, 3, -7, 8], index=['a', 'b', 'c', 'd', 'e'])
print(s)

a    14
b     7
c     3
d    -7
e     8
dtype: int64

We can use this index to retrieve individual rows

In [4]:

s['a']

Out[4]:

to replace values in the series

In [5]:

s['c'] = -1

or to get a set of rows

In [6]:

s[['a', 'c', 'd']]

Out[6]:

a    14
c    -1
d    -7
dtype: int64

Exercises¶

Create a Pandas Series with 10 or so elements where the indices are years and the values are numbers.
Experiment with retrieving elements from the Series.
Try making another Series with duplicate values in the index, what happens when you access those elements?
How does a Pandas Series differ from a Python list or dict?

Series operations¶

A Series is list-like in the sense that it is an ordered set of values. It is also dict-like since its entries can be accessed via key lookup. One very important way in which is differs is how it allows operations to be done over the whole Series in one go, a technique often referred to as 'broadcasting'.

A simple example is wanting to double the value of every entry in a set of data. In standard Python, you might have a list like

In [7]:

my_list = [3, 6, 8, 4, 10]

If you wanted to double every entry you might try simply multiplying the list by 2:

In [8]:

my_list * 2

Out[8]:

[3, 6, 8, 4, 10, 3, 6, 8, 4, 10]

but as you can see, that simply duplicated the elements. Instead you would have to use a for loop or a list comprehension:

In [9]:

[i * 2 for i in my_list]

Out[9]:

[6, 12, 16, 8, 20]

With a pandas Series, however, you can perform bulk mathematical operations to the whole series in one go:

In [10]:

my_series = Series(my_list)
print(my_series)

0     3
1     6
2     8
3     4
4    10
dtype: int64

In [11]:

my_series * 2

Out[11]:

0     6
1    12
2    16
3     8
4    20
dtype: int64

As well as bulk modifications, you can perform bulk selections by putting more complex statements in the square brackets:

In [12]:

s[s < 0]  # All negative entries

Out[12]:

c   -1
d   -7
dtype: int64

In [13]:

s[(s * 2) > 4]  # All entries which, when doubled are greater than 4

Out[13]:

a    14
b     7
e     8
dtype: int64

These operations work because the Series index selection can be passed a series of True and False values which it then uses to filter the result:

In [14]:

(s * 2) > 4

Out[14]:

a     True
b     True
c    False
d    False
e     True
dtype: bool

Here you can see that the rows a, b and e are True while the others are False. Passing this to s[...] will only show rows that are True.

Multi-Series operations¶

It is also possible to perform operations between two Series objects:

In [15]:

s2 = Series([23,5,34,7,5])
s3 = Series([7, 6, 5,4,3])
s2 - s3

Out[15]:

0    16
1    -1
2    29
3     3
4     2
dtype: int64

Exercises¶

Create two Series objects of equal length with no specified index and containing any values you like. Perform some mathematical operations on them and experiment to make sure it works how you think.
What happens then you perform an operation on two series which have different lengths? How does this change when you give the series some indices?
Using the Series from the first exercise with the years for the index, Select all entries with even-numbered years. Also, select all those with odd-numbered years.

DataFrame¶

While you can think of the Series as a one-dimensional list of data, pandas' DataFrame is a two (or possibly more) dimensional table of data. You can think of each column in the table as being a Series.

In [16]:

data = {'city': ['Paris', 'Paris', 'Paris', 'Paris',
                 'London', 'London', 'London', 'London',
                 'Rome', 'Rome', 'Rome', 'Rome'],
        'year': [2001, 2008, 2009, 2010,
                 2001, 2006, 2011, 2015,
                 2001, 2006, 2009, 2012],
        'pop': [2.148, 2.211, 2.234, 2.244,
                7.322, 7.657, 8.174, 8.615,
                2.547, 2.627, 2.734, 2.627]}
df = DataFrame(data)

This has created a DataFrame from the dictionary data. The keys will become the column headers and the values will be the values in each column. As with the Series, an index will be created automatically.

In [17]:

df

Out[17]:

	city	year	pop
0	Paris	2001	2.148
1	Paris	2008	2.211
2	Paris	2009	2.234
3	Paris	2010	2.244
4	London	2001	7.322
5	London	2006	7.657
6	London	2011	8.174
7	London	2015	8.615
8	Rome	2001	2.547
9	Rome	2006	2.627
10	Rome	2009	2.734
11	Rome	2012	2.627

Or, if you just want a peek at the data, you can just grab the first few rows with:

In [18]:

df.head(3)

Out[18]:

	city	year	pop
0	Paris	2001	2.148
1	Paris	2008	2.211
2	Paris	2009	2.234

Since we passed in a dictionary to the DataFrame constructor, the order of the columns will not necessarilly match the order in which you defined them. To enforce a certain order, you can pass a columns argument to the constructor giving a list of the columns in the order you want them:

In [19]:

DataFrame(data, columns=['year', 'city', 'pop'])

Out[19]:

	year	city	pop
0	2001	Paris	2.148
1	2008	Paris	2.211
2	2009	Paris	2.234
3	2010	Paris	2.244
4	2001	London	7.322
5	2006	London	7.657
6	2011	London	8.174
7	2015	London	8.615
8	2001	Rome	2.547
9	2006	Rome	2.627
10	2009	Rome	2.734
11	2012	Rome	2.627

When we accessed elements from a Series object, it would select an element by row. However, by default DataFrames index primarily by column. You can access any column directly:

In [20]:

df['city']

Out[20]:

0      Paris
1      Paris
2      Paris
3      Paris
4     London
5     London
6     London
7     London
8       Rome
9       Rome
10      Rome
11      Rome
Name: city, dtype: object

Accessing a column like this returns a Series which will act in the same way as those we were using earlier.

Note that there is one additional part to this output, Name: city. Pandas has remembered that this Series was created from the 'city' column in the DataFrame.

In [21]:

type(df['city'])

Out[21]:

pandas.core.series.Series

In [22]:

df['city'] == 'Paris'

Out[22]:

0      True
1      True
2      True
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
Name: city, dtype: bool

This has created a new Series which has True set where the city is Paris and False elsewhere.

We can use filtered Series like this to filter the DataFrame as a whole. df['city'] == 'Paris' has returned a Series containing booleans. Passing it back into df as an indexing operation will use it to filter based on the 'city' column.

In [23]:

df[df['city'] == 'Paris']

Out[23]:

	city	year	pop
0	Paris	2001	2.148
1	Paris	2008	2.211
2	Paris	2009	2.234
3	Paris	2010	2.244

You can then carry on and grab another column after that filter:

In [24]:

df[df['city'] == 'Paris'].year

Out[24]:

0    2001
1    2008
2    2009
3    2010
Name: year, dtype: int64

If you want to select a row from a DataFrame then you can use the .loc attribute which allows you to pass index values like:

In [25]:

df.loc[2]

Out[25]:

city    Paris
year     2009
pop     2.234
Name: 2, dtype: object

In [26]:

df.loc[2]['city']

Out[26]:

'Paris'

Adding new columns¶

New columns can be added to a DataFrame simply by assigning them by index (as you would for a Python dict) and can be deleted with the del keyword in the same way:

In [27]:

df['continental'] = df['city'] != 'London'
df

Out[27]:

	city	year	pop	continental
0	Paris	2001	2.148	True
1	Paris	2008	2.211	True
2	Paris	2009	2.234	True
3	Paris	2010	2.244	True
4	London	2001	7.322	False
5	London	2006	7.657	False
6	London	2011	8.174	False
7	London	2015	8.615	False
8	Rome	2001	2.547	True
9	Rome	2006	2.627	True
10	Rome	2009	2.734	True
11	Rome	2012	2.627	True

In [28]:

del df['continental']

Exercises¶

Create the DataFrame containing the census data for the three cities.
Select the data for the year 2001. Which city had the smallest population that year?
Find all the cities which had a population smaller than 2.6 million.

Reading from file¶

One of the msot common situations is that you have some data file containing the data you want to read. Perhaps this is data you've produced yourself or maybe it's from a collegue. In an ideal world the file will be perfectly formatted and will be trivial to import into pandas but since this is so often not the case, it provides a number of features to make your ife easier.

Full information on reading and writing is available in the pandas manual on IO tools but first it's worth noting the common formats that pandas can work with:

Comma separated tables (or tab-separated or space-separated etc.)
Excel spreadsheets
HDF5 files
SQL databases

For this course we will focus on plain-text CSV files as they are perhaps the most common format. Imagine we have a CSV file like (you can download this file from city_pop.csv):

In [29]:

!cat city_pop.csv  # Uses the IPython 'magic' !cat to print the file

This is an example CSV file
The text at the top here is not part of the data but instead is here
to describe the file. You'll see this quite often in real-world data.
A -1 signifies a missing value.

year;London;Paris;Rome
2001;7.322;2.148;2.547
2006;7.652;;2.627
2008;-1;2.211;
2009;-1;2.234;2.734
2011;8.174;;
2012;-1;2.244;2.627
2015;8.615;;

We can use the pandas function read_csv() to read the file and convert it to a DataFrame. Full documentation for this function can be found in the manual or, as with any Python object, directly in the notebook by putting a ? after the name:

In [30]:

pd.read_csv?

In [31]:

pd.read_csv('city_pop.csv')

Out[31]:

	This is an example CSV file
0	The text at the top here is not part of the da...
1	to describe the file. You'll see this quite of...
2	A -1 signifies a missing value.
3	year;London;Paris;Rome
4	2001;7.322;2.148;2.547
5	2006;7.652;;2.627
6	2008;-1;2.211;
7	2009;-1;2.234;2.734
8	2011;8.174;;
9	2012;-1;2.244;2.627
10	2015;8.615;;

We can see that by default it's done a fairly bad job of parsing the file (this is mostly because I;ve construsted the city_pop.csv file to be as obtuse as possible). It's making a lot of assumptions about the structure of the file but in general it's taking quite a naïve approach.

The first this we notice is that it's treating the text at the top of the file as though it's data. Checking the documentation we see that the simplest way to solve this is to use the skiprows argument to the function to which we give an integer giving the number of rows to skip:

In [32]:

pd.read_csv(
    'city_pop.csv',
    skiprows=5,
)

Out[32]:

	year;London;Paris;Rome
0	2001;7.322;2.148;2.547
1	2006;7.652;;2.627
2	2008;-1;2.211;
3	2009;-1;2.234;2.734
4	2011;8.174;;
5	2012;-1;2.244;2.627
6	2015;8.615;;

The next most obvious problem is that it is not separating the columns at all. This is controlled by the sep argument which is set to ',' by default (hence comma separated values). We can simply set it to the appropriate semi-colon:

In [33]:

pd.read_csv(
    'city_pop.csv',
    skiprows=5,
    sep=';'
)

Out[33]:

	year	London	Paris	Rome
0	2001	7.322	2.148	2.547
1	2006	7.652	NaN	2.627
2	2008	-1.000	2.211	NaN
3	2009	-1.000	2.234	2.734
4	2011	8.174	NaN	NaN
5	2012	-1.000	2.244	2.627
6	2015	8.615	NaN	NaN

Reading the descriptive header of our data file we see that a value of -1 signifies a missing reading so we should mark those too. This can be done after the fact but it is simplest to do it at import-time using the na_values argument:

In [34]:

pd.read_csv(
    'city_pop.csv',
    skiprows=5,
    sep=';',
    na_values='-1'
)

Out[34]:

	year	London	Paris	Rome
0	2001	7.322	2.148	2.547
1	2006	7.652	NaN	2.627
2	2008	NaN	2.211	NaN
3	2009	NaN	2.234	2.734
4	2011	8.174	NaN	NaN
5	2012	NaN	2.244	2.627
6	2015	8.615	NaN	NaN

The last this we want to do is use the year column as the index for the DataFrame. This can be done by passing the name of the column to the index_col argument:

In [35]:

df3 = pd.read_csv(
    'city_pop.csv',
    skiprows=5,
    sep=';',
    na_values='-1',
    index_col='year'
)
df3

Out[35]:

	London	Paris	Rome
year
2001	7.322	2.148	2.547
2006	7.652	NaN	2.627
2008	NaN	2.211	NaN
2009	NaN	2.234	2.734
2011	8.174	NaN	NaN
2012	NaN	2.244	2.627
2015	8.615	NaN	NaN

Exercises¶

Alongside city_pop.csv there is another file called cetml1659on.dat (also available from here). This contains some historical weather data for a location in the UK. Import that file as a Pandas DataFrame using read_csv(), making sure that you cover all the NaN values.
How many years had a negative average temperature in January?
What was the average temperature in June over the years in the data set? Tip: look in the documentation for which method to call.

We will come back to this data set in a later stage.

That covers the basics of reading in data with pandas. For more tutorials on further pandas topics you can have look at their website. Also, the official Pandas cheat sheet is very useful.

Next we will do something useful with our data and plot it. Continue to the next section.