Data analysis in Python

This course is aimed at the intermediate Python developer who wants to learn how to do useful data analysis tasks in Python. It will initially focus on the Python package pandas but will also cover matplotlib, NumPy and SciPy to some degree.

Data analysis is a huge topic and we couldn't possibly cover it all in one short course so the purpose of this workshop is to give you an introduction to some of the most useful tools and to demonstrate some of the most common problems that surface.

In previous courses, you've used the python command line program to execute scripts and ipython to run interactively. This course will use another tool called Jupyter (previously known as IPython Notebook) to run your Python code. It operates like a standard IPython interactive session with the addition of allowing you to intersperse your code with blocks of text to explain what you're doing and embed output such as graphs directly into the page.

To get started, launch Jupyter Notebook as shown by your instructor. Towards the top-right of that page there should be three buttons, the middle of which is labelled New. Clicking that gives a drop-down and you should select the Python 3 option.

This will open a new notebook file. Give it a name by clicking on the 'Untitled' at the top of the screen.

Throughout this course you will likely want to start a new notebook for each section of the course so name them appropriately to make it easier to find them later.

Getting started

Once the notebook is launched, you will see a wide white box with a grey box inside it with a blue In [ ]: to the left. The grey box is an input cell, similar to that which you find in the IPython command line program. You type any Python code you want to run inside that box:

In [1]:
# Python code can be written in 'Code' cells
print('Output appears below when the cell is run')
print('To run a cell, press Ctrl-Enter with the cursor inside or use the run button in the toolbar at the top')
Output appears below when the cell is run
To run a cell, press Ctrl-Enter with the cursor inside or use the run button in the toolbar at the top

In your notebook, type the following in the first cell and then run it, you should see the same output:

In [2]:
a = 5
b = 7
a + b

The cells in a notebook are linked together so a variable defined in one is available in all the cells from that point on so in the second cell you can use the variables a and b:

In [3]:
c = a - b

Some Python libraries have special integration with Jupyter notebooks and so can display their output directly into the page. For example pandas will format tables of data nicely and matplotlib will embed graphs directly:

In [4]:
from pandas import DataFrame
0 1 2
0 1 2 3
1 5 6 6
In [5]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

t = np.arange(0.0, 2.0, 0.01)
s = np.sin(2*np.pi*t)
plt.plot(t, s)


If you want to write some text as documentation (like these words here) then you should label the cell as being a Markdown cell. Do that by selecting the cell and going to the dropdown at the top of the page labelled Code and changing it to Markdown.

It is becomming common for people to use Jupyter notebooks as a sort of lab notebook where they document their processes, interspersed with code. This style of working where you give prose and code equal weight is sometimes called literate programming.


Take the following code and break it down, chunk by chunk, interspersing it with documentation explaining what each part does using Markdown blocks:

prices = {'apple': 0.40, 'banana': 0.50}
my_purchase = {
    'apple': 1,
    'banana': 6
grocery_bill = 0
for fruit in my_purchase:
    grocery_bill += prices[fruit] * my_purchase[fruit]
print('I owe the grocer ${:.2f}'.format(grocery_bill))

You don't need to put only one line of code per cell, it makes sense to group some lines together.

Throughout this course, use the Jupyter notebook to solve the problems. Follow along with the examples, typing them into your own notebooks and see how they work.

Now to move on to the next page.