It is recommended to view this notebook in nbviewer for the best viewing experience.
You can also execute the code in this notebook on Binder - no local installation required.
In the time it took you to read this sentence, terabytes of data have been collectively generated across the world — more data than any of us could ever hope to process, much less make sense of, on the machines we're using to read this notebook.
In response to this massive influx of data, the field of Data Science has come to the forefront in the past decade. Cobbled together by people from a diverse array of fields — statistics, physics, computer science, design, and many more — the field of Data Science represents our collective desire to understand and harness the abundance of data around us to build a better world.
In this notebook, I'm going to go over a basic Python data analysis pipeline from start to finish to show you what a typical data science workflow looks like.
In addition to providing code examples, I also hope to imbue in you a sense of good practices so you can be a more effective — and more collaborative — data scientist.
I will be following along with the data analysis checklist from The Elements of Data Analytic Style, which I strongly recommend reading as a free and quick guidebook to performing outstanding data analysis.
This notebook is intended to be a public resource. As such, if you see any glaring inaccuracies or if a critical topic is missing, please feel free to point it out or (preferably) submit a pull request to improve the notebook.
Please see the repository README file for the licenses and usage terms for the instructional material and code in this notebook. In general, I have licensed this material so that it is as widely usable and shareable as possible.
If you don't have Python on your computer, you can use the Anaconda Python distribution to install most of the Python packages you need. Anaconda provides a simple double-click installer for your convenience.
This notebook uses several Python packages that come standard with the Anaconda Python distribution. The primary libraries that we'll be using are:
To make sure you have all of the packages you need, install them with conda
:
conda install numpy pandas scikit-learn matplotlib seaborn
conda install -c conda-forge watermark
conda
may ask you to update some of them if you don't have the most recent version. Allow it to do so.
Note: I will not be providing support for people trying to run this notebook outside of the Anaconda Python distribution.
For the purposes of this exercise, let's pretend we're working for a startup that just got funded to create a smartphone app that automatically identifies species of flowers from pictures taken on the smartphone. We're working with a moderately-sized team of data scientists and will be building part of the data analysis pipeline for this app.
We've been tasked by our company's Head of Data Science to create a demo machine learning model that takes four measurements from the flowers (sepal length, sepal width, petal length, and petal width) and identifies the species based on those measurements alone.
We've been given a data set from our field researchers to develop the demo, which only includes measurements for three types of Iris flowers:
The four measurements we're using currently come from hand-measurements by the field researchers, but they will be automatically measured by an image processing model in the future.
Note: The data set we're working with is the famous Iris data set — included with this notebook — which I have modified slightly for demonstration purposes.
The first step to any data analysis project is to define the question or problem we're looking to solve, and to define a measure (or set of measures) for our success at solving that task. The data analysis checklist has us answer a handful of questions to accomplish that, so let's work through those questions.
Did you specify the type of data analytic question (e.g. exploration, association causality) before touching the data?
We're trying to classify the species (i.e., class) of the flower based on four measurements that we're provided: sepal length, sepal width, petal length, and petal width.
Did you define the metric for success before beginning?
Let's do that now. Since we're performing classification, we can use accuracy — the fraction of correctly classified flowers — to quantify how well our model is performing. Our company's Head of Data has told us that we should achieve at least 90% accuracy.
Did you understand the context for the question and the scientific or business application?
We're building part of a data analysis pipeline for a smartphone app that will be able to classify the species of flowers from pictures taken on the smartphone. In the future, this pipeline will be connected to another pipeline that automatically measures from pictures the traits we're using to perform this classification.
Did you record the experimental design?
Our company's Head of Data has told us that the field researchers are hand-measuring 50 randomly-sampled flowers of each species using a standardized methodology. The field researchers take pictures of each flower they sample from pre-defined angles so the measurements and species can be confirmed by the other field researchers at a later point. At the end of each day, the data is compiled and stored on a private company GitHub repository.
Did you consider whether the question could be answered with the available data?
The data set we currently have is only for three types of Iris flowers. The model built off of this data set will only work for those Iris flowers, so we will need more data to create a general flower classifier.
Notice that we've spent a fair amount of time working on the problem without writing a line of code or even looking at the data.
Thinking about and documenting the problem we're working on is an important step to performing effective data analysis that often goes overlooked. Don't skip it.
The next step is to look at the data we're working with. Even curated data sets from the government can have errors in them, and it's vital that we spot these errors before investing too much time in our analysis.
Generally, we're looking to answer the following questions:
Let's start by reading the data into a pandas DataFrame.
import pandas as pd
iris_data = pd.read_csv('iris-data.csv')
iris_data.head()
sepal_length_cm | sepal_width_cm | petal_length_cm | petal_width_cm | class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
We're in luck! The data seems to be in a usable format.
The first row in the data file defines the column headers, and the headers are descriptive enough for us to understand what each column represents. The headers even give us the units that the measurements were recorded in, just in case we needed to know at a later point in the project.
Each row following the first row represents an entry for a flower: four measurements and one class, which tells us the species of the flower.
One of the first things we should look for is missing data. Thankfully, the field researchers already told us that they put a 'NA' into the spreadsheet when they were missing a measurement.
We can tell pandas to automatically identify missing values if it knows our missing value marker.
iris_data = pd.read_csv('iris-data.csv', na_values=['NA'])
Voilà! Now pandas knows to treat rows with 'NA' as missing values.
Next, it's always a good idea to look at the distribution of our data — especially the outliers.
Let's start by printing out some summary statistics about the data set.
iris_data.describe()
sepal_length_cm | sepal_width_cm | petal_length_cm | petal_width_cm | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 145.000000 |
mean | 5.644627 | 3.054667 | 3.758667 | 1.236552 |
std | 1.312781 | 0.433123 | 1.764420 | 0.755058 |
min | 0.055000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.400000 |
50% | 5.700000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
We can see several useful values from this table. For example, we see that five petal_width_cm
entries are missing.
If you ask me, though, tables like this are rarely useful unless we know that our data should fall in a particular range. It's usually better to visualize the data in some way. Visualization makes outliers and errors immediately stand out, whereas they might go unnoticed in a large table of numbers.
Since we know we're going to be plotting in this section, let's set up the notebook so we can plot inside of it.
# This line tells the notebook to show plots inside of the notebook
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb
Next, let's create a scatterplot matrix. Scatterplot matrices plot the distribution of each column along the diagonal, and then plot a scatterplot matrix for the combination of each variable. They make for an efficient tool to look for errors in our data.
We can even have the plotting package color each entry by its class to look for trends within the classes.
# We have to temporarily drop the rows with 'NA' values
# because the Seaborn plotting function does not know
# what to do with them
sb.pairplot(iris_data.dropna(), hue='class')
;
''
From the scatterplot matrix, we can already see some issues with the data set:
There are five classes when there should only be three, meaning there were some coding errors.
There are some clear outliers in the measurements that may be erroneous: one sepal_width_cm
entry for Iris-setosa
falls well outside its normal range, and several sepal_length_cm
entries for Iris-versicolor
are near-zero for some reason.
We had to drop those rows with missing values.
In all of these cases, we need to figure out what to do with the erroneous data. Which takes us to the next step...
Now that we've identified several errors in the data set, we need to fix them before we proceed with the analysis.
Let's walk through the issues one-by-one.
There are five classes when there should only be three, meaning there were some coding errors.
After talking with the field researchers, it sounds like one of them forgot to add Iris-
before their Iris-versicolor
entries. The other extraneous class, Iris-setossa
, was simply a typo that they forgot to fix.
Let's use the DataFrame to fix these errors.
iris_data.loc[iris_data['class'] == 'versicolor', 'class'] = 'Iris-versicolor'
iris_data.loc[iris_data['class'] == 'Iris-setossa', 'class'] = 'Iris-setosa'
iris_data['class'].unique()
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
Much better! Now we only have three class types. Imagine how embarrassing it would've been to create a model that used the wrong classes.
There are some clear outliers in the measurements that may be erroneous: one
sepal_width_cm
entry forIris-setosa
falls well outside its normal range, and severalsepal_length_cm
entries forIris-versicolor
are near-zero for some reason.
Fixing outliers can be tricky business. It's rarely clear whether the outlier was caused by measurement error, recording the data in improper units, or if the outlier is a real anomaly. For that reason, we should be judicious when working with outliers: if we decide to exclude any data, we need to make sure to document what data we excluded and provide solid reasoning for excluding that data. (i.e., "This data didn't fit my hypothesis" will not stand peer review.)
In the case of the one anomalous entry for Iris-setosa
, let's say our field researchers know that it's impossible for Iris-setosa
to have a sepal width below 2.5 cm. Clearly this entry was made in error, and we're better off just scrapping the entry than spending hours finding out what happened.
# This line drops any 'Iris-setosa' rows with a separal width less than 2.5 cm
iris_data = iris_data.loc[(iris_data['class'] != 'Iris-setosa') | (iris_data['sepal_width_cm'] >= 2.5)]
iris_data.loc[iris_data['class'] == 'Iris-setosa', 'sepal_width_cm'].hist()
;
''
Excellent! Now all of our Iris-setosa
rows have a sepal width greater than 2.5.
The next data issue to address is the several near-zero sepal lengths for the Iris-versicolor
rows. Let's take a look at those rows.
iris_data.loc[(iris_data['class'] == 'Iris-versicolor') &
(iris_data['sepal_length_cm'] < 1.0)]
sepal_length_cm | sepal_width_cm | petal_length_cm | petal_width_cm | class | |
---|---|---|---|---|---|
77 | 0.067 | 3.0 | 5.0 | 1.7 | Iris-versicolor |
78 | 0.060 | 2.9 | 4.5 | 1.5 | Iris-versicolor |
79 | 0.057 | 2.6 | 3.5 | 1.0 | Iris-versicolor |
80 | 0.055 | 2.4 | 3.8 | 1.1 | Iris-versicolor |
81 | 0.055 | 2.4 | 3.7 | 1.0 | Iris-versicolor |
How about that? All of these near-zero sepal_length_cm
entries seem to be off by two orders of magnitude, as if they had been recorded in meters instead of centimeters.
After some brief correspondence with the field researchers, we find that one of them forgot to convert those measurements to centimeters. Let's do that for them.
iris_data.loc[(iris_data['class'] == 'Iris-versicolor') &
(iris_data['sepal_length_cm'] < 1.0),
'sepal_length_cm'] *= 100.0
iris_data.loc[iris_data['class'] == 'Iris-versicolor', 'sepal_length_cm'].hist()
;
''
Phew! Good thing we fixed those outliers. They could've really thrown our analysis off.
We had to drop those rows with missing values.
Let's take a look at the rows with missing values:
iris_data.loc[(iris_data['sepal_length_cm'].isnull()) |
(iris_data['sepal_width_cm'].isnull()) |
(iris_data['petal_length_cm'].isnull()) |
(iris_data['petal_width_cm'].isnull())]
sepal_length_cm | sepal_width_cm | petal_length_cm | petal_width_cm | class | |
---|---|---|---|---|---|
7 | 5.0 | 3.4 | 1.5 | NaN | Iris-setosa |
8 | 4.4 | 2.9 | 1.4 | NaN | Iris-setosa |
9 | 4.9 | 3.1 | 1.5 | NaN | Iris-setosa |
10 | 5.4 | 3.7 | 1.5 | NaN | Iris-setosa |
11 | 4.8 | 3.4 | 1.6 | NaN | Iris-setosa |
It's not ideal that we had to drop those rows, especially considering they're all Iris-setosa
entries. Since it seems like the missing data is systematic — all of the missing values are in the same column for the same Iris type — this error could potentially bias our analysis.
One way to deal with missing data is mean imputation: If we know that the values for a measurement fall in a certain range, we can fill in empty values with the average of that measurement.
Let's see if we can do that here.
iris_data.loc[iris_data['class'] == 'Iris-setosa', 'petal_width_cm'].hist()
;
''
Most of the petal widths for Iris-setosa
fall within the 0.2-0.3 range, so let's fill in these entries with the average measured petal width.
average_petal_width = iris_data.loc[iris_data['class'] == 'Iris-setosa', 'petal_width_cm'].mean()
iris_data.loc[(iris_data['class'] == 'Iris-setosa') &
(iris_data['petal_width_cm'].isnull()),
'petal_width_cm'] = average_petal_width
iris_data.loc[(iris_data['class'] == 'Iris-setosa') &
(iris_data['petal_width_cm'] == average_petal_width)]
sepal_length_cm | sepal_width_cm | petal_length_cm | petal_width_cm | class | |
---|---|---|---|---|---|
7 | 5.0 | 3.4 | 1.5 | 0.25 | Iris-setosa |
8 | 4.4 | 2.9 | 1.4 | 0.25 | Iris-setosa |
9 | 4.9 | 3.1 | 1.5 | 0.25 | Iris-setosa |
10 | 5.4 | 3.7 | 1.5 | 0.25 | Iris-setosa |
11 | 4.8 | 3.4 | 1.6 | 0.25 | Iris-setosa |
iris_data.loc[(iris_data['sepal_length_cm'].isnull()) |
(iris_data['sepal_width_cm'].isnull()) |
(iris_data['petal_length_cm'].isnull()) |
(iris_data['petal_width_cm'].isnull())]
sepal_length_cm | sepal_width_cm | petal_length_cm | petal_width_cm | class |
---|
Great! Now we've recovered those rows and no longer have missing data in our data set.
Note: If you don't feel comfortable imputing your data, you can drop all rows with missing data with the dropna()
call:
iris_data.dropna(inplace=True)
After all this hard work, we don't want to repeat this process every time we work with the data set. Let's save the tidied data file as a separate file and work directly with that data file from now on.
iris_data.to_csv('iris-data-clean.csv', index=False)
iris_data_clean = pd.read_csv('iris-data-clean.csv')
Now, let's take a look at the scatterplot matrix now that we've tidied the data.
sb.pairplot(iris_data_clean, hue='class')
;
''
Of course, I purposely inserted numerous errors into this data set to demonstrate some of the many possible scenarios you may face while tidying your data.
The general takeaways here should be:
Make sure your data is encoded properly
Make sure your data falls within the expected range, and use domain knowledge whenever possible to define that expected range
Deal with missing data in one way or another: replace it if you can or drop it
Never tidy your data manually because that is not easily reproducible
Use code as a record of how you tidied your data
Plot everything you can about the data at this stage of the analysis so you can visually confirm everything looks correct
At SciPy 2015, I was exposed to a great idea: We should test our data. Just how we use unit tests to verify our expectations from code, we can similarly set up unit tests to verify our expectations about a data set.
We can quickly test our data using assert
statements: We assert that something must be true, and if it is, then nothing happens and the notebook continues running. However, if our assertion is wrong, then the notebook stops running and brings it to our attention. For example,
assert 1 == 2
will raise an AssertionError
and stop execution of the notebook because the assertion failed.
Let's test a few things that we know about our data set now.
# We know that we should only have three classes
assert len(iris_data_clean['class'].unique()) == 3
# We know that sepal lengths for 'Iris-versicolor' should never be below 2.5 cm
assert iris_data_clean.loc[iris_data_clean['class'] == 'Iris-versicolor', 'sepal_length_cm'].min() >= 2.5
# We know that our data set should have no missing measurements
assert len(iris_data_clean.loc[(iris_data_clean['sepal_length_cm'].isnull()) |
(iris_data_clean['sepal_width_cm'].isnull()) |
(iris_data_clean['petal_length_cm'].isnull()) |
(iris_data_clean['petal_width_cm'].isnull())]) == 0
And so on. If any of these expectations are violated, then our analysis immediately stops and we have to return to the tidying stage.
Now after spending entirely too much time tidying our data, we can start analyzing it!
Exploratory analysis is the step where we start delving deeper into the data set beyond the outliers and errors. We'll be looking to answer questions such as:
How is my data distributed?
Are there any correlations in my data?
Are there any confounding factors that explain these correlations?
This is the stage where we plot all the data in as many ways as possible. Create many charts, but don't bother making them pretty — these charts are for internal use.
Let's return to that scatterplot matrix that we used earlier.
sb.pairplot(iris_data_clean)
;
''
Our data is normally distributed for the most part, which is great news if we plan on using any modeling methods that assume the data is normally distributed.
There's something strange going on with the petal measurements. Maybe it's something to do with the different Iris
types. Let's color code the data by the class again to see if that clears things up.
sb.pairplot(iris_data_clean, hue='class')
;
''