In this second lesson, we'll apply the core elements of programming languages to a common scientific task: reading a data table, performing analysis on its contents, and saving the results. As we go through these different stages of analysis, we'll note how each task relates to one of the elements from the last lesson (we'll note these as
E1, for example, for element 1, "a thing").
For this lesson, we'll use two important scientific Python packages.
numpy is the main package providing numerical analysis functions, and
pandas is designed to make it easy to work with tabular data.
# Import pandas and set up inline plotting import pandas as pd %matplotlib inline
Unless you're generating your own data by simulation (as in our previous logistic growth function), most scientific analyses begin with loading an external data set.
For this lesson, we'll use data from the North American Breeding Bird Survey. As part of this survey, volunteers have driven cars along fixed routes once a year for the past forty years, stopping periodically along the way and counting all of the birds that they see when they do. The particular data tables that we'll work with today summarize the number of birds of many different species that were counted along routes in the state of California. The large table contains forty years of data for all sighted species, while the small table is a subset of the large table.
You can download and play with this data yourself at:
Pardieck, K.L., D.J. Ziolkowski Jr., M.-A.R. Hudson. 2015. North American Breeding Bird Survey Dataset 1966 - 2014, version 2014.0. U.S. Geological Survey, Patuxent Wildlife Research Center http://www.pwrc.usgs.gov/BBS/RawData/.
Tip: It's often a good idea to take a large data set and extract a small portion of it to use while building and testing your analysis code. Small data sets can be analyzed faster and allow you to see, visually, what the "right answer" should be when you write code to perform analysis. Determining whether your function gives the right answer on a small data set is the core idea behind unit testing, which we'll discuss later.
# You can use the exclamation point symbol (the "bang") to run a shell command # Let's use cat to see the contents of the small data table
# Read the small table using pandas # The DataFrame function (E2) in pandas creates a thing (E1) called a DataFrame
# Now let's look at the contents of our data frame "thing" (E1)
# Like other "things" in Python, a data frame is an object # The object contains methods that operate on it (E2)
A data frame can be conceptualized as a kind of "thing", like we have above, that we can move around and perform operations on. However, it also shares some characteristics in common with a collection of things (E3) because we can use indexes and slicing to pull out subsets of the data table.
There are two main ways that we can select rows and columns from our table: using the labels for the rows and columns or using numeric indexes for the row and column locations. Below we'll focus on label names - check out the
pandas help for the method
iloc to learn about using numeric indexes.
# Look at the table again and think about it as a collection (E3)
# Use the loc method to pull out rows and columns by name # Like a matrix, the row goes first, then the column
# You can use ranges of names, similar to what we saw before for lists
# You can also use lists of names
Once we have our data table read in, we generally want to perform some sort of analysis with it. Let's presume that we want to get a count of the mean number of individuals sighted per species in each year. However, we only want the average over the species that were actually sighted in the state that year, ignoring species with counts of zero (this is a fairly common analysis in ecology).
Conceptually, one way to approach this problem is to imagine looping through (E4) all of the years, that is the columns of the data frame, one by one. For each year, we want to count the number of species present, sum their counts, and divide the sum of the counts by the number of species seen. We should record this information in some other sort of collection (E3) - we'll use another data frame.
# First, let's set up a new data frame to hold the result of our calculation # We'll get the column names from the bird table, then use DataFrame to make a new df
# Next, let's figure out how we would do our analysis for one year, say 2010
# Our final calculation code could look like this
forloop (E4) that loops over all years, calculating the mean count of birds per species present each year, and stores the result in a new empty data frame.
bird_sm) and returns the result data frame. Test it with
bird_smto make sure that it works.
if-elsestatement (E5) that checks for the problem that you just uncovered and takes some reasonable action when it occurs.
Now that we've managed to generate some useful results, we want to save them somewhere on our computer for later use. There are two broad types of outputs that we might want to save, tables and plots, and we'll use the built-in methods for data frames to do both.
Getting a plot to look just right can take a very long time. Here we'll just use the pandas default styles. For more help on plotting, have a look at the extra lesson on
# First we make sure that we've saved our results table
# Data frames have a method to save themselves as a csv file - easy!
# Data frames also have a method to plot their contents # There's one trick though - by default they put the rows on the x axis and columns on the y # We want the reverse, so we need to transpose our data frame before plotting it
# With a few extra steps, we can save the plot # This code looks strange, since we haven't talked about the details of matplotlib # At this stage, it's best to just use it as a recipe
bird_sm.csv, make your cell use
bird_lg.csvand see what the saved results look like. If necessary, modify your code and variable names so that all you have to do is change two letters (
lg) in one place in the code to make this change.