Why automate your work flow, and how to approach the process

Questions for students to consider:

  1. In the data exploration section you made some plots from your data. What if you want to look at other relationships?
  2. Are there computational processes you do often? How do you implement these?
  3. Do you have a clear workflow you could replicate from data to conclusions?
  4. Could you plug this new data set into your old workflow?

Level of Python / Jupyter Automation

  1. Good - Documenting all analysis steps in enough details that will enable them to be reproduced successfully.
  2. Better - Script your analysis
  3. Best - Script your analysis and write tests to validate each step.

Key takehomes

  • Code is read much more often that it is written
  • You are NEVER finshed with an analysis (drafts, reviewer comments, new data etc.). Make your own future life easy!
  • Repeating yourself creates opportunity for mistakes/divergence

Learning Objectives of Automation Module: (total time, 3 hrs including 15 min break)

Lesson 1

  • Employ best practices of naming a variable including: don’t use existing function names, avoid periods in names, don’t use numbers at the beginning of a variable name.
  • Defensive programming: catch errors instead of just fixing them.

Lesson 2

  • Define "Don't Repeat Yourself" (DRY) and provide examples of how you would implement DRY in your code
  • Identify code that can be modularized following DRY and implement a modular workflow using functions.

Lesson 3

  • Know how to construct a function: variables, function name, syntax, documentation, return values
  • Demonstrate use of function within the notebook / code.
  • Construct and compose function documentation that clearly defines inputs, output variables and behaviour.

Basic Overview of the suggested workflow using Socrative (Optional)

  • Use Socrative quiz to collect answers from student activities (students can run their code in their notebooks, and post to socrative). This will allow the instructor to see what solutions students came up with, and identify any places where misconceptions and confusion are coming up. Using Socrative quizes also allows for a record of the student work to be analyzed after class to see how students are learning and where they are having troubles.
  • sharing of prepared Socrative Quizes designed to be used with the automation module can be shared by URL links to each teacher so they do not have to be remade.

Setup

Lesson 1


Lets begin by creating a new Jupyter notebook.

Question:

  • According to the organization we setup where should we put this notebook?

Review of good variable practices

Learning Objective: Employ best practices of naming a variable including: don’t use existing function names, avoid periods in names, don’t use numbers at the beginning of a variable name

Types of variables:

  • strings, integers, etc..

References:

Keep in mind that code is read many more times then it is written!

Other useful conventions with variables to follow

  1. Set-up variables at the begining of your page, after importing libraries
  2. use variables instead of file names, or exact values or strings so that if you need to change the value of something you don't have to search through all your code to make sure you made the change everywhere, simply change the value of the variable at the top. -- This will also make your code more reproducible in the end.

Let's Get Started

To get started we will import the python modules that we will use in the session. These modules are developed by programmers and made available as open source packages for python. We would normally have to install each of these ourself but they are included as part of the Anaconda Python Distribution.

The %matplotlib inline statement is part of the Jupyter and IPython magic that enables plaots generated by the matplotlib package to be discplayed as output in the Jupyter Notebook instead of open in a separate window.

In [1]:
import numpy as np
import pandas as pd
import pylab as plt
import matplotlib

%matplotlib inline

We will continue where the data exploration module left off but importing the cleaned gapminder dataset and setting it equal to a new varaible named df to denote that we have imported a pandas dataframe.

As validation that we have imported the data we will also look at the top five rows of data using the head method of pandas.

Defensive programming

In [2]:
cleaned_data_location = 'data/gapminder_cleaned.csv'
df = pd.read_csv(cleaned_data_location)
df.head()
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-2-bdd8684bfb1c> in <module>
      1 cleaned_data_location = 'data/gapminder_cleaned.csv'
----> 2 df = pd.read_csv(cleaned_data_location)
      3 df.head()

/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    608     kwds.update(kwds_defaults)
    609 
--> 610     return _read(filepath_or_buffer, kwds)
    611 
    612 

/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    460 
    461     # Create the parser.
--> 462     parser = TextFileReader(filepath_or_buffer, **kwds)
    463 
    464     if chunksize or iterator:

/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    817             self.options["has_index_names"] = kwds["has_index_names"]
    818 
--> 819         self._engine = self._make_engine(self.engine)
    820 
    821     def close(self):

/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1048             )
   1049         # error: Too many arguments for "ParserBase"
-> 1050         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1051 
   1052     def _failover_to_python(self):

/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1865 
   1866         # open handles
-> 1867         self._open_handles(src, kwds)
   1868         assert self.handles is not None
   1869         for key in ("storage_options", "encoding", "memory_map", "compression"):

/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in _open_handles(self, src, kwds)
   1360         Let the readers open IOHanldes after they are done with their potential raises.
   1361         """
-> 1362         self.handles = get_handle(
   1363             src,
   1364             "r",

/opt/anaconda3/lib/python3.8/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    640                 errors = "replace"
    641             # Encoding
--> 642             handle = open(
    643                 handle,
    644                 ioargs.mode,

FileNotFoundError: [Errno 2] No such file or directory: 'data/gapminder_cleaned.csv'

Whooops! That doesn't look great. and we know that the file does exist. We just downloaded it!

Lets do some defensive programming to prevent things from breaking

Your most common collaborator is YOU, in the future. Including error handling and messages help your colleagues and students, but most importantly, YOU.

Catching errors using try-except

try and except statements are a good way to catch and deal with errors in a convenient way

In [3]:
cleaned_data_location = 'data/gapminder_cleaned.csv'

try:
    df = pd.read_csv(cleaned_data_location)

except FileNotFoundError:
    print("Couldn't find data file, check path? You tried", cleaned_data_location)
Couldn't find data file, check path? You tried data/gapminder_cleaned.csv

Exercise: what do you need to fix to actually open that data file?

In [4]:
cleaned_data_location = '../data/gapminder_cleaned.csv'

try:
    df = pd.read_csv(cleaned_data_location)

except FileNotFoundError:
    print("Couldn't find data file, check path? You tried", cleaned_data_location)

We can set a flag at the top of our script to dynamically set how much infomation we want to see. e.g. VERBOSE.

When the variable is set to True we print out extra information.

In [5]:
VERBOSE = True

try:
    df = pd.read_csv(cleaned_data_location)
    if VERBOSE:
        print(df.head())

except FileNotFoundError:
    print("Couldn't find data file, check path? You tried", cleaned_data_location)
   year       pop  lifeexp   gdppercap      country continent
0  1952   8425333   28.801  779.445314  afghanistan      asia
1  1957   9240934   30.332  820.853030  afghanistan      asia
2  1962  10267083   31.997  853.100710  afghanistan      asia
3  1967  11537966   34.020  836.197138  afghanistan      asia
4  1972  13079460   36.088  739.981106  afghanistan      asia

Using assert statements to be explicit about assumptions

assert will fail if statement isn't true. Nothing will happen if it is true

In [6]:
years = df['year'].unique()
years.sort()
assert years[-1]==2007 #Check that the most recent year is as expected

Lesson 2


Learning Objectives

  • Define "Don't Repeat Yourself" (DRY) and provide examples of how you would implement DRY in your code
  • Identify code that can be modularized following DRY and implement a modular workflow using functions.

As you write software there comes a time when you are going to encounter a situation where you want to do the same analysis step as you have already done in your analysis. Our natural tendancy is the copy the code that we wrote and paste it into teh new location for reuse. Sounds easy, right. Copy, paste, move on...not so fast.

What happens if there is a problem with the code or you decide to tweak it, just a little, to change a format or enahce it?

You wil have to change the code in every place you ahve copied it. How do you know if you got all of the copies? What happens if one of the copies is not changed?

These examples illustrate the principle of "Don't Repeat Yourself". We are going to look at how to refactor our code and pull pieces out by making them functions. They we will call the function everytime we want to use that code.

What if we want to ask questions about variables over time?

Lets start with mean life expectancy in Asia

In [7]:
#decide on a year, and calculate the statistic of interest
## do it point by point!! DON'T RUSH

mask_asia = df['continent'] == 'asia'
df_asia = df[mask_asia]

mask_1952 = df_asia['year'] == 1952
df_1952 = df_asia[mask_1952]

value = np.mean(df_1952['lifeexp'])

# create an empty list
result = []

# append a row to list with a tuple containing your result
result.append(('asia', '1952', value))
    
# Turn the summary into a dataframe so that we can visualize easily
result = pd.DataFrame(result, columns=['continent', 'year', 'lifeexp'])
In [8]:
result
Out[8]:
continent year lifeexp
0 asia 1952 46.314394
In [9]:
# Use a for loop to Loop through years and calculate the statistic of interest

mask_asia = df['continent'] == 'asia'
df_asia = df[mask_asia]

years = df_asia['year'].unique()
summary = []

for year in years:
    mask_year = df_asia['year'] == year
    df_year = df_asia[mask_year]
    value = np.mean(df_year['lifeexp'])
    summary.append(('asia', year, value))
    
# Turn the summary into a dataframe so that we can visualize easily
summary = pd.DataFrame(summary, columns=['continent', 'year', 'lifeexp'])
In [10]:
summary
Out[10]:
continent year lifeexp
0 asia 1952 46.314394
1 asia 1957 49.318544
2 asia 1962 51.563223
3 asia 1967 54.663640
4 asia 1972 57.319269
5 asia 1977 59.610556
6 asia 1982 62.617939
7 asia 1987 64.851182
8 asia 1992 66.537212
9 asia 1997 68.020515
10 asia 2002 69.233879
11 asia 2007 70.728485

But now your PI wants that information all the years for a different continent!

Activity: How could we use variables to make it easier to re-run this for differnet continents?

In [11]:
# Define which continent / category we will use
category = 'lifeexp'
continent = 'asia'

# Create a mask that selects the continent of choice
mask_continent = df['continent'] == 'asia'
df_continent = df[mask_continent]

# Loop through years and calculate the statistic of interest
years = df_continent['year'].unique()
summary = []

for year in years:
    mask_year = df_continent['year'] == year
    df_year = df_continent[mask_year]
    value = np.mean(df_year[category])
    summary.append((continent, year, value))
    
# Turn the summary into a dataframe so that we can visualize easily
summary = pd.DataFrame(summary, columns=['continent', 'year', category])
In [12]:
summary.plot.line('year', category, label = "life expectancy")
Out[12]:
<AxesSubplot:xlabel='year'>

Lesson 3

Building functions


Learning Objectives

  • Know how to construct a function: variables, function name, syntax, documentation, return values
  • Demonstrate use of function within the notebook / code.
  • Construct and compose function documentation that clearly defines inputs, output
In [13]:
def calculate_mean_over_time(data, category, continent, verbose=False):
    
    # Create a mask that selects the continent of choice
    mask_continent = data['continent'] == continent
    data_continent = data[mask_continent]

    # Loop through years and calculate the statistic of interest
    years = data_continent['year'].unique()
    summary = []
    for year in years:
        if verbose:
            print(year)
        mask_year = data_continent['year'] == year
        data_year = data_continent[mask_year]
        value = np.mean(data_year[category])
        summary.append((continent, year, value))

    # Turn the summary into a dataframe so that we can visualize easily
    summary = pd.DataFrame(summary, columns=['continent', 'year', category])
    return summary
In [14]:
VERBOSE = False
calculate_mean_over_time(df, "lifeexp", "asia", VERBOSE)
Out[14]:
continent year lifeexp
0 asia 1952 46.314394
1 asia 1957 49.318544
2 asia 1962 51.563223
3 asia 1967 54.663640
4 asia 1972 57.319269
5 asia 1977 59.610556
6 asia 1982 62.617939
7 asia 1987 64.851182
8 asia 1992 66.537212
9 asia 1997 68.020515
10 asia 2002 69.233879
11 asia 2007 70.728485

Activity:

How would you make a function to calculate the median through time?
(Hint: focus on DRY.)
In [15]:
def calculate_statistic_over_time(data, category, continent, func):
    
    # Create a mask that selects the continent of choice
    mask_continent = data['continent'] == continent
    data_continent = data[mask_continent]

    # Loop through years and calculate the statistic of interest
    years = data_continent['year'].unique()
    summary = []
    for year in years:
        mask_year = data_continent['year'] == year
        data_year = data_continent[mask_year]
        value = func(data_year[category])
        summary.append((continent, year, value))

    # Turn the summary into a dataframe so that we can visualize easily
    summary = pd.DataFrame(summary, columns=['continent', 'year', category])
    return summary
In [16]:
calculate_statistic_over_time(df, "lifeexp", "asia", np.median)
Out[16]:
continent year lifeexp
0 asia 1952 44.869
1 asia 1957 48.284
2 asia 1962 49.325
3 asia 1967 53.655
4 asia 1972 56.950
5 asia 1977 60.765
6 asia 1982 63.739
7 asia 1987 66.295
8 asia 1992 68.690
9 asia 1997 70.265
10 asia 2002 71.028
11 asia 2007 72.396

Including docstrings with any functions is good programming practice, and helps out your collaborators (i.e., you!)

More on docstrings: https://www.python.org/dev/peps/pep-0257/#what-is-a-docstring

In [17]:
def calculate_statistic_over_time(data, category, continent, func):
    """calculate values of a statistic through time

    Args:
        data: a data frame
        category: one of the column headers of the data frame (e.g. 'lifeexp')
        continent: possible value of the continent column of that data frame (e.g. 'asia')
        func: the funtion to apply to data values (e.g. np.mean)
        
    Returns:
        a summary table of value per year.

    """
    
    # Create a mask that selects the continent of choice
    mask_continent = data['continent'] == continent
    data_continent = data[mask_continent]

    # Loop through years and calculate the statistic of interest
    years = data_continent['year'].unique()
    summary = []
    for year in years:
        mask_year = data_continent['year'] == year
        data_year = data_continent[mask_year]
        value = func(data_year[category])
        summary.append((continent, year, value))

    # Turn the summary into a dataframe so that we can visualize easily
    summary = pd.DataFrame(summary, columns=['continent', 'year', category])
    return summary

Defensive programming activity

How would you check to make sure input values are reasonable? Use assertions or try-except statements, and add options with the VERBOSE flag

In [18]:
def calculate_statistic_over_time(data, category, continent, func, verbose=False):
    """calculate values of a statistic through time

    Args:
        data: a pandas data frame
        category: one of the column headers of the data frame (e.g. 'lifeexp')
        continent: possible value of the continent column of that data frame (e.g. 'asia')
        func: the funtion to apply to data values (e.g. np.mean)
        
    Returns:
        a summary table of value per year.

    """
    
    # Check values
    assert category in data.columns.values
    assert 'continent' in data.columns.values
    assert continent in data['continent'].unique()
    
    # Create a mask that selects the continent of choice
    mask_continent = data['continent'] == continent
    data_continent = data[mask_continent]
    
    
    # Loop through years and calculate the statistic of interest
    years = data_continent['year'].unique()
    if verbose:
        print("years include", years)
    summary = []
    for year in years:
        mask_year = data_continent['year'] == year
        data_year = data_continent[mask_year]
        value = func(data_year[category])
        summary.append((continent, year, value))

    # Turn the summary into a dataframe so that we can visualize easily
    summary = pd.DataFrame(summary, columns=['continent', 'year', category])
    return summary
In [19]:
calculate_statistic_over_time(df, "lifeexp", "asia", np.median)
Out[19]:
continent year lifeexp
0 asia 1952 44.869
1 asia 1957 48.284
2 asia 1962 49.325
3 asia 1967 53.655
4 asia 1972 56.950
5 asia 1977 60.765
6 asia 1982 63.739
7 asia 1987 66.295
8 asia 1992 68.690
9 asia 1997 70.265
10 asia 2002 71.028
11 asia 2007 72.396
In [20]:
help(calculate_statistic_over_time)
Help on function calculate_statistic_over_time in module __main__:

calculate_statistic_over_time(data, category, continent, func, verbose=False)
    calculate values of a statistic through time
    
    Args:
        data: a pandas data frame
        category: one of the column headers of the data frame (e.g. 'lifeexp')
        continent: possible value of the continent column of that data frame (e.g. 'asia')
        func: the funtion to apply to data values (e.g. np.mean)
        
    Returns:
        a summary table of value per year.

In [21]:
#use this function to plot life expectancy through time
continents = df['continent'].unique()
VERBOSE = False
fig, ax = plt.subplots()

for continent in continents:
    output = calculate_statistic_over_time(df,"lifeexp", continent, np.median)
    output.plot.line('year', "lifeexp", ax=ax, label=continent)

Demo: Importing your own functions as module, (main.ipynb)

In [ ]: