Environmental Data Analytics | John Fay and Luana Lima | Developed by Kateri Salk
Spring 2023

2: Reproducibility & Coding Basics¶

Objectives¶

Discuss the benefits and approach for reproducible data analysis
Perform simple operations using Python coding syntax
Call and create functions in Python

Reproducible Data Analysis¶

Fundamentals of reproducibility¶

Reproducibility: when someone else (e.g., future self) can obtain the same outcomes from the same dataset and analysis

Raw data are always separate from processed data
Link data transformations with a reproducible pipeline
Raw datasets NEVER changed
Cleaning/transformations done through coding, not by editing within Excel
Edits documented by well-commented code
Majority of time spent in the data processing phase (clean, wrangle)

Rules and Conventions¶

Data stored in nonproprietary software (e.g., .csv, .md, .txt)
File names in ASCII text
No spaces!
Consistent file naming conventions
Store data, code, and output in separate folders

Version Control¶

This semester, we will incorporate the fundamentals of version control, the process by which all changes to code, text, and files are tracked. In this manner, we're also able to maintain data and information to support collaborative projects, but to also make sure your analyses are preserved.

Before coming to class, you were asked to create a GitHub.com account. GitHub is the web hosting platform for maintaining our Git repositories. Our version control system for the purposes of this course is Git.

JupyterLab Basics¶

When you open your JupyterLab container, you will see the JupyterLab interface. Documentation on the interface is provided here:
https://jupyterlab.readthedocs.io/en/stable/user/interface.html

Jupyter Notebooks¶

A Jupyter Notebook is an analog to an R Markdown document. It too can include text chunks and R code chunks that can be viewed together. A few unique aspects of notebooks over Rmd files are:

Notebooks are not knitted. Instead they can be run and then exported into various formats. Notebooks are more WSIWYG (what you see is what you get) than Rmd file.
All outputs are provided in the document itself; there is no separate console to which you can direct output. (Though you can certainly save files to your filesystem.)
Notebooks are organized into "cells" and cells are defined as either code cells or markdown cells.

Pretty much, after a small learning curve, notebooks and the JuptyerLab interface should become fairly intuitive. So, rather than write all this down here, we'll do some hands-on work that will be recorded for your benefit...

Python Coding basics¶

Python as a calculator¶

Below is a code cell. You can run code cells in a few ways:

Click the ► button in the menu.
Hit - on your keyboard (or - on a Mac)

→Note that you can't run single lines in Jupyter notebooks; you have the run the entire code cell

Basic math¶

In [ ]:

1 + 1

In [ ]:

1 - 1

In [ ]:

2 * 2

In [ ]:

1 / 2

In [ ]:

1 / 200 * 30

In [ ]:

5 + 2 * 3

In [ ]:

(5 + 2) * 3

Common terms¶

In [ ]:

import math #we need to import the `math` package for this
math.sqrt(25)

In [ ]:

math.sin(3)

In [ ]:

math.pi

Summary statistics¶

In [ ]:

import statistics as stats #we need to import the statistics package for this
stats.mean([5, 4, 6, 4, 6])

In [ ]:

stats.median([5, 4, 6, 4, 6])

Conditional statements¶

In [ ]:

4 > 5

In [ ]:

4 < 5

In [ ]:

4 != 5

In [ ]:

4 == 5

Objects¶

Python does not use R's an assignment statement; it just uses =.

In [ ]:

x = 3*4

Now, call up the object x.

In [ ]:

Unlike R-Studio, Jupyter lab does not have a built in variable explorer. (There are extensions for this, but we won't go into those here...) However, we can run the %whos command to reveal all named objects in our current session (including packages).

In [ ]:

%whos

Naming¶

Python objects can be named with a combination of letters, numbers, and underscore (_) - BUT NO PERIODS (.). The best object names are informative. Resist the temptation to call your object something convenient, like "a", "b", and so on. Calling your object something specific means that you can call up that object later and have an idea of what it contains, with less need for specific context.

Informative names are the first illustration of a common data management recommendation: take the time to use best management practices at the outset, and it will save you time in the long term.

Run the first code cell below. Then, type in "long" and press tab. What happens?

In [ ]:

long_name_for_illustration = 11

In [ ]:

What happens if there is a typo in your code?
Type the following in the R window:
Long_name_for_illustration
longnameforillustration

In [ ]:

Comments¶

Within your Python code, it is often useful to include notes about your workflow. So that these aren't interpreted by the software as code, precede the notes with a # sign. Your editor will display this comment as a different color to indicate it will not be run in the console. Comments can be placed on their own lines or at the end of a line of code.

In [ ]:

# I am demonstrating a comment here.

In [ ]:

1 + 1 # This is a simple math problem

Functions¶

Python functions are the major tool. Functions can do virtually unlimited things within the Python universe, but each function requires specific inputs that are provided under specific syntax. We will start with a simple function that is built into Python, len(), which returns the length of an object.

In [ ]:

len("ABCDEF")

To mimic the code in the R counterpart of this document, we actually need two functions in Python. The range() function works like R's seq() function, but it returns a "range" object, not a vector. To conver this to a vector we coerce the range object into a list with the list() function...

In [ ]:

list(range(10))

In [ ]:

ten_sequence = list(range(10))
ten_sequence

In [ ]:

list(range(1,10,2))

In [ ]:

?range

Defining your own function¶

The basic form of a function is functionname(), and the packages we will use in this class will use these basic forms. However, there may be situations when you will want to create your own function. Below is a description of how to write functions through the metaphor of creating a recipe (credit: @IsabellaGhement on Twitter).

Writing a function is like writing a recipe. Your function will need a recipe name (functionname). Your recipe ingredients will go inside the parentheses. The recipe steps and end product go inside the curly brackets.

→ Note that Python does not use curly braces "{ }" to indicate which code goes into the function. Instead it uses indentation: all indented code will be part of the function's code...

def functionname():
        statement_1
        statement_2
        return(result)

♦ A single ingredient recipe:

In [ ]:

# Write the recipe
def recipe1(x):
  mix = x*2
  return(mix)

In [ ]:

# Bake the recipe
simplemeal = recipe1(5)

In [ ]:

# Serve the recipe
simplemeal

♦ Two single ingredient recipes, baked at the same time:

In [ ]:

def recipe2(x):
    mix1 = x*2
    mix2 = x/2
    return([mix1, #comma indicates we continue onto the next line, as long as values are between ( ).
            mix2])

In [ ]:

doublesimplemeal = recipe2(6)

In [ ]:

doublesimplemeal

♦ Two double ingredient recipes, baked at the same time:

In [ ]:

def recipe3(x, f):
    mix1 = x*f
    mix2 = x/f
    return([mix1,mix2])

In [ ]:

doublecomplexmeal = recipe3(x = 5, f = 2)
doublecomplexmeal

In [ ]:

#Show the first item in the returned list
doublecomplexmeal[0]

♦Make a recipe based on the ingredients you have

In [ ]:

def recipe4(x):
    if(x < 3):
        return x*2 
    else:
        return x/2 

def recipe5(x):
    if(x < 3):    return x*2
    elif(x > 3):  return x/2
    else:         return x

In [ ]:

meal = recipe4(4); meal

In [ ]:

meal2 = recipe4(2); meal2

In [ ]:

meal3 = recipe5(3); meal3