Created by Ethan C. Campbell for NCAT/MATE/GO-BGC Marine Technology Summer Program
Tuesday, August 22, 2023
Computer code allows us to work with data, create visualizations, and create repeatable scientific workflows. It is an integral part of the modern scientific method!
Every programming language has a specific syntax. In English as well as programming languages, syntax describes valid combinations of symbols and words:
Semantics refer to whether a phrase has meaning. It's up to us to write computer code that has scientific meaning and is useful. The computer will allow us to write code that is syntactically valid but semantically – or scientifically – incorrect!
(Image source: stackoverflow.blog)
No programming language is perfect. As the inventor of C++ once said, “There are only two kinds of programming languages: the ones people complain about and the ones nobody uses.”
However, there are many reasons that we use Python instead of other programming languages, like MATLAB, Java, or C:
*Question: How many of you have heard of Python before this course? Who has written code in Python (or a different language) before?*
This web page is called a notebook. It lets us write and run computer code, and the results get displayed and saved alongside the code. If you download this notebook in the File menu, the file extension will be .ipynb
.
Sometimes it makes more sense to create a script instead of a notebook. Scripts are code files that run from top to bottom, and they don't save the output.
*Question: When we run Python code in this notebook, where is the code actually being run?*
First, we always have to load packages into the notebook using the import
command! Packages give us additional functions that allow us to get more stuff done.
To run a coding cell, you can click the "play" button or type Shift
-Enter
(PC) or Shift
-Return
(Mac) on your keyboard. *Try this with the cell below:*
import numpy as np # NumPy is an array and math library
import matplotlib.pyplot as plt # Matplotlib is a visualization (plotting) library
import pandas as pd # Pandas lets us work with spreadsheet (.csv) data
from datetime import datetime, timedelta # Datetime helps us work with dates and times
When we write import numpy as np
, we are saying: "import the package NumPy and we will access it using the abbreviation np
from here onwards." You could technically write any abbreviation, but np
is standard for NumPy.
Often we'd like to add notes to our code. You can do this using comments, notated above using a # (hash) symbol. Everything after the # is ignored and not treated like code.
We can use Python as a calculator. Run the cell below:
3 + 9
Note that parentheses can be used to change the order of operations:
1 + 2 * 3 + 4
(1 + 2) * (3 + 4)
If Python doesn't recognize the code, it will give an error.
*What helpful information does the following error message include?*
3 + hello
Try doing some math yourself below. *Question: Can you figure out how to multiply and divide numbers?*
Usually, Python needs to be told when to "print" something to the screen. For this, we use the print()
function:
print('Hello world!')
*Try writing code to print a different message:*
Note how comments are used in two ways below, both to describe a section of code and to annotate a specific line:
# This is a section comment
print('This is not a comment')
print('This is also not a comment') # This is a line comment
In Python, we use variables to store information. Variables can be numbers (integers or floats), combinations of characters (called strings), a boolean (which are either True or False), or other things that are generally called "objects".
To save a variable, we use the equal sign (=
). You can name your variable anything descriptive, as long as it's one word! Note that underscore (_
) can be used to join words in a variable name.
a = -5 # This variable is an "integer" because is a whole number (a number without a decimal point)
almost_ten = 9.9 # This variable is a "float" because is a floating point number (a number with a decimal point)
scientific = 2e3 # This variable is also a float, and is written in scientific notation: 2.0 x 10^3 = 1000
mate = 'FLOATS' # This variable is a string
mate_2 = "FLOATS" # You can also specify strings using double quotation marks
boolean = True # This variable is a boolean
print(a)
print(almost_ten)
print(scientific)
print(mate)
print(mate_2)
print(boolean)
You can do math at the same time that you create a variable!
result = 2023 - 1913
print(result)
*Try the following:*
# Write your code here:
You can also change a variable using this compact notation:
a += b
is the same as a = a + b
a -= b
is the same as a = a - b
a *= b
is the same as a = a * b
result += 50
print(result)
Note that Python treats booleans (True and False) like the integers 1 and 0, respectively. *This means you can do math with booleans. What will the code produce below, and why?*
print((False * 5) + (True * 3))
*What happens when you add two strings together? Try it below.*
# Write your code here:
To store multiple numbers, we use lists or NumPy arrays. Lists and arrays are types of variables, and NumPy is one of the packages that we imported at the top. Here's how we create a list or array:
my_list = [1,2,3,4,5]
my_array = np.array([1,2,3,4,5,6,7,8,9])
print(my_list)
print(my_array)
You can add elements to the end of a list by appending. The syntax is:
list_name.append(NEW_ELEMENT)
# Append to the list that you created earlier:
my_list.append(6)
my_list.append(7)
print(my_list)
You can convert a list to an array by putting it inside np.array()
:
print(np.array(my_list))
A list can store a combination of numbers and strings, while an array can store only one variable type (so just numbers OR just strings):
combo_list = ['element #1', 2, 'element #3', 4]
print(combo_list)
Naturally, we can do math with arrays. This is very useful!
*Before running the cells below, what do you expect will be the result of each line of code?*
my_array + 5
my_array * 2
my_array + my_array
*What happens when you add two lists together? Try it!*
# Write your code here:
If we want to retrieve certain elements from a list or array, we need to count the position of the elements, which we call an index. More than one index are indices. For example:
List: ['A', 'B', 'C', 'D', 'E', 'F', 'G']
Indices: A = 0, B = 1, C = 2, D = 3, E = 4, F = 5, G = 6
To extract the element, we can index or slice into the list or array using a bracket [ ] after the variable name:
variable_name[INDEX]
variable_name[START (optional) : END (optional)]
(note: END
is exclusive, so it is the index after the final element that you want)*Run each cell below and think about why the results make sense:*
year = [2,0,2,3]
print(year)
# Examples of indexing:
print(year[0])
print(year[3])
print(year[-1]) # This is pretty cool! Negative indexing counts backwards from the end
# Examples of slicing:
print(year[0:4])
print(year[:2])
*Can you find two different ways to extract the last two elements (['2','3']
) of the variable year
?*
*Try using one of them to save (['2','3']
) into a new variable.*
# Write your code here:
Similarly, you can use indexing or slicing to assign new values in a list or array:
array_to_modify = np.array([10,20,30,40,50])
array_to_modify[0] = 0
array_to_modify[1:4] = np.array([21,31,41])
array_to_modify[4] *= 2
*What will array_to_modify
be after these modifications? Test your prediction by printing the variable below:*
# Write your code here:
*What happens when you index or slice into a string? Try it!*
# Write your code here:
NumPy arrays can also be two-dimensional (or higher dimensions). Whoa!
This allows us to represent data on multiple axes using nested brackets: [ [ ], [ ], [ ], etc. ]. Below, I've created a 2-D NumPy array where each column is average monthly temperature for a city. Each row is a different city. I've found the data for Pasadena, CA (top row - index 0) and Seattle, WA (bottom row - index 1) on climate-data.org.
temp = np.array([[53.6,53.9,57.3,60.5,64.8,70.1,75.7,76.4,74.1,67.3,59.8,52.9], # (Pasadena)
[40.0,40.6,44.2,48.4,54.9,60.2,66.2,66.7,60.5,52.0,44.5,39.6]]) # (Seattle)
print(temp)
Just like len()
gives the length of a 1-D array, the command .shape
(a property, not a function!) gives the dimensions of a 2-D (or 3-D, 4-D, etc.) array:
temp.shape # returns: (number of rows, number of columns)
Axis 0 goes across rows and axis 1 goes down columns.
We still index and slice into 2-D arrays using brackets, with the index for each dimension separated by a comma: ,
:
array_name[ROW_INDEX, COLUMN_INDEX]
So we'd get the temperature in Pasadena (row index 0) in June (month #6, so column index 5) by writing:
print(temp[0,5])
*Use indexing to retrieve the December average temperature in Seattle. Print your result:*
# Write your code below
Slicing works the same way. Instead of a single row or column index, use the range of indices:
array_name[ROW_START:ROW_END, COLUMN_START:COLUMN_END]
To get all the elements along a certain axis, just use a single colon, :
.
*Try using slicing to get the temperatures for the first half of the year for Pasadena:*
# Write your code below
*Next, try using slicing to obtain the average temperatures for both cities in August:*
# Write your code below
*Finally, using slicing and mathematical operations to calculate the average temperatures for both cities between December to February (three months). You got this!*
# Write your code below
You already know two functions: print()
and np.array()
. Functions usually take at least one input "argument" inside the parentheses, with multiple arguments separated by commas. Then the function "returns" or "outputs" something back.
Let's learn a few other functions...
The function len(INPUT)
returns the length of a list, array, or string. *Do the following outputs make sense based on the input arguments?*
year = np.array([2,0,2,3])
array_digits = len(year)
print(array_digits)
year = '2023'
str_digits = len(year)
print(str_digits)
The NumPy function np.arange(START, END, INTERVAL)
creates a list of numbers from START to END with a certain INTERVAL between each number.
*Can you guess what the result of the code below will be?*
np.arange(0,100,5)
Note that np.arange(END)
is a shorter way of writing np.arange(0,END,1)
:
print(np.arange(10))
print(np.arange(0,10,1))
Additionally, the NumPy package has many useful functions for mathematical operations:
np.mean(INPUT)
calculates the average value of elements in an INPUT
list or arraynp.sum(INPUT)
calculates the sum of elements in an INPUT
list or arraynp.max(INPUT)
and np.min(INPUT)
find the maximum or minimum values in INPUT
np.ones(N)
creates a new array of length N
filled with the integer 1
np.zeros(N)
creates a new array of length N
filled with the integer 0
For example:
# Do some math on arrays:
test = np.array([1,2,3])
print(np.mean(test))
print(np.sum(test))
print(np.max(test))
# Create new arrays:
print(np.ones(5))
print(np.zeros(5))
Many functions can be called (applied) to a variable in two different ways. For example:
np.mean(test) # Option 1
test.mean() # Option 2 (same result!)
To learn more about a function, you can always consult its online documentation! A package's documentation website usually has a page for each function describing its arguments, outputs, and examples of how to use it.
*Google "numpy mean" to find the documentation page for that function. How is the webpage structured, and what information does it tell us about the arguments needed to apply np.mean()
to 2-D arrays?*
Now that you've discovered named arguments... *use np.mean()
to calculate and print the average annual (yearly) temperature in Seattle using the variable temp
from earlier:*
# Write your code here:
Often, we will want to compare two numbers or variables. We do this using the following logical operations:
==
: equal!=
: not equal>
: greater than>=
: greater than or equal to<
: less than<=
: less than or equal toand
or &
: are both booleans true?or
or |
: is either boolean true?not
or ~
: reverse the boolean (True -> False, False -> True)in
: is a membernot in
: is not a memberEach logical operation evaluates to (returns) a boolean — True or False. Consider the following examples:
3 == 3
3 == 3.0 # integers can be compared to floating-point numbers
not 3 == 3
3 == 5
3 != 5
3 > 5
5 <= 5
(11 == 12) or (12 == 12)
(11 == 12) and (12 == 12)
Applying a logical comparison to a NumPy array gives a boolean array!
x = np.array([1,2,3,4,5,6])
print(x < 4)
print(x <= 4)
# Note: "not" can't be applied to an entire boolean array. Instead, we have to use "~":
print(~np.array([True,False,True]))
Note that membership tests work on lists, arrays, and strings:
print(3 in x) # this is asking: "is 3 in x?"
print(7 in x)
print(3 not in x) # this is asking: "is 3 not in x?"
print('hello' in 'hello world')
print('o w' in 'hello world')
print('World' in 'hello world') # note that string membership is case-sensitive
Heads up: this next skill is super powerful. We saw above that applying a logical comparison to an array of numbers gives us a boolean array.
We can use boolean arrays as "masks" to select certain elements of an array. This is called boolean indexing.
# Let's revisit the Seattle temperatures from earlier:
seattle_temps = np.array([40.0,40.6,44.2,48.4,54.9,60.2,66.2,66.7,60.5,52.0,44.5,39.6])
# Applying a logical comparison creates a boolean array, or "mask":
print(seattle_temps > 60)
# Now let's use the mask to retrieve only the elements where the mask is True:
seattle_temps[seattle_temps > 60]
# Note: this only works when the mask is the same length as the array!
# The boolean indexing gives the same result as specifying the actual array indices:
seattle_temps[[5,6,7,8]]
*How many months of the year is Seattle 40°F or colder? Try using boolean indexing and a function that you've learned to calculate and print the answer:*
# Write your code here:
if
statements and for
loops¶We can use logical operations to create conditional actions using the if-else
statement.
If the first condition evaluates to True
, then the lines inside the first block are executed.
If the first condition evaluates to False
, then Python tests the second
elif
statement.
If all of the if
and elif
statements are False
, then Python will finally run the else
statement.
if <CONDITION #1>:
<ACTION>
<ACTION>
etc.
elif <CONDITION #2>: (optional)
<ACTION>
elif <CONDITION #3>: (optional)
<ACTION>
else <CONDITION #4>: (optional)
<ACTION>
IMPORTANT: note the colons (:
) and how the lines below each condition are indented using a Tab
or two spaces on your keyboard.
*Try changing the value of rain_chance
below and running the code. Do you understand the control flow?*
*What range of values for rain_chance
will trigger the else
statement?*
rain_chance = 5 # i.e., a 5% chance of rain
if rain_chance >= 50:
print('Ugh... I better bring an umbrella.')
elif rain_chance == 0:
print('I definitely will not need an umbrella.')
elif rain_chance <= 20:
print('I should be okay without an umbrella.')
else:
print('I am not sure what to do.')
Sometimes, we might want to perform an action again and again. Coding makes this possible using loops!
A for
loop has the following syntax:
for <ELEMENT> in <ITERABLE>:
<ACTION>
<ACTION>
etc.
Here, <ITERABLE>
can be a list, array, string, or other collection of elements. You can give <ELEMENT>
any variable name, and that variable can only be used inside the loop.
*Run and consider the following examples:*
countdown = [4,3,2,1]
for item in countdown:
print(item)
for character in 'floats':
print(character)
for even_number in np.arange(0,7,2):
print(even_number)
*You already learned how to calculate the sum of an array of numbers using np.sum()
.*
*Now, try to calculate the average Seattle annual temperaturre by writing a for
loop below. There are at least two different ways to do this!*
seattle_temps = np.array([40.0,40.6,44.2,48.4,54.9,60.2,66.2,66.7,60.5,52.0,44.5,39.6])
# Write your code below:
# Finally, print the average temperature by adding it to the print() statement:
print('The average temperature is:',)
In the real world, you'll frequently encounter missing data in an array.
Missing data is represented by the float np.nan
or np.NaN
(the two are the same). NaN stands for "Not a Number".
pH_measurements = np.array([7.84, 7.91, 8.05, np.nan, 7.96, 8.03])
print(pH_measurements)
We can test for missing values using the function np.isnan()
, which returns a boolean (or a boolean array when applied to an array):
np.isnan(5)
np.isnan(np.nan)
np.isnan(pH_measurements)
Do you remember boolean indexing? We can use it to extract only the valid data from an array:
pH_measurements[~np.isnan(pH_measurements)]
It's good to be aware that missing data can cause functions like np.mean()
to fail:
np.mean(pH_measurements)
Many functions have a "NaN-safe" version that ignores missing values and still calculates the result, such as np.nanmean()
:
np.nanmean(pH_measurements)
Wow! It's time to start creating visualizations of data, called plots.
Earlier, we imported the package Matplotlib using:
import matplotlib.pyplot as plt
Creating a line plot is simple. Use the Matplotlib function plt.plot()
. The basic form of the function is:
plt.plot(X, Y, <FORMAT_ARGUMENTS>...)
where X
and Y
are 1-D arrays of data, and the <FORMAT_ARGUMENTS>
can be found on Matplotlib's documentation webpage.
x = np.array([0,1,2,3,4])
y = np.array([0,4,2,6,4])
plt.plot(x,y)
Some formatting arguments include:
c
or color
: line color (options: 'k'
or 'black'
for black, 'red'
for red, etc. – see this page for color options)lw
or linewidth
: line width (a number; the default is 1.5)ls
or linestyle
: line style (options: '-', '--', '-.', ':'
)marker
: optional marker style (options: '.', 'o', 'v', '^', '<', '>', 's', '*',
etc.)*Try plotting x versus y again, except this time use a "goldenrod"-colored dashed line of width 2.5 with star-shaped markers:*
# Write your code here:
Some other options include changing the figure size by starting with a call to:
plt.figure(figsize=(WIDTH,HEIGHT))
Adding x-axis and y-axis labels and a title at the top:
plt.xlabel(STRING)
plt.ylabel(STRING)
plt.title(STRING)
Or adding grid lines using:
plt.grid()
Or adding multiple lines by specifying the label
argument in plt.plot()
and adding a key using:
plt.legend()
Check out these additional formatting options below:
plt.figure(figsize=(6,3))
plt.plot(x,y,label='Original data')
plt.plot(x,2*y,label='2 * y') # y-values are multiplied by 2 here
plt.legend()
plt.grid()
plt.xlabel('x-values')
plt.ylabel('y-values')
plt.title('This is a title');
We can also create a scatter plot with just the points (no line). The function is similar to plt.plot()
:
plt.scatter(X, Y, s=SIZE, c=COLOR, marker=MARKER_STYLE, etc.)
plt.figure(figsize=(6,3))
plt.scatter(x,y,s=100,c='dodgerblue',marker='^');
*Let's bring it all together! Below, try plotting the monthly temperatures in Pasadena, CA and Seattle, WA. Use line plots with circle-shaped markers (or add scatter points separately). Include a legend and label the plot appropriately.*
temp = np.array([[53.6,53.9,57.3,60.5,64.8,70.1,75.7,76.4,74.1,67.3,59.8,52.9], # (Pasadena)
[40.0,40.6,44.2,48.4,54.9,60.2,66.2,66.7,60.5,52.0,44.5,39.6]]) # (Seattle)
# Write your code below:
Up until now, we've been using data that we've typed directly into Python. However, most real-world data is stored in files that we'd like to open using Python.
The most common type of data file is a spreadsheet, which has rows and columns. Generally, the columns will have column labels.
Spreadsheets are often stored in comma-separated value (CSV) format, with the file extension being .csv
. Data files in this format can be opened using Microsoft Excel or Google Sheets, as well as Python.
In Python, we use the pandas
package to work with spreadsheet data. We imported the package earlier using:
import pandas as pd
Just like NumPy has arrays, Pandas has two types of objects: Series
and DataFrame
. This is what they look like:
For now, we'll just be applying simple operations to read spreadsheet data using pandas
. But if you would like to learn more, check out these lesson slides.
First, let's download two .csv
data files from Google Drive here: https://drive.google.com/drive/folders/1Am6XdlB-APQ3ccOvLeGK8DFPQ2OnPeJD?usp=share_link. Each file is a CTD cast that was collected from the R/V Rachel Carson off of Carkeek Park near Seattle. *Save these two files to your computer.*
Next, we can upload the files to this Google Colab notebook. *Click the sidebar folder icon on the left, then use the page-with-arrow icon at the top to select the files and upload them.* NOTE: uploaded files will be deleted from Google Colab when you refresh this notebook!
We will specify each filepath using string variables:
filepath_0 = '/content/2023051001001_Carkeek.csv'
filepath_1 = '/content/2023051101001_Carkeek.csv'
Now, we can load the files using pandas
:
pd.read_csv(FILEPATH, ARGUMENTS...)
This function is very customizable using the many optional ARGUMENTS
, which allow it to handle almost any file. You can find documentation about the arguments at this link.
*Let's first take a look at the data file using a simple text editor. Notice the long header. What argument can we use to exclude the header from being loaded?*
Below, we'll load each data file using pd.read_csv()
and store each file into a new variable.
We can look at the data using display()
(which is a fancy version of print()
for DataFrames):
data_0 = pd.read_csv(filepath_0,comment='#')
data_1 = pd.read_csv(filepath_1,comment='#')
display(data_0)
The data in a pandas
DataFrame is similar to a NumPy 2-D array, except we use column labels to refer to columns and index values to refer to rows.
To retrieve a specific column, we use bracket notation: data_frame[COLUMN_LABEL]
.
# For example:
data_0['density00']
*With these tools, can you make a line plot of temperature vs. depth that includes both CTD casts? (Alternatively, you could try plotting salinity, oxygen, or fluorescence vs. depth.)*
You may need the following line of code to flip the y-axis so the surface is at the top: plt.gca().invert_yaxis()
.
# Write your code here: