Title: Quick Start With Python
Author: John Fay for ENV872
Date: Spring 2023
Instructor: John Fay
As most programming languages, Python supports basic data types for integers (int
), real numbers (float
), character strings (str
) and logical True/False values (bool
).
The type of a variable is automatically set when a value is assigned to it, using the =
operator. It can be queried with the built-in type()
function.
i = 3
type(i)
Python supports the usual arithmetic operators: +
, -
, *
, /
, **
(exponent)
and comparison operators: ==
(equal), !=
(non-equal), <
, >
, <=
, >=
.
Both int
and float
values can be mixed within an expression; the result is a float
.
r = i + 1.5
print (r, 'is of', type(r))
In the code above, we introduced the print
statement, which prints the output of multiple Python expressions on the same line, separated by spaces. Note that quoted character strings (here, ‘is of’) are printed as is.
Let’s define a new string variable.
s = 'three'
type(s)
In Python, the same operator can perform different functions based on the data types of the operands. See what happens if you “add” two character strings.
s + 'four'
b = False
type(b)
int(b)
q=''
bool(q))
q=0
bool(q))
Python offers different types of objects to represent collections of values, the most common being a list. It is created by listing multiple values or variables, separated by commas and enclosed by square brackets.
lst = [r, s, 'another string']
lst
You can retrieve individual elements of a list by their index; note that in Python, the first element has an index of 0.
lst[1]
Negative indices are also possible: -1 is the last item in the list, -2 the second-to-last item, etc.
lst[-1]
The syntax list[i:j]
selects a sub-list starting with the element at index i
and ending with the element at index j - 1
.
lst[0:2]
A blank space before or after the “:” indicates the start or end of the list, respectively. For example, the previous example could have been written lst[:2]
.
A potentially useful trick to remember the list subsetting rules in Python is to picture the indices as “dividers” between list elements.
0 1 2 3
| 4.5 | 'three' | 'another string' |
-3 -2 -1
Positive indices are written at the top and negative indices at the bottom. list[i]
returns the element to the right of i
whereas list[i:j]
returns elements between i
and j
.
Positive indices are written at the top and negative indices at the bottom. list[i]
returns the element to the right of i
whereas list[i:j]
returns elements between i
and j
.
►Question: Given any Python list, how can you retrieve its last two elements?
Answer:
Lists can be nested within other lists: in this case, multiple sets of brackets might be necessary to access individual elements.
nested_list = [1, 2, 3, [11, 12, 13]]
nested_list[3][1]
The Python language includes multiple functions that work with lists. Here are a few examples. Note that code lines starting with #
are comments, which serve to document the code but are ignored by the Python interpreter.
# Returns the length of a list
len(lst)
# Returns the position of an element in a list
lst.index(4.5)
# Appends an element to the end of a list
lst.append(100)
lst
# Reverse the order of a list's elements
lst.reverse()
lst
The last three examples feature a special type of functions called methods. In object-oriented programming, methods belong to a specific object; in Python, they are called with the object.method()
syntax. In general, methods and functions operate in a similar manner; for example, len()
could have been a list method.
Note that the append
and reverse
methods modify the lst
object, and return no value. A common mistake – especially for those used to program in R – would be to write lst = lst.append(100)
, which overwrites lst
with a null value!
►Question: What is the output of len(lst[2])
? What does it mean? (Like the +
operator, this is another case of a function that behaves differently depending of the type of data it’s applied to.)
Answer:
Lists are useful when you need to access elements by their position in a sequence. In contrast, dictionaries make it easy to find values based on unique identifiers called keys.
A dictionary is defined as a list of key:value
pairs enclosed by curly brackets. Individual values are accessed using square brackets, as for lists, except that keys are used as the indices.
animals = {'Snowy':'dog', 'Garfield':'cat', 'Bugs':'rabbit'}
animals['Bugs']
To add an element to the dictionary, we “select” a new key and assign it a value.
animals['Lassie'] = 'dog'
animals
Note that the keys of a dictionary must be unique. Assigning a value to an existing key would overwrite its previously associated value. As you can also see from the example above, the order in which Python returns dictionary elements is arbitrary.
►Question: Based on what we have learned so far, how could you represent a contact list in Python, i.e. a list of individuals with their names, phone numbers, email addresses, etc.?
Answer:
A for
loop takes a list and executes a block of code once for each element of the list.
for i in range(1, 5):
j = i * 2
print (j)
The range(i, j)
function creates a list of integers from i
to j - 1
; just like in the case of list slices, the upper bound is excluded.
Note the pattern of the block above: the for
statement is followed by a colon, each line in the following block is indented at the same level, and there is no delimiter or statement indicating the end of the block. Compared with other programming languages where code indentation only serves to enhance readability, code blocks in Python are defined by changes in indentation.
A for
loop can be used to iterate over the elements of any list. In the following example, we create a contact list (as a list of dictionaries), then perform a loop over all contacts. Within the loop, we use a conditional statement (if
) to check if the name is ‘Ann’. If so, we print the phone number; if not (else
block), we print the name.
contacts = [ {'name': 'Ann', 'phone': '555-111-2222'},
{'name': 'Bob', 'phone': '555-333-4444'} ]
for c in contacts:
if c['name'] == 'Ann':
print (c['phone'])
else:
print (c['name'])
i
is even, i % 2 == 0
, where %
is the modulo (or division remainder) operator.
We already saw examples of a few built-in functions, such as type()
or len()
. You can define your own Python functions as a block of code starting with a def
statement.
def add_2(num):
result = num + 2
return result
add_2(10)
The def
keyword is followed by the function name, its arguments enclosed in parentheses (separated by commas if there are more than one), and a colon. The return
statement passes the specified result as the output of the function. A simple return
line with no output value just exits the function.
After it is defined, the function is invoked using its name and specifying the arguments in parentheses, in the same order as in its definition.
►Exercise: Create a function that takes a list as an argument and returns its first and last elements as a new list.
#Create the function
#Run the function
So far we have only covered elements of the base Python language. However, most of Python’s useful tools for scientific programming can be found in packages that extend its base functionalities.
Because Python lists are meant to contain elements of any data type, they are not so useful as numeric vectors. In particular, the +
and *
operations do not perform numerical calculations when applied to lists, rather, they respectively concatenate and duplicate list elements.
add_list = [1, 2] + [3, 4]
mult_list = [5, 6] * 2
print (add_list, mult_list)
The NumPy package and its array
type provide a solution to define vectors, matrices and higher-dimension arrays.
import numpy as np
vect = np.array([5, 20, 12])
vect
The first line of this code, import numpy as np
, gives Python access to functions from the numpy
package, using the package.function
syntax. To save time typing package names, Python programmers often define short aliases for them, such as np
here. This allows us to write np.array
instead of numpy.array
on the following line.
The definition of the array itself looks much like a Python list, and array subsetting follows the same conventions as list subsetting. The main difference is for multidimensional arrays, where the indices in each dimensions can be separated by commas within one set of brackets. As an example, we create a 2 x 3 matrix and selected the first two columns.
mat = np.array([[1, 2, 3], [4, 5, 6]])
mat[:, 0:2]
The initial “:” (with no indices) is interpreted as “select all rows”.
Arithmetic operators and basic mathematical functions (e.g. exp, sqrt) are applied element-wise to NumPy arrays.
vect + np.array([1, 2, 3])
vect * 2
mat * vect
In the last example, vect
was multipled element-wise to each row of mat
.To multiply a matrix and a vector (or two matrices, or two vectors in a dot-product), use the dot
method.
mat.dot(vect) # Alternate syntax is np.dot(mat, vect)
If you have used the statistical programming language R, you are familiar with data frames, two-dimensional data structures where each column can hold a different type of data, as in a spreadsheet.
The data analysis library pandas provides a data frame object type for Python, along with functions to subset, filter reshape and aggregate data stored in data frames.
After importing pandas, we call its read_csv
function to load the Portal surveys data from the file surveys.csv.
import pandas as pd
surveys = pd.read_csv("../Data/Raw/surveys.csv")
surveys.head()
By default, the head
method of a data frame shows its first five rows. To select a subset of rows and columns from the data frame, we can use the loc
method, specifying a range of row indices and a list of column names. Note that unlike the usual way we specify number ranges in Python, the end of the range (row 3) is included here.
surveys.loc[1:3, ['plot_id', 'species_id']]
We can also select a whole column by writing its name in square brackets. Here, we select the weight column and call the describe
method to get summary statistics for that column.
surveys['weight'].describe()
The loc
method can also filter rows, if we specify a logical condition in place of the row indices. For example, here is how we could get the subset of surveys where the species is “DM”, and save it in a new data frame. Note that when we don’t specify any column names after the comma, all columns are kept.
surveys_dm = surveys.loc[surveys['species_id'] == 'DM', ]
Another useful feature of pandas is the groupby
method, which defines groups of rows based on their values for a given variable. After grouping a data frame, we can use statistical methods (like mean
) to get summary statistics by group.
surveys_group = surveys_dm.groupby('sex')
surveys_group['hindfoot_length', 'weight'].mean()
► Exercise: Knowing that the count
method (e.g. surveys.count()
) returns the number of rows in a data frame, find which month had the most observations recorded in surveys.
Hint: you'll want to group your data by month, and then display values by count.
To complete this lesson, we will draw plots of our data using the matplotlib package and more specifically its pyplot subpackage. The pandas package works particularly well with pyplot, since it defines plotting methods that work specifically for data frames.
In the following, we import pyplot, then call the plot
method to create a scatterplot of weight against hindfoot_length from the surveys_dm data. The plt.show()
function opens a new window showing the active plot.
import matplotlib.pyplot as plt
%matplotlib inline
surveys_dm.plot('hindfoot_length', 'weight', kind = 'scatter')
plt.show()
Besides scatter
, the plot
method supports other kinds of plots such as bar and line graphs. To create the histogram of one variable from the data frame, you may use a different method, hist
.
plt.close() # close the current plot to start a new one
surveys_dm.hist('weight')
plt.show()
The material in this lesson is partly based on Data Carpentry: Python for Ecologists and the Data Carpentry for Biologists course. These are good resources for a more detailed overview of data analysis and scientific computing in Python.