CASE - Observation data - analysis

© 2021, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons

In [ ]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-whitegrid')

1. Reading in the enriched observations data¶

EXERCISE 1

Read in the survey_data_completed.csv file and save the resulting DataFrame as variable survey_data_processed (if you did not complete the previous notebook, a version of the csv file is available in the data folder).
Interpret the 'eventDate' column directly as python datetime objects and make sure the 'occurrenceID' column is used as the index of the resulting DataFrame (both can be done at once when reading the csv file using parameters of the read_csv function)
Inspect the first five rows of the DataFrame and the data types of each of the data columns. Verify that the 'eventDate' indeed has a datetime data type.

Hints

All read functions in Pandas start with pd.read_....
To check the documentation of a function, use the keystroke combination of SHIFT + TAB when the cursor is on the function.
Remember .head() and .info()?

In [ ]:

# %load _solutions/case2_observations_analysis1.py

In [ ]:

# %load _solutions/case2_observations_analysis2.py

In [ ]:

# %load _solutions/case2_observations_analysis3.py

2. Tackle missing values (NaN) and duplicate values¶

See pandas_08_missing_values.ipynb for an overview of functionality to work with missing values.

EXERCISE 2

How many records in the data set have no information about the species? Use the isna() method to find out.

Hints

Do NOT use survey_data_processed['species'] == np.nan, but use the available method isna() to check if a value is NaN
The result of an (element-wise) condition returns a set of True/False values, corresponding to 1/0 values. The amount of True values is equal to the sum.

In [ ]:

# %load _solutions/case2_observations_analysis4.py

EXERCISE 3

How many duplicate records are present in the dataset? Use the method duplicated() to check if a row is a duplicate.

Hints

The result of an (element-wise) condition returns a set of True/False values, corresponding to 1/0 values. The amount of True values is equal to the sum.

In [ ]:

# %load _solutions/case2_observations_analysis5.py

EXERCISE 4

Select all duplicate data by filtering the observations data and assign the result to a new variable duplicate_observations. The duplicated() method provides a keep argument define which duplicates (if any) to mark.
Sort the duplicate_observations data on both the columns eventDate and verbatimLocality and show the first 9 records.

Hints

Check the documentation of the duplicated method to find out which value the argument keep requires to select all duplicate data.
sort_values() can work with a single columns name as well as a list of names.

In [ ]:

# %load _solutions/case2_observations_analysis6.py

EXERCISE 5

Exclude the duplicate values (i.e. keep the first occurrence while removing the other ones) from the observations data set and save the result as survey_data_unique. Use the drop duplicates() method from Pandas.
How many observations are still left in the data set?

Hints

keep=First is the default option for drop_duplicates
The number of rows in a DataFrame is equal to the length

In [ ]:

# %load _solutions/case2_observations_analysis7.py

In [ ]:

# %load _solutions/case2_observations_analysis8.py

EXERCISE 6

Use the dropna() method to find out:

For how many observations (rows) we have all the information available (i.e. no NaN values in any of the columns)?
For how many observations (rows) we do have the species data available ?

Hints

dropna by default removes by default all rows for which any of the columns contains a NaN value.
To specify which specific columns to check, use the subset argument

In [ ]:

# %load _solutions/case2_observations_analysis9.py

EXERCISE 7

Filter the survey_data_unique data and select only those records that do not have a species while having information on the sex. Store the result as variable not_identified.

Hints

To combine logical operators element-wise in Pandas, use the & operator.
Pandas provides both a isna() and a notna() method to check the existence of NaN values.

In [ ]:

# %load _solutions/case2_observations_analysis10.py

In [ ]:

not_identified.head()

NOTE!

The DataFrame we will use in the further analyses contains species information:

In [ ]:

survey_data = survey_data_unique.dropna(subset=['species']).copy()
survey_data['name'] = survey_data['genus'] + ' ' + survey_data['species']

INFO

For biodiversity studies, absence values (knowing that something is not present) are useful as well to normalize the observations, but this is out of scope for these exercises.

3. Select subsets of the data¶

In [ ]:

survey_data['taxa'].value_counts()
#survey_data.groupby('taxa').size()

EXERCISE 8

Select the observations for which the taxa is equal to 'Rabbit', 'Bird' or 'Reptile'. Assign the result to a variable non_rodent_species. Use the isin method for the selection.

Hints

You do not have to combine three different conditions, but use the isin operator with a list of names.

In [ ]:

# %load _solutions/case2_observations_analysis11.py

In [ ]:

len(non_rodent_species)

EXERCISE 9

Select the observations for which the name starts with the characters 'r' (make sure it does not matter if a capital character is used in the 'taxa' name). Call the resulting variable r_species.

Hints

Remember the .str. construction to provide all kind of string functionalities? You can combine multiple of these after each other.
If the presence of capital letters should not matter, make everything lowercase first before comparing (.lower())

In [ ]:

# %load _solutions/case2_observations_analysis12.py

In [ ]:

len(r_species)

In [ ]:

r_species["name"].value_counts()

EXERCISE 10

Select the observations that are not Birds. Call the resulting variable non_bird_species.

Hints

Logical operators like ==, !=, >,... can still be used.

In [ ]:

# %load _solutions/case2_observations_analysis13.py

In [ ]:

len(non_bird_species)

EXERCISE 11

Select the Bird (taxa is Bird) observations from 1985-01 till 1989-12 using the eventDate column. Call the resulting variable birds_85_89.

Hints

No hints, you can do this! (with the help of some <= and &, and don't forget the put brackets around each comparison that you combine)

In [ ]:

# %load _solutions/case2_observations_analysis14.py

Alternative solution:

In [ ]:

# %load _solutions/case2_observations_analysis15.py

EXERCISE 12

Drop the observations for which no 'weight' (wgt column) information is available.
On the filtered data, compare the median weight for each of the species (use the name column)
Sort the output from high to low median weight (i.e. descending)

Note You can do this all in a single line statement, but don't have to do it as such!

Hints

You will need dropna, groupby, median and sort_values.

In [ ]:

# %load _solutions/case2_observations_analysis16.py

In [ ]:

# %load _solutions/case2_observations_analysis17.py

4. Species abundance¶

EXERCISE 13

Which 8 species (use the name column to identify the different species) have been observed most over the entire data set?

Hints

Pandas provide a function to combine sorting and showing the first n records, see here...

In [ ]:

# %load _solutions/case2_observations_analysis18.py

In [ ]:

# %load _solutions/case2_observations_analysis19.py

EXERCISE 14

What is the number of different species (name) in each of the verbatimLocality plots? Use the nunique method. Assign the output to a new variable n_species_per_plot.
Define a Matplotlib Figure (fig) and Axes (ax) to prepare a plot. Make an horizontal bar chart using Pandas plot function linked to the just created Matplotlib ax. Each bar represents the species per plot/verbatimLocality. Change the y-label to 'Plot number'.

Hints

...in each of the... should provide a hint to use groupby for this exercise. The nunique is the aggregation function for each of the groups.
fig, ax = plt.subplots() prepares a Matplotlib Figure and Axes.

In [ ]:

# %load _solutions/case2_observations_analysis20.py

EXERCISE 15

What is the number of plots (verbatimLocality) each of the species (name) have been observed in? Assign the output to a new variable n_plots_per_species. Sort the counts from low to high.
Make an horizontal bar chart using Pandas plot function to show the number of plots each of the species was found (using the n_plots_per_species variable).

Hints

Use the previous exercise to solve this one.

In [ ]:

# %load _solutions/case2_observations_analysis21.py

EXERCISE 16

Starting from the survey_data, calculate the amount of males and females present in each of the plots (verbatimLocality). The result should return the counts for each of the combinations of sex and verbatimLocality. Assign to a new variable n_plot_sex and ensure the counts are in a column named "count".
Use pivot to convert the n_plot_sex DataFrame to a new DataFrame with the verbatimLocality as index and male/female as column names. Assign to a new variable pivoted.

Hints

...for each of the combinations... groupby can also be used with multiple columns at the same time.
If a groupby operation gives a Series as result, you can give that Series a name with the .rename(..) method.
reset_index() is useful function to convert multiple indices into columns again.

In [ ]:

# %load _solutions/case2_observations_analysis22.py

In [ ]:

# %load _solutions/case2_observations_analysis23.py

In [ ]:

pivoted.head()

To check, we can use the variable pivoted to plot the result:

In [ ]:

pivoted.plot(kind='bar', figsize=(12, 6), rot=0)

EXERCISE 17

Recreate the previous plot with the catplot function from the Seaborn library starting from n_plot_sex.

Hints

Check the kind argument of the catplot function to figure out to specify you want a barplot with given x and y values.
To link a column to different colors, use the hue argument
Using height and aspect, the figure size can be optimized.

In [ ]:

# %load _solutions/case2_observations_analysis24.py

EXERCISE 18

Recreate the previous plot with the catplot function from the Seaborn library directly starting from survey_data.

Hints

Check the kindargument of the catplot function to find out how to use counts to define the bars instead of a y value.
To link a column to different colors, use the hue argument

In [ ]:

# %load _solutions/case2_observations_analysis25.py

EXERCISE 19

Make a summary table with the number of records of each of the species in each of the plots (also called verbatimLocality). Each of the species names is a row index and each of the verbatimLocality plots is a column name.
Using the Seaborn documentation to make a heatmap.

Hints

Make sure to pass the correct columns to respectively the index, columns, values and aggfunc parameters of the pivot_table function. You can use the datasetName to count the number of observations for each name/locality combination (when counting rows, the exact column doesn't matter).

In [ ]:

# %load _solutions/case2_observations_analysis26.py

In [ ]:

# %load _solutions/case2_observations_analysis27.py

5. Observations over time¶

EXERCISE 20

Make a plot visualizing the evolution of the number of observations for each of the individual years (i.e. annual counts) using the resample method.

Hints

You want to resample the data using the eventDate column to create annual counts. If the index is not a datetime-index, you can use the on= keyword to specify which datetime column to use.
resample needs an aggregation function on how to combine the values within a single 'group' (in this case data within a year). In this example, we want to know the size of each group, i.e. the number of records within each year.

In [ ]:

# %load _solutions/case2_observations_analysis28.py

To evaluate the intensity or number of occurrences during different time spans, a heatmap is an interesting representation.

EXERCISE 21

Create a table, called heatmap_prep, based on the survey_data DataFrame with the row index the individual years, in the column the months of the year (1-> 12) and as values of the table, the counts for each of these year/month combinations.
Using the seaborn documentation, make a heatmap starting from the heatmap_prep variable.

Hints

The .dt accessor can be used to get the year, month,... from a datetime column
Use pivot_table and provide the years to index and the months to columns. Do not forget to count the number for each combination (aggfunc).
Seaborn has an heatmap function which requires a short-form DataFrame, comparable to giving each element in a table a color value.

In [ ]:

# %load _solutions/case2_observations_analysis29.py

Remark that we started from a tidy data format (also called long format) and converted to short format with in the row index the years, in the column the months and the counts for each of these year/month combinations as values.

(OPTIONAL SECTION) 6. Evolution of species during monitoring period¶

In this section, all plots can be made with the embedded Pandas plot function, unless specificly asked

EXERCISE 22

Plot using Pandas plot function the number of records for Dipodomys merriami for each month of the year (January (1) -> December (12)), aggregated over all years.

Hints

...for each month of... requires groupby.
resample is not useful here, as we do not want to change the time-interval, but look at month of the year (over all years)

In [ ]:

# %load _solutions/case2_observations_analysis30.py

In [ ]:

# %load _solutions/case2_observations_analysis31.py

EXERCISE 23

Plot, for the species 'Dipodomys merriami', 'Dipodomys ordii', 'Reithrodontomys megalotis' and 'Chaetodipus baileyi', the monthly number of records as a function of time during the monitoring period. Plot each of the individual species in a separate subplot and provide them all with the same y-axis scale

Hints

isin is useful to select from within a list of elements.
groupby AND resample need to be combined. We do want to change the time-interval to represent data as a function of time (resample) and we want to do this for each name/species (groupby). The order matters!
unstack is a Pandas function a bit similar to pivot. Check the unstack documentation as it might be helpful for this exercise.

In [ ]:

# %load _solutions/case2_observations_analysis32.py

In [ ]:

# %load _solutions/case2_observations_analysis33.py

In [ ]:

# %load _solutions/case2_observations_analysis34.py

EXERCISE 24

Recreate the same plot as in the previous exercise using Seaborn relplot functon with the month_evolution variable.

Hints

We want to have the counts as a function of eventDate, so link these columns to y and x respectively.
To create subplots in Seaborn, the usage of facetting (splitting data sets to multiple facets) is used by linking a column name to the row/col parameter.
Using height and aspect, the figure size can be optimized.

Uncomment the next cell (calculates month_evolution, the intermediate result of the previous excercise):

In [ ]:

# %load _solutions/case2_observations_analysis35.py

Plotting with seaborn:

In [ ]:

# %load _solutions/case2_observations_analysis36.py

EXERCISE 25

Plot the annual amount of occurrences for each of the 'taxa' as a function of time using Seaborn. Plot each taxa in a separate subplot and do not share the y-axis among the facets.

Hints

Combine resample and groupby!
Check out the previous exercise for the plot function.
Pass the sharey=False to the facet_kws argument as a dictionary.

In [ ]:

# %load _solutions/case2_observations_analysis37.py

In [ ]:

# %load _solutions/case2_observations_analysis38.py

EXERCISE 26

The observations where taken by volunteers. You wonder on which day of the week the most observations where done. Calculate for each day of the week (dayofweek) the number of observations and make a bar plot.

Hints

Did you know the Python standard Library has a module calendar which contains names of week days, month names,...?

In [ ]:

# %load _solutions/case2_observations_analysis39.py

Nice work!