CASE - Observation data - analysis
© 2021, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
EXERCISE 1
survey_data_completed.csv
file and save the resulting DataFrame
as variable survey_data_processed
(if you did not complete the previous notebook, a version of the csv file is available in the data
folder).datetime
objects and make sure the 'occurrenceID' column is used as the index of the resulting DataFrame (both can be done at once when reading the csv file using parameters of the read_csv
function)pd.read_...
..head()
and .info()
?# %load _solutions/case2_observations_analysis1.py
# %load _solutions/case2_observations_analysis2.py
# %load _solutions/case2_observations_analysis3.py
See pandas_08_missing_values.ipynb for an overview of functionality to work with missing values.
EXERCISE 2
How many records in the data set have no information about the species
? Use the isna()
method to find out.
survey_data_processed['species'] == np.nan
, but use the available method isna()
to check if a value is NaN# %load _solutions/case2_observations_analysis4.py
EXERCISE 3
How many duplicate records are present in the dataset? Use the method duplicated()
to check if a row is a duplicate.
# %load _solutions/case2_observations_analysis5.py
EXERCISE 4
observations
data and assign the result to a new variable duplicate_observations
. The duplicated()
method provides a keep
argument define which duplicates (if any) to mark.duplicate_observations
data on both the columns eventDate
and verbatimLocality
and show the first 9 records.duplicated
method to find out which value the argument keep
requires to select all duplicate data.sort_values()
can work with a single columns name as well as a list of names.# %load _solutions/case2_observations_analysis6.py
EXERCISE 5
observations
data set and save the result as survey_data_unique
. Use the drop duplicates()
method from Pandas.keep=First
is the default option for drop_duplicates
len
gth# %load _solutions/case2_observations_analysis7.py
# %load _solutions/case2_observations_analysis8.py
EXERCISE 6
Use the dropna()
method to find out:
species
data available ?dropna
by default removes by default all rows for which any of the columns contains a NaN
value.subset
argument# %load _solutions/case2_observations_analysis9.py
EXERCISE 7
Filter the survey_data_unique
data and select only those records that do not have a species
while having information on the sex
. Store the result as variable not_identified
.
&
operator.isna()
and a notna()
method to check the existence of NaN
values.# %load _solutions/case2_observations_analysis10.py
not_identified.head()
NOTE!
The DataFrame
we will use in the further analyses contains species information:
survey_data = survey_data_unique.dropna(subset=['species']).copy()
survey_data['name'] = survey_data['genus'] + ' ' + survey_data['species']
INFO
For biodiversity studies, absence values (knowing that something is not present) are useful as well to normalize the observations, but this is out of scope for these exercises.
survey_data['taxa'].value_counts()
#survey_data.groupby('taxa').size()
EXERCISE 8
taxa
is equal to 'Rabbit', 'Bird' or 'Reptile'. Assign the result to a variable non_rodent_species
. Use the isin
method for the selection.isin
operator with a list of names.# %load _solutions/case2_observations_analysis11.py
len(non_rodent_species)
EXERCISE 9
Select the observations for which the name
starts with the characters 'r' (make sure it does not matter if a capital character is used in the 'taxa' name). Call the resulting variable r_species
.
.str.
construction to provide all kind of string functionalities? You can combine multiple of these after each other..lower()
)# %load _solutions/case2_observations_analysis12.py
len(r_species)
r_species["name"].value_counts()
EXERCISE 10
Select the observations that are not Birds. Call the resulting variable non_bird_species
.
==
, !=
, >
,... can still be used.# %load _solutions/case2_observations_analysis13.py
len(non_bird_species)
EXERCISE 11
Select the Bird (taxa is Bird) observations from 1985-01 till 1989-12 using the eventDate
column. Call the resulting variable birds_85_89
.
<=
and &
, and don't forget the put brackets around each comparison that you combine)# %load _solutions/case2_observations_analysis14.py
Alternative solution:
# %load _solutions/case2_observations_analysis15.py
EXERCISE 12
wgt
column) information is available.name
column)Note You can do this all in a single line statement, but don't have to do it as such!
dropna
, groupby
, median
and sort_values
.# %load _solutions/case2_observations_analysis16.py
# %load _solutions/case2_observations_analysis17.py
EXERCISE 13
Which 8 species (use the name
column to identify the different species) have been observed most over the entire data set?
# %load _solutions/case2_observations_analysis18.py
# %load _solutions/case2_observations_analysis19.py
EXERCISE 14
name
) in each of the verbatimLocality
plots? Use the nunique
method. Assign the output to a new variable n_species_per_plot
.Figure
(fig
) and Axes
(ax
) to prepare a plot. Make an horizontal bar chart using Pandas plot
function linked to the just created Matplotlib ax
. Each bar represents the species per plot/verbatimLocality
. Change the y-label to 'Plot number'.groupby
for this exercise. The nunique
is the aggregation function for each of the groups.fig, ax = plt.subplots()
prepares a Matplotlib Figure and Axes.# %load _solutions/case2_observations_analysis20.py
EXERCISE 15
verbatimLocality
) each of the species (name
) have been observed in? Assign the output to a new variable n_plots_per_species
. Sort the counts from low to high.plot
function to show the number of plots each of the species was found (using the n_plots_per_species
variable).# %load _solutions/case2_observations_analysis21.py
EXERCISE 16
survey_data
, calculate the amount of males and females present in each of the plots (verbatimLocality
). The result should return the counts for each of the combinations of sex
and verbatimLocality
. Assign to a new variable n_plot_sex
and ensure the counts are in a column named "count".pivot
to convert the n_plot_sex
DataFrame to a new DataFrame with the verbatimLocality
as index and male
/female
as column names. Assign to a new variable pivoted
.groupby
can also be used with multiple columns at the same time.groupby
operation gives a Series as result, you can give that Series a name with the .rename(..)
method.reset_index()
is useful function to convert multiple indices into columns again.# %load _solutions/case2_observations_analysis22.py
# %load _solutions/case2_observations_analysis23.py
pivoted.head()
To check, we can use the variable pivoted
to plot the result:
pivoted.plot(kind='bar', figsize=(12, 6), rot=0)
EXERCISE 17
Recreate the previous plot with the catplot
function from the Seaborn library starting from n_plot_sex
.
kind
argument of the catplot
function to figure out to specify you want a barplot with given x and y values.hue
argumentheight
and aspect
, the figure size can be optimized.# %load _solutions/case2_observations_analysis24.py
EXERCISE 18
Recreate the previous plot with the catplot
function from the Seaborn library directly starting from survey_data
.
kind
argument of the catplot
function to find out how to use counts to define the bars instead of a y
value.hue
argument# %load _solutions/case2_observations_analysis25.py
EXERCISE 19
verbatimLocality
). Each of the species name
s is a row index and each of the verbatimLocality
plots is a column name.index
, columns
, values
and aggfunc
parameters of the pivot_table
function. You can use the datasetName
to count the number of observations for each name/locality combination (when counting rows, the exact column doesn't matter).# %load _solutions/case2_observations_analysis26.py
# %load _solutions/case2_observations_analysis27.py
EXERCISE 20
Make a plot visualizing the evolution of the number of observations for each of the individual years (i.e. annual counts) using the resample
method.
resample
the data using the eventDate
column to create annual counts. If the index is not a datetime-index, you can use the on=
keyword to specify which datetime column to use.resample
needs an aggregation function on how to combine the values within a single 'group' (in this case data within a year). In this example, we want to know the size
of each group, i.e. the number of records within each year.# %load _solutions/case2_observations_analysis28.py
To evaluate the intensity or number of occurrences during different time spans, a heatmap is an interesting representation.
EXERCISE 21
heatmap_prep
, based on the survey_data
DataFrame with the row index the individual years, in the column the months of the year (1-> 12) and as values of the table, the counts for each of these year/month combinations.heatmap_prep
variable..dt
accessor can be used to get the year
, month
,... from a datetime
columnpivot_table
and provide the years to index
and the months to columns
. Do not forget to count
the number for each combination (aggfunc
).heatmap
function which requires a short-form DataFrame, comparable to giving each element in a table a color value.# %load _solutions/case2_observations_analysis29.py
Remark that we started from a tidy
data format (also called long format) and converted to short format with in the row index the years, in the column the months and the counts for each of these year/month combinations as values.
In this section, all plots can be made with the embedded Pandas plot function, unless specificly asked
EXERCISE 22
Plot using Pandas plot
function the number of records for Dipodomys merriami
for each month of the year (January (1) -> December (12)), aggregated over all years.
groupby
.resample
is not useful here, as we do not want to change the time-interval, but look at month of the year (over all years)# %load _solutions/case2_observations_analysis30.py
# %load _solutions/case2_observations_analysis31.py
EXERCISE 23
Plot, for the species 'Dipodomys merriami', 'Dipodomys ordii', 'Reithrodontomys megalotis' and 'Chaetodipus baileyi', the monthly number of records as a function of time during the monitoring period. Plot each of the individual species in a separate subplot and provide them all with the same y-axis scale
isin
is useful to select from within a list of elements.groupby
AND resample
need to be combined. We do want to change the time-interval to represent data as a function of time (resample
) and we want to do this for each name/species (groupby
). The order matters!unstack
is a Pandas function a bit similar to pivot
. Check the unstack documentation as it might be helpful for this exercise.# %load _solutions/case2_observations_analysis32.py
# %load _solutions/case2_observations_analysis33.py
# %load _solutions/case2_observations_analysis34.py
EXERCISE 24
Recreate the same plot as in the previous exercise using Seaborn relplot
functon with the month_evolution
variable.
counts
as a function of eventDate
, so link these columns to y and x respectively.row
/col
parameter.height
and aspect
, the figure size can be optimized.Uncomment the next cell (calculates month_evolution
, the intermediate result of the previous excercise):
# %load _solutions/case2_observations_analysis35.py
Plotting with seaborn:
# %load _solutions/case2_observations_analysis36.py
EXERCISE 25
Plot the annual amount of occurrences for each of the 'taxa' as a function of time using Seaborn. Plot each taxa in a separate subplot and do not share the y-axis among the facets.
resample
and groupby
!sharey=False
to the facet_kws
argument as a dictionary.# %load _solutions/case2_observations_analysis37.py
# %load _solutions/case2_observations_analysis38.py
EXERCISE 26
The observations where taken by volunteers. You wonder on which day of the week the most observations where done. Calculate for each day of the week (dayofweek
) the number of observations and make a bar plot.
calendar
which contains names of week days, month names,...?# %load _solutions/case2_observations_analysis39.py
Nice work!