EDIPI Training Event 1&2, November 2021

Introductory exercises: data analysis with xarray and scipy

Imports¶

First cell in notebook should always contain all necessary imports

In [1]:

import numpy as np
import xarray as xr
import scipy.stats as stats
from matplotlib import pyplot as plt

Load the dataset and take a look at the contents¶

The dataset contains a single variable called 'tmax' with time coordinate 'time'.

The 'tmax' variable is ERA5 daily-maximum surface temperature from the Pacific Northwest area of the USA for the period 1950-2021.

In [2]:

ds = xr.open_dataset('itmax_era5_index.nc')
ds

Out[2]:

<xarray.Dataset>
Dimensions:  (time: 26298)
Coordinates:
  * time     (time) datetime64[ns] 1950-01-01 1950-01-02 ... 2021-12-31
Data variables:
    tmax     (time) float32 -9.767 -13.07 -15.17 -12.89 ... nan nan nan nan
Attributes: (12/21)
    title:                      spatial statistic of "ERA5 reanalysis, https:...
    description:                tmax era5 index
    scripturl01:                https://climexp.knmi.nl/get_index.cgi?email=$...
    minimal_valid_fraction:      30.00
    file:                       ERA5/era5_tmax_daily_na_extended.nc
    cdi:                        Climate Data Interface version 1.9.10 (https:...
    ...                         ...
    cdo:                        Climate Data Operators version 1.9.10 (https:...
    ave_region:                 lon= -123.125 -118.875, lat=   44.875   52.125
    comment:                    
    scripturl02:                https://climexp.knmi.nl/dat2nc.cgi?id=$id&sta...
    history:                     2021-11-03 18:34:46 bin/dat2nc data/itmax_er...
    Conventions:                CF-1.0

xarray.Dataset

Dimensions:
- time: 26298

Coordinates: (1)

time

(time)

datetime64[ns]

1950-01-01 ... 2021-12-31

standard_name :: time
long_name :: time
axis :: T

array(['1950-01-01T00:00:00.000000000', '1950-01-02T00:00:00.000000000',
       '1950-01-03T00:00:00.000000000', ..., '2021-12-29T00:00:00.000000000',
       '2021-12-30T00:00:00.000000000', '2021-12-31T00:00:00.000000000'],
      dtype='datetime64[ns]')

Data variables: (1)

tmax

(time)

float32

...

long_name :: Daily Maximum Near-Surface Air Temperature
units :: Celsius

array([ -9.767426, -13.07486 , -15.1666  , ...,        nan,        nan,
              nan], dtype=float32)

Attributes: (21)
title :
spatial statistic of "ERA5 reanalysis, https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5"
description :
tmax era5 index
scripturl01 :
https://climexp.knmi.nl/get_index.cgi?email=$id&field=era5_tmax_daily_na_e&gridpoints=false&intertype=nearest&lat1=45&lat2=52&lon1=-123&lon2=-119&masktype=5lan&standardunits=standardunits
minimal_valid_fraction :
30.00
file :
ERA5/era5_tmax_daily_na_extended.nc
cdi :
Climate Data Interface version 1.9.10 (https://mpimet.mpg.de/cdi)
institution :
KNMI Climate Explorer and European Centre for Medium-Range Weather Forecasts
frequency :
day
nco :
netCDF Operators version 4.9.3 (Homepage = http://nco.sf.net, Code = http://github.com/nco/nco)
geospatial_lat_max :
54.1
geospatial_lon_min :
-130.1
geospatial_lon_units :
degrees_east
geospatial_lon_resolution :
0.2
time_coverage_start :
1950-01-01
time_coverage_end :
2021-05-31
cdo :
Climate Data Operators version 1.9.10 (https://mpimet.mpg.de/cdo)
ave_region :
lon= -123.125 -118.875, lat= 44.875 52.125
comment :
scripturl02 :
https://climexp.knmi.nl/dat2nc.cgi?id=$id&station=tmax_era5_index&type=i
history :
2021-11-03 18:34:46 bin/dat2nc data/itmax_era5_index.dat i tmax_era5_index data/itmax_era5_index.nc\n 2021-07-01 8:32:07 bin/get_index ERA5/era5_tmax_daily_na_extended.nc -123 -119 45 52 dipole no minfac 30 nearest lsmask ERA5/era5_000000_lsm_na.nc 5lan nomissing standardunits era5_tmax_daily_na_e_-123--119E_45-52N_n_5lan_su\nWed Jun 30 16:03:53 2021: cdo -r -R -f nc4 -z zip copy era5_tmax_daily_na.nc na/tmax/oper_tmax202106.nc na/tmax/forecast_tmax2021-06-30.nc era5_tmax_daily_na_extended.nc\nTue Jun 29 19:33:18 2021: ncatted -a title,global,c,c,ERA5 reanalysis, https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5 era5_tmax_daily_na.nc\nTue Jun 29 19:32:06 2021: cdo -r -f nc4 -z zip copy na/tmax/era5_195001_tmax.nc na/tmax/era5_195002_tmax.nc na/tmax/era5_195003_tmax.nc na/tmax/era5_195004_tmax.nc na/tmax/era5_195005_tmax.nc na/tmax/era5_195006_tmax.nc na/tmax/era5_195007_tmax.nc na/tmax/era5_195008_tmax.nc na/tmax/era5_195009_tmax.nc na/tmax/era5_195010_tmax.nc na/tmax/era5_195011_tmax.nc na/tmax/era5_195012_tmax.nc na/tmax/era5_195101_tmax.nc na/tmax/era5_195102_tmax.nc na/tmax/era5_195103_tmax.nc na/tmax/era5_195104_tmax.nc na/tmax/era5_195105_tmax.nc na/tmax/era5_195106_tmax.nc na/tmax/era5_195107_tmax.nc na/tmax/era5_195108_tmax.nc na/tmax/era5_195109_tmax.nc na/tmax/era5_195110_tmax.nc na/tmax/era5_195111_tmax.nc na/tmax/era5_195112_tmax.nc na/tmax/era5_195201_tmax.nc na/tmax/era5_195202_tmax.nc na/tmax/era5_195203_tmax.nc na/tmax/era5_195204_tmax.nc na/tmax/era5_195205_tmax.nc na/tmax/era5_195206_tmax.nc na/tmax/era5_195207_tmax.nc na/tmax/era5_195208_tmax.nc na/tmax/era5_195209_tmax.nc na/tmax/era5_195210_tmax.nc na/tmax/era5_195211_tmax.nc na/tmax/era5_195212_tmax.nc na/tmax/era5_195301_tmax.nc na/tmax/era5_195302_tmax.nc na/tmax/era5_195303_tmax.nc na/tmax/era5_195304_tmax.nc na/tmax/era5_195305_tmax.nc na/tmax/era5_195306_tmax.nc na/tmax/era5_195307_tmax.nc na/tmax/era5_195308_tmax.nc na/tmax/era5_195309_tmax.nc na/tmax/era5_195310_tmax.nc na/tmax/era5_195311_tmax.nc na/tmax/era5_195312_tmax.nc na/tmax/era5_195401_tmax.nc na/tmax/era5_195402_tmax.nc na/tmax/era5_195403_tmax.nc na/tmax/era5_195404_tmax.nc na/tmax/era5_195405_tmax.nc na/tmax/era5_195406_tmax.nc na/tmax/era5_195407_tmax.nc na/tmax/era5_195408_tmax.nc na/tmax/era5_195409_tmax.nc na/tmax/era5_195410_tmax.nc na/tmax/era5_195411_tmax.nc na/tmax/era5_195412_tmax.nc na/tmax/era5_195501_tmax.nc na/tmax/era5_195502_tmax.nc na/tmax/era5_195503_tmax.nc na/tmax/era5_195504_tmax.nc na/tmax/era5_195505_tmax.nc na/tmax/era5_195506_tmax.nc na/tmax/era5_195507_tmax.nc na/tmax/era5_195508_tmax.nc na/tmax/era5_195509_tmax.nc na/tmax/era5_195510_tmax.nc na/tmax/era5_195511_tmax.nc na/tmax/era5_195512_tmax.nc na/tmax/era5_195601_tmax.nc na/tmax/era5_195602_tmax.nc na/tmax/era5_195603_tmax.nc na/tmax/era5_195604_tmax.nc na/tmax/era5_195605_tmax.nc na/tmax/era5_195606_tmax.nc na/tmax/era5_195607_tmax.nc na/tmax/era5_195608_tmax.nc na/tmax/era5_195609_tmax.nc na/tmax/era5_195610_tmax.nc na/tmax/era5_195611_tmax.nc na/tmax/era5_195612_tmax.nc na/tmax/era5_195701_tmax.nc na/tmax/era5_195702_tmax.nc na/tmax/era5_195703_tmax.nc na/tmax/era5_195704_tmax.nc na/tmax/era5_195705_tmax.nc na/tmax/era5_195706_tmax.nc na/tmax/era5_195707_tmax.nc na/tmax/era5_195708_tmax.nc na/tmax/era5_195709_tmax.nc na/tmax/era5_195710_tmax.nc na/tmax/era5_195711_tmax.nc na/tmax/era5_195712_tmax.nc na/tmax/era5_195801_tmax.nc na/tmax/era5_195802_tmax.nc na/tmax/era5_195803_tmax.nc na/tmax/era5_195804_tmax.nc na/tmax/era5_195805_tmax.nc na/tmax/era5_195806_tmax.nc na/tmax/era5_195807_tmax.nc na/tmax/era5_195808_tmax.nc na/tmax/era5_195809_tmax.nc na/tmax/era5_195810_tmax.nc na/tmax/era5_195811_tmax.nc na/tmax/era5_195812_tmax.nc na/tmax/era5_195901_tmax.nc na/tmax/era5_195902_tmax.nc na/tmax/era5_195903_tmax.nc na/tmax/era5_195904_tmax.nc na/tmax/era5_195905_tmax.nc na/tmax/era5_195906_tmax.nc na/tmax/era5_195907_tmax.nc na/tmax/era5_195908_tmax.nc na/tmax/era5_195909_tmax.nc na/tmax/era5_195910_tmax.nc na/tmax/era5_195911_tmax.nc na/tmax/era5_195912_tmax.nc na/tmax/era5_196001_tmax.nc na/tmax/era5_196002_tmax.nc na/tmax/era5_196003_tmax.nc na/tmax/era5_196004_tmax.nc na/tmax/era5_196005_tmax.nc na/tmax/era5_196006_tmax.nc na/tmax/era5_196007_tmax.nc na/tmax/era5_196008_tmax.nc na/tmax/era5_196009_tmax.nc na/tmax/era5_196010_tmax.nc na/tmax/era5_196011_tmax.nc na/tmax/era5_196012_tmax.nc na/tmax/era5_196101_tmax.nc na/tmax/era5_196102_tmax.nc na/tmax/era5_196103_tmax.nc na/tmax/era5_196104_tmax.nc na/tmax/era5_196105_tmax.nc na/tmax/era5_196106_tmax.nc na/tmax/era5_196107_tmax.nc na/tmax/era5_196108_tmax.nc na/tmax/era5_196109_tmax.nc na/tmax/era5_196110_tmax.nc na/tmax/era5_196111_tmax.nc na/tmax/era5_196112_tmax.nc na/tmax/era5_196201_tmax.nc na/tmax/era5_196202_tmax.nc na/tmax/era5_196203_tmax.nc na/tmax/era5_196204_tmax.nc na/tmax/era5_196205_tmax.nc na/tmax/era5_196206_tmax.nc na/tmax/era5_196207_tmax.nc na/tmax/era5_196208_tmax.nc na/tmax/era5_196209_tmax.nc na/tmax/era5_196210_tmax.nc na/tmax/era5_196211_tmax.nc na/tmax/era5_196212_tmax.nc na/tmax/era5_196301_tmax.nc na/tmax/era5_196302_tmax.nc na/tmax/era5_196303_tmax.nc na/tmax/era5_196304_tmax.nc na/tmax/era5_196305_tmax.nc na/tmax/era5_196306_tmax.nc na/tmax/era5_196307_tmax.nc na/tmax/era5_196308_tmax.nc na/tmax/era5_196309_tmax.nc na/tmax/era5_196310_tmax.nc na/tmax/era5_196311_tmax.nc na/tmax/era5_196312_tmax.nc na/tmax/era5_196401_tmax.nc na/tmax/era5_196402_tmax.nc na/tmax/era5_196403_tmax.nc na/tmax/era5_196404_tmax.nc na/tmax/era5_196405_tmax.nc na/tmax/era5_196406_tmax.nc na/tmax/era5_196407_tmax.nc na/tmax/era5_196408_tmax.nc na/tmax/era5_196409_tmax.nc na/tmax/era5_196410_tmax.nc na/tmax/era5_196411_tmax.nc na/tmax/era5_196412_tmax.nc na/tmax/era5_196501_tmax.nc na/tmax/era5_196502_tmax.nc na/tmax/era5_196503_tmax.nc na/tmax/era5_196504_tmax.nc na/tmax/era5_196505_tmax.nc na/tmax/era5_196506_tmax.nc na/tmax/era5_196507_tmax.nc na/tmax/era5_196508_tmax.nc na/tmax/era5_196509_tmax.nc na/tmax/era5_196510_tmax.nc na/tmax/era5_196511_tmax.nc na/tmax/era5_196512_tmax.nc na/tmax/era5_196601_tmax.nc na/tmax/era5_196602_tmax.nc na/tmax/era5_196603_tmax.nc na/tmax/era5_196604_tmax.nc na/tmax/era5_196605_tmax.nc na/tmax/era5_196606_tmax.nc na/tmax/era5_196607_tmax.nc na/tmax/era5_196608_tmax.nc na/tmax/era5_196609_tmax.nc na/tmax/era5_196610_tmax.nc na/tmax/era5_196611_tmax.nc na/tmax/era5_196612_tmax.nc na/tmax/era5_196701_tmax.nc na/tmax/era5_196702_tmax.nc na/tmax/era5_196703_tmax.nc na/tmax/era5_196704_tmax.nc na/tmax/era5_196705_tmax.nc na/tmax/era5_196706_tmax.nc na/tmax/era5_196707_tmax.nc na/tmax/era5_196708_tmax.nc na/tmax/era5_196709_tmax.nc na/tmax/era5_196710_tmax.nc na/tmax/era5_196711_tmax.nc na/tmax/era5_196712_tmax.nc na/tmax/era5_196801_tmax.nc na/tmax/era5_196802_tmax.nc na/tmax/era5_196803_tmax.nc na/tmax/era5_196804_tmax.nc na/tmax/era5_196805_tmax.nc na/tmax/era5_196806_tmax.nc na/tmax/era5_196807_tmax.nc na/tmax/era5_196808_tmax.nc na/tmax/era5_196809_tmax.nc na/tmax/era5_196810_tmax.nc na/tmax/era5_196811_tmax.nc na/tmax/era5_196812_tmax.nc na/tmax/era5_196901_tmax.nc na/tmax/era5_196902_tmax.nc na/tmax/era5_196903_tmax.nc na/tmax/era5_196904_tmax.nc na/tmax/era5_196905_tmax.nc na/tmax/era5_196906_tmax.nc na/tmax/era5_196907_tmax.nc na/tmax/era5_196908_tmax.nc na/tmax/era5_196909_tmax.nc na/tmax/era5_196910_tmax.nc na/tmax/era5_196911_tmax.nc na/tmax/era5_196912_tmax.nc na/tmax/era5_197001_tmax.nc na/tmax/era5_197002_tmax.nc na/tmax/era5_197003_tmax.nc na/tmax/era5_197004_tmax.nc na/tmax/era5_197005_tmax.nc na/tmax/era5_197006_tmax.nc na/tmax/era5_197007_tmax.nc na/tmax/era5_197008_tmax.nc na/tmax/era5_197009_tmax.nc na/tmax/era5_197010_tmax.nc na/tmax/era5_197011_tmax.nc na/tmax/era5_197012_tmax.nc na/tmax/era5_197101_tmax.nc na/tmax/era5_197102_tmax.nc na/tmax/era5_197103_tmax.nc na/tmax/era5_197104_tmax.nc na/tmax/era5_197105_tmax.nc na/tmax/era5_197106_tmax.nc na/tmax/era5_197107_tmax.nc na/tmax/era5_197108_tmax.nc na/tmax/era5_197109_tmax.nc na/tmax/era5_197110_tmax.nc na/tmax/era5_197111_tmax.nc na/tmax/era5_197112_tmax.nc na/tmax/era5_197201_tmax.nc na/tmax/era5_197202_tmax.nc na/tmax/era5_197203_tmax.nc na/tmax/era5_197204_tmax.nc na/tmax/era5_197205_tmax.nc na/tmax/era5_197206_tmax.nc na/tmax/era5_197207_tmax.nc na/tmax/era5_197208_tmax.nc na/tmax/era5_197209_tmax.nc na/tmax/era5_197210_tmax.nc na/tmax/era5_197211_tmax.nc na/tmax/era5_197212_tmax.nc na/tmax/era5_197301_tmax.nc na/tmax/era5_197302_tmax.nc na/tmax/era5_197303_tmax.nc na/tmax/era5_197304_tmax.nc na/tmax/era5_197305_tmax.nc na/tmax/era5_197306_tmax.nc na/tmax/era5_197307_tmax.nc na/tmax/era5_197308_tmax.nc na/tmax/era5_197309_tmax.nc na/tmax/era5_197310_tmax.nc na/tmax/era5_197311_tmax.nc na/tmax/era5_197312_tmax.nc na/tmax/era5_197401_tmax.nc na/tmax/era5_197402_tmax.nc na/tmax/era5_197403_tmax.nc na/tmax/era5_197404_tmax.nc na/tmax/era5_197405_tmax.nc na/tmax/era5_197406_tmax.nc na/tmax/era5_197407_tmax.nc na/tmax/era5_197408_tmax.nc na/tmax/era5_197409_tmax.nc na/tmax/era5_197410_tmax.nc na/tmax/era5_197411_tmax.nc na/tmax/era5_197412_tmax.nc na/tmax/era5_197501_tmax.nc na/tmax/era5_197502_tmax.nc na/tmax/era5_197503_tmax.nc na/tmax/era5_197504_tmax.nc na/tmax/era5_197505_tmax.nc na/tmax/era5_197506_tmax.nc na/tmax/era5_197507_tmax.nc na/tmax/era5_197508_tmax.nc na/tmax/era5_197509_tmax.nc na/tmax/era5_197510_tmax.nc na/tmax/era5_197511_tmax.nc na/tmax/era5_197512_tmax.nc na/tmax/era5_197601_tmax.nc na/tmax/era5_197602_tmax.nc na/tmax/era5_197603_tmax.nc na/tmax/era5_197604_tmax.nc na/tmax/era5_197605_tmax.nc na/tmax/era5_197606_tmax.nc na/tmax/era5_197607_tmax.nc na/tmax/era5_197608_tmax.nc na/tmax/era5_197609_tmax.nc na/tmax/era5_197610_tmax.nc na/tmax/era5_197611_tmax.nc na/tmax/era5_197612_tmax.nc na/tmax/era5_197701_tmax.nc na/tmax/era5_197702_tmax.nc na/tmax/era5_197703_tmax.nc na/tmax/era5_197704_tmax.nc na/tmax/era5_197705_tmax.nc na/tmax/era5_197706_tmax.nc na/tmax/era5_197707_tmax.nc na/tmax/era5_197708_tmax.nc na/tmax/era5_197709_tmax.nc na/tmax/era5_197710_tmax.nc na/tmax/era5_197711_t
Conventions :
CF-1.0

Extract the tmax data array and plot it¶

Clearly, the summer 2021 of was an extreme event, you can read about it here

In [3]:

tx = ds['tmax'] # tx now contains the tmax data

In [4]:

plt.figure(figsize=(15,5))
tx.plot();

The properties of time¶

The 'time' coordinate is a datetime64 object, which provides lots of powerful functionality in xarray, see here

In particular, the .dt method gives access to a lot of information: “year”, “month”, “day”, “hour”, “minute”, “second”, “dayofyear”, “week”, “dayofweek”, “weekday”, “quarter” and "season".

In [5]:

dates = ds['time'].sel(time = ['1950-01-01', '1966-07-14', '2001-09-11']) # can select specific dates as strings
dates

Out[5]:

<xarray.DataArray 'time' (time: 3)>
array(['1950-01-01T00:00:00.000000000', '1966-07-14T00:00:00.000000000',
       '2001-09-11T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 1950-01-01 1966-07-14 2001-09-11
Attributes:
    standard_name:  time
    long_name:      time
    axis:           T

In [6]:

dates.dt.dayofyear # the day of year at those dates

Out[6]:

<xarray.DataArray 'dayofyear' (time: 3)>
array([  1, 195, 254])
Coordinates:
  * time     (time) datetime64[ns] 1950-01-01 1966-07-14 2001-09-11

In [7]:

dates.dt.season # the season at those dates

Out[7]:

<xarray.DataArray 'season' (time: 3)>
array(['DJF', 'JJA', 'SON'], dtype='<U3')
Coordinates:
  * time     (time) datetime64[ns] 1950-01-01 1966-07-14 2001-09-11

Exercises¶

Compute the mean, median and 99th percentile temperature¶

Use the xarray .mean and .quantile functions

In [8]:

print('mean: %4.1f' % tx.mean(dim='time'))
print('median: %4.1f' % tx.quantile(0.5, dim='time'))
print('99th percentile: %4.1f' % tx.quantile(0.99, dim='time') )

mean: 11.1
median: 10.2
99th percentile: 29.9

Compute the mean summer (May-Sep) temperature for the years 2011-2020¶

Use .sel to select the years 2011-2020

Use .where to filter out non-summer days

Use .coarsen to take means over each the summer of each year

In [9]:

# select all data in the time range specified by slice
tx_2011_2020 = tx.sel(time = slice('2011-01-01','2020-12-31')) 

# filter data -- replace all data with dates outside May-Sep with nan
tx_2011_2020_summer = tx_2011_2020.where(tx['time'].dt.month.isin([5,6,7,8,9]))

# coarsen data into 365 day chunks and average over each chunk. 
#Nans do not contribute to mean, so the result is the May-Sep mean for each year
tx_2011_2020_summer_mean = tx_2011_2020_summer.coarsen(time=365,boundary='trim').mean()

In [10]:

plt.figure(figsize=(15,5))
tx_2011_2020_summer.plot()
tx_2011_2020_summer_mean.plot.line(marker='o');

Compute a smoothed seasonal cycle¶

Use .groupby to group data by day of year and average over all years -- this gives a "raw" seasonal cycle

Apply .rolling to the raw seasonal cycle to compute a 31-day running-mean smoothed seasonal cycle; you will need to use .pad with mode='wrap' to extend the raw seasonal cycle periodically, so the running mean window doesn't run out of data at the beginning and end

In [70]:

# daily seasonal cycle
tx_sc = tx.groupby(tx['time'].dt.dayofyear).mean()

# smooth seasonal cyle with moving average over window of specified width
window = 31
pad = int(window/2)
tx_sc_smooth = tx_sc.pad(dayofyear=(pad,pad), mode='wrap',center=True).rolling(dayofyear=window,center=True).mean().dropna(dim='dayofyear')

# remove seasonal cycle from data
tx_nosc = tx.groupby('time.dayofyear') - tx_sc_smooth

In [12]:

plt.figure(figsize=(15,5))
tx_sc.plot()
tx_sc_smooth.plot()
plt.legend(['raw','smooth'])
plt.title('Smoothed seasonal cycle (degC)');
plt.ylabel('temperature (degC)');

In [13]:

plt.figure(figsize=(15,5))
tx_nosc.plot()
plt.title('Deseasonalised data (anomalies from annual cycle, degC)');
plt.ylabel('temperature (degC)');

Extract annual block maxima (i.e. the maximum temperature for each year in the dataset)¶

Use .groupby to group data by year and extract the maximum value for each year

In [14]:

tx_bm = tx.groupby(tx['time'].dt.year).max()

In [15]:

plt.figure(figsize=(15,5))
tx_bm.plot.line(marker='o');
plt.ylabel('temperature (degC)');

Compute the linear time trend in the annual maxima¶

A linear time trend is just a linear regression of the data onto time.

Compute the trend in the annual block maxima calculated above using stats.linregress

In [16]:

# note that the time axis of tx_bm is no longer called 'time', it's called 'year'
r = stats.linregress(tx_bm['year'], tx_bm)
trend = r.intercept + r.slope * tx_bm['year']
tx_bm_detr = tx_bm - trend + tx_bm.mean()

In [17]:

plt.figure(figsize=(15,5))
tx_bm.plot.line(marker='o')
tx_bm_detr.plot.line(marker='o')
trend.plot.line()
plt.legend(['full','detrended','trend line'])
plt.title('Removing linear trend in annual maxima');
plt.ylabel('temperature (degC)');

Estimating the probability density function (PDF)¶

The PDF is interpreted as "in a given year, a the temperature will have a value $\ell$ contined in a bin ($\ell$,$\ell+d\ell$) with probability $p \, d\ell$", where $p$ is the PDF.

The PDF for a given dataset can be estimated 2 ways:

parametrically: assume the data follows a given distribution function, and fit the parameters
non-parametrically: compute a histogram

Exercise:

Apply both these approaches to the annual block maxima:

Use stats.genextreme.fit to fit a generalized extreme value (GEV) distribution to the block maxima, leaving out the extreme data point for 2021
Generate a "frozen" random variate object by calling stats.genextreme with the fitted parameters: rv = stats.genextreme(params)
Compute the PDF using rv.pdf
Compare to histogram, computed using the xarray .plot.hist function

In [91]:

# fit a GEV distribution
shape, location, scale = stats.genextreme.fit( tx_bm.sel(year=slice('1950','2020')) )
print(shape, location, scale)

# produce "frozen" random variate for the fitted distribution
rv = stats.genextreme(shape, location, scale)

# set up a dummy tx axis called tx_p which spans from the 0.001 percentile to the 0.999 percentile of the fitted distribution
tx_min = rv.ppf(0.0001)
tx_max = rv.ppf(0.9999)
tx_p = np.linspace(tx_min, tx_max, 100)

# compute the PDF
p = rv.pdf(tx_p)

0.44344893039383315 30.16768201460574 1.8451298419501154

In [92]:

tx_bm.plot.hist(bins=10, density=True)
plt.plot(tx_p, p)
plt.ylabel('$p$')
plt.xlabel('$\ell$',fontsize=14);

Estimating the cumulative distribution function (CDF)¶

The CDF is just the integral of the PDF. It can be interpreted as "in a given year, the temperature will stay below level $\ell$ with probability $P(\ell)$", where $P$ is the CDF.

Alternatively, once can consider the function $1-P$ (sometimes called the survival function); then interpretation is "in a given year, the temperature will exceed $\ell$ with probability $1-P(\ell)$"

Again, the CDF can be estimated parametrically (via a fit, as above), or non-parametrically directly from the data.

The simplest non-parametric estimator for the CDF is to rank the data points in ascending order and assume that the CDF increases by 1/N for each data point, where N is total number of years in sample. More accurate assumptions are described here.

Exercise:

Use the .sortby function to rank the annual block maxima in ascending order
Create a P array which increases from just above 0 to just below 1 in N even steps, as described in the link above
Plot the two against one another; this is the non-parametric CDF
Compute and plot the parametric CDF using rv.cdf with the "frozen" rv computed in the previous exercise

In [93]:

# Non-parametric CDF estimator
N = len(tx_bm) # number of data points
tx_np = tx_bm.sortby(tx_bm, ascending=True) 
a = 0.5 # this is the Hazen estimator
P_np = (np.linspace(1, N, N) - a)/ (N + 1 - 2*a)

# Parametric estimator
P_p = rv.cdf(tx_p)

In [94]:

fig,axes = plt.subplots(1,2,figsize=(15,5))

ax=axes[0]
ax.plot(tx_np, P_np, 'k.')
ax.plot(tx_p, P_p)
ax.set_xlabel('$\ell$',fontsize=14)
ax.set_ylabel('$P(\ell$)',fontsize=14)
ax.set_title('CDF')

ax=axes[1]
ax.plot(tx_np, 1 - P_np, 'k.')
ax.plot(tx_p, 1 - P_p)
ax.set_xlabel('$\ell$',fontsize=14)
ax.set_ylabel('$1-P(\ell$)',fontsize=14)
ax.set_title('1 - CDF');

Return level plot¶

The return period $r$ is defined as the number of years you expect to wait in order to see the temperature exceeding a certain level $\ell$.

You can also think of return period as the number of years you need to wait in order for the probability of exceeding $\ell$ to be just equal to 1, assuming all years are independent:

$r \ (1-P(\ell)) = 1$

so

$ r = \frac{1}{1-P(\ell)}$

A return level plot is just a modified form of the 1-CDF plot, with the temperatures plotted on the y axis (instead of the x axis) and with $r = 1/(1-P)$ on the x axis.

Exercise:

Plot the parametric and non-parametric return level estimators. Use a logarithmic x axis.

In [95]:

# non-parametric return periods
r_np = 1 / (1 - P_np)

# parametric return periods
r_p = 1 / (1 - P_p)

In [96]:

plt.figure(figsize=(7.5,5))
plt.semilogx(r_np, tx_np, 'k.')
plt.semilogx(r_p, tx_p)
plt.ylabel('$\ell$',fontsize=14)
plt.xlabel('return period (years)');
plt.title('Return level plot');

Bootstrap method to determine uncertainty in fit¶

Exercise: To get an estimate of the uncertainty in the estimates, do the following:

Generate 1000 synthetic data samples by resampling the block maxima with replacement (use the np.random.choice function)
Fit a GEV to each of the samples
For each fit, compute the return level for a given set of return periods
Compute the upper and lower percentiles for the return levels at each return period

In [24]:

def bootstrap_ci(original_sample):

    Nsamples = 1000 # no. of samples in bootstrap
    Nrp = 50        # no. of return periods 

    r = np.linspace(10, 1000, Nrp) # the return periods at which to compute confidence interval
    P = 1 - 1/r                     # the corresponding probabilities

    data = np.zeros((Nrp, Nsamples)) # array to hold bootstrap samples
    for i in range(Nsamples):
        sample = np.random.choice(original_sample, size=len(original_sample), replace=True) # resample with replacement
        rv = stats.genextreme( *stats.genextreme.fit(sample) ) # fit GEV to new sample 
        data[:,i] = rv.ppf(P) # get quantiles for given probs
    ci = np.percentile(data, (2.5, 97.5), axis=1) # 95% confidence interval
    return r, ci

In [25]:

# perform bootstrap 
r_boot, ci = bootstrap_ci(tx_bm[:-1])

In [26]:

plt.figure(figsize=(7.5,5))
plt.semilogx(r_np, tx_np, 'k.')
plt.semilogx(r_p, tx_p)
plt.semilogx(r_boot, ci[0], r_boot, ci[1])
plt.ylabel('$\ell$',fontsize=14)
plt.xlabel('return period (years)');
plt.title('Return level plot with 95% uncertainty bounds');