First cell in notebook should always contain all necessary imports
import numpy as np
import xarray as xr
import scipy.stats as stats
from matplotlib import pyplot as plt
The dataset contains a single variable called 'tmax' with time coordinate 'time'.
The 'tmax' variable is ERA5 daily-maximum surface temperature from the Pacific Northwest area of the USA for the period 1950-2021.
ds = xr.open_dataset('itmax_era5_index.nc')
ds
<xarray.Dataset> Dimensions: (time: 26298) Coordinates: * time (time) datetime64[ns] 1950-01-01 1950-01-02 ... 2021-12-31 Data variables: tmax (time) float32 -9.767 -13.07 -15.17 -12.89 ... nan nan nan nan Attributes: (12/21) title: spatial statistic of "ERA5 reanalysis, https:... description: tmax era5 index scripturl01: https://climexp.knmi.nl/get_index.cgi?email=$... minimal_valid_fraction: 30.00 file: ERA5/era5_tmax_daily_na_extended.nc cdi: Climate Data Interface version 1.9.10 (https:... ... ... cdo: Climate Data Operators version 1.9.10 (https:... ave_region: lon= -123.125 -118.875, lat= 44.875 52.125 comment: scripturl02: https://climexp.knmi.nl/dat2nc.cgi?id=$id&sta... history: 2021-11-03 18:34:46 bin/dat2nc data/itmax_er... Conventions: CF-1.0
array(['1950-01-01T00:00:00.000000000', '1950-01-02T00:00:00.000000000', '1950-01-03T00:00:00.000000000', ..., '2021-12-29T00:00:00.000000000', '2021-12-30T00:00:00.000000000', '2021-12-31T00:00:00.000000000'], dtype='datetime64[ns]')
array([ -9.767426, -13.07486 , -15.1666 , ..., nan, nan, nan], dtype=float32)
tx = ds['tmax'] # tx now contains the tmax data
plt.figure(figsize=(15,5))
tx.plot();
The 'time' coordinate is a datetime64 object, which provides lots of powerful functionality in xarray, see here
In particular, the .dt method gives access to a lot of information: “year”, “month”, “day”, “hour”, “minute”, “second”, “dayofyear”, “week”, “dayofweek”, “weekday”, “quarter” and "season".
dates = ds['time'].sel(time = ['1950-01-01', '1966-07-14', '2001-09-11']) # can select specific dates as strings
dates
<xarray.DataArray 'time' (time: 3)> array(['1950-01-01T00:00:00.000000000', '1966-07-14T00:00:00.000000000', '2001-09-11T00:00:00.000000000'], dtype='datetime64[ns]') Coordinates: * time (time) datetime64[ns] 1950-01-01 1966-07-14 2001-09-11 Attributes: standard_name: time long_name: time axis: T
array(['1950-01-01T00:00:00.000000000', '1966-07-14T00:00:00.000000000', '2001-09-11T00:00:00.000000000'], dtype='datetime64[ns]')
array(['1950-01-01T00:00:00.000000000', '1966-07-14T00:00:00.000000000', '2001-09-11T00:00:00.000000000'], dtype='datetime64[ns]')
dates.dt.dayofyear # the day of year at those dates
<xarray.DataArray 'dayofyear' (time: 3)> array([ 1, 195, 254]) Coordinates: * time (time) datetime64[ns] 1950-01-01 1966-07-14 2001-09-11
array([ 1, 195, 254])
array(['1950-01-01T00:00:00.000000000', '1966-07-14T00:00:00.000000000', '2001-09-11T00:00:00.000000000'], dtype='datetime64[ns]')
dates.dt.season # the season at those dates
<xarray.DataArray 'season' (time: 3)> array(['DJF', 'JJA', 'SON'], dtype='<U3') Coordinates: * time (time) datetime64[ns] 1950-01-01 1966-07-14 2001-09-11
array(['DJF', 'JJA', 'SON'], dtype='<U3')
array(['1950-01-01T00:00:00.000000000', '1966-07-14T00:00:00.000000000', '2001-09-11T00:00:00.000000000'], dtype='datetime64[ns]')
print('mean: %4.1f' % tx.mean(dim='time'))
print('median: %4.1f' % tx.quantile(0.5, dim='time'))
print('99th percentile: %4.1f' % tx.quantile(0.99, dim='time') )
mean: 11.1 median: 10.2 99th percentile: 29.9
# select all data in the time range specified by slice
tx_2011_2020 = tx.sel(time = slice('2011-01-01','2020-12-31'))
# filter data -- replace all data with dates outside May-Sep with nan
tx_2011_2020_summer = tx_2011_2020.where(tx['time'].dt.month.isin([5,6,7,8,9]))
# coarsen data into 365 day chunks and average over each chunk.
#Nans do not contribute to mean, so the result is the May-Sep mean for each year
tx_2011_2020_summer_mean = tx_2011_2020_summer.coarsen(time=365,boundary='trim').mean()
plt.figure(figsize=(15,5))
tx_2011_2020_summer.plot()
tx_2011_2020_summer_mean.plot.line(marker='o');
Use .groupby to group data by day of year and average over all years -- this gives a "raw" seasonal cycle
Apply .rolling to the raw seasonal cycle to compute a 31-day running-mean smoothed seasonal cycle; you will need to use .pad with mode='wrap' to extend the raw seasonal cycle periodically, so the running mean window doesn't run out of data at the beginning and end
# daily seasonal cycle
tx_sc = tx.groupby(tx['time'].dt.dayofyear).mean()
# smooth seasonal cyle with moving average over window of specified width
window = 31
pad = int(window/2)
tx_sc_smooth = tx_sc.pad(dayofyear=(pad,pad), mode='wrap',center=True).rolling(dayofyear=window,center=True).mean().dropna(dim='dayofyear')
# remove seasonal cycle from data
tx_nosc = tx.groupby('time.dayofyear') - tx_sc_smooth
plt.figure(figsize=(15,5))
tx_sc.plot()
tx_sc_smooth.plot()
plt.legend(['raw','smooth'])
plt.title('Smoothed seasonal cycle (degC)');
plt.ylabel('temperature (degC)');
plt.figure(figsize=(15,5))
tx_nosc.plot()
plt.title('Deseasonalised data (anomalies from annual cycle, degC)');
plt.ylabel('temperature (degC)');
tx_bm = tx.groupby(tx['time'].dt.year).max()
plt.figure(figsize=(15,5))
tx_bm.plot.line(marker='o');
plt.ylabel('temperature (degC)');
A linear time trend is just a linear regression of the data onto time.
Compute the trend in the annual block maxima calculated above using stats.linregress
# note that the time axis of tx_bm is no longer called 'time', it's called 'year'
r = stats.linregress(tx_bm['year'], tx_bm)
trend = r.intercept + r.slope * tx_bm['year']
tx_bm_detr = tx_bm - trend + tx_bm.mean()
plt.figure(figsize=(15,5))
tx_bm.plot.line(marker='o')
tx_bm_detr.plot.line(marker='o')
trend.plot.line()
plt.legend(['full','detrended','trend line'])
plt.title('Removing linear trend in annual maxima');
plt.ylabel('temperature (degC)');
The PDF is interpreted as "in a given year, a the temperature will have a value $\ell$ contined in a bin ($\ell$,$\ell+d\ell$) with probability $p \, d\ell$", where $p$ is the PDF.
The PDF for a given dataset can be estimated 2 ways:
Exercise:
Apply both these approaches to the annual block maxima:
# fit a GEV distribution
shape, location, scale = stats.genextreme.fit( tx_bm.sel(year=slice('1950','2020')) )
print(shape, location, scale)
# produce "frozen" random variate for the fitted distribution
rv = stats.genextreme(shape, location, scale)
# set up a dummy tx axis called tx_p which spans from the 0.001 percentile to the 0.999 percentile of the fitted distribution
tx_min = rv.ppf(0.0001)
tx_max = rv.ppf(0.9999)
tx_p = np.linspace(tx_min, tx_max, 100)
# compute the PDF
p = rv.pdf(tx_p)
0.44344893039383315 30.16768201460574 1.8451298419501154
tx_bm.plot.hist(bins=10, density=True)
plt.plot(tx_p, p)
plt.ylabel('$p$')
plt.xlabel('$\ell$',fontsize=14);