Notebook

Learning Goals¶

Calculate the power spectrum for data that started on a non-uniform grid
Contrast the Boxcar and Hanning wind approaches. What are the advantages/disadvantages of these two windows? How do they change the spectra?
Apply a Hanning window of different window lengths. How does that affect the statistical significance and temporal precision?
Appling a Hanning window with Welch's Method (Windows of Overlapping Segment Analysis, WOSA). How does WOSA change the results?

Underlying science and data¶

This notebook has code that computes the power spectrum of the temperature record from the Dome-C ice core (Antarctica) taken as (123E,75S) (Jouzel et al. 2007, https://science.sciencemag.org/content/317/5839/793). The data goes back 800,000 years before present and are unevenly spaced in time because the further down you go, the more compressed ice gets (older ice samples a longer segment of time). You can obtain the data online here from the NOAA Paleoclimatology Program and World Data Center for Paleoclimatology: ftp://ftp.ncdc.noaa.gov/pub/data/paleo/icecore/antarctica/epica_domec/edc3deuttemp2007.txt More information on the data is available at: https://www.ncdc.noaa.gov/paleo-search/study/6080

Power spectral analysis of this dataset will illuminate the periodic behavior of ice ages due to Milankovitch cycles. Milankovitch cycles are periodic variations in climate due to the Earth's gradual orbital variations in eccentricity, tilt, and precession. If Milankovitch cyles are a new phenomena for you, check out this website after class: https://climate.nasa.gov/news/2948/milankovitch-orbital-cycles-and-their-role-in-earths-climate/

How to complete this notebook¶

Power spectrum analysis via fast Fourier transform (FFT) requires evenly spaced data. Regrid the data to an evenly spaced dataset. Plot the before and after data to make sure things worked and make sense.
What is the autocorrelation and e-folding time of your data? Calculate the power spectra using the Numpy method (which uses a boxcar window by default. Graph the power spectrum, a red noise fit to the data, and the 99% confidence interval. What statistically significant spectral peaks do you find and how much power do they have? What do they represent?

Insert answer here.

Compare the Boxcar and Hanning window: Calculate and plot the power spectra using the scipy periodogram method which allows quick comparison between different window types. Compare the power spectrum results using the two different windows. What are the stiatistically significant peaks? What are the differences that occur from using the two different windows? Why?

Insert answer here.

Compare Hanning windows of different lengths and use multiple chunks of the dataset to increase spectral confidence. Calculate and plot the power spectra using the scipy method but with different window lengths. Try a couple different window lengths to see how the results change. How does window length change the statistically signficant peaks that you obtain? How does decreasing the window length affect the temporal precision of the spectral peaks and the statistical significance? Do you find the tradeoff between (1) high spectral/temporal resolution but low quality statistics AND (2) high quality statistics but low spectral/temporal resolution?

Insert answer here.

Apply WOSA: Apply Welch's method with a Hanning window with different window lengths. Use the same window lengths that you tried in 4. How does using overlaps with Welch's method change your results?

Insert answer here.

Import modules

In [1]:

import numpy as np   #for calculations
import matplotlib.pyplot as plt  #for plotting
import pandas as pd  #for reading in data
import datetime    #for manipulation by dtae
import collections  #for additional dictionary functionality
from matplotlib.gridspec import GridSpec #for subplots
import scipy.signal as signal #for signal processing
import scipy.stats as stats

A couple helpful functions

In [2]:

def create_normalized_redfit(data_length,Te):
    freq = np.arange(0,(data_length/2)+1,1)/float(data_length) # to Nyquist
    red_fit = (2 * Te)/(1 + ((2*np.pi*freq)**2)*(Te**2)) # After Hartmann 6.64, 6.91
    return red_fit/np.sum(red_fit)

def create_f_bounds(alpha,dof,red_fit_n):
    f_ratio = stats.f.ppf(alpha,dof,200) # Note: 200 = large degree of freedom for red noise
    return f_ratio*red_fit_n

Read in the ice core data
Data are from ftp://ftp.ncdc.noaa.gov/pub/data/paleo/icecore/antarctica/epica_domec/edc3deuttemp2007.txt

In [3]:

filename = 'edc3deuttemp2007_nohead.txt'
data_all=pd.read_csv(filename,sep='\s+',names = ['bag', 'ztop','age','Deuterium','TempAnom'])
data_all.head()

Out[3]:

	bag	ztop	age	Deuterium	TempAnom
0	13	6.60	38.37379	-390.9	0.88
1	14	7.15	46.81203	-385.1	1.84
2	15	7.70	55.05624	-377.8	3.04
3	16	8.25	64.41511	-394.1	0.35
4	17	8.80	73.15077	-398.7	-0.42

We're using the temperature data (which is derived from different paleoproxies - how that's done is a different class! Go talk to the geologists in the building next door...)
Read in the temperature data from the ice core:

In [4]:

data_raw=data_all['TempAnom']
time_raw=data_all['age']
var_str = "Temp"
data_units = "C"

Plot raw temperature data

In [5]:

fig = plt.figure(figsize=(10,4))
plt.plot(time_raw.values,data_raw, lw=.75)
plt.ylabel(data_units,fontsize=9)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.xlabel('Age (years)',fontsize=14)
plt.ylabel('Temperature Anomaly (K)',fontsize=14)
plt.title("Dome-C Ice Core Antarctica",fontsize=14)
plt.grid()
#plt.savefig('XX.pdf', dpi=300, format='png')

PROBLEM FOR FFT ANALYSIS: Raw data are not evenly spaced in time.
Need to fix that as FFT requires data on evenly spaced grid

In [6]:

## Print statements and calculate 1st derivative to show raw data are not evenly spaced
#print(time.values)
dt=np.diff(time_raw.values)
#print(dt)
#print(round(np.mean(dt),0))
print(min(time_raw.values),0)
print(max(time_raw.values),0)

### Define the grid
time=np.linspace(min(time_raw.values),max(time_raw.values),800)
#time=np.linspace(min(time_raw.values),max(time_raw.values),max(time_raw.values)-min(time_raw.values)+1)
#print(time)
dt=np.diff(time)
print(dt[0])
dt_regrid=dt[0]

## Interpolate raw data to an evenly spaced grid
data=np.interp(time,time_raw,data_raw)
print('regridded spacing in years',dt[1])

38.37379 0
801662.0 0
1003.2836373091365
regridded spacing in years 1003.2836373091366

Plot the evenly spaced data:

In [7]:

fig = plt.figure(figsize=(10,4))
plt.plot(time,data, lw=.75)
plt.ylabel(data_units,fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title("Vostok Ice Core Antarctica - regridded on evenly spaced grid",fontsize=14)
plt.grid()
plt.ylabel('Temperature anomaly (K)')
plt.xlabel('Age (years)')
#plt.savefig('XX.pdf', dpi=300, format='png')
print(len(data))
print(len(time))

800
800

In [8]:

## calculate the anomaly note: need to do this to get equivalent results with numpy and scipy method (not sure why?)
data=data-data.mean()

Calculate the power spectrum of red noise with lag1_r to use for significance testing

In [9]:

 alpha = 0.99 ## set statistical significance level

### step 1: calculate lag-1 autocorrelation (lag1_r, rho) and the associated p value (lag1_p)
lag1_r,lag1_p = stats.pearsonr(data[0:len(data)-1],data[1:len(data)])
### step 2: Calculate e-folding time for a red-noise process with this lag-1 autocorrelation
Te = -1./np.log(lag1_r) # After Hartman 6.62 with delta t = 1
print(var_str,'lag-1 autocorrelation =',round(lag1_r,2),'and Te =',round(Te,0))

## calculate the power spectrum of red noise with lag1_r to use for significance testing
red_fit_n = create_normalized_redfit(len(data),Te)
dof_entirewindow=2 ### note dof=2 because using whole record for FFT with no chunking
f_bounds = create_f_bounds(alpha,dof_entirewindow,red_fit_n)  ## using f-test for variance, see function

Temp lag-1 autocorrelation = 0.96 and Te = 25.0

Calculate the power spectrum. First with the numpy method:

In [10]:

## Method #1: Numpy Method - Calculate Power Spectrum
freq = np.fft.rfftfreq(len(data),d=1.)  ### 
#print(freq)
A = np.fft.rfft(data) ### r = real
ps = (np.abs(A)**2)/2.
ps_n = (np.abs(A)**2)/np.sum(np.abs(A)**2)  ## normalizing - calculate percent of the total  

Plot the results -- replicate a figure from a paper in Science that has 6000+ citations (Petit et al. 1999)

In [11]:

fig = plt.figure(figsize=(15,6))
method_str = "Boxcar Window on Full Dataset"
plt.plot(freq,ps_n, color='C0', lw=1.3, label='Boxcar')
plt.plot(freq,red_fit_n, color='firebrick', lw=1, alpha=0.9, label='Red Noise Fit') # freq plots 0...0.5
plt.plot(freq,f_bounds, color='firebrick', lw=1, ls='--', alpha=0.9, label=str(int(alpha*100))+'% CI')
plt.ylim(0.0,0.3)
plt.xlim(0.0,0.06)
plt.xlabel("Frequency (approximately millennia$^{-1}$)",fontsize=14)
plt.ylabel("Normalized Power",fontsize=14)
plt.title("Normalized Power Spectrum of "+var_str+"\n"+method_str,fontsize=14)
plt.legend(loc='upper right',fontsize=14)
plt.grid()
fig.tight_layout()

Find the significant spectral peaks:

In [12]:

for i in range(len(ps_n)):
    if (ps_n[i] - f_bounds[i]) > 5e-03:
        print('##### FOUND IT - spectral peak exceeds red noise ####')
        print('exceeds by...',ps_n[i]-f_bounds[i])
        print('at frequency....',freq[i])
        print('which in years is approximately...',round((dt_regrid/freq[i]))) ## use the regridded regular spacing    
        print()

##### FOUND IT - spectral peak exceeds red noise ####
exceeds by... 0.09410631001641392
at frequency.... 0.01
which in years is approximately... 100328.0

##### FOUND IT - spectral peak exceeds red noise ####
exceeds by... 0.10688859416323282
at frequency.... 0.025
which in years is approximately... 40131.0

##### FOUND IT - spectral peak exceeds red noise ####
exceeds by... 0.018915070428135672
at frequency.... 0.0425
which in years is approximately... 23607.0

##### FOUND IT - spectral peak exceeds red noise ####
exceeds by... 0.0064147302797330896
at frequency.... 0.043750000000000004
which in years is approximately... 22932.0

Scipy Method: Compare Boxcar (sqaure) Window with a Hanning (tapered) Window.
http://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.periodogram.html#scipy.signal.periodogram Which would you prefer??? Boxcar or Hanning??

In [13]:

## Use Boxcar Window
f,Pxx = signal.periodogram(data,fs=1.,window='boxcar',nfft=None,return_onesided=True,scaling='spectrum')
Pxx_n = Pxx/np.sum(Pxx)

## Use Hanning Window
f_h,Pxx_h = signal.periodogram(data,fs=1.,window='hanning',nfft=None,return_onesided=True,scaling='spectrum')
Pxx_h_n = Pxx_h/np.sum(Pxx_h)

fig = plt.figure(figsize=(15,6))
plt.plot(f,Pxx_n, color='C0', lw=1.3, label='Boxcar')
plt.plot(f_h,Pxx_h_n, color='C1', lw=1.3, label='Hanning')
plt.plot(freq,red_fit_n, color='firebrick', lw=1, alpha=0.9, label='Red Noise Fit')
plt.plot(freq,f_bounds, color='firebrick', lw=1, ls='--', alpha=0.9, label=str(int(alpha*100))+'% CI')
plt.ylim(0.0,0.3)
plt.xlim(0.0,0.06)
plt.xlabel("Frequency (approximately millennia$^{-1}$)",fontsize=14)
plt.ylabel("Normalized Power",fontsize=11)
plt.title("Normalized Power Spectrum of "+var_str,fontsize=14)
plt.legend(loc='upper right',fontsize=14)
plt.grid()

Use the Hanning Window AND change the length of the data window.

Tinker, try different window lengths, you should see the peaks broaden as window length shortens. Taking a small window decreases the sample, erodes confidence in the location of the peaks. But the smaller the window the more DOF and the higher the statistical significance.
Take home: window length is a tradeoff between accuracy and significance. Also beware of what is happening at the lowest frequencies (longest period waves) when you reduce the window_length... remember that your dataset is only 800,000 years long :)

In [14]:

window_length=int(len(data)/4)    #CHANGE WINDOW LENGTH HERE; divide by bigger/smaller numbers
print(f'window length: {window_length}')

f_h_wl,Pxx_h_wl = signal.periodogram(data,fs=1.,window='hanning',nfft=window_length,return_onesided=True,scaling='spectrum')
Pxx_h_wl_n = Pxx_h_wl/np.sum(Pxx_h_wl)

### Need to update the frequencies to plot, the DOF, the red noise fit and 99% confidence interval
### Note: Check that I am calculating the significance level appropriately for the window.
### Vineel/Elizabeth to double check!!
#EM: looks good to me.

freq_wl = np.arange(0.,(window_length/2)+1.)/window_length
#print(freq_wl)  #EM: added these three lines to make sure that the above line is doing what I think it is- and it is
#freq_test=np.fft.rfftfreq(window_length,d=1.)  #EM: these can be removed later
#print(freq_test)

red_fit_n_wl = create_normalized_redfit(window_length,Te)  ## generate red_fit_n using window_length
print(np.size(red_fit_n_wl))
dof_wl=1.2*len(data)/(window_length/2)  ### Barnes Eq. 26
print('DOF with window =',dof_wl)
print('alpha, statistical significance level',alpha)
f_bounds_wl = create_f_bounds(alpha,dof_wl,red_fit_n_wl)  ## f-test for variance; 

### plot results with data window of length window_length and compare to using the entire window
fig = plt.figure(figsize=(14,8))
gs=GridSpec(2,1)
plt.subplot(gs[0,0])
plt.plot(f_h_wl,Pxx_h_wl_n, color='C0', lw=1.3, label='Hanning,wl='+str(window_length))
plt.plot(freq_wl,red_fit_n_wl, color='firebrick', lw=1, alpha=0.9, label='Red Noise Fit')
plt.plot(freq_wl,f_bounds_wl, color='firebrick', lw=1, ls='--', alpha=0.9, label=str(int(alpha*100))+'% CI')
plt.ylim(0.0,0.5)
plt.xlim(0.0,0.06)
plt.xlabel("Frequency (approximately millennia$^{-1}$)",fontsize=14)
plt.ylabel("Normalized Power",fontsize=14)
plt.title("Normalized Power Spectrum: Hanning window using window length="+str(window_length)+",DOF="+str(dof_wl),fontsize=14)
plt.legend(loc='upper right',fontsize=14)
plt.grid()
fig.tight_layout()

plt.subplot(gs[1,0])
plt.plot(f_h,Pxx_h_n, color='C0', lw=1.3, label='Hanning')
plt.plot(freq,red_fit_n, color='firebrick', lw=1, alpha=0.9, label='Red Noise Fit')
plt.plot(freq,f_bounds, color='firebrick', lw=1, ls='--', alpha=0.9, label=str(int(alpha*100))+'% CI')
plt.ylim(0.0,0.5)
plt.xlim(0.0,0.06)
plt.xlabel("Frequency (approximately millennia$^{-1}$)",fontsize=14)
plt.ylabel("Normalized Power",fontsize=14)
plt.title("Normalized Power Spectrum: Hanning window with full data as window length="+str(len(data))+",DOF=2",fontsize=14)
plt.legend(loc='upper right',fontsize=14)
plt.grid()
fig.tight_layout()

### 

window length: 200
101
DOF with window = 8.0
alpha, statistical significance level 0.99

Take-home message for 100,000ish year peak. If you reduce your window length too much -- your data chunks are too short to see the low frequency oscillations!

Use the Hanning Window AND change the length of the data window AND use WOSA (Welch’s Overlapping Segment Analysis)
How does adding WOSA change the answer??

In [15]:

window_length=int(len(data)/4)    ##Here's where you change the window length if tinkering
T2 = window_length/2                            
freq_w = np.arange(0.,T2+1.)/window_length
#print(len(freq))
P_welch = signal.welch(data,window='hanning',nperseg=window_length);   #Here's where Welch's method is changed
#print(np.size(P_welch))
P_welch_n = P_welch[1]/np.sum(P_welch[1])  #normalize

## Construct the expected rednoise spectrum - functions
## Need to update the 99% confidence interval - we have more DOF
red_fit_n_welch = create_normalized_redfit(window_length,Te)  ## generate red_fit_n using window_length
print(np.size(red_fit_n_welch))
dof_welch=1.2*len(data)/(window_length/2)  ### Barnes Eq. 26
print('DOF =',dof_welch)
f_bounds_welch = create_f_bounds(alpha,dof_welch,red_fit_n_welch)  ## f-test for variance; 

## plot
fig = plt.figure(figsize=(14,4))
plt.xlabel('Frequency')
plt.ylabel('Normalized Power')
plt.ylim(0.0,0.5)
plt.xlim(0.0,0.06)
plt.grid()
plt.plot(freq_w,P_welch_n, color='C0', lw=1.3, label='Hanning,wl='+str(window_length));
plt.plot(freq_w,red_fit_n_welch, color='firebrick', lw=1, alpha=0.9, label='Red Noise Fit');
plt.plot(freq_w,f_bounds_welch, color='firebrick', lw=1, ls='--', alpha=0.9, label=str(int(alpha*100))+'% CI');
plt.title("Normalized Power Spectrum: WOSA, Hanning window using window length="+str(window_length)+",DOF="+str(round(dof_welch,0)),fontsize=14);

101
DOF = 8.0

In [ ]: