You will go beyond summary statistics by learning about autocorrelation and partial autocorrelation plots. You will also learn how to automatically detect seasonality, trend and noise in your time series data. This is the Summary of lecture "Visualizing Time-Series data in Python", via datacamp.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (10, 5)
plt.style.use('fivethirtyeight')
In the field of time series analysis, autocorrelation refers to the correlation of a time series with a lagged version of itself. For example, an autocorrelation of order 3 returns the correlation between a time series and its own values lagged by 3 time points.
It is common to use the autocorrelation (ACF) plot, also known as self-autocorrelation, to visualize the autocorrelation of a time-series. The plot_acf()
function in the statsmodels library can be used to measure and plot the autocorrelation of a time series.
co2_levels = pd.read_csv('./dataset/ch2_co2_levels.csv')
co2_levels.set_index('datestamp', inplace=True)
co2_levels = co2_levels.fillna(method='bfill')
from statsmodels.graphics import tsaplots
# Display
fig = tsaplots.plot_acf(co2_levels['co2'], lags= 24);
If autocorrelation values are close to 0, then values between consecutive observations are not correlated with one another. Inversely, autocorrelations values close to 1 or -1 indicate that there exists strong positive or negative correlations between consecutive observations, respectively.
In order to help you asses how trustworthy these autocorrelation values are, the plot_acf()
function also returns confidence intervals (represented as blue shaded regions). If an autocorrelation value goes beyond the confidence interval region, you can assume that the observed autocorrelation value is statistically significant.
Like autocorrelation, the partial autocorrelation function (PACF) measures the correlation coefficient between a time-series and lagged versions of itself. However, it extends upon this idea by also removing the effect of previous time points. For example, a partial autocorrelation function of order 3 returns the correlation between our time series ($t_1, t_2, t_3, \dots$) and its own values lagged by 3 time points ($t_4, t_5, t_6, \dots$), but only after removing all effects attributable to lags 1 and 2.
The plot_pacf()
function in the statsmodels library can be used to measure and plot the partial autocorrelation of a time series.
# Display the partial autocorrelation plot of your time series
fig = tsaplots.plot_pacf(co2_levels['co2'], lags=24);
If partial autocorrelation values are close to 0, then values between observations and lagged observations are not correlated with one another. Inversely, partial autocorrelations with values close to 1 or -1 indicate that there exists strong positive or negative correlations between the lagged observations of the time series.
The .plot_pacf()
function also returns confidence intervals, which are represented as blue shaded regions. If partial autocorrelation values are beyond this confidence interval regions, then you can assume that the observed partial autocorrelation values are statistically significant.
You can rely on a method known as time-series decomposition to automatically extract and quantify the structure of time-series data. The statsmodels library provides the seasonal_decompose()
function to perform time series decomposition out of the box.
decomposition = sm.tsa.seasonal_decompose(time_series)
You can extract a specific component, for example seasonality, by accessing the seasonal attribute of the decomposition
object.
co2_levels.index = pd.to_datetime(co2_levels.index)
import statsmodels.api as sm
# Perform time series decomposition
decomposition = sm.tsa.seasonal_decompose(co2_levels)
# Print the seasonality component
print(decomposition.seasonal)
datestamp 1958-03-29 1.028042 1958-04-05 1.235242 1958-04-12 1.412344 1958-04-19 1.701186 1958-04-26 1.950694 ... 2001-12-01 -0.525044 2001-12-08 -0.392799 2001-12-15 -0.134838 2001-12-22 0.116056 2001-12-29 0.285354 Name: seasonal, Length: 2284, dtype: float64
It is also possible to extract other inferred quantities from your time-series decomposition object. The following code shows you how to extract the observed, trend and noise (or residual, resid
) components.
observed = decomposition.observed
trend = decomposition.trend
residuals = decomposition.resid
You can then use the extracted components and plot them individually.
# Extract the trend component
trend = decomposition.trend
# Plot the values of the trend
ax = trend.plot(figsize=(12, 6), fontsize=10);
# Specify axis labels
ax.set_xlabel('Date', fontsize=10);
ax.set_title('Seasonal component the CO2 time-series', fontsize=10);
You will now review the contents of chapter 1. You will have the opportunity to work with a new dataset that contains the monthly number of passengers who took a commercial flight between January 1949 and December 1960.
airline = pd.read_csv('./dataset/ch3_airline_passengers.csv', parse_dates=['Month'], index_col='Month')
airline.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 144 entries, 1949-01-01 to 1960-12-01 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AirPassengers 144 non-null int64 dtypes: int64(1) memory usage: 2.2 KB
# Plot the time series in your dataframe
ax = airline.plot(color='blue', fontsize=12);
# Add a red vertical line at the date 1955-12-01
ax.axvline('1955-12-01', color='red', linestyle='--');
# Specify the labels in your plot
ax.set_xlabel('Date', fontsize=12);
ax.set_title('Number of Monthly Airline Passengers', fontsize=12);
In Chapter 2 you learned:
# Print out the number of missing values
print(airline.isnull().sum())
# Print out summary statistics of the airline DataFrame
print(airline.describe())
AirPassengers 0 dtype: int64 AirPassengers count 144.000000 mean 280.298611 std 119.966317 min 104.000000 25% 180.000000 50% 265.500000 75% 360.500000 max 622.000000
# Display boxplot of airline values
ax = airline.boxplot();
# Specify the title of your plot
ax.set_title('Boxplot of Monthly Airline\nPassengers Count', fontsize=20);
# Get month for each dates from the index of airline
index_month = airline.index.month
# Compute the mean number of passengers for each month of the year
mean_airline_by_month = airline.groupby(index_month).mean()
# Plot the mean number of passengers for each month of the year
mean_airline_by_month.plot();
plt.legend(fontsize=20);
In this exercise, you will apply time series decomposition to the airline
dataset, and visualize the trend
and seasonal
componenets.
# Perform time series decomposition
decomposition = sm.tsa.seasonal_decompose(airline)
# Extract the trend and seasonal components
trend = decomposition.trend
seasonal = decomposition.seasonal
airline_decomposed = pd.concat([trend, seasonal], axis=1)
# Print the first 5 rows of airline_decomposed
print(airline_decomposed.head(5))
# Plot the values of the airline_decomposed DataFrame
ax = airline_decomposed.plot(figsize=(12, 6), fontsize=15);
# Specify axis labels
ax.set_xlabel('Date', fontsize=15);
plt.legend(fontsize=15);
plt.savefig('../images/trend_seasonal.png')
trend seasonal Month 1949-01-01 NaN -24.748737 1949-02-01 NaN -36.188131 1949-03-01 NaN -2.241162 1949-04-01 NaN -8.036616 1949-05-01 NaN -4.506313