This notebook explains how to create time series features with tsfresh
.
This notebook will use the Beijing Multi-Site Air-Quality Data downloaded from the UCI Machine Learning Repository.
import pandas as pd
import tsfresh
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile
The zipfile is downloaded from UCI Machine Learning Repository using urllib
and unzipped with zipfile
. This zipfile contains one csv for each reporting station. Read each of these csv files and append to the pandas dataframe.
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00501/PRSA2017_Data_20130301-20170228.zip"
r = urlopen(url)
zf = ZipFile(BytesIO(r.read()))
df = pd.DataFrame()
for file in zf.infolist():
if file.filename.endswith('.csv'):
df = df.append(pd.read_csv(zf.open(file)))
df['timestamp'] = pd.to_datetime(df[["year", "month", "day", "hour"]])
df.drop(columns=['No'], inplace=True)
df.sort_values(by=['timestamp', 'station']).head(10)
year | month | day | hour | PM2.5 | PM10 | SO2 | NO2 | CO | O3 | TEMP | PRES | DEWP | RAIN | wd | WSPM | station | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | 3 | 1 | 0 | 4.0 | 4.0 | 4.0 | 7.0 | 300.0 | 77.0 | -0.7 | 1023.0 | -18.8 | 0.0 | NNW | 4.4 | Aotizhongxin | 2013-03-01 |
0 | 2013 | 3 | 1 | 0 | 3.0 | 6.0 | 13.0 | 7.0 | 300.0 | 85.0 | -2.3 | 1020.8 | -19.7 | 0.0 | E | 0.5 | Changping | 2013-03-01 |
0 | 2013 | 3 | 1 | 0 | 4.0 | 4.0 | 3.0 | NaN | 200.0 | 82.0 | -2.3 | 1020.8 | -19.7 | 0.0 | E | 0.5 | Dingling | 2013-03-01 |
0 | 2013 | 3 | 1 | 0 | 9.0 | 9.0 | 3.0 | 17.0 | 300.0 | 89.0 | -0.5 | 1024.5 | -21.4 | 0.0 | NNW | 5.7 | Dongsi | 2013-03-01 |
0 | 2013 | 3 | 1 | 0 | 4.0 | 4.0 | 14.0 | 20.0 | 300.0 | 69.0 | -0.7 | 1023.0 | -18.8 | 0.0 | NNW | 4.4 | Guanyuan | 2013-03-01 |
0 | 2013 | 3 | 1 | 0 | 6.0 | 18.0 | 5.0 | NaN | 800.0 | 88.0 | 0.1 | 1021.1 | -18.6 | 0.0 | NW | 4.4 | Gucheng | 2013-03-01 |
0 | 2013 | 3 | 1 | 0 | 7.0 | 7.0 | 3.0 | 2.0 | 100.0 | 91.0 | -2.3 | 1020.3 | -20.7 | 0.0 | WNW | 3.1 | Huairou | 2013-03-01 |
0 | 2013 | 3 | 1 | 0 | 5.0 | 14.0 | 4.0 | 12.0 | 200.0 | 85.0 | -0.5 | 1024.5 | -21.4 | 0.0 | NNW | 5.7 | Nongzhanguan | 2013-03-01 |
0 | 2013 | 3 | 1 | 0 | 3.0 | 6.0 | 3.0 | 8.0 | 300.0 | 44.0 | -0.9 | 1025.8 | -20.5 | 0.0 | NW | 9.3 | Shunyi | 2013-03-01 |
0 | 2013 | 3 | 1 | 0 | 6.0 | 6.0 | 4.0 | 8.0 | 300.0 | 81.0 | -0.5 | 1024.5 | -21.4 | 0.0 | NNW | 5.7 | Tiantan | 2013-03-01 |
tsfresh
doesn't handle missing value well, so check for missing values.
df.isnull().sum()
year 0 month 0 day 0 hour 0 PM2.5 8739 PM10 6449 SO2 9021 NO2 12116 CO 20701 O3 13277 TEMP 398 PRES 393 DEWP 403 RAIN 390 wd 1822 WSPM 318 station 0 timestamp 0 dtype: int64
As this is hourly time series, replace missing values by the previous value.
df.fillna(method='ffill', inplace=True)
tsfresh
computes a large number of features by default. For the purposes of this tutorial, limit the data to just one month for three stations.
df2014 = df[df.timestamp.between("2014-03-01", "2014-04-01")]
df2014_limited = df2014[df2014.station.isin(['Dongsi', 'Wanliu', 'Shunyi'])]
Remove categorical features as tsfresh
doesn't process this type of feature. They can be added back in before modeling if necessary.
ts_df = df2014_limited.drop(columns=['year', 'month', 'day', 'hour', 'wd'])
df_features = tsfresh.extract_features(ts_df, column_id='station', column_sort='timestamp')
df_features.columns
Feature Extraction: 100%|██████████| 17/17 [00:05<00:00, 3.36it/s]
Index(['CO__variance_larger_than_standard_deviation', 'CO__has_duplicate_max', 'CO__has_duplicate_min', 'CO__has_duplicate', 'CO__sum_values', 'CO__abs_energy', 'CO__mean_abs_change', 'CO__mean_change', 'CO__mean_second_derivative_central', 'CO__median', ... 'WSPM__permutation_entropy__dimension_5__tau_1', 'WSPM__permutation_entropy__dimension_6__tau_1', 'WSPM__permutation_entropy__dimension_7__tau_1', 'WSPM__query_similarity_count__query_None__threshold_0.0', 'WSPM__matrix_profile__feature_"min"__threshold_0.98', 'WSPM__matrix_profile__feature_"max"__threshold_0.98', 'WSPM__matrix_profile__feature_"mean"__threshold_0.98', 'WSPM__matrix_profile__feature_"median"__threshold_0.98', 'WSPM__matrix_profile__feature_"25"__threshold_0.98', 'WSPM__matrix_profile__feature_"75"__threshold_0.98'], dtype='object', length=8657)
tsfresh
allows control over what features are created. tsfresh
supports several methods to determine this list: tsfresh.feature_extraction.ComprehensiveFCParameters
(the default value) includes all features with common parameters, tsfresh.feature_extraction.MinimalFCParameters
includes a small number of easily calculated features, tsfresh.feature_extraction.EfficientFCParameters
drops high computational cost features from the comprehensive list.
df_features = tsfresh.extract_features(ts_df, column_id='station', column_sort='timestamp',
default_fc_parameters=tsfresh.feature_extraction.MinimalFCParameters())
df_features.columns
Feature Extraction: 100%|██████████| 17/17 [00:00<00:00, 805.99it/s]
Index(['PM2.5__sum_values', 'PM2.5__median', 'PM2.5__mean', 'PM2.5__length', 'PM2.5__standard_deviation', 'PM2.5__variance', 'PM2.5__root_mean_square', 'PM2.5__maximum', 'PM2.5__minimum', 'PM10__sum_values', 'PM10__median', 'PM10__mean', 'PM10__length', 'PM10__standard_deviation', 'PM10__variance', 'PM10__root_mean_square', 'PM10__maximum', 'PM10__minimum', 'DEWP__sum_values', 'DEWP__median', 'DEWP__mean', 'DEWP__length', 'DEWP__standard_deviation', 'DEWP__variance', 'DEWP__root_mean_square', 'DEWP__maximum', 'DEWP__minimum', 'RAIN__sum_values', 'RAIN__median', 'RAIN__mean', 'RAIN__length', 'RAIN__standard_deviation', 'RAIN__variance', 'RAIN__root_mean_square', 'RAIN__maximum', 'RAIN__minimum', 'SO2__sum_values', 'SO2__median', 'SO2__mean', 'SO2__length', 'SO2__standard_deviation', 'SO2__variance', 'SO2__root_mean_square', 'SO2__maximum', 'SO2__minimum', 'NO2__sum_values', 'NO2__median', 'NO2__mean', 'NO2__length', 'NO2__standard_deviation', 'NO2__variance', 'NO2__root_mean_square', 'NO2__maximum', 'NO2__minimum', 'CO__sum_values', 'CO__median', 'CO__mean', 'CO__length', 'CO__standard_deviation', 'CO__variance', 'CO__root_mean_square', 'CO__maximum', 'CO__minimum', 'O3__sum_values', 'O3__median', 'O3__mean', 'O3__length', 'O3__standard_deviation', 'O3__variance', 'O3__root_mean_square', 'O3__maximum', 'O3__minimum', 'WSPM__sum_values', 'WSPM__median', 'WSPM__mean', 'WSPM__length', 'WSPM__standard_deviation', 'WSPM__variance', 'WSPM__root_mean_square', 'WSPM__maximum', 'WSPM__minimum', 'TEMP__sum_values', 'TEMP__median', 'TEMP__mean', 'TEMP__length', 'TEMP__standard_deviation', 'TEMP__variance', 'TEMP__root_mean_square', 'TEMP__maximum', 'TEMP__minimum', 'PRES__sum_values', 'PRES__median', 'PRES__mean', 'PRES__length', 'PRES__standard_deviation', 'PRES__variance', 'PRES__root_mean_square', 'PRES__maximum', 'PRES__minimum'], dtype='object')
A dictionary of features and settings can also be created to control the features created.
fc_settings = {'variance_larger_than_standard_deviation': None,
'has_duplicate_max': None,
'has_duplicate_min': None,
'has_duplicate': None,
'sum_values': None,
'abs_energy': None,
'mean_abs_change': None,
'mean_change': None,
'mean_second_derivative_central': None,
'median': None,
'mean': None,
'length': None,
'standard_deviation': None,
'variation_coefficient': None,
'variance': None,
'skewness': None,
'kurtosis': None,
'root_mean_square': None,
'absolute_sum_of_changes': None,
'longest_strike_below_mean': None,
'longest_strike_above_mean': None,
'count_above_mean': None,
'count_below_mean': None,
'last_location_of_maximum': None,
'first_location_of_maximum': None,
'last_location_of_minimum': None,
'first_location_of_minimum': None,
'percentage_of_reoccurring_values_to_all_values': None,
'percentage_of_reoccurring_datapoints_to_all_datapoints': None,
'sum_of_reoccurring_values': None,
'sum_of_reoccurring_data_points': None,
'ratio_value_number_to_time_series_length': None,
'maximum': None,
'minimum': None,
'benford_correlation': None,
'time_reversal_asymmetry_statistic': [{'lag': 1}, {'lag': 2}, {'lag': 3}],
'c3': [{'lag': 1}, {'lag': 2}, {'lag': 3}],
'cid_ce': [{'normalize': True}, {'normalize': False}],
'symmetry_looking': [{'r': 0.0},
{'r': 0.1},
{'r': 0.2},
{'r': 0.30000000000000004},
{'r': 0.4},
{'r': 0.5}],
'large_standard_deviation': [{'r': 0.5},
{'r': 0.75},
{'r': 0.9500000000000001}],
'quantile': [{'q': 0.1},
{'q': 0.2},
{'q': 0.3},
{'q': 0.4},
{'q': 0.6},
{'q': 0.7},
{'q': 0.8},
{'q': 0.9}],
'autocorrelation': [{'lag': 0},
{'lag': 1},
{'lag': 2},
{'lag': 3},
{'lag': 4},
{'lag': 5},
{'lag': 6},
{'lag': 7},
{'lag': 8},
{'lag': 9}],
'agg_autocorrelation': [{'f_agg': 'mean', 'maxlag': 40},
{'f_agg': 'median', 'maxlag': 40},
{'f_agg': 'var', 'maxlag': 40}],
'partial_autocorrelation': [{'lag': 0},
{'lag': 1},
{'lag': 2},
{'lag': 3},
{'lag': 4},
{'lag': 5},
{'lag': 6},
{'lag': 7},
{'lag': 8},
{'lag': 9}],
'number_cwt_peaks': [{'n': 1}, {'n': 5}],
'number_peaks': [{'n': 1}, {'n': 3}, {'n': 5}, {'n': 10}, {'n': 50}],
'binned_entropy': [{'max_bins': 10}],
'index_mass_quantile': [{'q': 0.1},
{'q': 0.2},
{'q': 0.3},
{'q': 0.4},
{'q': 0.6},
{'q': 0.7},
{'q': 0.8},
{'q': 0.9}],
'spkt_welch_density': [{'coeff': 2}, {'coeff': 5}, {'coeff': 8}],
'ar_coefficient': [{'coeff': 0, 'k': 10},
{'coeff': 1, 'k': 10},
{'coeff': 2, 'k': 10},
{'coeff': 3, 'k': 10},
{'coeff': 4, 'k': 10},
{'coeff': 5, 'k': 10},
{'coeff': 6, 'k': 10},
{'coeff': 7, 'k': 10},
{'coeff': 8, 'k': 10},
{'coeff': 9, 'k': 10},
{'coeff': 10, 'k': 10}],
'value_count': [{'value': 0}, {'value': 1}, {'value': -1}],
'range_count': [{'min': -1, 'max': 1}],
'linear_trend': [{'attr': 'pvalue'},
{'attr': 'rvalue'},
{'attr': 'intercept'},
{'attr': 'slope'},
{'attr': 'stderr'}],
'augmented_dickey_fuller': [{'attr': 'teststat'},
{'attr': 'pvalue'},
{'attr': 'usedlag'}],
'number_crossing_m': [{'m': 0}, {'m': -1}, {'m': 1}],
'energy_ratio_by_chunks': [{'num_segments': 10, 'segment_focus': 0},
{'num_segments': 10, 'segment_focus': 1},
{'num_segments': 10, 'segment_focus': 2},
{'num_segments': 10, 'segment_focus': 3},
{'num_segments': 10, 'segment_focus': 4},
{'num_segments': 10, 'segment_focus': 5},
{'num_segments': 10, 'segment_focus': 6},
{'num_segments': 10, 'segment_focus': 7},
{'num_segments': 10, 'segment_focus': 8},
{'num_segments': 10, 'segment_focus': 9}],
'ratio_beyond_r_sigma': [{'r': 0.5},
{'r': 1},
{'r': 1.5},
{'r': 2},
{'r': 2.5},
{'r': 3},
{'r': 5},
{'r': 6},
{'r': 7},
{'r': 10}],
'linear_trend_timewise': [{'attr': 'pvalue'},
{'attr': 'rvalue'},
{'attr': 'intercept'},
{'attr': 'slope'},
{'attr': 'stderr'}],
'count_above': [{'t': 0}],
'count_below': [{'t': 0}],
'permutation_entropy': [{'tau': 1, 'dimension': 3},
{'tau': 1, 'dimension': 4},
{'tau': 1, 'dimension': 5},
{'tau': 1, 'dimension': 6},
{'tau': 1, 'dimension': 7}],
'query_similarity_count': [{'query': None, 'threshold': 0.0}]}
df_features = tsfresh.extract_features(ts_df, column_id='station', column_sort='timestamp',
default_fc_parameters=fc_settings)
df_features.columns
Feature Extraction: 100%|██████████| 17/17 [00:01<00:00, 15.05it/s]
Index(['TEMP__variance_larger_than_standard_deviation', 'TEMP__has_duplicate_max', 'TEMP__has_duplicate_min', 'TEMP__has_duplicate', 'TEMP__sum_values', 'TEMP__abs_energy', 'TEMP__mean_abs_change', 'TEMP__mean_change', 'TEMP__mean_second_derivative_central', 'TEMP__median', ... 'WSPM__ratio_beyond_r_sigma__r_7', 'WSPM__ratio_beyond_r_sigma__r_10', 'WSPM__count_above__t_0', 'WSPM__count_below__t_0', 'WSPM__permutation_entropy__dimension_3__tau_1', 'WSPM__permutation_entropy__dimension_4__tau_1', 'WSPM__permutation_entropy__dimension_5__tau_1', 'WSPM__permutation_entropy__dimension_6__tau_1', 'WSPM__permutation_entropy__dimension_7__tau_1', 'WSPM__query_similarity_count__query_None__threshold_0.0'], dtype='object', length=1716)
The above method rolls all time series data up into a single record per column_id
(station in this case). For time series, this summarization often needs to be done at each timestamp and summarize the data from prior to the current timestamp. roll_time_series
creates a dataframe that allows tsfresh
to calculate the features at each timestamp correctly. We control the maximum window of the data with the parameter max_timeshift.
df_rolled = tsfresh.utilities.dataframe_functions.roll_time_series(df2014,
column_id='station',
column_sort='timestamp',
min_timeshift=24,
max_timeshift=24)
df_rolled.drop(columns=['year', 'month', 'day', 'hour', 'wd', 'station'], inplace=True)
Rolling: 100%|██████████| 20/20 [00:06<00:00, 3.29it/s]
Now that the rolled dataframe has been created, extract_features
can be run just as was done before
df_features = tsfresh.extract_features(df_rolled, column_id='id', column_sort='timestamp',
default_fc_parameters=tsfresh.feature_extraction.MinimalFCParameters())
df_features.columns
Feature Extraction: 100%|██████████| 20/20 [00:16<00:00, 1.19it/s]
Index(['PM2.5__sum_values', 'PM2.5__median', 'PM2.5__mean', 'PM2.5__length', 'PM2.5__standard_deviation', 'PM2.5__variance', 'PM2.5__root_mean_square', 'PM2.5__maximum', 'PM2.5__minimum', 'PM10__sum_values', 'PM10__median', 'PM10__mean', 'PM10__length', 'PM10__standard_deviation', 'PM10__variance', 'PM10__root_mean_square', 'PM10__maximum', 'PM10__minimum', 'SO2__sum_values', 'SO2__median', 'SO2__mean', 'SO2__length', 'SO2__standard_deviation', 'SO2__variance', 'SO2__root_mean_square', 'SO2__maximum', 'SO2__minimum', 'NO2__sum_values', 'NO2__median', 'NO2__mean', 'NO2__length', 'NO2__standard_deviation', 'NO2__variance', 'NO2__root_mean_square', 'NO2__maximum', 'NO2__minimum', 'CO__sum_values', 'CO__median', 'CO__mean', 'CO__length', 'CO__standard_deviation', 'CO__variance', 'CO__root_mean_square', 'CO__maximum', 'CO__minimum', 'O3__sum_values', 'O3__median', 'O3__mean', 'O3__length', 'O3__standard_deviation', 'O3__variance', 'O3__root_mean_square', 'O3__maximum', 'O3__minimum', 'TEMP__sum_values', 'TEMP__median', 'TEMP__mean', 'TEMP__length', 'TEMP__standard_deviation', 'TEMP__variance', 'TEMP__root_mean_square', 'TEMP__maximum', 'TEMP__minimum', 'PRES__sum_values', 'PRES__median', 'PRES__mean', 'PRES__length', 'PRES__standard_deviation', 'PRES__variance', 'PRES__root_mean_square', 'PRES__maximum', 'PRES__minimum', 'DEWP__sum_values', 'DEWP__median', 'DEWP__mean', 'DEWP__length', 'DEWP__standard_deviation', 'DEWP__variance', 'DEWP__root_mean_square', 'DEWP__maximum', 'DEWP__minimum', 'RAIN__sum_values', 'RAIN__median', 'RAIN__mean', 'RAIN__length', 'RAIN__standard_deviation', 'RAIN__variance', 'RAIN__root_mean_square', 'RAIN__maximum', 'RAIN__minimum', 'WSPM__sum_values', 'WSPM__median', 'WSPM__mean', 'WSPM__length', 'WSPM__standard_deviation', 'WSPM__variance', 'WSPM__root_mean_square', 'WSPM__maximum', 'WSPM__minimum'], dtype='object')
Now, each timestamp has the data summarized from the prior 24 hours.
df_features = df_features.reset_index().rename(columns={'level_0': 'station', 'level_1': 'timestamp'})
df_features.head()
station | timestamp | PM2.5__sum_values | PM2.5__median | PM2.5__mean | PM2.5__length | PM2.5__standard_deviation | PM2.5__variance | PM2.5__root_mean_square | PM2.5__maximum | ... | RAIN__minimum | WSPM__sum_values | WSPM__median | WSPM__mean | WSPM__length | WSPM__standard_deviation | WSPM__variance | WSPM__root_mean_square | WSPM__maximum | WSPM__minimum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aotizhongxin | 2014-03-02 00:00:00 | 2053.0 | 67.0 | 82.12 | 25.0 | 67.658153 | 4577.6256 | 106.401692 | 210.0 | ... | 0.0 | 51.1 | 1.8 | 2.044 | 25.0 | 0.964606 | 0.930464 | 2.260177 | 4.3 | 0.1 |
1 | Aotizhongxin | 2014-03-02 01:00:00 | 1976.0 | 67.0 | 79.04 | 25.0 | 64.108957 | 4109.9584 | 101.770723 | 210.0 | ... | 0.0 | 52.3 | 2.0 | 2.092 | 25.0 | 0.966197 | 0.933536 | 2.304344 | 4.3 | 0.1 |
2 | Aotizhongxin | 2014-03-02 02:00:00 | 1902.0 | 67.0 | 76.08 | 25.0 | 59.539513 | 3544.9536 | 96.608074 | 177.0 | ... | 0.0 | 52.6 | 2.1 | 2.104 | 25.0 | 0.964357 | 0.929984 | 2.314476 | 4.3 | 0.1 |
3 | Aotizhongxin | 2014-03-02 03:00:00 | 1852.0 | 67.0 | 74.08 | 25.0 | 56.897044 | 3237.2736 | 93.408351 | 176.0 | ... | 0.0 | 53.5 | 2.2 | 2.140 | 25.0 | 0.950368 | 0.903200 | 2.341538 | 4.3 | 0.1 |
4 | Aotizhongxin | 2014-03-02 04:00:00 | 1790.0 | 67.0 | 71.60 | 25.0 | 53.659668 | 2879.3600 | 89.475807 | 175.0 | ... | 0.0 | 54.3 | 2.2 | 2.172 | 25.0 | 0.934888 | 0.874016 | 2.364656 | 4.3 | 0.1 |
5 rows × 101 columns