This notebook explains how to generate a train-test split from scikit-learn
to allow validation of machine learning models with out of sample data.
This notebook will use hourly weather data for multiple weather stations (origin
) for flights from New York airports in 2013.
This tutorial uses:
import statsmodels.api as sm
import pandas as pd
from sklearn.model_selection import train_test_split
The data is from rdatasets
imported using the Python package statsmodels
.
df = sm.datasets.get_rdataset('weather', 'nycflights13').data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 26115 entries, 0 to 26114 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 origin 26115 non-null object 1 year 26115 non-null int64 2 month 26115 non-null int64 3 day 26115 non-null int64 4 hour 26115 non-null int64 5 temp 26114 non-null float64 6 dewp 26114 non-null float64 7 humid 26114 non-null float64 8 wind_dir 25655 non-null float64 9 wind_speed 26111 non-null float64 10 wind_gust 5337 non-null float64 11 precip 26115 non-null float64 12 pressure 23386 non-null float64 13 visib 26115 non-null float64 14 time_hour 26115 non-null object dtypes: float64(9), int64(4), object(2) memory usage: 3.0+ MB
df.origin.unique()
array(['EWR', 'JFK', 'LGA'], dtype=object)
time_hour contains the hour of the observation as a string. Convert it to a datetime as observation_time. year, month, day and hour are duplicates and can be dropped from the dataframe.
df['observation_time'] = pd.to_datetime(df.time_hour)
df.drop(columns=['year', 'month', 'day', 'hour', 'time_hour'], inplace=True)
df.head()
origin | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib | observation_time | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | EWR | 39.02 | 26.06 | 59.37 | 270.0 | 10.35702 | NaN | 0.0 | 1012.0 | 10.0 | 2013-01-01 01:00:00 |
1 | EWR | 39.02 | 26.96 | 61.63 | 250.0 | 8.05546 | NaN | 0.0 | 1012.3 | 10.0 | 2013-01-01 02:00:00 |
2 | EWR | 39.02 | 28.04 | 64.43 | 240.0 | 11.50780 | NaN | 0.0 | 1012.5 | 10.0 | 2013-01-01 03:00:00 |
3 | EWR | 39.92 | 28.04 | 62.21 | 250.0 | 12.65858 | NaN | 0.0 | 1012.2 | 10.0 | 2013-01-01 04:00:00 |
4 | EWR | 39.02 | 28.04 | 64.43 | 260.0 | 12.65858 | NaN | 0.0 | 1011.9 | 10.0 | 2013-01-01 05:00:00 |
train_df, test_df = train_test_split(df, test_size=.2)
train_df
origin | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib | observation_time | |
---|---|---|---|---|---|---|---|---|---|---|---|
9030 | JFK | 53.06 | 33.98 | 48.16 | 340.0 | 9.20624 | NaN | 0.0 | 1021.6 | 10.0 | 2013-01-14 17:00:00 |
22499 | LGA | 75.92 | 64.94 | 68.78 | 180.0 | 10.35702 | 18.41248 | 0.0 | 1014.3 | 10.0 | 2013-08-01 09:00:00 |
6287 | EWR | 78.08 | 55.94 | 46.49 | 160.0 | 8.05546 | NaN | 0.0 | 1017.0 | 10.0 | 2013-09-20 13:00:00 |
15793 | JFK | 48.02 | 33.98 | 58.07 | 310.0 | 13.80936 | NaN | 0.0 | 1007.6 | 10.0 | 2013-10-23 22:00:00 |
11971 | JFK | 64.94 | 39.92 | 39.79 | 300.0 | 10.35702 | 21.86482 | 0.0 | 1018.3 | 10.0 | 2013-05-17 10:00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3598 | EWR | 73.94 | 64.94 | 73.49 | 220.0 | 6.90468 | NaN | 0.0 | 1019.0 | 10.0 | 2013-05-31 04:00:00 |
4973 | EWR | 80.96 | 62.96 | 54.35 | 170.0 | 11.50780 | 19.56326 | 0.0 | 1016.6 | 10.0 | 2013-07-27 13:00:00 |
6147 | EWR | 64.94 | 46.04 | 50.32 | 290.0 | 16.11092 | 21.86482 | 0.0 | 1017.3 | 10.0 | 2013-09-14 17:00:00 |
15586 | JFK | 53.06 | 51.08 | 92.96 | 40.0 | 3.45234 | NaN | 0.0 | 1023.8 | 10.0 | 2013-10-15 07:00:00 |
9050 | JFK | 39.02 | 24.98 | 56.77 | 350.0 | 9.20624 | NaN | 0.0 | 1025.3 | 10.0 | 2013-01-15 13:00:00 |
20892 rows × 11 columns
print("Train:", train_df.origin.unique())
print("Test:", test_df.origin.unique())
print("Train:", train_df.observation_time.min(), train_df.observation_time.max())
print("Test:", test_df.observation_time.min(), test_df.observation_time.max())
Train: ['JFK' 'LGA' 'EWR'] Test: ['LGA' 'JFK' 'EWR'] Train: 2013-01-01 01:00:00 2013-12-30 18:00:00 Test: 2013-01-01 02:00:00 2013-12-30 18:00:00