Scikit-Learn Train-Test Split¶

This notebook explains how to generate a train-test split from scikit-learn to allow validation of machine learning models with out of sample data.

This notebook will use hourly weather data for multiple weather stations (origin) for flights from New York airports in 2013.

Packages¶

This tutorial uses:

In [1]:

import statsmodels.api as sm
import pandas as pd
from sklearn.model_selection import train_test_split

Reading the data¶

The data is from rdatasets imported using the Python package statsmodels.

In [2]:

df = sm.datasets.get_rdataset('weather', 'nycflights13').data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26115 entries, 0 to 26114
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   origin      26115 non-null  object 
 1   year        26115 non-null  int64  
 2   month       26115 non-null  int64  
 3   day         26115 non-null  int64  
 4   hour        26115 non-null  int64  
 5   temp        26114 non-null  float64
 6   dewp        26114 non-null  float64
 7   humid       26114 non-null  float64
 8   wind_dir    25655 non-null  float64
 9   wind_speed  26111 non-null  float64
 10  wind_gust   5337 non-null   float64
 11  precip      26115 non-null  float64
 12  pressure    23386 non-null  float64
 13  visib       26115 non-null  float64
 14  time_hour   26115 non-null  object 
dtypes: float64(9), int64(4), object(2)
memory usage: 3.0+ MB

In [3]:

df.origin.unique()

Out[3]:

array(['EWR', 'JFK', 'LGA'], dtype=object)

Fix dates¶

time_hour contains the hour of the observation as a string. Convert it to a datetime as observation_time. year, month, day and hour are duplicates and can be dropped from the dataframe.

In [4]:

df['observation_time'] = pd.to_datetime(df.time_hour)
df.drop(columns=['year', 'month', 'day', 'hour', 'time_hour'], inplace=True)
df.head()

Out[4]:

	origin	temp	dewp	humid	wind_dir	wind_speed	wind_gust	pressure	visib	observation_time
0	EWR	39.02	26.06	59.37	270.0	10.35702	NaN	1012.0	10.0	2013-01-01 01:00:00
1	EWR	39.02	26.96	61.63	250.0	8.05546	NaN	1012.3	10.0	2013-01-01 02:00:00
2	EWR	39.02	28.04	64.43	240.0	11.50780	NaN	1012.5	10.0	2013-01-01 03:00:00
3	EWR	39.92	28.04	62.21	250.0	12.65858	NaN	1012.2	10.0	2013-01-01 04:00:00
4	EWR	39.02	28.04	64.43	260.0	12.65858	NaN	1011.9	10.0	2013-01-01 05:00:00

Train-test splitting¶

In [5]:

train_df, test_df = train_test_split(df, test_size=.2)
train_df

Out[5]:

	origin	temp	dewp	humid	wind_dir	wind_speed	wind_gust	precip	pressure	visib	observation_time
9030	JFK	53.06	33.98	48.16	340.0	9.20624	NaN	0.0	1021.6	10.0	2013-01-14 17:00:00
22499	LGA	75.92	64.94	68.78	180.0	10.35702	18.41248	0.0	1014.3	10.0	2013-08-01 09:00:00
6287	EWR	78.08	55.94	46.49	160.0	8.05546	NaN	0.0	1017.0	10.0	2013-09-20 13:00:00
15793	JFK	48.02	33.98	58.07	310.0	13.80936	NaN	0.0	1007.6	10.0	2013-10-23 22:00:00
11971	JFK	64.94	39.92	39.79	300.0	10.35702	21.86482	0.0	1018.3	10.0	2013-05-17 10:00:00
...	...	...	...	...	...	...	...	...	...	...	...
3598	EWR	73.94	64.94	73.49	220.0	6.90468	NaN	0.0	1019.0	10.0	2013-05-31 04:00:00
4973	EWR	80.96	62.96	54.35	170.0	11.50780	19.56326	0.0	1016.6	10.0	2013-07-27 13:00:00
6147	EWR	64.94	46.04	50.32	290.0	16.11092	21.86482	0.0	1017.3	10.0	2013-09-14 17:00:00
15586	JFK	53.06	51.08	92.96	40.0	3.45234	NaN	0.0	1023.8	10.0	2013-10-15 07:00:00
9050	JFK	39.02	24.98	56.77	350.0	9.20624	NaN	0.0	1025.3	10.0	2013-01-15 13:00:00

20892 rows × 11 columns

In [6]:

print("Train:", train_df.origin.unique())
print("Test:", test_df.origin.unique())
print("Train:", train_df.observation_time.min(), train_df.observation_time.max())
print("Test:", test_df.observation_time.min(), test_df.observation_time.max())

Train: ['JFK' 'LGA' 'EWR']
Test: ['LGA' 'JFK' 'EWR']
Train: 2013-01-01 01:00:00 2013-12-30 18:00:00
Test: 2013-01-01 02:00:00 2013-12-30 18:00:00

In [ ]: