ML/Random Forest Classifier: Weather Prediction¶

Click here to Interact with this code on nbViewer

Data Preprocessing¶

In [1]:

import pandas as pd

Get the csv from github

In [2]:

dataframe = pd.read_csv("https://raw.githubusercontent.com/ujwalnk/MachineLearning101/main/data/01%20Weather%20Data.csv")
dataframe.head()

Out[2]:

	Date	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
0	2008-12-01	Albury	13.4	22.9	0.6	NaN	NaN	W	44.0	W	...	71.0	22.0	1007.7	1007.1	8.0	NaN	16.9	21.8	No	No
1	2008-12-02	Albury	7.4	25.1	0.0	NaN	NaN	WNW	44.0	NNW	...	44.0	25.0	1010.6	1007.8	NaN	NaN	17.2	24.3	No	No
2	2008-12-03	Albury	12.9	25.7	0.0	NaN	NaN	WSW	46.0	W	...	38.0	30.0	1007.6	1008.7	NaN	2.0	21.0	23.2	No	No
3	2008-12-04	Albury	9.2	28.0	0.0	NaN	NaN	NE	24.0	SE	...	45.0	16.0	1017.6	1012.8	NaN	NaN	18.1	26.5	No	No
4	2008-12-05	Albury	17.5	32.3	1.0	NaN	NaN	W	41.0	ENE	...	82.0	33.0	1010.8	1006.0	7.0	8.0	17.8	29.7	No	No

5 rows × 23 columns

Check for missing data & remove any na data¶

In [3]:

dataframe = dataframe.dropna()
dataframe.isnull().sum(), dataframe.count()

Out[3]:

(Date             0
 Location         0
 MinTemp          0
 MaxTemp          0
 Rainfall         0
 Evaporation      0
 Sunshine         0
 WindGustDir      0
 WindGustSpeed    0
 WindDir9am       0
 WindDir3pm       0
 WindSpeed9am     0
 WindSpeed3pm     0
 Humidity9am      0
 Humidity3pm      0
 Pressure9am      0
 Pressure3pm      0
 Cloud9am         0
 Cloud3pm         0
 Temp9am          0
 Temp3pm          0
 RainToday        0
 RainTomorrow     0
 dtype: int64,
 Date             56420
 Location         56420
 MinTemp          56420
 MaxTemp          56420
 Rainfall         56420
 Evaporation      56420
 Sunshine         56420
 WindGustDir      56420
 WindGustSpeed    56420
 WindDir9am       56420
 WindDir3pm       56420
 WindSpeed9am     56420
 WindSpeed3pm     56420
 Humidity9am      56420
 Humidity3pm      56420
 Pressure9am      56420
 Pressure3pm      56420
 Cloud9am         56420
 Cloud3pm         56420
 Temp9am          56420
 Temp3pm          56420
 RainToday        56420
 RainTomorrow     56420
 dtype: int64)

Drop Unnecessary Columns

In [4]:

dataframe = dataframe.drop("Date", axis=1)

Sort and check for datapoints

In [5]:

dataframe = dataframe.drop_duplicates()
dataframe.sort_values("RainTomorrow", axis=0, ascending=True, inplace=True)
dataframe["RainTomorrow"].value_counts()

Out[5]:

No     43993
Yes    12427
Name: RainTomorrow, dtype: int64

Data Splitting¶

In [6]:

from sklearn.model_selection import train_test_split

# Import label encoder
from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()
dataframe["RainTomorrow"] = label_encoder.fit_transform(dataframe["RainTomorrow"])

y = dataframe["RainTomorrow"]
X = dataframe = pd.get_dummies(dataframe.drop("RainTomorrow", axis=1))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Test for shape matching:¶

$$ \begin{align} &(\approx 70\% \times data , <\# vars>) &&(\approx 10\% \times data , <\# vars>) &&&(\approx 20\% \times data , <\# vars>) \\ &(\approx 70\% \times data,) &&(\approx 10\% \times data,) &&&(\approx 20\% \times data,) \\ \end{align}$$

Need to make a separate validation and test data as all data is labelled

In [7]:

X_train.shape, X_test.shape, y_train.shape, y_test.shape

Out[7]:

((45136, 92), (11284, 92), (45136,), (11284,))

The shapes match, so start training the model

Model Training¶

In [15]:

from sklearn.ensemble import RandomForestClassifier as clf

rf_model = clf()
rf_model.fit(X_train, y_train)

Out[15]:

RandomForestClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Testing accuracy score of model¶

In [16]:

rf_model.score(X_test, y_test)

Out[16]:

0.8645870258773485