#!/usr/bin/env python # coding: utf-8 # # Forecasting the Weather Using Facebook's Prophet Algorithm # ## Project Overview # # In this project, we'll predict the weather using the Facebook Prophet algorithm and local weather data of New York City. Prophet uses an additive model to add up seasonal effects and trends to make a prediction. The advantage of prophet is that it automatically identifies seasonality in the data - and weather data has strong seasonal effects. So without any feature engineering, we can get good baseline accuracy. It can also scale to multiple time series (think data from adjacent weather stations) easily. # # By the end, we'll have a model that predicts the weather, and can be extended to improve accuracy. # # **Project Steps** # * Load in and clean data # * Define targets and predictors # * Train model # * Scale model to entire dataset using cv # * Make future predictions # In[11]: import pandas as pd import prophet import numpy as np import plotly # ## Load and Clean the Data # In[12]: weather = pd.read_csv("weather.csv", index_col="DATE") # In[13]: weather # We can see that some of these columns have missing data because maybe the sensor didn't work all the time properly or they didn't record data correctly. We have to fix that as machine learning models can't work with missing values. What we'll do here is to get rid of the columns that have a ton of missing values. To do that we're going to find the percentage of null values in each column and drop those that have more than 5% of null values # In[14]: null_pct = weather.apply(pd.isnull).sum()/weather.shape[0] # In[15]: null_pct # In[16]: valid_columns = weather.columns[null_pct < .05] # In[59]: print("Columns to keep") valid_columns # The next thing we're going to do is to subset our data using only our valid columns # In[18]: weather = weather[valid_columns].copy() # In[60]: weather.columns = weather.columns.str.lower() # Turning all the columns to lowercase to save time holding shift # As we are dealing with time series data we're going to make sure that our data type for indexes is date to make it easier to work with later on. # In[20]: weather.index = pd.to_datetime(weather.index) # ## Split the Data # We're going to split the dataframe up into two. One that has data for JFK Airport and one that has data for Lagardia Airport and cmbine them so they fit together and add extra columns to each row. So each row is going to have a unique date and data from each station will show up in a single row. # In[21]: weather["station"].unique() # In[22]: lga = weather[weather["station"] == "USW00014732"].copy() # In[23]: lga # In[24]: weather = weather[weather["station"] == "USW00094789"].copy() # In[25]: weather # In[26]: weather = weather.merge(lga, left_index=True, right_index=True) # In[27]: weather # ## Set up for the algorithm # Now we're ready to do the set up for our Prophet algorithm. What we're going to do here is to set up the `target` which is the thing we're trying to predict i.e. the future temperature at JFK Airport # In[61]: weather['y'] = weather.shift(-1)["tmax_x"] # Take the maximum daily temperature at JFK Airport and shift it back one day # In[29]: weather[["tmax_x", "y"]] # The dataframe above means we can use today's max temperature to help us predict tomorrow's max temperature. So `y` represents tomorrow's max temperature. # To fill the null values we're going to usse the ffill() method (forward fill) which takes the last non null value in a column to fill in the next null value. It doesn't make total sense to fill in `NaN` values with yesterday temperature but it makes it easier to forecast the future and it's not going to mess up our predictions. # In[30]: weather = weather.ffill() # In[31]: weather # In[62]: weather["ds"] = weather.index # Prophet needs the index to be called ds # In[33]: weather # In[34]: weather.columns # In[63]: predictors = weather.columns[~weather.columns.isin(["y", "name_x", "station_x", "name_y", "station_y", "ds"])] # the '~' operator stands for exclude # In[36]: predictors # In[37]: train = weather[:"2021-12-31"] test = weather["2021-12-31":] # ## Run Prophet for 2022 # In[38]: from prophet import Prophet # In[39]: def fit_prophet(train): m = Prophet() for p in predictors: m.add_regressor(p) m.fit(train) return m m = fit_prophet(train) # In[40]: predictions = m.predict(test) # In[41]: predictions # So for each day we have a predict temperature (yhat). What prophet does is basically time series forecasting. It picks up different components from the time series and tries to model how that time series will change. We can look more deeply into what Prophet is doing by plotting the different components that go into Prophet. # In[42]: weather["2018-12-31":].plot("ds", "y") # We can see there is a strong pattern in our daily temperatures. So during the summer which is where the peaks are, our temperature is high and in the bottom, the temperature is low. And this pattern repeats overtime. We can refer it as a seasonal pattern. Prophet separates the temperatures into these different factors like how much the month impacts the temperature, how much the day of the week impacts the temperature, is there a general trend in the temperature seperate from these seasonal facts? And it tries to predict each one separately and then adds them together to get the final prediction. # In[43]: from prophet.plot import plot_plotly, plot_components_plotly, plot_cross_validation_metric # In[44]: plot_components_plotly(m, predictions) # Now let's look at how important our columns are to Prophet's predictions # In[45]: from prophet.utilities import regressor_coefficients # In[46]: regressor_coefficients(m) # The dataframe above shows the impact of a single regressor on our predictions. For exemple the first row shows us the impact of precipitations at JFK on tomorrow's temperature. And we can see that generally if it rains today, Prophet thinks tomorrow the temperature would be lower (indicated by a negative coefficient) # Now let's look at the mean square error (mse) which is a very common way to measure how effective our model was. # In[47]: predictions.index = test.index predictions["actual"] = test["y"] # In[48]: def mse(predictions, actual_label="actual", pred_label="yhat"): se = ((predictions[actual_label] - predictions[pred_label])**2) print(se.mean()) mse(predictions) # ## Cross validation # If we evaluate our model across our whole dataset using cross validation, we can get a more realistic error estimate. So cross validation splits our data into multiple pieces and then uses the first piece for example to predict the second piece, then put the first and second pieces together and uses it to predict the third piece. What this makes sure is we are not training our model using the same data we're testing it on but we're still able to get a error metric or estimate accross a good portion of the data. # In[49]: from prophet.diagnostics import cross_validation # In[52]: m = fit_prophet(weather) cv = cross_validation(m, initial=f"{365*5} days", period="180 days", horizon="180 days", parallel="processes") # In[51]: mse(cv, actual_label="y") # Our `error` accross the `whole dataset` is a little bit lower than our `error` of just `2022`. Maybe there was something weird happening in 2022 # In[54]: cv[['y', 'yhat'][-365:]].plot() # We can see tjat our predicted `y` values do correlate well and follow the trend pretty well but we can see there's a lot of spikes in the actual temperature that our model doesn't know about. Those spikes could be for a number of reasons. There could be a storm system coming south from Canada into New York. There could be a hurricane. Our model just doesn't have enough information to follow all these spikes. # ## Make Future predictions # Now lets use our model to predict one day ahead. # In[55]: m = fit_prophet(weather) m.predict(weather.iloc[-1:]) # How about predicting multiple days ahead? Predicting multiple days ahead could be less acurate than predicting a day ahead. Let's predict for the next 365 days. # In[56]: m = Prophet() m.fit(weather) future = m.make_future_dataframe(periods=365) # In[57]: future # In[58]: forecast = m.predict(future) plot_plotly(m, forecast) # In[ ]: