In this project we will work with historical data from the S&P500 index. This index, measure the stock performance of 500 large companies listed on stock exchanges in the US. It is one of the most commonly followed equity indices, and many consider it to be one of the best representation of the US Stock market.
When an index fund like S&P500 or the Down Jones Industrial Average (DJIA or Dow30) goes up or down you can say that the underlying market or sector it represents is also going up or down. For example if the DJIA goes down one day, you can say that American stocks overall went down.
We will use S&P500 historical data to make predictions about future prices. Predicting wheter an index will go up or down will help us forecast how the stock market as a whole will perform. Since stocks tend to correlate with how well the economy as a whole is performing it can also help us to make economic forecasts.
However it is recomended that we shouldn't make trades with the model developed in this project. Trading stocks has risks, and nothing we do here constitutes a strock trading advice
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
stocks = pd.read_csv('sphist.csv')
stocks.head()
Date | Open | High | Low | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|---|
0 | 2015-12-07 | 2090.419922 | 2090.419922 | 2066.780029 | 2077.070068 | 4.043820e+09 | 2077.070068 |
1 | 2015-12-04 | 2051.239990 | 2093.840088 | 2051.239990 | 2091.689941 | 4.214910e+09 | 2091.689941 |
2 | 2015-12-03 | 2080.709961 | 2085.000000 | 2042.349976 | 2049.620117 | 4.306490e+09 | 2049.620117 |
3 | 2015-12-02 | 2101.709961 | 2104.270020 | 2077.110107 | 2079.510010 | 3.950640e+09 | 2079.510010 |
4 | 2015-12-01 | 2082.929932 | 2103.370117 | 2082.929932 | 2102.629883 | 3.712120e+09 | 2102.629883 |
stocks.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16590 entries, 0 to 16589 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 16590 non-null object 1 Open 16590 non-null float64 2 High 16590 non-null float64 3 Low 16590 non-null float64 4 Close 16590 non-null float64 5 Volume 16590 non-null float64 6 Adj Close 16590 non-null float64 dtypes: float64(6), object(1) memory usage: 907.4+ KB
The data has 16590 rows of data. We also found that the Date column wasn't recognized as a datetime so we need to transform it. So far we don't have any NaN data
stocks['Date'] = pd.to_datetime(stocks['Date'])
stocks.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16590 entries, 0 to 16589 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 16590 non-null datetime64[ns] 1 Open 16590 non-null float64 2 High 16590 non-null float64 3 Low 16590 non-null float64 4 Close 16590 non-null float64 5 Volume 16590 non-null float64 6 Adj Close 16590 non-null float64 dtypes: datetime64[ns](1), float64(6) memory usage: 907.4 KB
print(min(stocks['Date']))
print(max(stocks['Date']))
1950-01-03 00:00:00 2015-12-07 00:00:00
The data goes from January 3rd, 1950 to December 7th, 2015. Let's order them in ascending order
stocks = stocks.sort_values(by='Date', ascending=True).reset_index()
stocks = stocks.drop('index', axis=1)
stocks.head()
Date | Open | High | Low | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|---|
0 | 1950-01-03 | 16.66 | 16.66 | 16.66 | 16.66 | 1260000.0 | 16.66 |
1 | 1950-01-04 | 16.85 | 16.85 | 16.85 | 16.85 | 1890000.0 | 16.85 |
2 | 1950-01-05 | 16.93 | 16.93 | 16.93 | 16.93 | 2550000.0 | 16.93 |
3 | 1950-01-06 | 16.98 | 16.98 | 16.98 | 16.98 | 2010000.0 | 16.98 |
4 | 1950-01-09 | 17.08 | 17.08 | 17.08 | 17.08 | 2520000.0 | 17.08 |
Datasets from the stock market need to be handled differently than datasets from other sector when it comes to make predictions. In a normal Machine Learning (ML) exercise, we treat each row as independent, however, stock market is sequential, and each observation comes a day after the previous observation. Thus, the observations are not at all independent.
The time series nature of the data means that can generate indicators to make our model more accurate. For example, we can create a new column that contains the average price of the last 10 trades for each row. This will incorporate information from multiple pior rows into one, and will make predictions much more accurate.
When we do this, we have to be careful not to use the current row in the values we average. We want to teach the model how to predict the current price from historical prices. If we include the current price in the prices we average, it will be equivalent to handing the answers to the model, and will make it impossible to use in the "real world", where we don't know the price upfront.
Let's calculate the 5 day and 1 year average in the close and volume columns
stocks['Avg_5day'] = stocks['Close'].rolling(window=5).mean().shift(1)
stocks['Avg_30day'] = stocks['Close'].rolling(window=30).mean().shift(1)
stocks['Avg_1yr'] = stocks['Close'].rolling(window=365).mean().shift(1)
stocks['Std_5day'] = stocks['Close'].rolling(window=5).std().shift(1)
stocks['Std_30day'] = stocks['Close'].rolling(window=30).std().shift(1)
stocks['Std_1yr'] = stocks['Close'].rolling(window=365).std().shift(1)
stocks['ratio_dy'] = stocks['Avg_5day'] / stocks['Avg_1yr']
stocks['AvgVol_5day'] = stocks['Volume'].rolling(window=5).mean().shift(1)
stocks['AvgVol_30day'] = stocks['Volume'].rolling(window=30).mean().shift(1)
stocks['AvgVol_1yr'] = stocks['Volume'].rolling(window=365).mean().shift(1)
stocks['StdVol_5day'] = stocks['Volume'].rolling(window=5).std().shift(1)
stocks['StdVol_30day'] = stocks['Volume'].rolling(window=30).std().shift(1)
stocks['StdVol_1yr'] = stocks['Volume'].rolling(window=365).std().shift(1)
stocks.head(10)
Date | Open | High | Low | Close | Volume | Adj Close | Avg_5day | Avg_30day | Avg_1yr | Std_5day | Std_30day | Std_1yr | ratio_dy | AvgVol_5day | AvgVol_30day | AvgVol_1yr | StdVol_5day | StdVol_30day | StdVol_1yr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1950-01-03 | 16.660000 | 16.660000 | 16.660000 | 16.660000 | 1260000.0 | 16.660000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 1950-01-04 | 16.850000 | 16.850000 | 16.850000 | 16.850000 | 1890000.0 | 16.850000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1950-01-05 | 16.930000 | 16.930000 | 16.930000 | 16.930000 | 2550000.0 | 16.930000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1950-01-06 | 16.980000 | 16.980000 | 16.980000 | 16.980000 | 2010000.0 | 16.980000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 1950-01-09 | 17.080000 | 17.080000 | 17.080000 | 17.080000 | 2520000.0 | 17.080000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 1950-01-10 | 17.030001 | 17.030001 | 17.030001 | 17.030001 | 2160000.0 | 17.030001 | 16.900 | NaN | NaN | 0.157956 | NaN | NaN | NaN | 2046000.0 | NaN | NaN | 529650.828377 | NaN | NaN |
6 | 1950-01-11 | 17.090000 | 17.090000 | 17.090000 | 17.090000 | 2630000.0 | 17.090000 | 16.974 | NaN | NaN | 0.089051 | NaN | NaN | NaN | 2226000.0 | NaN | NaN | 298043.620968 | NaN | NaN |
7 | 1950-01-12 | 16.760000 | 16.760000 | 16.760000 | 16.760000 | 2970000.0 | 16.760000 | 17.022 | NaN | NaN | 0.067602 | NaN | NaN | NaN | 2374000.0 | NaN | NaN | 272084.545684 | NaN | NaN |
8 | 1950-01-13 | 16.670000 | 16.670000 | 16.670000 | 16.670000 | 3330000.0 | 16.670000 | 16.988 | NaN | NaN | 0.134796 | NaN | NaN | NaN | 2458000.0 | NaN | NaN | 382452.611444 | NaN | NaN |
9 | 1950-01-16 | 16.719999 | 16.719999 | 16.719999 | 16.719999 | 1460000.0 | 16.719999 | 16.926 | NaN | NaN | 0.196545 | NaN | NaN | NaN | 2722000.0 | NaN | NaN | 446172.612337 | NaN | NaN |
To explain what i did. I used the method df.rolling() to select 5 rows in the dataframe then the df.mean() calculates the mean of each group of 5 rows, and finally the df.shift() method shift or moves to the next index and repeats the processs.
So the first iteration in the close column would we
(16.660 + 16.850 + 16.930 + 16.980 + 17.080) / 5 = 16.900
(index[0] + index[1] + index[2] + index[3] + index[4]) / 5 = Avg_5day, saved in index[5]
In the second iteration the first index would move from 0 to 1
(16.850 + 16.930 + 16.980 + 17.080 + 17.030) / 5 = 16.974
(index[1] + index[2] + index[3] + index[4] + index[5]) / 5 = Avg_5day, saved in index[6], and so on
For the ratio_dy column the first 365 rows would be empty because there isn't an average before 1950, so when we drop the data, the first year would be 1951
stocks = stocks[stocks['Date'] > dt.datetime(year=1951, month=1, day=1)]
stocks.dropna(axis=0, inplace=True)
stocks.head()
Date | Open | High | Low | Close | Volume | Adj Close | Avg_5day | Avg_30day | Avg_1yr | Std_5day | Std_30day | Std_1yr | ratio_dy | AvgVol_5day | AvgVol_30day | AvgVol_1yr | StdVol_5day | StdVol_30day | StdVol_1yr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
365 | 1951-06-19 | 22.020000 | 22.020000 | 22.020000 | 22.020000 | 1100000.0 | 22.020000 | 21.800 | 21.703333 | 19.447726 | 0.256223 | 0.473595 | 1.790253 | 1.120954 | 1196000.0 | 1.707667e+06 | 1.989479e+06 | 142232.204511 | 1.566790e+06 | 772310.649463 |
366 | 1951-06-20 | 21.910000 | 21.910000 | 21.910000 | 21.910000 | 1120000.0 | 21.910000 | 21.900 | 21.683000 | 19.462411 | 0.213659 | 0.444648 | 1.789307 | 1.125246 | 1176000.0 | 1.691667e+06 | 1.989041e+06 | 148425.065269 | 1.570585e+06 | 772771.102512 |
367 | 1951-06-21 | 21.780001 | 21.780001 | 21.780001 | 21.780001 | 1100000.0 | 21.780001 | 21.972 | 21.659667 | 19.476274 | 0.092574 | 0.411452 | 1.788613 | 1.128142 | 1188000.0 | 1.675667e+06 | 1.986932e+06 | 138816.425541 | 1.573993e+06 | 774092.100761 |
368 | 1951-06-22 | 21.549999 | 21.549999 | 21.549999 | 21.549999 | 1340000.0 | 21.549999 | 21.960 | 21.631000 | 19.489562 | 0.115108 | 0.368514 | 1.787659 | 1.126757 | 1148000.0 | 1.647000e+06 | 1.982959e+06 | 126767.503722 | 1.576465e+06 | 774914.749625 |
369 | 1951-06-25 | 21.290001 | 21.290001 | 21.290001 | 21.290001 | 2440000.0 | 21.290001 | 21.862 | 21.599000 | 19.502082 | 0.204132 | 0.329130 | 1.786038 | 1.121008 | 1142000.0 | 1.636333e+06 | 1.981123e+06 | 113666.177907 | 1.577456e+06 | 775643.723180 |
To train the linear regression model, we need to separate the dataframe into train and test. Everything before 2013-01-01 will be test, and any rows with a date greater than or equal to 2013-01-01 will be test
Also, we need to select the columns (or features) that doesn't contain knowledge of the "future", so in the end, the columns that we will use are the ones we created
train = stocks[stocks['Date'] < dt.datetime(year = 2013, month = 1, day = 1)]
test = stocks[stocks['Date'] >= dt.datetime(year = 2013, month = 1, day = 1)]
cols = ['Avg_5day', 'Avg_30day', 'Avg_1yr', 'Std_5day', 'Std_30day', 'Std_1yr',
'AvgVol_5day', 'AvgVol_30day', 'AvgVol_1yr', 'StdVol_5day', 'StdVol_30day', 'StdVol_1yr']
target = 'Close'
lr = LinearRegression()
lr.fit(train[cols], train[target])
train_pred = lr.predict(train[cols])
train_mse = mean_squared_error(train[target], train_pred)
train_rmse = np.sqrt(train_mse)
test_pred = lr.predict(test[cols])
test_mse = mean_squared_error(test[target], test_pred)
test_rmse = np.sqrt(test_mse)
print('TRAIN ERRORS')
print('Train mse:', train_mse)
print('Train rmse:', train_rmse)
print('---'*15)
print('TEST ERRORS')
print('Test mse:', test_mse)
print('Test rmse:', test_rmse)
TRAIN ERRORS Train mse: 107.86383615103183 Train rmse: 10.385751592977362 --------------------------------------------- TEST ERRORS Test mse: 496.2931629544718 Test rmse: 22.277638181694034
plt.scatter(train[target], train_pred, color='yellow')
plt.scatter(test[target], test_pred, color='red')
plt.xlabel('Close Real')
plt.ylabel('Close Predicted')
plt.title('Real Vs Predicted values')
plt.legend(['Train', 'Test'])
<matplotlib.legend.Legend at 0x24ab7c3d088>
Looks like the model is overfitted, my theory is that we are using data from 1951 to 2012 to predict data from 2013 only. Let's select data from 2011 to 2012 to predict data from 2013 and se if the scores are better
train = stocks[(stocks['Date'] >= dt.datetime(year = 2011, month = 1, day = 1)) & (stocks['Date'] < dt.datetime(year = 2013, month = 1, day = 1))]
test = stocks[stocks['Date'] >= dt.datetime(year = 2013, month = 1, day = 1)]
cols = ['Avg_5day', 'Avg_30day', 'Avg_1yr', 'Std_5day', 'Std_30day', 'Std_1yr',
'AvgVol_5day', 'AvgVol_30day', 'AvgVol_1yr', 'StdVol_5day', 'StdVol_30day', 'StdVol_1yr']
target = 'Close'
lr = LinearRegression()
lr.fit(train[cols], train[target])
train_pred = lr.predict(train[cols])
train_mse = mean_squared_error(train[target], train_pred)
train_rmse = np.sqrt(train_mse)
test_pred = lr.predict(test[cols])
test_mse = mean_squared_error(test[target], test_pred)
test_rmse = np.sqrt(test_mse)
print('TRAIN ERRORS')
print('Train mse:', train_mse)
print('Train rmse:', train_rmse)
print('---'*15)
print('TEST ERRORS')
print('Test mse:', test_mse)
print('Test rmse:', test_rmse)
TRAIN ERRORS Train mse: 406.3002943336934 Train rmse: 20.156891980999784 --------------------------------------------- TEST ERRORS Test mse: 4239.565533480189 Test rmse: 65.11194616566294
plt.scatter(train[target], train_pred, color='yellow')
plt.scatter(test[target], test_pred, color='red')
plt.xlabel('Close Real')
plt.ylabel('Close Predicted')
plt.title('Real Vs Predicted values')
plt.legend(['Train', 'Test'])
<matplotlib.legend.Legend at 0x24ab7c99308>
Nope, the model even got more overfitted than before. However i think that predicting the stock market it's one of the hardest things ever, it's not only about the economy factors different things like politics, wars, natural dissasters can affect the price of the stock. If i'm analyzing the mse correctly, we are over/under the mean value of the stock ($2131.82) by $496.29, that's 23.28%. I think that if we had data about the demand of the stock we could have had a better result