Predicting The Stock Market

In this project we will work with historical data from the S&P500 index. This index, measure the stock performance of 500 large companies listed on stock exchanges in the US. It is one of the most commonly followed equity indices, and many consider it to be one of the best representation of the US Stock market.

When an index fund like S&P500 or the Down Jones Industrial Average (DJIA or Dow30) goes up or down you can say that the underlying market or sector it represents is also going up or down. For example if the DJIA goes down one day, you can say that American stocks overall went down.

We will use S&P500 historical data to make predictions about future prices. Predicting wheter an index will go up or down will help us forecast how the stock market as a whole will perform. Since stocks tend to correlate with how well the economy as a whole is performing it can also help us to make economic forecasts.

However it is recomended that we shouldn't make trades with the model developed in this project. Trading stocks has risks, and nothing we do here constitutes a strock trading advice

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
In [2]:
stocks = pd.read_csv('sphist.csv')
stocks.head()
Out[2]:
Date Open High Low Close Volume Adj Close
0 2015-12-07 2090.419922 2090.419922 2066.780029 2077.070068 4.043820e+09 2077.070068
1 2015-12-04 2051.239990 2093.840088 2051.239990 2091.689941 4.214910e+09 2091.689941
2 2015-12-03 2080.709961 2085.000000 2042.349976 2049.620117 4.306490e+09 2049.620117
3 2015-12-02 2101.709961 2104.270020 2077.110107 2079.510010 3.950640e+09 2079.510010
4 2015-12-01 2082.929932 2103.370117 2082.929932 2102.629883 3.712120e+09 2102.629883
In [3]:
stocks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       16590 non-null  object 
 1   Open       16590 non-null  float64
 2   High       16590 non-null  float64
 3   Low        16590 non-null  float64
 4   Close      16590 non-null  float64
 5   Volume     16590 non-null  float64
 6   Adj Close  16590 non-null  float64
dtypes: float64(6), object(1)
memory usage: 907.4+ KB

The data has 16590 rows of data. We also found that the Date column wasn't recognized as a datetime so we need to transform it. So far we don't have any NaN data

In [4]:
stocks['Date'] = pd.to_datetime(stocks['Date'])
stocks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       16590 non-null  datetime64[ns]
 1   Open       16590 non-null  float64       
 2   High       16590 non-null  float64       
 3   Low        16590 non-null  float64       
 4   Close      16590 non-null  float64       
 5   Volume     16590 non-null  float64       
 6   Adj Close  16590 non-null  float64       
dtypes: datetime64[ns](1), float64(6)
memory usage: 907.4 KB
In [5]:
print(min(stocks['Date']))
print(max(stocks['Date']))
1950-01-03 00:00:00
2015-12-07 00:00:00

The data goes from January 3rd, 1950 to December 7th, 2015. Let's order them in ascending order

In [6]:
stocks = stocks.sort_values(by='Date', ascending=True).reset_index()
stocks = stocks.drop('index', axis=1)
stocks.head()
Out[6]:
Date Open High Low Close Volume Adj Close
0 1950-01-03 16.66 16.66 16.66 16.66 1260000.0 16.66
1 1950-01-04 16.85 16.85 16.85 16.85 1890000.0 16.85
2 1950-01-05 16.93 16.93 16.93 16.93 2550000.0 16.93
3 1950-01-06 16.98 16.98 16.98 16.98 2010000.0 16.98
4 1950-01-09 17.08 17.08 17.08 17.08 2520000.0 17.08

Datasets from the stock market need to be handled differently than datasets from other sector when it comes to make predictions. In a normal Machine Learning (ML) exercise, we treat each row as independent, however, stock market is sequential, and each observation comes a day after the previous observation. Thus, the observations are not at all independent.

The time series nature of the data means that can generate indicators to make our model more accurate. For example, we can create a new column that contains the average price of the last 10 trades for each row. This will incorporate information from multiple pior rows into one, and will make predictions much more accurate.

When we do this, we have to be careful not to use the current row in the values we average. We want to teach the model how to predict the current price from historical prices. If we include the current price in the prices we average, it will be equivalent to handing the answers to the model, and will make it impossible to use in the "real world", where we don't know the price upfront.

Let's calculate the 5 day and 1 year average in the close and volume columns

Calculating average columns

In [7]:
stocks['Avg_5day'] = stocks['Close'].rolling(window=5).mean().shift(1)
stocks['Avg_30day'] = stocks['Close'].rolling(window=30).mean().shift(1)
stocks['Avg_1yr'] = stocks['Close'].rolling(window=365).mean().shift(1)
stocks['Std_5day'] = stocks['Close'].rolling(window=5).std().shift(1)
stocks['Std_30day'] = stocks['Close'].rolling(window=30).std().shift(1)
stocks['Std_1yr'] = stocks['Close'].rolling(window=365).std().shift(1)
stocks['ratio_dy'] = stocks['Avg_5day'] / stocks['Avg_1yr']
stocks['AvgVol_5day'] = stocks['Volume'].rolling(window=5).mean().shift(1)
stocks['AvgVol_30day'] = stocks['Volume'].rolling(window=30).mean().shift(1)
stocks['AvgVol_1yr'] = stocks['Volume'].rolling(window=365).mean().shift(1)
stocks['StdVol_5day'] = stocks['Volume'].rolling(window=5).std().shift(1)
stocks['StdVol_30day'] = stocks['Volume'].rolling(window=30).std().shift(1)
stocks['StdVol_1yr'] = stocks['Volume'].rolling(window=365).std().shift(1)

stocks.head(10)
Out[7]:
Date Open High Low Close Volume Adj Close Avg_5day Avg_30day Avg_1yr Std_5day Std_30day Std_1yr ratio_dy AvgVol_5day AvgVol_30day AvgVol_1yr StdVol_5day StdVol_30day StdVol_1yr
0 1950-01-03 16.660000 16.660000 16.660000 16.660000 1260000.0 16.660000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1950-01-04 16.850000 16.850000 16.850000 16.850000 1890000.0 16.850000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1950-01-05 16.930000 16.930000 16.930000 16.930000 2550000.0 16.930000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1950-01-06 16.980000 16.980000 16.980000 16.980000 2010000.0 16.980000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1950-01-09 17.080000 17.080000 17.080000 17.080000 2520000.0 17.080000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 1950-01-10 17.030001 17.030001 17.030001 17.030001 2160000.0 17.030001 16.900 NaN NaN 0.157956 NaN NaN NaN 2046000.0 NaN NaN 529650.828377 NaN NaN
6 1950-01-11 17.090000 17.090000 17.090000 17.090000 2630000.0 17.090000 16.974 NaN NaN 0.089051 NaN NaN NaN 2226000.0 NaN NaN 298043.620968 NaN NaN
7 1950-01-12 16.760000 16.760000 16.760000 16.760000 2970000.0 16.760000 17.022 NaN NaN 0.067602 NaN NaN NaN 2374000.0 NaN NaN 272084.545684 NaN NaN
8 1950-01-13 16.670000 16.670000 16.670000 16.670000 3330000.0 16.670000 16.988 NaN NaN 0.134796 NaN NaN NaN 2458000.0 NaN NaN 382452.611444 NaN NaN
9 1950-01-16 16.719999 16.719999 16.719999 16.719999 1460000.0 16.719999 16.926 NaN NaN 0.196545 NaN NaN NaN 2722000.0 NaN NaN 446172.612337 NaN NaN

To explain what i did. I used the method df.rolling() to select 5 rows in the dataframe then the df.mean() calculates the mean of each group of 5 rows, and finally the df.shift() method shift or moves to the next index and repeats the processs.

So the first iteration in the close column would we

(16.660 + 16.850 + 16.930 + 16.980 + 17.080) / 5 = 16.900

(index[0] + index[1] + index[2] + index[3] + index[4]) / 5 = Avg_5day, saved in index[5]

In the second iteration the first index would move from 0 to 1

(16.850 + 16.930 + 16.980 + 17.080 + 17.030) / 5 = 16.974

(index[1] + index[2] + index[3] + index[4] + index[5]) / 5 = Avg_5day, saved in index[6], and so on

For the ratio_dy column the first 365 rows would be empty because there isn't an average before 1950, so when we drop the data, the first year would be 1951

Removing 1950 and NaN data

In [8]:
stocks = stocks[stocks['Date'] > dt.datetime(year=1951, month=1, day=1)]
stocks.dropna(axis=0, inplace=True)
stocks.head()
Out[8]:
Date Open High Low Close Volume Adj Close Avg_5day Avg_30day Avg_1yr Std_5day Std_30day Std_1yr ratio_dy AvgVol_5day AvgVol_30day AvgVol_1yr StdVol_5day StdVol_30day StdVol_1yr
365 1951-06-19 22.020000 22.020000 22.020000 22.020000 1100000.0 22.020000 21.800 21.703333 19.447726 0.256223 0.473595 1.790253 1.120954 1196000.0 1.707667e+06 1.989479e+06 142232.204511 1.566790e+06 772310.649463
366 1951-06-20 21.910000 21.910000 21.910000 21.910000 1120000.0 21.910000 21.900 21.683000 19.462411 0.213659 0.444648 1.789307 1.125246 1176000.0 1.691667e+06 1.989041e+06 148425.065269 1.570585e+06 772771.102512
367 1951-06-21 21.780001 21.780001 21.780001 21.780001 1100000.0 21.780001 21.972 21.659667 19.476274 0.092574 0.411452 1.788613 1.128142 1188000.0 1.675667e+06 1.986932e+06 138816.425541 1.573993e+06 774092.100761
368 1951-06-22 21.549999 21.549999 21.549999 21.549999 1340000.0 21.549999 21.960 21.631000 19.489562 0.115108 0.368514 1.787659 1.126757 1148000.0 1.647000e+06 1.982959e+06 126767.503722 1.576465e+06 774914.749625
369 1951-06-25 21.290001 21.290001 21.290001 21.290001 2440000.0 21.290001 21.862 21.599000 19.502082 0.204132 0.329130 1.786038 1.121008 1142000.0 1.636333e+06 1.981123e+06 113666.177907 1.577456e+06 775643.723180

Linear Regression

To train the linear regression model, we need to separate the dataframe into train and test. Everything before 2013-01-01 will be test, and any rows with a date greater than or equal to 2013-01-01 will be test

Also, we need to select the columns (or features) that doesn't contain knowledge of the "future", so in the end, the columns that we will use are the ones we created

In [20]:
train = stocks[stocks['Date'] < dt.datetime(year = 2013, month = 1, day = 1)]
test = stocks[stocks['Date'] >= dt.datetime(year = 2013, month = 1, day = 1)]

cols = ['Avg_5day', 'Avg_30day', 'Avg_1yr', 'Std_5day', 'Std_30day', 'Std_1yr',
        'AvgVol_5day', 'AvgVol_30day', 'AvgVol_1yr', 'StdVol_5day', 'StdVol_30day', 'StdVol_1yr']
target = 'Close'

lr = LinearRegression()

lr.fit(train[cols], train[target])

train_pred = lr.predict(train[cols])
train_mse = mean_squared_error(train[target], train_pred)
train_rmse = np.sqrt(train_mse)


test_pred = lr.predict(test[cols])
test_mse = mean_squared_error(test[target], test_pred)
test_rmse = np.sqrt(test_mse)


print('TRAIN ERRORS')
print('Train mse:', train_mse)
print('Train rmse:', train_rmse)
print('---'*15)
print('TEST ERRORS')
print('Test mse:', test_mse)
print('Test rmse:', test_rmse)
TRAIN ERRORS
Train mse: 107.86383615103183
Train rmse: 10.385751592977362
---------------------------------------------
TEST ERRORS
Test mse: 496.2931629544718
Test rmse: 22.277638181694034
In [37]:
plt.scatter(train[target], train_pred, color='yellow')
plt.scatter(test[target], test_pred, color='red')
plt.xlabel('Close Real')
plt.ylabel('Close Predicted')
plt.title('Real Vs Predicted values')
plt.legend(['Train', 'Test'])
Out[37]:
<matplotlib.legend.Legend at 0x24ab7c3d088>

Looks like the model is overfitted, my theory is that we are using data from 1951 to 2012 to predict data from 2013 only. Let's select data from 2011 to 2012 to predict data from 2013 and se if the scores are better

In [31]:
train = stocks[(stocks['Date'] >= dt.datetime(year = 2011, month = 1, day = 1)) & (stocks['Date'] < dt.datetime(year = 2013, month = 1, day = 1))]
test = stocks[stocks['Date'] >= dt.datetime(year = 2013, month = 1, day = 1)]

cols = ['Avg_5day', 'Avg_30day', 'Avg_1yr', 'Std_5day', 'Std_30day', 'Std_1yr',
        'AvgVol_5day', 'AvgVol_30day', 'AvgVol_1yr', 'StdVol_5day', 'StdVol_30day', 'StdVol_1yr']
target = 'Close'

lr = LinearRegression()

lr.fit(train[cols], train[target])

train_pred = lr.predict(train[cols])
train_mse = mean_squared_error(train[target], train_pred)
train_rmse = np.sqrt(train_mse)


test_pred = lr.predict(test[cols])
test_mse = mean_squared_error(test[target], test_pred)
test_rmse = np.sqrt(test_mse)


print('TRAIN ERRORS')
print('Train mse:', train_mse)
print('Train rmse:', train_rmse)
print('---'*15)
print('TEST ERRORS')
print('Test mse:', test_mse)
print('Test rmse:', test_rmse)
TRAIN ERRORS
Train mse: 406.3002943336934
Train rmse: 20.156891980999784
---------------------------------------------
TEST ERRORS
Test mse: 4239.565533480189
Test rmse: 65.11194616566294
In [38]:
plt.scatter(train[target], train_pred, color='yellow')
plt.scatter(test[target], test_pred, color='red')
plt.xlabel('Close Real')
plt.ylabel('Close Predicted')
plt.title('Real Vs Predicted values')
plt.legend(['Train', 'Test'])
Out[38]:
<matplotlib.legend.Legend at 0x24ab7c99308>

Nope, the model even got more overfitted than before. However i think that predicting the stock market it's one of the hardest things ever, it's not only about the economy factors different things like politics, wars, natural dissasters can affect the price of the stock. If i'm analyzing the mse correctly, we are over/under the mean value of the stock ($2131.82) by $496.29, that's 23.28%. I think that if we had data about the demand of the stock we could have had a better result

In [ ]: