Guided Project: Predicting the stock market

In this project, we'll be working with data from the S&P500 Index.

We will be using historical data on the price of the S&P500 Index to make predictions about future prices. Predicting whether an index will go up or down will help us forecast how the stock market as a whole will perform. Since stocks tend to correlate with how well the economy as a whole is performing, it can also help us make economic forecasts.

We will be working with a csv file containing index prices. Each row in the file contains a daily record of the price of the S&P500 Index from 1950 to 2015. The dataset is stored in sphist.csv.

The columns of the dataset are:

  • Date -- The date of the record.
  • Open -- The opening price of the day (when trading starts).
  • High -- The highest trade price during the day.
  • Low -- The lowest trade price during the day.
  • Close -- The closing price for the day (when trading is finished).
  • Volume -- The number of shares traded.
  • Adj Close -- The daily closing price, adjusted retroactively to include any corporate actions. Read more here.

We'll be using this dataset to develop a predictive model. We'll train the model with data from 1950-2012, and try to make predictions from 2013-2015.

In [23]:
import pandas as pd
from datetime import datetime
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
In [24]:
#Let's start by reading in the dataset and converting the `Date` column to datetime format:

sp = pd.read_csv('sphist.csv')
sp['Date'] = pd.to_datetime(sp['Date'])
In [25]:
sp.head()
Out[25]:
Date Open High Low Close Volume Adj Close
0 2015-12-07 2090.419922 2090.419922 2066.780029 2077.070068 4.043820e+09 2077.070068
1 2015-12-04 2051.239990 2093.840088 2051.239990 2091.689941 4.214910e+09 2091.689941
2 2015-12-03 2080.709961 2085.000000 2042.349976 2049.620117 4.306490e+09 2049.620117
3 2015-12-02 2101.709961 2104.270020 2077.110107 2079.510010 3.950640e+09 2079.510010
4 2015-12-01 2082.929932 2103.370117 2082.929932 2102.629883 3.712120e+09 2102.629883
In [26]:
sp.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       16590 non-null  datetime64[ns]
 1   Open       16590 non-null  float64       
 2   High       16590 non-null  float64       
 3   Low        16590 non-null  float64       
 4   Close      16590 non-null  float64       
 5   Volume     16590 non-null  float64       
 6   Adj Close  16590 non-null  float64       
dtypes: datetime64[ns](1), float64(6)
memory usage: 907.4 KB
In [27]:
sp['after'] = sp['Date'] > datetime(year=2015, month=4, day=1)
sp.head()
Out[27]:
Date Open High Low Close Volume Adj Close after
0 2015-12-07 2090.419922 2090.419922 2066.780029 2077.070068 4.043820e+09 2077.070068 True
1 2015-12-04 2051.239990 2093.840088 2051.239990 2091.689941 4.214910e+09 2091.689941 True
2 2015-12-03 2080.709961 2085.000000 2042.349976 2049.620117 4.306490e+09 2049.620117 True
3 2015-12-02 2101.709961 2104.270020 2077.110107 2079.510010 3.950640e+09 2079.510010 True
4 2015-12-01 2082.929932 2103.370117 2082.929932 2102.629883 3.712120e+09 2102.629883 True
In [28]:
#Let's sort the dataframe by the `Date` column in a descending order

sp = sp.sort_values(by='Date', ascending = True)
sp.head()
Out[28]:
Date Open High Low Close Volume Adj Close after
16589 1950-01-03 16.66 16.66 16.66 16.66 1260000.0 16.66 False
16588 1950-01-04 16.85 16.85 16.85 16.85 1890000.0 16.85 False
16587 1950-01-05 16.93 16.93 16.93 16.93 2550000.0 16.93 False
16586 1950-01-06 16.98 16.98 16.98 16.98 2010000.0 16.98 False
16585 1950-01-09 17.08 17.08 17.08 17.08 2520000.0 17.08 False

Generating indicators

Datasets taken from the stock market need to be handled differently than datasets from other sectors when it comes time to make predictions. In a normal machine learning exercise, we treat each row as independent. Stock market data is sequential, and each observation comes a day after the previous observation. Thus, the observations are not all independent, and you can't treat them as such.

This means we have to be extra careful to not inject "future" knowledge into past rows when we do training and prediction. Injecting future knowledge will make our model look good when we are training and testing it, but will make it fail in the real world. This is how many algorithmic traders lose money.

In [29]:
#Calculate the mean for the past 5, 30, 365 days
sp['day_5'] = sp['Close'].rolling(5).mean().shift(1)
sp['day_30'] = sp['Close'].rolling(30).mean().shift(1)
sp['day_365'] = sp['Close'].rolling(365).mean().shift(1)

#Calculate the STD for the past 5, 365 days
sp['std_5'] = sp['Close'].rolling(5).std().shift(1)
sp['std_365'] = sp['Close'].rolling(365).std().shift(1)

#Calculate the mean volume for the past 5, 365 days
sp['day_5_volume'] = sp['Volume'].rolling(5).mean().shift(1)
sp['day_365_volume'] = sp['Volume'].rolling(365).mean().shift(1)

#Calculate the STD of the average volume over the past five days
sp['5_volume_std'] = sp['day_5_volume'].rolling(5).std().shift(1)
In [30]:
sp.tail(10)
Out[30]:
Date Open High Low Close Volume Adj Close after day_5 day_30 day_365 std_5 std_365 day_5_volume day_365_volume 5_volume_std
9 2015-11-23 2089.409912 2095.610107 2081.389893 2086.590088 3.587980e+09 2086.590088 True 2071.523974 2061.892989 2033.605890 18.246940 64.911334 3.930538e+09 3.523622e+09 6.821252e+07
8 2015-11-24 2084.419922 2094.120117 2070.290039 2089.139893 3.884930e+09 2089.139893 True 2078.204004 2064.197327 2034.018028 15.807754 64.768328 3.899886e+09 3.526334e+09 6.979154e+07
7 2015-11-25 2089.300049 2093.000000 2086.300049 2088.870117 2.852940e+09 2088.870117 True 2085.943994 2067.045658 2034.432712 3.491188 64.634873 3.791402e+09 3.528961e+09 7.278537e+07
6 2015-11-27 2088.820068 2093.290039 2084.129883 2090.110107 1.466840e+09 2090.110107 True 2087.002002 2070.199996 2034.835123 3.395982 64.514871 3.576712e+09 3.528637e+09 1.077890e+08
5 2015-11-30 2090.949951 2093.810059 2080.409912 2080.409912 4.245030e+09 2080.409912 True 2088.776025 2072.408333 2035.199864 1.309055 64.449800 3.144458e+09 3.524258e+09 1.652146e+08
4 2015-12-01 2082.929932 2103.370117 2082.929932 2102.629883 3.712120e+09 2102.629883 True 2087.024023 2073.984998 2035.531178 3.916109 64.370261 3.207544e+09 3.527800e+09 3.241733e+08
3 2015-12-02 2101.709961 2104.270020 2077.110107 2079.510010 3.950640e+09 2079.510010 True 2090.231982 2076.283993 2035.914082 7.956808 64.352527 3.232372e+09 3.526090e+09 3.390314e+08
2 2015-12-03 2080.709961 2085.000000 2042.349976 2049.620117 4.306490e+09 2049.620117 True 2088.306006 2077.908659 2036.234356 9.333599 64.277554 3.245514e+09 3.529468e+09 2.803620e+08
1 2015-12-04 2051.239990 2093.840088 2051.239990 2091.689941 4.214910e+09 2091.689941 True 2080.456006 2078.931331 2036.507343 19.599946 64.121622 3.536224e+09 3.532802e+09 1.696382e+08
0 2015-12-07 2090.419922 2090.419922 2066.780029 2077.070068 4.043820e+09 2077.070068 True 2080.771973 2080.237329 2036.869425 19.806136 64.058862 4.085838e+09 3.535838e+09 1.520693e+08

Splitting up the data

Since we are computing indicators that use historical data, there are some rows where there isn't enough historical data to generate them. Some of the indicators use 365 days of historical data, and the dataset starts on 1950-01-03. Thus, any rows that fall before 1951-01-03 don't have enough historical data to compute all the indicators. We will need to remove these rows before you split the data.

In [31]:
sp = sp[sp['Date'] > datetime(year=1951, month=1, day=2)]
In [32]:
sp = sp.dropna(axis=0)
In [33]:
sp.head()
Out[33]:
Date Open High Low Close Volume Adj Close after day_5 day_30 day_365 std_5 std_365 day_5_volume day_365_volume 5_volume_std
16224 1951-06-19 22.020000 22.020000 22.020000 22.020000 1100000.0 22.020000 False 21.800 21.703333 19.447726 0.256223 1.790253 1196000.0 1.989479e+06 28982.753492
16223 1951-06-20 21.910000 21.910000 21.910000 21.910000 1120000.0 21.910000 False 21.900 21.683000 19.462411 0.213659 1.789307 1176000.0 1.989041e+06 29339.393313
16222 1951-06-21 21.780001 21.780001 21.780001 21.780001 1100000.0 21.780001 False 21.972 21.659667 19.476274 0.092574 1.788613 1188000.0 1.986932e+06 29610.808837
16221 1951-06-22 21.549999 21.549999 21.549999 21.549999 1340000.0 21.549999 False 21.960 21.631000 19.489562 0.115108 1.787659 1148000.0 1.982959e+06 27334.959301
16220 1951-06-25 21.290001 21.290001 21.290001 21.290001 2440000.0 21.290001 False 21.862 21.599000 19.502082 0.204132 1.786038 1142000.0 1.981123e+06 29879.759035

Let's now generate two new dataframes to use in making our algorithm. train should contain any rows in the data with a date less than 2013-01-01. test should contain any rows with a date greater than or equal to 2013-01-01.

In [34]:
train = sp[sp['Date'] < datetime(year=2013, month=1, day=1)]
test = sp[sp['Date'] > datetime(year=2013, month=1, day=1)]

Making predictions

We are now ready to train the algorithum, make predictions and calculate the Mean Squared Error. Our target column is Close.

In [35]:
lr = LinearRegression()
lr.fit(train[['day_5','day_30', 'day_365', 'std_5', 'std_365', 'day_5_volume', 'day_365_volume','5_volume_std']],train['Close'])
predictions = lr.predict(test[['day_5','day_30', 'day_365', 'std_5', 'std_365', 'day_5_volume', 'day_365_volume', '5_volume_std']])
mse = mean_squared_error(test['Close'], predictions)
In [36]:
mse
Out[36]:
494.66054052414876

Let's now make a prediction just one day ahead.

In [37]:
train_1 = sp.iloc[:-1]
test_1 = sp.iloc[-1:]
In [47]:
lr = LinearRegression()
lr.fit(train_1[['day_5','day_30', 'day_365', 'std_5', 'std_365', 'day_5_volume', 'day_365_volume','5_volume_std']],train_1['Close'])
predictions_1 = lr.predict(test_1[['day_5','day_30', 'day_365', 'std_5', 'std_365', 'day_5_volume', 'day_365_volume', '5_volume_std']])
mse_1 = mean_squared_error(test_1['Close'], predictions_1)
In [48]:
mse_1
Out[48]:
9.629910205317884
In [ ]:
 
In [ ]: