Stock Market Predictios for S&P500 Index¶

Introduction¶

In this project, we will work with data from the S&P500 Index and try to predict values based on historical data provided to our Linear Regression Model on test set of future dates of the same dataset to understand performance of the model. The S&P500 is a stock market index.

Some companies are publicly traded, which means that anyone can buy and sell their shares on the open market. A share entitles the owner to some control over the direction of the company and to a percentage (or share) of the earnings of the company. When you buy or sell shares, it's common known as trading a stock.

The price of a share is based on supply and demand for a given stock. For example, Apple stock has a price of 120 dollars per share as of December 2015 -- http://www.nasdaq.com/symbol/aapl. A stock that is in less demand, like Ford Motor Company, has a lower price -- http://finance.yahoo.com/q?s=F. Stock price is also influenced by other factors, including the number of shares a company has issued.

Stocks are traded daily and the price can rise or fall from the beginning of a trading day to the end based on demand. Stocks that are in more in demand, such as Apple, are traded more often than stocks of smaller companies.

Indexes aggregate the prices of multiple stocks together, and allow you to see how the market as a whole performs. For example, the Dow Jones Industrial Average aggregates the stock prices of 30 large American companies together. The S&P500 Index aggregates the stock prices of 500 large companies. When an index fund goes up or down, you can say that the primary market or sector it represents is doing the same. For example, if the Dow Jones Industrial Average price goes down one day, you can say that American stocks overall went down (ie, most American stocks went down in price).

There are thousands of traders who make money by buying and selling Exchange Traded Funds. ETFs allow you to buy and sell indexes like stocks. This means that you could "buy" the S&P500 Index ETF when the price is low and sell when it's high to make a profit. Creating a predictive model could allow traders to make money on the stock market.

Note: You shouldn't make trades with any models developed in this project. Trading stocks has risks and nothing in this lesson constitutes stock trading advice.

In [121]:

# Import required libraries here

import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

Data Import and Exploration¶

In [122]:

sp = pd.read_csv("sphist.csv")

In [123]:

sp.head()

Out[123]:

	Date	Open	High	Low	Close	Volume	Adj Close
0	2015-12-07	2090.419922	2090.419922	2066.780029	2077.070068	4.043820e+09	2077.070068
1	2015-12-04	2051.239990	2093.840088	2051.239990	2091.689941	4.214910e+09	2091.689941
2	2015-12-03	2080.709961	2085.000000	2042.349976	2049.620117	4.306490e+09	2049.620117
3	2015-12-02	2101.709961	2104.270020	2077.110107	2079.510010	3.950640e+09	2079.510010
4	2015-12-01	2082.929932	2103.370117	2082.929932	2102.629883	3.712120e+09	2102.629883

In [124]:

sp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       16590 non-null  object 
 1   Open       16590 non-null  float64
 2   High       16590 non-null  float64
 3   Low        16590 non-null  float64
 4   Close      16590 non-null  float64
 5   Volume     16590 non-null  float64
 6   Adj Close  16590 non-null  float64
dtypes: float64(6), object(1)
memory usage: 907.4+ KB

In [125]:

sp["Date"] = pd.to_datetime(sp["Date"])

In [126]:

sp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       16590 non-null  datetime64[ns]
 1   Open       16590 non-null  float64       
 2   High       16590 non-null  float64       
 3   Low        16590 non-null  float64       
 4   Close      16590 non-null  float64       
 5   Volume     16590 non-null  float64       
 6   Adj Close  16590 non-null  float64       
dtypes: datetime64[ns](1), float64(6)
memory usage: 907.4 KB

In [127]:

sp = sp.sort_values("Date", ascending = True)

In [128]:

sp.head()

Out[128]:

	Date	Open	High	Low	Close	Volume	Adj Close
16589	1950-01-03	16.66	16.66	16.66	16.66	1260000.0	16.66
16588	1950-01-04	16.85	16.85	16.85	16.85	1890000.0	16.85
16587	1950-01-05	16.93	16.93	16.93	16.93	2550000.0	16.93
16586	1950-01-06	16.98	16.98	16.98	16.98	2010000.0	16.98
16585	1950-01-09	17.08	17.08	17.08	17.08	2520000.0	17.08

Feature Engineering¶

Lets choose 5 primary indicators:

Average price for last 5 days
Average price for last 30 days
Average price for last 365 days
Ratio of average price for last 5 days and average price for last 365 days
Ration of standard deviation for last 5 days and standard deviation for 365 days

In [129]:

# Calculating the 3 indicators and assiging it to new columns in the dataframe.
# Shifting index by 1 so that we do not feed future information and leak info during predictions.
# Means
sp["avg_5_days"] = sp["Close"].rolling(window=5).mean().shift(1)
sp["avg_30_days"] = sp["Close"].rolling(window=30).mean().shift(1)
sp["avg_365_days"] = sp["Close"].rolling(window=365).mean().shift(1)

# Standard Deviations
sp["std_5_days"] = sp["Close"].rolling(window=5).std().shift(1)
sp["std_365_days"] = sp["Close"].rolling(window=365).std().shift(1)

# Calculating complex indicators involving 3 primary indicators
sp["ratio_avg_5_to_365"] = sp["avg_5_days"]/sp["avg_365_days"] 
sp["ratio_std_5_to_365"] = sp["std_5_days"]/sp["std_365_days"]   
    

In [130]:

sp.head(10)

Out[130]:

	Date	Open	High	Low	Close	Volume	Adj Close	avg_5_days	avg_30_days	avg_365_days	std_5_days	std_365_days	ratio_avg_5_to_365	ratio_std_5_to_365
16589	1950-01-03	16.660000	16.660000	16.660000	16.660000	1260000.0	16.660000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16588	1950-01-04	16.850000	16.850000	16.850000	16.850000	1890000.0	16.850000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16587	1950-01-05	16.930000	16.930000	16.930000	16.930000	2550000.0	16.930000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16586	1950-01-06	16.980000	16.980000	16.980000	16.980000	2010000.0	16.980000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16585	1950-01-09	17.080000	17.080000	17.080000	17.080000	2520000.0	17.080000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16584	1950-01-10	17.030001	17.030001	17.030001	17.030001	2160000.0	17.030001	16.900	NaN	NaN	0.157956	NaN	NaN	NaN
16583	1950-01-11	17.090000	17.090000	17.090000	17.090000	2630000.0	17.090000	16.974	NaN	NaN	0.089051	NaN	NaN	NaN
16582	1950-01-12	16.760000	16.760000	16.760000	16.760000	2970000.0	16.760000	17.022	NaN	NaN	0.067602	NaN	NaN	NaN
16581	1950-01-13	16.670000	16.670000	16.670000	16.670000	3330000.0	16.670000	16.988	NaN	NaN	0.134796	NaN	NaN	NaN
16580	1950-01-16	16.719999	16.719999	16.719999	16.719999	1460000.0	16.719999	16.926	NaN	NaN	0.196545	NaN	NaN	NaN

In [131]:

sp.tail(10)

Out[131]:

	Date	Open	High	Low	Close	Volume	Adj Close	avg_5_days	avg_30_days	avg_365_days	std_5_days	std_365_days	ratio_avg_5_to_365	ratio_std_5_to_365
9	2015-11-23	2089.409912	2095.610107	2081.389893	2086.590088	3.587980e+09	2086.590088	2071.523974	2061.892989	2033.605890	18.246940	64.911334	1.018646	0.281106
8	2015-11-24	2084.419922	2094.120117	2070.290039	2089.139893	3.884930e+09	2089.139893	2078.204004	2064.197327	2034.018028	15.807754	64.768328	1.021723	0.244066
7	2015-11-25	2089.300049	2093.000000	2086.300049	2088.870117	2.852940e+09	2088.870117	2085.943994	2067.045658	2034.432712	3.491188	64.634873	1.025320	0.054014
6	2015-11-27	2088.820068	2093.290039	2084.129883	2090.110107	1.466840e+09	2090.110107	2087.002002	2070.199996	2034.835123	3.395982	64.514871	1.025637	0.052639
5	2015-11-30	2090.949951	2093.810059	2080.409912	2080.409912	4.245030e+09	2080.409912	2088.776025	2072.408333	2035.199864	1.309055	64.449800	1.026325	0.020311
4	2015-12-01	2082.929932	2103.370117	2082.929932	2102.629883	3.712120e+09	2102.629883	2087.024023	2073.984998	2035.531178	3.916109	64.370261	1.025297	0.060837
3	2015-12-02	2101.709961	2104.270020	2077.110107	2079.510010	3.950640e+09	2079.510010	2090.231982	2076.283993	2035.914082	7.956808	64.352527	1.026680	0.123644
2	2015-12-03	2080.709961	2085.000000	2042.349976	2049.620117	4.306490e+09	2049.620117	2088.306006	2077.908659	2036.234356	9.333599	64.277554	1.025573	0.145208
1	2015-12-04	2051.239990	2093.840088	2051.239990	2091.689941	4.214910e+09	2091.689941	2080.456006	2078.931331	2036.507343	19.599946	64.121622	1.021580	0.305668
0	2015-12-07	2090.419922	2090.419922	2066.780029	2077.070068	4.043820e+09	2077.070068	2080.771973	2080.237329	2036.869425	19.806136	64.058862	1.021554	0.309187

Since we are computing indicators that use historical data, there are some rows where there isn't enough historical data to generate them. Some of the indicators use 365 days of historical data and the dataset starts on 1950-01-03. Thus, any rows that fall before 1951-01-03 don't have enough historical data to compute all the indicators. We will need to remove these rows before we split the data.

In [132]:

sp_filtered =sp[sp["Date"] > datetime(year =1951, month=1, day=2)]

In [133]:

sp_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16340 entries, 16339 to 0
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Date                16340 non-null  datetime64[ns]
 1   Open                16340 non-null  float64       
 2   High                16340 non-null  float64       
 3   Low                 16340 non-null  float64       
 4   Close               16340 non-null  float64       
 5   Volume              16340 non-null  float64       
 6   Adj Close           16340 non-null  float64       
 7   avg_5_days          16340 non-null  float64       
 8   avg_30_days         16340 non-null  float64       
 9   avg_365_days        16225 non-null  float64       
 10  std_5_days          16340 non-null  float64       
 11  std_365_days        16225 non-null  float64       
 12  ratio_avg_5_to_365  16225 non-null  float64       
 13  ratio_std_5_to_365  16225 non-null  float64       
dtypes: datetime64[ns](1), float64(13)
memory usage: 1.9 MB

In [134]:

sp_filtered.head()

Out[134]:

	Date	Open	High	Low	Close	Volume	Adj Close	avg_5_days	avg_30_days	avg_365_days	std_5_days	std_365_days	ratio_avg_5_to_365	ratio_std_5_to_365
16339	1951-01-03	20.690001	20.690001	20.690001	20.690001	3370000.0	20.690001	20.360000	19.815000	NaN	0.304385	NaN	NaN	NaN
16338	1951-01-04	20.870001	20.870001	20.870001	20.870001	3390000.0	20.870001	20.514000	19.842667	NaN	0.204524	NaN	NaN	NaN
16337	1951-01-05	20.870001	20.870001	20.870001	20.870001	3390000.0	20.870001	20.628000	19.874000	NaN	0.214057	NaN	NaN	NaN
16336	1951-01-08	21.000000	21.000000	21.000000	21.000000	2780000.0	21.000000	20.726001	19.907000	NaN	0.181879	NaN	NaN	NaN
16335	1951-01-09	21.120001	21.120001	21.120001	21.120001	3800000.0	21.120001	20.840001	19.935000	NaN	0.117047	NaN	NaN	NaN

In [135]:

# Droping any rows with NaN values
sp_filtered = sp_filtered.dropna(axis=0)

In [136]:

sp_filtered.head()

Out[136]:

	Date	Open	High	Low	Close	Volume	Adj Close	avg_5_days	avg_30_days	avg_365_days	std_5_days	std_365_days	ratio_avg_5_to_365	ratio_std_5_to_365
16224	1951-06-19	22.020000	22.020000	22.020000	22.020000	1100000.0	22.020000	21.800	21.703333	19.447726	0.256223	1.790253	1.120954	0.143121
16223	1951-06-20	21.910000	21.910000	21.910000	21.910000	1120000.0	21.910000	21.900	21.683000	19.462411	0.213659	1.789307	1.125246	0.119409
16222	1951-06-21	21.780001	21.780001	21.780001	21.780001	1100000.0	21.780001	21.972	21.659667	19.476274	0.092574	1.788613	1.128142	0.051758
16221	1951-06-22	21.549999	21.549999	21.549999	21.549999	1340000.0	21.549999	21.960	21.631000	19.489562	0.115108	1.787659	1.126757	0.064390
16220	1951-06-25	21.290001	21.290001	21.290001	21.290001	2440000.0	21.290001	21.862	21.599000	19.502082	0.204132	1.786038	1.121008	0.114293

In [137]:

sp_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16225 entries, 16224 to 0
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Date                16225 non-null  datetime64[ns]
 1   Open                16225 non-null  float64       
 2   High                16225 non-null  float64       
 3   Low                 16225 non-null  float64       
 4   Close               16225 non-null  float64       
 5   Volume              16225 non-null  float64       
 6   Adj Close           16225 non-null  float64       
 7   avg_5_days          16225 non-null  float64       
 8   avg_30_days         16225 non-null  float64       
 9   avg_365_days        16225 non-null  float64       
 10  std_5_days          16225 non-null  float64       
 11  std_365_days        16225 non-null  float64       
 12  ratio_avg_5_to_365  16225 non-null  float64       
 13  ratio_std_5_to_365  16225 non-null  float64       
dtypes: datetime64[ns](1), float64(13)
memory usage: 1.9 MB

In [138]:

train = sp_filtered[sp_filtered["Date"] < datetime(year=2013,month=1,day=1)]
test = sp_filtered[sp_filtered["Date"] >= datetime(year=2013,month=1,day=1)]

In [139]:

train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15486 entries, 16224 to 739
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Date                15486 non-null  datetime64[ns]
 1   Open                15486 non-null  float64       
 2   High                15486 non-null  float64       
 3   Low                 15486 non-null  float64       
 4   Close               15486 non-null  float64       
 5   Volume              15486 non-null  float64       
 6   Adj Close           15486 non-null  float64       
 7   avg_5_days          15486 non-null  float64       
 8   avg_30_days         15486 non-null  float64       
 9   avg_365_days        15486 non-null  float64       
 10  std_5_days          15486 non-null  float64       
 11  std_365_days        15486 non-null  float64       
 12  ratio_avg_5_to_365  15486 non-null  float64       
 13  ratio_std_5_to_365  15486 non-null  float64       
dtypes: datetime64[ns](1), float64(13)
memory usage: 1.8 MB

In [140]:

test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 739 entries, 738 to 0
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Date                739 non-null    datetime64[ns]
 1   Open                739 non-null    float64       
 2   High                739 non-null    float64       
 3   Low                 739 non-null    float64       
 4   Close               739 non-null    float64       
 5   Volume              739 non-null    float64       
 6   Adj Close           739 non-null    float64       
 7   avg_5_days          739 non-null    float64       
 8   avg_30_days         739 non-null    float64       
 9   avg_365_days        739 non-null    float64       
 10  std_5_days          739 non-null    float64       
 11  std_365_days        739 non-null    float64       
 12  ratio_avg_5_to_365  739 non-null    float64       
 13  ratio_std_5_to_365  739 non-null    float64       
dtypes: datetime64[ns](1), float64(13)
memory usage: 86.6 KB

For error matric for our Linear Regression Model lets choose MAE(Mean Absolute Error)

In [141]:

#Initialize instance of linear regression
lr = LinearRegression()

#Lets choose features and target is Close column
features = sp_filtered.columns.drop(["Close","High","Low","Open","Volume","Date","Adj Close"])


#Fitting the model on training set
lr.fit(train[features], train["Close"])

# Making predictions on test set
predictions = lr.predict(test[features])

test = test.reset_index()
test['Predicted_Close'] = predictions
test["Absolute_Error"] = abs(test["Close"]- test["Predicted_Close"])
test["Error %"] = (test["Absolute_Error"]/test["Close"])*100

# Finding mean absoulte error
error = mean_absolute_error(predictions,test["Close"])

In [142]:

test[["Close","Predicted_Close","Absolute_Error","Error %"]].head(20)

Out[142]:

	Close	Predicted_Close	Absolute_Error	Error %
0	1462.420044	1419.354407	43.065637	2.944820
1	1459.369995	1425.505780	33.864215	2.320468
2	1466.469971	1433.379736	33.090235	2.256455
3	1461.890015	1443.360695	18.529320	1.267491
4	1457.150024	1457.055697	0.094327	0.006473
5	1461.020020	1464.869450	3.849430	0.263475
6	1472.119995	1464.460455	7.659540	0.520307
7	1472.050049	1466.795298	5.254751	0.356968
8	1470.680054	1467.743572	2.936482	0.199668
9	1472.339966	1469.485739	2.854227	0.193857
10	1472.630005	1472.825472	0.195467	0.013273
11	1480.939941	1475.584981	5.354960	0.361592
12	1485.979980	1476.966922	9.013058	0.606540
13	1492.560059	1479.515931	13.044128	0.873943
14	1494.810059	1483.794688	11.015371	0.736908
15	1494.819946	1488.391453	6.428493	0.430051
16	1502.959961	1493.240665	9.719296	0.646677
17	1500.180054	1497.780202	2.399852	0.159971
18	1507.839966	1500.830227	7.009739	0.464886
19	1501.959961	1503.773871	1.813910	0.120770

In [143]:

print("Mean Absolute Error for our Model is : ", error)
print("Mean Absolute Error % : ",np.mean(test["Error %"]))

Mean Absolute Error for our Model is :  16.145140609744132
Mean Absolute Error % :  0.8615617558910121

Conclusion¶

We can see that on an average there is 16$ error between predicted value at closing and actual closing value and percentage of error between actual and predicted is less than 1. We will need to test to model on more current readings to understand more about predicting performance.