In this project, we will work with data from the S&P500 Index and try to predict values based on historical data provided to our Linear Regression Model on test set of future dates of the same dataset to understand performance of the model. The S&P500 is a stock market index.
Some companies are publicly traded, which means that anyone can buy and sell their shares on the open market. A share entitles the owner to some control over the direction of the company and to a percentage (or share) of the earnings of the company. When you buy or sell shares, it's common known as trading a stock.
The price of a share is based on supply and demand for a given stock. For example, Apple stock has a price of 120 dollars per share as of December 2015 -- http://www.nasdaq.com/symbol/aapl. A stock that is in less demand, like Ford Motor Company, has a lower price -- http://finance.yahoo.com/q?s=F. Stock price is also influenced by other factors, including the number of shares a company has issued.
Stocks are traded daily and the price can rise or fall from the beginning of a trading day to the end based on demand. Stocks that are in more in demand, such as Apple, are traded more often than stocks of smaller companies.
Indexes aggregate the prices of multiple stocks together, and allow you to see how the market as a whole performs. For example, the Dow Jones Industrial Average aggregates the stock prices of 30 large American companies together. The S&P500 Index aggregates the stock prices of 500 large companies. When an index fund goes up or down, you can say that the primary market or sector it represents is doing the same. For example, if the Dow Jones Industrial Average price goes down one day, you can say that American stocks overall went down (ie, most American stocks went down in price).
There are thousands of traders who make money by buying and selling Exchange Traded Funds. ETFs allow you to buy and sell indexes like stocks. This means that you could "buy" the S&P500 Index ETF when the price is low and sell when it's high to make a profit. Creating a predictive model could allow traders to make money on the stock market.
Note: You shouldn't make trades with any models developed in this project. Trading stocks has risks and nothing in this lesson constitutes stock trading advice.
# Import required libraries here
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
sp = pd.read_csv("sphist.csv")
sp.head()
Date | Open | High | Low | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|---|
0 | 2015-12-07 | 2090.419922 | 2090.419922 | 2066.780029 | 2077.070068 | 4.043820e+09 | 2077.070068 |
1 | 2015-12-04 | 2051.239990 | 2093.840088 | 2051.239990 | 2091.689941 | 4.214910e+09 | 2091.689941 |
2 | 2015-12-03 | 2080.709961 | 2085.000000 | 2042.349976 | 2049.620117 | 4.306490e+09 | 2049.620117 |
3 | 2015-12-02 | 2101.709961 | 2104.270020 | 2077.110107 | 2079.510010 | 3.950640e+09 | 2079.510010 |
4 | 2015-12-01 | 2082.929932 | 2103.370117 | 2082.929932 | 2102.629883 | 3.712120e+09 | 2102.629883 |
sp.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16590 entries, 0 to 16589 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 16590 non-null object 1 Open 16590 non-null float64 2 High 16590 non-null float64 3 Low 16590 non-null float64 4 Close 16590 non-null float64 5 Volume 16590 non-null float64 6 Adj Close 16590 non-null float64 dtypes: float64(6), object(1) memory usage: 907.4+ KB
sp["Date"] = pd.to_datetime(sp["Date"])
sp.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16590 entries, 0 to 16589 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 16590 non-null datetime64[ns] 1 Open 16590 non-null float64 2 High 16590 non-null float64 3 Low 16590 non-null float64 4 Close 16590 non-null float64 5 Volume 16590 non-null float64 6 Adj Close 16590 non-null float64 dtypes: datetime64[ns](1), float64(6) memory usage: 907.4 KB
sp = sp.sort_values("Date", ascending = True)
sp.head()
Date | Open | High | Low | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|---|
16589 | 1950-01-03 | 16.66 | 16.66 | 16.66 | 16.66 | 1260000.0 | 16.66 |
16588 | 1950-01-04 | 16.85 | 16.85 | 16.85 | 16.85 | 1890000.0 | 16.85 |
16587 | 1950-01-05 | 16.93 | 16.93 | 16.93 | 16.93 | 2550000.0 | 16.93 |
16586 | 1950-01-06 | 16.98 | 16.98 | 16.98 | 16.98 | 2010000.0 | 16.98 |
16585 | 1950-01-09 | 17.08 | 17.08 | 17.08 | 17.08 | 2520000.0 | 17.08 |
Lets choose 5 primary indicators:
# Calculating the 3 indicators and assiging it to new columns in the dataframe.
# Shifting index by 1 so that we do not feed future information and leak info during predictions.
# Means
sp["avg_5_days"] = sp["Close"].rolling(window=5).mean().shift(1)
sp["avg_30_days"] = sp["Close"].rolling(window=30).mean().shift(1)
sp["avg_365_days"] = sp["Close"].rolling(window=365).mean().shift(1)
# Standard Deviations
sp["std_5_days"] = sp["Close"].rolling(window=5).std().shift(1)
sp["std_365_days"] = sp["Close"].rolling(window=365).std().shift(1)
# Calculating complex indicators involving 3 primary indicators
sp["ratio_avg_5_to_365"] = sp["avg_5_days"]/sp["avg_365_days"]
sp["ratio_std_5_to_365"] = sp["std_5_days"]/sp["std_365_days"]
sp.head(10)
Date | Open | High | Low | Close | Volume | Adj Close | avg_5_days | avg_30_days | avg_365_days | std_5_days | std_365_days | ratio_avg_5_to_365 | ratio_std_5_to_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16589 | 1950-01-03 | 16.660000 | 16.660000 | 16.660000 | 16.660000 | 1260000.0 | 16.660000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
16588 | 1950-01-04 | 16.850000 | 16.850000 | 16.850000 | 16.850000 | 1890000.0 | 16.850000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
16587 | 1950-01-05 | 16.930000 | 16.930000 | 16.930000 | 16.930000 | 2550000.0 | 16.930000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
16586 | 1950-01-06 | 16.980000 | 16.980000 | 16.980000 | 16.980000 | 2010000.0 | 16.980000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
16585 | 1950-01-09 | 17.080000 | 17.080000 | 17.080000 | 17.080000 | 2520000.0 | 17.080000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
16584 | 1950-01-10 | 17.030001 | 17.030001 | 17.030001 | 17.030001 | 2160000.0 | 17.030001 | 16.900 | NaN | NaN | 0.157956 | NaN | NaN | NaN |
16583 | 1950-01-11 | 17.090000 | 17.090000 | 17.090000 | 17.090000 | 2630000.0 | 17.090000 | 16.974 | NaN | NaN | 0.089051 | NaN | NaN | NaN |
16582 | 1950-01-12 | 16.760000 | 16.760000 | 16.760000 | 16.760000 | 2970000.0 | 16.760000 | 17.022 | NaN | NaN | 0.067602 | NaN | NaN | NaN |
16581 | 1950-01-13 | 16.670000 | 16.670000 | 16.670000 | 16.670000 | 3330000.0 | 16.670000 | 16.988 | NaN | NaN | 0.134796 | NaN | NaN | NaN |
16580 | 1950-01-16 | 16.719999 | 16.719999 | 16.719999 | 16.719999 | 1460000.0 | 16.719999 | 16.926 | NaN | NaN | 0.196545 | NaN | NaN | NaN |
sp.tail(10)
Date | Open | High | Low | Close | Volume | Adj Close | avg_5_days | avg_30_days | avg_365_days | std_5_days | std_365_days | ratio_avg_5_to_365 | ratio_std_5_to_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | 2015-11-23 | 2089.409912 | 2095.610107 | 2081.389893 | 2086.590088 | 3.587980e+09 | 2086.590088 | 2071.523974 | 2061.892989 | 2033.605890 | 18.246940 | 64.911334 | 1.018646 | 0.281106 |
8 | 2015-11-24 | 2084.419922 | 2094.120117 | 2070.290039 | 2089.139893 | 3.884930e+09 | 2089.139893 | 2078.204004 | 2064.197327 | 2034.018028 | 15.807754 | 64.768328 | 1.021723 | 0.244066 |
7 | 2015-11-25 | 2089.300049 | 2093.000000 | 2086.300049 | 2088.870117 | 2.852940e+09 | 2088.870117 | 2085.943994 | 2067.045658 | 2034.432712 | 3.491188 | 64.634873 | 1.025320 | 0.054014 |
6 | 2015-11-27 | 2088.820068 | 2093.290039 | 2084.129883 | 2090.110107 | 1.466840e+09 | 2090.110107 | 2087.002002 | 2070.199996 | 2034.835123 | 3.395982 | 64.514871 | 1.025637 | 0.052639 |
5 | 2015-11-30 | 2090.949951 | 2093.810059 | 2080.409912 | 2080.409912 | 4.245030e+09 | 2080.409912 | 2088.776025 | 2072.408333 | 2035.199864 | 1.309055 | 64.449800 | 1.026325 | 0.020311 |
4 | 2015-12-01 | 2082.929932 | 2103.370117 | 2082.929932 | 2102.629883 | 3.712120e+09 | 2102.629883 | 2087.024023 | 2073.984998 | 2035.531178 | 3.916109 | 64.370261 | 1.025297 | 0.060837 |
3 | 2015-12-02 | 2101.709961 | 2104.270020 | 2077.110107 | 2079.510010 | 3.950640e+09 | 2079.510010 | 2090.231982 | 2076.283993 | 2035.914082 | 7.956808 | 64.352527 | 1.026680 | 0.123644 |
2 | 2015-12-03 | 2080.709961 | 2085.000000 | 2042.349976 | 2049.620117 | 4.306490e+09 | 2049.620117 | 2088.306006 | 2077.908659 | 2036.234356 | 9.333599 | 64.277554 | 1.025573 | 0.145208 |
1 | 2015-12-04 | 2051.239990 | 2093.840088 | 2051.239990 | 2091.689941 | 4.214910e+09 | 2091.689941 | 2080.456006 | 2078.931331 | 2036.507343 | 19.599946 | 64.121622 | 1.021580 | 0.305668 |
0 | 2015-12-07 | 2090.419922 | 2090.419922 | 2066.780029 | 2077.070068 | 4.043820e+09 | 2077.070068 | 2080.771973 | 2080.237329 | 2036.869425 | 19.806136 | 64.058862 | 1.021554 | 0.309187 |
Since we are computing indicators that use historical data, there are some rows where there isn't enough historical data to generate them. Some of the indicators use 365 days of historical data and the dataset starts on 1950-01-03. Thus, any rows that fall before 1951-01-03 don't have enough historical data to compute all the indicators. We will need to remove these rows before we split the data.
sp_filtered =sp[sp["Date"] > datetime(year =1951, month=1, day=2)]
sp_filtered.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 16340 entries, 16339 to 0 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 16340 non-null datetime64[ns] 1 Open 16340 non-null float64 2 High 16340 non-null float64 3 Low 16340 non-null float64 4 Close 16340 non-null float64 5 Volume 16340 non-null float64 6 Adj Close 16340 non-null float64 7 avg_5_days 16340 non-null float64 8 avg_30_days 16340 non-null float64 9 avg_365_days 16225 non-null float64 10 std_5_days 16340 non-null float64 11 std_365_days 16225 non-null float64 12 ratio_avg_5_to_365 16225 non-null float64 13 ratio_std_5_to_365 16225 non-null float64 dtypes: datetime64[ns](1), float64(13) memory usage: 1.9 MB
sp_filtered.head()
Date | Open | High | Low | Close | Volume | Adj Close | avg_5_days | avg_30_days | avg_365_days | std_5_days | std_365_days | ratio_avg_5_to_365 | ratio_std_5_to_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16339 | 1951-01-03 | 20.690001 | 20.690001 | 20.690001 | 20.690001 | 3370000.0 | 20.690001 | 20.360000 | 19.815000 | NaN | 0.304385 | NaN | NaN | NaN |
16338 | 1951-01-04 | 20.870001 | 20.870001 | 20.870001 | 20.870001 | 3390000.0 | 20.870001 | 20.514000 | 19.842667 | NaN | 0.204524 | NaN | NaN | NaN |
16337 | 1951-01-05 | 20.870001 | 20.870001 | 20.870001 | 20.870001 | 3390000.0 | 20.870001 | 20.628000 | 19.874000 | NaN | 0.214057 | NaN | NaN | NaN |
16336 | 1951-01-08 | 21.000000 | 21.000000 | 21.000000 | 21.000000 | 2780000.0 | 21.000000 | 20.726001 | 19.907000 | NaN | 0.181879 | NaN | NaN | NaN |
16335 | 1951-01-09 | 21.120001 | 21.120001 | 21.120001 | 21.120001 | 3800000.0 | 21.120001 | 20.840001 | 19.935000 | NaN | 0.117047 | NaN | NaN | NaN |
# Droping any rows with NaN values
sp_filtered = sp_filtered.dropna(axis=0)
sp_filtered.head()
Date | Open | High | Low | Close | Volume | Adj Close | avg_5_days | avg_30_days | avg_365_days | std_5_days | std_365_days | ratio_avg_5_to_365 | ratio_std_5_to_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16224 | 1951-06-19 | 22.020000 | 22.020000 | 22.020000 | 22.020000 | 1100000.0 | 22.020000 | 21.800 | 21.703333 | 19.447726 | 0.256223 | 1.790253 | 1.120954 | 0.143121 |
16223 | 1951-06-20 | 21.910000 | 21.910000 | 21.910000 | 21.910000 | 1120000.0 | 21.910000 | 21.900 | 21.683000 | 19.462411 | 0.213659 | 1.789307 | 1.125246 | 0.119409 |
16222 | 1951-06-21 | 21.780001 | 21.780001 | 21.780001 | 21.780001 | 1100000.0 | 21.780001 | 21.972 | 21.659667 | 19.476274 | 0.092574 | 1.788613 | 1.128142 | 0.051758 |
16221 | 1951-06-22 | 21.549999 | 21.549999 | 21.549999 | 21.549999 | 1340000.0 | 21.549999 | 21.960 | 21.631000 | 19.489562 | 0.115108 | 1.787659 | 1.126757 | 0.064390 |
16220 | 1951-06-25 | 21.290001 | 21.290001 | 21.290001 | 21.290001 | 2440000.0 | 21.290001 | 21.862 | 21.599000 | 19.502082 | 0.204132 | 1.786038 | 1.121008 | 0.114293 |
sp_filtered.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 16225 entries, 16224 to 0 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 16225 non-null datetime64[ns] 1 Open 16225 non-null float64 2 High 16225 non-null float64 3 Low 16225 non-null float64 4 Close 16225 non-null float64 5 Volume 16225 non-null float64 6 Adj Close 16225 non-null float64 7 avg_5_days 16225 non-null float64 8 avg_30_days 16225 non-null float64 9 avg_365_days 16225 non-null float64 10 std_5_days 16225 non-null float64 11 std_365_days 16225 non-null float64 12 ratio_avg_5_to_365 16225 non-null float64 13 ratio_std_5_to_365 16225 non-null float64 dtypes: datetime64[ns](1), float64(13) memory usage: 1.9 MB
train = sp_filtered[sp_filtered["Date"] < datetime(year=2013,month=1,day=1)]
test = sp_filtered[sp_filtered["Date"] >= datetime(year=2013,month=1,day=1)]
train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 15486 entries, 16224 to 739 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 15486 non-null datetime64[ns] 1 Open 15486 non-null float64 2 High 15486 non-null float64 3 Low 15486 non-null float64 4 Close 15486 non-null float64 5 Volume 15486 non-null float64 6 Adj Close 15486 non-null float64 7 avg_5_days 15486 non-null float64 8 avg_30_days 15486 non-null float64 9 avg_365_days 15486 non-null float64 10 std_5_days 15486 non-null float64 11 std_365_days 15486 non-null float64 12 ratio_avg_5_to_365 15486 non-null float64 13 ratio_std_5_to_365 15486 non-null float64 dtypes: datetime64[ns](1), float64(13) memory usage: 1.8 MB
test.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 739 entries, 738 to 0 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 739 non-null datetime64[ns] 1 Open 739 non-null float64 2 High 739 non-null float64 3 Low 739 non-null float64 4 Close 739 non-null float64 5 Volume 739 non-null float64 6 Adj Close 739 non-null float64 7 avg_5_days 739 non-null float64 8 avg_30_days 739 non-null float64 9 avg_365_days 739 non-null float64 10 std_5_days 739 non-null float64 11 std_365_days 739 non-null float64 12 ratio_avg_5_to_365 739 non-null float64 13 ratio_std_5_to_365 739 non-null float64 dtypes: datetime64[ns](1), float64(13) memory usage: 86.6 KB
For error matric for our Linear Regression Model lets choose MAE(Mean Absolute Error)
#Initialize instance of linear regression
lr = LinearRegression()
#Lets choose features and target is Close column
features = sp_filtered.columns.drop(["Close","High","Low","Open","Volume","Date","Adj Close"])
#Fitting the model on training set
lr.fit(train[features], train["Close"])
# Making predictions on test set
predictions = lr.predict(test[features])
test = test.reset_index()
test['Predicted_Close'] = predictions
test["Absolute_Error"] = abs(test["Close"]- test["Predicted_Close"])
test["Error %"] = (test["Absolute_Error"]/test["Close"])*100
# Finding mean absoulte error
error = mean_absolute_error(predictions,test["Close"])
test[["Close","Predicted_Close","Absolute_Error","Error %"]].head(20)
Close | Predicted_Close | Absolute_Error | Error % | |
---|---|---|---|---|
0 | 1462.420044 | 1419.354407 | 43.065637 | 2.944820 |
1 | 1459.369995 | 1425.505780 | 33.864215 | 2.320468 |
2 | 1466.469971 | 1433.379736 | 33.090235 | 2.256455 |
3 | 1461.890015 | 1443.360695 | 18.529320 | 1.267491 |
4 | 1457.150024 | 1457.055697 | 0.094327 | 0.006473 |
5 | 1461.020020 | 1464.869450 | 3.849430 | 0.263475 |
6 | 1472.119995 | 1464.460455 | 7.659540 | 0.520307 |
7 | 1472.050049 | 1466.795298 | 5.254751 | 0.356968 |
8 | 1470.680054 | 1467.743572 | 2.936482 | 0.199668 |
9 | 1472.339966 | 1469.485739 | 2.854227 | 0.193857 |
10 | 1472.630005 | 1472.825472 | 0.195467 | 0.013273 |
11 | 1480.939941 | 1475.584981 | 5.354960 | 0.361592 |
12 | 1485.979980 | 1476.966922 | 9.013058 | 0.606540 |
13 | 1492.560059 | 1479.515931 | 13.044128 | 0.873943 |
14 | 1494.810059 | 1483.794688 | 11.015371 | 0.736908 |
15 | 1494.819946 | 1488.391453 | 6.428493 | 0.430051 |
16 | 1502.959961 | 1493.240665 | 9.719296 | 0.646677 |
17 | 1500.180054 | 1497.780202 | 2.399852 | 0.159971 |
18 | 1507.839966 | 1500.830227 | 7.009739 | 0.464886 |
19 | 1501.959961 | 1503.773871 | 1.813910 | 0.120770 |
print("Mean Absolute Error for our Model is : ", error)
print("Mean Absolute Error % : ",np.mean(test["Error %"]))
Mean Absolute Error for our Model is : 16.145140609744132 Mean Absolute Error % : 0.8615617558910121
We can see that on an average there is 16$ error between predicted value at closing and actual closing value and percentage of error between actual and predicted is less than 1. We will need to test to model on more current readings to understand more about predicting performance.