In this project, you'll work with data from the S&P500 Index. The S&P500 is a stock market index. Before we get into what an index is, we'll need to start with the basics of the stock market.
Some companies are publicly traded, which means that anyone can buy and sell their shares on the open market. A share entitles the owner to some control over the direction of the company and to a percentage (or share) of the earnings of the company. When you buy or sell shares, it's common known as trading a stock. The price of a share is based on supply and demand for a given stock.
Indexes aggregate the prices of multiple stocks together, and allow you to see how the market as a whole performs.
You'll be using historical data on the price of the S&P500 Index to make predictions about future prices. Predicting whether an index goes up or down helps forecast how the stock market as a whole performs. Since stocks tend to correlate with how well the economy as a whole is performs, it can also help with economic forecasts.
In this project, our dataset contain index prices. Each row in the file contains a daily record of the price of the S&P500 Index from 1950 to 2015. The dataset is stored in sphist.csv.
Columns | Description |
---|---|
Date | The date of the record. |
Open | The opening price of the day (when trading starts) |
High | The highest trade price during the day |
Low | The lowest trade price during the day |
Close | The closing price for the day (when trading is finished) |
Volume | The number of shares traded |
Adj Close | The daily closing price, adjusted retroactively to include any corporate actions. |
You'll be using this dataset to develop a predictive model. You'll train the model with data from 1950-2012 and try to make predictions from 2013-2015.
import pandas as pd
from datetime import datetime
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
# Read our data
df = pd.read_csv("sphist.csv")
# Convert the Date column into a Pandas date type
df["Date"] = pd.to_datetime(df["Date"])
df.sort_values(by="Date", ascending=True, inplace=True)
df.head(10)
Date | Open | High | Low | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|---|
16589 | 1950-01-03 | 16.660000 | 16.660000 | 16.660000 | 16.660000 | 1260000.0 | 16.660000 |
16588 | 1950-01-04 | 16.850000 | 16.850000 | 16.850000 | 16.850000 | 1890000.0 | 16.850000 |
16587 | 1950-01-05 | 16.930000 | 16.930000 | 16.930000 | 16.930000 | 2550000.0 | 16.930000 |
16586 | 1950-01-06 | 16.980000 | 16.980000 | 16.980000 | 16.980000 | 2010000.0 | 16.980000 |
16585 | 1950-01-09 | 17.080000 | 17.080000 | 17.080000 | 17.080000 | 2520000.0 | 17.080000 |
16584 | 1950-01-10 | 17.030001 | 17.030001 | 17.030001 | 17.030001 | 2160000.0 | 17.030001 |
16583 | 1950-01-11 | 17.090000 | 17.090000 | 17.090000 | 17.090000 | 2630000.0 | 17.090000 |
16582 | 1950-01-12 | 16.760000 | 16.760000 | 16.760000 | 16.760000 | 2970000.0 | 16.760000 |
16581 | 1950-01-13 | 16.670000 | 16.670000 | 16.670000 | 16.670000 | 3330000.0 | 16.670000 |
16580 | 1950-01-16 | 16.719999 | 16.719999 | 16.719999 | 16.719999 | 1460000.0 | 16.719999 |
Stock market data is sequential and each observation comes a day after the previous observation. Thus, the observations are not all independent and you can't treat them as such. The time series nature of the data means that we can generate indicators to make our model more accurate. Our goal is to teach the model how to predict the current price from historical prices. Let's select 3 indicators :
indicators = [5, 30]
def add_indicator(df,indicators,targets):
for target in targets:
for index, row in df.iterrows():
size = len(df[df['Date'] < row['Date']])
for indicator in indicators:
column = "{}_Day_{}".format(target,indicator)
#new_column = "Volume_Day_{}".format(indicator)
# print(column)
if size < indicator:
df.loc[index, column] = 0
else:
df.loc[index, column] = np.mean(
df.loc[index+indicator:index-1, target])
column1 ="Ratio_{}_{}_{}".format(target,indicators[0],indicators[1])
df[column1] = df.iloc[:,-2] / df.iloc[:,-1]
add_indicator(df,indicators,["Close","Volume"])
df.head(20)
Date | Open | High | Low | Close | Volume | Adj Close | Close_Day_5 | Close_Day_30 | Ratio_Close_5_30 | Volume_Day_5 | Volume_Day_30 | Ratio_Volume_5_30 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16589 | 1950-01-03 | 16.660000 | 16.660000 | 16.660000 | 16.660000 | 1260000.0 | 16.660000 | 0.000000 | 0.0 | NaN | 0.000000e+00 | 0.0 | NaN |
16588 | 1950-01-04 | 16.850000 | 16.850000 | 16.850000 | 16.850000 | 1890000.0 | 16.850000 | 0.000000 | 0.0 | NaN | 0.000000e+00 | 0.0 | NaN |
16587 | 1950-01-05 | 16.930000 | 16.930000 | 16.930000 | 16.930000 | 2550000.0 | 16.930000 | 0.000000 | 0.0 | NaN | 0.000000e+00 | 0.0 | NaN |
16586 | 1950-01-06 | 16.980000 | 16.980000 | 16.980000 | 16.980000 | 2010000.0 | 16.980000 | 0.000000 | 0.0 | NaN | 0.000000e+00 | 0.0 | NaN |
16585 | 1950-01-09 | 17.080000 | 17.080000 | 17.080000 | 17.080000 | 2520000.0 | 17.080000 | 0.000000 | 0.0 | NaN | 0.000000e+00 | 0.0 | NaN |
16584 | 1950-01-10 | 17.030001 | 17.030001 | 17.030001 | 17.030001 | 2160000.0 | 17.030001 | 16.945714 | 0.0 | inf | 2.145714e+06 | 0.0 | inf |
16583 | 1950-01-11 | 17.090000 | 17.090000 | 17.090000 | 17.090000 | 2630000.0 | 17.090000 | 16.960000 | 0.0 | inf | 2.390000e+06 | 0.0 | inf |
16582 | 1950-01-12 | 16.760000 | 16.760000 | 16.760000 | 16.760000 | 2970000.0 | 16.760000 | 16.934286 | 0.0 | inf | 2.595714e+06 | 0.0 | inf |
16581 | 1950-01-13 | 16.670000 | 16.670000 | 16.670000 | 16.670000 | 3330000.0 | 16.670000 | 16.904286 | 0.0 | inf | 2.440000e+06 | 0.0 | inf |
16580 | 1950-01-16 | 16.719999 | 16.719999 | 16.719999 | 16.719999 | 1460000.0 | 16.719999 | 16.887143 | 0.0 | inf | 2.408571e+06 | 0.0 | inf |
16579 | 1950-01-17 | 16.860001 | 16.860001 | 16.860001 | 16.860001 | 1790000.0 | 16.860001 | 16.854286 | 0.0 | inf | 2.272857e+06 | 0.0 | inf |
16578 | 1950-01-18 | 16.850000 | 16.850000 | 16.850000 | 16.850000 | 1570000.0 | 16.850000 | 16.831429 | 0.0 | inf | 2.131429e+06 | 0.0 | inf |
16577 | 1950-01-19 | 16.870001 | 16.870001 | 16.870001 | 16.870001 | 1170000.0 | 16.870001 | 16.804286 | 0.0 | inf | 1.961429e+06 | 0.0 | inf |
16576 | 1950-01-20 | 16.900000 | 16.900000 | 16.900000 | 16.900000 | 1440000.0 | 16.900000 | 16.827143 | 0.0 | inf | 1.728571e+06 | 0.0 | inf |
16575 | 1950-01-23 | 16.920000 | 16.920000 | 16.920000 | 16.920000 | 1340000.0 | 16.920000 | 16.854286 | 0.0 | inf | 1.431429e+06 | 0.0 | inf |
16574 | 1950-01-24 | 16.860001 | 16.860001 | 16.860001 | 16.860001 | 1250000.0 | 16.860001 | 16.857143 | 0.0 | inf | 1.465714e+06 | 0.0 | inf |
16573 | 1950-01-25 | 16.740000 | 16.740000 | 16.740000 | 16.740000 | 1700000.0 | 16.740000 | 16.838572 | 0.0 | inf | 1.374286e+06 | 0.0 | inf |
16572 | 1950-01-26 | 16.730000 | 16.730000 | 16.730000 | 16.730000 | 1150000.0 | 16.730000 | 16.834286 | 0.0 | inf | 1.328571e+06 | 0.0 | inf |
16571 | 1950-01-27 | 16.820000 | 16.820000 | 16.820000 | 16.820000 | 1250000.0 | 16.820000 | 16.855714 | 0.0 | inf | 1.395714e+06 | 0.0 | inf |
16570 | 1950-01-30 | 17.020000 | 17.020000 | 17.020000 | 17.020000 | 1640000.0 | 17.020000 | 16.877143 | 0.0 | inf | 1.431429e+06 | 0.0 | inf |
Since we're computing indicators that use historical data, there are some rows where there isn't enough historical data to generate them. let's clean our data depending on the select columns and number of days.
def clean(df,day,columns):
# Remove data before 1951-01-03
#df = df[df["Date"] > datetime(year=1951, month=1, day=2)]
for column in columns:
col = "{}_Day_{}".format(column,day)
df.drop(df[(df[col] == 0)].index,
axis=0, inplace=True)
df.dropna(axis=0, inplace=True)
clean_df = df.copy()
clean(clean_df,30,["Close","Volume"])
print(clean_df.isnull().sum())
print(clean_df.shape)
# Generate the train and test dataset
train = clean_df[clean_df["Date"] < datetime(year=2013, month=1, day=1)]
test = clean_df[clean_df["Date"] > datetime(year=2013, month=1, day=1)]
Date 0 Open 0 High 0 Low 0 Close 0 Volume 0 Adj Close 0 Close_Day_5 0 Close_Day_30 0 Ratio_Close_5_30 0 Volume_Day_5 0 Volume_Day_30 0 Ratio_Volume_5_30 0 dtype: int64 (16560, 13)
On the splitting part, we're going :
# Generate the train and test dataset
train = clean_df[clean_df["Date"] < datetime(year=2013, month=1, day=1)]
test = clean_df[clean_df["Date"] > datetime(year=2013, month=1, day=1)]
print(train.shape)
print(test.shape)
(15821, 13) (739, 13)
The linear regression model is going to be used to train the train dataset and predict the test dataset and the error metric is Mean Absolute Error (MAE).
def model(target,features,train,test):
lr = LinearRegression()
lr.fit(train[features], train[target])
predictions = lr.predict(test[features])
mae = mean_absolute_error(test['Close'], predictions)
print("MAE = {}".format(mae))
features = ["Close_Day_5", "Close_Day_30"]
target = "Close"
#"ratio_5_365","Volume_Day_5", "Volume_Day_365"
model(target,features,train,test)
features = ["Close_Day_5", "Close_Day_30","Ratio_Close_5_30"]
target = "Close"
#"ratio_5_365","Volume_Day_5", "Volume_Day_365"
model(target,features,train,test)
MAE = 11.350741886520453 MAE = 11.351021603941877
By the result above we can say the ratio doesn't have significant effect in reducing error. Let's be sure by checking the correlation coefficients.
clean_df.corr()["Close"]
Open 0.999901 High 0.999954 Low 0.999956 Close 1.000000 Volume 0.774267 Adj Close 1.000000 Close_Day_5 0.999892 Close_Day_30 0.999297 Ratio_Close_5_30 0.005923 Volume_Day_5 0.783925 Volume_Day_30 0.788661 Ratio_Volume_5_30 -0.004822 Name: Close, dtype: float64
The coefficients (Ratio_Close_5_30 = 0.005923 and Ratio_Volume_5_30 = -0.004822) confirm the assertion from above. Let's add more 2 more indicators :
features = ["Close_Day_5", "Close_Day_30","Volume_Day_5", "Volume_Day_30"]
#"ratio_5_365","Volume_Day_5", "Volume_Day_365"
model(target,features,train,test)
MAE = 11.3073026169537
We have a small improvement of our model from MAE = 11.350741886520453 to MAE = 11.3073026169537 . We can also make significant structural improvements to the algorithm.
About this improvement, we train a model using data from 1951-01-03 to 2013-01-02, make predictions for 2013-01-03, and then train another model using data from 1951-01-03 to 2013-01-03, make predictions for 2013-01-04, and so on. This more closely simulates what you'd do if you were trading using the algorithm.
# Modification of model function
def model(df,row,target,features):
#print(row)
lr = LinearRegression()
train = df[df["Date"] < row["Date"]]
test = df[df["Date"] == row["Date"]]
if train.empty :
return 0
else:
lr.fit(train[features], train[target])
prediction = lr.predict(test[features])
mae = mean_absolute_error(test['Close'], prediction)
return mae
# get the MAEs of our new model
maes = clean_df.apply(lambda row : model(clean_df,row,target,features),axis=1)
mae = np.mean(maes)
print("MAE = {}".format(mae))
MAE = 3.7842744203948424
We can see a big improvement in the reduction's error, by that we can conclure the accuracy of the model will improve by making predictions only one day ahead.
In the goal to improve the accuracy of the prediction, we can :