Predicting the Stock Market¶

Load Packages¶

In [1]:

from IPython.display import display, HTML
display(HTML(data="""
<style>
    div#notebook-container    { width: 85%; }
    div#menubar-container     { width: 65%; }
    div#maintoolbar-container { width: 99%; }
</style>
"""))
from datetime import datetime
from itertools import permutations 
from collections import Counter
import sympy

import numpy as np
from numpy import arange
from numpy.random import randint, seed, random
import pandas as pd
import pandas_profiling

import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.style as style

import seaborn as sns

from scipy.stats import percentileofscore, chisquare, chi2_contingency
from scipy import stats
from scipy.spatial import distance

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.cluster import KMeans

The Dataset¶

In this project, you'll be working with data from the S&P500 Index. The S&P500 is a stock market index. Before we get into what an index is, we'll need to get into the basics of the stock market.

Some companies are publicly traded, which means that anyone can buy and sell their shares on the open market. A share entitles the owner to some control over the direction of the company, and to some percentage (or share) of the earnings of the company. When you buy or sell shares, it's common to say that you're trading a stock.

The price of a share is based mainly on supply and demand for a given stock. For example, Apple stock has a price of 120 dollars per share as of December 2015 -- http://www.nasdaq.com/symbol/aapl. A stock that is in less demand, like Ford Motor Company, has a lower price -- http://finance.yahoo.com/q?s=F. Stock price is also influenced by other factors, including the number of shares a company has issued.

Stocks are traded daily, and the price can rise or fall from the beginning of a trading day to the end based on demand. Stocks that are in more in demand, such as Apple, are traded more often than stocks of smaller companies.

Indexes aggregate the prices of multiple stocks together, and allow you to see how the market as a whole is performing. For example, the Dow Jones Industrial Average aggregates the stock prices of 30 large American companies together. The S&P500 Index aggregates the stock prices of 500 large companies. When an index fund goes up or down, you can say that the underlying market or sector it represents is also going up or down. For example, if the Dow Jones Industrial Average price goes down one day, you can say that American stocks overall went down (ie, most American stocks went down in price).

You'll be using historical data on the price of the S&P500 Index to make predictions about future prices. Predicting whether an index will go up or down will help us forecast how the stock market as a whole will perform. Since stocks tend to correlate with how well the economy as a whole is performing, it can also help us make economic forecasts.

There are also thousands of traders who make money by buying and selling Exchange Traded Funds. ETFs allow you to buy and sell indexes like stocks. This means that you could "buy" the S&P500 Index ETF when the price is low, and sell when it's high to make a profit. Creating a predictive model could allow traders to make money on the stock market.

In this mission, you'll be working with a csv file containing index prices. Each row in the file contains a daily record of the price of the S&P500 Index from 1950 to 2015. The dataset is stored in sphist.csv.

The columns of the dataset are:

Date -- The date of the record.
Open -- The opening price of the day (when trading starts).
High -- The highest trade price during the day.
Low -- The lowest trade price during the day.
Close -- The closing price for the day (when trading is finished).
Volume -- The number of shares traded.
Adj Close -- The daily closing price, adjusted retroactively to include any corporate actions. Read more here.

You'll be using this dataset to develop a predictive model. You'll train the model with data from 1950-2012, and try to make predictions from 2013-2015.

Reading in the Data¶

Instructions¶

Here are the steps you'll need to take, at a high level:

Read the data into a Pandas DataFrame. You can use the read_csv Pandas function for this.
Convert the Date column to a Pandas date type. This will allow you to do date comparisons with the column.
- You can perform this conversion with the to_datetime function in Pandas.
- Once you convert the column, you can perform comparisons with df["Date"] > datetime(year=2015, month=4, day=1). This will generate a Boolean series that tells you if each item in the Date column is after 2015-04-01. You'll have to import the datetime module from the datetime library first with from datetime import datetime.
Sort the dataframe on the Date column. It's currently in descending order, but we'll want it to be in ascending order for some of the next steps. You can use the DataFrame.sort_values() method on data frames for this.

Make sure to run the predict.py script using python predict.py as you work through the steps.

In [2]:

stocks = pd.read_csv('sphist.csv')

In [3]:

stocks.head()

Out[3]:

	Date	Open	High	Low	Close	Volume	Adj Close
0	2015-12-07	2090.419922	2090.419922	2066.780029	2077.070068	4.043820e+09	2077.070068
1	2015-12-04	2051.239990	2093.840088	2051.239990	2091.689941	4.214910e+09	2091.689941
2	2015-12-03	2080.709961	2085.000000	2042.349976	2049.620117	4.306490e+09	2049.620117
3	2015-12-02	2101.709961	2104.270020	2077.110107	2079.510010	3.950640e+09	2079.510010
4	2015-12-01	2082.929932	2103.370117	2082.929932	2102.629883	3.712120e+09	2102.629883

In [4]:

plt.figure(figsize=(40,15))
plt.scatter(stocks.Date, stocks.Volume, c='b')
plt.title('Stock Volume', fontsize=60)
plt.xlabel('Date', fontsize=40)
plt.ylabel('Volume', fontsize=40)

Out[4]:

Text(0, 0.5, 'Volume')

In [5]:

stocks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
Date         16590 non-null object
Open         16590 non-null float64
High         16590 non-null float64
Low          16590 non-null float64
Close        16590 non-null float64
Volume       16590 non-null float64
Adj Close    16590 non-null float64
dtypes: float64(6), object(1)
memory usage: 907.4+ KB

In [6]:

stocks['Date'] = pd.to_datetime(stocks.Date)

In [7]:

stocks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
Date         16590 non-null datetime64[ns]
Open         16590 non-null float64
High         16590 non-null float64
Low          16590 non-null float64
Close        16590 non-null float64
Volume       16590 non-null float64
Adj Close    16590 non-null float64
dtypes: datetime64[ns](1), float64(6)
memory usage: 907.4 KB

In [8]:

stocks.sort_values('Date', inplace=True)

In [9]:

stocks.head()

Out[9]:

	Date	Open	High	Low	Close	Volume	Adj Close
16589	1950-01-03	16.66	16.66	16.66	16.66	1260000.0	16.66
16588	1950-01-04	16.85	16.85	16.85	16.85	1890000.0	16.85
16587	1950-01-05	16.93	16.93	16.93	16.93	2550000.0	16.93
16586	1950-01-06	16.98	16.98	16.98	16.98	2010000.0	16.98
16585	1950-01-09	17.08	17.08	17.08	17.08	2520000.0	17.08

Generating Indicators¶

Datasets taken from the stock market need to be handled differently than datasets from other sectors when it comes time to make predictions. In a normal machine learning exercise, we treat each row as independent. Stock market data is sequential, and each observation comes a day after the previous observation. Thus, the observations are not all independent, and you can't treat them as such.

This means you have to be extra careful to not inject "future" knowledge" into past rows when you do training and prediction. Injecting future knowledge will make our model look good when you're training and testing it, but will make it fail in the real world. This is how many algorithmic traders lose money.

The time series nature of the data means that can generate indicators to make our model more accurate. For instance, you can create a new column that contains the average price of the last 10 trades for each row. This will incorporate information from multiple prior rows into one, and will make predictions much more accurate.

When you do this, you have to be careful not to use the current row in the values you average. You want to teach the model how to predict the current price from historical prices. If you include the current price in the prices you average, it will be equivalent to handing the answers to the model upfront, and will make it impossible to use in the "real world", where you don't know the price upfront.

Here are some indicators that are interesting to generate for each row:

The average price from the past 5 days.
The average price for the past 30 days.
The average price for the past 365 days.
The ratio between the average price for the past 5 days, and the average price for the past 365 days.
The standard deviation of the price over the past 5 days.
The standard deviation of the price over the past 365 days.
The ratio between the standard deviation for the past 5 days, and the standard deviation for the past 365 days.

"Days" means "trading days" -- so if you're computing the average of the past 5 days, it should be the 5 most recent dates before the current one. Assume that "price" means the Close column. Always be careful not to include the current price in these indicators! You're predicting the next day price, so our indicators are designed to predict the current price from the previous prices.

Some of these indicators require a year of historical data to compute. Our first day of data falls on 1950-01-03, so the first day you can start computing indicators on is 1951-01-03.

To compute indicators, you'll need to loop through each day from 1951-01-03 to 2015-12-07 (the last day you have prices for). For instance, if we were computing the average price from the past 5 days, we'd start at 1951-01-03, get the prices for each day from 1950-12-26 to 1951-01-02, and find the average. The reason why we start on the 26th, and take more than 5 calendar days into account is because the stock market is shutdown on certain holidays. Since we're looking at the past 5 trading days, we need to look at more than 5 calendar days to find them. Here's a diagram showing how we average 5 days to get the average closing price for 1951-01-03:

You'd then move to 1951-01-04, and find the average price from 1950-12-30 to 1951-01-03. Here's a diagram showing how we might compute the average here:

We'd keep repeating this process to compute all of the averages. Note how when we compute the average of the past 5 days for 1951-01-04, we don't include 1951-01-04 in that average. It's critical not to do this, or our model won't work in the "real world".

Here's a table of how the first 10 dates would look if we computed the 5 day average closing price. Close is the closing price for that day, and day_5 is the average of the past 5 trading closing prices at each row:

Instructions¶

Pick 3 indicators to compute, and generate a different column for each one.

There are a few different ways to do this:

You can use a for loop along with the iterrows method to loop over the rows in the DataFrame and compute the indicators. This is the recommended way, as it's a bit simpler to understand what's happening. Since you'll be looping over all of the rows, for any date that comes before there is enough historical data to compute an indicator, just fill in 0.
Pandas has some time series tools that can help, including the rolling function, which will do most of the hard computation for you. Set the window equal to the number of trading days in the past you want to use to compute the indicators. This will add in NaN values for any row where there aren't enough historical trading days to do the computation. Note: There is a giant caveat here, which is that the rolling mean will use the current day's price. You'll need to reindex the resulting series to shift all the values "forward" one day. For example, the rolling mean calculated for 1950-01-03 will need to be assigned to 1950-01-04, and so on. You can use the shift method on Dataframes to do this.

In [10]:

def avg_ndays_stock_price(df, n):
    col_name = 'Avg Price ' + str(n) + ' Day'
    for index, row in df.iterrows():
        start_index = index + (n - 1)
        if start_index not in df.index:
            df.loc[index, col_name] = 0
        else:
            # assign mean nday to column
            df.loc[index, col_name] = round(df.loc[start_index:index, 'Close'].mean(),2)
    return df

In [11]:

def std_ndays_stock_price(df, n):
    col_name = 'Std Price ' + str(n) + ' Day'
    for index, row in df.iterrows():
        start_index = index + (n - 1)
        if start_index not in df.index:
            df.loc[index, col_name] = 0
        else:
            # assign standard deviation nday to column
            df.loc[index, col_name] = round(df.loc[start_index:index, 'Close'].std(), 4)
    return df

In [12]:

def avg_ndays_stock_volume(df, n):
    col_name = 'Avg Volume ' + str(n) + ' Day'
    for index, row in df.iterrows():
        start_index = index + (n - 1)
        # checks if row exist
        if start_index not in df.index:
            # adds 0 to rows with no historical data
            df.loc[index, col_name] = 0 
        else:
            # assign mean nday to column
            df.loc[index, col_name] = round(df.loc[start_index:index, 'Volume'].mean(),2)
    return df

In [13]:

def std_ndays_stock_volume(df, n):
    col_name = 'Std Volume ' + str(n) + ' Day'
    for index, row in df.iterrows():
        start_index = index + (n - 1)
        # checks if row exist
        if start_index not in df.index:
            # adds 0 to rows with no historical data
            df.loc[index, col_name] = 0
        else:
            # assign standard deviation nday to column
            df.loc[index, col_name] = round(df.loc[start_index:index, 'Volume'].std(), 4)
    return df

In [14]:

print(stocks.shape)
df = avg_ndays_stock_price(stocks, 5)
df = std_ndays_stock_price(stocks, 5)
df = avg_ndays_stock_price(stocks, 30)
df = std_ndays_stock_price(stocks, 30)
df = avg_ndays_stock_price(stocks, 365)
df = std_ndays_stock_price(stocks, 365)

df = avg_ndays_stock_volume(stocks, 5)
df = std_ndays_stock_volume(stocks, 5)
df = avg_ndays_stock_volume(stocks, 30)
df = std_ndays_stock_volume(stocks, 30)
df = avg_ndays_stock_volume(stocks, 365)
df = std_ndays_stock_volume(stocks, 365)
print(df.shape)

(16590, 7)
(16590, 19)

In [15]:

df.head(365)

Out[15]:

	Date	Open	High	Low	Close	Volume	Adj Close	Avg Price 5 Day	Std Price 5 Day	Avg Price 30 Day	Std Price 30 Day	Avg Price 365 Day	Std Price 365 Day	Avg Volume 5 Day	Std Volume 5 Day	Avg Volume 30 Day	Std Volume 30 Day	Avg Volume 365 Day	Std Volume 365 Day
16589	1950-01-03	16.660000	16.660000	16.660000	16.660000	1260000.0	16.660000	0.00	0.0000	0.00	0.0000	0.00	0.0000	0.0	0.0000	0.00	0.000000e+00	0.00	0.0000
16588	1950-01-04	16.850000	16.850000	16.850000	16.850000	1890000.0	16.850000	0.00	0.0000	0.00	0.0000	0.00	0.0000	0.0	0.0000	0.00	0.000000e+00	0.00	0.0000
16587	1950-01-05	16.930000	16.930000	16.930000	16.930000	2550000.0	16.930000	0.00	0.0000	0.00	0.0000	0.00	0.0000	0.0	0.0000	0.00	0.000000e+00	0.00	0.0000
16586	1950-01-06	16.980000	16.980000	16.980000	16.980000	2010000.0	16.980000	0.00	0.0000	0.00	0.0000	0.00	0.0000	0.0	0.0000	0.00	0.000000e+00	0.00	0.0000
16585	1950-01-09	17.080000	17.080000	17.080000	17.080000	2520000.0	17.080000	16.90	0.1580	0.00	0.0000	0.00	0.0000	2046000.0	529650.8284	0.00	0.000000e+00	0.00	0.0000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
16229	1951-06-12	21.520000	21.520000	21.520000	21.520000	1200000.0	21.520000	21.53	0.0536	21.81	0.5802	0.00	0.0000	1192000.0	122147.4519	1807333.33	1.554093e+06	0.00	0.0000
16228	1951-06-13	21.549999	21.549999	21.549999	21.549999	1060000.0	21.549999	21.55	0.0451	21.78	0.5658	0.00	0.0000	1164000.0	135203.5502	1784000.00	1.560072e+06	0.00	0.0000
16227	1951-06-14	21.840000	21.840000	21.840000	21.840000	1300000.0	21.840000	21.60	0.1402	21.75	0.5433	0.00	0.0000	1156000.0	122800.6515	1764000.00	1.562378e+06	0.00	0.0000
16226	1951-06-15	22.040001	22.040001	22.040001	22.040001	1370000.0	22.040001	21.71	0.2222	21.73	0.5087	0.00	0.0000	1230000.0	116619.0379	1741000.00	1.562949e+06	0.00	0.0000
16225	1951-06-18	22.049999	22.049999	22.049999	22.049999	1050000.0	22.049999	21.80	0.2562	21.70	0.4736	19.45	1.7903	1196000.0	142232.2045	1707666.67	1.566790e+06	1989479.45	772310.6495

365 rows × 19 columns

Splitting up the Data¶

Since you're computing indicators that use historical data, there are some rows where there isn't enough historical data to generate them. Some of the indicators use 365 days of historical data, and the dataset starts on 1950-01-03. Thus, any rows that fall before 1951-01-03 don't have enough historical data to compute all the indicators. You'll need to remove these rows before you split the data.

If you have a Dataframe df, you can select any rows with the Date column greater than 1951-01-02 using df[df["Date"] > datetime(year=1951, month=1, day=2)]

Instructions¶

Remove any rows from the DataFrame that fall before 1951-01-03.
Use the dropna method to remove any rows with NaN values. Pass in the axis=0 argument to drop rows.
Generate two new dataframes to use in making our algorithm. train should contain any rows in the data with a date less than 2013-01-01. test should contain any rows with a date greater than or equal to 2013-01-01.

In [16]:

print(df.shape)

(16590, 19)

In [17]:

df.isnull().sum()

Out[17]:

Date                  0
Open                  0
High                  0
Low                   0
Close                 0
Volume                0
Adj Close             0
Avg Price 5 Day       0
Std Price 5 Day       0
Avg Price 30 Day      0
Std Price 30 Day      0
Avg Price 365 Day     0
Std Price 365 Day     0
Avg Volume 5 Day      0
Std Volume 5 Day      0
Avg Volume 30 Day     0
Std Volume 30 Day     0
Avg Volume 365 Day    0
Std Volume 365 Day    0
dtype: int64

In [18]:

df = df[df["Date"] > datetime(year=1951, month=1, day=2)]
df.dropna(axis=0, inplace=True)
train = df[df["Date"] < datetime(year=2013, month=1, day=1)]
test = df[df["Date"] >= datetime(year=2013, month=1, day=1)]

C:\Users\Leo\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [19]:

print(df.shape)

(16340, 19)

In [20]:

df.isnull().sum()

Out[20]:

Date                  0
Open                  0
High                  0
Low                   0
Close                 0
Volume                0
Adj Close             0
Avg Price 5 Day       0
Std Price 5 Day       0
Avg Price 30 Day      0
Std Price 30 Day      0
Avg Price 365 Day     0
Std Price 365 Day     0
Avg Volume 5 Day      0
Std Volume 5 Day      0
Avg Volume 30 Day     0
Std Volume 30 Day     0
Avg Volume 365 Day    0
Std Volume 365 Day    0
dtype: int64

In [21]:

train

Out[21]:

	Date	Open	High	Low	Close	Volume	Adj Close	Avg Price 5 Day	Std Price 5 Day	Avg Price 30 Day	Std Price 30 Day	Avg Price 365 Day	Std Price 365 Day	Avg Volume 5 Day	Std Volume 5 Day	Avg Volume 30 Day	Std Volume 30 Day	Avg Volume 365 Day	Std Volume 365 Day
16339	1951-01-03	20.690001	20.690001	20.690001	20.690001	3.370000e+06	20.690001	20.51	0.2045	19.84	0.4343	0.00	0.0000	3.268000e+06	2.690167e+05	2.694667e+06	6.950948e+05	0.000000e+00	0.000000e+00
16338	1951-01-04	20.870001	20.870001	20.870001	20.870001	3.390000e+06	20.870001	20.63	0.2141	19.87	0.4730	0.00	0.0000	3.358000e+06	1.976613e+05	2.732667e+06	7.010826e+05	0.000000e+00	0.000000e+00
16337	1951-01-05	20.870001	20.870001	20.870001	20.870001	3.390000e+06	20.870001	20.73	0.1819	19.91	0.5068	0.00	0.0000	3.324000e+06	1.663731e+05	2.778667e+06	6.972939e+05	0.000000e+00	0.000000e+00
16336	1951-01-08	21.000000	21.000000	21.000000	21.000000	2.780000e+06	21.000000	20.84	0.1170	19.93	0.5431	0.00	0.0000	3.192000e+06	2.766225e+05	2.780333e+06	6.972334e+05	0.000000e+00	0.000000e+00
16335	1951-01-09	21.120001	21.120001	21.120001	21.120001	3.800000e+06	21.120001	20.91	0.1611	19.96	0.5810	0.00	0.0000	3.346000e+06	3.643213e+05	2.819667e+06	7.207634e+05	0.000000e+00	0.000000e+00
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
743	2012-12-24	1430.150024	1430.150024	1424.660034	1426.660034	1.248960e+09	1426.660034	1436.62	8.5897	1407.49	24.5204	1326.41	89.9835	3.667348e+09	1.477523e+09	3.381918e+09	7.656156e+08	3.878488e+09	7.749962e+08
742	2012-12-26	1426.660034	1429.420044	1416.430054	1419.829956	2.285030e+09	1419.829956	1431.23	9.0587	1408.81	24.0560	1326.72	90.1114	3.263906e+09	1.535105e+09	3.372501e+09	7.776211e+08	3.872807e+09	7.790566e+08
741	2012-12-27	1419.829956	1422.800049	1401.800049	1418.099976	2.830180e+09	1418.099976	1427.69	10.2086	1410.27	23.2154	1327.00	90.2365	3.055982e+09	1.502584e+09	3.351655e+09	7.836766e+08	3.868936e+09	7.807206e+08
740	2012-12-28	1418.099976	1418.099976	1401.579956	1402.430054	2.426680e+09	1402.430054	1419.43	10.7019	1411.83	20.8585	1327.26	90.3156	2.804002e+09	1.475799e+09	3.295561e+09	7.877764e+08	3.864302e+09	7.842493e+08
739	2012-12-31	1402.430054	1426.739990	1398.109985	1426.189941	3.204330e+09	1426.189941	1418.64	9.8208	1414.26	17.8347	1327.53	90.4639	2.399036e+09	7.366851e+08	3.271409e+09	7.787457e+08	3.861288e+09	7.846668e+08

15601 rows × 19 columns

In [22]:

test

Out[22]:

	Date	Open	High	Low	Close	Volume	Adj Close	Avg Price 5 Day	Std Price 5 Day	Avg Price 30 Day	Std Price 30 Day	Avg Price 365 Day	Std Price 365 Day	Avg Volume 5 Day	Std Volume 5 Day	Avg Volume 30 Day	Std Volume 30 Day	Avg Volume 365 Day	Std Volume 365 Day
738	2013-01-02	1426.189941	1462.430054	1426.189941	1462.420044	4.202600e+09	1462.420044	1425.79	22.2613	1417.68	16.8526	1327.91	90.7390	2.989764e+09	7.674944e+08	3.276632e+09	7.846227e+08	3.862480e+09	7.848543e+08
737	2013-01-03	1462.420044	1465.469971	1455.530029	1459.369995	3.829730e+09	1459.369995	1433.70	26.2743	1420.09	17.4708	1328.22	90.9959	3.298704e+09	7.224721e+08	3.291797e+09	7.909560e+08	3.859719e+09	7.831860e+08
736	2013-01-04	1459.369995	1467.939941	1458.989990	1466.469971	3.424290e+09	1466.469971	1443.38	27.9452	1422.71	18.3398	1328.56	91.2790	3.417526e+09	6.733364e+08	3.299034e+09	7.911482e+08	3.859449e+09	7.833194e+08
735	2013-01-07	1466.469971	1466.469971	1456.619995	1461.890015	3.304970e+09	1461.890015	1455.27	16.4533	1425.08	18.6783	1328.90	91.5444	3.593184e+09	4.153739e+08	3.320297e+09	7.820986e+08	3.858814e+09	7.836758e+08
734	2013-01-08	1461.890015	1461.890015	1451.640015	1457.150024	3.601600e+09	1457.150024	1461.46	3.5054	1426.68	19.3121	1329.24	91.7901	3.672638e+09	3.561743e+08	3.390185e+09	7.040724e+08	3.857703e+09	7.837524e+08
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4	2015-12-01	2082.929932	2103.370117	2082.929932	2102.629883	3.712120e+09	2102.629883	2090.23	7.9568	2076.28	23.9705	2035.91	64.3525	3.232372e+09	1.111591e+09	3.856341e+09	5.899408e+08	3.526090e+09	6.925928e+08
3	2015-12-02	2101.709961	2104.270020	2077.110107	2079.510010	3.950640e+09	2079.510010	2088.31	9.3336	2077.91	22.3781	2036.23	64.2776	3.245514e+09	1.121578e+09	3.876979e+09	5.817195e+08	3.529468e+09	6.916451e+08
2	2015-12-03	2080.709961	2085.000000	2042.349976	2049.620117	4.306490e+09	2049.620117	2080.46	19.5999	2078.93	20.1838	2036.51	64.1216	3.536224e+09	1.181180e+09	3.899603e+09	5.848831e+08	3.532802e+09	6.924515e+08
1	2015-12-04	2051.239990	2093.840088	2051.239990	2091.689941	4.214910e+09	2091.689941	2080.77	19.8061	2080.24	19.6764	2036.87	64.0589	4.085838e+09	2.493154e+08	3.892405e+09	5.794230e+08	3.535838e+09	6.930075e+08
0	2015-12-07	2090.419922	2090.419922	2066.780029	2077.070068	4.043820e+09	2077.070068	2080.10	19.8776	2080.30	19.6624	2037.20	63.9566	4.045596e+09	2.328943e+08	3.890250e+09	5.787116e+08	3.539304e+09	6.923741e+08

739 rows × 19 columns

Making Predictions¶

Now, you can define an error metric, train a model using the train data, and make predictions on the test data.

It's recommended to use Mean Absolute Error, also called MAE, as an error metric, because it will show you how "close" you were to the price in intuitive terms. Mean Squared Error, or MSE, is an alternative that is more commonly used, but makes it harder to intuitively tell how far off you are from the true price because it squares the error.

Instructions¶

Pick an error metric.
Initialize an instance of the LinearRegression class.
Train a linear regression model, using the train Dataframe. Leave out all of the original columns (Close, High, Low, Open, Volume, Adj Close, Date) when training your model. These all contain knowledge of the future that you don't want to feed the model. Use the Close column as the target.
Make predictions for the Close column of the test data, using the same columns for training as you did with train.
Compute the error between the predictions and the Close column of test.

In [23]:

df.columns

Out[23]:

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close',
       'Avg Price 5 Day', 'Std Price 5 Day', 'Avg Price 30 Day',
       'Std Price 30 Day', 'Avg Price 365 Day', 'Std Price 365 Day',
       'Avg Volume 5 Day', 'Std Volume 5 Day', 'Avg Volume 30 Day',
       'Std Volume 30 Day', 'Avg Volume 365 Day', 'Std Volume 365 Day'],
      dtype='object')

In [24]:

features = ['Avg Price 5 Day', 'Std Price 5 Day', 'Avg Price 30 Day', \
       'Std Price 30 Day', 'Avg Price 365 Day', 'Std Price 365 Day']
y_train = train['Close']
y_test = test['Close']

# instantiate model
lr = LinearRegression()
# fit model
lr.fit(train[features], y_train)
# predict model
predictions = lr.predict(test[features])
# test model
mae = mean_absolute_error(test['Close'], predictions)
mae

Out[24]:

11.93595588485392

In [25]:

lr.coef_

Out[25]:

array([ 1.07953948e+00, -2.06551454e-01, -7.91936105e-02,  9.14256546e-02,
       -6.47158132e-04,  5.24102042e-03])

In [26]:

lr.intercept_

Out[26]:

0.05487448493005331

In [27]:

plt.figure(figsize=(40,15))
plt.scatter(test['Date'], test['Close'], c='b')
plt.scatter(test['Date'], predictions, c='r')
plt.title('Stock Predictions', fontsize=60)
plt.xlabel('Date', fontsize=40)
plt.ylabel('Stock Price', fontsize=40)

C:\Users\Leo\Anaconda3\lib\site-packages\pandas\plotting\_matplotlib\converter.py:103: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.

To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()
  warnings.warn(msg, FutureWarning)

Out[27]:

Text(0, 0.5, 'Stock Price')

Improving Error¶

Congratulations! You can now predict the S&P500 (with some error). You can improve the error of this model significantly, though. Think about some indicators that might be helpful to compute.

Here are some ideas that might be helpful:

The average volume over the past five days.
The average volume over the past year.
The ratio between the average volume for the past five days, and the average volume for the past year.
The standard deviation of the average volume over the past five days.
The standard deviation of the average volume over the past year.
The ratio between the standard deviation of the average volume for the past five days, and the standard deviation of the average volume for the past year.
The year component of the date.
The ratio between the lowest price in the past year and the current price.
The ratio between the highest price in the past year and the current price.
The month component of the date.
The day of week.
The day component of the date.
The number of holidays in the prior month.

Instructions¶

Add 2 additional indicators to your dataframe, and see if the error is reduced. You'll need to insert these indicators at the same point where you insert the others, before you clean out rows with NaN values and split the dataframe into train and `test.

In [28]:

features = ['Avg Price 5 Day', 'Std Price 5 Day', 'Avg Price 30 Day', \
       'Std Price 30 Day', 'Avg Price 365 Day', 'Std Price 365 Day', \
       'Avg Volume 5 Day', 'Std Volume 5 Day', 'Avg Volume 30 Day', \
       'Std Volume 30 Day', 'Avg Volume 365 Day', 'Std Volume 365 Day']
y_train = train['Close']
y_test = test['Close']

# instantiate model
lr = LinearRegression()
# fit model
lr.fit(train[features], y_train)
# predict model
predictions = lr.predict(test[features])
# test model
mae = mean_absolute_error(test['Close'], predictions)
mae

Out[28]:

11.960202192277311

In [29]:

lr.coef_

Out[29]:

array([ 1.07760237e+00, -2.04476309e-01, -7.84381668e-02,  9.75785299e-02,
       -1.97650884e-05,  7.91658560e-03, -5.39331940e-10,  2.16775615e-09,
        6.96329036e-10, -2.33016836e-09,  9.26969237e-10, -3.77309004e-09])

In [30]:

lr.intercept_

Out[30]:

0.0876194354404447

In [31]:

plt.figure(figsize=(40,15))
plt.scatter(test['Date'], test['Close'], c='b')
plt.scatter(test['Date'], predictions, c='r')
plt.title('Stock Predictions', fontsize=60)
plt.xlabel('Date', fontsize=40)
plt.ylabel('Stock Price', fontsize=40)

Out[31]:

Text(0, 0.5, 'Stock Price')

In [32]:

features = ['Avg Price 5 Day', 'Std Price 5 Day', 'Avg Price 30 Day', \
       'Std Price 30 Day', 'Avg Price 365 Day', 'Std Price 365 Day', \
       'Avg Volume 5 Day', 'Std Volume 5 Day', 'Avg Volume 30 Day', \
       'Std Volume 30 Day', 'Avg Volume 365 Day', 'Std Volume 365 Day']
y_train = train['Close']
y_test = test['Close']

results = {}
feature_list = []

for feature in features:
    feature_list.append(feature)   
    # instantiate model
    lr = LinearRegression()
    # fit model
    lr.fit(train[feature_list], y_train)
    # predict model
    predictions = lr.predict(test[feature_list])
    # test model
    mae = mean_absolute_error(test['Close'], predictions)
    results[tuple(feature_list)] = mae

In [33]:

{k: v for k, v in sorted(results.items(), key=lambda item: item[1])}

Out[33]:

{('Avg Price 5 Day',
  'Std Price 5 Day',
  'Avg Price 30 Day'): 11.805347782941194,
 ('Avg Price 5 Day',
  'Std Price 5 Day',
  'Avg Price 30 Day',
  'Std Price 30 Day',
  'Avg Price 365 Day',
  'Std Price 365 Day'): 11.93595588485392,
 ('Avg Price 5 Day',
  'Std Price 5 Day',
  'Avg Price 30 Day',
  'Std Price 30 Day',
  'Avg Price 365 Day',
  'Std Price 365 Day',
  'Avg Volume 5 Day'): 11.937468284690883,
 ('Avg Price 5 Day',
  'Std Price 5 Day',
  'Avg Price 30 Day',
  'Std Price 30 Day',
  'Avg Price 365 Day'): 11.938430456423463,
 ('Avg Price 5 Day',
  'Std Price 5 Day',
  'Avg Price 30 Day',
  'Std Price 30 Day'): 11.938433123358328,
 ('Avg Price 5 Day', 'Std Price 5 Day'): 11.957965690278314,
 ('Avg Price 5 Day',
  'Std Price 5 Day',
  'Avg Price 30 Day',
  'Std Price 30 Day',
  'Avg Price 365 Day',
  'Std Price 365 Day',
  'Avg Volume 5 Day',
  'Std Volume 5 Day',
  'Avg Volume 30 Day',
  'Std Volume 30 Day',
  'Avg Volume 365 Day',
  'Std Volume 365 Day'): 11.960202192277311,
 ('Avg Price 5 Day',
  'Std Price 5 Day',
  'Avg Price 30 Day',
  'Std Price 30 Day',
  'Avg Price 365 Day',
  'Std Price 365 Day',
  'Avg Volume 5 Day',
  'Std Volume 5 Day',
  'Avg Volume 30 Day'): 11.963932665089686,
 ('Avg Price 5 Day',
  'Std Price 5 Day',
  'Avg Price 30 Day',
  'Std Price 30 Day',
  'Avg Price 365 Day',
  'Std Price 365 Day',
  'Avg Volume 5 Day',
  'Std Volume 5 Day'): 11.970832672173376,
 ('Avg Price 5 Day',
  'Std Price 5 Day',
  'Avg Price 30 Day',
  'Std Price 30 Day',
  'Avg Price 365 Day',
  'Std Price 365 Day',
  'Avg Volume 5 Day',
  'Std Volume 5 Day',
  'Avg Volume 30 Day',
  'Std Volume 30 Day',
  'Avg Volume 365 Day'): 11.985133357957164,
 ('Avg Price 5 Day',
  'Std Price 5 Day',
  'Avg Price 30 Day',
  'Std Price 30 Day',
  'Avg Price 365 Day',
  'Std Price 365 Day',
  'Avg Volume 5 Day',
  'Std Volume 5 Day',
  'Avg Volume 30 Day',
  'Std Volume 30 Day'): 11.985789542369043,
 ('Avg Price 5 Day',): 12.05338738870169}

In [34]:

features = ['Avg Price 5 Day', 'Std Price 5 Day', 'Avg Price 30 Day']
y_train = train['Close']
y_test = test['Close']

# instantiate model
lr = LinearRegression()
# fit model
lr.fit(train[features], y_train)
# predict model
predictions = lr.predict(test[features])
# test model
mae = mean_absolute_error(test['Close'], predictions)
mae

Out[34]:

11.805347782941194

In [35]:

lr.coef_

Out[35]:

array([ 1.0796239, -0.1003418, -0.0786633])

In [36]:

lr.intercept_

Out[36]:

0.07180909809756031

In [37]:

plt.style.use('fivethirtyeight')
# fandango_2015.Fandango_Stars.plot.kde(label = '2015', legend = True, figsize=(20,10))
# fandango_2016.fandango.plot.kde(label = '2016', legend = True)
plt.figure(figsize=(40,15))
plt.scatter(test['Date'], test['Close'], c='b', label = 'Close Price')
plt.scatter(test['Date'], predictions, c='r', label = 'Prediction Price')
plt.title('Stock Predictions', fontsize=60)
plt.xlabel('Date', fontsize=40)
plt.ylabel('Stock Price', fontsize=40)
plt.legend(fontsize=20, markerscale=2)

Out[37]:

<matplotlib.legend.Legend at 0x1a24ab24358>

Next Steps¶

There's a lot of improvement still to be made on the indicator side, and we urge you to think of better indicators that you could use for prediction. We can also make significant structural improvements to the algorithm, and pull in data from other sources.

Accuracy would improve greatly by making predictions only one day ahead. For example, train a model using data from 1951-01-03 to 2013-01-02, make predictions for 2013-01-03, and then train another model using data from 1951-01-03 to 2013-01-03, make predictions for 2013-01-04, and so on. This more closely simulates what you'd do if you were trading using the algorithm.
You can also improve the algorithm used significantly. Try other techniques, like a random forest, and see if they perform better.
You can also incorporate outside data, such as the weather in New York City (where most trading happens) the day before, and the amount of Twitter activity around certain stocks.
You can also make the system real-time by writing an automated script to download the latest data when the market closes, and make predictions for the next day.
Finally, you can make the system "higher-resolution". You're currently making daily predictions, but you could make hourly, minute-by-minute, or second by second predictions. This will require obtaining more data, though. You could also make predictions for individual stocks instead of the S&P500.