Google stock analysis and predictions¶

Affinito Alessandro¶

Abstract¶

We study some tech stock price through data visualization and some financial technique, focusing on those which are intended to give a sort of reliable prevision to permit brokers have a basis on which they could decide when it is the best moment to sell or buy stocks. We first analyze a year of data about the biggest companies as Amazon, Google, Apple and Microsoft but right after that we focus on Google stocks.

Next we leave the financial tools for supervised learning analysis. These machine learning processes learn a function from an input type to an output type using data comprising examples. Furthermore we'll talk specifically of regression supervised learning, meaning that we're interested in inferring a real valued function whose values corresponds to the mean of a dependant variable (stock prices).

We first applied linear regression on the last 6 years of Google Trends about the word 'google' specifically searched in the financial news domain, versus the last 6 years Google stock prices. From now on we change our feature domain with a multivariate input, i.e. we use other stock prices (AAPL, MSFT, TWTR, AMZN) to study the accuracy of others algorithms such as a multivariate linear regression, a SVR and a Random Forest.

*** keywords :*** Finance, Stock Price Analysis, MACD, Machine Learning, Linear Regression, SVR, Random Forest, Data Visualization, Python, R

Index¶

Data collection
Financial analysis 3. Moving average 3. MACD 3. Daily return analysis 3. Risk analysis
Correlation
Can we see the future ? 2. Google Trends Simple Linear Regression 2. Multivariate LR 2. Support Vector Regression 2. Random Forest
Conclusions

In [2]:

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

# For Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

# For reading stock data from yahoo or google
from pandas.io.data import DataReader

# For time stamps
from datetime import datetime

# suppressing warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:

# interactive plots

import plotly.plotly as py
import cufflinks as cf
import plotly.tools as tls
tls.set_credentials_file(username='affinito', api_key='')
#from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
init_notebook_mode()

Data Retrieving¶

In [4]:

# The tech stocks we'll use for this analysis
tech_list = ['AAPL','GOOG','MSFT','AMZN']

# Set up End and Start times for data grab
end = datetime.now()
start = datetime(end.year - 1, end.month, end.day)


#For loop for grabing yahoo finance data and setting as a dataframe

for stock in tech_list:   
    # Set DataFrame as the Stock Ticker
    globals()[stock] = DataReader(stock,'google',start,end)

In [182]:

#AAPL.to_csv("files/apple.csv")
GOOG.ix[:10,:-8]

Out[182]:

	Open	High	Low	Close	Volume
Date
2015-02-04	529.24	532.67	521.27	522.76	1659125
2015-02-05	523.79	528.50	522.09	527.58	1844687
2015-02-06	527.64	537.20	526.41	531.00	1758650
2015-02-09	528.00	532.00	526.02	527.83	1264276
2015-02-10	529.30	537.70	526.92	536.94	1745076
2015-02-11	535.30	538.45	533.38	535.97	1373970
2015-02-12	537.25	544.82	534.67	542.93	1615824
2015-02-13	543.35	549.91	543.13	549.01	1895126
2015-02-17	546.83	550.00	541.09	542.84	1612439
2015-02-18	541.40	545.49	537.51	539.70	1449089

In [5]:

cf.set_config_file(offline=True, world_readable=True, theme='ggplot')

GOOG['Close'].iplot(kind='scatter', title="Google stocks closing prices", yTitle='Price',
                    theme='white', bestfit=True)

In [6]:

GOOG['Volume'].iplot(title='Google stocks volume', theme='pearl', yTitle='Volume', bestfit=False )

Moving Average¶

Let's see the moving average of the apple stock price:

A widely used indicator in technical analysis that helps smooth out price action by filtering out the “noise” from random price fluctuations. A moving average (MA) is a trend-following indicator because it is based on past prices. The two basic and commonly used MAs are the simple moving average (SMA), which is the simple average of a security over a defined number of time periods, and the exponential moving average (EMA), which gives bigger weight to more recent prices.

In [7]:

mov_avg = [10, 20, 30]
for m in mov_avg:
    GOOG[ str(m)+" MA" ] = pd.rolling_mean( GOOG['Close'], m)

In [79]:

GOOG.tail()

Out[79]:

	Open	High	Low	Close	Volume	10 MA	20 MA	30 MA
Date
2016-01-28	722.22	733.69	712.35	730.96	2658016	709.691	721.9175	732.139000
2016-01-29	731.53	744.99	726.80	742.95	3394935	712.514	720.5150	732.124000
2016-02-01	750.46	757.86	743.27	752.00	4801816	718.269	720.1710	731.921000
2016-02-02	784.50	789.87	764.65	764.65	6332431	724.555	721.3115	732.428333
2016-02-03	770.22	774.50	720.50	726.95	6171019	727.405	720.5300	732.016333

Moving averages versus the real closing value zoomed on last 100 market days.

In [8]:

GOOG[['Close','10 MA','20 MA','30 MA']][-100:].iplot(theme='white')

While the exponential moving average gives more weight to latest data, i.e. react faster to recent price changes thanks to exponentially decreasing weights associated to oldest data.

So let's have a look at the differences in a 26 day (we'll see later why) window:

In [9]:

window = 26

GOOG[str(window)+' MA'] = pd.rolling_mean(GOOG['Close'], window)

ema_center = (window-1)/2
GOOG[str(window)+' EMA']= pd.ewma(GOOG['Close'], com=ema_center)

GOOG[['Close',str(window)+' MA', str(window)+' EMA']].iplot(theme='white')

MACD¶

Moving average convergence divergence (MACD) is a trend-following momentum indicator that shows the relationship between two moving averages of prices. The MACD is calculated by subtracting the 26-day exponential moving average (EMA) from the 12-day EMA. A nine-day EMA of the MACD, called the "signal line", is then plotted on top of the MACD, functioning as a trigger for buy and sell signals.

In [10]:

GOOG['12 EMA'] = pd.ewma(GOOG['Close'], com=(12-1)/2)
GOOG['MACD']   = GOOG['12 EMA'] - GOOG['26 EMA']

#GOOG[['Close','MACD','26 MA']].iplot(theme='white')

GOOG[['Close','26 MA','MACD']].iplot( subplots=True, shape=(3,1), shared_xaxes=True,
                                     title='Comparison between prevision power of MA and MACD', size=10 )

Daily return analysis¶

Now it's time to study the risk of the stock, that similar to ROI, is used to evaluate the efficiency of an investment or to compare the efficiency of a number of different investments. To calculate ROI, the benefit of an investment is divided by the cost of the investment, and the result is expressed as a percentage.

In [11]:

GOOG['Daily return'] = GOOG['Close'].pct_change()
GOOG['Daily return'].iplot(theme='white', yTitle='daily benefit')

What's happened on the 17 of July ?

Google gains billions in value as YouTube drives ad growth : Google Inc's (GOOGL.O) shares closed up 16.3 percent at $699.62 on Friday, adding about $65 billion to its market value, as strong growth in YouTube viewership eased investor concerns about Facebook Inc's (FB.O) push into video.

It’s official: Google books biggest day in history, adding $66.9B : Google Inc.’s stock closed at a record $699.62 on Friday, delivering $66.9 billion to investors in one day — a record for Wall Street. Shares of Google GOOGL, -3.60% GOOG, -3.45% skyrocketed 16.3% on Friday, the company’s biggest one-day percentage gain since April 2008. The increase raised Google’s market capitalization by $66.9 billion to $478 billion, according to FactSet.

Great, now let's get an overall look at the average daily return using a histogram. We'll see a histogram with the percentage of having a given daily benefit.

In [12]:

GOOG['Daily return'].iplot(theme='white', kind='hist', histfunc='count', histnorm='percent')

Now let's compare all the closing prices of the different stocks

In [13]:

closing_df = DataFrame(data=[GOOG['Close'], AAPL['Close'], MSFT['Close'], AMZN['Close']])
closing_df = closing_df.transpose()
closing_df.columns = ['Google','Apple','Microsoft','Amazon']

# making the avg daily return dataframe
daily_returns_df = closing_df.pct_change()
daily_returns_df.head()

Out[13]:

	Google	Apple	Microsoft	Amazon
Date
2015-04-06	NaN	NaN	NaN	NaN
2015-04-07	0.000484	-0.010522	-0.000241	-0.006975
2015-04-08	0.008547	-0.003254	-0.002649	0.018135
2015-04-09	-0.001532	0.007643	0.001449	0.006139
2015-04-10	-0.001424	0.004267	0.005786	-0.002320

Risk analysis¶

Risk analysis refers to the uncertainty of forecasted future cash flows streams, variance of portfolio/stock returns, statistical analysis to determine the probability of a project's success or failure, and possible future economic states.

One of the ways to quantify risk, is using the information we've gathered on daily percentage returns is by comparing the expected return with the standard deviation of the daily returns.

In [14]:

risk_df = DataFrame( [daily_returns_df.dropna().mean(), daily_returns_df.dropna().std()] ) 
risk_df = risk_df.transpose()
risk_df.columns = ['mean','std']
#risk_df.index = ['mean','std']
risk_df

Out[14]:

	mean	std
Google	0.001492	0.019200
Apple	-0.000492	0.017192
Microsoft	0.001338	0.017787
Amazon	0.002019	0.022792

In [15]:

daily_returns_df.iplot(theme='white', kind='box', yTitle='Risk')

Now, refering to the Google daily return histogram above let's get the amount of risk for the stock:

In [16]:

daily_returns_df['Google'].quantile(0.05)

Out[16]:

-0.022848104408675803

The 0.05 empirical quantile of daily returns is at about -0.023. That means that with 95% confidence, our worst daily loss will not exceed 2.3%.

### Correlation
Now let's check if there's any linear correlation between these stocks. We can do this through Seaborn pairplot which lets us see it in terms of Pearson product-moment correlation coefficient.

Remind Expectation doesn't have multiplicativity property in general, unless the variables are uncorrelated, this discrepancy is measured by covariance, that's why the graph isn't simmetric.

We'll discover two interesting high correlations.

In [23]:

sns.pairplot(daily_returns_df.dropna(), size=2.6)

Out[23]:

<seaborn.axisgrid.PairGrid at 0x18b221beba8>

We haven't found any paper or piece of news justifying the high correlation between Google vs Amazon stocks and Apple vs Microsoft stocks. Just coincidence ? Or maybe Apple and Google could be seen as some of the mayor clustering nodes of the higly connected (scale-free)[1] stock market network and so they're probably highly correlated with other stocks?

Statistical analysis of financial networks

In [22]:

sns.jointplot('Google','Amazon', daily_returns_df, kind='scatter', color='seagreen')

Out[22]:

<seaborn.axisgrid.JointGrid at 0x18b221beda0>

In [18]:

sns.jointplot('Apple','Microsoft', daily_returns_df, kind='scatter', color='seagreen')

Out[18]:

<seaborn.axisgrid.JointGrid at 0x18b20810518>

Correlation heatmap¶

In [19]:

daily_returns_df = daily_returns_df.dropna()
# rowvar=0 -> observations are in the columns
daily_corr = np.corrcoef(daily_returns_df[['Apple','Microsoft']].values,  rowvar=1)

corr_min = daily_corr.min()
corr_max = daily_corr.max()

Next in a heatmap trace object, we overwrite the default minimum and maximum color scale values (with 'zmin' and ' zmax' respectively) to round up the color bar's range.

In [21]:

from plotly.graph_objs import *


corr_trace = Heatmap(
    z=daily_corr,  # correlation as color contours 
    x=daily_returns_df.index,      # sites on both
    y=daily_returns_df.index,      #  axes
    zauto=False,  # (!) overwrite Plotly's default color levels
    zmin=corr_min,     # (!) set value of min color level
    zmax=corr_max,       # (!) set value of max color level
    colorscale='YIOrRd', # light yellow-orange-red colormap
    reversescale=True    # inverse colormap order
)

heatmap_title = "Apple-Microsoft yearly correlation heatmap from "+GOOG.index[0].strftime("%d/%m/%y")

# Make layout object
layout = Layout(
    title=heatmap_title,  # set plot title
    autosize=False,       # turn off autosize 
    height=500,           # plot's height in pixels 
    width=800,            # plot's width in pixels
    margin=Margin(l=130), # adjust left margin
)

corr_data = Data([corr_trace])
fig = Figure( data=corr_data, layout=layout)
py.iplot(fig, filename='apple-microsoft-correlation-heatmap')

# remember you can zoom on the psychedelic figure!

Out[21]:

Supervised Learning¶

## Google Trends in Simple Linear Regression

Stock price forecasting is a popular and important topic in financial and academic studies. Time series analysis is the  most common and fundamental method used to perform this task. This paper aims to combine the conventional time series analysis technique with information from the Google trend website and  the Yahoo finance website to predict weekly changes in stock price. Important news/events related to a selected stock over a five year span are recorded and the weekly Google trend index values on this stock are used to provide a measure of the magnitude of these events. The result of this experiment shows significant correlation between the changes in weekly stock prices and the values of important news/events computed from the Google trend website. The algorithm proposed in this paper can potentially outperform the conventional time series analysis in stock price forecasting. 
[Stock Price Forecasting Using Information from Yahoo Finance and Google Trend - Selene Yue Xu]

As stated in the above quotation, the relationship between weekly stock prices changes and Google Trend news is proved to be not an effect of random causes, i.e. it's statistically significant. So let's study if this correlation holds for this case.

google_trends contains the number of times the word "google" was searched in Google Financial News per week.

Before plotting the linear regression some criterion should be respected, such as the linearity between the two dataset can't be null. So we want to plot the percentual change of google trends over the percentual change of the google stock prices. What we expect is a very little correlation, thus no studies found any significant correlation for the word google itself.

In [530]:

end = datetime.now()
start = datetime(end.year - 6, end.month, end.day)
GOOG = DataReader('GOOG','google', start, end)
GOOG['Daily return'] = GOOG['Close'].pct_change()
print ( "Number of values : "+str(GOOG.size))
GOOG.head()

Number of values : 9030

Out[530]:

	Open	High	Low	Close	Volume	Daily return
Date
2010-02-10	266.77	268.63	263.58	266.96	NaN	NaN
2010-02-11	265.86	269.97	264.49	267.93	NaN	0.003634
2010-02-12	266.22	268.31	264.98	266.29	NaN	-0.006121
2010-02-16	268.30	271.79	266.88	270.38	NaN	0.015359
2010-02-17	270.73	271.43	268.54	268.83	NaN	-0.005733

In [488]:

google_trends = pd.read_csv("files/trendGoogle_financeNews_2010-2016.csv")
# 319 values, but we need to divide for 5 days weeks, so let's keep 315
google_trends = google_trends[5:-3]                                    # to have equal starts
google_trends.columns= ['Date','Value']
google_trends.index = pd.Series( range(google_trends.index.size ) )    # google_trends.index.size = 310
google_trends[['Value']] = google_trends[['Value']].astype(int)
google_trends.head()

Out[488]:

	Date	Value
0	2010-02-07 - 2010-02-13	32
1	2010-02-14 - 2010-02-20	32
2	2010-02-21 - 2010-02-27	31
3	2010-02-28 - 2010-03-06	28
4	2010-03-07 - 2010-03-13	31

Dividing the values by 5-days weeks and averaging

In [465]:

num_of_groups = len( GOOG['Daily return'])/5
drw_mean = np.array([0.0]*int(num_of_groups))
i = 0
for v in np.split( GOOG['Daily return'].values, num_of_groups):
    drw_mean[i] = v.mean()
    i+=1
    
print(" Num. of weeks available: "+str(num_of_groups))

 Num. of weeks available: 301.0

In [468]:

print( drw_mean[:10] )

[        nan -0.00247831  0.00519872  0.01124714 -0.00372487 -0.00286979
  0.0034982  -0.0002987  -0.00506958 -0.00185979]

Now we can create the dataframe containing the two percentage series to plot on a correlation figure:

In [40]:

# let's generate the new dataframe
#trend_price_df = DataFrame( [google_trends['Value'][:300].pct_change(), drw_mean] )
#trend_price_df = trend_price_df.transpose()
#trend_price_df.columns = ['trend pct','stock pct']

trend_price_df = pd.read_csv("files/trend_price.csv", index_col=0)    # to avoid to get different data in future executions 
#trend_price_df.to_csv("files/trend_price.csv")

trend_price_df = trend_price_df.dropna()         # who needs null values ?
trend_price_df.head()

Out[40]:

	trend pct	stock pct
1	0.000000	-0.002478
2	-0.031250	0.005199
3	-0.096774	0.011247
4	0.107143	-0.003725
5	-0.096774	-0.002870

In [41]:

# ..
sns.jointplot('trend pct','stock pct', trend_price_df, kind='scatter', color='seagreen', size=7)

Out[41]:

<seaborn.axisgrid.JointGrid at 0x20759a32cc0>

So as we expected, the PPMCC is low (0.13), but it could still be interesting, although not significant, to try a prevision through a linear regression.

Linear Regression¶

To perform supervised learning we have to chose how to represent our function. In this case we could approximate y, which corresponds to the vector of output values, as a linear function of x (called input or feature) :

$ h_\theta (x_i) = \hat y_i = \theta_0 + \theta_i x_{i} + \epsilon $

where θ_i are the parameters or weights and ε is the error called residual. Since this is the case of a univariate linear regression, which is why we have just one x_i.

Now for running the prediction we want to minimize the distance between the hypotheses h(x) and y. We can do this throgh a cost function named *least-squares cost function* J(θ) :

$ J(\theta) = \frac{1}{2} \sum_{i=1}^{m}( h(x_i) - y_i)^2 $

To minimize this cost function an algorithm called *gradient descent* is used. Where iteratively the function is derived respect to a secifical parameter, with the goal of making a step in the deepest decrease of J.

Numpy has a built in Least Square Method in its linear algebra library. We'll use this first for our Univariate regression and then move on to scikit learn for out Multi variate regression.

So we'll look at our line as y = mx+ b but we need to transform it in matrix form in order to use numpy.linalg.lstsq: y= Ap, where A=[x 1] and p=[m b]

In [42]:

import sklearn
from sklearn.linear_model import LinearRegression

#trend_price_df = trend_price_df.dropna()
X = np.array([ [value, 1] for value in trend_price_df['trend pct']])

Y = trend_price_df['stock pct']
m, b = np.linalg.lstsq(X, Y)[0]

Now that we have found out our linear coefficients, we can plot the line:

In [516]:

x = trend_price_df['trend pct']
plt.plot(x, Y,'o')

plt.plot(x, m*x + b, 'r', label='Best Fit Line')

Out[516]:

[<matplotlib.lines.Line2D at 0x1880b3c9940>]

RMSE : Now it's time to find the error of the least square method, for each element, it checks the difference between the line and the true value, squares it, and returns the sum of all these residuals.

$ RMSE = \sqrt{\frac{\sum_{i=1}^n (\hat y_i - y_i)^2}{n}} $

To train my model we used 70% of the input feature and 30% for the testing.

In [67]:

from sklearn import linear_model

# Get the resulting array
result = np.linalg.lstsq(X,Y)

# Get the total error and the residuals vector
lin_sq_residual_sum = result[1]

train_size = int(np.ceil( 0.7*len(X) ))
X_train = X[:train_size]
X_test  = X[-train_size:]
Y_train = Y[:train_size]
Y_test  = Y[-train_size:]

lin_reg = linear_model.LinearRegression().fit(X_train, Y_train)
uni_lin_residuals = lin_reg.predict(X_test) - Y_test

# Get the root mean square error
lin_rmse = np.sqrt( lin_sq_residual_sum /len(X) )[0]

# Print
print ("The root mean squared error was:\t {:1.5f}\
        \n\t std( residuals ):\t\t {:1.5f}".format( lin_rmse, np.std( uni_lin_residuals )))

The root mean squared error was:	 0.00727        
	 std( residuals ):		 0.00668

In [114]:

np.savetxt("files/uni_lin_residuals.csv", uni_lin_residuals, delimiter=',', fmt='%1.6f')

Since the root mean square error (RMSE) corresponds approximately to the standard deviation we can now say that the price of a stock won't vary more than 2 times the RMSE 95% of the time ( 68–95–99.7 rule ).

Multivariate linear regression¶

Let's now use more than one feature to fit our linear regression model, i.e. we'll fit it with Apple, Microsoft, Amazon and Twitter stock information.

In [29]:

tech_list = ['AAPL','MSFT','AMZN','TWTR','GOOG']
# nasdaq ? 
end = datetime.now()
start = datetime(end.year - 4, end.month, end.day)
stocks = []
for t in tech_list:   
    stocks.append( DataReader(t, 'google', start, end) )

#        TWTR!        
# start  2013-11-07   

In [567]:

stocks[4].size

Out[567]:

Let's fit a linear model with our features :

In [31]:

import sklearn
from sklearn.linear_model import LinearRegression

X_stocks = DataFrame( [stocks[0]['Close'].pct_change(), stocks[1]['Close'].pct_change(), 
                       stocks[2]['Close'].pct_change(), stocks[3]['Close'].pct_change() ] )
X_stocks = X_stocks.transpose()
X_stocks.columns = tech_list[:-1]
Y_target = stocks[4]['Close'].ix['2013-11-07': ].pct_change()

X_stocks = X_stocks.dropna()[-500:]
Y_target = Y_target.dropna()[-500:]

lreg = LinearRegression()
lreg.fit(X_stocks, Y_target )    # fits a linear model

print (' The estimated intercept coefficient is %.5f ' %lreg.intercept_)
print (' The number of coefficients used was %d ' % len(lreg.coef_) )

 The estimated intercept coefficient is 0.00032 
 The number of coefficients used was 4

Let's build a common dataframe :

In [88]:

X_stocks = pd.read_csv("files/X_stocks.csv", index_col=[0])
Y_target = pd.read_csv("files/Y_target.csv", index_col=[0], names=['Date','Target Value'] )
X_stocks.head()

Out[88]:

	AAPL	MSFT	AMZN	TWTR
Date
2014-02-28	-0.002653	0.011886	0.005470	-0.015420
2014-03-03	0.002793	-0.013835	-0.006407	-0.021854
2014-03-04	0.006632	0.016675	0.011451	0.010613
2014-03-05	0.002108	-0.007810	0.023276	0.001842
2014-03-06	-0.003024	0.001050	-0.000564	0.008275

And let's see the linear coefficient associated to each feature :

In [89]:

coef_df = DataFrame( X_stocks.columns )
coef_df.columns = ['Feature']
coef_df["Coefficient Estimate"] = pd.Series( lreg.coef_ )
coef_df

Out[89]:

	Feature	Coefficient Estimate
0	AAPL	-0.056726
1	MSFT	0.095380
2	AMZN	0.015966
3	TWTR	0.018711

Where we can see the highest correlation with Microsoft stocks.

Now let's split our data in order to perform cross validation, then we check again our accuracy through the RMSE.

In [90]:

X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(X_stocks, Y_target)

# Print shapes of the training and testing data sets
print (X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

(375, 4) (125, 4) (375, 1) (125, 1)

In [99]:

lreg.fit(X_train,Y_train)
pred_train = lreg.predict(X_train)
pred_test  = lreg.predict(X_test)

print ("MSE on training set:\t %.6f"  % np.mean((Y_train - pred_train) ** 2) )    
print ("MSE on testing set:\t %.6f"  %np.mean((Y_test - pred_test) ** 2) )

MSE on training set:	 0.000228
MSE on testing set:	 0.000398

In [110]:

mlin_residuals  =  (Y_test - pred_test).values
mlin_RMSE = np.sqrt( np.mean( mlin_residuals** 2))

print(" Multivariate linear regression \n\tRMSE : {:1.5f}\
        \n\t std : {:1.5f}".format(mlin_RMSE, np.std(mlin_residuals)) )

 Multivariate linear regression 
	RMSE : 0.01994        
	 std : 0.01992

In [115]:

np.savetxt("files/mlin_residuals.csv", mlin_residuals, delimiter=',', fmt='%1.6f')

The difference between the observed value of the dependent variable (y) and the predicted value (h(x)) is called the **residual** (e). Each data point has one residual, so that *residual = ObservedValue − PredictedValue*.
You can think of these residuals in the same way as the θ value we discussed earlier, in this case however, there were multiple data points considered, such that we have the more general equation

$ h_\theta (x) = \sum_{i=0}^{m}{\theta_i x_i}$

A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.

In [595]:

# Scatter plot the training data
train = plt.scatter( pred_train, (pred_train-Y_train), c='b', alpha=0.5)

# Scatter plot the testing data
test = plt.scatter( pred_test, (pred_test-Y_test), c='r', alpha=0.5, )

# Plot a horizontal axis line at 0
plt.hlines( y=0, xmin=-0.015, xmax=0.015)

#Labels
plt.legend((train,test),('Training','Test'), loc='lower left')
plt.title('Residual Plots')

Out[595]:

<matplotlib.text.Text at 0x1880c921710>

SVR¶

In machine learning, support vector machines (SVMs) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. Given a set of training examples, each marked for belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

The advantages of support vector machines are:

Effective in high dimensional spaces.
Still effective in cases where number of dimensions is greater than the number of samples.
Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

If the number of features is much greater than the number of samples, the method is likely to give poor performances.
SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.

This technique was developed in three major steps. 2. First, assuming that the two classes of training examples can be separated by a hyperplane, the optimal hyperplane is the one that separates the training examples with the widest margin [1]. 2. Second, it were incorporated kernel function into the maximum margin models. Kernels allow SVM to implicitly construct the optimal hyperplane in the feature space, and the resulting nonlinear model is important for modeling real data. 2. in case the training examples are not linearly separable, Cortes and Vapnik [2] showed that the some margin can be applied, allowing some examples to violate the margin condition (soft margin).

So, Given training vectors x_i in ℝ^p, i=1,..., n, and a vector y in ℝⁿ ε-SVR solves the following primal problem:

As the constraints in the primal form are not convenient to handle, people have conventionally resorted to the dual problem, which is again a quadratic program, but with much simpler constraints: box constraints plus a single linear equality constraint.

The training examples can be divided into three categories according to the value of α*_i.

If α*_i= 0, it means the corresponding training example does not affect the decision boundary, and in fact it lies beyond the margin, that is, y_i( < w, x_i > +b ) > 1.
If α*_i ∈(0, n⁻¹), then the training example lies on the margin.
If α*_i = n⁻¹ it means the training example violates the margin.

In the latter two cases where α*_i > 0, the i-th training example is called a support vector.

e is the vector of all ones, C > 0 is the upper bound, Q is an n by n positive semidefinite matrix, Q_ij=K(x_i, x_j) = φ(x_i)^T φ(x_j) is the kernel.

REF:¶

Let's now define our training sets end testing sets.

Here you can see too the python settings for the model, but we'll do the job in R, which has a very nice ad-hoc package (e1071)

In [116]:

from sklearn.svm import SVR
from sklearn.cross_validation import train_test_split
'''
C       Penalty parameter of the error term.
epsilon Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the 
        training loss function with points predicted within a distance epsilon from the actual value.
kernel  Specifies the kernel type to be used in the algorithm. RBF = radial basis function.
gamma   Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. If gamma is ‘auto’ then 1/n_features will be used

'''

svr_model = SVR(kernel='rbf', gamma='auto', C=1.0, epsilon=0.1, cache_size=500 )

#  Split the data into Trainging and Testing sets
X_train, X_test, Y_train, Y_test = train_test_split( X_stocks, Y_target, train_size=0.7,  random_state=42 )

print ("Subset shapes: X: {}, {}, y:{}, {}".format(X_train.shape, X_test.shape,Y_train.shape,Y_test.shape ))


#  Fit the SVM model according to the given training data.

xxTrain = np.array( X_train.as_matrix() )
yyTrain = np.array( Y_train.as_matrix() )

Subset shapes: X: (350, 4), (150, 4), y:(350, 1), (150, 1)

In [ ]:

SVR in R¶

The model used by svm function is an epsilon-SVR. Here, the data points lie in between the two borders of the margin which is maximized under suitable conditions to avoid outlier inclusion. I.e. we have a ε-insensitive loss function defined as $ |y - f(x)|_{\epsilon} := max \left\{ 0, |y - f(x)| - \epsilon \right\} $. Which is our minimization problem (see above):

$ min \left\{ \frac{1}{2}|w|^2 + \frac{C}{m}\sum_{i=1}^{m}{ |y_i - f(x_i)|_{\epsilon} } \right\} $

Ref¶

In [5]:

# http://rpy.sourceforge.net/rpy2/doc-2.4/html/interactive.html#module-rpy2.ipython.rmagic
%load_ext rpy2.ipython 
%load_ext rmagic 

In [6]:

from rpy2.robjects.packages import importr
utils = importr('utils')
utils.install_packages('e1071')

Out[6]:

rpy2.rinterface.NULL

In [117]:

%%R -i xxTrain,yyTrain,Y_target -o svrPredictionRMSE,SVR_error
library(e1071)

rmse <- function(error)
{
  sqrt(mean(error^2))
}

model      <- svm( yyTrain ~ xxTrain )
predictedY <- predict(model)

SVR_error <- Y_target - predictedY
svrPredictionRMSE <- rmse( SVR_error)

In [118]:

print( " RMSE for the multivariate SVR : {:4.4f}".format( svrPredictionRMSE[0])) 

 RMSE for the multivariate SVR : 0.0173

Could it be improved ?

In our previous example, we performed an epsilon-regression, we did not set any value for epsilon ( ϵ ), but it took a default value of 0.1. There is also a cost parameter which we can change to avoid overfitting (by default is 1) [svm doc].

In [36]:

%%R -i xxTrain,yyTrain
# perform a grid search
tuneResult <- tune(svm, yyTrain ~ xxTrain, ranges = list(epsilon = seq(0,1,0.05), cost = 2^(0:9)))

# Draw the tuning graph
plot(tuneResult)

So, since lower the error the better is our model, now let's zoom in the darker region of our heatmap.

In [120]:

%%R -i xxTrain,yyTrain -o tuneResult
# perform a grid search
tuneResult <- tune(svm, yyTrain ~ xxTrain, ranges = list(epsilon = seq(0,1,0.05), cost = (1:10)))

# Draw the tuning graph
plot(tuneResult)

In [39]:

print(tuneResult)


Parameter tuning of 'svm':



- sampling method: 10-fold cross validation 



- best parameters:

 epsilon cost

       1    1



- best performance: 0.0002284055

In [121]:

%%R -i xxTrain,yyTrain,Y_target -o tunedModelRMSE,SVR_error

tunedModel  <- tuneResult$best.model
tunedModelY <- predict(tunedModel, xxTrain) 
 
SVR_error <- Y_target - tunedModelY  
 
# the tune method  randomly shuffles the data
tunedModelRMSE <- rmse(SVR_error)

In [65]:

print( " RMSE \n\tfor the tuned SVR \t= {:4.4f} \
              \n\tfor the standard SVR \t= {:4.4f} ".format( tunedModelRMSE[0], svrPredictionRMSE[0])) 

improvement = tunedModelRMSE[0]/svrPredictionRMSE[0]

print( " We have improved our model of a {:.2f}%".format(improvement) )

 RMSE 
	for the tuned SVR 	= 0.0171               
	for the standard SVR 	= 0.0173 
 We have improved our model of a 0.99%

Well it seems standard values (epsilon=0.1 and cost=1) were good enough for our model. But It is a good practice to prove it and to try an improvement, which we did, even if a little bit.

In [122]:

np.savetxt("files/SVR_error.csv", SVR_error, delimiter=',', fmt='%1.6f')

Random Forest¶

Random Forest is an ensemble learning technique. It is a hybrid of the Bagging algorithm and the random subspace method, and uses decision trees as the base classifier.

Each tree is constructed from a bootstrap sample from the original dataset. To diversify the classifiers, at each branch in the tree, the decision of which feature to split on is restricted to a random subset of size n, from the full feature set. The random subset is chosen anew for each branching point. *n* is suggested to be *log(N + 1)*, where *N* is the size of the whole feature set.

Decision trees are built in a top-down fashion, with a recursive partitioning for selecting the root of the tree (and his subtrees), splitting the sets of examples in disjoint sets and adding respective nodes and branches to the tree. Typical attribute selection criteria use a function t omeasure the impurity of a node, that is the degree t owhich the node contains only examples of a single class. Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree [1].

With CART decision trees the impurity measure used for classification is the Gini index, defined as below, where S_i is the subset of the training examples S relative to the class c_i. While the MSE is used for the regression.

$ Gini(S) = 1 - \sum_{i=1}^{c} {\left( \frac{|S_i|}{|S|} \right) }^2 $

After some trials I have found I could reduce significantly the number of grown trees from 500 (the default value [4]) to 60 without loosing much accuracy, that is about 8%, but 2.7*10^-5 in absolute value [3], but for the final comparison I used the 300 trees RF in order to gain the minimum RMSE.

Ref:¶

In [7]:

%R install.packages('randomForest')

Let's see a chart of variable importance as measured by the Random Forest

In [126]:

%%R -o rf_300,rf_mse,rf_residuals
library(randomForest)
X_stocks= read.csv("files/X_stocks.csv", header=TRUE)
Y_target = read.csv("files/Y_target.csv", header=FALSE)

# seeding the random sampling generator to have reproducible results
set.seed(42)

rf_300 = randomForest( matrix(Y_target[,2], nrow=500, byrow=TRUE), x=X_stocks[,c(2,3,4,5)], ntree=300, mtry=1, replace=TRUE)
rf_mse = rf_300$mse

rf_residuals = Y_target - predict(rf_300)

varImpPlot(rf_300)

In [27]:

rf_RMSE= np.sqrt(np.mean( rf_mse ))
print(rf_RMSE)

0.0173996738908

It's interesting too, to note how the the importance of the variables in the decision trees were proportional to their correlation to the target variable.

In [138]:

rf_residuals = rf_residuals['V2']
np.savetxt("files/rf_residuals.csv", rf_residuals, delimiter=',', fmt='%1.6f')

Conclusions¶

Surprisingly the best result (0.007) is reached through the simple linear regression on the google trend analysis. As said this was focused on searches about finance, on a 300 weeks period (70% of which was used to train the algorithm). Trying to predict percentual price changes per week instead of per day could have improved the accuracy. But since there was no linear correlation this result should be ignored.

The second lowest RMSE through the SVR with a value of 0.01715 and the residuals distribution centered on zero.

The random forest directly follows with almost the same values for RMS (0.0174) and residual distribution, which is just a little more sqeezed.

The multivariate linear regression still performs similar, returning an RMS of about 0.02 but with a left skewed residual distribution.

In [144]:

RMSE_list = [ lin_rmse, mlin_RMSE, tunedModelRMSE[0], rf_RMSE]

print( "RMSE summary\n\t univ. linear regression on google trend:\t {:1.3f} \
                     \n\t multivariate linear regression:\t\t {:1.5f} \
                     \n\t SVR : \t\t\t\t\t\t {:1.5f} \
                     \n\t Random forest (300 trees):\t\t\t {:1.5f}".format( *RMSE_list))

RMSE summary
	 univ. linear regression on google trend:	 0.007                      
	 multivariate linear regression:		 0.01994                      
	 SVR : 						 0.01715                      
	 Random forest (300 trees):			 0.01740

In [45]:

residuals_names = [ 'uni_lin_residuals', 'mlin_residuals', 'SVR_error', 'rf_residuals']
residuals = []
for r in residuals_names:
    path = "files/"+r+".csv"
    r_csv = pd.read_csv(path, header=None,)
    residuals.append(r_csv)

fig, ax = plt.subplots(  figsize=(15, 10))
ax.boxplot(residuals, patch_artist=True, notch=False, labels=residuals_names )
ax.set_title("RESIDUALS BOXPLOT CONFRONTATION ")
ax.set_ylim(-0.06, 0.05)
plt.show()

In [ ]: