Quentin Batista
The University of Tokyo
Relative valuation is a popular technique which relies on the market value of similar assets to price a given asset. For example, if you want to price a laundry business, you might want to calculate the price to earnings ratio of other similar businesses and multiply this ratio by the earnings of the laundry business to obtain a valuation. Sheridan and Martin describe this methodology as a 4-step process:
Step 1: Identify similar or comparable investments and recent market prices for each.
Step 2: Calculate a valuation metric for use in valuing the asset.
Step 3: Calculate an initial estimate of value.
Step 4: Refine or tailor your initial valuation estimate to the specific characteristics of the investment.
Some of the most commonly used multiples are:
Price to earnings (P/E)
Market to book value of equity
Enterprise Value (EV) to EBITDA
Enterprise Value (EV) to revenue
Enterprise Value (EV) to cash flow
This list is not exhaustive, and it is possible for you to create your own multiple. This is particularly popular in the technology sector where analysts have come up with multiples such as Enterprise Value to unique visitors or website hits. In doing so, you must ensure that the components of the multiple are consistent with each other. For example, you might consider using the price to sales ratio as a valuation multiple. However, an implicit assumption behind this multiple is that comparable companies have identical capital structures, which is very rarely the case in practice. When this assumption is violated, the multiple becomes inconsistent because it is impacted by the amount of debt that a company has relative to its equity.
Finally, a key step in applying this methodology is to determine which multiple is appropriate for the asset you are trying to value. For example, valuing young firms and startups using the P/E ratio is likely to be inappropriate if those firms have negative or highly volatile earnings. Instead, using the EV to sales ratio would likely give a better estimate. Additionally, it is important to realize that these multiples have different characteristics. While EV to EBITDA ignores the firm's CapEx, depreciation, and capital structure, while the P/E ratio takes those into account. Using these multiples concurrently allows you to see the big picture and understand what is driving the valuation of an asset.
In this notebook, we systematize the methodology introduced above for companies in the S&P 500 using two different Machine Learning approaches. First, we replicate Gael Varoquaux's analysis on scikit-learn which extracts a graphical structure from the correlation of intraday variations and then applies Affinity Propagation to group together stocks that behave similarly. Second, we analyze the companies' latest 10K using the Doc2Vec implementation in gensim to quantify similarity. In the first case, we use the average multiple in each cluster to value each company. In the second case, we pick the top 3 comparable companies and transform the similarity measures into weights to compute the average comparable multiple. We then apply this multiple to each company's financials to obtain a valuation.
The hope of using this approach is to capture relationships between companies that human analysts would not be able to perceive. Reading through the financial statements of all the companies in the S&P 500 and forming an opinion about how these companies relate to each other would be an extremely time-consuming process. As a result, most analysts on the Street cover a specific industry group, and therefore, they might not realize that a company they are researching is actually very similar to a company in classified in a different industry group. By using Machine Learning, we can alleviate this issue, and hopefully, gain some valuable insights about the stock market structure.
We scrape the list of companies in the S&P 500, their ticker and CIK code from Wikipedia. We then use the CIK code to scrape the latest 10K from EDGAR. There are 505 companies in the Wikipedia list because some companies trade under multiple symbols (for example, Discovery Communications Inc.). A few companies' financial statements are not available on EDGAR for various reasons -- we ignore those companies. We clean the data by removing "Table of Contents" markers when they exist, page numbers, line breaks, punctuations and numbers from the statements. We scrape company fundamentals and their historical prices from Yahoo! Finance. Note that we drop the companies with negative multiples to simplify valuation. While Yahoo! Finance might not be a particularly reliable source for getting data about company fundamentals, it is the best among free data sources.
!pip install numpy
!pip install pandas
!pip install beautifulsoup4
!pip install gensim
!pip install nltk
!pip install sklearn
!pip install bokeh
!pip install pandas_datareader
!pip install datetime
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
from collections import namedtuple
from gensim.models import doc2vec
from gensim.similarities import docsim
from gensim import corpora
import itertools
from string import punctuation, digits
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.manifold import TSNE
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, LabelSet, Legend, HoverTool
from bokeh.palettes import all_palettes
from pandas_datareader.data import DataReader
from datetime import date
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
def table_extractor(soup):
"""Extract the tables from a soup object"""
for table in soup.find_all("table"):
table.extract()
return soup
sp_500_wiki_link = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
soup_wiki = BeautifulSoup(urlopen(sp_500_wiki_link), 'html.parser')
table_wiki = soup_wiki.find("table", {"class": "wikitable sortable"})
# Fail now if the right table hasn't been found
header = table_wiki.findAll('th')
if header[0].string != "Ticker symbol" or header[1].string != "Security":
raise Exception("Can't parse wikipedia's table!")
# Retreive the values in the table
records = []
rows = table_wiki.findAll('tr')
for row in rows:
fields = row.findAll('td')
if fields:
# Get info and SEC company link
symbol = fields[0].string
wiki_link = "https://en.wikipedia.org/wiki" + fields[1].a.get('href')
CIK = fields[7].string
sec_company_link = "https://www.sec.gov/cgi-bin/browse-edgar?" + \
"action=getcompany&CIK=" + CIK + \
"&type=10-K&dateb=&owner=include&count=40"
name = fields[1].a.string
sector = fields[3].string
# Get link for the page with latest 10-K related filings
soup_comp = BeautifulSoup(urlopen(sec_company_link), 'html.parser')
table_comp = soup_comp.find("table", {"class": "tableFile2"})
try:
# Get the latest filing page
filings_link = "https://www.sec.gov" + \
table_comp.a.get('href')
# Get the link for the latest 10K
soup_filings = BeautifulSoup(urlopen(filings_link),
'html.parser')
table_filings = soup_filings.find("table", {"class": "tableFile"})
_10K_link = "https://www.sec.gov" + table_filings.a.get('href')
# Extracting the text from the latest 10K
try:
soup_latest_10K = BeautifulSoup(urlopen(_10K_link).read(),
'html.parser')
soup_latest_10K = table_extractor(soup_latest_10K)
_latest_10K_txt = soup_latest_10K.get_text()
except:
# If the latest 10K is not available, return N/A
_latest_10K_txt = np.nan
except:
# If the filings are not available, return N/A
_10K_link = np.nan
_latest_10K_txt = np.nan
# Append results
records.append([symbol, wiki_link, name, sector, sec_company_link,
CIK, _latest_10K_txt])
headers = ['Symbol', 'Wikipedia Link', 'Name', 'Sector', 'SEC Filings Link',
'CIK', 'Latest 10K']
data = pd.DataFrame(records, columns=headers)
# Correct ambiguous tickers for later purpose
ambiguous_tickers = ['BRK.B', 'BF.B']
corrected_tickers = ['BRK-B', 'BF-B']
for i, ticker in enumerate(ambiguous_tickers):
data['Symbol'] = data['Symbol'].replace(ticker, corrected_tickers[i])
def unit_converter(data):
billion = 1_000_000_000
million = 1_000_000
if data[-1] == 'B':
return float(data[:-1])*billion
elif data[-1] == 'M':
return float(data[:-1])*million
else:
return float(data)
items = ['Enterprise Value', 'Enterprise Value/Revenue', 'Diluted EPS',
'Enterprise Value/EBITDA', 'Revenue', 'EBITDA', 'Trailing P/E']
for i, ticker in enumerate(data['Symbol']):
key_stats_link = 'https://finance.yahoo.com/quote/' + ticker + \
'/key-statistics?p=' + ticker
key_stats_soup = BeautifulSoup(urlopen(key_stats_link).read(),
'html.parser').findAll('td')
for j, row in enumerate(key_stats_soup):
for item in items:
try:
if item == row.span.string:
# Dangerous assumption
data.loc[i, item] = \
unit_converter(key_stats_soup[j+1].string)
except:
next
Disclaimer: This scraper makes an important assumption regarding the structure of the html file. Specifically, it assumes that the data for each item of interest is the string of text located in the 'td' tag following the one where the name of the item was found.
from pandas_datareader.data import DataReader
from datetime import date
start = date(2013, 1, 2)
end = date.today()
data_source = 'yahoo'
# Sometimes fail -- retry if it does
historical_prices_panel = DataReader(data['Symbol'], data_source, start, end,
retry_count=5)
# Current price is last close
last_price = \
historical_prices_panel['Close'].tail(1).T.iloc[:, 0].rename('Current Price')
data = data.join(last_price, on='Symbol')
close_price = historical_prices_panel['Close']
open_price = historical_prices_panel['Open']
# Uncomment to save the data
# data.to_csv('10K_data.csv')
# historical_prices_panel['Close'].to_csv('close_price_data.csv')
# historical_prices_panel['Open'].to_csv('open_price_data.csv')
# Uncomment to read the data
# data = pd.read_csv('10K_data.csv')
# close_price = pd.read_csv('close_price_data.csv')
# open_price = pd.read_csv('open_price_data.csv')
# Remove companies without filings
no_filings_data = data[data['Latest 10K'].isnull()]
data = data[~data['Latest 10K'].isnull()]
# Remove duplicates (keep first)
data = data.drop_duplicates(keep='first')
# Drop Google duplicate
data = data[data['Symbol'] != 'GOOG']
# Drop NA rows (about 60 companies)
data = data.dropna()
# Drop companies with negative EPS or EBITDA (about 30 companies)
data = data[(data[['EBITDA', 'Diluted EPS']] > 0).all(1)]
# Reset index
data = data.reset_index(drop=True)
from collections import namedtuple
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation, digits
def _10K_string_cleaner(_10K):
# Lowercase the text
_10K = _10K.lower()
stopchar = punctuation + digits + '’“”'
for ch in stopchar:
# Replace stopchar by whitespace
_10K = _10K.replace(ch, ' ')
# Tokenize
_10K = word_tokenize(_10K)
# Remove stopwords
_10K = [word for word in _10K if word not in stopwords.words('english')]
return _10K
corpus = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(data['Latest 10K']):
corpus.append(analyzedDocument(_10K_string_cleaner(text), [i]))
Doc2Vec is an implementation of the Paragraph Vector unsupervised learning algorithm proposed by Quoc Le and Tomas Mikolov for learning fixed-length feature representations from variable-length pieces of texts. It aims to overcome some of the weaknesses of the more traditional bag-of-words (BOW) methodology.
from gensim.models import doc2vec
from gensim.similarities import docsim
model_NLP = doc2vec.Doc2Vec(corpus, size=500, window=300, min_count=1,
workers=4)
import itertools
similarity_matrix_NLP = np.empty(0)
length_docvecs = len(model_NLP.docvecs)
for item in itertools.product(range(length_docvecs), range(length_docvecs)):
similarity_matrix_NLP = np.append(similarity_matrix_NLP,
model_NLP.docvecs.similarity(*item))
similarity_matrix_NLP = similarity_matrix_NLP.reshape((length_docvecs,
length_docvecs))
We use three different methodologies for visualizing the structure of the stock market. First, we create a simple scatter plot of the companies' EV/EBITDA against their P/E ratio. Second, we use the correlation matrix of the S&P 500 stocks between the start of 2013 and the last price. Third, we use the similarity matrix outputted from the Doc2Vec model. In the last two cases, t-SNE is used for dimensionality reduction.
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, LabelSet, Legend, HoverTool
from bokeh.palettes import all_palettes
output_notebook()
category_items = data['Sector'].unique()
palette = all_palettes['Viridis'][len(category_items)]
colormap = dict(zip(category_items, palette))
data['Color'] = data['Sector'].map(colormap)
TOOLS = "crosshair,pan,wheel_zoom,reset,tap,save,box_select"
source = ColumnDataSource(dict(x=data['Trailing P/E'],
y=data['Enterprise Value/EBITDA'],
color=data['Color'],
label=data['Name'],
ticker=data['Symbol'],
sector=data['Sector']))
hover = HoverTool(tooltips=[
("index", "$index"),
("name", "@label"),
("sector", "@sector"),
("ticker", "@ticker"),
("(x,y)", "($x, $y)"),
])
p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)
labels = LabelSet(x='x', y='y', text='label', source=source,
text_font_size='8pt')
# Remove "legend='sector'," to remove legend
p.scatter(x='x', y='y', color='color', legend='sector', source=source)
p.title.text = "Current Valuation Structure"
p.title.text_font_size = "25px"
# Uncomment the following line to add labels
# p.add_layout(labels)
p.xaxis.axis_label = 'Trailing P/E'
p.yaxis.axis_label = 'EV/EBITDA'
show(p)
There is a clear positive relationship between a company's EV/EBITDA and P/E ratio. While most companies are clustered in the bottom left quadrant, there are a few outliers. We choose to keep those outliers for valuation because they still provide valuable information about companies as we will see later.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0, perplexity=10.0)
Y = tsne.fit_transform(similarity_matrix_NLP)
plotting_df = pd.concat([data[['Name', 'Symbol', 'Sector', 'Color']],
pd.DataFrame(Y, columns=['x', 'y'])], axis=1)
source = ColumnDataSource(dict(x=plotting_df['x'],
y=plotting_df['y'],
color=plotting_df['Color'],
label=plotting_df['Name'],
ticker=plotting_df['Symbol'],
sector=plotting_df['Sector']))
hover = HoverTool(tooltips=[
("index", "$index"),
("name", "@label"),
("sector", "@sector"),
("ticker", "@ticker"),
("(x,y)", "($x, $y)"),
])
p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)
labels = LabelSet(x='x', y='y', text='label', source=source,
text_font_size='8pt')
p.scatter(x='x', y='y', color='color', legend='sector', source=source)
p.title.text = "2D Company Similarity Visualization (NLP)"
p.title.text_font_size = "25px"
# Uncomment the following line to add labels
# p.add_layout(labels)
show(p)
While companies are mostly clustered by their industry groups, this analysis seems to capture relationships between companies which go beyond those. The fact that companies such as Activision Blizzard and Electronic Arts are close to each other suggests that the model is doing something right. However, note that it is possible that the model relies on trivial factors, for example, attributing a high similarity score to companies whose financial statements have the same auditor (say, because the writing style is similar).
from sklearn import cluster, covariance
# This code was adapted from Gael Varoquaux's work (see references)
# Calculate intraday variation
variation_df = close_price - open_price
variation_df = variation_df.T.reindex(data['Symbol'])
# Get name, sector and color from the data dataframe
variation_df = \
data[['Symbol', 'Name', 'Sector', 'Color']].join(variation_df, on='Symbol')
# Drop rows with NAs
variation_df = variation_df.dropna(axis=0)
# Data for the model
var_data = variation_df.drop(['Symbol', 'Name', 'Sector', 'Color'], axis=1).T
# Learn a graphical structure from the correlations
edge_model = covariance.GraphLassoCV()
# Standardize the time series: using correlations rather than covariance is
# more efficient for structure recovery
var_data /= var_data.std(axis=0)
edge_model.fit(var_data)
# Cluster using affinity propagation
_, labels = cluster.affinity_propagation(edge_model.covariance_)
n_labels = labels.max()
variation_df['Cluster'] = np.nan
for i in range(n_labels + 1):
variation_df.loc[labels == i, 'Cluster'] = i + 1
print('Cluster %i: %s' % ((i + 1),
', '.join(variation_df['Name'][labels == i])))
# Find a low-dimension embedding for visualization: find the best position of
# the nodes (the stocks) on a 2D plane
embedding = tsne.fit_transform(var_data.T)
# Display a graph of the partial correlations
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.06)
# Plot the edges
start_idx, end_idx = np.where(non_zero)
segments = [[embedding.T[:, start], embedding.T[:, stop]]
for start, stop in zip(start_idx, end_idx)]
values = np.abs(partial_correlations[non_zero])
/Users/QBatista/anaconda/lib/python3.6/site-packages/numpy/linalg/linalg.py:1757: RuntimeWarning: invalid value encountered in slogdet sign, logdet = _umath_linalg.slogdet(a, signature=signature) /Users/QBatista/anaconda/lib/python3.6/site-packages/sklearn/covariance/graph_lasso_.py:230: RuntimeWarning: invalid value encountered in multiply * coefs) /Users/QBatista/anaconda/lib/python3.6/site-packages/sklearn/covariance/graph_lasso_.py:232: RuntimeWarning: invalid value encountered in multiply * coefs) /Users/QBatista/anaconda/lib/python3.6/site-packages/sklearn/covariance/graph_lasso_.py:252: ConvergenceWarning: graph_lasso: did not converge after 100 iteration: dual gap: -1.510e-04 ConvergenceWarning)
Cluster 1: Activision Blizzard, Electronic Arts Cluster 2: Alphabet Inc Class A, Amazon.com Inc, eBay, Expedia Inc., Facebook, Inc., Microsoft Corp., Netflix Inc., Priceline.com Inc, TripAdvisor Cluster 3: American Tower Corp A, Crown Castle International Corp., SBA Communications Cluster 4: Aetna Inc, Anthem Inc., Centene Corporation, CIGNA Corp., Humana Inc., United Health Group Inc. Cluster 5: Block H&R Cluster 6: BorgWarner, Delphi Automotive, Goodyear Tire & Rubber, LKQ Corporation Cluster 7: Cadence Design Systems, Synopsys Inc. Cluster 8: AmerisourceBergen Corp, Cardinal Health, CVS Health, Express Scripts, Henry Schein, McKesson Corp., Patterson Companies Cluster 9: Carnival Corp., Royal Caribbean Cruises Ltd Cluster 10: AbbVie Inc., Alexion Pharmaceuticals, Amgen Inc, Biogen Inc., Bristol-Myers Squibb, Celgene Corp., Gilead Sciences, Illumina Inc, Mylan N.V., Regeneron, Vertex Pharmaceuticals Inc Cluster 11: Chevron Corp., Exxon Mobil Corp., Occidental Petroleum, Schlumberger Ltd., TechnipFMC Cluster 12: Chipotle Mexican Grill Cluster 13: The Clorox Company, Colgate-Palmolive, Estee Lauder Cos., Kimberly-Clark, Procter & Gamble Cluster 14: Charter Communications, Comcast Corp. Cluster 15: Cimarex Energy, Concho Resources, Devon Energy, Newfield Exploration Co, Pioneer Natural Resources, Range Resources Corp. Cluster 16: Constellation Brands, Molson Coors Brewing Company Cluster 17: Costco Wholesale Corp., Wal-Mart Stores Cluster 18: Alaska Air Group Inc, American Airlines Group, Delta Air Lines, Southwest Airlines, United Continental Holdings Cluster 19: Discovery Communications, Discovery Communications, Viacom Inc. Cluster 20: Dish Network Cluster 21: Dollar General, Dollar Tree, Kroger Co., Target Corp. Cluster 22: Albemarle Corp, Eastman Chemical, FMC Corporation, LyondellBasell, The Mosaic Company, Nucor Corp. Cluster 23: Equifax Inc. Cluster 24: Extra Space Storage, Public Storage Cluster 25: Fastenal Co, Grainger (W.W.) Inc. Cluster 26: Accenture plc, Automatic Data Processing, Cognizant Technology Solutions, Fidelity National Information Services, Fiserv Inc, Gartner Inc, Intuit, Paychex Inc., Verisk Analytics Cluster 27: Flowserve Corporation, Fluor Corp., Freeport-McMoRan Inc., Jacobs Engineering Group, Quanta Services Inc. Cluster 28: Carmax Inc, Ford Motor, General Motors Cluster 29: Brown-Forman Corp., Campbell Soup, Conagra Brands, General Mills, The Hershey Company, Hormel Foods Corp., JM Smucker, Kellogg Co., McCormick & Co., Mondelez International, Sysco Corp., Tyson Foods Cluster 30: Hasbro Inc. Cluster 31: Home Depot, Lowe's Cos., Newell Brands, Tractor Supply Company Cluster 32: 3M Company, Amphenol Corp, Boeing Company, Cintas Corporation, Danaher Corp., General Electric, Genuine Parts, Harris Corporation, Honeywell Int'l Inc., Illinois Tool Works, Monsanto Co., Rockwell Collins, Roper Technologies, Snap-On Inc., Textron Inc., United Technologies Cluster 33: CBOE Holdings, CME Group Inc., Intercontinental Exchange, Nasdaq, Inc. Cluster 34: Avery Dennison Corp, Ball Corp, International Paper, Packaging Corporation of America, Weyerhaeuser Corp. Cluster 35: Interpublic Group, Omnicom Group Cluster 36: Affiliated Managers Group Inc, Archer-Daniels-Midland Co, BlackRock, CBRE Group, Corning Inc., Franklin Resources, Garmin Ltd., Invesco Ltd., S&P Global, Inc., T. Rowe Price Group, TE Connectivity Ltd., Western Union Co Cluster 37: Applied Materials Inc, KLA-Tencor Corp., Lam Research, Micron Technology, Nvidia Corporation Cluster 38: Lennar Corp., Pulte Homes Inc. Cluster 39: General Dynamics, L-3 Communications Holdings, Lockheed Martin Corp., Northrop Grumman Corp., Raytheon Co. Cluster 40: Host Hotels & Resorts, Marriott Int'l., Wyndham Worldwide Cluster 41: Aon plc, Arthur J. Gallagher & Co., Marsh & McLennan, Willis Towers Watson Cluster 42: Acuity Brands Inc, Fortune Brands Home & Security, Leggett & Platt, Masco Corp., Mohawk Industries, Whirlpool Corp. Cluster 43: Alliance Data Systems, Cerner, Global Payments Inc, Mastercard Inc., Total System Services, Visa Inc. Cluster 44: Mattel Inc. Cluster 45: MGM Resorts International, Wynn Resorts Cluster 46: Michael Kors Holdings Cluster 47: Gap Inc., Kohl's, Macy's Inc., Nordstrom Cluster 48: CSX Corp., Kansas City Southern, Norfolk Southern Corp., Union Pacific Cluster 49: Advance Auto Parts, AutoZone Inc, O'Reilly Automotive Cluster 50: Kinder Morgan, ONEOK, Williams Cos. Cluster 51: AMETEK Inc, Caterpillar Inc., Cummins Inc., Deere & Co., Dover Corp., Eaton Corporation, Emerson Electric Company, FLIR Systems, Harley-Davidson, Ingersoll-Rand PLC, PACCAR Inc., Parker-Hannifin, Pentair Ltd., Rockwell Automation Inc., Stanley Black & Decker, United Rentals, Inc., Xylem Inc. Cluster 52: Altria Group Inc, Coca-Cola Company, Dr Pepper Snapple Group, Monster Beverage, PepsiCo Inc., Philip Morris International Cluster 53: Allergan, Plc, Johnson & Johnson, Lilly (Eli) & Co., Merck & Co., Pfizer Inc. Cluster 54: Air Products & Chemicals Inc, Ecolab, Intl Flavors & Fragrances, PPG Industries, Praxair Inc., Sealed Air, Sherwin-Williams Cluster 55: AFLAC Inc, Ameriprise Financial, Assurant Inc, Berkshire Hathaway, Hartford Financial Svc.Gp., International Business Machines, Leucadia National Corp., Lincoln National, Loews Corp., MetLife Inc., Principal Financial Group, Prudential Financial, Robert Half International, Torchmark Corp., Unum Group Cluster 56: DaVita, Laboratory Corp. of America Holding, Quest Diagnostics Cluster 57: Adobe Systems Inc, Akamai Technologies Inc, ANSYS, CA, Inc., Citrix Systems, F5 Networks, Oracle Corp., Red Hat Inc., Verisign Inc. Cluster 58: Duke Realty Corp, Federal Realty Investment Trust, General Growth Properties Inc., Iron Mountain Incorporated, Macerich, Prologis, Realty Income Corporation, Regency Centers Corporation, Simon Property Group Inc Cluster 59: HP Inc., NetApp, Seagate Technology, Western Digital Cluster 60: Signet Jewelers Cluster 61: Darden Restaurants, McDonald's Corp., Starbucks Corp., Yum! Brands Inc Cluster 62: Abbott Laboratories, Align Technology, Bard (C.R.) Inc., Baxter International Inc., Becton Dickinson, Boston Scientific, The Cooper Companies, Edwards Lifesciences, Hologic, IDEXX Laboratories, Intuitive Surgical Inc., Medtronic plc, ResMed, Stryker Corp., Zimmer Biomet Holdings Cluster 63: Analog Devices, Inc., Apple Inc., Broadcom, Cisco Systems, Intel Corp., Juniper Networks, Microchip Technology, Motorola Solutions Inc., QUALCOMM Inc., Skyworks Solutions, Texas Instruments, Xilinx Inc Cluster 64: Foot Locker Inc, L Brands Inc., Ross Stores, TJX Companies Inc., Ulta Salon Cosmetics & Fragrance Inc Cluster 65: TransDigm Group Cluster 66: Allstate Corp, Chubb Limited, Cincinnati Financial, Progressive Corp., The Travelers Companies Inc., XL Capital Cluster 67: CBS Corp., Time Warner Inc., Twenty-First Century Fox Class A, Twenty-First Century Fox Class B, The Walt Disney Company Cluster 68: Apartment Investment & Management, AvalonBay Communities, Inc., Equity Residential, Mid-America Apartments, UDR Inc Cluster 69: C. H. Robinson Worldwide, Expeditors International, FedEx Corporation, J. B. Hunt Transport Services, United Parcel Service Cluster 70: HCA Holdings, Universal Health Services, Inc. Cluster 71: Hanesbrands Inc, Nike, PVH Corp., Under Armour, V.F. Corp. Cluster 72: Andeavor, Marathon Petroleum, Phillips 66, Valero Energy Cluster 73: Digital Realty, Equinix, HCP Inc., Ventas Inc, Welltower Inc. Cluster 74: AT&T Inc, CenturyLink Inc, Verizon Communications Cluster 75: Alexandria Real Estate Equities Inc, Boston Properties, SL Green Realty, Vornado Realty Trust Cluster 76: Martin Marietta Materials, Vulcan Materials Cluster 77: Eversource Energy, Tapestry, Inc., Walgreens Boots Alliance Cluster 78: Republic Services Inc, Waste Management Inc. Cluster 79: Agilent Technologies Inc, PerkinElmer, Thermo Fisher Scientific, Varian Medical Systems, Waters Corporation Cluster 80: Alliant Energy Corp, Ameren Corp, American Electric Power, American Water Works Company Inc, CenterPoint Energy, CMS Energy, Consolidated Edison, Dominion Energy, DTE Energy, Duke Energy, Edison Int'l, Exelon Corp., NextEra Energy, NiSource Inc., PG&E Corp., Pinnacle West Capital, PPL Corp., Public Serv. Enterprise Inc., SCANA Corp, Sempra Energy, Southern Co., Wec Energy Group Inc, Xcel Energy Inc
d1 = variation_df[['Name', 'Symbol', 'Sector', 'Color']].reset_index(drop=True)
d2 = pd.DataFrame(np.stack(embedding, axis=1).T, columns=['x', 'y'])
plotting_df = pd.concat([d1, d2], axis=1)
source = ColumnDataSource(dict(x=plotting_df['x'],
y=plotting_df['y'],
color=plotting_df['Color'],
label=plotting_df['Name'],
ticker=plotting_df['Symbol'],
sector=plotting_df['Sector']))
hover = HoverTool(tooltips=[
("index", "$index"),
("name", "@label"),
("sector", "@sector"),
("ticker", "@ticker"),
("(x,y)", "($x, $y)"),
])
p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)
labels = LabelSet(x='x', y='y', text='label', source=source,
text_font_size='8pt')
p.scatter(x='x', y='y', color='color', legend='sector', source=source)
p.title.text = "2D Company Similarity Visualization (Correlation)"
p.title.text_font_size = "25px"
# Comment this line to remove segments
p.segment(*np.reshape(np.array(segments).flatten(), (len(segments), 4)).T)
# Uncomment the following line to add labels
# p.add_layout(labels)
show(p)
Note: Segments represent partial correlation coefficients whose absolute value exceed 6%.
Here, companies are much more clearly clustered by industry groups. For example, one can see that airline companies (American Airlines Group, Delta Air Lines, etc) are clustered together and connected to each other.
# Subtract identity matrix from similarity matrix to avoid selecting the
# original company
sim_mat = similarity_matrix_NLP - np.eye(len(similarity_matrix_NLP))
def valuation_calculator(index, data, multiple, similarity_matrix):
sorted_similarity_array = -np.sort((-similarity_matrix[0]))
sorted_similarity_indices = (-similarity_matrix[index]).argsort()
top_3_comps = sorted_similarity_array[:3]
baseline_comp = sorted_similarity_array[4]
normalized_weights = ((top_3_comps-baseline_comp) /
sum(top_3_comps-baseline_comp))
top_3_pe = data[multiple][sorted_similarity_indices[:3]]
weigthed_pe = np.dot(normalized_weights, top_3_pe)
if multiple == 'Trailing P/E':
valuation = weigthed_pe * data['Diluted EPS'][index]
elif multiple == 'Enterprise Value/EBITDA':
valuation = weigthed_pe * data['EBITDA'][index]
elif multiple == 'Enterprise Value/Revenue':
valuation = weigthed_pe * data['Revenue'][index]
return valuation
valuation_df_nlp = data[['Name', 'Symbol', 'Sector', 'Color', 'Current Price',
'Enterprise Value']].copy()
for i, company in enumerate(data['Name']):
valuation_df_nlp.loc[i, 'Valuation (P/E)'] = \
valuation_calculator(i, data, 'Trailing P/E', sim_mat)
valuation_df_nlp.loc[i, 'Valuation (EV/EBITDA)'] = \
valuation_calculator(i, data, 'Enterprise Value/EBITDA', sim_mat)
valuation_df_nlp.loc[i, 'Valuation (EV/Revenue)'] = \
valuation_calculator(i, data, 'Enterprise Value/Revenue', sim_mat)
valuation_df_nlp['% Over/Undervaluation (EV/Revenue)'] = \
(valuation_df_nlp['Valuation (EV/Revenue)'] /
valuation_df_nlp['Enterprise Value']-1)*100
valuation_df_nlp['% Over/Undervaluation (EV/EBITDA)'] = \
(valuation_df_nlp['Valuation (EV/EBITDA)'] /
valuation_df_nlp['Enterprise Value']-1)*100
valuation_df_nlp['% Over/Undervaluation (P/E)'] = \
(valuation_df_nlp['Valuation (P/E)'] / valuation_df_nlp['Current Price']-1)*100
valuation_df_nlp['% Over/Undervaluation (EV/Revenue)'].describe()
count 430.000000 mean 101.705297 std 451.641116 min -89.167688 25% -23.991184 50% 17.955084 75% 91.800569 max 6113.063281 Name: % Over/Undervaluation (EV/Revenue), dtype: float64
valuation_df_nlp['% Over/Undervaluation (EV/EBITDA)'].describe()
count 430.000000 mean 31.095668 std 261.918370 min -85.110505 25% -19.046748 50% 5.264212 75% 41.416836 max 5282.424119 Name: % Over/Undervaluation (EV/EBITDA), dtype: float64
valuation_df_nlp['% Over/Undervaluation (P/E)'].describe()
count 430.000000 mean 70.585367 std 268.240955 min -97.265648 25% -28.156419 50% 6.538062 75% 58.264523 max 2774.539533 Name: % Over/Undervaluation (P/E), dtype: float64
# Add the price on the last day of 2016 to the dataframe
last_day_2016 = close_price.loc['2016-12-30', :].rename('2016-12-30 Price')
valuation_df_nlp = valuation_df_nlp.join(last_day_2016, on='Symbol')
# Calculates the percentage of correct predictions year to date
valuation_df_nlp['Actual Change'] = \
valuation_df_nlp['Current Price'] > valuation_df_nlp['2016-12-30 Price']
valuation_df_nlp['Return'] = \
valuation_df_nlp['Current Price'] / valuation_df_nlp['2016-12-30 Price'] - 1
valuation_df_nlp['Prediction'] = \
valuation_df_nlp['Valuation (P/E)'] > valuation_df_nlp['2016-12-30 Price']
avg_return = \
np.mean(valuation_df_nlp['Prediction'].replace(0, -1) *
valuation_df_nlp['Return'])
outcome = (valuation_df_nlp['Prediction'] == valuation_df_nlp['Actual Change'])
correct_pred = sum(outcome) / len(outcome)
print('Percentage of correct predictions YTD (NLP): ' + str(correct_pred))
print('Average Return: ' + str(avg_return))
Percentage of correct predictions YTD (NLP): 0.651162790698 Average Return: 0.1147698701648122
Disclaimer: While the system appears to be promising, there are many reasons other than luck for which this should not be taken at face value. First, we use the last price of 2016 as a reference, but the financial statements are published later even though they reflect the state of the company at the end of the year. Additionally, many more tests would actually need to performed to evaluate the quality of these signals which is beyond the aim of this notebook. As such, you should not consider using these results for taking investment decisions.
source = \
ColumnDataSource(dict(x=valuation_df_nlp['% Over/Undervaluation (P/E)'],
y=valuation_df_nlp['% Over/Undervaluation (EV/EBITDA)'],
color=valuation_df_nlp['Color'],
label=valuation_df_nlp['Name'],
ticker=valuation_df_nlp['Symbol'],
sector=valuation_df_nlp['Sector']))
hover = HoverTool(tooltips=[
("index", "$index"),
("name", "@label"),
("sector", "@sector"),
("ticker", "@ticker"),
("(x,y)", "($x, $y)"),
])
p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)
labels = LabelSet(x='x', y='y', text='label', source=source,
text_font_size='8pt')
# Remove "legend='sector'," to remove legend
p.scatter(x='x', y='y', color='color', legend='sector', source=source)
p.title.text = "Over/Undervaluation (NLP) Based On..."
p.title.text_font_size = "25px"
# Uncomment the following line to add labels
# p.add_layout(labels)
p.xaxis.axis_label = 'Trailing P/E'
p.yaxis.axis_label = 'EV/EBITDA'
show(p)
A positive value means that the company should be trading at a higher value (as of the end of 2016). Overall, there appears to be a positive relationship between the valuation based on those two multiples. One caveat, however, is the limited amount of companies. Some companies in the S&P500 have no good comparable companies, and therefore, the percentage of over/undervaluation suggested by the model should not be taken at face value. For example, companies such as Netflix and Amazon appear highly overvalued based on this model. However, this is because idiosyncratic factors have a large influence on these companies which are not captured by this methodology. In fact, on an EV/Revenue basis, the model actually suggests that Amazon is undervalued. Conducting this analysis with a much larger dataset would partially alleviate this issue. An interesting observation is that Nvidia, which has performed very well recently due to developments in the GPU space is still not as "overvalued" as Netflix on the basis of those two multiples.
cols_to_be_used = ['Name', 'Symbol', 'Sector', 'Color', 'Current Price',
'Enterprise Value', 'Enterprise Value/Revenue',
'Enterprise Value/EBITDA', 'Trailing P/E']
df1 = variation_df[['Name', 'Cluster']].set_index('Name')
val_df_corr = data[cols_to_be_used].join(df1, on='Name')
val_df_corr = val_df_corr.dropna()
for i in range(n_labels + 2):
val_df_corr.loc[val_df_corr['Cluster'] == i, 'Cluster P/E'] = \
np.mean(val_df_corr[val_df_corr['Cluster'] == i]['Trailing P/E'])
val_df_corr.loc[val_df_corr['Cluster'] == i, 'Cluster EV/EBITDA'] = \
np.mean(val_df_corr[val_df_corr['Cluster'] == i]
['Enterprise Value/EBITDA'])
val_df_corr.loc[val_df_corr['Cluster'] == i, 'Cluster EV/Revenue'] = \
np.mean(val_df_corr[val_df_corr['Cluster'] == i]
['Enterprise Value/Revenue'])
val_df_corr['% Over/Undervaluation (EV/Revenue)'] = \
(val_df_corr['Enterprise Value/Revenue'] /
val_df_corr['Cluster EV/Revenue']-1)*100
val_df_corr['% Over/Undervaluation (EV/EBITDA)'] = \
(val_df_corr['Enterprise Value/EBITDA'] /
val_df_corr['Cluster EV/EBITDA'] - 1) * 100
val_df_corr['% Over/Undervaluation (P/E)'] = \
(val_df_corr['Trailing P/E'] / val_df_corr['Cluster P/E'] - 1) * 100
val_df_corr['% Over/Undervaluation (EV/Revenue)'].round(3).describe()
count 411.000000 mean 0.000002 std 41.113803 min -96.331000 25% -25.817000 50% -2.205000 75% 17.535500 max 186.307000 Name: % Over/Undervaluation (EV/Revenue), dtype: float64
val_df_corr['% Over/Undervaluation (EV/EBITDA)'].round(3).describe()
count 411.000000 mean -0.000032 std 45.999853 min -98.236000 25% -18.438500 50% -4.308000 75% 11.977500 max 610.876000 Name: % Over/Undervaluation (EV/EBITDA), dtype: float64
val_df_corr['% Over/Undervaluation (P/E)'].round(3).describe()
count 411.000000 mean -0.000015 std 92.845076 min -93.641000 25% -39.437500 50% -11.260000 75% 11.894500 max 1026.426000 Name: % Over/Undervaluation (P/E), dtype: float64
source = \
ColumnDataSource(dict(x=val_df_corr['% Over/Undervaluation (P/E)'],
y=val_df_corr['% Over/Undervaluation (EV/EBITDA)'],
color=val_df_corr['Color'],
label=val_df_corr['Name'],
ticker=val_df_corr['Symbol'],
sector=val_df_corr['Sector']))
hover = HoverTool(tooltips=[
("index", "$index"),
("name", "@label"),
("sector", "@sector"),
("ticker", "@ticker"),
("(x,y)", "($x, $y)"),
])
p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)
labels = LabelSet(x='x', y='y', text='label', source=source,
text_font_size='8pt')
# Remove "legend='sector'," to remove legend
p.scatter(x='x', y='y', color='color', legend='sector', source=source)
p.title.text = "Over/Undervaluation (Correlation) Based On..."
p.title.text_font_size = "25px"
# Uncomment the following line to add labels
# p.add_layout(labels)
p.xaxis.axis_label = 'Trailing P/E'
p.yaxis.axis_label = 'EV/EBITDA'
show(p)
This methodology gives valuations which are much less extreme than the NLP approach. One potential explanation for this is that clusters are closer to the traditional industry groups, and therefore, to how the market values these companies.
These results were obtained as of:
date.today()
datetime.date(2017, 11, 28)
Titman, Sheridan and Martin, John D. Valuation: the Art and Science of Corporate Investment Decisions. Prentice Hall, 2015. Print.
Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. http://arxiv.org/pdf/1405.4053v2.pdf
Laurens van der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
Gael Varoquaux. Visualizing the Stock Market Structure. http://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html#sphx-glr-auto-examples-applications-plot-stock-market-py