Introduction to Relative Valuation using Market Comparables

Quentin Batista
The University of Tokyo

Relative valuation is a popular technique which relies on the market value of similar assets to price a given asset. For example, if you want to price a laundry business, you might want to calculate the price to earnings ratio of other similar businesses and multiply this ratio by the earnings of the laundry business to obtain a valuation. Sheridan and Martin describe this methodology as a 4-step process:

Step 1: Identify similar or comparable investments and recent market prices for each.

Step 2: Calculate a valuation metric for use in valuing the asset.

Step 3: Calculate an initial estimate of value.

Step 4: Refine or tailor your initial valuation estimate to the specific characteristics of the investment.

Some of the most commonly used multiples are:

  • Price to earnings (P/E)

  • Market to book value of equity

  • Enterprise Value (EV) to EBITDA

  • Enterprise Value (EV) to revenue

  • Enterprise Value (EV) to cash flow

This list is not exhaustive, and it is possible for you to create your own multiple. This is particularly popular in the technology sector where analysts have come up with multiples such as Enterprise Value to unique visitors or website hits. In doing so, you must ensure that the components of the multiple are consistent with each other. For example, you might consider using the price to sales ratio as a valuation multiple. However, an implicit assumption behind this multiple is that comparable companies have identical capital structures, which is very rarely the case in practice. When this assumption is violated, the multiple becomes inconsistent because it is impacted by the amount of debt that a company has relative to its equity.

Finally, a key step in applying this methodology is to determine which multiple is appropriate for the asset you are trying to value. For example, valuing young firms and startups using the P/E ratio is likely to be inappropriate if those firms have negative or highly volatile earnings. Instead, using the EV to sales ratio would likely give a better estimate. Additionally, it is important to realize that these multiples have different characteristics. While EV to EBITDA ignores the firm's CapEx, depreciation, and capital structure, while the P/E ratio takes those into account. Using these multiples concurrently allows you to see the big picture and understand what is driving the valuation of an asset.

Systematic Relative Valuation Using Machine Learning


In this notebook, we systematize the methodology introduced above for companies in the S&P 500 using two different Machine Learning approaches. First, we replicate Gael Varoquaux's analysis on scikit-learn which extracts a graphical structure from the correlation of intraday variations and then applies Affinity Propagation to group together stocks that behave similarly. Second, we analyze the companies' latest 10K using the Doc2Vec implementation in gensim to quantify similarity. In the first case, we use the average multiple in each cluster to value each company. In the second case, we pick the top 3 comparable companies and transform the similarity measures into weights to compute the average comparable multiple. We then apply this multiple to each company's financials to obtain a valuation.

The hope of using this approach is to capture relationships between companies that human analysts would not be able to perceive. Reading through the financial statements of all the companies in the S&P 500 and forming an opinion about how these companies relate to each other would be an extremely time-consuming process. As a result, most analysts on the Street cover a specific industry group, and therefore, they might not realize that a company they are researching is actually very similar to a company in classified in a different industry group. By using Machine Learning, we can alleviate this issue, and hopefully, gain some valuable insights about the stock market structure.


We scrape the list of companies in the S&P 500, their ticker and CIK code from Wikipedia. We then use the CIK code to scrape the latest 10K from EDGAR. There are 505 companies in the Wikipedia list because some companies trade under multiple symbols (for example, Discovery Communications Inc.). A few companies' financial statements are not available on EDGAR for various reasons -- we ignore those companies. We clean the data by removing "Table of Contents" markers when they exist, page numbers, line breaks, punctuations and numbers from the statements. We scrape company fundamentals and their historical prices from Yahoo! Finance. Note that we drop the companies with negative multiples to simplify valuation. While Yahoo! Finance might not be a particularly reliable source for getting data about company fundamentals, it is the best among free data sources.

Required Packages

In [ ]:
!pip install numpy
!pip install pandas
!pip install beautifulsoup4
!pip install gensim
!pip install nltk
!pip install sklearn
!pip install bokeh
!pip install pandas_datareader
!pip install datetime
In [2]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
from collections import namedtuple
from gensim.models import doc2vec
from gensim.similarities import docsim
from gensim import corpora
import itertools
from string import punctuation, digits
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.manifold import TSNE
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, LabelSet, Legend, HoverTool
from bokeh.palettes import all_palettes
from import DataReader
from datetime import date

Data Scraping

Scraping 10Ks from EDGAR

In [3]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen

def table_extractor(soup):
    """Extract the tables from a soup object"""
    for table in soup.find_all("table"):
    return soup

sp_500_wiki_link = ''
soup_wiki = BeautifulSoup(urlopen(sp_500_wiki_link), 'html.parser')
table_wiki = soup_wiki.find("table", {"class": "wikitable sortable"})

# Fail now if the right table hasn't been found
header = table_wiki.findAll('th')
if header[0].string != "Ticker symbol" or header[1].string != "Security":
    raise Exception("Can't parse wikipedia's table!")

# Retreive the values in the table
records = []
rows = table_wiki.findAll('tr')
for row in rows:
    fields = row.findAll('td')
    if fields:
        # Get info and SEC company link
        symbol = fields[0].string
        wiki_link = "" + fields[1].a.get('href')
        CIK = fields[7].string
        sec_company_link = "" + \
                           "action=getcompany&CIK=" + CIK + \
        name = fields[1].a.string
        sector = fields[3].string

        # Get link for the page with latest 10-K related filings
        soup_comp = BeautifulSoup(urlopen(sec_company_link), 'html.parser')
        table_comp = soup_comp.find("table", {"class": "tableFile2"})
            # Get the latest filing page
            filings_link = "" + \

            # Get the link for the latest 10K
            soup_filings = BeautifulSoup(urlopen(filings_link),
            table_filings = soup_filings.find("table", {"class": "tableFile"})
            _10K_link = "" + table_filings.a.get('href')

            # Extracting the text from the latest 10K
                soup_latest_10K = BeautifulSoup(urlopen(_10K_link).read(),
                soup_latest_10K = table_extractor(soup_latest_10K)
                _latest_10K_txt = soup_latest_10K.get_text()

                # If the latest 10K is not available, return N/A
                _latest_10K_txt = np.nan

            # If the filings are not available, return N/A
            _10K_link = np.nan
            _latest_10K_txt = np.nan

        # Append results
        records.append([symbol, wiki_link, name, sector, sec_company_link,
                        CIK, _latest_10K_txt])

headers = ['Symbol', 'Wikipedia Link', 'Name', 'Sector', 'SEC Filings Link',
           'CIK', 'Latest 10K']
data = pd.DataFrame(records, columns=headers)

# Correct ambiguous tickers for later purpose
ambiguous_tickers = ['BRK.B', 'BF.B']
corrected_tickers = ['BRK-B', 'BF-B']

for i, ticker in enumerate(ambiguous_tickers):
    data['Symbol'] = data['Symbol'].replace(ticker, corrected_tickers[i])

Scraping Fundamentals from Yahoo! Finance

In [5]:
def unit_converter(data):
    billion = 1_000_000_000
    million = 1_000_000
    if data[-1] == 'B':
        return float(data[:-1])*billion
    elif data[-1] == 'M':
        return float(data[:-1])*million
        return float(data)

items = ['Enterprise Value', 'Enterprise Value/Revenue', 'Diluted EPS',
         'Enterprise Value/EBITDA', 'Revenue', 'EBITDA', 'Trailing P/E']

for i, ticker in enumerate(data['Symbol']):
    key_stats_link = '' + ticker + \
                     '/key-statistics?p=' + ticker
    key_stats_soup = BeautifulSoup(urlopen(key_stats_link).read(),
    for j, row in enumerate(key_stats_soup):
        for item in items:
                if item == row.span.string:
                    # Dangerous assumption
                    data.loc[i, item] = \

Disclaimer: This scraper makes an important assumption regarding the structure of the html file. Specifically, it assumes that the data for each item of interest is the string of text located in the 'td' tag following the one where the name of the item was found.

Scraping Historical Prices from Yahoo! Finance

In [6]:
from import DataReader
from datetime import date

start = date(2013, 1, 2)
end =

data_source = 'yahoo'

# Sometimes fail -- retry if it does
historical_prices_panel = DataReader(data['Symbol'], data_source, start, end,

# Current price is last close
last_price = \
historical_prices_panel['Close'].tail(1).T.iloc[:, 0].rename('Current Price')
data = data.join(last_price, on='Symbol')

close_price = historical_prices_panel['Close']
open_price = historical_prices_panel['Open']
In [7]:
# Uncomment to save the data
# data.to_csv('10K_data.csv')
# historical_prices_panel['Close'].to_csv('close_price_data.csv')
# historical_prices_panel['Open'].to_csv('open_price_data.csv')

Data Cleaning and Preprocessing

In [8]:
# Uncomment to read the data
# data = pd.read_csv('10K_data.csv')
# close_price = pd.read_csv('close_price_data.csv')
# open_price = pd.read_csv('open_price_data.csv')
In [9]:
# Remove companies without filings
no_filings_data = data[data['Latest 10K'].isnull()]
data = data[~data['Latest 10K'].isnull()]

# Remove duplicates (keep first)
data = data.drop_duplicates(keep='first')

# Drop Google duplicate
data = data[data['Symbol'] != 'GOOG']

# Drop NA rows (about 60 companies)
data = data.dropna()

# Drop companies with negative EPS or EBITDA (about 30 companies)
data = data[(data[['EBITDA', 'Diluted EPS']] > 0).all(1)]

# Reset index
data = data.reset_index(drop=True)
In [10]:
from collections import namedtuple
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation, digits

def _10K_string_cleaner(_10K):
    # Lowercase the text
    _10K = _10K.lower()
    stopchar = punctuation + digits + '’“”'
    for ch in stopchar:
        # Replace stopchar by whitespace
        _10K = _10K.replace(ch, ' ')
    # Tokenize
    _10K = word_tokenize(_10K)
    # Remove stopwords
    _10K = [word for word in _10K if word not in stopwords.words('english')]
    return _10K

corpus = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(data['Latest 10K']):
    corpus.append(analyzedDocument(_10K_string_cleaner(text), [i]))

Doc2Vec Model

Doc2Vec is an implementation of the Paragraph Vector unsupervised learning algorithm proposed by Quoc Le and Tomas Mikolov for learning fixed-length feature representations from variable-length pieces of texts. It aims to overcome some of the weaknesses of the more traditional bag-of-words (BOW) methodology.

In [11]:
from gensim.models import doc2vec
from gensim.similarities import docsim

model_NLP = doc2vec.Doc2Vec(corpus, size=500, window=300, min_count=1,
In [12]:
import itertools

similarity_matrix_NLP = np.empty(0)
length_docvecs = len(model_NLP.docvecs)
for item in itertools.product(range(length_docvecs), range(length_docvecs)):
    similarity_matrix_NLP = np.append(similarity_matrix_NLP,

similarity_matrix_NLP = similarity_matrix_NLP.reshape((length_docvecs,


Stock Market Structure

We use three different methodologies for visualizing the structure of the stock market. First, we create a simple scatter plot of the companies' EV/EBITDA against their P/E ratio. Second, we use the correlation matrix of the S&P 500 stocks between the start of 2013 and the last price. Third, we use the similarity matrix outputted from the Doc2Vec model. In the last two cases, t-SNE is used for dimensionality reduction.

In [13]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, LabelSet, Legend, HoverTool
from bokeh.palettes import all_palettes


category_items = data['Sector'].unique()
palette = all_palettes['Viridis'][len(category_items)]
colormap = dict(zip(category_items, palette))
data['Color'] = data['Sector'].map(colormap)

TOOLS = "crosshair,pan,wheel_zoom,reset,tap,save,box_select"

source = ColumnDataSource(dict(x=data['Trailing P/E'],
                               y=data['Enterprise Value/EBITDA'],

hover = HoverTool(tooltips=[
    ("index", "$index"),
    ("name", "@label"),
    ("sector", "@sector"),
    ("ticker", "@ticker"),
    ("(x,y)", "($x, $y)"),

p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)

labels = LabelSet(x='x', y='y', text='label', source=source,

# Remove "legend='sector'," to remove legend
p.scatter(x='x', y='y', color='color', legend='sector', source=source)

p.title.text = "Current Valuation Structure"
p.title.text_font_size = "25px"

# Uncomment the following line to add labels
# p.add_layout(labels)

p.xaxis.axis_label = 'Trailing P/E'
p.yaxis.axis_label = 'EV/EBITDA'

Loading BokehJS ...

There is a clear positive relationship between a company's EV/EBITDA and P/E ratio. While most companies are clustered in the bottom left quadrant, there are a few outliers. We choose to keep those outliers for valuation because they still provide valuable information about companies as we will see later.

In [14]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=0, perplexity=10.0)
Y = tsne.fit_transform(similarity_matrix_NLP)

plotting_df = pd.concat([data[['Name', 'Symbol', 'Sector', 'Color']],
                         pd.DataFrame(Y, columns=['x', 'y'])], axis=1)

source = ColumnDataSource(dict(x=plotting_df['x'],

hover = HoverTool(tooltips=[
    ("index", "$index"),
    ("name", "@label"),
    ("sector", "@sector"),
    ("ticker", "@ticker"),
    ("(x,y)", "($x, $y)"),

p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)

labels = LabelSet(x='x', y='y', text='label', source=source,

p.scatter(x='x', y='y', color='color', legend='sector', source=source)

p.title.text = "2D Company Similarity Visualization (NLP)"
p.title.text_font_size = "25px"

# Uncomment the following line to add labels
# p.add_layout(labels)


While companies are mostly clustered by their industry groups, this analysis seems to capture relationships between companies which go beyond those. The fact that companies such as Activision Blizzard and Electronic Arts are close to each other suggests that the model is doing something right. However, note that it is possible that the model relies on trivial factors, for example, attributing a high similarity score to companies whose financial statements have the same auditor (say, because the writing style is similar).

In [15]:
from sklearn import cluster, covariance

# This code was adapted from Gael Varoquaux's work (see references)

# Calculate intraday variation
variation_df = close_price - open_price
variation_df = variation_df.T.reindex(data['Symbol'])

# Get name, sector and color from the data dataframe
variation_df = \
data[['Symbol', 'Name', 'Sector', 'Color']].join(variation_df, on='Symbol')

# Drop rows with NAs
variation_df = variation_df.dropna(axis=0)

# Data for the model
var_data = variation_df.drop(['Symbol', 'Name', 'Sector', 'Color'], axis=1).T

# Learn a graphical structure from the correlations
edge_model = covariance.GraphLassoCV()

# Standardize the time series: using correlations rather than covariance is
# more efficient for structure recovery
var_data /= var_data.std(axis=0)

# Cluster using affinity propagation
_, labels = cluster.affinity_propagation(edge_model.covariance_)
n_labels = labels.max()

variation_df['Cluster'] = np.nan

for i in range(n_labels + 1):
    variation_df.loc[labels == i, 'Cluster'] = i + 1
    print('Cluster %i: %s' % ((i + 1),
          ', '.join(variation_df['Name'][labels == i])))

# Find a low-dimension embedding for visualization: find the best position of
# the nodes (the stocks) on a 2D plane
embedding = tsne.fit_transform(var_data.T)

# Display a graph of the partial correlations
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.06)

# Plot the edges
start_idx, end_idx = np.where(non_zero)
segments = [[embedding.T[:, start], embedding.T[:, stop]]
            for start, stop in zip(start_idx, end_idx)]
values = np.abs(partial_correlations[non_zero])
/Users/QBatista/anaconda/lib/python3.6/site-packages/numpy/linalg/ RuntimeWarning: invalid value encountered in slogdet
  sign, logdet = _umath_linalg.slogdet(a, signature=signature)
/Users/QBatista/anaconda/lib/python3.6/site-packages/sklearn/covariance/ RuntimeWarning: invalid value encountered in multiply
  * coefs)
/Users/QBatista/anaconda/lib/python3.6/site-packages/sklearn/covariance/ RuntimeWarning: invalid value encountered in multiply
  * coefs)
/Users/QBatista/anaconda/lib/python3.6/site-packages/sklearn/covariance/ ConvergenceWarning: graph_lasso: did not converge after 100 iteration: dual gap: -1.510e-04
Cluster 1: Activision Blizzard, Electronic Arts
Cluster 2: Alphabet Inc Class A, Inc, eBay, Expedia Inc., Facebook, Inc., Microsoft Corp., Netflix Inc., Inc, TripAdvisor
Cluster 3: American Tower Corp A, Crown Castle International Corp., SBA Communications
Cluster 4: Aetna Inc, Anthem Inc., Centene Corporation, CIGNA Corp., Humana Inc., United Health Group Inc.
Cluster 5: Block H&R
Cluster 6: BorgWarner, Delphi Automotive, Goodyear Tire & Rubber, LKQ Corporation
Cluster 7: Cadence Design Systems, Synopsys Inc.
Cluster 8: AmerisourceBergen Corp, Cardinal Health, CVS Health, Express Scripts, Henry Schein, McKesson Corp., Patterson Companies
Cluster 9: Carnival Corp., Royal Caribbean Cruises Ltd
Cluster 10: AbbVie Inc., Alexion Pharmaceuticals, Amgen Inc, Biogen Inc., Bristol-Myers Squibb, Celgene Corp., Gilead Sciences, Illumina Inc, Mylan N.V., Regeneron, Vertex Pharmaceuticals Inc
Cluster 11: Chevron Corp., Exxon Mobil Corp., Occidental Petroleum, Schlumberger Ltd., TechnipFMC
Cluster 12: Chipotle Mexican Grill
Cluster 13: The Clorox Company, Colgate-Palmolive, Estee Lauder Cos., Kimberly-Clark, Procter & Gamble
Cluster 14: Charter Communications, Comcast Corp.
Cluster 15: Cimarex Energy, Concho Resources, Devon Energy, Newfield Exploration Co, Pioneer Natural Resources, Range Resources Corp.
Cluster 16: Constellation Brands, Molson Coors Brewing Company
Cluster 17: Costco Wholesale Corp., Wal-Mart Stores
Cluster 18: Alaska Air Group Inc, American Airlines Group, Delta Air Lines, Southwest Airlines, United Continental Holdings
Cluster 19: Discovery Communications, Discovery Communications, Viacom Inc.
Cluster 20: Dish Network
Cluster 21: Dollar General, Dollar Tree, Kroger Co., Target Corp.
Cluster 22: Albemarle Corp, Eastman Chemical, FMC Corporation, LyondellBasell, The Mosaic Company, Nucor Corp.
Cluster 23: Equifax Inc.
Cluster 24: Extra Space Storage, Public Storage
Cluster 25: Fastenal Co, Grainger (W.W.) Inc.
Cluster 26: Accenture plc, Automatic Data Processing, Cognizant Technology Solutions, Fidelity National Information Services, Fiserv Inc, Gartner Inc, Intuit, Paychex Inc., Verisk Analytics
Cluster 27: Flowserve Corporation, Fluor Corp., Freeport-McMoRan Inc., Jacobs Engineering Group, Quanta Services Inc.
Cluster 28: Carmax Inc, Ford Motor, General Motors
Cluster 29: Brown-Forman Corp., Campbell Soup, Conagra Brands, General Mills, The Hershey Company, Hormel Foods Corp., JM Smucker, Kellogg Co., McCormick & Co., Mondelez International, Sysco Corp., Tyson Foods
Cluster 30: Hasbro Inc.
Cluster 31: Home Depot, Lowe's Cos., Newell Brands, Tractor Supply Company
Cluster 32: 3M Company, Amphenol Corp, Boeing Company, Cintas Corporation, Danaher Corp., General Electric, Genuine Parts, Harris Corporation, Honeywell Int'l Inc., Illinois Tool Works, Monsanto Co., Rockwell Collins, Roper Technologies, Snap-On Inc., Textron Inc., United Technologies
Cluster 33: CBOE Holdings, CME Group Inc., Intercontinental Exchange, Nasdaq, Inc.
Cluster 34: Avery Dennison Corp, Ball Corp, International Paper, Packaging Corporation of America, Weyerhaeuser Corp.
Cluster 35: Interpublic Group, Omnicom Group
Cluster 36: Affiliated Managers Group Inc, Archer-Daniels-Midland Co, BlackRock, CBRE Group, Corning Inc., Franklin Resources, Garmin Ltd., Invesco Ltd., S&P Global, Inc., T. Rowe Price Group, TE Connectivity Ltd., Western Union Co
Cluster 37: Applied Materials Inc, KLA-Tencor Corp., Lam Research, Micron Technology, Nvidia Corporation
Cluster 38: Lennar Corp., Pulte Homes Inc.
Cluster 39: General Dynamics, L-3 Communications Holdings, Lockheed Martin Corp., Northrop Grumman Corp., Raytheon Co.
Cluster 40: Host Hotels & Resorts, Marriott Int'l., Wyndham Worldwide
Cluster 41: Aon plc, Arthur J. Gallagher & Co., Marsh & McLennan, Willis Towers Watson
Cluster 42: Acuity Brands Inc, Fortune Brands Home & Security, Leggett & Platt, Masco Corp., Mohawk Industries, Whirlpool Corp.
Cluster 43: Alliance Data Systems, Cerner, Global Payments Inc, Mastercard Inc., Total System Services, Visa Inc.
Cluster 44: Mattel Inc.
Cluster 45: MGM Resorts International, Wynn Resorts
Cluster 46: Michael Kors Holdings
Cluster 47: Gap Inc., Kohl's, Macy's Inc., Nordstrom
Cluster 48: CSX Corp., Kansas City Southern, Norfolk Southern Corp., Union Pacific
Cluster 49: Advance Auto Parts, AutoZone Inc, O'Reilly Automotive
Cluster 50: Kinder Morgan, ONEOK, Williams Cos.
Cluster 51: AMETEK Inc, Caterpillar Inc., Cummins Inc., Deere & Co., Dover Corp., Eaton Corporation, Emerson Electric Company, FLIR Systems, Harley-Davidson, Ingersoll-Rand PLC, PACCAR Inc., Parker-Hannifin, Pentair Ltd., Rockwell Automation Inc., Stanley Black & Decker, United Rentals, Inc., Xylem Inc.
Cluster 52: Altria Group Inc, Coca-Cola Company, Dr Pepper Snapple Group, Monster Beverage, PepsiCo Inc., Philip Morris International
Cluster 53: Allergan, Plc, Johnson & Johnson, Lilly (Eli) & Co., Merck & Co., Pfizer Inc.
Cluster 54: Air Products & Chemicals Inc, Ecolab, Intl Flavors & Fragrances, PPG Industries, Praxair Inc., Sealed Air, Sherwin-Williams
Cluster 55: AFLAC Inc, Ameriprise Financial, Assurant Inc, Berkshire Hathaway, Hartford Financial Svc.Gp., International Business Machines, Leucadia National Corp., Lincoln National, Loews Corp., MetLife Inc., Principal Financial Group, Prudential Financial, Robert Half International, Torchmark Corp., Unum Group
Cluster 56: DaVita, Laboratory Corp. of America Holding, Quest Diagnostics
Cluster 57: Adobe Systems Inc, Akamai Technologies Inc, ANSYS, CA, Inc., Citrix Systems, F5 Networks, Oracle Corp., Red Hat Inc., Verisign Inc.
Cluster 58: Duke Realty Corp, Federal Realty Investment Trust, General Growth Properties Inc., Iron Mountain Incorporated, Macerich, Prologis, Realty Income Corporation, Regency Centers Corporation, Simon Property Group Inc
Cluster 59: HP Inc., NetApp, Seagate Technology, Western Digital
Cluster 60: Signet Jewelers
Cluster 61: Darden Restaurants, McDonald's Corp., Starbucks Corp., Yum! Brands Inc
Cluster 62: Abbott Laboratories, Align Technology, Bard (C.R.) Inc., Baxter International Inc., Becton Dickinson, Boston Scientific, The Cooper Companies, Edwards Lifesciences, Hologic, IDEXX Laboratories, Intuitive Surgical Inc., Medtronic plc, ResMed, Stryker Corp., Zimmer Biomet Holdings
Cluster 63: Analog Devices, Inc., Apple Inc., Broadcom, Cisco Systems, Intel Corp., Juniper Networks, Microchip Technology, Motorola Solutions Inc., QUALCOMM Inc., Skyworks Solutions, Texas Instruments, Xilinx Inc
Cluster 64: Foot Locker Inc, L Brands Inc., Ross Stores, TJX Companies Inc., Ulta Salon Cosmetics & Fragrance Inc
Cluster 65: TransDigm Group
Cluster 66: Allstate Corp, Chubb Limited, Cincinnati Financial, Progressive Corp., The Travelers Companies Inc., XL Capital
Cluster 67: CBS Corp., Time Warner Inc., Twenty-First Century Fox Class A, Twenty-First Century Fox Class B, The Walt Disney Company
Cluster 68: Apartment Investment & Management, AvalonBay Communities, Inc., Equity Residential, Mid-America Apartments, UDR Inc
Cluster 69: C. H. Robinson Worldwide, Expeditors International, FedEx Corporation, J. B. Hunt Transport Services, United Parcel Service
Cluster 70: HCA Holdings, Universal Health Services, Inc.
Cluster 71: Hanesbrands Inc, Nike, PVH Corp., Under Armour, V.F. Corp.
Cluster 72: Andeavor, Marathon Petroleum, Phillips 66, Valero Energy
Cluster 73: Digital Realty, Equinix, HCP Inc., Ventas Inc, Welltower Inc.
Cluster 74: AT&T Inc, CenturyLink Inc, Verizon Communications
Cluster 75: Alexandria Real Estate Equities Inc, Boston Properties, SL Green Realty, Vornado Realty Trust
Cluster 76: Martin Marietta Materials, Vulcan Materials
Cluster 77: Eversource Energy, Tapestry, Inc., Walgreens Boots Alliance
Cluster 78: Republic Services Inc, Waste Management Inc.
Cluster 79: Agilent Technologies Inc, PerkinElmer, Thermo Fisher Scientific, Varian Medical Systems, Waters Corporation
Cluster 80: Alliant Energy Corp, Ameren Corp, American Electric Power, American Water Works Company Inc, CenterPoint Energy, CMS Energy, Consolidated Edison, Dominion Energy, DTE Energy, Duke Energy, Edison Int'l, Exelon Corp., NextEra Energy, NiSource Inc., PG&E Corp., Pinnacle West Capital, PPL Corp., Public Serv. Enterprise Inc., SCANA Corp, Sempra Energy, Southern Co., Wec Energy Group Inc, Xcel Energy Inc
In [16]:
d1 = variation_df[['Name', 'Symbol', 'Sector', 'Color']].reset_index(drop=True)
d2 = pd.DataFrame(np.stack(embedding, axis=1).T, columns=['x', 'y'])
plotting_df = pd.concat([d1, d2], axis=1)

source = ColumnDataSource(dict(x=plotting_df['x'],

hover = HoverTool(tooltips=[
    ("index", "$index"),
    ("name", "@label"),
    ("sector", "@sector"),
    ("ticker", "@ticker"),
    ("(x,y)", "($x, $y)"),

p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)

labels = LabelSet(x='x', y='y', text='label', source=source,

p.scatter(x='x', y='y', color='color', legend='sector', source=source)

p.title.text = "2D Company Similarity Visualization (Correlation)"
p.title.text_font_size = "25px"

# Comment this line to remove segments
p.segment(*np.reshape(np.array(segments).flatten(), (len(segments), 4)).T)

# Uncomment the following line to add labels
# p.add_layout(labels)


Note: Segments represent partial correlation coefficients whose absolute value exceed 6%.

Here, companies are much more clearly clustered by industry groups. For example, one can see that airline companies (American Airlines Group, Delta Air Lines, etc) are clustered together and connected to each other.


NLP Approach

In [18]:
# Subtract identity matrix from similarity matrix to avoid selecting the
# original company
sim_mat = similarity_matrix_NLP - np.eye(len(similarity_matrix_NLP))

def valuation_calculator(index, data, multiple, similarity_matrix):
    sorted_similarity_array = -np.sort((-similarity_matrix[0]))
    sorted_similarity_indices = (-similarity_matrix[index]).argsort()
    top_3_comps = sorted_similarity_array[:3]
    baseline_comp = sorted_similarity_array[4]
    normalized_weights = ((top_3_comps-baseline_comp) /
    top_3_pe = data[multiple][sorted_similarity_indices[:3]]
    weigthed_pe =, top_3_pe)
    if multiple == 'Trailing P/E':
        valuation = weigthed_pe * data['Diluted EPS'][index]
    elif multiple == 'Enterprise Value/EBITDA':
        valuation = weigthed_pe * data['EBITDA'][index]
    elif multiple == 'Enterprise Value/Revenue':
        valuation = weigthed_pe * data['Revenue'][index]
    return valuation

valuation_df_nlp = data[['Name', 'Symbol', 'Sector', 'Color', 'Current Price',
                        'Enterprise Value']].copy()

for i, company in enumerate(data['Name']):
    valuation_df_nlp.loc[i, 'Valuation (P/E)'] = \
    valuation_calculator(i, data, 'Trailing P/E', sim_mat)

    valuation_df_nlp.loc[i, 'Valuation (EV/EBITDA)'] = \
    valuation_calculator(i, data, 'Enterprise Value/EBITDA', sim_mat)

    valuation_df_nlp.loc[i, 'Valuation (EV/Revenue)'] = \
    valuation_calculator(i, data, 'Enterprise Value/Revenue', sim_mat)

valuation_df_nlp['% Over/Undervaluation (EV/Revenue)'] = \
(valuation_df_nlp['Valuation (EV/Revenue)'] /
    valuation_df_nlp['Enterprise Value']-1)*100

valuation_df_nlp['% Over/Undervaluation (EV/EBITDA)'] = \
(valuation_df_nlp['Valuation (EV/EBITDA)'] /
    valuation_df_nlp['Enterprise Value']-1)*100

valuation_df_nlp['% Over/Undervaluation (P/E)'] = \
(valuation_df_nlp['Valuation (P/E)'] / valuation_df_nlp['Current Price']-1)*100
In [19]:
valuation_df_nlp['% Over/Undervaluation (EV/Revenue)'].describe()
count     430.000000
mean      101.705297
std       451.641116
min       -89.167688
25%       -23.991184
50%        17.955084
75%        91.800569
max      6113.063281
Name: % Over/Undervaluation (EV/Revenue), dtype: float64
In [20]:
valuation_df_nlp['% Over/Undervaluation (EV/EBITDA)'].describe()
count     430.000000
mean       31.095668
std       261.918370
min       -85.110505
25%       -19.046748
50%         5.264212
75%        41.416836
max      5282.424119
Name: % Over/Undervaluation (EV/EBITDA), dtype: float64
In [21]:
valuation_df_nlp['% Over/Undervaluation (P/E)'].describe()
count     430.000000
mean       70.585367
std       268.240955
min       -97.265648
25%       -28.156419
50%         6.538062
75%        58.264523
max      2774.539533
Name: % Over/Undervaluation (P/E), dtype: float64
In [22]:
# Add the price on the last day of 2016 to the dataframe
last_day_2016 = close_price.loc['2016-12-30', :].rename('2016-12-30 Price')
valuation_df_nlp = valuation_df_nlp.join(last_day_2016, on='Symbol')
In [24]:
# Calculates the percentage of correct predictions year to date
valuation_df_nlp['Actual Change'] = \
valuation_df_nlp['Current Price'] > valuation_df_nlp['2016-12-30 Price']

valuation_df_nlp['Return'] = \
valuation_df_nlp['Current Price'] / valuation_df_nlp['2016-12-30 Price'] - 1

valuation_df_nlp['Prediction'] = \
valuation_df_nlp['Valuation (P/E)'] > valuation_df_nlp['2016-12-30 Price']

avg_return = \
np.mean(valuation_df_nlp['Prediction'].replace(0, -1) *

outcome = (valuation_df_nlp['Prediction'] == valuation_df_nlp['Actual Change'])
correct_pred = sum(outcome) / len(outcome)
print('Percentage of correct predictions YTD (NLP): ' + str(correct_pred))
print('Average Return: ' + str(avg_return))
Percentage of correct predictions YTD (NLP): 0.651162790698
Average Return: 0.1147698701648122

Disclaimer: While the system appears to be promising, there are many reasons other than luck for which this should not be taken at face value. First, we use the last price of 2016 as a reference, but the financial statements are published later even though they reflect the state of the company at the end of the year. Additionally, many more tests would actually need to performed to evaluate the quality of these signals which is beyond the aim of this notebook. As such, you should not consider using these results for taking investment decisions.

In [25]:
source = \
ColumnDataSource(dict(x=valuation_df_nlp['% Over/Undervaluation (P/E)'],
                      y=valuation_df_nlp['% Over/Undervaluation (EV/EBITDA)'],

hover = HoverTool(tooltips=[
    ("index", "$index"),
    ("name", "@label"),
    ("sector", "@sector"),
    ("ticker", "@ticker"),
    ("(x,y)", "($x, $y)"),

p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)

labels = LabelSet(x='x', y='y', text='label', source=source,

# Remove "legend='sector'," to remove legend
p.scatter(x='x', y='y', color='color', legend='sector', source=source)

p.title.text = "Over/Undervaluation (NLP) Based On..."
p.title.text_font_size = "25px"

# Uncomment the following line to add labels
# p.add_layout(labels)

p.xaxis.axis_label = 'Trailing P/E'
p.yaxis.axis_label = 'EV/EBITDA'


A positive value means that the company should be trading at a higher value (as of the end of 2016). Overall, there appears to be a positive relationship between the valuation based on those two multiples. One caveat, however, is the limited amount of companies. Some companies in the S&P500 have no good comparable companies, and therefore, the percentage of over/undervaluation suggested by the model should not be taken at face value. For example, companies such as Netflix and Amazon appear highly overvalued based on this model. However, this is because idiosyncratic factors have a large influence on these companies which are not captured by this methodology. In fact, on an EV/Revenue basis, the model actually suggests that Amazon is undervalued. Conducting this analysis with a much larger dataset would partially alleviate this issue. An interesting observation is that Nvidia, which has performed very well recently due to developments in the GPU space is still not as "overvalued" as Netflix on the basis of those two multiples.

Correlation Approach

In [26]:
cols_to_be_used = ['Name', 'Symbol', 'Sector', 'Color', 'Current Price',
                   'Enterprise Value', 'Enterprise Value/Revenue',
                   'Enterprise Value/EBITDA', 'Trailing P/E']

df1 = variation_df[['Name', 'Cluster']].set_index('Name')
val_df_corr = data[cols_to_be_used].join(df1, on='Name')
val_df_corr = val_df_corr.dropna()

for i in range(n_labels + 2):
    val_df_corr.loc[val_df_corr['Cluster'] == i, 'Cluster P/E'] = \
    np.mean(val_df_corr[val_df_corr['Cluster'] == i]['Trailing P/E'])

    val_df_corr.loc[val_df_corr['Cluster'] == i, 'Cluster EV/EBITDA'] = \
    np.mean(val_df_corr[val_df_corr['Cluster'] == i]
            ['Enterprise Value/EBITDA'])

    val_df_corr.loc[val_df_corr['Cluster'] == i, 'Cluster EV/Revenue'] = \
    np.mean(val_df_corr[val_df_corr['Cluster'] == i]
            ['Enterprise Value/Revenue'])

val_df_corr['% Over/Undervaluation (EV/Revenue)'] = \
(val_df_corr['Enterprise Value/Revenue'] /
    val_df_corr['Cluster EV/Revenue']-1)*100

val_df_corr['% Over/Undervaluation (EV/EBITDA)'] = \
(val_df_corr['Enterprise Value/EBITDA'] /
    val_df_corr['Cluster EV/EBITDA'] - 1) * 100

val_df_corr['% Over/Undervaluation (P/E)'] = \
(val_df_corr['Trailing P/E'] / val_df_corr['Cluster P/E'] - 1) * 100
In [27]:
val_df_corr['% Over/Undervaluation (EV/Revenue)'].round(3).describe()
count    411.000000
mean       0.000002
std       41.113803
min      -96.331000
25%      -25.817000
50%       -2.205000
75%       17.535500
max      186.307000
Name: % Over/Undervaluation (EV/Revenue), dtype: float64
In [28]:
val_df_corr['% Over/Undervaluation (EV/EBITDA)'].round(3).describe()
count    411.000000
mean      -0.000032
std       45.999853
min      -98.236000
25%      -18.438500
50%       -4.308000
75%       11.977500
max      610.876000
Name: % Over/Undervaluation (EV/EBITDA), dtype: float64
In [29]:
val_df_corr['% Over/Undervaluation (P/E)'].round(3).describe()
count     411.000000
mean       -0.000015
std        92.845076
min       -93.641000
25%       -39.437500
50%       -11.260000
75%        11.894500
max      1026.426000
Name: % Over/Undervaluation (P/E), dtype: float64
In [30]:
source = \
ColumnDataSource(dict(x=val_df_corr['% Over/Undervaluation (P/E)'],
                      y=val_df_corr['% Over/Undervaluation (EV/EBITDA)'],

hover = HoverTool(tooltips=[
    ("index", "$index"),
    ("name", "@label"),
    ("sector", "@sector"),
    ("ticker", "@ticker"),
    ("(x,y)", "($x, $y)"),

p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)

labels = LabelSet(x='x', y='y', text='label', source=source,

# Remove "legend='sector'," to remove legend
p.scatter(x='x', y='y', color='color', legend='sector', source=source)

p.title.text = "Over/Undervaluation (Correlation) Based On..."
p.title.text_font_size = "25px"

# Uncomment the following line to add labels
# p.add_layout(labels)

p.xaxis.axis_label = 'Trailing P/E'
p.yaxis.axis_label = 'EV/EBITDA'


This methodology gives valuations which are much less extreme than the NLP approach. One potential explanation for this is that clusters are closer to the traditional industry groups, and therefore, to how the market values these companies.

These results were obtained as of:

In [31]:
Out[31]:, 11, 28)


Titman, Sheridan and Martin, John D. Valuation: the Art and Science of Corporate Investment Decisions. Prentice Hall, 2015. Print.

Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents.

Laurens van der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE.

Gael Varoquaux. Visualizing the Stock Market Structure.