In [2]:

import pandas as pd 
import numpy as np
from os import listdir
from IPython.display import Image

Introduction to Machine learning¶

Shir Meir Lador¶

Data scientist - the sexiest job of the 21st century (Harvard Business Review 2012)¶

The “data scientist” (2008) - ״It’s a high-ranking professional with the training and curiosity to make discoveries in the world of big data.״

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

Outline¶

What is machine learning?
Example of machine learning in practice
History of machine learning
Supervised and unsupervised learning
Classification and regression
Generalization
Data - train and test
Overfitting
Linear and logistic regression
Decision trees and Random forests
Bias variance tradeoff
SVM
Example data - visualization of the decision boundaries of different models
Bonus: Pro tips

What is machine learning?¶

Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959).

The study and construction of algorithms that can learn from and make predictions on data (through building a model from sample inputs).

Example of machine learning in practice¶

Approving loans automatically using machine learning models based on:

Features based on the user bank account
Features based on the user credit scoring
Features based on the user web appearances
...

History of machine learning¶

1950 — “Turing Test” - determine if a computer has real intelligence. To pass the test, a computer must be able to fool a human into believing it is also human.

1952 — First computer learning program (Arthur Samuel) - Checkers game. The IBM computer improved at the game the more it played, studying which moves made up winning strategies and incorporating those moves into its program.

1957 — First neural network for computers - the perceptron (Frank Rosenblatt), which "simulate" the thought processes of the human brain.

1967 — The “nearest neighbor” algorithm was written - basic pattern recognition. This could be used to map a route for traveling salesmen, starting at a random city but ensuring they visit all cities during a short tour.

1997 — IBM’s Deep Blue beats the world champion at chess.

2006 — Geoffrey Hinton coins the term “deep learning” to explain new algorithms that let computers “see” and distinguish objects and text in images and videos.

Supervised and unsupervised learning¶

Supervised - labels are given¶

Unsupervised - no labels¶

Overfitting¶

Overfitting occurs when a model begins to "memorize" training data rather than "learning" to generalize from trend¶

When a statistical model describes random error or noise instead of the underlying relationship.

Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.

Overfitting¶

A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.¶

Linear regression - I¶

A linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables X.

Linear regression - II¶

OLS - The Ordinary Least Squares method minimizes the sum of squared errors.

Logistic Regression - I¶

Linear approach for classification.

The logistic regression estimates the probability that the dependent variable is 0/1.

Logistic Regression - II¶

Use the logit transformation to transform the linear regression result to a probability.

Decision tree¶

Idea - series of binary questions

Decision tree - training¶

Decision tree - How to split?¶

Decision tree - prediction¶

Decision tree vs. Logistic regression¶

Which model should I choose?¶

What kind of decision boundary makes more sense in my problem?
How complex is the relationship between my variables and target?
Are there interactions between my features?
How many features and samples do I have?
Try both models and do cross-validation - help you find out which one is more likely to have better generalization error.

Error of a model - Bias Variance Tradeoff¶

Bias Variance Tradeoff - II¶

Bias Variance Tradeoff - III¶

Random forest (Ensemble learning method)¶

Average prediction of many decision trees trained on different subsets of the features

Ensemble methods reduce the prediction variance --> Reduce the prediction error (and decrease overfitting).
Reminder - $var(x)=\sigma^2$, $\space\space\space$ $var(\frac {1}{n} \Sigma(x_i) ) = \frac {1}{n^2} var(\Sigma(x_i)) = \frac {n}{n^2} {\sigma^2}=\frac {\sigma^2}{n}$

Random forest¶

Linear SVM¶

Find the hyperplane that maximizes the margin.¶

The larger the margin the lower the generalization error of the classifier.¶

Non-linear SVM (using Kernels)¶

Kernel trick - implicitly mapping the inputs into high-dimensional feature spaces.

SVM - Kernel trick¶

Deep Learning - a glimpse¶

Example Data¶

In [14]:

%matplotlib inline
import numpy as np
import pylab as pl
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn import linear_model
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn import tree, ensemble

import pandas as pd
from matplotlib import pyplot as plt
def plot_results_with_hyperplane(clf, clf_name, df, plt_nmbr):
    x_min, x_max = df.x.min() - .5, df.x.max() + .5
    y_min, y_max = df.y.min() - .5, df.y.max() + .5
    # step between points. i.e. [0, 0.02, 0.04, ...]
    step = .02
    # to plot the boundary, we're going to create a matrix of every possible point
    # then label each point as a wolf or cow using our classifier
    xx, yy = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    # this gets our predictions back into a matrix
    Z = Z.reshape(xx.shape)
    # create a subplot (we're going to have more than 1 plot on a given image)
    pl.subplot(2, 3, plt_nmbr)
    # plot the boundaries
    pl.pcolormesh(xx, yy, Z, cmap=pl.cm.Paired)
    # plot the wolves and cows
    for animal in df.animal.unique():
        pl.scatter(df[df.animal==animal].x,
                   df[df.animal==animal].y,
                   marker=animal, s=70,
                   label="cows" if animal=="x" else "wolves",
                   color='black')
    pl.title(clf_name, fontsize=20)

data = open("cows_and_wolves.txt").read()
data = [row.split('\t') for row in data.strip().split('\n')]

animals = []
for y, row in enumerate(data):
    for x, item in enumerate(row):
        # x's are cows, o's are wolves
        if item in ['o', 'x']:
            animals.append([x, y, item])

df = pd.DataFrame(animals, columns=["x", "y", "animal"])
df['animal_type'] = df.animal.apply(lambda x: 0 if x=="x" else 1)

# train using the x and y position coordiantes
train_cols = ["x", "y"]

clfs = {
    "SVM": svm.SVC(),
    "Logistic Regression" : linear_model.LogisticRegression(),
    "Decision Tree": tree.DecisionTreeClassifier(),
    "Random Forest": ensemble.RandomForestClassifier(random_state=0),
    "K-Nearest Neighbors Classifier": KNeighborsClassifier(n_neighbors=3), 
    "Gaussian Naive Bayes": GaussianNB(),
}

In [15]:

plt.figure(figsize=(20,12))
plt_nmbr = 1
for clf_name, clf in clfs.items():
    clf.fit(df[train_cols], df.animal_type)
    plot_results_with_hyperplane(clf, clf_name, df, plt_nmbr)
    plt_nmbr += 1
pl.show()

Pro tips¶

Good informative features --> Good model
Be aware of Data leakage (Cancer example, Clickstream example)
Learning from you model's errors
- Add new features
- Give more weight to certain examples in training
- Remove outliers
Using sampling methods
- For imbalanced data set
- For emphasisizing different populations in the data
- For removing correlated samples (deal approval example)
Choosing the right metrics for your task
- Considering business requirements
- Considering different loss for different types of errors
- Considering imbalanced data set
https://medium.com/towards-data-science/what-metrics-should-we-use-on-imbalanced-data-set-precision-recall-roc-e2e79252aeba
Learn pandas tips and tricks
- https://medium.com/towards-data-science/pandas-tips-and-tricks-33bcc8a40bb9

Introduction to Machine learning¶

Shir Meir Lador¶

Data scientist - the sexiest job of the 21st century (Harvard Business Review 2012)¶

Outline¶

What is machine learning?¶

Example of machine learning in practice¶

History of machine learning¶

Supervised and unsupervised learning¶

Supervised - labels are given¶

Unsupervised - no labels¶

Overfitting¶

Overfitting occurs when a model begins to "memorize" training data rather than "learning" to generalize from trend¶

Overfitting¶

A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.¶

Linear regression - I¶

Linear regression - II¶

Logistic Regression - I¶

Logistic Regression - II¶

Decision tree¶

Decision tree - training¶

Decision tree - How to split?¶

Decision tree - prediction¶

Decision tree vs. Logistic regression¶

Which model should I choose?¶

Error of a model - Bias Variance Tradeoff¶

Bias Variance Tradeoff - II¶

Bias Variance Tradeoff - III¶

Random forest (Ensemble learning method)¶

Random forest¶

Random forest¶

Linear SVM¶

Find the hyperplane that maximizes the margin.¶

The larger the margin the lower the generalization error of the classifier.¶

Non-linear SVM (using Kernels)¶

SVM - Kernel trick¶

Deep Learning - a glimpse¶

Example Data¶

Pro tips¶

Questions?¶