import pandas as pd
import numpy as np
from os import listdir
from IPython.display import Image
The “data scientist” (2008) - ״It’s a high-ranking professional with the training and curiosity to make discoveries in the world of big data.״
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959).
The study and construction of algorithms that can learn from and make predictions on data (through building a model from sample inputs).
Approving loans automatically using machine learning models based on:
1950 — “Turing Test” - determine if a computer has real intelligence. To pass the test, a computer must be able to fool a human into believing it is also human.
1952 — First computer learning program (Arthur Samuel) - Checkers game. The IBM computer improved at the game the more it played, studying which moves made up winning strategies and incorporating those moves into its program.
1957 — First neural network for computers - the perceptron (Frank Rosenblatt), which "simulate" the thought processes of the human brain.
1967 — The “nearest neighbor” algorithm was written - basic pattern recognition. This could be used to map a route for traveling salesmen, starting at a random city but ensuring they visit all cities during a short tour.
1997 — IBM’s Deep Blue beats the world champion at chess.
2006 — Geoffrey Hinton coins the term “deep learning” to explain new algorithms that let computers “see” and distinguish objects and text in images and videos.
When a statistical model describes random error or noise instead of the underlying relationship.
Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.
A linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables X.
OLS - The Ordinary Least Squares method minimizes the sum of squared errors.
Linear approach for classification.
The logistic regression estimates the probability that the dependent variable is 0/1.
Use the logit transformation to transform the linear regression result to a probability.
Idea - series of binary questions
What kind of decision boundary makes more sense in my problem?
How complex is the relationship between my variables and target?
Are there interactions between my features?
How many features and samples do I have?
Try both models and do cross-validation - help you find out which one is more likely to have better generalization error.
Average prediction of many decision trees trained on different subsets of the features
Kernel trick - implicitly mapping the inputs into high-dimensional feature spaces.
%matplotlib inline
import numpy as np
import pylab as pl
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn import linear_model
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn import tree, ensemble
import pandas as pd
from matplotlib import pyplot as plt
def plot_results_with_hyperplane(clf, clf_name, df, plt_nmbr):
x_min, x_max = df.x.min() - .5, df.x.max() + .5
y_min, y_max = df.y.min() - .5, df.y.max() + .5
# step between points. i.e. [0, 0.02, 0.04, ...]
step = .02
# to plot the boundary, we're going to create a matrix of every possible point
# then label each point as a wolf or cow using our classifier
xx, yy = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# this gets our predictions back into a matrix
Z = Z.reshape(xx.shape)
# create a subplot (we're going to have more than 1 plot on a given image)
pl.subplot(2, 3, plt_nmbr)
# plot the boundaries
pl.pcolormesh(xx, yy, Z, cmap=pl.cm.Paired)
# plot the wolves and cows
for animal in df.animal.unique():
pl.scatter(df[df.animal==animal].x,
df[df.animal==animal].y,
marker=animal, s=70,
label="cows" if animal=="x" else "wolves",
color='black')
pl.title(clf_name, fontsize=20)
data = open("cows_and_wolves.txt").read()
data = [row.split('\t') for row in data.strip().split('\n')]
animals = []
for y, row in enumerate(data):
for x, item in enumerate(row):
# x's are cows, o's are wolves
if item in ['o', 'x']:
animals.append([x, y, item])
df = pd.DataFrame(animals, columns=["x", "y", "animal"])
df['animal_type'] = df.animal.apply(lambda x: 0 if x=="x" else 1)
# train using the x and y position coordiantes
train_cols = ["x", "y"]
clfs = {
"SVM": svm.SVC(),
"Logistic Regression" : linear_model.LogisticRegression(),
"Decision Tree": tree.DecisionTreeClassifier(),
"Random Forest": ensemble.RandomForestClassifier(random_state=0),
"K-Nearest Neighbors Classifier": KNeighborsClassifier(n_neighbors=3),
"Gaussian Naive Bayes": GaussianNB(),
}
plt.figure(figsize=(20,12))
plt_nmbr = 1
for clf_name, clf in clfs.items():
clf.fit(df[train_cols], df.animal_type)
plot_results_with_hyperplane(clf, clf_name, df, plt_nmbr)
plt_nmbr += 1
pl.show()
Good informative features --> Good model
Be aware of Data leakage (Cancer example, Clickstream example)
Learning from you model's errors
Using sampling methods
Choosing the right metrics for your task
Learn pandas tips and tricks