Ryan J. O'Neil
Data Science DC
March 20, 2017
ryanjoneil@gmail.com
https://ryanjoneil.github.io
@ryanjoneil
We observe noisy data from an unknown function. We want to infer that function.
# Generate some random data.
import numpy as np
import random
# Sort the data so they're easier to plot later.
x = [random.uniform(-10, 10) for _ in range(500)]
x.sort()
y = []
for xi in x:
eps = random.normalvariate(0, 25)
yi = 3*xi**2 - 2*xi + 10 + eps
y.append(yi)
x = np.array(x)
y = np.array(y)
...assuming you use scikit-learn
like every other sane Python programmer.
We looked at a chart of our data and decided to describe it with:
from sklearn.linear_model import LinearRegression
# Note: A is our feature matrix.
# We intentionally add a "1" for the offset, instead of letting
# sklearn do that for us. This will make sense soon.
X = np.array([[xi**2, xi, 1] for xi in x])
lin = LinearRegression(fit_intercept=False)
lin.fit(X, y)
print(lin.coef_)
[ 2.98259014 -2.25266697 10.66100195]
/Users/roneil/src/github.com/ryanjoneil/talks/2017/03/data-science-dc/lib/python3.6/site-packages/scipy/linalg/basic.py:884: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver. warnings.warn(mesg, RuntimeWarning)
How'd we do?
from bokeh.charts import Line
y_hat = lin.predict(X)
show(Line({'x': x, 'y': y_hat}, x='x', y='y', width=750, height=400))
...with chalk and a slab of slate.
Construct a function to calculate the sum of squared residuals...
$$ \begin{align} \text{min}\ f(\beta) & = \frac{1}{2} ||y - X \beta||^2 \\ & \\ & = \frac{1}{2} (y - X \beta)'(y - X \beta) \\ & \\ & = \frac{1}{2} y'y - y'X\beta + \frac{1}{2} \beta'X'X\beta \\ & \\ & = \frac{1}{2} y'y - y'X\beta + \frac{1}{2} \beta'X'X\beta \\ & \\ \end{align} $$...and take its derivative to find a closed-form solution.
$$\nabla f(\beta) = X'X\beta - y'X\beta = 0$$
$$\beta = (X'X)^{-1}X'y$$
from numpy.linalg import inv
# beta = (X'X)^-1 * X * y
Xt = X.transpose()
pseudo_inv = inv(np.matmul(Xt, X))
beta = np.matmul(np.matmul(pseudo_inv, Xt), y)
print(beta)
[ 2.98259014 -2.25266697 10.66100195]
How'd grandma and grandpa do?
y_hat = [beta[0]*xi**2 + beta[1]*xi + beta[2] for xi in x]
show(Line({'x': x, 'y': y_hat}, x='x', y='y', width=750, height=400))
...'cause he used to work at NASA and code in Forth.
So we need to convert from
$$\frac{1}{2} \beta'X'X\beta - y'X\beta + \frac{1}{2} y'y $$to another form
$$\frac{1}{2} \beta'P\beta + q'\beta$$which is simply
$$P = X'X, q = -y'X$$import cvxopt as cvx
P = cvx.matrix(np.matmul(Xt, X))
q = cvx.matrix(-1 * np.matmul(y.transpose(), X))
solution = cvx.solvers.qp(P, q)
beta = solution['x'] # unrelated to our x
print(beta)
[ 2.98e+00] [-2.25e+00] [ 1.07e+01]
How'd Crazy Uncle Eddie do?
y_hat = [beta[0]*xi**2 + beta[1]*xi + beta[2] for xi in x]
show(Line({'x': x, 'y': y_hat}, x='x', y='y', width=750, height=400))
Well, besides the obvious.
While all three techniques produced the same result, Crazy Uncle Eddie's is interesting because it is more general than the others.
Crazy Eddie can solve any quadratic optimization problem, of which Least Squares is just one instance.
If we change the structure of the problem slightly:
scikit-learn
We have a big pot of money to allocate among different investments. Lucky us!
Some investment returns are correlated. They go up and down together.
Other returns are anticorrelated. They tend to do the opposite things.
How do we allocate our money to maximize our expected return, subject to our tolerance for risk?
We'll use 100 months of exchange rate data from the Fed, circa 2014.
Since this talk isn't about data wrangling, I've already cleaned it up into the important pieces.
$x$ tells me how much of my total budget to put in each investment.
$$ \begin{align} \text{max} \ \ \ & \ \mu'x - \alpha x \Sigma x \\ \text{s.t.} \ \ \ & \ e'x = 1 \\ & \ x \ge 0 \end{align} $$But wait! This is a maximization problem! The last model used $\text{min}$.
$$ \begin{align} \text{min} \ \ \ & \ \alpha x \Sigma x - \mu'x \\ \text{s.t.} \ \ \ & \ e'x = 1 \\ & \ x \ge 0 \end{align} $$The only differences between this model and the least squares model are the constraints we've added.
This one forces the model to allocate all of my budget into investments:
$$e'x = 1$$This one disallows the model from making negative investments:
$$x \ge 0$$# Read in the returns and covariance data.
import pandas as pd
exp_returns = pd.read_csv('portfolio-optimization/currency-returns.csv')
exp_returns.head()
Unnamed: 0 | mean | variance | |
---|---|---|---|
0 | RXI$US_N.M.AL | 0.152151 | 10.996718 |
1 | RXI$US_N.M.EU | 0.002683 | 5.928188 |
2 | RXI$US_N.M.NZ | 0.220217 | 9.690793 |
3 | RXI$US_N.M.UK | -0.159779 | 5.098969 |
4 | RXI_N.M.BZ | -0.128507 | 12.743632 |
returns_cov = pd.read_csv('portfolio-optimization/currency-covariance.csv', header=None)
returns_cov.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10.996718 | 5.181782 | 8.654762 | 4.457704 | 9.789818 | 5.421746 | 0.194428 | 5.166875 | 0.028496 | 4.379632 | ... | 6.494732 | 7.078498 | 7.083824 | 8.574515 | 3.287411 | 0.343673 | 4.163674 | 2.450772 | 2.381722 | 0.542450 |
1 | 5.181782 | 5.928188 | 4.487890 | 3.659548 | 4.686754 | 2.467804 | 0.313081 | 5.867674 | 0.038117 | 2.545200 | ... | 2.894461 | 5.350733 | 5.515164 | 4.314480 | 2.265781 | 0.418269 | 5.173168 | 1.559857 | 1.351946 | 1.776855 |
2 | 8.654762 | 4.487890 | 9.690793 | 4.330531 | 8.130412 | 4.268307 | 0.129303 | 4.466514 | 0.017975 | 3.919170 | ... | 5.536301 | 5.602586 | 6.054567 | 6.733828 | 2.811946 | 0.451742 | 3.903260 | 2.135899 | 2.569281 | 0.379954 |
3 | 4.457704 | 3.659548 | 4.330531 | 5.098969 | 4.301933 | 2.573100 | 0.137068 | 3.632088 | 0.012420 | 1.821456 | ... | 2.775715 | 4.081803 | 4.233658 | 3.749874 | 1.780620 | 0.326828 | 3.335175 | 1.356927 | 1.014109 | 1.584577 |
4 | 9.789818 | 4.686754 | 8.130412 | 4.301933 | 12.743632 | 5.433992 | 0.279201 | 4.689891 | -0.034405 | 5.567318 | ... | 7.177333 | 7.288341 | 6.774689 | 8.761299 | 2.999360 | 0.843294 | 4.269524 | 2.163251 | 2.439204 | 0.496304 |
5 rows × 23 columns
# A model that will return an optimal portfolio for any risk aversion.
def portfolio(alpha):
P = cvx.matrix(alpha * returns_cov.as_matrix())
q = cvx.matrix(-exp_returns['mean'].as_matrix())
G = cvx.matrix(0.0, (len(q),len(q)))
G[::len(q)+1] = -1.0
h = cvx.matrix(0.0, (len(q),1))
A = cvx.matrix(1.0, (1,len(q)))
b = cvx.matrix(1.0)
solution = cvx.solvers.qp(P, q, G, h, A, b)
return exp_returns['mean'].dot(solution['x'])[0]
risk_aversion = [ra/2.0 for ra in range(41)]
returns = [portfolio(alpha) for alpha in risk_aversion]
pcost dcost gap pres dres 0: -1.4109e+00 -1.2870e+00 6e+01 9e+00 5e+00 1: -1.8207e-02 -1.2285e+00 1e+00 5e-15 8e-16 2: -6.2411e-02 -2.9637e-01 2e-01 7e-16 7e-16 3: -1.4238e-01 -3.5309e-01 2e-01 8e-16 5e-16 4: -2.8201e-01 -2.9496e-01 1e-02 4e-16 5e-16 5: -2.8688e-01 -2.8703e-01 2e-04 1e-16 3e-16 6: -2.8695e-01 -2.8695e-01 2e-06 3e-16 2e-16 7: -2.8695e-01 -2.8695e-01 2e-08 1e-16 5e-16 Optimal solution found. pcost dcost gap pres dres 0: -2.6295e-01 -1.3019e+00 4e+01 5e+00 6e+00 1: -3.1048e-02 -1.0159e+00 2e+00 2e-01 2e-01 2: 3.3465e-02 -3.1889e-01 5e-01 3e-02 3e-02 3: -1.3476e-01 -2.5385e-01 1e-01 3e-16 7e-16 4: -1.9154e-01 -2.0318e-01 1e-02 1e-16 7e-16 5: -2.0001e-01 -2.0086e-01 9e-04 5e-17 3e-16 6: -2.0070e-01 -2.0074e-01 4e-05 6e-17 6e-16 7: -2.0073e-01 -2.0073e-01 1e-06 2e-16 1e-15 8: -2.0073e-01 -2.0073e-01 1e-08 3e-16 5e-16 Optimal solution found. pcost dcost gap pres dres 0: -1.7310e-01 -1.2428e+00 4e+01 5e+00 6e+00 1: -4.7542e-03 -9.2263e-01 2e+00 2e-01 2e-01 2: 3.8252e-02 -2.6797e-01 4e-01 3e-02 3e-02 3: -9.4620e-02 -1.8481e-01 9e-02 1e-16 1e-15 4: -1.4057e-01 -1.5061e-01 1e-02 3e-16 5e-16 5: -1.4747e-01 -1.4833e-01 9e-04 1e-16 6e-16 6: -1.4822e-01 -1.4825e-01 3e-05 1e-16 5e-16 7: -1.4825e-01 -1.4825e-01 1e-06 2e-16 4e-16 8: -1.4825e-01 -1.4825e-01 2e-08 2e-16 5e-16 Optimal solution found. pcost dcost gap pres dres 0: -1.3049e-01 -1.1978e+00 4e+01 5e+00 6e+00 1: 1.0823e-02 -8.5585e-01 2e+00 2e-01 2e-01 2: 4.1508e-02 -2.1689e-01 4e-01 3e-02 3e-02 3: -6.4792e-02 -1.2939e-01 6e-02 2e-16 8e-16 4: -9.7828e-02 -1.0446e-01 7e-03 3e-16 9e-16 5: -1.0228e-01 -1.0300e-01 7e-04 2e-16 5e-16 6: -1.0293e-01 -1.0295e-01 2e-05 1e-16 4e-16 7: -1.0295e-01 -1.0295e-01 3e-07 1e-16 5e-16 8: -1.0295e-01 -1.0295e-01 3e-09 1e-16 3e-16 Optimal solution found. pcost dcost gap pres dres 0: -1.0337e-01 -1.1595e+00 3e+01 5e+00 6e+00 1: 2.2165e-02 -8.0202e-01 2e+00 2e-01 2e-01 2: 4.3501e-02 -1.6970e-01 3e-01 2e-02 2e-02 3: -4.8964e-02 -9.2477e-02 4e-02 2e-16 1e-15 4: -7.3350e-02 -7.7843e-02 4e-03 5e-17 1e-15 5: -7.6465e-02 -7.6904e-02 4e-04 1e-16 6e-16 6: -7.6868e-02 -7.6878e-02 1e-05 3e-16 3e-16 7: -7.6877e-02 -7.6878e-02 1e-07 1e-16 5e-16 Optimal solution found. pcost dcost gap pres dres 0: -8.3552e-02 -1.1256e+00 3e+01 5e+00 6e+00 1: 3.1025e-02 -7.5646e-01 2e+00 2e-01 2e-01 2: 4.4428e-02 -1.3898e-01 3e-01 2e-02 2e-02 3: -3.7424e-02 -7.1497e-02 3e-02 8e-17 9e-16 4: -5.7093e-02 -6.0638e-02 4e-03 1e-16 8e-16 5: -5.9609e-02 -5.9920e-02 3e-04 4e-16 5e-16 6: -5.9898e-02 -5.9904e-02 5e-06 2e-16 8e-16 7: -5.9904e-02 -5.9904e-02 5e-08 2e-16 5e-16 Optimal solution found. pcost dcost gap pres dres 0: -6.7880e-02 -1.0982e+00 3e+01 5e+00 6e+00 1: 3.8514e-02 -7.1915e-01 2e+00 2e-01 2e-01 2: 4.4486e-02 -1.3048e-01 3e-01 2e-02 2e-02 3: -2.3873e-02 -6.0193e-02 4e-02 4e-04 5e-04 4: -4.4365e-02 -4.8338e-02 4e-03 2e-16 7e-16 5: -4.7131e-02 -4.7498e-02 4e-04 6e-17 7e-16 6: -4.7472e-02 -4.7478e-02 6e-06 3e-16 5e-16 7: -4.7478e-02 -4.7478e-02 6e-08 3e-16 4e-16 Optimal solution found. pcost dcost gap pres dres 0: -5.4871e-02 -1.0978e+00 3e+01 5e+00 6e+00 1: 4.6908e-02 -7.0310e-01 2e+00 2e-01 2e-01 2: 4.7530e-02 -1.2405e-01 3e-01 2e-02 2e-02 3: -1.4964e-02 -5.0020e-02 4e-02 2e-04 3e-04 4: -3.4679e-02 -3.8443e-02 4e-03 1e-16 7e-16 5: -3.7331e-02 -3.7669e-02 3e-04 2e-16 6e-16 6: -3.7646e-02 -3.7652e-02 5e-06 1e-16 6e-16 7: -3.7651e-02 -3.7651e-02 5e-08 2e-16 4e-16 Optimal solution found. pcost dcost gap pres dres 0: -4.3716e-02 -1.0974e+00 3e+01 5e+00 6e+00 1: 5.4196e-02 -6.8865e-01 2e+00 2e-01 2e-01 2: 5.0246e-02 -1.1735e-01 3e-01 2e-02 2e-02 3: -7.9683e-03 -4.1274e-02 3e-02 1e-17 1e-15 4: -2.6682e-02 -3.0144e-02 3e-03 1e-16 7e-16 5: -2.9169e-02 -2.9464e-02 3e-04 1e-16 9e-16 6: -2.9445e-02 -2.9449e-02 4e-06 2e-16 5e-16 7: -2.9449e-02 -2.9449e-02 4e-08 2e-16 5e-16 Optimal solution found. pcost dcost gap pres dres 0: -3.3930e-02 -1.0969e+00 4e+01 5e+00 6e+00 1: 6.0601e-02 -6.7563e-01 2e+00 2e-01 2e-01 2: 5.2718e-02 -1.1058e-01 3e-01 2e-02 2e-02 3: -1.7058e-03 -3.4171e-02 3e-02 1e-16 1e-15 4: -1.9669e-02 -2.2968e-02 3e-03 1e-16 1e-15 5: -2.2073e-02 -2.2343e-02 3e-04 1e-16 7e-16 6: -2.2327e-02 -2.2330e-02 4e-06 1e-16 6e-16 7: -2.2330e-02 -2.2330e-02 4e-08 2e-16 6e-16 Optimal solution found. pcost dcost gap pres dres 0: -2.5201e-02 -1.0963e+00 4e+01 5e+00 6e+00 1: 6.6299e-02 -6.6378e-01 2e+00 2e-01 2e-01 2: 5.5021e-02 -1.0386e-01 3e-01 2e-02 2e-02 3: 3.7272e-03 -2.7715e-02 3e-02 1e-16 1e-15 4: -1.3351e-02 -1.6585e-02 3e-03 1e-16 8e-16 5: -1.5724e-02 -1.5981e-02 3e-04 2e-16 7e-16 6: -1.5966e-02 -1.5969e-02 4e-06 3e-16 4e-16 7: -1.5969e-02 -1.5969e-02 4e-08 2e-16 4e-16 Optimal solution found. pcost dcost gap pres dres 0: -1.7311e-02 -1.0955e+00 4e+01 5e+00 6e+00 1: 7.1360e-02 -6.5307e-01 2e+00 1e-01 2e-01 2: 5.7368e-02 -9.7504e-02 3e-01 2e-02 2e-02 3: 8.8197e-03 -2.1891e-02 3e-02 2e-16 1e-15 4: -7.5622e-03 -1.0767e-02 3e-03 1e-16 8e-16 5: -9.9228e-03 -1.0171e-02 2e-04 3e-16 2e-15 6: -1.0156e-02 -1.0160e-02 4e-06 2e-16 6e-16 7: -1.0159e-02 -1.0159e-02 4e-08 2e-16 6e-16 Optimal solution found. pcost dcost gap pres dres 0: -1.0105e-02 -1.0946e+00 4e+01 5e+00 6e+00 1: 7.5884e-02 -6.4354e-01 2e+00 1e-01 2e-01 2: 5.9679e-02 -9.1465e-02 3e-01 2e-02 2e-02 3: 1.3652e-02 -1.6538e-02 3e-02 4e-16 9e-16 4: -2.1949e-03 -5.3577e-03 3e-03 1e-16 9e-16 5: -4.5359e-03 -4.7750e-03 2e-04 6e-17 7e-16 6: -4.7597e-03 -4.7633e-03 4e-06 1e-16 7e-16 7: -4.7632e-03 -4.7632e-03 4e-08 1e-16 1e-15 Optimal solution found. pcost dcost gap pres dres 0: -3.4681e-03 -1.0935e+00 4e+01 5e+00 6e+00 1: 8.0049e-02 -6.3465e-01 2e+00 1e-01 2e-01 2: 6.1857e-02 -8.5535e-02 3e-01 2e-02 2e-02 3: 1.8143e-02 -1.1436e-02 3e-02 3e-16 1e-15 4: 2.8321e-03 -2.5884e-04 3e-03 3e-16 6e-16 5: 5.3004e-04 3.0346e-04 2e-04 2e-16 6e-16 6: 3.1844e-04 3.1486e-04 4e-06 1e-16 8e-16 7: 3.1496e-04 3.1492e-04 4e-08 5e-17 5e-16 Optimal solution found. pcost dcost gap pres dres 0: 2.6898e-03 -1.0922e+00 4e+01 5e+00 6e+00 1: 8.3930e-02 -6.2618e-01 2e+00 1e-01 2e-01 2: 6.3951e-02 -7.9724e-02 2e-01 2e-02 2e-02 3: 2.2378e-02 -6.5347e-03 3e-02 2e-16 1e-15 4: 7.5962e-03 4.5940e-03 3e-03 2e-16 7e-16 5: 5.3448e-03 5.1321e-03 2e-04 3e-16 9e-16 6: 5.1467e-03 5.1431e-03 4e-06 3e-16 6e-16 7: 5.1432e-03 5.1432e-03 4e-08 2e-16 7e-16 Optimal solution found. pcost dcost gap pres dres 0: 8.4393e-03 -1.0909e+00 4e+01 5e+00 6e+00 1: 8.7571e-02 -6.1808e-01 2e+00 1e-01 2e-01 2: 6.5997e-02 -7.4045e-02 2e-01 2e-02 2e-02 3: 2.6417e-02 -1.7979e-03 3e-02 7e-17 1e-15 4: 1.2153e-02 9.2485e-03 3e-03 5e-17 1e-15 5: 9.9593e-03 9.7607e-03 2e-04 2e-16 8e-16 6: 9.7750e-03 9.7715e-03 4e-06 1e-16 9e-16 7: 9.7715e-03 9.7715e-03 4e-08 8e-17 7e-16 Optimal solution found. pcost dcost gap pres dres 0: 1.3837e-02 -1.0893e+00 4e+01 5e+00 6e+00 1: 9.1007e-02 -6.1030e-01 2e+00 1e-01 2e-01 2: 6.8019e-02 -6.8500e-02 2e-01 2e-02 2e-02 3: 3.0306e-02 2.8000e-03 3e-02 1e-16 1e-15 4: 1.6544e-02 1.3741e-02 3e-03 1e-16 9e-16 5: 1.4412e-02 1.4227e-02 2e-04 3e-18 6e-16 6: 1.4241e-02 1.4237e-02 4e-06 1e-16 5e-16 7: 1.4237e-02 1.4237e-02 4e-08 1e-16 5e-16 Optimal solution found. pcost dcost gap pres dres 0: 1.8930e-02 -1.0877e+00 4e+01 5e+00 6e+00 1: 9.4269e-02 -6.0282e-01 2e+00 1e-01 2e-01 2: 7.0033e-02 -6.3089e-02 2e-01 1e-02 2e-02 3: 3.4079e-02 7.2789e-03 3e-02 1e-16 8e-16 4: 2.0799e-02 1.8098e-02 3e-03 1e-16 9e-16 5: 1.8731e-02 1.8559e-02 2e-04 6e-17 7e-16 6: 1.8573e-02 1.8569e-02 4e-06 2e-16 1e-15 7: 1.8570e-02 1.8570e-02 4e-08 6e-17 6e-16 Optimal solution found. pcost dcost gap pres dres 0: 2.3755e-02 -1.0860e+00 4e+01 5e+00 6e+00 1: 9.7380e-02 -5.9561e-01 2e+00 1e-01 2e-01 2: 7.2051e-02 -5.7806e-02 2e-01 1e-02 2e-02 3: 3.7759e-02 1.1655e-02 3e-02 3e-16 1e-15 4: 2.4943e-02 2.2343e-02 3e-03 1e-16 1e-15 5: 2.2939e-02 2.2780e-02 2e-04 8e-17 7e-16 6: 2.2794e-02 2.2790e-02 4e-06 7e-17 5e-16 7: 2.2790e-02 2.2790e-02 4e-08 2e-16 1e-15 Optimal solution found. pcost dcost gap pres dres 0: 2.8347e-02 -1.0842e+00 4e+01 5e+00 6e+00 1: 1.0036e-01 -5.8864e-01 2e+00 1e-01 2e-01 2: 7.4080e-02 -5.2647e-02 2e-01 1e-02 2e-02 3: 4.1366e-02 1.5941e-02 3e-02 3e-16 2e-15 4: 2.8994e-02 2.6492e-02 3e-03 2e-16 7e-16 5: 2.7055e-02 2.6906e-02 1e-04 2e-16 2e-15 6: 2.6921e-02 2.6917e-02 4e-06 2e-16 8e-16 7: 2.6917e-02 2.6917e-02 4e-08 2e-18 6e-16 Optimal solution found. pcost dcost gap pres dres 0: 3.2731e-02 -1.0823e+00 4e+01 5e+00 6e+00 1: 1.0323e-01 -5.8189e-01 2e+00 1e-01 2e-01 2: 7.6124e-02 -4.7604e-02 2e-01 1e-02 2e-02 3: 4.4914e-02 2.0148e-02 2e-02 1e-16 1e-15 4: 3.2967e-02 3.0559e-02 2e-03 1e-16 1e-15 5: 3.1091e-02 3.0953e-02 1e-04 2e-16 1e-15 6: 3.0968e-02 3.0964e-02 4e-06 4e-17 7e-16 7: 3.0964e-02 3.0964e-02 5e-08 1e-16 9e-16 Optimal solution found. pcost dcost gap pres dres 0: 3.6932e-02 -1.0803e+00 4e+01 5e+00 6e+00 1: 1.0601e-01 -5.7535e-01 2e+00 1e-01 2e-01 2: 7.8188e-02 -4.2671e-02 2e-01 1e-02 2e-02 3: 4.8414e-02 2.4285e-02 2e-02 1e-16 1e-15 4: 3.6873e-02 3.4556e-02 2e-03 1e-16 9e-16 5: 3.5060e-02 3.4931e-02 1e-04 1e-16 7e-16 6: 3.4947e-02 3.4942e-02 5e-06 1e-16 7e-16 7: 3.4942e-02 3.4942e-02 7e-08 3e-16 2e-15 Optimal solution found. pcost dcost gap pres dres 0: 4.0970e-02 -1.0782e+00 4e+01 5e+00 6e+00 1: 1.0870e-01 -5.6899e-01 2e+00 1e-01 2e-01 2: 8.0272e-02 -3.7840e-02 2e-01 1e-02 2e-02 3: 5.1876e-02 2.8359e-02 2e-02 2e-16 2e-15 4: 4.0722e-02 3.8492e-02 2e-03 3e-16 1e-15 5: 3.8970e-02 3.8850e-02 1e-04 2e-16 1e-15 6: 3.8866e-02 3.8862e-02 5e-06 6e-17 7e-16 7: 3.8862e-02 3.8862e-02 1e-07 2e-16 5e-16 8: 3.8862e-02 3.8862e-02 1e-09 1e-16 7e-16 Optimal solution found. pcost dcost gap pres dres 0: 4.4862e-02 -1.0761e+00 4e+01 5e+00 6e+00 1: 1.1131e-01 -5.6281e-01 2e+00 1e-01 2e-01 2: 8.2437e-02 -3.3137e-02 2e-01 1e-02 2e-02 3: 5.5312e-02 3.2376e-02 2e-02 2e-16 2e-15 4: 4.4523e-02 4.2375e-02 2e-03 6e-17 1e-15 5: 4.2829e-02 4.2718e-02 1e-04 3e-17 1e-15 6: 4.2735e-02 4.2730e-02 5e-06 3e-17 1e-15 7: 4.2730e-02 4.2730e-02 2e-07 1e-16 8e-16 8: 4.2730e-02 4.2730e-02 2e-09 2e-16 8e-16 Optimal solution found. pcost dcost gap pres dres 0: 4.8622e-02 -1.0739e+00 4e+01 5e+00 6e+00 1: 1.1386e-01 -5.5678e-01 2e+00 1e-01 2e-01 2: 8.4714e-02 -2.8572e-02 2e-01 1e-02 2e-02 3: 5.8731e-02 3.6341e-02 2e-02 1e-16 2e-15 4: 4.8283e-02 4.6211e-02 2e-03 1e-16 1e-15 5: 4.6644e-02 4.6540e-02 1e-04 1e-16 1e-15 6: 4.6558e-02 4.6553e-02 5e-06 1e-16 1e-15 7: 4.6553e-02 4.6553e-02 3e-07 2e-16 1e-15 8: 4.6553e-02 4.6553e-02 3e-09 2e-16 8e-16 Optimal solution found. pcost dcost gap pres dres 0: 5.2265e-02 -1.0717e+00 4e+01 5e+00 6e+00 1: 1.1636e-01 -5.5091e-01 2e+00 1e-01 2e-01 2: 8.7006e-02 -2.4082e-02 2e-01 1e-02 2e-02 3: 6.2126e-02 4.0262e-02 2e-02 4e-16 2e-15 4: 5.2005e-02 5.0005e-02 2e-03 3e-16 2e-15 5: 5.0419e-02 5.0323e-02 1e-04 1e-16 1e-15 6: 5.0341e-02 5.0336e-02 5e-06 1e-16 9e-16 7: 5.0337e-02 5.0337e-02 4e-07 2e-16 8e-16 8: 5.0337e-02 5.0337e-02 1e-08 1e-16 1e-15 Optimal solution found. pcost dcost gap pres dres 0: 5.5803e-02 -1.0694e+00 4e+01 5e+00 6e+00 1: 1.1880e-01 -5.4517e-01 2e+00 1e-01 2e-01 2: 8.9312e-02 -1.9664e-02 2e-01 1e-02 2e-02 3: 6.5569e-02 4.4089e-02 2e-02 4e-16 2e-15 4: 5.5705e-02 5.3761e-02 2e-03 1e-16 1e-15 5: 5.4160e-02 5.4070e-02 9e-05 1e-16 8e-16 6: 5.4089e-02 5.4084e-02 5e-06 2e-16 9e-16 7: 5.4086e-02 5.4085e-02 6e-07 1e-16 1e-15 8: 5.4085e-02 5.4085e-02 6e-08 3e-16 1e-15 Optimal solution found. pcost dcost gap pres dres 0: 5.9244e-02 -1.0671e+00 4e+01 5e+00 6e+00 1: 1.2120e-01 -5.3956e-01 2e+00 1e-01 2e-01 2: 9.1634e-02 -1.5312e-02 2e-01 1e-02 2e-02 3: 6.8996e-02 4.7879e-02 2e-02 3e-16 2e-15 4: 5.9375e-02 5.7484e-02 2e-03 1e-16 2e-15 5: 5.7872e-02 5.7786e-02 9e-05 1e-16 1e-15 6: 5.7806e-02 5.7801e-02 5e-06 2e-16 7e-16 7: 5.7803e-02 5.7803e-02 6e-07 1e-16 1e-15 8: 5.7803e-02 5.7803e-02 7e-08 2e-16 1e-15 Optimal solution found. pcost dcost gap pres dres 0: 6.2599e-02 -1.0647e+00 4e+01 5e+00 6e+00 1: 1.2357e-01 -5.3408e-01 2e+00 1e-01 2e-01 2: 9.3970e-02 -1.1021e-02 2e-01 1e-02 1e-02 3: 7.2411e-02 5.1637e-02 2e-02 5e-16 2e-15 4: 6.3019e-02 6.1178e-02 2e-03 8e-17 2e-15 5: 6.1557e-02 6.1473e-02 8e-05 2e-16 8e-16 6: 6.1494e-02 6.1489e-02 5e-06 2e-16 8e-16 7: 6.1492e-02 6.1492e-02 4e-07 7e-17 1e-15 8: 6.1492e-02 6.1492e-02 1e-08 6e-17 1e-15 Optimal solution found. pcost dcost gap pres dres 0: 6.5876e-02 -1.0623e+00 4e+01 5e+00 6e+00 1: 1.2589e-01 -5.2871e-01 2e+00 1e-01 2e-01 2: 9.6320e-02 -6.7889e-03 2e-01 1e-02 1e-02 3: 7.5815e-02 5.5365e-02 2e-02 2e-16 2e-15 4: 6.6652e-02 6.4833e-02 2e-03 1e-16 1e-15 5: 6.5218e-02 6.5136e-02 8e-05 1e-16 1e-15 6: 6.5157e-02 6.5153e-02 4e-06 1e-16 7e-16 7: 6.5155e-02 6.5155e-02 3e-07 3e-16 1e-15 8: 6.5155e-02 6.5155e-02 3e-09 1e-16 2e-15 Optimal solution found. pcost dcost gap pres dres 0: 6.9081e-02 -1.0599e+00 4e+01 5e+00 7e+00 1: 1.2820e-01 -5.2344e-01 2e+00 1e-01 2e-01 2: 9.8683e-02 -2.6104e-03 2e-01 1e-02 1e-02 3: 7.9211e-02 5.9065e-02 2e-02 2e-16 4e-15 4: 7.0265e-02 6.8463e-02 2e-03 1e-16 2e-15 5: 6.8856e-02 6.8775e-02 8e-05 1e-16 2e-15 6: 6.8797e-02 6.8793e-02 4e-06 2e-16 8e-16 7: 6.8796e-02 6.8796e-02 1e-07 3e-16 9e-16 8: 6.8796e-02 6.8796e-02 1e-09 3e-17 2e-15 Optimal solution found. pcost dcost gap pres dres 0: 7.2221e-02 -1.0574e+00 4e+01 5e+00 7e+00 1: 1.3047e-01 -5.1827e-01 2e+00 1e-01 2e-01 2: 1.0106e-01 1.5175e-03 2e-01 1e-02 1e-02 3: 8.2601e-02 6.2739e-02 2e-02 6e-17 4e-15 4: 7.3858e-02 7.2073e-02 2e-03 1e-16 2e-15 5: 7.2476e-02 7.2394e-02 8e-05 1e-16 1e-15 6: 7.2416e-02 7.2413e-02 3e-06 1e-16 1e-15 7: 7.2415e-02 7.2415e-02 8e-08 2e-16 8e-16 Optimal solution found. pcost dcost gap pres dres 0: 7.5302e-02 -1.0549e+00 4e+01 5e+00 7e+00 1: 1.3272e-01 -5.1320e-01 2e+00 1e-01 2e-01 2: 1.0342e-01 4.4485e-03 2e-01 1e-02 1e-02 3: 8.6093e-02 6.6277e-02 2e-02 1e-16 3e-15 4: 7.7451e-02 7.5661e-02 2e-03 2e-16 1e-15 5: 7.6078e-02 7.5995e-02 8e-05 1e-16 2e-15 6: 7.6017e-02 7.6014e-02 3e-06 1e-16 1e-15 7: 7.6016e-02 7.6016e-02 5e-08 2e-16 1e-15 Optimal solution found. pcost dcost gap pres dres 0: 7.8329e-02 -1.0524e+00 4e+01 5e+00 7e+00 1: 1.3495e-01 -5.0821e-01 2e+00 1e-01 2e-01 2: 1.0576e-01 6.5317e-03 2e-01 1e-02 1e-02 3: 8.9647e-02 6.9716e-02 2e-02 3e-16 9e-15 4: 8.1040e-02 7.9229e-02 2e-03 1e-16 2e-15 5: 7.9665e-02 7.9578e-02 9e-05 1e-16 1e-15 6: 7.9600e-02 7.9597e-02 3e-06 2e-16 1e-15 7: 7.9599e-02 7.9599e-02 3e-08 2e-16 8e-16 Optimal solution found. pcost dcost gap pres dres 0: 8.1307e-02 -1.0499e+00 4e+01 5e+00 7e+00 1: 1.3717e-01 -5.0331e-01 2e+00 1e-01 2e-01 2: 1.0811e-01 8.6551e-03 2e-01 1e-02 1e-02 3: 9.3182e-02 7.3146e-02 2e-02 1e-16 6e-15 4: 8.4612e-02 8.2783e-02 2e-03 2e-16 1e-15 5: 8.3235e-02 8.3145e-02 9e-05 1e-16 1e-15 6: 8.3168e-02 8.3165e-02 2e-06 2e-16 2e-15 7: 8.3167e-02 8.3167e-02 2e-08 1e-16 9e-16 Optimal solution found. pcost dcost gap pres dres 0: 8.4239e-02 -1.0473e+00 4e+01 5e+00 7e+00 1: 1.3937e-01 -4.9849e-01 2e+00 1e-01 2e-01 2: 1.1048e-01 1.0815e-02 2e-01 1e-02 1e-02 3: 9.6699e-02 7.6568e-02 2e-02 3e-16 4e-15 4: 8.8168e-02 8.6322e-02 2e-03 1e-16 1e-15 5: 8.6792e-02 8.6699e-02 9e-05 2e-16 1e-15 6: 8.6721e-02 8.6719e-02 2e-06 1e-16 1e-15 7: 8.6720e-02 8.6720e-02 2e-08 2e-16 9e-16 Optimal solution found. pcost dcost gap pres dres 0: 8.7131e-02 -1.0447e+00 4e+01 5e+00 7e+00 1: 1.4155e-01 -4.9374e-01 2e+00 1e-01 2e-01 2: 1.1285e-01 1.3010e-02 2e-01 1e-02 2e-02 3: 1.0020e-01 7.9982e-02 2e-02 5e-16 4e-15 4: 9.1709e-02 8.9848e-02 2e-03 1e-16 1e-15 5: 9.0335e-02 9.0239e-02 1e-04 5e-18 1e-15 6: 9.0261e-02 9.0260e-02 2e-06 6e-17 8e-16 7: 9.0260e-02 9.0260e-02 2e-08 4e-16 1e-15 Optimal solution found. pcost dcost gap pres dres 0: 8.9984e-02 -1.0421e+00 4e+01 5e+00 7e+00 1: 1.4373e-01 -4.8906e-01 2e+00 1e-01 2e-01 2: 1.1523e-01 1.5235e-02 2e-01 1e-02 2e-02 3: 1.0368e-01 8.3385e-02 2e-02 7e-07 9e-07 4: 9.5237e-02 9.3362e-02 2e-03 2e-08 3e-08 5: 9.3867e-02 9.3768e-02 1e-04 1e-10 1e-10 6: 9.3790e-02 9.3788e-02 2e-06 9e-13 1e-12 7: 9.3789e-02 9.3789e-02 2e-08 9e-15 1e-14 Optimal solution found. pcost dcost gap pres dres 0: 9.2802e-02 -1.0395e+00 4e+01 5e+00 7e+00 1: 1.4589e-01 -4.8445e-01 2e+00 1e-01 2e-01 2: 1.1762e-01 1.7488e-02 2e-01 1e-02 2e-02 3: 1.0715e-01 8.6703e-02 2e-02 2e-05 2e-05 4: 9.8757e-02 9.6865e-02 2e-03 6e-07 8e-07 5: 9.7387e-02 9.7285e-02 1e-04 3e-09 4e-09 6: 9.7307e-02 9.7305e-02 1e-06 3e-11 3e-11 7: 9.7306e-02 9.7306e-02 1e-08 3e-13 3e-13 Optimal solution found. pcost dcost gap pres dres 0: 9.5589e-02 -1.0369e+00 4e+01 5e+00 7e+00 1: 1.4804e-01 -4.7990e-01 2e+00 1e-01 2e-01 2: 1.2002e-01 1.9769e-02 2e-01 1e-02 2e-02 3: 1.1060e-01 9.0031e-02 2e-02 3e-05 4e-05 4: 1.0226e-01 1.0036e-01 2e-03 1e-06 1e-06 5: 1.0090e-01 1.0079e-01 1e-04 5e-09 7e-09 6: 1.0081e-01 1.0081e-01 1e-06 5e-11 6e-11 7: 1.0081e-01 1.0081e-01 1e-08 5e-13 6e-13 Optimal solution found. pcost dcost gap pres dres 0: 9.8346e-02 -1.0343e+00 4e+01 5e+00 7e+00 1: 1.5019e-01 -4.7542e-01 2e+00 1e-01 2e-01 2: 1.2243e-01 2.2073e-02 2e-01 1e-02 2e-02 3: 1.1404e-01 9.3364e-02 2e-02 4e-05 6e-05 4: 1.0576e-01 1.0384e-01 2e-03 1e-06 2e-06 5: 1.0440e-01 1.0429e-01 1e-04 7e-09 1e-08 6: 1.0431e-01 1.0431e-01 1e-06 7e-11 9e-11 7: 1.0431e-01 1.0431e-01 1e-08 7e-13 9e-13 Optimal solution found.
show(Line(
{'risk aversion': risk_aversion, 'expected return': returns},
x='risk aversion',
y='expected return',
width=750,
height=400
))
We want to draw a line that separates two sets with as little misclassification as possible.
If $f(x) \le 5$, the point is of type $-1$, otherwise it is type $+1$.
$$f(x) = (x_1 + \epsilon_1) - (x_2+ \epsilon_2)$$ $$\epsilon_i \sim N\left(0, 1.25^2\right)\ \forall\ i \in {1, 2}$$x1 = [random.uniform(0, 10) for _ in range(150)]
x2 = [random.uniform(0, 10) for _ in range(150)]
y = []
for xi_1, xi_2 in zip(x1, x2):
eps_1 = random.normalvariate(0, 1.25)
eps_2 = random.normalvariate(0, 1.25)
if (xi_1 + eps_1) - (xi_2 + eps_2) <= 5:
y.append(-1)
else:
y.append(+1)
X = np.array(list(zip(x1, x2)))
y = np.array(y)
show(Scatter(
{'x1': x1, 'x2': x2, 'y': y},
x='x1', y='x2', color='y', marker='y', palette=['lightblue', 'orange'],
width=500, height=500
))
scikit-learn
¶from sklearn import svm
clf = svm.SVC(kernel='linear')
clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
The SVM coefficients give us a separating hyperplane with minimum classification error.
$$b + w'x = 0$$print('b =', clf.intercept_[0])
print('w =', clf.coef_[0])
b = -3.36050577816 w = [ 0.67262893 -0.60681053]
We classify a point $x = (x_1, x_2)'$ based on which side of that hyperplane it is on.
$$b + w'x \ge 0\ \rightarrow\ \text{Type +1}$$
$$b + w'x < 0\ \rightarrow\ \text{Type -1}$$
from sklearn import metrics
y_hat = clf.predict(X)
print('confusion matrix:')
metrics.confusion_matrix(y_true=y, y_pred=y_hat)
confusion matrix:
array([[117, 2], [ 15, 16]])
# We need a function to plot classifications and the separating hyperplane.
def plot_svm_results(correct, b_val, w1_val, w2_val):
# Classifications
plot = Scatter(
{'x1': x1, 'x2': x2, 'y': y, 'correct': correct},
x='x1', y='x2',
color='correct', palette=['lightgrey', 'darkred'], marker='correct',
width=500, height=500
)
# Separating hyperplane: x2 = -(b + (w1*x1)) / w2
hyperplane_x1 = np.linspace(0, 10)
hyperplane_x2 = [
-(b_val + (w1_val * x1_i)) / w2_val for x1_i in hyperplane_x1
]
plot.line(hyperplane_x1, hyperplane_x2, color='darkblue')
show(plot)
How'd scikit-learn do?
correct = ['Correct' if yi == yh_i else 'Incorrect' for yi, yh_i in zip(y, y_hat)]
plot_svm_results(correct, clf.intercept_[0], clf.coef_[0][0], clf.coef_[0][1])
PuLP
is a modeling tool that lets us solve anything of this form.
Where $c$ and $b$ are input vectors, $A$ is an input matrix, and $x$ is a vector of real-valued decision variables.
Linear Optimization is well solved. You can assume that a linear optimizer is robust and fast for any model in that form.
PuLP can also solve discrete models, which are not so computationally easy.
How do we get our SVM problem into a linear optimizer?
The decision variables $b$, $w_1$, and $w_2$, give us a separating hyperplane $b + w'x = 0$.
For each misclassified observation $i$, $\xi_i$ is its classification error. We find the best hyperplane by minimizing total error.
$$ \begin{align} \text{min} \ \ \ & \ \sum_i \xi_i \ & \ \\ \text{s.t.} \ \ \ & \ b + w_1 x_{i1} + w_2 x_{i2} \le -1 + \xi_i\ & \ \forall\ y_i = -1 \\ & \ b + w_1 x_{i1} + w_2 x_{i2} \ge +1 - \xi_i\ & \ \forall\ y_i = +1 \\ & \ \xi \ge 0 \end{align} $$We can simplify the model by multiplying each constraint by $y_i$.
$$ \begin{align} \text{min} \ \ \ & \ \sum_i \xi_i \\ \text{s.t.} \ \ \ & \ y_i (b + w_1 x_{i1} + w_2 x_{i2}) \ge 1 - \xi_i\ \forall\ i \\ & \ \xi \ge 0 \end{align} $$But wait! Constraints like $a'x \ge b$ aren't in the form you gave us!
That's OK. They get converted by the optimizer.
$$a'x - e = b$$$$e \ge 0$$So do constraints of the form $a'x \le b$.
$$a'x + s = b$$$$s \ge 0$$As do any variables that are unrestricted in sign.
$$x_i = x_i^+ - x_i^-$$$$x_i^+, x_i^- \ge 0$$As long as your objective function is linear and your constraints are linear and of the form $=$, $\le$, or $\ge$, you can solve it with a linear optimizer.
Let's get hacking. First make variables for the separating hyperplane.
Note these are unrestricted in sign.
import pulp
b = pulp.LpVariable('b')
w1 = pulp.LpVariable('w1')
w2 = pulp.LpVariable('w2')
We also need variables for the classification errors.
These are nonnegative since we only want to penalize errors, not reward the model for correct classifiations.
errors = []
for i in range(len(y)):
e = pulp.LpVariable('e_%d' % i, lowBound=0) # lowBound=0 means >=0
errors.append(e)
Build a model that records classification error based on the separating hyperplane in the appropriate variables.
m = pulp.LpProblem(sense=pulp.LpMinimize)
for x_i1, x_i2, y_i, e_i in zip(x1, x2, y, errors):
m += y_i * (b + w1 * x_i1 + w2 * x_i2) >= (1 - e_i)
So far the model represents the feasible region of all possible separating hyperplanes.
We want the best one.
m.setObjective(sum(errors))
assert m.solve() == pulp.LpStatusOptimal
print('b =', b.value())
print('w =', [w1.value(), w2.value()])
b = -3.5456539 w = [0.71991737, -0.64676861]
We don't have a predict function, so let's build one.
def classify(xi_1, xi_2):
if b.value() + (w1.value() * xi_1) + (w2.value() * xi_2) >= 0:
return +1
return -1
y_hat = [classify(xi_1, xi_2) for xi_1, xi_2 in zip(x1, x2)]
print('confusion matrix:')
metrics.confusion_matrix(y_true=y, y_pred=y_hat)
confusion matrix:
array([[117, 2], [ 15, 16]])
How'd the linear optimizer do?
correct = ['Correct' if yi == yh_i else 'Incorrect' for yi, yh_i in zip(y, y_hat)]
plot_svm_results(correct, b.value(), w1.value(), w2.value())
In Linear Algebra 101 we learned to solve problems that look like this.
$$Ax = b$$$$ \begin{bmatrix} \cdot & \cdot \\ \cdot & \cdot \end{bmatrix} \begin{bmatrix} \cdot \\ \cdot \end{bmatrix} = \begin{bmatrix} \cdot \\ \cdot \end{bmatrix} $$Many statistical inference problems are overdetermined and unsolvable directly.
$$ \begin{bmatrix} \cdot & \cdot \\ \cdot & \cdot \\ \cdot & \cdot \\ \cdot & \cdot \\ \cdot & \cdot \\ \cdot & \cdot \\ \cdot & \cdot \\ \cdot & \cdot \\ \cdot & \cdot \\ \cdot & \cdot \end{bmatrix} \begin{bmatrix} \cdot \\ \cdot \end{bmatrix} = \begin{bmatrix} \cdot \\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ \cdot \end{bmatrix} $$Decision models tend to involve a large number of potential actions in relation to the constraint set.
$$ \begin{bmatrix} \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \end{bmatrix} \begin{bmatrix} \cdot \\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ \cdot \\ \cdot \end{bmatrix} = \begin{bmatrix} \cdot \\ \cdot \end{bmatrix} $$In both cases, we start with a problem we can't solve easily and reformulate it as something we can.
For statistical inference, we solve a secondary problem that minimizes error against the observed data.
For decision models, we choose a set of actions to take (i.e. a basis) and solve the remaining system.
We have a set of unclassified data and we want to cluster it around logical centers.
We generate 25 points around each center $(x_{10}, x_{20}) \in \{(0.3, 0.3), (0.3, 0.7), (0.7, 0.7)\}$.
$$x_{1i} = x_{10} + \epsilon$$ $$x_{2i} = x_{20} + \epsilon$$$$\epsilon \sim N\left(0, 0.2^2\right)$$centers = [(0.3, 0.3), (0.3, 0.7), (0.7, 0.7)]
x1 = []
x2 = []
for x10, x20 in centers:
for _ in range(7):
eps_x1 = random.normalvariate(0, 0.2)
eps_x2 = random.normalvariate(0, 0.2)
x1i = min(max(0.0, x10+eps_x1), 1.0)
x2i = min(max(0.0, x20+eps_x2), 1.0)
x1.append(x1i)
x2.append(x2i)
X = np.array(list(zip(x1, x2)))
show(Scatter({'x1': x1, 'x2': x2}, x='x1', y='x2', width=500, height=500))
scikit-learn
¶Assume we know a prioi how many clusters there are.
from sklearn.cluster import KMeans
clust = KMeans(n_clusters=3)
clust.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)
The KMeans
model has two outputs:
clust.cluster_centers_
array([[ 0.80998334, 0.53503504], [ 0.32591213, 0.69901855], [ 0.25386159, 0.29119595]])
We need a function to easily plot clustering results.
def plot_clusters(clusters, cluster_centers):
plot = Scatter(
{'x1': x1, 'x2': x2, 'clusters': clusters},
x='x1', y='x2',
color='clusters', marker='clusters',
width=500, height=500
)
for x10, x20 in cluster_centers:
plot.circle(x10, x20, alpha=0.5, color='grey', size=15)
show(plot)
How'd scikit-learn
do?
plot_clusters(clust.predict(X), clust.cluster_centers_)
Let's extend the linar optimization model support the concept of binary variables.
$$ \begin{align} \text{min} \ \ \ & \ c'x + d'y \\ \text{s.t.} \ \ \ & \ A x + D y = b \\ & \ x \ge 0 \\ & \ y \in \{0, 1\} \end{align} $$$y$ is a vector of variables which can only take values $0$ or $1$. These represent discrete decisions and allow logical relationships among them.
Discrete optimization is very powerful, but not well solved the way linear optimization is. Model solvability is highly dependent on formulation.
What's a clustering model look like with binary variables?
We need to make two major types of decisions:
We minimize total taxicab distance from each point to its assigned cluster
Primary decision variables = locations of cluster centers
w = []
for j in range(len(centers)):
x10 = pulp.LpVariable('x_10%d' % j, lowBound=0, upBound=1)
x20 = pulp.LpVariable('x_20%d' % j, lowBound=0, upBound=1)
w.append([x10, x20])
Secondary decision variables = do we associate point $i$ with cluster $j$?
v = []
for i in range(len(x1)):
vi = []
for j in range(len(w)):
vij = pulp.LpVariable('v_%d%d' % (i, j), cat=pulp.LpBinary)
vi.append(vij)
v.append(vi)
Each point is associated with exactly one cluster.
m = pulp.LpProblem(sense=pulp.LpMinimize)
for vi in v:
m += sum(vi) == 1
Make a list of taxicab distance variables to guide the optimizer.
distances = []
for i, (vi, x1i, x2i), in enumerate(zip(v, x1, x2)):
for j, (wj, vij) in enumerate(zip(w, vi)):
x10, x20 = wj
dx1 = pulp.LpVariable('dx_1%d%d' % (i, j), lowBound=0, upBound=1)
dx2 = pulp.LpVariable('dx_2%d%d' % (i, j), lowBound=0, upBound=1)
m += dx1 >= (x1i - x10) - (1 - vij)
m += dx1 >= (x10 - x1i) - (1 - vij)
m += dx2 >= (x2i - x20) - (1 - vij)
m += dx2 >= (x20 - x2i) - (1 - vij)
distances.extend([dx1, dx2])
WARNING WARNING WARNING
Symmetry elimination is beyond the scope of this talk.
We don't technically need it to solve this model, but it sure can help.
for j, (x01_1, x02_1) in enumerate(w):
for x01_2, x02_2 in w[j+1:]:
m += x01_1 + x02_1 <= x01_2 + x02_2
Total distance from points to their cluster centers is our objective function.
m.setObjective(sum(distances))
assert m.solve() == pulp.LpStatusOptimal
cluster_centers = []
for wj in w:
cluster_centers.append([wj[0].value(), wj[1].value()])
for c in cluster_centers:
print(c)
[0.35947481, 0.24456388] [0.23018792, 0.75395849] [0.79554363, 0.60669282]
There's no predict method, so we yank that information out of the $v_{ij}$ variables.
clusters = []
for vi in v:
for c, vij in zip([0, 1, 2], vi):
if vij.value() > 0.5:
clusters.append(c)
break
How'd mixed integer optimization do?
plot_clusters(clusters, cluster_centers)
Hmmm. That model took a while to solve. What gives?
Let's look at something more realistic.
Peuber is the Uber of pedicabs.
We have 7 pedicabs in our fleet and 11 customer requesting service. It's a prime day for Peuber!
pedicabs = range(7)
customers = range(11)
Generate locations for each of these.
rng = lambda: random.uniform(0, 100)
pedicab_locations = [(rng(), rng()) for _ in pedicabs]
pickup_locations = [(rng(), rng()) for _ in customers]
dropoff_locations = [(rng(), rng()) for _ in customers]
This function will help us visualize the routing problem and its solutions.
from bokeh.models import Range1d
def build_pedicabs_and_customers_plot():
all_points = pedicab_locations + pickup_locations + dropoff_locations
point_types = ['Pedicab'] * len(pedicabs) + \
['Pickup'] * len(customers) + \
['Dropoff'] * len(customers)
plot = Scatter(
{
'x': [l[0] for l in all_points],
'y': [l[1] for l in all_points],
'type': point_types
},
x='x', y='y',
color='type', marker='type',
width=600, height=600
)
plot.legend.orientation = "horizontal"
plot.x_range = plot.y_range = Range1d(-15, 115)
return plot
So what's our problem look like right now?
plot = build_pedicabs_and_customers_plot()
# Connect pickups to dropoffs
for p, d in zip(pickup_locations, dropoff_locations):
plot.line(
[p[0], d[0]], [p[1], d[1]],
line_alpha=0.5,
line_color='grey',
line_dash='dashed'
)
show(plot)
Column generation is a popular technique for solving problems like this. We generate columns (i.e. binary variables) which are pre-defined routes.
Our model minimizes overall cost while covering each customer and assigning no more than one route to each pedicab.
$$ \begin{align} \text{min} \ \ \ & \ \sum_r c_r x_r \ & \ \\ \text{s.t.} \ \ \ & \ \sum_{r \in \Omega_i} x_r = 1 & \ \forall\ i \\ & \ \sum_{r \in \Omega_j} x_r \le 1 & \ \forall\ j \\ & x_r \in \{0, 1\} \end{align} $$We need to track routes, their costs, and what actors they are associated with.
from collections import defaultdict
import math
l2 = lambda pt1, pt2: math.sqrt(sum((x1-x2)**2 for x1, x2 in zip(pt1, pt2)))
routes = []
route_costs = {}
routes_by_pedicab = defaultdict(list)
routes_by_customer = defaultdict(list)
def add_route(pedicab, customers, locations):
stops = [pedicab_locations[pedicab]] + list(locations)
route_num = len(routes)
routes.append(stops)
route_costs[route_num] = sum(l2(s1, s2) for s1, s2 in zip(stops, stops[1:]))
routes_by_pedicab[pedicab].append(route_num)
for cust in customers:
routes_by_customer[cust].append(route_num)
Generate $\Omega$! Let's assume we're not interested in assigning more than 2 customers to any pedicab.
from itertools import permutations
pickups_dropoffs = list(zip(pickup_locations, dropoff_locations))
for cab in pedicabs:
# Single-customer routes
for cust in customers:
add_route(cab, [cust], pickups_dropoffs[cust])
# Doubles
for c1, c2 in permutations(customers, 2):
p1, d1 = pickups_dropoffs[c1]
p2, d2 = pickups_dropoffs[c2]
add_route(cab, [c1, c2], [p1, p2, d1, d2])
add_route(cab, [c1, c2], [p1, p2, d2, d1])
add_route(cab, [c1, c2], [p1, d1, p2, d2])
len(routes)
2387
Binary variables indicate which routes to use.
x = []
for r in range(len(routes)):
x.append(pulp.LpVariable('x%d' % r, cat=pulp.LpBinary))
The entire constraint set to the problem:
prm = pulp.LpProblem('Pedicab Routing Model', sense=pulp.LpMinimize)
# No more than one route assigned to each cab
for cab_routes in routes_by_pedicab.values():
prm += sum(x[r] for r in cab_routes) <= 1
# Each customer is picked up by exactly one pedicab
for cust_routes in routes_by_customer.values():
prm += sum(x[r] for r in cust_routes) == 1
Objective = minimize total distance.
z = pulp.LpVariable('z', lowBound=0)
prm += z == sum(route_costs[r] * x[r] for r in range(len(routes)))
prm.setObjective(z)
assert prm.solve() == pulp.LpStatusOptimal
best_total_cost = z.value()
print('total cost:', best_total_cost)
total cost: 692.04026
We need a function to plot solutions.
def plot_routes():
plot = build_pedicabs_and_customers_plot()
# Show
for r, stops in enumerate(routes):
if x[r].value() < 0.5:
continue
plot.line(
[s[0] for s in stops],
[s[1] for s in stops],
line_alpha=0.5,
line_color='grey',
line_dash='dashed'
)
show(plot)
plot_routes()
Some of our pedicabs complain that they have to travel much farther than others. Let's fix that by minimizing the maximum trip length.
z2 = pulp.LpVariable('z2', lowBound=0)
for r in range(len(routes)):
prm += z2 >= route_costs[r] * x[r]
prm.setObjective(z2)
assert prm.solve() == pulp.LpStatusOptimal
print('total cost:', z.value())
print('max route length:', z2.value())
total cost: 916.95713 max route length: 155.61076
plot_routes()
That can hurt our global solution a lot. What if we are willing to accept a 10% cut to the global solution for more balance among individual routes?
# Only willing to accept 10% degradation in global optimality
prm += z <= best_total_cost * 1.1
assert prm.solve() == pulp.LpStatusOptimal
print('total cost:', z.value())
print('max route length:', z2.value())
total cost: 710.47738 max route length: 167.38201
plot_routes()