Earlier, we used regression to find an intercept and slope (for one feature) to predict a label. The equation was:

$$f(x) = \beta_0 + \beta_1 x$$Then, we discussed multiple regression: finding an intercept and several slopes (for several corresponding features) to predict a label. The equation was:

$$f(x_1, x_2, \dots, x_d) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_d x_d$$A related problem is using one feature to fit a **polynomial** instead of a linear function. The equation will look like:

The goal is to find coefficients $\beta$ to match our observations $f(x)$ given the features $x, x^2, \ldots, x^d$.

Let's create these observations:

In [1]:

```
import numpy as np
np.random.seed(42)
```

In [2]:

```
def make_observations(x, sigma=0.1):
noise = sigma * np.random.randn(len(x))
y = 1 - np.exp(-x**2) + noise
return y
n = 15
x = np.linspace(-2, 2, num=n)
y = make_observations(x)
```

Let's visualize the data:

In [3]:

```
import pandas as pd
df = pd.DataFrame({"x": x, "y": y})
df.plot(x="x", y="y", style="o", legend=False);
```

**Our goal: predict the y values accurately given the input x.**

To fit this polynomial, we will need polynomial features:

In [4]:

```
def polynomial_features(x, max_degree=3):
poly_df = pd.DataFrame()
for i in range(max_degree + 1):
poly_df[i] = x ** i
return poly_df
```

In [5]:

```
d = 3
X_train = polynomial_features(x, max_degree=d)
y_train = y
print(x.min())
X_train.head()
```

-2.0

Out[5]:

0 | 1 | 2 | 3 | |
---|---|---|---|---|

0 | 1.0 | -2.000000 | 4.000000 | -8.000000 |

1 | 1.0 | -1.714286 | 2.938776 | -5.037901 |

2 | 1.0 | -1.428571 | 2.040816 | -2.915452 |

3 | 1.0 | -1.142857 | 1.306122 | -1.492711 |

4 | 1.0 | -0.857143 | 0.734694 | -0.629738 |

In [6]:

```
X_train.index = X_train[1]
ax = X_train.plot(style="o-", title="features for different polynomial degrees");
ax.set_xlabel("x");
```

Remember: $x, x^2, x^3$ are the features. We are trying to find $\beta_0, \ldots, \beta_3$ to make

$$f(x) = \beta_0 + \beta_1x + \beta_2 x^2 + \beta_3x^3$$*for every $x$.*

This is the same formulation as multiple linear regression. Recall that for multiple linear regression,

$$f(x_1, x_2, \dots, x_d) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_d x_d$$Past feature names, these formulations are identical.

In [7]:

```
from sklearn.linear_model import LinearRegression
# note: we don't need to use an "intercept" term because this is the same thing as using a
# polynomial of degree zero, which we have included in our list of features!
model = LinearRegression(fit_intercept=False)
model.fit(X_train, y_train)
```

Out[7]:

LinearRegression(fit_intercept=False)

What did this model do? It found coefficients so that

$$y \approx \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3$$The data looks pretty quadratic; how large is $\beta_2$ relative $\beta_3$?

In [8]:

```
model.coef_
```

Out[8]:

array([ 0.27914314, -0.03092637, 0.20295325, -0.00578903])

The best way of seeing that is by showing the predicted value over the train set:

In [9]:

```
import matplotlib.pyplot as plt
def fit_plot(x_train, y_train, degree=3, ax=None):
X_train = polynomial_features(x_train, max_degree=degree)
x_test = np.linspace(x_train.min(), x_train.max(), num=1000)
X_test = polynomial_features(x_test, max_degree=degree)
model = LinearRegression(fit_intercept=False)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
df_test = pd.DataFrame({"x": x_test, "y_pred": y_pred})
ax = df.plot(x="x", y="y", style="o", ax=ax)
ax = df_test.plot(x="x", y="y_pred", ax=ax)
ax.legend();
```

In [10]:

```
fit_plot(x, y, degree=3)
```

In [11]:

```
import matplotlib.pyplot as plt
fig, axs = plt.subplots(nrows=3, ncols=4, figsize=(16,10))
for d, ax in enumerate(axs.flat[:]):
fit_plot(x, y, degree=d, ax=ax)
ax.set_title("degree = " + str(d))
fig.tight_layout()
```

We will use k-fold cross-validation to select the best degree to use

In [12]:

```
from sklearn.model_selection import cross_val_score
y_train = y
for d in range(10):
X_train = polynomial_features(x, max_degree=d)
model = LinearRegression(fit_intercept=False)
model_scores = cross_val_score(
model,
X_train, y_train,
cv=5, scoring='neg_mean_squared_error'
)
print("degree = " + str(d) + ". Mean residual sum of squares (RSS) =",
-model_scores.mean().round(2))
```

We can get a little more complicated and actually visualize those results...

In [13]:

```
from sklearn.metrics import mean_squared_error
y_train = y
scores = []
for d in range(10):
X_train = polynomial_features(x, max_degree=d)
model = LinearRegression(fit_intercept=False)
model_scores = cross_val_score(
model,
X_train, y_train,
cv=5, scoring='neg_mean_squared_error'
)
cross_val_rss = -model_scores.mean()
y_train_pred = model.fit(X_train, y_train).predict(X_train)
train_rss = mean_squared_error(y_train_pred, y_train)
scores.append({
"cross_val_rss": -model_scores.mean(),
"train_rss": train_rss,
"degree": d,
})
```

In [14]:

```
show = pd.DataFrame(scores)
ax = show.plot(x="degree", y=["cross_val_rss", "train_rss"], style="o-")
ax.set_ylim(0, 1)
```

Out[14]:

(0.0, 1.0)

Let's test the goal of machine learning: **to predict well on unseen data.** We only trained on input in a certain range and validated off that input too.

But how well does it perform on a completely new value? Let's test on unseen data, or polynomial inputs in a *slightly* broader range.

In [15]:

```
x_train = np.linspace(-2, 2, num=100)
y_train = make_observations(x_train)
x_test = np.linspace(-4, 4, num=100)
y_test = make_observations(x_test)
```

`d = 2`

because cross-validation *on the train data* says `d = 2`

is best.

In [16]:

```
d = 2 # because the best value found with cross validation.
X_test = polynomial_features(x_test, max_degree=d)
X_train = polynomial_features(x_train, max_degree=d)
```

Fit a polynomial with degree 2 to the train data (because degree=2 is the best value found above with cross validation).

Test this polynomial on the provided test data.

How well does it predict for $x$ values in [-4, -3] and [3, 4]?

- The prediction is relatively close to the testing data in [-4, -3] and [3, 4].
- It does not predict well in [-4, -3] and [3, 4]. The prediction increases while the testing data stays flat.
- It does not predict well in [-4, -3] and [3, 4]. The prediction decreases while the testing data stays increases.

In [ ]:

```
# Your code here
```

In [ ]:

```
# Define y_pred above
show = pd.DataFrame({"y_pred": y_pred, "x_test": x_test, "y_test": y_test})
ax = show.plot(x="x_test", y="y_test", style="o")
show.plot(x="x_test", y="y_pred", style="-", ax=ax)
```

*Gold star question:* How does this plot change as the degree is varied?

*Gold star question:* How do the answers below change when the degree `d`

is changed above? How does the test score change relative to the train score *on this more general test dataset*?

In [ ]:

```
from sklearn.metrics import mean_squared_error
```

In [ ]:

```
# Your code here
```

What is the residual sum of squares (RSS) for the test data? Use the setup above; the degree, train data and test data will remain the same.

In [ ]:

```
from sklearn.metrics import mean_squared_error
# Your code here
```

In [ ]:

```
```