#!/usr/bin/env python # coding: utf-8 # In[1]: get_ipython().run_line_magic('matplotlib', 'inline') # In[2]: import numpy as np import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt from sklearn.linear_model import Lasso, LassoCV, Ridge from sklearn.model_selection import cross_val_predict, train_test_split from yellowbrick.datasets import load_concrete from yellowbrick.regressor import AlphaSelection, PredictionError, ResidualsPlot mpl.rcParams['figure.figsize'] = (9,6) # # Yellowbrick - Regression Examples # # The Yellowbrick library is a diagnostic visualization platform for machine learning that allows data scientists to steer the model selection process. It extends the scikit-learn API with a new core object: the Visualizer. Visualizers allow visual models to be fit and transformed as part of the scikit-learn pipeline process, providing visual diagnostics throughout the transformation of high-dimensional data. # # Estimator score visualizers *wrap* scikit-learn estimators and expose the Estimator API such that they have `fit()`, `predict()`, and `score()` methods that call the appropriate estimator methods under the hood. Score visualizers can wrap an estimator and be passed in as the final step in a `Pipeline` or `VisualPipeline`. # # In machine learning, regression models attempt to predict a target in a continuous space. Yellowbrick has implemented the following regressor score visualizers that display the instances in model space to better understand how the model is making predictions: # - `AlphaSelection` visual tuning of regularization hyperparameters # - `PredictionError` plot the expected vs. the actual values in model space # - `Residuals Plot` plot the difference between the expected and actual values # #### Load Data # # Yellowbrick provides several datasets wrangled from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/). For the following examples, we'll use the `concrete` dataset, since it is well-suited for regression tasks. # # The `concrete` dataset contains 1030 instances and 9 attributes. Eight of the attributes are explanatory variables, including the age of the concrete and the materials used to create it, while the target variable `strength` is a measure of the concrete's compressive strength (MPa). # In[3]: # Use Yellowbrick to load the concrete dataset X, y = load_concrete() # Create the train and test data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # ### Residuals Plot # # A residual is the difference between the observed value of the target variable (y) and the predicted value (ŷ), i.e. the error of the prediction. The `ResidualsPlot` Visualizer shows the difference between residuals on the vertical axis and the dependent variable on the horizontal axis, allowing you to detect regions within the target that may be susceptible to more or less error. # # If the points are randomly dispersed around the horizontal axis, a linear regression model is usually well-suited for the data; otherwise, a non-linear model is more appropriate. The following example shows a fairly random, uniform distribution of the residuals against the target in two dimensions. This seems to indicate that our linear model is performing well. # In[4]: # Instantiate the linear model and visualizer model = Ridge() visualizer = ResidualsPlot(model) visualizer.fit(X_train, y_train) # Fit the training data to the visualizer visualizer.score(X_test, y_test) # Evaluate the model on the test data g = visualizer.poof() # Draw/show/poof the data # Yellowbrick's `ResidualsPlot` Visualizer also displays a histogram of the error values along the right-hand side. In the example above, the error is normally distributed around zero, which also generally indicates a well-fitted model. If the histogram is not desired, it can be turned off with the `hist=False` flag. # In[5]: # Instantiate the linear model and visualizer model = Ridge() visualizer = ResidualsPlot(model, hist=False) visualizer.fit(X_train, y_train) # Fit the training data to the visualizer visualizer.score(X_test, y_test) # Evaluate the model on the test data g = visualizer.poof() # Draw/show/poof the data # ### Prediction Error Plot # # Yellowbrick's `PredictionError` Visualizer plots the actual targets from the dataset against the predicted values generated by the model. This allows us to see how much variance is in the model. Data scientists can diagnose regression models using this plot by comparing against the 45-degree line, where the prediction exactly matches the model. # In[6]: # Instantiate the linear model and visualizer model = Lasso() visualizer = PredictionError(model) visualizer.fit(X_train, y_train) # Fit the training data to the visualizer visualizer.score(X_test, y_test) # Evaluate the model on the test data g = visualizer.poof() # Draw/show/poof the data # ### Alpha Selection Visualizer # # The `AlphaSelection` Visualizer demonstrates how different values of alpha influence model selection during the regularization of linear models. Since regularization is designed to penalize model complexity, the higher the alpha, the less complex the model, decreasing the error due to variance (overfit). However, alphas that are too high increase the error due to bias (underfit). Therefore, it is important to choose an optimal alpha such that the error is minimized in both directions. # # To do this, typically you would you use one of the "RegressionCV” models in scikit-learn. E.g. instead of using the `Ridge` (L2) regularizer, use `RidgeCV` and pass a list of alphas, which will be selected based on the cross-validation score of each alpha. This visualizer wraps a “RegressionCV” model and visualizes the alpha/error curve. If the visualization shows a jagged or random plot, then potentially the model is not sensitive to that type of regularization and another is required (e.g. L1 or Lasso regularization). # In[7]: # Create a list of alphas to cross-validate against alphas = np.logspace(-10, 1, 400) # Instantiate the linear model and visualizer model = LassoCV(alphas=alphas) visualizer = AlphaSelection(model) visualizer.fit(X, y) # Fit the data to the visualizer g = visualizer.poof() # Draw/show/poof the data