from trustyai.utils import TestModels
from trustyai.model import feature, output, Model, Dataset
from trustyai.model import simple_prediction
from trustyai.explainers import SHAPExplainer
import pandas as pd
import random
%pip install -r ../requirements-examples.txt --quiet
Note: you may need to restart the kernel to use updated packages.
Now let's go over how to use a Python model with TrustyAI. First, let's grab a dataset, we'll use the California Housing dataset from sklearn
, which tries
to predict the median house value of various California housing districts given a number of different attributes of the district.
After downloading the dataset, we then split it into train and test splits.
from sklearn import datasets
from sklearn.model_selection import train_test_split
X, y = datasets.fetch_california_housing(data_home="data", return_X_y=True, as_frame=True)
y = pd.DataFrame(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8)
print(f"X Train: {X_train.shape}, X Test: {X_test.shape}, Y Train: {y_train.shape}, Y Test: {y_test.shape}")
X Train: (16512, 8), X Test: (4128, 8), Y Train: (16512, 1), Y Test: (4128, 1)
Now let's grab our model, just a simple xgboost regressor. We'll then plot its test predictions against the the true test labels, to see how well it does.
import xgboost
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('../styles/material_rh.mplstyle')
# uncomment to train from scratch
# xgb_model = xgboost.XGBRegressor(objective='reg:squarederror', tree_method='gpu_hist')
# xgb_model.fit(X_train, y_train)
# print('Test MSE', xgb_model.score(X_test, y_test))
# xgb_model.save_model("models/california_xgboost")
# load and score model
xgb_model = xgboost.XGBRegressor(objective='reg:squarederror')
xgb_model.load_model("models/california_xgboost")
print('Test MSE', xgb_model.score(X_test, y_test))
# grab predictions and find largest error
predictions = xgb_model.predict(X_test)
worst = np.argmax(np.abs(predictions - y_test['MedHouseVal'].values))
# plot predictions
plt.scatter(y_test, predictions)
plt.scatter(y_test['MedHouseVal'].iloc[worst], predictions[worst], color='r')
plt.plot([0,5], [0,5], color='k')
plt.xlabel("True Value")
plt.ylabel("Predicted Value")
plt.title("XGBoost Predictions, California Housing")
plt.show()
/Users/rui/.virtualenvs/trustyai-explainability-python/lib/python3.10/site-packages/xgboost/compat.py:31: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import MultiIndex, Int64Index
Test MSE 0.9280984757003108
That's pretty decent! Let's grab a point to explain; let's choose that really erroneous point marked in red in the above plot.
point_to_explain = X_test.iloc[worst]
point_to_explain
MedInc 3.729200 HouseAge 6.000000 AveRooms 4.583333 AveBedrms 1.083333 Population 69.000000 AveOccup 2.875000 Latitude 37.800000 Longitude -121.290000 Name: 16556, dtype: float64
We'll need to convert it into a Prediction object in order to pass it to the SHAP Explainer. Notice we index our dataframes by .iloc[worst:worst+1]
, this is because
we need to pass single-row dataframes to preserve the type information of each column. If we used .iloc[worst]
we'd retrieve a pd.Series
object which removes the type information from individual values.
Now we can wrap our model into a TrustyAI PredictionProvider. We do this via an ArrowModel
, which rapidly speeds up the data transfer between Python and the TrustyAI Java library.
To create an ArrowModel, we need to pass it a function that accepts a Pandas DataFrame as input and outputs a Pandas DataFrame or Numpy Array. All sklearn models satisfy this with their
predict
or predict_proba
functions, so this is really easy to do.
We then call the get_as_prediction_provider
function on the ArrowModel, to which we pass an example datapoint to use as a template for our data conversions. Make sure this template point has the same schema (i.e., feature names and types) as all the other points you plan on passing to the model!
from trustyai.model import Model
trustyai_model = Model(xgb_model.predict, dataframe_input=True, output_names=['MedHouseVal'])
With our model successfully wrapped, we can create our SHAP explainer. To do this we need to specify a background dataset, a small $(\le100)$ set of representative examples of the model's input. We'll use the first 100 training points as our background dataset.
from trustyai.explainers import SHAPExplainer
explainer = SHAPExplainer(background=X_train[:100])
We can now produce our explanation:
(this will produce some Java warnings, don't worry about these)
explanations = explainer.explain(inputs=X_test.iloc[worst:worst+1],
outputs=y_test.iloc[worst:worst+1],
model=trustyai_model)
[main] INFO org.apache.arrow.memory.BaseAllocator - Debug mode disabled. [main] INFO org.apache.arrow.memory.DefaultAllocationManagerOption - allocation manager type not specified, using netty as the default type [main] INFO org.apache.arrow.memory.CheckAllocator - Using DefaultAllocationManager at memory/DefaultAllocationManagerFactory.class
Now let's visualize the explanation, first as a dataframe:
explanations.as_html()['MedHouseVal']
Feature | Value | Mean Background Value | SHAP Value | Confidence | |
---|---|---|---|---|---|
0 | Background | nan | nan | 2.156897 | nan |
1 | MedInc | 3.729200 | 4.085421 | 0.264785 | 0.498505 |
2 | HouseAge | 6.000000 | 28.610000 | 0.445050 | 0.498505 |
3 | AveRooms | 4.583333 | 5.542214 | 0.304172 | 0.498505 |
4 | AveBedrms | 1.083333 | 1.115171 | 0.433965 | 0.498505 |
5 | Population | 69.000000 | 1583.910000 | 0.522398 | 0.498505 |
6 | AveOccup | 2.875000 | 2.805248 | 0.427528 | 0.498505 |
7 | Latitude | 37.800000 | 35.448800 | -0.497719 | 0.498505 |
8 | Longitude | -121.290000 | -119.392500 | 0.692924 | 1.318919 |
Feature values in red/green indicate a lower/higher value than the average background value of that feature. SHAP values in red/green indicate a negative/positive contribution to the prediction.
Now let's visualize the explanation as a candlestick plot:
explanations.plot()
explanations.as_html()['MedHouseVal']
Feature | Value | Mean Background Value | SHAP Value | Confidence | |
---|---|---|---|---|---|
0 | Background | nan | nan | 2.156897 | nan |
1 | MedInc | 3.729200 | 4.085421 | 0.264785 | 0.498505 |
2 | HouseAge | 6.000000 | 28.610000 | 0.445050 | 0.498505 |
3 | AveRooms | 4.583333 | 5.542214 | 0.304172 | 0.498505 |
4 | AveBedrms | 1.083333 | 1.115171 | 0.433965 | 0.498505 |
5 | Population | 69.000000 | 1583.910000 | 0.522398 | 0.498505 |
6 | AveOccup | 2.875000 | 2.805248 | 0.427528 | 0.498505 |
7 | Latitude | 37.800000 | 35.448800 | -0.497719 | 0.498505 |
8 | Longitude | -121.290000 | -119.392500 | 0.692924 | 1.318919 |