In this notebook, we will recreate some key results from Ling et al. IMMI (2017), which studied the application of random-forest-based uncertainties to materials design. We will show that the errors produced from the Random Forest implemented in lolo (the code used by Ling et al.) are well-calibrated and that the uncertainties can be used with Sequential Learning to quickly find optimal materials within a search space.
Note: This notebook will require you to install lolopy and establish an account with Citrination to get an an API key (see Quickstart), and set it as an environment variable named CITRINE_KEY. Also, the uncertainity calculations do not currently function on Windows.
Last used with matminer version 0.4.5.
%matplotlib inline
from matplotlib import pyplot as plt
from matminer.data_retrieval.retrieve_Citrine import CitrineDataRetrieval
from matminer.featurizers.conversions import StrToComposition
from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers import composition as cf
from lolopy.learners import RandomForestRegressor
from sklearn.model_selection import KFold
from pymatgen import Composition
from scipy.stats import norm
import pandas as pd
import numpy as np
import os
Set the random seed
np.random.seed(8)
The Ling Paper used 4 different datasets to test the uncertainty estimates
cdr = CitrineDataRetrieval()
data = cdr.get_dataframe(criteria={'data_set_id': 150888}, print_properties_options=False)
100%|██████████| 165/165 [00:03<00:00, 46.07it/s]
Convert the composition and class variable from strings
data = StrToComposition(target_col_id='composition').featurize_dataframe(data, "chemicalFormula")
Failed to display Jupyter Widget of type HBox
.
If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation for setup instructions.
If you're reading this message in another frontend (for example, a static rendering on GitHub or NBViewer), it may mean that your frontend doesn't currently support widgets.
data['ZT'] = pd.to_numeric(data['ZT'], errors='coerce')
data.reset_index(drop=True, inplace=True)
Every dataset except the steel fatigue dataset uses the composition-based features of Ward et al..
f = MultipleFeaturizer([cf.Stoichiometry(), cf.ElementProperty.from_preset("magpie"),
cf.ValenceOrbital(props=['avg']), cf.IonProperty(fast=True)])
X = np.array(f.featurize_many(data['composition']))
Failed to display Jupyter Widget of type HBox
.
If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation for setup instructions.
If you're reading this message in another frontend (for example, a static rendering on GitHub or NBViewer), it may mean that your frontend doesn't currently support widgets.
As described in the Ling paper, ideally-calibrated uncertainty estimaes should have a particular relationship with the errors of a machine learning model. Specifically, the distribution of $r(x)/\sigma(x)$ where $r(x)$ is the residual of the prediction and $\sigma(x)$ is the uncertainty of the prediction for x should have a Gaussian distribution with zero mean and unit standard deviation.
model = RandomForestRegressor()
Get the errors from 8-fold cross-validation
y = data['ZT'].values
y_resid = []
y_uncer = []
for train_id, test_id in KFold(8, shuffle=True).split(X):
model.fit(X[train_id], y[train_id])
yf_pred, yf_std = model.predict(X[test_id], return_std=True)
y_resid.extend(yf_pred - y[test_id])
y_uncer.extend(yf_std)
Plot the normalized residuals ($r(x)/\sigma(x)$) against the normal distribution
fig, axs = plt.subplots(1, 2, sharey=True)
x = np.linspace(-8, 8, 50)
# Plot the RF uncertainty
resid = np.divide(y_resid, y_uncer)
axs[0].hist(resid, x, density=True)
axs[0].set_title('With Lolo Uncertainty Estimates')
# Plot assuming constant errors
resid = np.divide(y_resid, np.sqrt(np.power(y_resid, 2).mean()))
axs[1].hist(resid, x, density=True)
axs[1].set_title('Assuming Constant Error')
for ax in axs:
ax.plot(x, norm.pdf(x), 'k--', lw=0.75)
ax.set_xlabel('Normalized Residual')
axs[0].set_ylabel('Probability Density')
fig.set_size_inches(6.5, 2)
fig.tight_layout()
Here, we compare the error distribution using the Lolo uncertainty estimates (left) and the assumption that all entries have the same error (right). The normalized residuals for the uncertainty estimates have a distribution closer to the unit normal distribution, which means - as expected - that it better captures which predictions will have a higher error.
One important use of model uncertainties is to employ them to guide which experiments to pick to find optimal materials with minimal experiments/computations. As described in the Ling paper (and other nice articles), it is not always best to pick the experiment that the model predicts to have the best properties if you can perform more than one experiment sequentially. Rather, it can be better to pick entries with large uncertainities that, when tested and added to the training set, can improve the models predictions for the next experiments.
Here, we demonstrate one approach for picking experiments: Maximum Liklihood of Improvement (MLI). In contrast to picking the material with the best predicted properties (an approach we refer to Maximum Expected Improvment (MEI)), the MLI approach pickes the material with with the highest liklihood of being better than the best material in the training set - a measure that uses both the predicted value and the uncertainty. The MLI method is equivalent to the Expected Improvement metric common in Bayesian Optimization, and balances a tradeoff between picking entries with favorable predictions (exploitation) and those with high uncertainties (exploration).
We'll start with a small set of entries from the training set
in_train = np.zeros(len(data), dtype=np.bool)
in_train[np.random.choice(len(data), 10, replace=False)] = True
print('Picked {} training entries'.format(in_train.sum()))
Picked 10 training entries
assert not np.isclose(max(y), max(y[in_train]))
Just to give a visual of how the selection process works
Make the predictions
model.fit(X[in_train], y[in_train])
y_pred, y_std = model.predict(X[~in_train], return_std=True)
For MEI, we pick the highest predicted value. For MLI, we pick the material that has the highest probability of being better than any material in the training set. As we assume the predictions to be normally distributed, the probability of materials can be computed from the Z-score $Z = (y - y^*)/\sigma$ where $y^*$ is the maximum of the $y$ of the training set. Formally, the probability can be computed from the Z-score using the cumulative distribution function of the normal distribution. For our purposes, we can use the Z-score becuase the probability is a monotonic function of the Z-score (stated simply: the material with the highest probability will have the highest Z-score).
mei_selection = np.argmax(y_pred)
mli_selection = np.argmax(np.divide(y_pred - np.max(y[in_train]), y_std))
print('Predicted ZT of material #{} selected based on MEI: {:.2f} +/- {:.2f}'.format(mei_selection, y_pred[mei_selection], y_std[mei_selection]))
print('Predicted ZT of material #{} selected based on MLI: {:.2f} +/- {:.2f}'.format(mli_selection, y_pred[mli_selection], y_std[mli_selection]))
Predicted ZT of material #13 selected based on MEI: 0.14 +/- 0.08 Predicted ZT of material #1 selected based on MLI: 0.12 +/- 0.12
For this particular iteration, the MEI and MLI strategies pick the same material. Depending on the random seed of this notebook and that used by lolo, you may see that the material picked by MLI has a lower predicted $ZT$ but a higher variance. According to the logic behind MLI, picking that entry will (1) yield a higher liklihood of finding a well-performing material and (2) lead to an improved model.
Starting with the same 32 materials in the training set, we will iteratively pick materials, add them to the training set, and retrain the model using 3 different strategies for picking entries: MEI, MLI, and randomly.
n_steps = 32
all_inds = set(range(len(y)))
Just pick an entry at random, no need to train a model
random_train = [set(np.where(in_train)[0].tolist())]
for i in range(n_steps):
# Get the current train set and search space
train_inds = set(random_train[-1]) # Last iteration
search_inds = sorted(all_inds.difference(train_inds))
# Pick an entry at random
train_inds.add(np.random.choice(search_inds))
# Add it to the list of training sets
random_train.append(train_inds)
Pick the entry with the largest predicted value
mei_train = [set(np.where(in_train)[0].tolist())]
for i in range(n_steps):
# Get the current train set and search space
train_inds = sorted(set(mei_train[-1])) # Last iteration
search_inds = sorted(all_inds.difference(train_inds))
# Pick entry with the largest maximum value
model.fit(X[train_inds], y[train_inds])
y_pred = model.predict(X[search_inds])
train_inds.append(search_inds[np.argmax(y_pred)])
# Add it to the list of training sets
mei_train.append(set(train_inds))
Pick the entry with the largest probability of improvement
mli_train = [set(np.where(in_train)[0].tolist())]
for i in range(n_steps):
# Get the current train set and search space
train_inds = sorted(set(mei_train[-1])) # Last iteration
search_inds = sorted(all_inds.difference(train_inds))
# Pick entry with the largest maximum value
model.fit(X[train_inds], y[train_inds])
y_pred, y_std = model.predict(X[search_inds], return_std=True)
train_inds.append(search_inds[np.argmax(np.divide(y_pred - np.max(y[train_inds]), y_std))])
# Add it to the list of training sets
mli_train.append(set(train_inds))
Plot the results
fig, ax = plt.subplots()
for train_inds, label in zip([random_train, mei_train, mli_train], ['Random', 'MEI', 'MLI']):
ax.plot(np.arange(len(train_inds)), [max(y[list(t)]) for t in train_inds], label=label)
ax.set_xlabel('Number of New Experiments')
ax.set_ylabel('Best $ZT$ Found')
fig.set_size_inches(3.5, 2)
ax.legend()
fig.tight_layout()
For this particular case, we find that the MLI strategy finds the best material more quickly than the Random or MEI approaches. In Ling 2017, they evaluate the performance of these strategies over many iterations and find that, on average, MLI finds the optimal materials as fast or better than any other approach.