This notebook recreates a 2016 paper by Ward et al. on predicting the formation enthalpy of materials based on their composition. We will use the Materials Data Facility to retrieve a training set from the the OQMD, compute features based on the composition of each entry, and then train a random forest model.
This example was last updated on 06/07/2021 for Matminer v.0.7.0
%matplotlib inline
from matminer.data_retrieval import retrieve_MDF
from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers import composition as cf
from matminer.featurizers.conversions import StrToComposition
from matplotlib import pyplot as plt
from matplotlib.colors import LogNorm
import numpy as np
import pandas as pd
import pickle as pkl
from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, cross_val_predict, GridSearchCV, ShuffleSplit, KFold
Settings to change
quick_demo = True # Whether to run an faster version of this demo.
# The full OQMD model takes about a hour to test and ~8GB of RAM
We first create a Forge
instance, which simplifies performing search queries against the MDF.
The first step is to create a tool for reading from the MDF's search index.
mdf = retrieve_MDF.MDFDataRetrieval(anonymous=True)
Then, we assemble a query that gets only the converged static calculations from the OQMD.
query_string = 'mdf.source_name:oqmd AND (oqmd.configuration:static OR '\
'oqmd.configuration:standard) AND dft.converged:True'
if quick_demo:
query_string += " AND mdf.scroll_id:<10000"
data = mdf.get_data(query_string, unwind_arrays=False)
This tool creates a DataFrame object with the metadata for each entry in the OQMD
data.head(2)
crystal_structure.cross_reference.icsd | crystal_structure.number_of_atoms | crystal_structure.space_group_number | crystal_structure.volume | dft.converged | dft.cutoff_energy | dft.exchange_correlation_functional | files | material.composition | material.elements | ... | oqmd.delta_e.units | oqmd.delta_e.value | oqmd.magnetic_moment.units | oqmd.magnetic_moment.value | oqmd.stability.units | oqmd.stability.value | oqmd.total_energy.units | oqmd.total_energy.value | oqmd.volume_pa.units | oqmd.volume_pa.value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 90433.0 | 12 | 62 | 185.154 | True | 520.0 | PBE | [{'data_type': 'ASCII text, with very long lin... | Nb1Pt1Si1 | [Nb, Pt, Si] | ... | eV/atom | -0.805020 | bohr/atom | -0.000119 | eV/atom | -0.105391 | eV/atom | -7.996541 | angstrom^3/atom | 15.4295 |
1 | 639016.0 | 3 | 139 | 59.293 | True | 520.0 | PBE | [{'data_type': 'ASCII text, with very long lin... | Hf2Zn1 | [Hf, Zn] | ... | eV/atom | -0.173969 | bohr/atom | -0.000561 | eV/atom | -0.042780 | eV/atom | -7.232890 | angstrom^3/atom | 19.7643 |
2 rows × 29 columns
We only need two columns: delta_e
and material.composition
data = data[['oqmd.delta_e.value', 'material.composition']]
Renaming the columns to make the rest of the code more succinct
data = data.rename(columns={'oqmd.delta_e.value': 'delta_e', 'material.composition':'composition'})
Our next step is to get only the lowest-energy entry for each composition.
data = StrToComposition(target_col_id='composition_obj').featurize_dataframe(data, 'composition')
StrToComposition: 0%| | 0/4849 [00:00<?, ?it/s]
Create shortcuts for our input and output columns
Remove compounds w/o a delta_e
measurement.
for k in ['delta_e']:
data[k] = pd.to_numeric(data[k])
original_count = len(data)
data = data[~ data['delta_e'].isnull()]
print('Removed %d/%d entries'%(original_count - len(data), original_count))
Removed 409/4849 entries
Get only the groundstate and each composition
%%time
original_count = len(data)
data['composition'] = data['composition_obj'].apply(lambda x: x.reduced_formula)
data.sort_values('delta_e', ascending=True, inplace=True)
data.drop_duplicates('composition', keep='first', inplace=True)
print('Removed %d/%d entries'%(original_count - len(data), original_count))
Removed 24/4440 entries CPU times: user 356 ms, sys: 6.76 ms, total: 363 ms Wall time: 378 ms
Remove outliers
original_count = len(data)
data = data[np.logical_and(data['delta_e'] >= -20, data['delta_e'] <= 5)]
print('Removed %d/%d entries'%(original_count - len(data), original_count))
Removed 1/4416 entries
In this part of the notebook, we build a ML model using scikit-learn and evaluate its performance using cross-validation.
The first step in building a ML model is to convert the raw materials data (here: the composition) into the required input for an ML model: a finite list of quantitative attributes. In this example, we use the "general-purpose" attributes of Ward et al 2016.
feature_calculators = MultipleFeaturizer([cf.Stoichiometry(), cf.ElementProperty.from_preset("magpie"),
cf.ValenceOrbital(props=['avg']), cf.IonProperty(fast=True)])
Get the feature names
feature_labels = feature_calculators.feature_labels()
Compute the features
%%time
data = feature_calculators.featurize_dataframe(data, col_id='composition_obj');
MultipleFeaturizer: 0%| | 0/4415 [00:00<?, ?it/s]
CPU times: user 641 ms, sys: 155 ms, total: 796 ms Wall time: 24.3 s
print('Generated %d features'%len(feature_labels))
print('Training set size:', 'x'.join([str(x) for x in data[feature_labels].shape]))
Generated 145 features Training set size: 4415x145
Remove entries with NaN
or infinite
features
original_count = len(data)
data = data[~ data[feature_labels].isnull().any(axis=1)]
print('Removed %d/%d entries'%(original_count - len(data), original_count))
Removed 0/4415 entries
For brevity, we will only consider one ML algorithm in this example: random forest. The "random forest" algorithm works by training many different decision tree models, where each is trained on a different subset of the dataset . Here, we tune one of the major parameters of the algorithm: the number features considered at each split in each decision tree
model = GridSearchCV(RandomForestRegressor(n_estimators=20 if quick_demo else 150, n_jobs=-1),
param_grid=dict(max_features=range(8,15)),
scoring='neg_mean_squared_error',cv=ShuffleSplit(n_splits=1, test_size=0.1))
model.fit(data[feature_labels], data['delta_e'])
GridSearchCV(cv=ShuffleSplit(n_splits=1, random_state=None, test_size=0.1, train_size=None), estimator=RandomForestRegressor(n_estimators=20, n_jobs=-1), param_grid={'max_features': range(8, 15)}, scoring='neg_mean_squared_error')
Plot the tuning results. This shows the CV score as a function of the parameter we tuned "max features"
model.best_score_
-0.07393843983358246
fig, ax = plt.subplots()
# Plot the score as a function of alpha
ax.scatter(model.cv_results_['param_max_features'].data,
np.sqrt(-1 * model.cv_results_['mean_test_score']))
ax.scatter([model.best_params_['max_features']], np.sqrt([-1*model.best_score_]), marker='o', color='r', s=40)
ax.set_xlabel('Max. Features')
ax.set_ylabel('RMSE (eV/atom)')
Text(0, 0.5, 'RMSE (eV/atom)')
Save our best model
model = model.best_estimator_
Quantify the performance of this model using 10-fold cross-validation
cv_prediction = cross_val_predict(model, data[feature_labels], data['delta_e'], cv=KFold(10, shuffle=True))
Compute aggregate statistics
for scorer in ['r2_score', 'mean_absolute_error', 'mean_squared_error']:
score = getattr(metrics,scorer)(data['delta_e'], cv_prediction)
print(scorer, score)
r2_score 0.8229782353297522 mean_absolute_error 0.19136257807725038 mean_squared_error 0.08679513226277784
model
RandomForestRegressor(max_features=14, n_estimators=20, n_jobs=-1)
Plot the individual predictions
fig, ax = plt.subplots()
ax.hist2d(pd.to_numeric(data['delta_e']), cv_prediction, norm=LogNorm(), bins=64, cmap='Blues', alpha=0.9)
ax.set_xlim(ax.get_ylim())
ax.set_ylim(ax.get_xlim())
mae = metrics.mean_absolute_error(data['delta_e'], cv_prediction)
r2 = metrics.r2_score(data['delta_e'], cv_prediction)
ax.text(0.5, 0.1, 'MAE: {:.2f} eV/atom\n$R^2$: {:.2f}'.format(mae, r2),
transform=ax.transAxes,
bbox={'facecolor': 'w', 'edgecolor': 'k'})
ax.plot(ax.get_xlim(), ax.get_xlim(), 'k--')
ax.set_xlabel('DFT $\Delta H_f$ (eV/atom)')
ax.set_ylabel('ML $\Delta H_f$ (eV/atom)')
fig.set_size_inches(3, 3)
fig.tight_layout()
fig.savefig('oqmd_cv.png', dpi=320)