Matminer introduction - Predicting bulk modulus¶

This notebook demonstrates some of the basic features of matminer.

This notebook was last updated 06/07/21 for version 0.7.0 of matminer and verison 0.0.2 of figrecipes (which you can download by cloning the figrecipes source repo).

Note that in order to get the in-line plotting to work, you might need to start Jupyter notebook with a higher data rate, e.g., jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10. We recommend you do this before starting.

Overview¶

In this notebook, we will:

Load and examine a dataset in a pandas dataframe
Add descriptors to the dataframe using matminer
Train, compare, and visualize several machine learning methods with scikit-learn and matminer FigRecipes.

1. Load and process data set¶

Matminer comes pre-loaded with several example data sets you can use. Below, we'll load a data set of computed elastic properties of materials which is sourced from the paper: "Charting the complete elastic properties of inorganic crystalline compounds", M. de Jong et al., Sci. Data. 2 (2015) 150009.

In [1]:

from matminer.datasets.convenience_loaders import load_elastic_tensor
df = load_elastic_tensor()  # loads dataset in a pandas DataFrame object

Decoding objects from /Users/ardunn/alex/lbl/projects/common_env/dev_codes/matminer/matminer/datasets/elastic_tensor_2015.json.gz: 100%|##########| 4724/4724 [00:20<00:00, 5805.67it/s]

1.1 A first look at the data set¶

The data set comes as a pandas DataFrame, which is a kind of "spreadsheet" object in Python. DataFrames have several useful methods you can use to explore and clean the data, some of which we'll explore below.

In [2]:

df.head()

Out[2]:

	material_id	formula	nsites	space_group	volume	structure	elastic_anisotropy	G_Reuss	G_VRH	G_Voigt	K_Reuss	K_VRH	K_Voigt	poisson_ratio	compliance_tensor	elastic_tensor	elastic_tensor_original
0	mp-10003	Nb4CoSi	12	124	194.419802	[[0.94814328 2.07280467 2.5112 ] Nb, [5.273...	0.030688	96.844535	97.141604	97.438674	194.267623	194.268884	194.270146	0.285701	[[0.004385293093993, -0.0016070693558990002, -...	[[311.33514638650246, 144.45092552856926, 126....	[[311.33514638650246, 144.45092552856926, 126....
1	mp-10010	Al(CoSi)2	5	164	61.987320	[[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278...	0.266910	93.939650	96.252006	98.564362	173.647763	175.449907	177.252050	0.268105	[[0.0037715428949660003, -0.000844229828709, -...	[[306.93357350984974, 88.02634955100905, 105.6...	[[306.93357350984974, 88.02634955100905, 105.6...
2	mp-10015	SiOs	2	221	25.952539	[[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os]	0.756489	120.962289	130.112955	139.263621	295.077545	295.077545	295.077545	0.307780	[[0.0019959391925840004, -0.000433146670736000...	[[569.5291276937579, 157.8517489654999, 157.85...	[[569.5291276937579, 157.8517489654999, 157.85...
3	mp-10021	Ga	4	63	76.721433	[[0. 1.09045794 0.84078375] Ga, [0. ...	2.376805	12.205989	15.101901	17.997812	49.025963	49.130670	49.235377	0.360593	[[0.021647143908635, -0.005207263618160001, -0...	[[69.28798774976904, 34.7875015216915, 37.3877...	[[70.13259066665267, 40.60474945058445, 37.387...
4	mp-10025	SiRu2	12	62	160.300999	[[1.0094265 4.24771709 2.9955487 ] Si, [3.028...	0.196930	100.110773	101.947798	103.784823	255.055257	256.768081	258.480904	0.324682	[[0.00410214297725, -0.001272204332729, -0.001...	[[349.3767766177825, 186.67131003104407, 176.4...	[[407.4791016459293, 176.4759188081947, 213.83...

1.2 Removing unneeded columns from the data set¶

The data set above has many columns - we won't need all this data for our modeling. We'll mainly be trying to predict K_VRH and G_VRH (the Voight-Reuss-Hill average of the bulk and shear modulus, respectively) and the elastic_anisotropy. We can drop most of the other output data. We also don't need various metadata such as the cif string of the structure since the crystal structure is already embedded in the structure column.

In [3]:

df.columns

Out[3]:

Index(['material_id', 'formula', 'nsites', 'space_group', 'volume',
       'structure', 'elastic_anisotropy', 'G_Reuss', 'G_VRH', 'G_Voigt',
       'K_Reuss', 'K_VRH', 'K_Voigt', 'poisson_ratio', 'compliance_tensor',
       'elastic_tensor', 'elastic_tensor_original'],
      dtype='object')

In [4]:

unwanted_columns = ["volume", "nsites", "compliance_tensor", "elastic_tensor", 
                    "elastic_tensor_original", "K_Voigt", "G_Voigt", "K_Reuss", "G_Reuss"]
df = df.drop(unwanted_columns, axis=1)

In [5]:

df.head()

Out[5]:

	material_id	formula	space_group	structure	elastic_anisotropy	G_VRH	K_VRH	poisson_ratio
0	mp-10003	Nb4CoSi	124	[[0.94814328 2.07280467 2.5112 ] Nb, [5.273...	0.030688	97.141604	194.268884	0.285701
1	mp-10010	Al(CoSi)2	164	[[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278...	0.266910	96.252006	175.449907	0.268105
2	mp-10015	SiOs	221	[[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os]	0.756489	130.112955	295.077545	0.307780
3	mp-10021	Ga	63	[[0. 1.09045794 0.84078375] Ga, [0. ...	2.376805	15.101901	49.130670	0.360593
4	mp-10025	SiRu2	62	[[1.0094265 4.24771709 2.9955487 ] Si, [3.028...	0.196930	101.947798	256.768081	0.324682

1.3 Getting a feel for the data using descriptive statistics¶

A pandas DataFrame includes a function called describe() that helps determine statistics for the various numerical / categorical columns in the data.

In [6]:

df.describe()

Out[6]:

	space_group	elastic_anisotropy	G_VRH	K_VRH	poisson_ratio
count	1181.000000	1181.000000	1181.000000	1181.000000	1181.000000
mean	163.403895	2.145013	67.543145	136.259661	0.287401
std	65.040733	19.140097	44.579408	72.886978	0.062177
min	4.000000	0.000005	2.722175	6.476135	0.042582
25%	124.000000	0.145030	34.117959	76.435350	0.249159
50%	193.000000	0.355287	59.735163	130.382766	0.290198
75%	221.000000	0.923117	91.332142	189.574194	0.328808
max	229.000000	397.297866	522.921225	435.661487	0.467523

Sometimes, the describe() function will reveal outliers that indicate mistakes in the data. For example, negative hence unphysical minimum bulk/shear moduli or maximum bulk/shear moduli that are too high. In this case, the data looks ok at first glance; meaing that there are no clear problems with the ranges of the various properties. Therefore, and we won't filter out any data.

Note that the describe() function only describes numerical columns by default.

2. Add descriptors to the data ("featurization")¶

We are seeking to find relationships between the inputs (composition and crystal structure of a material) and outputs (elastic properties such as K_VRH, G_VRH, and elastic_anisotropy). To find such relations, we need to "featurize" the input data such that they are numbers that meaningfully represent the underlying physical quantity. For example, one "feature" or "descriptor" of the composition of a material such as Nb4CoSi would be the standard deviation of the Pauling electronegativity of the elements in the compound (weighted by stoichiometry). Compositions with high values of this quantity would be more ionic and quantities with lower values would tend towards covalent or ionic. A descriptor of the crystal structure might be the average coordination number of sites; higher coordination numbers indicate more bonds and therefore might indicate stiffer materials. With matminer, we can start generating hundreds of possible descriptors using the descriptor library that is available. Data mining techniques can help narrow down which descriptors are the most relevant to the target problem using the available output data as a guide.

2.1 Add composition-based features¶

A major class of featurizers available in matminer uses the chemical composition to featurize the input data. Let's add some composition based features to our DataFrame.

The first step is to have a column representing the chemical composition as a pymatgen Composition object. One way to do this is to use the conversions Featurizers in matminer to turn a String composition (our formula column from before) into a pymatgen Composition.

In [7]:

from matminer.featurizers.conversions import StrToComposition
df = StrToComposition().featurize_dataframe(df, "formula")
df.head()

Reading file /Users/ardunn/alex/lbl/projects/common_env/dev_codes/matminer/matminer/datasets/elastic_tensor_2015.json.gz: 4724it [01:29, 52.69it/s]   
Decoding objects from /Users/ardunn/alex/lbl/projects/common_env/dev_codes/matminer/matminer/datasets/elastic_tensor_2015.json.gz: 100%|##########| 4724/4724 [01:29<00:00, 52.69it/s]

StrToComposition:   0%|          | 0/1181 [00:00<?, ?it/s]

Out[7]:

	material_id	formula	space_group	structure	elastic_anisotropy	G_VRH	K_VRH	poisson_ratio	composition
0	mp-10003	Nb4CoSi	124	[[0.94814328 2.07280467 2.5112 ] Nb, [5.273...	0.030688	97.141604	194.268884	0.285701	(Nb, Co, Si)
1	mp-10010	Al(CoSi)2	164	[[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278...	0.266910	96.252006	175.449907	0.268105	(Al, Co, Si)
2	mp-10015	SiOs	221	[[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os]	0.756489	130.112955	295.077545	0.307780	(Si, Os)
3	mp-10021	Ga	63	[[0. 1.09045794 0.84078375] Ga, [0. ...	2.376805	15.101901	49.130670	0.360593	(Ga)
4	mp-10025	SiRu2	62	[[1.0094265 4.24771709 2.9955487 ] Si, [3.028...	0.196930	101.947798	256.768081	0.324682	(Si, Ru)

As you can see, we now have a new Composition column above. The visualization in the DataFrame is not so clear, but the column embeds both the elements and the amounts of each element in the composition (not shown).

Next, we'll use one of the featurizers in matminer to add a suite of descriptors to the DataFrame.

In [8]:

from matminer.featurizers.composition import ElementProperty

ep_feat = ElementProperty.from_preset(preset_name="magpie")
df = ep_feat.featurize_dataframe(df, col_id="composition")  # input the "composition" column to the featurizer
df.head()

ElementProperty:   0%|          | 0/1181 [00:00<?, ?it/s]

Out[8]:

	material_id	formula	space_group	structure	elastic_anisotropy	G_VRH	K_VRH	poisson_ratio	composition	MagpieData minimum Number	...	MagpieData range GSmagmom	MagpieData mean GSmagmom	MagpieData avg_dev GSmagmom	MagpieData minimum SpaceGroupNumber	MagpieData maximum SpaceGroupNumber	MagpieData range SpaceGroupNumber	MagpieData mean SpaceGroupNumber	MagpieData avg_dev SpaceGroupNumber	MagpieData mode SpaceGroupNumber
0	mp-10003	Nb4CoSi	124	[[0.94814328 2.07280467 2.5112 ] Nb, [5.273...	0.030688	97.141604	194.268884	0.285701	(Nb, Co, Si)	14.0	...	1.548471	0.258079	0.430131	194.0	229.0	35.0	222.833333	9.611111	229.0
1	mp-10010	Al(CoSi)2	164	[[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278...	0.266910	96.252006	175.449907	0.268105	(Al, Co, Si)	13.0	...	1.548471	0.619388	0.743266	194.0	227.0	33.0	213.400000	15.520000	194.0
2	mp-10015	SiOs	221	[[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os]	0.756489	130.112955	295.077545	0.307780	(Si, Os)	14.0	...	0.000000	0.000000	0.000000	194.0	227.0	33.0	210.500000	16.500000	194.0
3	mp-10021	Ga	63	[[0. 1.09045794 0.84078375] Ga, [0. ...	2.376805	15.101901	49.130670	0.360593	(Ga)	31.0	...	0.000000	0.000000	0.000000	64.0	64.0	0.0	64.000000	0.000000	64.0
4	mp-10025	SiRu2	62	[[1.0094265 4.24771709 2.9955487 ] Si, [3.028...	0.196930	101.947798	256.768081	0.324682	(Si, Ru)	14.0	...	0.000000	0.000000	0.000000	194.0	227.0	33.0	205.000000	14.666667	194.0

5 rows × 141 columns

As you can see, we now have many more columns in the DataFrame (a total of 141 columns - not all are shown). So we added many features to our data!

As an aside, note that each featurizer also has a citations() function that tells you where to find out more about the Featurizer.

In [9]:

ep_feat.citations()

Out[9]:

['@article{ward_agrawal_choudary_wolverton_2016, title={A general-purpose machine learning framework for predicting properties of inorganic materials}, volume={2}, DOI={10.1038/npjcompumats.2017.28}, number={1}, journal={npj Computational Materials}, author={Ward, Logan and Agrawal, Ankit and Choudhary, Alok and Wolverton, Christopher}, year={2016}}']

2.2 Add more composition-based features¶

There are many more Composition based featurizers apart from ElementProperty that are available in the matminer.featurizers.composition. Let's try the ElectronegativityDiff featurizer which requires knowing the oxidation state of the various elements in the Composition. This information is not there currently, but one can use the conversions package to try to guess oxidation states and then apply the ElectronegativityDiff featurizer to this column.

In [10]:

from matminer.featurizers.conversions import CompositionToOxidComposition
from matminer.featurizers.composition import OxidationStates

df = CompositionToOxidComposition().featurize_dataframe(df, "composition")

os_feat = OxidationStates()
df = os_feat.featurize_dataframe(df, "composition_oxid")
df.head()

CompositionToOxidComposition:   0%|          | 0/1181 [00:00<?, ?it/s]

OxidationStates:   0%|          | 0/1181 [00:00<?, ?it/s]

Out[10]:

	material_id	formula	space_group	structure	elastic_anisotropy	G_VRH	K_VRH	poisson_ratio	composition	MagpieData minimum Number	...	MagpieData maximum SpaceGroupNumber	MagpieData range SpaceGroupNumber	MagpieData mean SpaceGroupNumber	MagpieData avg_dev SpaceGroupNumber	MagpieData mode SpaceGroupNumber	composition_oxid	minimum oxidation state	maximum oxidation state	range oxidation state	std_dev oxidation state
0	mp-10003	Nb4CoSi	124	[[0.94814328 2.07280467 2.5112 ] Nb, [5.273...	0.030688	97.141604	194.268884	0.285701	(Nb, Co, Si)	14.0	...	229.0	35.0	222.833333	9.611111	229.0	(Nb0+, Co0+, Si0+)	0	0	0	0.000000
1	mp-10010	Al(CoSi)2	164	[[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278...	0.266910	96.252006	175.449907	0.268105	(Al, Co, Si)	13.0	...	227.0	33.0	213.400000	15.520000	194.0	(Al3+, Co2+, Co3+, Si4-)	-4	3	7	3.872983
2	mp-10015	SiOs	221	[[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os]	0.756489	130.112955	295.077545	0.307780	(Si, Os)	14.0	...	227.0	33.0	210.500000	16.500000	194.0	(Si4-, Os4+)	-4	4	8	5.656854
3	mp-10021	Ga	63	[[0. 1.09045794 0.84078375] Ga, [0. ...	2.376805	15.101901	49.130670	0.360593	(Ga)	31.0	...	64.0	0.0	64.000000	0.000000	64.0	(Ga0+)	0	0	0	0.000000
4	mp-10025	SiRu2	62	[[1.0094265 4.24771709 2.9955487 ] Si, [3.028...	0.196930	101.947798	256.768081	0.324682	(Si, Ru)	14.0	...	227.0	33.0	205.000000	14.666667	194.0	(Si4-, Ru2+)	-4	2	6	4.242641

5 rows × 146 columns

As you can see, the end of our data frame has now has some oxidation state based features!

2.3 Add some structure based features¶

Not all featurizers operate on compositions. Matminer can also analyze crystal structures and featurize those as well. Let's start by adding some simple density features.

In [11]:

from matminer.featurizers.structure import DensityFeatures

df_feat = DensityFeatures()
df = df_feat.featurize_dataframe(df, "structure")  # input the structure column to the featurizer
df.head()

DensityFeatures:   0%|          | 0/1181 [00:00<?, ?it/s]

Out[11]:

	material_id	formula	space_group	structure	elastic_anisotropy	G_VRH	K_VRH	poisson_ratio	composition	MagpieData minimum Number	...	MagpieData avg_dev SpaceGroupNumber	MagpieData mode SpaceGroupNumber	composition_oxid	minimum oxidation state	maximum oxidation state	range oxidation state	std_dev oxidation state	density	vpa	packing fraction
0	mp-10003	Nb4CoSi	124	[[0.94814328 2.07280467 2.5112 ] Nb, [5.273...	0.030688	97.141604	194.268884	0.285701	(Nb, Co, Si)	14.0	...	9.611111	229.0	(Nb0+, Co0+, Si0+)	0	0	0	0.000000	7.834556	16.201654	0.688834
1	mp-10010	Al(CoSi)2	164	[[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278...	0.266910	96.252006	175.449907	0.268105	(Al, Co, Si)	13.0	...	15.520000	194.0	(Al3+, Co2+, Co3+, Si4-)	-4	3	7	3.872983	5.384968	12.397466	0.644386
2	mp-10015	SiOs	221	[[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os]	0.756489	130.112955	295.077545	0.307780	(Si, Os)	14.0	...	16.500000	194.0	(Si4-, Os4+)	-4	4	8	5.656854	13.968635	12.976265	0.569426
3	mp-10021	Ga	63	[[0. 1.09045794 0.84078375] Ga, [0. ...	2.376805	15.101901	49.130670	0.360593	(Ga)	31.0	...	0.000000	64.0	(Ga0+)	0	0	0	0.000000	6.036267	19.180359	0.479802
4	mp-10025	SiRu2	62	[[1.0094265 4.24771709 2.9955487 ] Si, [3.028...	0.196930	101.947798	256.768081	0.324682	(Si, Ru)	14.0	...	14.666667	194.0	(Si4-, Ru2+)	-4	2	6	4.242641	9.539514	13.358418	0.598395

5 rows × 149 columns

In [12]:

df_feat.feature_labels()

Out[12]:

['density', 'vpa', 'packing fraction']

Again, we see more features in the last few columns: density, vpa, and packing fraction.

There are many more structure based featurizers that are much more complex and detailed analysis of the crystal structure that are outside of the scope of this example. Let's move on to doing some machine learning predictions.

3. Try some different machine learning models to relate input features to the bulk modulus¶

We now have enough data to do some machine learning! We'll need to first determine what columns we consider input (independent variables) and what column we consider output (dependent variable).

3.1 Define input data and output data¶

For now, we'll use K_VRH (bulk modulus) as the output. You could retry this example with G_VRH, elastic_anisotropy, or poisson_ratio as the target output.

For the inputs, we'll use all the features we generated. That is, everything except the output data and non-numerical columns like composition and structure.

In [13]:

y = df['K_VRH'].values
excluded = ["G_VRH", "K_VRH", "elastic_anisotropy", "formula", "material_id", 
            "poisson_ratio", "structure", "composition", "composition_oxid"]
X = df.drop(excluded, axis=1)
print("There are {} possible descriptors:\n\n{}".format(X.shape[1], X.columns.values))

There are 140 possible descriptors:

['space_group' 'MagpieData minimum Number' 'MagpieData maximum Number'
 'MagpieData range Number' 'MagpieData mean Number'
 'MagpieData avg_dev Number' 'MagpieData mode Number'
 'MagpieData minimum MendeleevNumber' 'MagpieData maximum MendeleevNumber'
 'MagpieData range MendeleevNumber' 'MagpieData mean MendeleevNumber'
 'MagpieData avg_dev MendeleevNumber' 'MagpieData mode MendeleevNumber'
 'MagpieData minimum AtomicWeight' 'MagpieData maximum AtomicWeight'
 'MagpieData range AtomicWeight' 'MagpieData mean AtomicWeight'
 'MagpieData avg_dev AtomicWeight' 'MagpieData mode AtomicWeight'
 'MagpieData minimum MeltingT' 'MagpieData maximum MeltingT'
 'MagpieData range MeltingT' 'MagpieData mean MeltingT'
 'MagpieData avg_dev MeltingT' 'MagpieData mode MeltingT'
 'MagpieData minimum Column' 'MagpieData maximum Column'
 'MagpieData range Column' 'MagpieData mean Column'
 'MagpieData avg_dev Column' 'MagpieData mode Column'
 'MagpieData minimum Row' 'MagpieData maximum Row' 'MagpieData range Row'
 'MagpieData mean Row' 'MagpieData avg_dev Row' 'MagpieData mode Row'
 'MagpieData minimum CovalentRadius' 'MagpieData maximum CovalentRadius'
 'MagpieData range CovalentRadius' 'MagpieData mean CovalentRadius'
 'MagpieData avg_dev CovalentRadius' 'MagpieData mode CovalentRadius'
 'MagpieData minimum Electronegativity'
 'MagpieData maximum Electronegativity'
 'MagpieData range Electronegativity' 'MagpieData mean Electronegativity'
 'MagpieData avg_dev Electronegativity'
 'MagpieData mode Electronegativity' 'MagpieData minimum NsValence'
 'MagpieData maximum NsValence' 'MagpieData range NsValence'
 'MagpieData mean NsValence' 'MagpieData avg_dev NsValence'
 'MagpieData mode NsValence' 'MagpieData minimum NpValence'
 'MagpieData maximum NpValence' 'MagpieData range NpValence'
 'MagpieData mean NpValence' 'MagpieData avg_dev NpValence'
 'MagpieData mode NpValence' 'MagpieData minimum NdValence'
 'MagpieData maximum NdValence' 'MagpieData range NdValence'
 'MagpieData mean NdValence' 'MagpieData avg_dev NdValence'
 'MagpieData mode NdValence' 'MagpieData minimum NfValence'
 'MagpieData maximum NfValence' 'MagpieData range NfValence'
 'MagpieData mean NfValence' 'MagpieData avg_dev NfValence'
 'MagpieData mode NfValence' 'MagpieData minimum NValence'
 'MagpieData maximum NValence' 'MagpieData range NValence'
 'MagpieData mean NValence' 'MagpieData avg_dev NValence'
 'MagpieData mode NValence' 'MagpieData minimum NsUnfilled'
 'MagpieData maximum NsUnfilled' 'MagpieData range NsUnfilled'
 'MagpieData mean NsUnfilled' 'MagpieData avg_dev NsUnfilled'
 'MagpieData mode NsUnfilled' 'MagpieData minimum NpUnfilled'
 'MagpieData maximum NpUnfilled' 'MagpieData range NpUnfilled'
 'MagpieData mean NpUnfilled' 'MagpieData avg_dev NpUnfilled'
 'MagpieData mode NpUnfilled' 'MagpieData minimum NdUnfilled'
 'MagpieData maximum NdUnfilled' 'MagpieData range NdUnfilled'
 'MagpieData mean NdUnfilled' 'MagpieData avg_dev NdUnfilled'
 'MagpieData mode NdUnfilled' 'MagpieData minimum NfUnfilled'
 'MagpieData maximum NfUnfilled' 'MagpieData range NfUnfilled'
 'MagpieData mean NfUnfilled' 'MagpieData avg_dev NfUnfilled'
 'MagpieData mode NfUnfilled' 'MagpieData minimum NUnfilled'
 'MagpieData maximum NUnfilled' 'MagpieData range NUnfilled'
 'MagpieData mean NUnfilled' 'MagpieData avg_dev NUnfilled'
 'MagpieData mode NUnfilled' 'MagpieData minimum GSvolume_pa'
 'MagpieData maximum GSvolume_pa' 'MagpieData range GSvolume_pa'
 'MagpieData mean GSvolume_pa' 'MagpieData avg_dev GSvolume_pa'
 'MagpieData mode GSvolume_pa' 'MagpieData minimum GSbandgap'
 'MagpieData maximum GSbandgap' 'MagpieData range GSbandgap'
 'MagpieData mean GSbandgap' 'MagpieData avg_dev GSbandgap'
 'MagpieData mode GSbandgap' 'MagpieData minimum GSmagmom'
 'MagpieData maximum GSmagmom' 'MagpieData range GSmagmom'
 'MagpieData mean GSmagmom' 'MagpieData avg_dev GSmagmom'
 'MagpieData mode GSmagmom' 'MagpieData minimum SpaceGroupNumber'
 'MagpieData maximum SpaceGroupNumber' 'MagpieData range SpaceGroupNumber'
 'MagpieData mean SpaceGroupNumber' 'MagpieData avg_dev SpaceGroupNumber'
 'MagpieData mode SpaceGroupNumber' 'minimum oxidation state'
 'maximum oxidation state' 'range oxidation state'
 'std_dev oxidation state' 'density' 'vpa' 'packing fraction']

3.1 Try a linear regression model using scikit-learn¶

The scikit-learn library makes it easy to fit and cross-validate different types of regression models. Let's start with one of the simplest - a linear regression.

In [14]:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

lr = LinearRegression()

lr.fit(X, y)

# get fit statistics
print('training R2 = ' + str(round(lr.score(X, y), 3)))
print('training RMSE = %.3f' % np.sqrt(mean_squared_error(y_true=y, y_pred=lr.predict(X))))

training R2 = 0.927
training RMSE = 19.625

This looks reasonable since linear regression is a simple (high bias) model. But to really validate that we are not over-fitting, we need to check the cross-validation score rather than the fitting score.

In [16]:

from sklearn.model_selection import KFold, cross_val_score

# Use 10-fold cross validation (90% training, 10% test)
crossvalidation = KFold(n_splits=10, shuffle=True, random_state=1)
scores = cross_val_score(lr, X, y, scoring='neg_mean_squared_error', cv=crossvalidation, n_jobs=1)
rmse_scores = [np.sqrt(abs(s)) for s in scores]
r2_scores = cross_val_score(lr, X, y, scoring='r2', cv=crossvalidation, n_jobs=1)

print('Cross-validation results:')
print('Folds: %i, mean R2: %.3f' % (len(scores), np.mean(np.abs(r2_scores))))
print('Folds: %i, mean RMSE: %.3f' % (len(scores), np.mean(np.abs(rmse_scores))))

Cross-validation results:
Folds: 10, mean R2: 0.902
Folds: 10, mean RMSE: 22.467

A root-mean-squared error of ~22 GPa is not bad! Let's see what this looks like on a plot.

Note that in order to get the code below to work, you might need to start Jupyter notebook with a higher data rate, e.g., jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10.

In [18]:

from figrecipes import PlotlyFig
from sklearn.model_selection import cross_val_predict

pf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)',
               y_title='Predicted bulk modulus (GPa)',
               title='Linear regression',
               mode='notebook',
               filename="lr_regression.html")

pf.xy(xy_pairs=[(y, cross_val_predict(lr, X, y, cv=crossvalidation)), ([0, 400], [0, 400])], 
      labels=df['formula'], 
      modes=['markers', 'lines'],
      lines=[{}, {'color': 'black', 'dash': 'dash'}], 
      showlegends=False
     )

Not too bad! However, there are definitely some outliers (you can hover over the points with your mouse to see what they are). We will see below (with a random forest model) how the accuracy changes when we try the model on a completely new test data.

3.2 Try a random forest model¶

Let's see if a more complex machine learning model does better. We can try a random forest model which is a good "starting" machine learning model, although one can usually do better. Let's repeat the steps for linear regression but for a random forest model.

In [19]:

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=50, random_state=1)

rf.fit(X, y)
print('training R2 = ' + str(round(rf.score(X, y), 3)))
print('training RMSE = %.3f' % np.sqrt(mean_squared_error(y_true=y, y_pred=rf.predict(X))))

training R2 = 0.989
training RMSE = 7.687

At least on the training data, we have very low RMSE and very high R2 - this is good! But let's see if these numbers hold up on cross-validation.

In [20]:

# compute cross validation scores for random forest model
r2_scores = cross_val_score(rf, X, y, scoring='r2', cv=crossvalidation, n_jobs=-1)
scores = cross_val_score(rf, X, y, scoring='neg_mean_squared_error', cv=crossvalidation, n_jobs=-1)
rmse_scores = [np.sqrt(abs(s)) for s in scores]

print('Cross-validation results:')
print('Folds: %i, mean R2: %.3f' % (len(scores), np.mean(np.abs(r2_scores))))
print('Folds: %i, mean RMSE: %.3f' % (len(scores), np.mean(np.abs(rmse_scores))))

Cross-validation results:
Folds: 10, mean R2: 0.924
Folds: 10, mean RMSE: 19.277

Looks like upon cross-validation, we do slightly better than a linear regression but not too much better. Let's plot this one as well.

In [21]:

from figrecipes import PlotlyFig

pf_rf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)',
                  y_title='Random forest bulk modulus (GPa)',
                  title='Random forest regression',
                  mode='notebook',
                  filename="rf_regression.html")

pf_rf.xy([(y, cross_val_predict(rf, X, y, cv=crossvalidation)), ([0, 400], [0, 400])], 
      labels=df['formula'], modes=['markers', 'lines'],
      lines=[{}, {'color': 'black', 'dash': 'dash'}], showlegends=False)

Visually, this looks a little better but not that much better. The random forest did very well in training, but is only a little better than linear regression when we plot the cross-validation error as per above.

You could (optionally) visualize the training error by replacing the code cross_val_predict(rf, X, y, cv=crossvalidation) in the above cell with rf.predict(X). That would look a lot better, but would not be an accurate representation of your prediction error.

In [22]:

from sklearn.model_selection import train_test_split
X['formula'] = df['formula']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
train_formula = X_train['formula']
X_train = X_train.drop('formula', axis=1)
test_formula = X_test['formula']
X_test = X_test.drop('formula', axis=1)

rf_reg = RandomForestRegressor(n_estimators=50, random_state=1)
rf_reg.fit(X_train, y_train)

# get fit statistics
print('training R2 = ' + str(round(rf_reg.score(X_train, y_train), 3)))
print('training RMSE = %.3f' % np.sqrt(mean_squared_error(y_true=y_train, y_pred=rf_reg.predict(X_train))))
print('test R2 = ' + str(round(rf_reg.score(X_test, y_test), 3)))
print('test RMSE = %.3f' % np.sqrt(mean_squared_error(y_true=y_test, y_pred=rf_reg.predict(X_test))))

training R2 = 0.987
training RMSE = 8.259
test R2 = 0.942
test RMSE = 16.928

In [24]:

from figrecipes import PlotlyFig
pf_rf = PlotlyFig(x_title='Bulk modulus prediction residual (GPa)',
                  y_title='Probability',
                  title='Random forest regression residuals',
                  mode="notebook",
                  filename="rf_regression_residuals.html")

hist_plot = pf_rf.histogram(data=[y_train-rf_reg.predict(X_train), 
                                  y_test-rf_reg.predict(X_test)],
                            histnorm='probability', colors=['blue', 'red'],
                            return_plot=True
                           )
hist_plot["data"][0]['name'] = 'train'
hist_plot["data"][1]['name'] = 'test'
pf_rf.create_plot(hist_plot)

Finally, let's see what are the most important features used by the random forest model.

In [25]:

importances = rf.feature_importances_
# included = np.asarray(included)
included = X.columns.values
indices = np.argsort(importances)[::-1]

pf = PlotlyFig(y_title='Importance (%)',
               title='Feature by importances',
               mode='notebook',
               fontsize=20,
               ticksize=15)

pf.bar(x=included[indices][0:10], y=importances[indices][0:10])

Features relating to the melting point and to the volume per atom / density are the most important in the random forest model.

This concludes the tutorial! You are now familiar with some of the basic features of matminer.