Notebook

Advanced MPDS API usage: unusual materials phases from the machine learning¶

Complexity level: green karate belt
Requirements: familiarity with machine learning and parallel programming

Here we look in MPDS for the "unusual" materials phases, i.e. those which have the extreme values of more than one physical property. Extreme in this context means close to the either of the prediction bounds, minimum or maximum. We consider 8 properties generated by machine learning. In MPDS they have clear bounds.

For instance, a crystal with the very low Debye temperature, very low enthalpy of formation, very high linear thermal expansion coefficient etc. would match. The materials with such unusual combinations of properties certainly deserve attention, so let's list them.

Important! Before you proceed: the notebooks running at the third-party servers are not secure. Using this notebook assumes you authenticate at the MPDS server with your own API key. Please run this notebook only if you have an open-access account (i.e. an access section of your MPDS account reads: Programmatic data access: only open data).

Please do not run this notebook at the third-party servers if you have an elevated API access to the MPDS, since there's a nonzero probability of key leakage!

Be sure to always invalidate (revoke) your API key at your MPDS account after using the notebooks.

Now let's proceed with the authentication part. First, apply for an MPDS account, if you have none. Then copy your API key, run the next cell, paste the key in the appeared prompt input, and hit Enter.

In [ ]:

import os, getpass
os.environ['MPDS_KEY'] = getpass.getpass()

OK, now you may talk to the MPDS server programmatically from this notebook on your behalf.

In [ ]:

!pip install mpds_client

In [ ]:

from __future__ import division
import time
import random
import threading

from mpds_client import MPDSDataRetrieval, MPDSDataTypes

ml_data = {
    'isothermal bulk modulus': {'bounds': [5, 265], 'units': 'GPa'},
    'enthalpy of formation': {'bounds': [-325, 0], 'units': 'kJ g-at.-1'},
    'heat capacity at constant pressure': {'bounds': [11, 28], 'units': 'J K-1 g-at.-1'},
    'Seebeck coefficient': {'bounds': [-150, 225], 'units': 'muV K-1'},
    'values of electronic band gap': {'bounds': [0.5, 10], 'units': 'eV'}, # NB both direct & indirect
    'temperature for congruent melting': {'bounds': [300, 2700], 'units': 'K'},
    'Debye temperature': {'bounds': [175, 1100], 'units': 'K'},
    'linear thermal expansion coefficient': {'bounds': [1.0E-06, 9.5E-05], 'units': 'K-1'}
}

bound_tolerance_factor = 15

What's the bound_tolerance_factor? For each machine-learning property we divide the entire range of values (e.g. from 300 to 2700) into this number. Then we take the first and the last segment. Entries with the property values in these segments will be considered as extreme and kept.

Note, if the key isn't valid, the API returns an HTTP error 403.

In [ ]:

extremes, extremes_intersects = {}, {}

def mpds_download_worker(prop, min_bound, max_bound):
    '''
    A parallelizable worker
    '''
    print("---Starting with %s" % prop)

    client = MPDSDataRetrieval(dtype=MPDSDataTypes.MACHINE_LEARNING)

    min_entries, max_entries = [], []

    for item in client.get_data({"props": prop}, fields={'P':[
        'sample.material.entry',
        'sample.material.phase_id',
        'sample.material.chemical_formula',
        'sample.measurement[0].property.scalar'
    ]}):
        if item[3] < min_bound:
            min_entries.append(item)

        elif item[3] > max_bound:
            max_entries.append(item)

    for item in list(min_entries) + list(max_entries):

        keep_info = [prop, item[0]] + item[2:]

        if item[1] in extremes:
            extremes_intersects.setdefault(item[1], []).append(keep_info)

        else:
            extremes[item[1]] = keep_info

Below is the most time-consuming step. We need to scan all the machine-learning data. To fetch all the entries for each property requires about 10 minutes. So that will be about 2 hours in total sequentially. Parallelizing the data extraction for 8 properties we could ideally achieve 8x speedup. However that would also increase the load at the MPDS server 8x, which we in principle should avoid. Let's be polite! Although it's safe to increase the load twice, so we can run two threads four times to fetch all the data. The total running time will be then about half an hour.

In [ ]:

start_time = time.time()
threads = []
ml_props = list(ml_data.keys())

for even, odd in zip(ml_props[0::2], ml_props[1::2]):

    print("---Preparing a pair of %s & %s" % (even, odd))

    for key in [even, odd]:

        # adjust bounds to match entries near the margin
        margin = (ml_data[key]['bounds'][1] - ml_data[key]['bounds'][0]) / bound_tolerance_factor
        ml_data[key]['bounds'] = [ml_data[key]['bounds'][0] + margin, ml_data[key]['bounds'][1] - margin]

        # run in parallel
        thread = threading.Thread(target=mpds_download_worker, args=[key] + ml_data[key]['bounds'])
        thread.start()
        threads.append(thread)

    for thread in threads:
        thread.join()

for phase_id in extremes_intersects:
    extremes_intersects[phase_id].append(extremes[phase_id])

for phase_id in sorted(extremes_intersects.keys()):

    print("*" * 30 + " Distinct phase https://mpds.io/#phase_id/%s " % phase_id + "*" * 30)

    for card in extremes_intersects[phase_id]:
        print("%s (%s) %s = %s %s" % (
            card[2], card[1], card[0], card[3], ml_data[card[0]]['units']
        ))

print("Done in %1.2f sc" % (time.time() - start_time))

Were you able to follow everything? Please, try to answer:

How is the value of bound_tolerance_factor connected with the total number of results?
How could one obtain the particular crystalline structures for these results?
How could one in principle verify these results?

PS don't forget to invalidate (revoke) your API key.