Meta functionalities of the EpiGraphDB platform

In this notebook we show the following aspects of the EpiGraphDB platform, and how to use the API to get the information:

  1. Metadata: meta nodes and meta edges, and the overall schema.
  2. Search for a specific node under the meta node.
  3. Cypher: how to query the database directly using Neo4j Cypher

For detailed documentation on the API endpoints please visit:

In [1]:
from pprint import pformat

import networkx as nx
import pandas as pd
import requests
In [2]:
# default parameters
API_URL = "https://api.epigraphdb.org"
In [3]:
# Parameters
API_URL = "https://api.epigraphdb.org"
In [4]:
print(API_URL)
requests.get(f"{API_URL}/ping").json()
https://api.epigraphdb.org
Out[4]:
True

Metadata

Here we query for the metadata information using the endpoint GET /meta/schema, which will be used for downstream processing.

In [5]:
endpoint = "/meta/schema"
params = {"graphviz": False, "plot": False}
r = requests.get(f"{API_URL}{endpoint}", params=params)
r.raise_for_status()
metadata = r.json()

# Preview of metadata information
keys = metadata.keys()
print(pformat(keys), "\n")
for key in list(keys):
    print(f"# {key}:")
    print(pformat(metadata[key])[:1000], "\n")
dict_keys(['nodes', 'edges', 'connections']) 

# nodes:
{'Disease': {'count': 38960,
             'properties': {'_id': {'indexed': True,
                                    'type': 'STRING',
                                    'unique': False},
                            '_name': {'indexed': True,
                                      'type': 'STRING',
                                      'unique': False},
                            '_source': {'indexed': False,
                                        'type': 'LIST',
                                        'unique': False},
                            'definition': {'indexed': False,
                                           'type': 'STRING',
                                           'unique': False},
                            'doid': {'indexed': True,
                                     'type': 'LIST',
                                     'unique': False},
                            'efo': {'indexed': False,
                                    'type': 'LIST',
                     

# edges:
{'BIORXIV_OBJ': {'count': 32651,
                 'properties': {'_source': {'array': True, 'type': 'LIST'}}},
 'BIORXIV_PREDICATE': {'count': 32648,
                       'properties': {'_source': {'array': True,
                                                  'type': 'LIST'},
                                      'count': {'array': False,
                                                'type': 'INTEGER'},
                                      'predicate': {'array': False,
                                                    'type': 'STRING'}}},
 'BIORXIV_SUB': {'count': 32657,
                 'properties': {'_source': {'array': True, 'type': 'LIST'}}},
 'BIORXIV_TO_LIT': {'count': 35211,
                    'properties': {'_source': {'array': True, 'type': 'LIST'}}},
 'CPIC': {'count': 375,
          'properties': {'_source': {'array': True, 'type': 'LIST'},
                         'cpic_level': {'array': False, 'type': 'STRING'},
                         'guideline': {'array': F 

# connections:
[{'count': 2461,
  'from_node': 'Drug',
  'rel': 'OPENTARGETS_DRUG_TO_DISEASE',
  'to_node': 'Disease'},
 {'count': 5763,
  'from_node': 'Gene',
  'rel': 'GENE_TO_DISEASE',
  'to_node': 'Disease'},
 {'count': 8247,
  'from_node': 'Disease',
  'rel': 'MONDO_MAP_UMLS',
  'to_node': 'LiteratureTerm'},
 {'count': 2819,
  'from_node': 'Disease',
  'rel': 'MONDO_MAP_EFO',
  'to_node': 'Efo'},
 {'count': 2463,
  'from_node': 'Pathway',
  'rel': 'PATHWAY_CHILD_OF',
  'to_node': 'Pathway'},
 {'count': 121873,
  'from_node': 'Protein',
  'rel': 'PROTEIN_IN_PATHWAY',
  'to_node': 'Pathway'},
 {'count': 1969,
  'from_node': 'LiteratureTerm',
  'rel': 'MEDRXIV_PREDICATE',
  'to_node': 'LiteratureTerm'},
 {'count': 16435,
  'from_node': 'LiteratureTerm',
  'rel': 'TERM_TO_GENE',
  'to_node': 'Gene'},
 {'count': 32651,
  'from_node': 'LiteratureTriple',
  'rel': 'BIORXIV_OBJ',
  'to_node': 'LiteratureTerm'},
 {'count': 32657,
  'from_node': 'LiteratureTriple',
  'rel': 'BIORXIV_SUB',
  'to_node': 'Li 

Meta nodes

We can extract the specific meta node information as a pandas dataframe from the metadata.

In [6]:
meta_node_df = pd.DataFrame.from_dict(metadata["nodes"], orient="index")

(
    meta_node_df.sort_index().assign(
        count=lambda df: df["count"].apply(lambda x: f"{x:,}")
    )
)
Out[6]:
count properties
Disease 38,960 {'_name': {'type': 'STRING', 'indexed': True, ...
Drug 2,697 {'molecule_type': {'type': 'STRING', 'indexed'...
Efo 25,390 {'_name': {'type': 'STRING', 'indexed': True, ...
Gene 57,737 {'druggability_tier': {'type': 'STRING', 'inde...
Gwas 34,494 {'note': {'type': 'STRING', 'indexed': False, ...
Literature 3,995,672 {'issn': {'type': 'STRING', 'indexed': False, ...
LiteratureTerm 108,905 {'_name': {'type': 'STRING', 'indexed': True, ...
LiteratureTriple 5,609,945 {'subject_id': {'type': 'STRING', 'indexed': T...
Pathway 2,441 {'_name': {'type': 'STRING', 'indexed': True, ...
Protein 20,280 {'name': {'type': 'STRING', 'indexed': True, '...
Tissue 54 {'name': {'type': 'STRING', 'indexed': True, '...
Variant 99,005 {'ref': {'type': 'STRING', 'indexed': False, '...

Meta relationships and connections

We can also extract the meta relationship (edge) information, and the connections.

In [7]:
meta_rel_df = pd.DataFrame.from_dict(metadata["edges"], orient="index").merge(
    pd.DataFrame.from_dict(
        {_["rel"]: _ for _ in metadata["connections"]}, orient="index"
    )[["from_node", "to_node"]],
    left_index=True,
    right_index=True,
)

(
    meta_rel_df.sort_values(by=["from_node", "to_node"]).assign(
        count=lambda df: df["count"].apply(lambda x: f"{x:,}")
    )
)
Out[7]:
count properties from_node to_node
MONDO_MAP_EFO 2,819 {'_source': {'array': False, 'type': 'STRING'}} Disease Efo
MONDO_MAP_UMLS 8,247 {'_source': {'array': False, 'type': 'STRING'}} Disease LiteratureTerm
OPENTARGETS_DRUG_TO_DISEASE 2,461 {'_source': {'array': True, 'type': 'LIST'}} Drug Disease
CPIC 375 {'pharmgkb_level_of_evidence': {'array': False... Drug Gene
OPENTARGETS_DRUG_TO_TARGET 6,534 {'phase': {'array': False, 'type': 'STRING'}, ... Drug Gene
EFO_CHILD_OF 43,132 {'_source': {'array': True, 'type': 'LIST'}} Efo Efo
GENE_TO_DISEASE 5,763 {'last_updated': {'array': False, 'type': 'STR... Gene Disease
XQTL_MULTI_SNP_MR 3,015,233 {'p': {'array': False, 'type': 'FLOAT'}, 'se':... Gene Gwas
XQTL_SINGLE_SNP_MR_GENE_GWAS 8,449,779 {'p': {'array': False, 'type': 'FLOAT'}, 'se':... Gene Gwas
GENE_TO_PROTEIN 19,142 {'_source': {'array': True, 'type': 'LIST'}} Gene Protein
EXPRESSED_IN 2,918,240 {'tpm': {'array': False, 'type': 'FLOAT'}, '_s... Gene Tissue
GWAS_NLP_EFO 12,302 {'score': {'array': False, 'type': 'FLOAT'}, '... Gwas Efo
GWAS_EFO_EBI 281 {'_source': {'array': True, 'type': 'LIST'}} Gwas Efo
PRS 118,124 {'p': {'array': False, 'type': 'FLOAT'}, 'r2':... Gwas Gwas
MR_EVE_MR 25,804,945 {'b': {'array': False, 'type': 'FLOAT'}, 'se':... Gwas Gwas
GEN_COR 840,960 {'h2_intercept_SE': {'array': False, 'type': '... Gwas Gwas
OBS_COR 17,932 {'_source': {'array': True, 'type': 'LIST'}, '... Gwas Gwas
GWAS_NLP 89,239,773 {'score': {'array': False, 'type': 'FLOAT'}, '... Gwas Gwas
GWAS_TO_LITERATURE 28,111,669 {'_source': {'array': True, 'type': 'LIST'}} Gwas Literature
METAMAP_LITE 5,556 {'_source': {'array': True, 'type': 'LIST'}, '... Gwas LiteratureTerm
GWAS_TO_LITERATURE_TRIPLE 17,531,153 {'pval': {'array': False, 'type': 'FLOAT'}, 'g... Gwas LiteratureTriple
OPENGWAS_TOPHITS 160,283 {'_source': {'array': True, 'type': 'LIST'}, '... Gwas Variant
GWAS_TO_VARIANT 26,436 {'se': {'array': False, 'type': 'FLOAT'}, 'nca... Gwas Variant
TERM_TO_GENE 16,435 {'_source': {'array': False, 'type': 'STRING'}} LiteratureTerm Gene
SEMMEDDB_PREDICATE 5,584,547 {'count': {'array': False, 'type': 'INTEGER'},... LiteratureTerm LiteratureTerm
BIORXIV_PREDICATE 32,648 {'count': {'array': False, 'type': 'INTEGER'},... LiteratureTerm LiteratureTerm
MEDRXIV_PREDICATE 1,969 {'count': {'array': False, 'type': 'INTEGER'},... LiteratureTerm LiteratureTerm
BIORXIV_TO_LIT 35,211 {'_source': {'array': True, 'type': 'LIST'}} LiteratureTriple Literature
SEMMEDDB_TO_LIT 10,589,785 {'_source': {'array': True, 'type': 'LIST'}} LiteratureTriple Literature
SEMMEDDB_SUB 5,584,547 {'_source': {'array': True, 'type': 'LIST'}} LiteratureTriple LiteratureTerm
BIORXIV_OBJ 32,651 {'_source': {'array': True, 'type': 'LIST'}} LiteratureTriple LiteratureTerm
BIORXIV_SUB 32,657 {'_source': {'array': True, 'type': 'LIST'}} LiteratureTriple LiteratureTerm
SEMMEDDB_OBJ 5,584,547 {'_source': {'array': True, 'type': 'LIST'}} LiteratureTriple LiteratureTerm
PATHWAY_CHILD_OF 2,463 {'_source': {'array': True, 'type': 'LIST'}} Pathway Pathway
PROTEIN_IN_PATHWAY 121,873 {'_source': {'array': True, 'type': 'LIST'}} Protein Pathway
STRING_INTERACT_WITH 827,184 {'score': {'array': False, 'type': 'FLOAT'}, '... Protein Protein
VARIANT_TO_GENE 108,561 {'amino_acids': {'array': False, 'type': 'STRI... Variant Gene
XQTL_SINGLE_SNP_MR_SNP_GENE 41,564 {'_source': {'array': True, 'type': 'LIST'}} Variant Gene

Schema plot

We can generate a network diagram of the graph db schema using networkx.

In [8]:
graph = nx.from_pandas_edgelist(meta_rel_df, source="from_node", target="to_node")
In [9]:
nx.draw(
    G=graph,
    pos=nx.kamada_kawai_layout(graph),
    with_labels=True,
    node_color="white",
)

A detailed version of the shema plot can be obtained from the API:

schema_plot

Search for specific node

Users can use the explorer on the Web UI to search for a specific node by:

  • fuzzy matching by "name" field.
  • exact matching by "ID" field if you know the its ID (e.g. the ID to a GWAS from IEU GWAS Database).

Here we show how these are done at the API level using Gwas nodes as an example.

First we need to know what the "ID" and "name" fields are for the meta nodes using GET /meta/nodes/id-name-schema:

In [10]:
r = requests.get(f"{API_URL}/meta/nodes/id-name-schema")
r.raise_for_status()

meta_node_fields = r.json()
meta_node_fields
Out[10]:
{'Disease': {'id': 'id', 'name': 'label'},
 'Drug': {'id': 'label', 'name': 'label'},
 'Efo': {'id': 'id', 'name': 'value'},
 'Gene': {'id': 'ensembl_id', 'name': 'name'},
 'Gwas': {'id': 'id', 'name': 'trait'},
 'Literature': {'id': 'id', 'name': 'id'},
 'LiteratureTerm': {'id': 'id', 'name': 'name'},
 'LiteratureTriple': {'id': 'id', 'name': 'name'},
 'Pathway': {'id': 'id', 'name': 'name'},
 'Protein': {'id': 'uniprot_id', 'name': 'uniprot_id'},
 'Tissue': {'id': 'id', 'name': 'name'},
 'Variant': {'id': 'name', 'name': 'name'}}

Fuzzy matching

Here we search for nodes can contain "body mass index" in their traits.

In [11]:
name = "body mass index"

r = requests.get(f"{API_URL}/meta/nodes/Gwas/search", params={"name": name})
r.raise_for_status()

print(pformat(r.json())[:3000])
{'metadata': {'empty_results': False,
              'query': 'MATCH (node: Gwas) WHERE node.trait =~ "(?i).*body '
                       'mass index.*" RETURN node LIMIT 10;',
              'total_seconds': 0.021328},
 'results': [{'node': {'_id': 'ieu-a-1089',
                       '_name': 'Body mass index',
                       '_source': ['OpenGWAS-2020-10-13'],
                       'author': 'Wood',
                       'build': 'HG19/GRCh37',
                       'category': 'Risk factor',
                       'id': 'ieu-a-1089',
                       'mr': '0',
                       'note': 'Dominance model? If so then not necessarily of '
                               'value for MR; Results from interim Biobank '
                               'release enriched for smokers; could lead to '
                               'bias through collider issues in MR',
                       'nsnp': '8654252',
                       'pmid': '26961502.0',
                       'population': 'European',
                       'sample_size': '120286.0',
                       'sex': 'Males and Females',
                       'subcategory': 'Anthropometric',
                       'trait': 'Body mass index',
                       'year': '2016.0'}},
             {'node': {'_id': 'ieu-a-974',
                       '_name': 'Body mass index',
                       '_source': ['OpenGWAS-2020-10-13'],
                       'author': 'Locke AE',
                       'build': 'HG19/GRCh37',
                       'category': 'Risk factor',
                       'id': 'ieu-a-974',
                       'mr': '1',
                       'nsnp': '2494613',
                       'pmid': '25673413.0',
                       'population': 'European',
                       'sample_size': '171977.0',
                       'sd': '4.77',
                       'sex': 'Females',
                       'subcategory': 'Anthropometric',
                       'trait': 'Body mass index',
                       'year': '2015.0'}},
             {'node': {'_id': 'ieu-a-95',
                       '_name': 'Body mass index',
                       '_source': ['OpenGWAS-2020-10-13'],
                       'author': 'Randall JC',
                       'build': 'HG19/GRCh37',
                       'category': 'Risk factor',
                       'id': 'ieu-a-95',
                       'mr': '1',
                       'nsnp': '2736876',
                       'pmid': '23754948.0',
                       'population': 'European',
                       'sample_size': '73137.0',
                       'sd': '4.77',
                       'sex': 'Females',
                       'subcategory': 'Anthropometric',
                       'trait': 'Body mass index',
                       'year': '2013.0'}},
             {'node': {'_id': 'ebi-a-GCST004904',
                       '_name': 'Body mass index',
                       '_source': ['OpenGWAS-2020-

Exact matching

Similarly, we can exact match a specific node by its ID.

In [12]:
id = "ieu-a-2"

r = requests.get(f"{API_URL}/meta/nodes/Gwas/search", params={"id": id})
r.raise_for_status()

print(pformat(r.json())[:3000])
{'metadata': {'empty_results': False,
              'query': 'MATCH (node: Gwas {id: "ieu-a-2"}) RETURN node LIMIT '
                       '10;',
              'total_seconds': 0.012118},
 'results': [{'node': {'_id': 'ieu-a-2',
                       '_name': 'Body mass index',
                       '_source': ['OpenGWAS-2020-10-13'],
                       'author': 'Locke AE',
                       'build': 'HG19/GRCh37',
                       'category': 'Risk factor',
                       'id': 'ieu-a-2',
                       'mr': '1',
                       'nsnp': '2555511',
                       'pmid': '25673413.0',
                       'population': 'Mixed',
                       'sample_size': '339224.0',
                       'sd': '4.77',
                       'sex': 'Males and Females',
                       'subcategory': 'Anthropometric',
                       'trait': 'Body mass index',
                       'year': '2015.0'}}]}

Cypher (advanced)

Advanced users that are familiar with Neo4j Cypher can query the database using Cypher directly.

In [13]:
query = """
    MATCH (exposure:Gwas)-[mr:MR_EVE_MR]->(outcome:Gwas) 
    WHERE exposure.trait = "Body mass index"
    RETURN exposure, outcome, mr LIMIT 2
"""

r = requests.post(f"{API_URL}/cypher", json={"query": query})
r.raise_for_status()

print(pformat(r.json())[:3000])
{'metadata': {'empty_results': False,
              'query': 'MATCH (exposure:Gwas)-[mr:MR_EVE_MR]->(outcome:Gwas) '
                       'WHERE exposure.trait = "Body mass index" RETURN '
                       'exposure, outcome, mr LIMIT 2',
              'total_seconds': 0.011951},
 'results': [{'exposure': {'_id': 'ieu-a-974',
                           '_name': 'Body mass index',
                           '_source': ['OpenGWAS-2020-10-13'],
                           'author': 'Locke AE',
                           'build': 'HG19/GRCh37',
                           'category': 'Risk factor',
                           'id': 'ieu-a-974',
                           'mr': '1',
                           'nsnp': '2494613',
                           'pmid': '25673413.0',
                           'population': 'European',
                           'sample_size': '171977.0',
                           'sd': '4.77',
                           'sex': 'Females',
                           'subcategory': 'Anthropometric',
                           'trait': 'Body mass index',
                           'year': '2015.0'},
              'mr': {'_source': ['MR-EvE-2021-03-10'],
                     'b': 0.0289678834636601,
                     'ci_low': -0.165265753333202,
                     'ci_upp': 0.22320152026052198,
                     'method': 'Simple mean',
                     'moescore': 0.77,
                     'nsnp': 37,
                     'pval': 0.771725286643359,
                     'se': 0.0990987942841133,
                     'selection': 'HF'},
              'outcome': {'_id': 'prot-a-1729',
                          '_name': 'Galectin-7',
                          '_source': ['OpenGWAS-2020-10-13'],
                          'author': 'Sun BB',
                          'build': 'HG19/GRCh37',
                          'category': 'Immune system',
                          'id': 'prot-a-1729',
                          'mr': '1',
                          'nsnp': '10534735',
                          'pmid': '29875488.0',
                          'population': 'European',
                          'sample_size': '3301.0',
                          'sex': 'Males and Females',
                          'subcategory': 'Protein',
                          'trait': 'Galectin-7',
                          'year': '2018.0'}},
             {'exposure': {'_id': 'ieu-a-974',
                           '_name': 'Body mass index',
                           '_source': ['OpenGWAS-2020-10-13'],
                           'author': 'Locke AE',
                           'build': 'HG19/GRCh37',
                           'category': 'Risk factor',
                           'id': 'ieu-a-974',
                           'mr': '1',
                           'nsnp': '2494613',
                           'pmid': '25673413.0',
                           'population': 'European',
                           'sample_size': '171977.0',
    

Again for the detailed documentation on the API endpoints please visit: