Getting started with EpiGraphDB in Python¶

This notebook is provided as a brief introductory guide to working with the EpiGraphDB platform through Python. Here we will demonstrate a few basic operations that can be carried out using the platform, but for more advanced methods please refer to the API endpoint documentation.

A Python wrapper for EpiGraphDB's API is currently in the works, but for now we will be querying it directly using the requests library- knowledge of this package is advantageous but not essential.

In [1]:

import requests

First, we will ping the API to check our connection:

In [2]:

# Store our API URL as a string for future use
API_URL = "https://api.epigraphdb.org"

# Here we use the .get() method to send a GET request to the /ping endpoint of the API
endpoint = '/ping'
response_object = requests.get(API_URL + endpoint)  

# Check that the ping was sucessful
response_object.raise_for_status() 
print("If this line gets printed, ping was sucessful.")

If this line gets printed, ping was sucessful.

1. Using EpiGraphDB to obtain biological mappings¶

In this first section, we will take an arbitrary list of genes and query the EpiGraph API to find the proteins that they map to. We will be using the POST HTTP method which requires its parameters to be passed in JSON format, a conversion that is easy to do using the json library. To find the correct names of the parameters that we are about to set, we can navigate to the EpiGraphDB API documentation and find the endpoint of interest. From there we simply read off the parameters that we want to pass, and can take a look at the example request as a reference point if needed.

In [3]:

# 1.1 Mapping genes to proteins

# Set parameters and convert to JSON format
import json
params = {
  "gene_name_list": [
    "TP53",
    "BRCA1", 
    "TNF"
  ]
}
json_params = json.dumps(params)

# Define which endpoint of the API we would like to connect with
endpoint = '/mappings/gene-to-protein'

# Send the POST request
response_object = requests.post(API_URL + endpoint, data=json_params)

# Check for successful request
response_object.raise_for_status()

# Store results in a pandas dataframe
import pandas as pd
results = response_object.json()['results']
gene_protein_df = pd.json_normalize(results)

gene_protein_df.head()

Out[3]:

	gene.name	gene.ensembl_id	protein.uniprot_id
0	TP53	ENSG00000141510	P04637
1	BRCA1	ENSG00000012048	P38398
2	TNF	ENSG00000232810	P01375

In the above cell, we queried EpiGraphDB for the proteins that have been mapped to the genes TP53, BRCA1, and TNF. Our query went through successfully and we received an associated protein for each. The columns in our output dataframe take the general form entity.property and this will remain consistent throughout this notebook.

Specific descriptions for the properties of each entity can be found in EpiGraphDB's data dictionary. Simply click on the relevant entity in the table of contents on the right hand side (or scroll down to the relevant section), then locate the property of interest.

In [4]:

# 1.2 Proteins to pathways

# As above, this is another POST request, so we need our data in JSON format
json_params = json.dumps({
  "uniprot_id_list": list(gene_protein_df['protein.uniprot_id'].values)
})

# Send the request
endpoint = '/protein/in-pathway'
response_object = requests.post(API_URL + endpoint, data=json_params)

# Check for successful request
response_object.raise_for_status()

# Store results
results = response_object.json()['results']
protein_pathway_df = pd.json_normalize(results)

protein_pathway_df.head()

Out[4]:

	uniprot_id	pathway_count	pathway_reactome_id
0	P04637	5	[R-HSA-6785807, R-HSA-390471, R-HSA-5689896, R...
1	P38398	6	[R-HSA-6796648, R-HSA-1221632, R-HSA-8953750, ...
2	P01375	3	[R-HSA-6785807, R-HSA-6783783, R-HSA-5357905]

Above, we took the proteins that had been mapped to our genes of interest and queried the platform for their associated pathway data. The API found multiple such pathways for each gene and has returned the respective reactome IDs to us as lists.

It is worth noting here that so far we have only been accessing the 'results' key in the nested dictionairy returned by the .json() method of our response object. The other available key is 'metadata' (see the output below) which provides us with information about the request itself, including the specific Cypher query that the platform ran to get these results. If you would like to know more about the use of Cypher in these requests, there is a section dedicated to this at the end of this notebook.

In [5]:

from pprint import pprint
metadata = response_object.json()['metadata']

pprint(metadata)

{'empty_results': False,
 'query': 'MATCH p=(protein:Protein)-[r:PROTEIN_IN_PATHWAY]-(pathway:Pathway) '
          "WHERE protein.uniprot_id IN ['P04637', 'P38398', 'P01375'] RETURN "
          'protein.uniprot_id AS uniprot_id, count(p) AS pathway_count, '
          'collect(pathway.reactome_id) AS pathway_reactome_id',
 'total_seconds': 0.005797}

2. Epidemiological relationship analysis¶

In the cell below, we will query EpiGraphDB to get metadata relating to GWAS studies of a target trait- body mass index. Following that, queries will be performed to get pre-computed Mendelian Randomisation (MR) results involving the same trait.

Here we will be using a different HTTP method than before- the GET method, which is in fact easier to use in Python because the parameters can be passed directly as a dictionary. To learn more about the differences between GET and POST, please see this guide.

In [6]:

# 2.1 Getting GWAS studies from EpiGraphDB

# Create a dictionary for the parameters to be passed
params = {
    'name':'Body mass index'
}

# Send the request
endpoint = '/meta/nodes/Gwas/search'
response_object = requests.get(API_URL + endpoint, params=params)
response_object.raise_for_status()

# Store the results of the query and display
result = response_object.json()['results']
gwas_df = pd.json_normalize(result)

gwas_df.head()

Out[6]:

	node.note	node.access	node.year	node.mr	node.author	node.consortium	node.sex	node.priority	node.pmid	node.population	node.unit	node.sample_size	node.nsnp	node.trait	node.id	node.subcategory	node.category	node.sd
0	NA	public	2018	1	Hoffmann TJ	NA	NA	0	30108127	European	NA	315347	27854527	Body mass index	ebi-a-GCST006368	NA	NA	NaN
1	NaN	public	2015	1	Locke AE	NA	Males and Females	1	25673413	Mixed	NA	339224	2555511	Body mass index	ieu-a-2	Anthropometric	Risk factor	4.77
2	NaN	public	2015	1	Locke AE	NA	Males	2	25673413	European	NA	152893	2477659	Body mass index	ieu-a-785	Anthropometric	Risk factor	4.77
3	NaN	public	2015	1	Locke AE	NA	Males and Females	3	25673413	European	NA	322154	2554668	Body mass index	ieu-a-835	Anthropometric	Risk factor	4.77
4	NA	public	2017	1	Akiyama M	NA	NA	0	28892062	East Asian	NA	158284	5952516	Body mass index	ebi-a-GCST004904	NA	NA	NaN

In [7]:

# 2.2 Getting MR results for a trait

# Set parameters
params = {'exposure_trait': 'Body mass index',
          'pval_threshold': 1e-10}

# Send request
endpoint = '/mr'
response_object = requests.get(API_URL + endpoint, params=params)
response_object.raise_for_status()

# Store and display results
result = response_object.json()['results']
BMI_MR_df = pd.json_normalize(result) 

BMI_MR_df.head()

Out[7]:

	exposure.id	exposure.trait	outcome.id	outcome.trait	mr.b	mr.se	mr.method	mr.selection	mr.moescore
0	ieu-a-2	Body mass index	ukb-a-74	Non-cancer illness code self-reported: diabetes	0.034559	0.002418	FE IVW	DF	0.93
1	ieu-a-2	Body mass index	ukb-a-388	Hip circumference	0.724105	0.026588	Simple median	Tophits	0.95
2	ieu-a-2	Body mass index	ukb-a-382	Waist circumference	0.656440	0.024496	Simple median	Tophits	0.94
3	ieu-a-2	Body mass index	ukb-a-35	Comparative height size at age 10	0.136684	0.007909	FE IVW	Tophits	0.94
4	ieu-a-2	Body mass index	ukb-a-34	Comparative body size at age 10	0.365580	0.023556	Simple median	HF	0.87

The dataframe above displays the results of our query. We requested all traits for which an MR analysis using body mass index as the exposure variable returned a causal estimate with a p-value lower than 1e-10. Information regarding the specific MR parameters, as well as the exposure and outcome variables, has been displayed in the table for all traits that matched our search conditions.

In the parameters we set in 2.2, another viable parameter name is 'outcome_trait' which takes the same type of values as 'exposure_trait'. Either one or both of these parameters can be passed during an MR query, which allows users to refine which results are returned to them depending on their own analytical preferences.

3. Looking for literature evidence¶

Accessing information in the literature is a ubiquitous task in research, be it for novel hypothesis generation or as part of evidence triangulation. EpiGraphDB facilitates fast processing of this information by allowing access to a host of literature-mined relationships that have been structured into semantic triples. These take the general form (subject, predicate, object) and have been generated using contemporary natural language processing techniques applied to a massive amount of published biomedical research papers. In the following section we will query the API for the literature relationship between a given gene and an outcome trait.

In [8]:

# Establish parameters
params = {
    'gene_name': "IL23R",
    'object_name': "Inflammatory bowel disease"
}

# Send the request
endpoint = "/literature/gene"
response_object = requests.get(API_URL + endpoint, params=params)
response_object.raise_for_status()

# Store the results of the query and display
result = response_object.json()['results']
lit_df = pd.json_normalize(result) 

lit_df

Out[8]:

	pubmed_id	gene.name	st.predicate	st.object_name
0	[17484863, 21155887]	IL23R	NEG_ASSOCIATED_WITH	Inflammatory Bowel Diseases
1	[27852544]	IL23R	AFFECTS	Inflammatory Bowel Diseases
2	[17484863, 19575361, 19496308, 18383521, 18341...	IL23R	ASSOCIATED_WITH	Inflammatory Bowel Diseases
3	[23131344]	IL23R	PREDISPOSES	Inflammatory Bowel Diseases

The dataframe outputted above shows the results of our query- four unique predicates were found between the gene IL23R and the trait Inflammatory bowel disease and are displayed in the st.predicate column. Our leftmost column contains the pubmed IDs of the papers from which this triple was derived. These IDs allow us to access the respective papers by navigating to https://pubmed.ncbi.nlm.nih.gov/*insert_pubmed_id_here*. In this particular case it seems that ASSOCIATED_WITH is the most common predicate linking our gene to the trait, but we can't see exactly how many papers there are due to how pandas displays lists. Let's add a paper count to the dataframe.

In [9]:

counts = [len(papers_list) for papers_list in lit_df['pubmed_id']]
lit_df['publication_count'] = counts

lit_df

Out[9]:

	pubmed_id	gene.name	st.predicate	st.object_name	publication_count
0	[17484863, 21155887]	IL23R	NEG_ASSOCIATED_WITH	Inflammatory Bowel Diseases	2
1	[27852544]	IL23R	AFFECTS	Inflammatory Bowel Diseases	1
2	[17484863, 19575361, 19496308, 18383521, 18341...	IL23R	ASSOCIATED_WITH	Inflammatory Bowel Diseases	21
3	[23131344]	IL23R	PREDISPOSES	Inflammatory Bowel Diseases	1

4. EpiGraphDB node search¶

EpiGraphDB stores data as nodes (entities) and edges (relationships) of a wide range of types. The /meta endpoints of the API offer us information about the structure of the graph itself- for example, the available classes of nodes can be listed through the /meta/nodes/list endpoint. Let’s do that now:

In [10]:

# 4.1 Getting a list of available meta-nodes

# Send the request
endpoint = "/meta/nodes/list"
response_object = requests.get(API_URL + endpoint)
response_object.raise_for_status()

# Store the results of the query and display
result = response_object.json()

result

Out[10]:

['Gwas',
 'Disease',
 'Drug',
 'Efo',
 'Event',
 'Gene',
 'Tissue',
 'Literature',
 'Pathway',
 'Protein',
 'SemmedTerm',
 'Variant']

This list above corresponds to EpiGraphDB's meta nodes, whose documentation can be found here along with their available properties.

In the following, we will demonstrate how we can search by name for a node of interest, using the endpoint /meta/nodes/{meta_node}/search, where viable values for {meta_node} are those terms listed above.

In [11]:

# 4.2 Searching for specific entities by name

# Set params 
params = {
    'name': 'breast cancer'
}

# Make request
meta_node = 'Gwas'
endpoint = f"/meta/nodes/{meta_node}/search"
response_object = requests.get(API_URL + endpoint, params=params)
response_object.raise_for_status()

# Convert to pandas
results = pd.json_normalize(response_object.json()['results'])
target_node_id = results['node.id'][3]  # Store one ID for use in the next cell

results[['node.trait', 'node.id', 'node.sample_size', 'node.year', 'node.author']]

Out[11]:

	node.trait	node.id	node.sample_size	node.year	node.author
0	Breast cancer	ebi-a-GCST007236	89677	2015	Michailidou K
1	Breast cancer	ebi-a-GCST004988	139274	2017	Michailidou K
2	Breast cancer (Combined Oncoarray; iCOGS; GWAS...	ieu-a-1126	228951	2017	Michailidou K
3	Breast cancer (GWAS)	ieu-a-1131	32498	2017	Michailidou K
4	Breast cancer (GWAS)	ieu-a-1168	33832	2015	Michailidou K
5	Breast cancer (Oncoarray)	ieu-a-1129	106776	2017	Michailidou K
6	Breast cancer (Survival)	ieu-a-1165	37954	2015	Guo Q
7	Breast cancer (iCOGS)	ieu-a-1162	89677	2015	Michailidou K
8	Breast cancer (iCOGS)	ieu-a-1130	89677	2017	Michailidou K
9	Breast cancer anti-estrogen resistance protein 3	prot-a-234	3301	2018	Sun BB

Above we used the name parameter of the endpoint to search for any GWAS nodes that fuzzily matched our specified string. Several did, and some of their basic node properties are displayed above. Fuzzy matching is useful because you don't need to know the exact name of the entity or its ID in order to look it up.

On the other hand, once you have identified your entity of interest, it is often sensible to move forward using the node's ID for the sake of unambiguity. Fortunately we can also search for traits using their ID, as demonstrated below.

In [12]:

# 4.3 Searching for a node by ID

# Set params
params = {
    'id': target_node_id  # From previous cell
}

# Make request
meta_node = 'Gwas'
endpoint = f"/meta/nodes/{meta_node}/search"
response_object = requests.get(API_URL + endpoint, params=params)
response_object.raise_for_status()

# Convert to pandas
results = pd.json_normalize(response_object.json()['results'])

results

Out[12]:

	node.ncase	node.access	node.year	node.mr	node.author	node.consortium	node.sex	node.priority	node.pmid	node.population	node.unit	node.sample_size	node.nsnp	node.ncontrol	node.trait	node.id	node.subcategory	node.category
0	14910	public	2017	1	Michailidou K	NA	Females	1	29059683	European	NA	32498	10680257	17588	Breast cancer (GWAS)	ieu-a-1131	Cancer	Disease

Advanced examples- Cypher¶

Until now, to get information from the platform we have been simply creating a dictionary or JSON object containing our parameters and then sending it to the correct endpoint of the API using the requests library. This is fine practice and the API has been designed specifically to allow this method of use, as we have (inexhaustively) demonstrated above. It works because the API automatically converts the HTTP requests that it receives into a Cypher query, which it then passes to the Neo4j database on which EpiGraphDB is built. The database passes back the result of the query, which is then returned to us in Python as a response object. Each response object contains metadata that includes the exact Cypher query that was called on the database, as shown in the cell below.

In [13]:

# 4.1 Cypher

params = {
  "gene_name_list": [
    "TP53"
  ]
}
json_params = json.dumps(params)
endpoint = '/mappings/gene-to-protein'
response_object = requests.post(API_URL + endpoint, data=json_params)
response_object.raise_for_status()

# Extract and print the Cypher query
cypher_query = response_object.json()['metadata']['query']
print(cypher_query)

MATCH (gene:Gene)-[gp:GENE_TO_PROTEIN]-(protein:Protein) WHERE gene.name IN ['TP53'] RETURN gene {.ensembl_id, .name}, protein {.uniprot_id}

The text printed above is the exact Cypher query that was run in section 1.1, behind the scenes. The basic structure of these queries is as follows:

                                        MATCH subgraph

                                        WHERE condition

                                        RETURN data

Note that the subgraph should take this general form: (node)-[relationship]-(node), but for both nodes and relationships we write them as my_variable_name:Meta_node so that we can access their properties through the variable name we assigned them (my_variable_name), and use those properties to define our conditions and what data we want returned. Information on the available properties for each class of entity can be found in EpiGraphDB's documentation, specifically here for nodes and here for relationships.

Now let's write and send our own basic query to get traits with high genetic correlation to body mass index:

In [14]:

# 4.2 Writing custom Cypher queries

# Define the target subgraph
cypher_query = 'MATCH (trait1:Gwas)-[corr:BN_GEN_COR]-(trait2:Gwas)'

# Add conditions to the query
cypher_query += ' WHERE trait1.trait = "Body mass index (BMI)" 
cypher_query += ' AND corr.rg > 0.9'

# Add which data we want returned
cypher_query += ' RETURN trait1, trait2, corr {.rg, .p}'

# Put our query into the correct format for a POST request
params = json.dumps({
    'query': cypher_query
})

# Define the target endpoint and send the request
endpoint = '/cypher'
response_object = requests.post(API_URL + endpoint, data=params)
response_object.raise_for_status()

# Display the returned data
results = response_object.json()['results']
results_df = pd.json_normalize(results)

results_df.head()

  File "<ipython-input-14-bf982b5ed6c6>", line 7
    cypher_query += ' WHERE trait1.trait = "Body mass index (BMI)"
                                                                   ^
SyntaxError: EOL while scanning string literal

In our Cypher query, we grabbed a subgraph from the database that comprised nodes representing biomedical traits, with edges between them representing their genetic correlation. The subgraph was then filtered to select any node-edge-node triples where the first node had the .trait property of "Body mass index (BMI)", and where the edge between the nodes had a .rg (genetic correlation score) value greater than 0.9. We then asked Neo4j to return us the names of the two traits, as well as the score and p-value of the correlation between the two, for all triples not filtered out by our conditions. Finally, we converted the returned dictionary to a dataframe for ease of viewing.

For more detailed information on Cypher queries, please refer to the official documentation.