As with the SPARQL example, this notebook shows how to use the Neo4j data endopoint, to do your own analyses of knetminer data, by exploiting their representation as property graph data and query these with the Cypher query language.
This example is a follow-up of the SPARQL example, where we found that the brassinosteroid pathway is the most relevant one for the yellow rust.
Now, let's drill down into the details of this pathway, by:
Cypher is particularly good with expressing long chains like the above:
cypher = """
MATCH path =
(bp:BioProc{ prefName:"Brassinosteroid Mediated Signaling Pathway" }) // The pathway of interest
<- [part:participates_in] - (rprotein:Protein) // the participating proteins
- [xr:h_s_s|xref|ortho*0..1] - (protein:Protein) // + related proteins
<- [enc:enc] - (gene:Gene) // and their encoding genes
- [occ:occ_in] -> (pub:Publication) // and the pubs naming those genes according to text mining
WHERE toFloat ( occ.TFIDF ) > 20 // and filter by significance
RETURN DISTINCT gene, toFloat ( occ.TFIDF ) AS score, pub
ORDER BY score DESC, pub.identifier
LIMIT 50
"""
Another good thing about Cypher and property graphs is that they natively manage relations with attributes. In this example, the gene/publication occurrence has a score attached, the TFIDF, to tell how much significant the the association computed by the text miner is. As you can see, we use that to do some result filtering.
Note that, for the time being, all the Knetminer (ie, Ondex) attributes are stored as strings,
that's a limit of the data converters we use, which we will improve in future. For the moment, you
can use functions like toFloat()
, as shown here.
Let's run the query against the Neo4j endpoint:
from neo4j import GraphDatabase
# See http://knetminer.org/data for a list of available Neo4j endpoints
driver = GraphDatabase.driver( "bolt://knetminer-neo4j.cyverseuk.org:7687" )
session = driver.session()
result = session.run( cypher )
As before, we can throw resulting data into a Panda dataframe and use this for further computations. Again, let's use that for some nice tabular visualisation:
import pandas as pd
dframe = []
for record in result:
gene = record [ "gene" ]
pub = record [ "pub" ]
score = record [ "score" ]
dframe.append ( [ pub[ "identifier" ], pub[ "AbstractHeader" ], \
gene[ "prefName" ], gene [ "identifier" ], score ] )
headers = [ "PMID", "Title", "Gene Symbol", "Gene Acc", "Score" ]
dframe = pd.DataFrame ( dframe, columns = headers )
display ( dframe.head ( 10 ) )
PMID | Title | Gene Symbol | Gene Acc | Score | |
---|---|---|---|---|---|
0 | 28710392 | Comparative study of Arabidopsis PBS1 and a wh... | PBS1 | TRAESCS2D02G284200 | 50.095543 |
1 | 28710392 | Comparative study of Arabidopsis PBS1 and a wh... | PBS1 | TRAESCS2B02G302800 | 50.095543 |
2 | 28710392 | Comparative study of Arabidopsis PBS1 and a wh... | PBS1 | TRAESCS2A02G285700 | 50.095543 |
3 | 16222089 | Cloning, characterization and expression of wh... | EDR1 | TRAESCS1A02G131600 | 38.629639 |
4 | 16222089 | Cloning, characterization and expression of wh... | EDR1 | TRAESCS1D02G136400 | 38.629639 |
5 | 16222089 | Cloning, characterization and expression of wh... | EDR1 | TRAESCS4B02G276200 | 38.629639 |
6 | 16222089 | Cloning, characterization and expression of wh... | EDR1 | TRAESCS4A02G029800 | 38.629639 |
7 | 16222089 | Cloning, characterization and expression of wh... | EDR1 | TRAESCS1B02G154400 | 38.629639 |
8 | 16222089 | Cloning, characterization and expression of wh... | EDR1 | TRAESCS4D02G274400 | 38.629639 |
9 | 24073880 | Mycosphaerella graminicola LysM effector-media... | CERK1 | TRAESCS7D02G265400 | 36.804436 |
Let's do something more: let's load the same data into a networkx graph. This can be used for a number of graph/topology/network/etc analyses. Here we simply pass the graph to matplot and show a graph of gene/publication associations, to have an idea of which publications mention more genes, and how the genes are grouped by publication (which hints about the ways they're related:
%matplotlib inline
import warnings
import networkx as nx
import matplotlib.pyplot as plt
import textwrap as tw
warnings.filterwarnings('ignore')
figsz = plt.gcf().get_size_inches()
plt.gcf().set_size_inches ( figsz * 3 )
g = nx.MultiDiGraph()
result = session.run( cypher ) # It's a consumable stream, you need to regenerate it
for record in result:
gene = record [ "gene" ] # this depends on the JSON being returned by Neo4j
gid = gene [ "prefName" ] #+ ":" + gene [ "identifier" ]
pub = record [ "pub" ]
pid = pub [ "AbstractHeader" ]
pid = tw.shorten ( pid, width = 40, fix_sentence_endings = True, placeholder = '...' ) \
+ " " + pub [ "prefName" ]
#pid = pub [ "prefName" ]
# Simple graph that shows the gene/publication relation
g.add_node ( gid )
g.add_node ( pid )
g.add_edge ( gid, pid )
nx.draw (g, pos = nx.circular_layout ( g ), \
with_labels = True, node_color = 'yellow', node_size = 800, \
edge_color = 'blue', font_size = 14 )
plt.show()