Using Knetminer Neo4j with Jupyter, a Demo¶

As with the SPARQL example, this notebook shows how to use the Neo4j data endopoint, to do your own analyses of knetminer data, by exploiting their representation as property graph data and query these with the Cypher query language.

This example is a follow-up of the SPARQL example, where we found that the brassinosteroid pathway is the most relevant one for the yellow rust.

Now, let's drill down into the details of this pathway, by:

first finding the proteins participating in the pathway,
then let's expand those to related proteins (eg, by sequence similarity or homology),
and finally, let's fetch the publications that mention the genes encoding these pathway proteins. This last step is based on text mining data.

Cypher is particularly good with expressing long chains like the above:

In [1]:

cypher = """
MATCH path =
  (bp:BioProc{ prefName:"Brassinosteroid Mediated Signaling Pathway" }) // The pathway of interest
  <- [part:participates_in] - (rprotein:Protein) // the participating proteins
  -  [xr:h_s_s|xref|ortho*0..1] - (protein:Protein) // + related proteins
  <-  [enc:enc] - (gene:Gene) // and their encoding genes
  - [occ:occ_in] -> (pub:Publication) // and the pubs naming those genes according to text mining
WHERE toFloat ( occ.TFIDF ) > 20 // and filter by significance
RETURN DISTINCT gene, toFloat ( occ.TFIDF ) AS score, pub
ORDER BY score DESC, pub.identifier
LIMIT 50
"""

Another good thing about Cypher and property graphs is that they natively manage relations with attributes. In this example, the gene/publication occurrence has a score attached, the TFIDF, to tell how much significant the the association computed by the text miner is. As you can see, we use that to do some result filtering.

Note that, for the time being, all the Knetminer (ie, Ondex) attributes are stored as strings, that's a limit of the data converters we use, which we will improve in future. For the moment, you can use functions like toFloat(), as shown here.

Let's run the query against the Neo4j endpoint:

In [2]:

from neo4j import GraphDatabase
# See http://knetminer.org/data for a list of available Neo4j endpoints
driver = GraphDatabase.driver( "bolt://knetminer-neo4j.cyverseuk.org:7687" )
session = driver.session()
result = session.run( cypher )

As before, we can throw resulting data into a Panda dataframe and use this for further computations. Again, let's use that for some nice tabular visualisation:

In [3]:

import pandas as pd
dframe = []
for record in result:
  gene = record [ "gene" ]
  pub = record [ "pub" ]
  score = record [ "score" ]
  dframe.append ( [ pub[ "identifier" ], pub[ "AbstractHeader" ], \
                     gene[ "prefName" ], gene [ "identifier" ], score ] )
headers = [ "PMID", "Title", "Gene Symbol", "Gene Acc", "Score" ]
dframe = pd.DataFrame ( dframe, columns = headers )
display ( dframe.head ( 10 ) )

	PMID	Title	Gene Symbol	Gene Acc	Score
0	28710392	Comparative study of Arabidopsis PBS1 and a wh...	PBS1	TRAESCS2D02G284200	50.095543
1	28710392	Comparative study of Arabidopsis PBS1 and a wh...	PBS1	TRAESCS2B02G302800	50.095543
2	28710392	Comparative study of Arabidopsis PBS1 and a wh...	PBS1	TRAESCS2A02G285700	50.095543
3	16222089	Cloning, characterization and expression of wh...	EDR1	TRAESCS1A02G131600	38.629639
4	16222089	Cloning, characterization and expression of wh...	EDR1	TRAESCS1D02G136400	38.629639
5	16222089	Cloning, characterization and expression of wh...	EDR1	TRAESCS4B02G276200	38.629639
6	16222089	Cloning, characterization and expression of wh...	EDR1	TRAESCS4A02G029800	38.629639
7	16222089	Cloning, characterization and expression of wh...	EDR1	TRAESCS1B02G154400	38.629639
8	16222089	Cloning, characterization and expression of wh...	EDR1	TRAESCS4D02G274400	38.629639
9	24073880	Mycosphaerella graminicola LysM effector-media...	CERK1	TRAESCS7D02G265400	36.804436

Let's do something more: let's load the same data into a networkx graph. This can be used for a number of graph/topology/network/etc analyses. Here we simply pass the graph to matplot and show a graph of gene/publication associations, to have an idea of which publications mention more genes, and how the genes are grouped by publication (which hints about the ways they're related:

In [4]:

%matplotlib inline
import warnings
import networkx as nx
import matplotlib.pyplot as plt
import textwrap as tw

warnings.filterwarnings('ignore')

figsz = plt.gcf().get_size_inches()
plt.gcf().set_size_inches ( figsz * 3 ) 

g = nx.MultiDiGraph()
result = session.run( cypher ) # It's a consumable stream, you need to regenerate it
for record in result:
  gene = record [ "gene" ] # this depends on the JSON being returned by Neo4j
  gid = gene [ "prefName" ] #+ ":" + gene [ "identifier" ]

  pub = record [ "pub" ]
  pid = pub [ "AbstractHeader" ]
  pid = tw.shorten ( pid, width = 40, fix_sentence_endings = True, placeholder = '...' ) \
                     + " " + pub [ "prefName" ]
  #pid = pub [ "prefName" ]

  # Simple graph that shows the gene/publication relation
  g.add_node ( gid )
  g.add_node ( pid )
  g.add_edge ( gid, pid )

nx.draw (g, pos = nx.circular_layout ( g ), \
         with_labels = True, node_color = 'yellow', node_size = 800, \
         edge_color = 'blue', font_size = 14 )
plt.show()