Using Knetminer Neo4j with Jupyter, a Demo

As with the SPARQL example, this notebook shows how to use the Neo4j data endopoint, to do your own analyses of knetminer data, by exploiting their representation as property graph data and query these with the Cypher query language.

This example is a follow-up of the SPARQL example, where we found that the brassinosteroid pathway is the most relevant one for the yellow rust.

Now, let's drill down into the details of this pathway, by:

  • first finding the proteins participating in the pathway,
  • then let's expand those to related proteins (eg, by sequence similarity or homology),
  • and finally, let's fetch the publications that mention the genes encoding these pathway proteins. This last step is based on text mining data.

Cypher is particularly good with expressing long chains like the above:

In [1]:
cypher = """
MATCH path =
  (bp:BioProc{ prefName:"Brassinosteroid Mediated Signaling Pathway" }) // The pathway of interest
  <- [part:participates_in] - (rprotein:Protein) // the participating proteins
  -  [xr:h_s_s|xref|ortho*0..1] - (protein:Protein) // + related proteins
  <-  [enc:enc] - (gene:Gene) // and their encoding genes
  - [occ:occ_in] -> (pub:Publication) // and the pubs naming those genes according to text mining
WHERE toFloat ( occ.TFIDF ) > 20 // and filter by significance
RETURN DISTINCT gene, toFloat ( occ.TFIDF ) AS score, pub
ORDER BY score DESC, pub.identifier
LIMIT 50
"""

Another good thing about Cypher and property graphs is that they natively manage relations with attributes. In this example, the gene/publication occurrence has a score attached, the TFIDF, to tell how much significant the the association computed by the text miner is. As you can see, we use that to do some result filtering.

Note that, for the time being, all the Knetminer (ie, Ondex) attributes are stored as strings, that's a limit of the data converters we use, which we will improve in future. For the moment, you can use functions like toFloat(), as shown here.

Let's run the query against the Neo4j endpoint:

In [2]:
from neo4j import GraphDatabase
# See http://knetminer.org/data for a list of available Neo4j endpoints
driver = GraphDatabase.driver( "bolt://knetminer-neo4j.cyverseuk.org:7687" )
session = driver.session()
result = session.run( cypher )

As before, we can throw resulting data into a Panda dataframe and use this for further computations. Again, let's use that for some nice tabular visualisation:

In [3]:
import pandas as pd
dframe = []
for record in result:
  gene = record [ "gene" ]
  pub = record [ "pub" ]
  score = record [ "score" ]
  dframe.append ( [ pub[ "identifier" ], pub[ "AbstractHeader" ], \
                     gene[ "prefName" ], gene [ "identifier" ], score ] )
headers = [ "PMID", "Title", "Gene Symbol", "Gene Acc", "Score" ]
dframe = pd.DataFrame ( dframe, columns = headers )
display ( dframe.head ( 10 ) )
PMID Title Gene Symbol Gene Acc Score
0 28710392 Comparative study of Arabidopsis PBS1 and a wh... PBS1 TRAESCS2D02G284200 50.095543
1 28710392 Comparative study of Arabidopsis PBS1 and a wh... PBS1 TRAESCS2B02G302800 50.095543
2 28710392 Comparative study of Arabidopsis PBS1 and a wh... PBS1 TRAESCS2A02G285700 50.095543
3 16222089 Cloning, characterization and expression of wh... EDR1 TRAESCS1A02G131600 38.629639
4 16222089 Cloning, characterization and expression of wh... EDR1 TRAESCS1D02G136400 38.629639
5 16222089 Cloning, characterization and expression of wh... EDR1 TRAESCS4B02G276200 38.629639
6 16222089 Cloning, characterization and expression of wh... EDR1 TRAESCS4A02G029800 38.629639
7 16222089 Cloning, characterization and expression of wh... EDR1 TRAESCS1B02G154400 38.629639
8 16222089 Cloning, characterization and expression of wh... EDR1 TRAESCS4D02G274400 38.629639
9 24073880 Mycosphaerella graminicola LysM effector-media... CERK1 TRAESCS7D02G265400 36.804436

Let's do something more: let's load the same data into a networkx graph. This can be used for a number of graph/topology/network/etc analyses. Here we simply pass the graph to matplot and show a graph of gene/publication associations, to have an idea of which publications mention more genes, and how the genes are grouped by publication (which hints about the ways they're related:

In [4]:
%matplotlib inline
import warnings
import networkx as nx
import matplotlib.pyplot as plt
import textwrap as tw

warnings.filterwarnings('ignore')

figsz = plt.gcf().get_size_inches()
plt.gcf().set_size_inches ( figsz * 3 ) 

g = nx.MultiDiGraph()
result = session.run( cypher ) # It's a consumable stream, you need to regenerate it
for record in result:
  gene = record [ "gene" ] # this depends on the JSON being returned by Neo4j
  gid = gene [ "prefName" ] #+ ":" + gene [ "identifier" ]

  pub = record [ "pub" ]
  pid = pub [ "AbstractHeader" ]
  pid = tw.shorten ( pid, width = 40, fix_sentence_endings = True, placeholder = '...' ) \
                     + " " + pub [ "prefName" ]
  #pid = pub [ "prefName" ]

  # Simple graph that shows the gene/publication relation
  g.add_node ( gid )
  g.add_node ( pid )
  g.add_edge ( gid, pid )

nx.draw (g, pos = nx.circular_layout ( g ), \
         with_labels = True, node_color = 'yellow', node_size = 800, \
         edge_color = 'blue', font_size = 14 )
plt.show()