Using Knetminer SPARQL with Jupyter, a Demo¶

With Knetminer available as SPARQL, we can do our own data analyses, in addition to using the intuitive functionality of the Knetminer web application.

Here it is a simple exmple. Let's do some investigation on the question: which biological processes are involved in yellow rust?

One way to figure out an answer is to start from genes mentioned in publications, then follow the proteins encoded by such genes, and see the GO biological processes those proteins are annotated with.

If we look ad the knowledge graph that Knetminer has for wheat, we soon discover that actually we also have to consider proteins related to gene-encoded proteins, since proteins with similar sequences or other public cross references are often relevant.

In SPARQL and using our BioKNO Ontology, this translates to the following query:

In [1]:

query = """
PREFIX bk: <http://knetminer.org/data/rdf/terms/biokno/>
PREFIX bkr: <http://knetminer.org/data/rdf/resources/>
PREFIX bka: <http://knetminer.org/data/rdf/terms/biokno/attributes/>
PREFIX bkg: <http://knetminer.org/data/rdf/resources/graphs/>

SELECT DISTINCT ?bioProcName (COUNT (?geneName) AS ?genes)
FROM bkg:poaceae # The endpoint has many datasets, we're looking into the graminaceae only
WHERE
{
  # The publications of interest
  ?pub a bk:Publication;
    bka:AbstractHeader ?title;

  FILTER ( CONTAINS ( ?title, "yellow rust" ) )

  # The genes mentioned by the publications
  ?gene bk:occ_in ?pub;
    a bk:Gene;
    bk:prefName ?geneName.
  
  # They encode proteins and there might be related proteins. Let's consider relation chains of
  # 0 (the encoded protein is directly involved) to 2.
  # predicates are repeated with the ^ prefix, to consider both directions
  # h_s_s := "has similar sequence"
  ?gene bk:enc ?protein.
  ?protein (bk:h_s_s|bk:xref|bk:ortho|^bk:h_s_s|^bk:xref|^bk:ortho){0,1} ?rprotein.

  ?rprotein bk:participates_in ?bioProc.
  
  ?bioProc bk:prefName ?bioProcName.
}
GROUP BY ?bioProcName
ORDER BY DESC ( ?genes )
LIMIT 100
"""

Now we can invoke the query against our endpoint, using the SPARQLWrapper libray. Let's also convert the result to a convenient matrix and render it as a nice table.

In [2]:

from SPARQLWrapper import SPARQLWrapper2

# Go with the query
sparql = SPARQLWrapper2 ( "http://knetminer-data.cyverseuk.org/lodestar/sparql" )
sparql.setQuery ( query )

# Clean it up
# it's a list of tuples, each tuple is like [ $select-variable: <value> ]
result = sparql.query().bindings
# Every value is a resource structure (unless you already mapped it in SPARQL)
result = [ [ r['bioProcName'].value, int ( r['genes'].value ) ] for r in result ]

# Render nicely
import pandas as pd
dframe = pd.DataFrame ( result, columns = [ "GO Bio Proc", "# Genes" ] )
display ( dframe )

	GO Bio Proc	# Genes
0	Brassinosteroid Mediated Signaling Pathway	60
1	Protein Phosphorylation	53
2	Nodulation	36
3	Innate Immune Response	33
4	Detection Of Brassinosteroid Stimulus	26
5	Regulation Of Seedling Development	24
6	Brassinosteroid Homeostasis	24
7	Positive Regulation Of Flower Development	24
8	Pollen Exine Formation	24
9	Microtubule Bundle Formation	24
10	Anther Wall Tapetum Cell Differentiation	24
11	Response To UV-B	24
12	Leaf Development	24
13	Negative Regulation Of Cell Death	24
14	Skotomorphogenesis	24
15	Defense Response	19
16	Plant-type Hypersensitive Response	18
17	Protein Autophosphorylation	18
18	Microsporogenesis	6
19	Megasporogenesis	6
20	Anther Wall Tapetum Development	6
21	Regulation Of Growth	5
22	Positive Regulation Of Innate Immune Response	5
23	Regulation Of Defense Response To Fungus	5
24	Cell Differentiation	5
25	Somatic Embryogenesis	5

Cool! Let's see them better in a chart:

In [3]:

%matplotlib inline

import matplotlib.pyplot as plt
import matplotlib
import textwrap

# Let's fix a few visualisation things
#plt.rcParams [ "font.size" ] = "20"

# Including figure size
figsz = plt.gcf().get_size_inches()
plt.gcf().set_size_inches ( figsz * 3 ) 

ylabels = [ row [0] for row in result ]
values = [ row [ 1 ] for row in result ]

# reverse order, probably there are better ways to do it in mathplot...
ypos = [ len( ylabels ) - 1 - i for i, _ in enumerate ( ylabels ) ]

# An horizontal bar chart
plt.barh ( ypos, values, color = 'blue' )
#plt.ylabel ( "GO Bio Proc" )
plt.xlabel( "#Genes" )
plt.title ( "GO Biological Processes about Yellow Rust" )
plt.yticks( ypos, ylabels )

plt.show()