With Knetminer available as SPARQL, we can do our own data analyses, in addition to using the intuitive functionality of the Knetminer web application.
Here it is a simple exmple. Let's do some investigation on the question: which biological processes are involved in yellow rust?
One way to figure out an answer is to start from genes mentioned in publications, then follow the proteins encoded by such genes, and see the GO biological processes those proteins are annotated with.
If we look ad the knowledge graph that Knetminer has for wheat, we soon discover that actually we also have to consider proteins related to gene-encoded proteins, since proteins with similar sequences or other public cross references are often relevant.
In SPARQL and using our BioKNO Ontology, this translates to the following query:
query = """
PREFIX bk: <http://knetminer.org/data/rdf/terms/biokno/>
PREFIX bkr: <http://knetminer.org/data/rdf/resources/>
PREFIX bka: <http://knetminer.org/data/rdf/terms/biokno/attributes/>
PREFIX bkg: <http://knetminer.org/data/rdf/resources/graphs/>
SELECT DISTINCT ?bioProcName (COUNT (?geneName) AS ?genes)
FROM bkg:poaceae # The endpoint has many datasets, we're looking into the graminaceae only
WHERE
{
# The publications of interest
?pub a bk:Publication;
bka:AbstractHeader ?title;
FILTER ( CONTAINS ( ?title, "yellow rust" ) )
# The genes mentioned by the publications
?gene bk:occ_in ?pub;
a bk:Gene;
bk:prefName ?geneName.
# They encode proteins and there might be related proteins. Let's consider relation chains of
# 0 (the encoded protein is directly involved) to 2.
# predicates are repeated with the ^ prefix, to consider both directions
# h_s_s := "has similar sequence"
?gene bk:enc ?protein.
?protein (bk:h_s_s|bk:xref|bk:ortho|^bk:h_s_s|^bk:xref|^bk:ortho){0,1} ?rprotein.
?rprotein bk:participates_in ?bioProc.
?bioProc bk:prefName ?bioProcName.
}
GROUP BY ?bioProcName
ORDER BY DESC ( ?genes )
LIMIT 100
"""
Now we can invoke the query against our endpoint, using the SPARQLWrapper libray. Let's also convert the result to a convenient matrix and render it as a nice table.
from SPARQLWrapper import SPARQLWrapper2
# Go with the query
sparql = SPARQLWrapper2 ( "http://knetminer-data.cyverseuk.org/lodestar/sparql" )
sparql.setQuery ( query )
# Clean it up
# it's a list of tuples, each tuple is like [ $select-variable: <value> ]
result = sparql.query().bindings
# Every value is a resource structure (unless you already mapped it in SPARQL)
result = [ [ r['bioProcName'].value, int ( r['genes'].value ) ] for r in result ]
# Render nicely
import pandas as pd
dframe = pd.DataFrame ( result, columns = [ "GO Bio Proc", "# Genes" ] )
display ( dframe )
GO Bio Proc | # Genes | |
---|---|---|
0 | Brassinosteroid Mediated Signaling Pathway | 60 |
1 | Protein Phosphorylation | 53 |
2 | Nodulation | 36 |
3 | Innate Immune Response | 33 |
4 | Detection Of Brassinosteroid Stimulus | 26 |
5 | Regulation Of Seedling Development | 24 |
6 | Brassinosteroid Homeostasis | 24 |
7 | Positive Regulation Of Flower Development | 24 |
8 | Pollen Exine Formation | 24 |
9 | Microtubule Bundle Formation | 24 |
10 | Anther Wall Tapetum Cell Differentiation | 24 |
11 | Response To UV-B | 24 |
12 | Leaf Development | 24 |
13 | Negative Regulation Of Cell Death | 24 |
14 | Skotomorphogenesis | 24 |
15 | Defense Response | 19 |
16 | Plant-type Hypersensitive Response | 18 |
17 | Protein Autophosphorylation | 18 |
18 | Microsporogenesis | 6 |
19 | Megasporogenesis | 6 |
20 | Anther Wall Tapetum Development | 6 |
21 | Regulation Of Growth | 5 |
22 | Positive Regulation Of Innate Immune Response | 5 |
23 | Regulation Of Defense Response To Fungus | 5 |
24 | Cell Differentiation | 5 |
25 | Somatic Embryogenesis | 5 |
Cool! Let's see them better in a chart:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib
import textwrap
# Let's fix a few visualisation things
#plt.rcParams [ "font.size" ] = "20"
# Including figure size
figsz = plt.gcf().get_size_inches()
plt.gcf().set_size_inches ( figsz * 3 )
ylabels = [ row [0] for row in result ]
values = [ row [ 1 ] for row in result ]
# reverse order, probably there are better ways to do it in mathplot...
ypos = [ len( ylabels ) - 1 - i for i, _ in enumerate ( ylabels ) ]
# An horizontal bar chart
plt.barh ( ypos, values, color = 'blue' )
#plt.ylabel ( "GO Bio Proc" )
plt.xlabel( "#Genes" )
plt.title ( "GO Biological Processes about Yellow Rust" )
plt.yticks( ypos, ylabels )
plt.show()