Literals refer two simple values (numerical values, strings, boolean, dates, etc.)
Resource refers to complex objects identified by an IRI (International Resource Identifier == URI allowing international characters). Note that URLs are IRIs pointing to web accessible documents/data. URIs can be shortened with PREFIX. As an example <http://my/super/vocab/my_term>
can be shortened as ns:my_term
if ns
is defined as a prefix for http://my/super/vocab/
.
subject
, predicate
, object
}Go through https://www.w3.org/TR/rdf11-primer/ to have more details on RDF.
@prefix ns: http://my_voc# .
, http://my_voc#term
can be written as ns:term
.
at the end: <subject> <predicate> <object> .
;
s p1 o1 ;
p2 o2 .
,
s p o1, o2, o3 .
turtle syntax:
<http://HG37> rdf:type <http://human_genome> .
<http://sample1> <http://is_aligned_with> <http://HG37> .
<http://sample1> rdfs:comment "Sample 1 from Study X [...]"^^xsd:string .
or
<http://HG37> rdf:type <http://human_genome> .
<http://sample1> <http://is_aligned_with> <http://HG37> ;
rdfs:comment "Sample 1 from Study X [...]"^^xsd:string .
family:has_mother
, family:has_father
, family:has_sister
@prefix family: <http://family.org/> .
<http://John> family:has_mother <http://Mary> .
<http://Mickael> family:has_father <http://Mark> .
<http://Mark> family:has_sister <http://Mary> .
SPARQL is the standards language to query multiple data sources expressed in RDF. The principle consists in defining a graph pattern to be matched against an RDF graph.
Triple Patterns (TPs) are like RDF triples except that each of the subject, predicate and object may be a variable. Variables are prefixed with a ?
.
Triple pattern
?x <is_a_variant_of> <RAC1> .
RDF graph
<SNP:123> <is_a_variant_of> <NEMO> .
<SNP:rs527330002> <is_a_variant_of> <RAC1> .
<SNP:rs527330002> <refers_to_organism> <http://www.uniprot.org/taxonomy/9606> .
<SNP:rs61753123> <is_a_variant_of> <RAC1> .
Bindings of variables ?x
?x = <SNP:rs527330002>
?x = <SNP:rs61753123>
We will now use the RDFlib package to parse RDF Data and do some very basic SPARQL queries.
from rdflib import Graph
# RDF graph, in turtle syntax, stored in a string
my_rdf_data = """
@prefix ns: <http://my_voc/> .
@prefix snp: <http://my_snps/> .
snp:123 ns:is_a_variant_of "NEMO" .
snp:rs527330002 ns:is_a_variant_of "RAC1" .
snp:rs527330002 ns:refers_to_organism <http://www.uniprot.org/taxonomy/9606> .
snp:rs61753123 ns:is_a_variant_of "RAC1" .
"""
# Initialization of the in-memory RDF graph, RDFlib Graph object
kg = Graph()
# Parsing of the RDF data
kg.parse(data=my_rdf_data, format='turtle')
# Printing the size of the graph and serializing it again.
print(f'the knowledge graph contains {len(kg)} triples\n')
print(kg.serialize(format="turtle"))
We now execute a simple query to search for all "variants" of RAC1
.
q = """
"""
res = kg.query(q)
for row in res:
print(f"{row['x']} is a variant of RAC1")
# SOLUTION:
!echo "ClNFTEVDVCA/eCBXSEVSRSB7CiAgICA/eCBuczppc19hX3ZhcmlhbnRfb2YgUkFDMSAuCn0K" | base64 --decode
Generalize this query to show all is a variant of relations. You can use two variables ?x
and ?y
.
q = """
"""
res = kg.query(q)
for row in res:
print(f"{row['x']} is a variant of {row['y']}")
# SOLUTION:
!echo "ClNFTEVDVCA/eCA/eSBXSEVSRSB7CiAgICA/eCBuczppc19hX3ZhcmlhbnRfb2YgP3kgLgp9Cg==" | base64 --decode
Search for the name of the gene who has a variant refering to the http://www.uniprot.org/taxonomy/9606
organism
q = """
"""
res = kg.query(q)
for row in res:
print(row['y'])
# SOLUTION:
!echo "ClNFTEVDVCA/eSBXSEVSRSB7CiAgICA/eCBuczpyZWZlcnNfdG9fb3JnYW5pc20gPGh0dHA6Ly93d3cudW5pcHJvdC5vcmcvdGF4b25vbXkvOTYwNj4gLgogICAgP3ggbnM6aXNfYV92YXJpYW50X29mID95IC4KfQo=" | base64 --decode