#!/usr/bin/env python # coding: utf-8 # # Practical introduction to RDF and SPARQL # ## Reminder: IRIs and Literals # # **Literals** refer two simple values (numerical values, strings, boolean, dates, etc.) # # **Resource** refers to complex objects identified by an **IRI** (International Resource Identifier == URI allowing international characters). Note that URLs are IRIs pointing to web accessible documents/data. URIs can be shortened with **PREFIX**. As an example `` can be shortened as `ns:my_term` if `ns` is defined as a prefix for `http://my/super/vocab/`. # # # ## Reminder: RDF, triples # 1. an RDF **statement** represents a **relationship** between two resources: a **subject** and an **object** # 1. relationships are directional and are called a **predicates** (or RDF properties) # 1. (logical) statements are called **triple** : {`subject`, `predicate`, `object`} # 1. a set of triples form a **directed labelled graph** : subject nodes are IRIs, edges are predicate (IRIs only), object nodes are IRIs or Literals. # # Go through https://www.w3.org/TR/rdf11-primer/ to have more details on RDF. # # ## Reminder: Turtle syntax # - header to define prefix # - example: with `@prefix ns: http://my_voc# .`, `http://my_voc#term` can be written as `ns:term` # - generally one line per triple with a `.` at the end: ` .` # - possible shortcuts to share the same subject: `;` # ``` # s p1 o1 ; # p2 o2 . # ``` # - possible shortcuts to share the same subject-predicate: `,` # ``` # s p o1, o2, o3 . # ``` # # ## Example # turtle syntax: # ```ruby # rdf:type . # . # rdfs:comment "Sample 1 from Study X [...]"^^xsd:string . # ``` # # or # # ```turtle # rdf:type . # ; # rdfs:comment "Sample 1 from Study X [...]"^^xsd:string . # ``` # ## Question 1 # # 1. Consider the following RDF properties `family:has_mother`, `family:has_father`, `family:has_sister` # 2. Only using these predicates, represent with RDF triples the following family: # - *The mother of John is Mary*, # - *Mickael is the son of Mark*, # - *John and Mickael are cousins (because Mark and Mary are siblings)*. # 3. Go to https://www.ldf.fi/service/rdf-grapher # 4. Generate a graphical representation of the RDF graph. # ## Answer # # ```turtle # @prefix family: . # # family:has_mother . # family:has_father . # family:has_sister . # ``` # # ![:scale 50%](fig/family.png) # --- # # SPARQL hands-on # SPARQL is the standards language to query multiple data sources expressed in RDF. The principle consists in defining a **graph pattern** to be matched against an RDF graph. # ## Definition # **Triple Patterns** (TPs) are like RDF triples except that each of the *subject*, *predicate* and *object* may be a **variable**. Variables are prefixed with a `?` . # # ## Example # Triple pattern # ```ruby # ?x . # ``` # # RDF graph # ```ruby # . # . # . # . # ``` # # Bindings of variables `?x` # ```ruby # ?x = # ?x = # ``` # # In[ ]: # ## Definition # **Basic Graph Patterns** (BGPs) consist in a set of triple patterns to be matched against an RDF graph. # ## Example # Basic graph pattern # ```ruby # ?x . # ?x ?z # ``` # ![:scale 60%](fig/bgp.png) # # # # 4 Types of SPARQL queries # - **SELECT** : returns the variables values (i.e. bound variables) for each graph pattern match ; # - **CONSTRUCT** : returns an RDF graph constructed by substituting variables in a set of triple patterns ; # - **ASK** : returns a boolean (true/false) indicating whether a query pattern matches or not ; # - **DESCRIBE** : returns an RDF graph that describes the resources found (resources neighborhood). # #
#
# Additional features: Optional BGPs, union, filters, aggregate functions, negation, service, *etc.* # # # Anatomy of a SPARQL query # # ![:scale 95%](fig/anat.png) # ## Question 2 # We will now use the RDFlib package to parse RDF Data and do some very basic SPARQL queries. # In[ ]: from rdflib import Graph # RDF graph, in turtle syntax, stored in a string my_rdf_data = """ @prefix ns: . @prefix snp: . snp:123 ns:is_a_variant_of "NEMO" . snp:rs527330002 ns:is_a_variant_of "RAC1" . snp:rs527330002 ns:refers_to_organism . snp:rs61753123 ns:is_a_variant_of "RAC1" . """ # Initialization of the in-memory RDF graph, RDFlib Graph object kg = Graph() # Parsing of the RDF data kg.parse(data=my_rdf_data, format='turtle') # Printing the size of the graph and serializing it again. print(f'the knowledge graph contains {len(kg)} triples\n') print(kg.serialize(format="turtle")) # We now execute a simple query to search for all "variants" of `RAC1`. # In[ ]: q = """ """ res = kg.query(q) for row in res: print(f"{row['x']} is a variant of RAC1") # In[ ]: # SOLUTION: get_ipython().system('echo "ClNFTEVDVCA/eCBXSEVSRSB7CiAgICA/eCBuczppc19hX3ZhcmlhbnRfb2YgUkFDMSAuCn0K" | base64 --decode') # ## Question 3 # Generalize this query to show all *is a variant of* relations. You can use two variables `?x` and `?y`. # In[ ]: q = """ """ res = kg.query(q) for row in res: print(f"{row['x']} is a variant of {row['y']}") # In[ ]: # SOLUTION: get_ipython().system('echo "ClNFTEVDVCA/eCA/eSBXSEVSRSB7CiAgICA/eCBuczppc19hX3ZhcmlhbnRfb2YgP3kgLgp9Cg==" | base64 --decode') # ## Question 4 # Search for the name of the gene who has a variant refering to the `http://www.uniprot.org/taxonomy/9606` organism # In[ ]: q = """ """ res = kg.query(q) for row in res: print(row['y']) # In[ ]: # SOLUTION: get_ipython().system('echo "ClNFTEVDVCA/eSBXSEVSRSB7CiAgICA/eCBuczpyZWZlcnNfdG9fb3JnYW5pc20gPGh0dHA6Ly93d3cudW5pcHJvdC5vcmcvdGF4b25vbXkvOTYwNj4gLgogICAgP3ggbnM6aXNfYV92YXJpYW50X29mID95IC4KfQo=" | base64 --decode') # In[ ]: