Introduction to SPARQL, RDF, and LOD

While many databases, services, or museums might expose their data via a web API, there can be limitations. Matthew Lincoln has an excellent tutorial at The Programming Historian that walks us through some of these differences, but the key one is in the way the data is represented. When data is described using a 'Resource Description Framework', RDF, the resource - the 'thing'- is described via a series of relationships, rather than as rows in a table or keys having values.

Information is in the relationships. It's a network. It's a graph. Thus, every 'thing' in this graph can have its own uniform resource identifier (URI) that lives as a location on the internet. Information can then be created by making statements that use these URIs, similarly to how English grammar creates meaning: subject verb object. Or, in RDF-speak, 'subject predicate object', also known as a triple. In this way, data in different places can be linked together by referencing the elements they have in common. This is Linked Open Data (LOD). The access point for interrogating LOD is called an 'endpoint'.

Finally, SPARQL is an acronymn for SPARQL Protocol and RDF Query Language (yes, it's one of those kinds of acronyms).

In this notebook, we're not using Python or R directly. Instead, we've set up a 'kernel' (think of that as the 'engine' for the notebook) that already includes everything necessary to set up and run SPARQL queries. (For reference, the kernel code is here). Both R and Python can interact with and query endpoints, and manipulate linked open data, but for the sake of learning a bit of what one can do with SPARQL, this notebook keeps all of that ancillary code tucked away. The followup notebook to this one shows you how to use R to do some basic manipulations of the query results.

Simple RDF example

Here, we are following Matthew Lincoln's tutorial.

Let's look at his example, which concerns the painting, 'The Nightwatch'.

<The Nightwatch> <was created by> <Rembrandt van Rijn> .

This statement has three elements:

  • the subject: <The Nightwatch>
  • the predicate: <was created by>
  • the object: <Rembrandt van Rijn>

Lincoln combines these, and other such statements, into a (pseudo-)RDF database like so:

<The Nightwatch> <was created by> <Rembrandt van Rijn> .
<The Nightwatch> <was created in> <1642> .
<The Nightwatch> <has medium> <oil on canvas> .
<Rembrandt van Rijn> <was born in> <1606> .
<Rembrandt van Rijn> <has nationality> <Dutch> .
<Johannes Vermeer> <has nationality> <Dutch> .
<Woman with a Balance> <was created by> <Johannes Vermeer> .
<Woman with a Balance> <has medium> <oil on canvas> .

Such RDF databases are describing nodes and links, and so we can visualize as a graph like so:

A network visualization of the pseudo-RDF shown above. Arrows indicate the ‘direction’ of the predicate. For example, that ‘Woman with a Balance was created by Vermeer’, and not the other way around.

But there is a difference between the pseudo-RDF that Lincoln shows us, and what actual RDF might look like:

<> <>  <>

The human-readable version requires more statements:

<> <> "The Nightwatch" .

<> <> "was created by" .

<> <> "Rembrandt van Rijn" .

This is just a quick introduction; please do examine Lincoln's tutorial for more details. But now, let's explore how this notebook can be used to write some queries.

In [2]:
# Jupyter notebooks have various built-in commands called 'magics' that are accessed with the '%' character; these depend on the kernel. 
# Let's see what the SPARQL kernel has
Available magics:
%auth %display %endpoint %format %graph %lang %log %lsmagics %outfile %prefix %qparam %show

%auth (basic|digest|none) <username> <passwd> : send HTTP authentication
%display raw | table [withtypes] | diagram [svg|png] [withliterals] : set display format
%endpoint <url> : set SPARQL endpoint. **REQUIRED**
%format JSON | N3 | XML | any | default : set requested result format
%graph <uri> : set default graph for the queries
%lang <lang> [...] | default | all : language(s) preferred for labels
%log critical | error | warning | info | debug : set logging level
%lsmagics : list all magics
%outfile <filename> | NONE : save raw output to a file (use "%d" in name to add cell number, "NONE" to cancel saving)
%prefix <name> [<uri>] : set (or delete) a persistent URI prefix for all queries
%qparam <name> [<value>] : add (or delete) a persistent custom parameter to the endpoint query
%show <n> | all : maximum number of shown results
In [10]:
# when using this notebook, the first thing we have to do - or rather, the first time we run _any_ query,
# is to tell it what endpoint we're going to use. Let's use the British Museum's:

Endpoint set to:

Lincoln suggests that when we first encountered a new RDF graph, that we explore the network of relationships from an example object to understand what is going on in the database, to see what is available for querying. Since we're querying the British Museum, let's take the Rosetta Stone as our example.

In the query below, p and o stand for 'predicate' and 'object'. Thus, we're building up a query that asks, 'show me every statment structured <The Rosetta Stone> <predicate> <object>. When the results load up, you can right-click on each statement (which is a URI, remember) to see what we've discovered. This could give you the necessary information to construct more complicated queries.

Nb The British Museum sparql endpoint and the underlying infrastructure does not appear to be well supported. Results are sometimes flaky or not reachable.

In [11]:
SELECT ?p ?o
  <> ?p ?o .
ptype otype
Total: 234, Shown: 20

In this next query, we look for objects in the collection that have the label 'fibula'.

In [91]:
%display table
PREFIX bmo: <>
PREFIX skos: <>

SELECT ?object

  # Search for all values of ?object that have a given "object type"
  ?object bmo:PX_object_type ?object_type .

  # That object type should have the label "fibula"
  ?object_type skos:prefLabel "fibula" .


Wikidata is another endpoint we can query. Below we have a query by Sebastian Heath that extracts some of the genealogical data on Roman emperors contained in that database. The wd:Q842606 can be expanded to refer to, which describes the concept 'Roman Emperor'. wdt:P39 is a predicate meaning 'Position held'

In [75]:
%display table

SELECT ?emperorLabel ?emperor_dob
       ?motherLabel ?maternalGrandfatherLabel ?maternalGrandmotherLabel
       ?emperor ?child ?mother ?maternalGrandfather ?maternalGrandmother WHERE {
  ?emperor wdt:P39 wd:Q842606 . #p39: position held. Q842606: Roman Emperor
  ?emperor wdt:P569 ?emperor_dob . # p569: date of birth
  ?child wdt:P22 ?emperor . #p22: father
  ?child wdt:P25 ?mother .  #p25: mother
  OPTIONAL { ?mother wdt:P22 ?maternalGrandfather }
  OPTIONAL { ?mother wdt:P25 ?maternalGrandmother }
 # automatic label expander
 SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} ORDER BY ?emperor_dob
Endpoint set to:
Display: table
emperorLabel emperor_dob childLabel motherLabel maternalGrandfatherLabel maternalGrandmotherLabel emperor child mother maternalGrandfather maternalGrandmother
Augustus -062-01-01T00:00:00Z Julia the Elder Scribonia Lucius Scribonius Libo
Tiberius -041-11-14T00:00:00Z Tiberius Julius Caesar Julia the Elder Augustus Scribonia
Tiberius -041-11-14T00:00:00Z Drusus Julius Caesar Vipsania Agrippina Marcus Vipsanius Agrippa Pomponia Caecilia Attica
Claudius -009-07-30T00:00:00Z Claudia Octavia Messalina Marcus Valerius Messalla Barbatus Domitia Lepida the Younger
Claudius -009-07-30T00:00:00Z Britannicus Messalina Marcus Valerius Messalla Barbatus Domitia Lepida the Younger
Claudius -009-07-30T00:00:00Z Claudius Drusus Plautia Urgulanilla Marcus Plautius Silvanus
Claudius -009-07-30T00:00:00Z Claudia Antonia Aelia Paetina Sextus Aelius Catus
Vespasian 0009-11-15T00:00:00Z Titus Domitilla the Elder Flavius Liberalis
Vespasian 0009-11-15T00:00:00Z Domitian Domitilla the Elder Flavius Liberalis
Vespasian 0009-11-15T00:00:00Z Domitilla the Younger Domitilla the Elder Flavius Liberalis
Caligula 0012-08-29T00:00:00Z Julia Drusilla Milonia Caesonia Vistilia
Vitellius 0015-09-22T00:00:00Z Vitellius Germanicus Galeria Fundana
Nero 0037-12-13T00:00:00Z Claudia Augusta Poppaea Sabina Titus Ollius
Titus 0039-12-28T00:00:00Z Julia Flavia Marcia Furnilla Quintus Marcius Barea Sura
Titus 0039-12-28T00:00:00Z Julia Flavia Arrecina Tertulla
Antoninus Pius 0086-09-17T00:00:00Z Faustina the Younger Faustina the Elder Marcus Annius Verus Rupilia
Marcus Aurelius Antoninus 0121-04-25T00:00:00Z Commodus Faustina the Younger Antoninus Pius Faustina the Elder
Marcus Aurelius Antoninus 0121-04-25T00:00:00Z Lucilla Faustina the Younger Antoninus Pius Faustina the Elder
Marcus Aurelius Antoninus 0121-04-25T00:00:00Z Annia Cornificia Faustina Minor Faustina the Younger Antoninus Pius Faustina the Elder
Marcus Aurelius Antoninus 0121-04-25T00:00:00Z Marcus Annius Verus Caesar Faustina the Younger Antoninus Pius Faustina the Elder
Total: 82, Shown: 20

Let's visualize these relationships. We're running the same query, but we use CONSTRUCT to create the nodes and edges that represent these familial relationships. We want to show 'emperor x is the father of person y' and 'person a is the mother of person y'. That gives us the structure. To get the content, we run the SELECT command where we first tell it to retrieve those individuals who were emperor, and then retrieve the children data.

Once you've run the query, use ctrl+f to find someone familiar, like Augustus (Q1405). In the resulting graph, an edge labeled 'p22' eg Q1405 ->P22 -> Q2259 can be read, 'Q1405 is the father of Q2259', or rather, 'Augustus is the father of Julia the Elder'.

Roman geneaology.... it's complicated!

In [122]:
%display diagram 

         ?emperor wdt:P22 ?child . #p22: father
         ?mother wdt:P25 ?child .  #p25: mother 

  ?emperor wdt:P39 wd:Q842606 .
  ?child wdt:P22 ?emperor . #p22: father
  ?child wdt:P25 ?mother .  #p25: mother
  OPTIONAL { ?mother wdt:P22 ?maternalGrandfather }
  OPTIONAL { ?mother wdt:P25 ?maternalGrandmother }
Endpoint set to:
Display: svg
%3 node0 Q1407 node1 Q313737 node0->node1 wdt:P22 node41 Q4222774 node0->node41 wdt:P22 node2 Q1430 node3 Q2055853 node2->node3 wdt:P22 node44 Q3656003 node2->node44 wdt:P22 node79 Q3655950 node2->node79 wdt:P22 node81 Q567222 node2->node81 wdt:P22 node97 Q242466 node2->node97 wdt:P22 node106 Q1434 node2->node106 wdt:P22 node115 Q441706 node2->node115 wdt:P22 node4 Q46720 node5 Q46846 node4->node5 wdt:P22 node42 Q189108 node4->node42 wdt:P22 node102 Q236999 node4->node102 wdt:P22 node6 Q164210 node7 Q234562 node6->node7 wdt:P25 node8 Q1327859 node9 Q13130598 node8->node9 wdt:P25 node33 Q3887731 node8->node33 wdt:P25 node51 Q4067684 node8->node51 wdt:P25 node107 Q507675 node8->node107 wdt:P25 node10 Q159369 node11 Q160353 node10->node11 wdt:P22 node69 Q231919 node10->node69 wdt:P22 node108 Q2696588 node10->node108 wdt:P22 node55 Q232329 node11->node55 wdt:P22 node12 Q2479052 node13 Q453551 node12->node13 wdt:P25 node14 Q1411 node15 Q313115 node14->node15 wdt:P22 node27 Q255410 node14->node27 wdt:P22 node124 Q231262 node14->node124 wdt:P22 node132 Q2975271 node14->node132 wdt:P22 node16 Q46418 node17 Q437472 node16->node17 wdt:P22 node18 Q232094 node19 Q504556 node18->node19 wdt:P25 node24 Q238023 node18->node24 wdt:P25 node39 Q486630 node18->node39 wdt:P25 node75 Q291738 node18->node75 wdt:P25 node87 Q450702 node18->node87 wdt:P25 node105 Q1427539 node18->node105 wdt:P25 node20 Q211772 node21 Q882941 node20->node21 wdt:P22 node23 Q2528282 node20->node23 wdt:P22 node62 Q2269678 node20->node62 wdt:P22 node96 Q2271845 node20->node96 wdt:P22 node127 Q2322166 node20->node127 wdt:P22 node22 Q380773 node22->node21 wdt:P25 node22->node23 wdt:P25 node22->node62 wdt:P25 node22->node96 wdt:P25 node22->node127 wdt:P25 node25 Q166731 node24->node25 wdt:P25 node26 Q238448 node26->node27 wdt:P25 node28 Q236466 node29 Q273253 node28->node29 wdt:P25 node129 Q46837 node28->node129 wdt:P25 node30 Q1405 node31 Q2259 node30->node31 wdt:P22 node31->node41 wdt:P25 node32 Q211396 node32->node9 wdt:P22 node32->node33 wdt:P22 node32->node51 wdt:P22 node32->node107 wdt:P22 node34 Q229246 node35 Q183089 node34->node35 wdt:P25 node103 Q1446 node34->node103 wdt:P25 node36 Q241474 node37 Q318865 node36->node37 wdt:P25 node38 Q131195 node38->node19 wdt:P22 node38->node24 wdt:P22 node38->node39 wdt:P22 node53 Q8413 node38->node53 wdt:P22 node38->node75 wdt:P22 node38->node87 wdt:P22 node38->node105 wdt:P22 node39->node75 wdt:P25 node40 Q1442 node40->node35 wdt:P22 node40->node103 wdt:P22 node43 Q236936 node43->node3 wdt:P25 node43->node44 wdt:P25 node43->node79 wdt:P25 node43->node81 wdt:P25 node43->node97 wdt:P25 node43->node106 wdt:P25 node43->node115 wdt:P25 node45 Q174323 node46 Q260033 node45->node46 wdt:P25 node47 Q184549 node47->node25 wdt:P22 node48 Q1413 node49 Q1275952 node48->node49 wdt:P22 node50 Q230716 node50->node49 wdt:P25 node52 Q233444 node52->node31 wdt:P25 node53->node16 wdt:P22 node54 Q464452 node53->node54 wdt:P22 node66 Q46734 node53->node66 wdt:P22 node89 Q311646 node53->node89 wdt:P22 node91 Q1001933 node53->node91 wdt:P22 node111 Q185538 node53->node111 wdt:P22 node55->node45 wdt:P25 node59 Q462395 node55->node59 wdt:P25 node56 Q1419 node57 Q1421 node56->node57 wdt:P22 node70 Q1423 node56->node70 wdt:P22 node125 Q260156 node56->node125 wdt:P22 node74 Q239314 node57->node74 wdt:P22 node58 Q170026 node58->node45 wdt:P22 node58->node59 wdt:P22 node60 Q1417 node61 Q662631 node60->node61 wdt:P22 node63 Q1248608 node64 Q104475 node63->node64 wdt:P25 node90 Q552224 node64->node90 wdt:P22 node99 Q297494 node64->node99 wdt:P22 node65 Q231063 node65->node16 wdt:P25 node65->node54 wdt:P25 node65->node66 wdt:P25 node65->node91 wdt:P25 node65->node111 wdt:P25 node67 Q201905 node67->node58 wdt:P22 node88 Q232271 node67->node88 wdt:P22 node68 Q232981 node68->node11 wdt:P25 node68->node69 wdt:P25 node68->node108 wdt:P25 node71 Q46696 node71->node10 wdt:P22 node72 Q237907 node71->node72 wdt:P22 node77 Q159798 node71->node77 wdt:P22 node83 Q1282616 node71->node83 wdt:P22 node72->node58 wdt:P25 node72->node88 wdt:P25 node73 Q731059 node73->node74 wdt:P25 node76 Q235603 node76->node10 wdt:P25 node76->node77 wdt:P25 node76->node83 wdt:P25 node78 Q193678 node78->node46 wdt:P22 node80 Q1817 node80->node37 wdt:P22 node82 Q229871 node82->node15 wdt:P25 node82->node124 wdt:P25 node84 Q1429 node84->node43 wdt:P22 node85 Q1409 node86 Q235586 node85->node86 wdt:P22 node92 Q240928 node92->node86 wdt:P25 node93 Q1528430 node93->node74 wdt:P25 node94 Q2724125 node95 Q46840 node94->node95 wdt:P25 node98 Q43107 node98->node7 wdt:P22 node100 Q234734 node100->node43 wdt:P25 node101 Q1233341 node101->node42 wdt:P25 node102->node72 wdt:P25 node104 Q239015 node104->node90 wdt:P25 node104->node99 wdt:P25 node109 Q241102 node109->node57 wdt:P25 node109->node70 wdt:P25 node109->node125 wdt:P25 node110 Q260039 node110->node61 wdt:P25 node112 Q63533 node112->node17 wdt:P25 node113 Q46768 node113->node18 wdt:P22 node113->node65 wdt:P22 node114 Q182070 node113->node114 wdt:P22 node116 Q170164 node116->node53 wdt:P25 node117 Q254471 node117->node5 wdt:P25 node117->node102 wdt:P25 node118 Q1777 node119 Q518890 node118->node119 wdt:P22 node120 Q171023 node121 Q202222 node120->node121 wdt:P22 node122 Q232090 node122->node1 wdt:P25 node123 Q45530 node123->node18 wdt:P25 node123->node65 wdt:P25 node123->node114 wdt:P25 node126 Q45522 node126->node89 wdt:P25 node128 Q1830 node128->node29 wdt:P22 node128->node129 wdt:P22 node130 Q1440 node130->node13 wdt:P22 node131 Q236259 node131->node132 wdt:P25 node133 Q172471 node134 Q749909 node133->node134 wdt:P22 node135 Q229307 node135->node55 wdt:P25 node136 Q272630 node136->node134 wdt:P25 node137 Q3372698 node137->node119 wdt:P25 node138 Q46750 node138->node64 wdt:P22 node139 Q1752 node139->node95 wdt:P22 node140 Q383304 node140->node121 wdt:P25


Another excellent SPARQL endpoint is the Nomisma portal for numismatic materials.

In [131]:
Endpoint set to:

Now, if you actually go to you'll find a query builder with the following information already preloaded:

PREFIX rdf: <>
PREFIX bio: <>
PREFIX crm: <>
PREFIX dcmitype:    <>
PREFIX dcterms: <>
PREFIX foaf:    <>
PREFIX geo: <>
PREFIX nm:  <>
PREFIX nmo: <>
PREFIX org: <>
PREFIX osgeo:   <>
PREFIX rdac:    <>
PREFIX skos:    <>
PREFIX spatial: <>
PREFIX void:    <>
PREFIX xsd: <>

  ?s ?p ?o
} LIMIT 100

All those prefixes are the ontologies being used to describe the materials. The ?s ?p ?o are the subject, predicate, objects that we're going to search for. Let's run some of the example queries that Nomisma can handle. Since Roman Emperors are often depicted on coins, let's see which emperors are present in Nomisma.

In [132]:
%display table 
PREFIX rdf:	<>
PREFIX bio:	<>
PREFIX crm:	<>
PREFIX dcmitype:	<>
PREFIX dcterms:	<>
PREFIX foaf:	<>
PREFIX geo:	<>
PREFIX nm:	<>
PREFIX nmo:	<>
PREFIX org:	<>
PREFIX osgeo:	<>
PREFIX rdac:	<>
PREFIX skos:	<>
PREFIX spatial: <>
PREFIX void:	<>
PREFIX xsd:	<>

SELECT ?uri ?label WHERE {
?uri a foaf:Person ;
  skos:prefLabel ?label ;         
  org:hasMembership ?membership .
?membership org:role nm:roman_emperor .
FILTER(langMatches(lang(?label), "EN"))

We can also do spatial queries; this one looks coins from mints within 50 km of Athens.

It also specifies the format in which we wants the results returned, and to write these results to a json file for further manipulation.

In [137]:
%format json  
%display table 
%outfile mints.json 
PREFIX rdf:	<>
PREFIX dcterms:	<>
PREFIX geo:	<>
PREFIX nm:	<>
PREFIX nmo:	<>
PREFIX skos:	<>
PREFIX spatial: <>
PREFIX xsd:	<>

   ?loc spatial:nearby (37.974722 23.7225 50 'km') ;
        geo:lat ?lat ;
        geo:long ?long .
   ?mint geo:location ?loc ;
         skos:prefLabel ?label ;
         a nmo:Mint
  FILTER langMatches (lang(?label), 'en')
Return format: JSON
Display: table
Output file: mints.json
loc lat long mint label 41.9 12.5 Rome 38.733333 35.483333 Caesarea in Cappadocia 40.766667 29.916667 Nicomedia 45.483168 16.371388 Siscia 34.751899 36.724237 Emisa 44.716471 21.166605 Viminacium 45.338611 36.468056 Panticapaeum 45.189 36.825 Phangoria 40.777626 24.703702 Thasos 39.641667 22.416667 Thessalian League 37.316667 13.583333 Agrigentum 37.063156 14.258219 Gela 37.285478 14.998115 Leontini 38.192251 15.556634 Messana 37.083333 15.283333 Syracuse 32.3242756 53.1738281 Parthia 33.137222 44.517222 Seleuceia ad Tigrim 36.2 36.15 Antioch 36.916667 34.9 Tarsus 43.296854 5.382499 Massalia
Total: 1798, Shown: 20
In [ ]: