Wikidata is an amazing project that aims to turn the unstructured text of Wikipedia into a database of facts and figures that allows you to go beyond just presenting a page about something to using data about it.
I've been wanting to try out using it, and "SPARQL", the language used to query it, so I decided to try and create a map of every stadium that has hosted a game at the Fifa World Cup finals - a topical query as the 2018 World Cup in Russia has just started.
UPDATE: I've updated this now the 2022 world cup is currently taking place.
I used query.wikidata.org to come up with a query that got me the data I was looking for. Having never used SPARQL before it took a bit of tweaking to get the query I needed - I found the interface helpful for finding the right entities and the included examples for how to structure it.
Here's the query I came up with. I'll go through what each part does below.
wc_sparql = """
SELECT ?FIFA_World_CupLabel ?location ?locationLabel ?coord ?countryLabel WHERE {
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
?FIFA_World_Cup wdt:P3450 wd:Q19317.
?FIFA_World_Cup wdt:P276 ?location.
?location wdt:P625 ?coord.
?location wdt:P17 ?country
}
ORDER BY ?FIFA_World_CupLabel
"""
The first part sets up the fields we want to return - the name of the World Cup, the location ID (a stadium), the name of the stadium, the latitude and longitude and the name of the country
SELECT ?FIFA_World_CupLabel ?location ?locationLabel ?coord ?countryLabel WHERE {
This next part allows you to fetch labels for each of the items, which is more helpful than the URI that gets returned.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
Then we start off by adding a field called "FIFA_World_Cup" based on finding "sports season of league or competition" (wdt:P3450
) with the labels "FIFA World Cup" (wd:Q19317
)
?FIFA_World_Cup wdt:P3450 wd:Q19317.
Then we look for the locations (wdt:P276
) attached to each of these competitions:
?FIFA_World_Cup wdt:P276 ?location.
And for each location we want the co-ordinates (wdt:P625
) and country (wdt:P17
).
?location wdt:P625 ?coord.
?location wdt:P17 ?country
I then used a python library called SPARQLWrapper to send the query to the WikiData sparql endpoint, and get JSON data back.
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery(wc_sparql)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
Here's an example of what one of the results looks like - a stadium used in the very first World Cup in Uruguay.
results['results']['bindings'][0]
{'location': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q498245'}, 'coord': {'datatype': 'http://www.opengis.net/ont/geosparql#wktLiteral', 'type': 'literal', 'value': 'Point(-56.152778 -34.894444)'}, 'FIFA_World_CupLabel': {'xml:lang': 'en', 'type': 'literal', 'value': '1930 FIFA World Cup'}, 'locationLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Estadio Centenario'}, 'countryLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Uruguay'}}
I then want to turn the results into nicely formatted data for plotting on a map. I'm looking for data that contains one record for each stadium, even if it has hosted games at more than one World Cup (e.g Mexico in 1970 and 1986).
The co-ordinates for each location come in WKT format, so I use a library called Shapely to extract the latitude and longitude.
# for converting coordinates
import shapely.wkt
Then I go through each of the results and add to a python dictionary. If the stadium is already in the dictionary I just add the extra World Cup year to the dictionary, rather than adding a new record.
stadia = {}
for result in results["results"]["bindings"]:
stadium_id = result["location"]["value"]
worldcup = result["FIFA_World_CupLabel"]["value"].replace(" FIFA World Cup","")
if stadium_id in stadia:
stadia[stadium_id]["worldcups"].append(worldcup)
else:
stadia[stadium_id] = {
"lat_lng": shapely.wkt.loads(result["coord"]["value"]).coords[0],
"worldcups": [worldcup],
"stadium": result["locationLabel"]["value"],
"country": result["countryLabel"]["value"],
}
Here's what an entry in the processed data looks like. I've used the wikidata URI as an identifier for each stadium.
stadia['http://www.wikidata.org/entity/Q498245']
{'lat_lng': (-56.152778, -34.894444), 'worldcups': ['1930'], 'stadium': 'Estadio Centenario', 'country': 'Uruguay'}
import folium
from folium.plugins import MarkerCluster
import html
First I initialise the map and zoom out so you can see the whole world.
m = folium.Map(
location=[20,0],
zoom_start=2,
tiles='Stamen Toner',
attr='''<a id="home-link" target="_top" href="../">Map tiles</a> by
<a target="_top" href="http://stamen.com">Stamen Design</a>,
under <a target="_top" href="http://creativecommons.org/licenses/by/3.0">CC BY 3.0</a>.
Data by <a target="_top" href="http://openstreetmap.org">OpenStreetMap</a>,
under <a target="_top" href="http://creativecommons.org/licenses/by-sa/3.0">CC BY SA</a>.
| Locations powered by <a href="https://query.wikidata.org/">Wikidata</a>.'''
)
Then we go through the stadia and add each one to a cluster based on its country. I've also added a little popup which tells you the stadium's name and which World Cups it hosted games at. I also set a football icon for the pins.
clusters = {}
for stadium_id in stadia:
s = stadia[stadium_id]
if s["country"] not in clusters:
clusters[s["country"]] = MarkerCluster().add_to(m)
folium.Marker(
[s["lat_lng"][1], s["lat_lng"][0]],
popup='{}, {} - <i>{}</i>'.format(
html.escape(s["stadium"]),
html.escape(s["country"]),
html.escape(", ".join(s["worldcups"]))
),
icon=folium.Icon(icon='soccer-ball-o', prefix='fa')
).add_to(clusters[s["country"]])
Finally we show the resulting map, which can be zoomed and panned to look at particular countries.
m
m.save("world-cup-stadia-map.html")
As an extra I wanted to convert the data into GeoJSON format so it's easy to use elsewhere.
from geojson import Feature, Point, FeatureCollection
import geojson
wc_geojson = FeatureCollection(
[Feature(geometry=Point(stadia[s]["lat_lng"]),
properties=stadia[s]) for s in stadia]
)
with open('world_cup_stadia.geojson', 'w') as a:
geojson.dump(wc_geojson, a, indent=4)
This was just a quick exercise to try and get data out of wikidata and then use it. There's a few things that could be done to take it further: