version 1.12
Update we got today a new Wikidata property P8402 for Open Data Portal --> we can start connect objects with its data portal see map what is connected today see tweet
Problem 1: Question to EDM helpdesk 2020-06-24 Strings not Things. Metadata is not handled with care in the European data portal e.g. keywords is not Linked data and is just a text string with a language tag.....
European Data Portal Helpdesk / Improvement and suggestions DESK-7510
Things not Strings
Issue Type: Improvement and suggestions Improvement and suggestions
Assignee: EDP Helpdesk
Created: 24/Jun/20 6:10 PM
Priority: Medium Medium
Reporter: EDP Helpdesk
I think you should use Things not strings when describing the data
e.g. json europeandataportal resource-28 you have keywords with a language i.e. its asking for problems
keyword:
@language: "sv",@value: "Telefonnummer"
@language: "sv",@value: "Kommuner"
@language: "sv", @value: "E-postadresser"
....
Much better use Linked data and things
Regards
Magnus Sälgö
++46705937578
Stockholm, Sweden
salgo1960@gmail.com
Answer Dear Magnus, Thank you for contacting European Data Portal Helpdesk.
We have gotten the following comments from the responsible team: "Thanks for your comments. We store the metadata as it comes from the data providers, so it is not on us to change that. Besides that, DCAT-AP defines keywords as literals."
Please let me know if you need further assistance from our services.
Best regards, Pernille Schnoor Clausen EDP Helpdesk
In Sweden I see a rather naive approach to open data and governance of open data. We have since 2012 tried getting museums Digital. They started speaking about linked data 2012 and today 2020 they sends text strings to a museum aggregator Europeana --> they cant even identify an artist at museum A is the same artist B at museum B see blogpost --> en:Wikipedia refused link them as the quality is so bad.
Things I see or miss
Comment: See how dataverse.harvard.edu use DOI --> 10.7910 is the portal and DVN/PEJ5QU is the dataset --> unique persistent identifier is doi:10.7910/DVN/PEJ5QU
support versioning
Started in 2012 trying to do Linked data and has failed because of museums are not skilled enough delivering quality metadata and Europeana is not doing the work of cleaning metadata see tweet --> the quality was so bad that en:Wikipedia 2019 took a decision not linking Europeana because they show the wrong artist see T243764 "en:Wikipedia has problem with the quality of Europeana..."
Wikidata is a fast growing open "knowledge base" why not use that. Google has its own much bigger knowledge graph but why not use Wikidata or build a "European data portal knowledge graph"
“Fostering FAIR Data Practices In Europe” see fairaware toolkit " It helps to assess your current level of awareness on making your datasets findable, accessible, interoperable and reusable (FAIR) before uploading them in a data repository"
# Number of Categories / themes abd their number of datasets
# --> result 25114 datasets compare 1 093 704 datasets says
# https://www.europeandataportal.eu/data/datasets?locale=en
import sys
from SPARQLWrapper import SPARQLWrapper, JSON
endpoint_url ="https://data.europa.eu/euodp/sparqlep"
query = """#European Data portal
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dc: <http://purl.org/dc/terms/>
SELECT ?theme (count(?s) AS ?count)
WHERE {?s a dcat:Dataset . ?s dcat:theme ?theme}
GROUP BY ?theme"""
def get_results(endpoint_url, query):
user_agent = "salgo60/%s.%s" % (sys.version_info[0], sys.version_info[1])
sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
return sparql.query().convert()
results = get_results(endpoint_url, query)
for result in results["results"]["bindings"]:
theme = result["theme"]["value"].replace("http://publications.europa.eu/resource/authority/data-theme/","")
value = result["count"]["value"]
print(theme, value)
ECON 2560 GOVE 2666 JUST 353 INTR 429 TRAN 747 HEAL 2700 EDUC 2954 ENVI 2903 AGRI 941 SOCI 3636 TECH 2655 ENER 1009 REGI 1561
# Datasets missing theme
query_missingTheme = """#European Data portal
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dc: <http://purl.org/dc/terms/>
SELECT ?theme (count(?s) AS ?count)
WHERE {?s a dcat:Dataset . minus {?s dcat:theme ?theme}}
GROUP BY ?theme"""
results = get_results(endpoint_url, query_missingTheme)
for result in results["results"]["bindings"]:
print("missing theme: ", result["count"]["value"])
missing theme: 206
#Number Datasets we get 15371 feels low
#same query at https://www.europeandataportal.eu/sparql-manager/en/
# gives 1 071 960
query_NumberOf = """
PREFIX dcat: <http://www.w3.org/ns/dcat#>
SELECT (count(*) AS ?count) WHERE { ?s a dcat:Dataset } LIMIT 1000"""
results = get_results(endpoint_url, query_NumberOf)
for result in results["results"]["bindings"]:
print("Number sets: ", result["count"]["value"])
Number sets: 15371
# PREFIX dcat: <http://www.w3.org/ns/dcat#>
query2 = """#Get datasets with MP in the name
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dc: <http://purl.org/dc/terms/>
SELECT ?DatasetURI ?title WHERE {
?DatasetURI a dcat:Dataset .
?DatasetURI dc:title ?title
FILTER (lang(?title)='en')
FILTER(regex(?title, "MP", "i"))
} limit 10"""
results = get_results(endpoint_url, query2)
for result in results["results"]["bindings"]:
#print(result)
ds = result["DatasetURI"]["value"]
title = result["title"]["value"]
print( title, "\n\t",ds)
Gender employment gap by NUTS 2 regions http://data.europa.eu/88u/dataset/YHvFUmXqU4LhS4fExi9CnA Gender employment gap by degree of urbanisation http://data.europa.eu/88u/dataset/sOx8fxyrYRoKlWkdPNA Implementation report and country fiches on the Environment Liability Directive (ELD) http://data.europa.eu/88u/dataset/implementation-report-and-country-fiches-on-the-environment-liability-directive-eld Employment in the EU environmental economy by environmental protection and resource management activities http://data.europa.eu/88u/dataset/GcXikOlwIaw0BJG3nTHeog Implementation report under the Landfill Directive http://data.europa.eu/88u/dataset/implementation-report-under-the-landfill-directive INSPIRE Implementation report and country fiches in relation to the infrastructure for geospatial data http://data.europa.eu/88u/dataset/inspire-implementation-report-and-country-fiches-in-relation-to-the-infrastructure-for-geospatial-data Atmospheric Particles-DMPS Particle Concentration (2018) http://data.europa.eu/88u/dataset/536fd44e-b05f-46ff-8ba8-791f855e8fb2 Assumptions for net migration by age, sex and type of projection http://data.europa.eu/88u/dataset/GzhfU2UD8w0IfX0vaR47A Disaggregated final energy consumption in households - quantities http://data.europa.eu/88u/dataset/UVygjkxEv6PYwQWBgmgwyg Non- innovative enterprises by barrier against innovation activities, level of importance of the barrier, NACE Rev. 2 activity and size class http://data.europa.eu/88u/dataset/ULZ2ng11rhPkqMfyPk9Q
import json
import pandas as pd
def get_sparql_dataframe(endpoint_url, query):
"""
Helper function to convert SPARQL results into a Pandas data frame.
"""
user_agent = "salgo60/%s.%s" % (sys.version_info[0], sys.version_info[1])
sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
result = sparql.query()
processed_results = json.load(result.response)
cols = processed_results['head']['vars']
out = []
for row in processed_results['results']['bindings']:
item = []
for c in cols:
item.append(row.get(c, {}).get('value'))
out.append(item)
return pd.DataFrame(out, columns=cols)
# retrieves all datasets format and, if available, retrieves the starting
# date of the temporal coverage period of the datasets.
querybase = """#Get datasets format and if coverage period
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dc: <http://purl.org/dc/terms/>
PREFIX odp: <https://data.europa.eu/euodp/ontologies/ec-odp#>
SELECT distinct ?DatasetURI ?title ?period ?keyword ?conformsTo
?LandingPage ?relatedResource ?accessRights ?created ?modified ?license
?Taxanomy ?description WHERE {
?DatasetURI a dcat:Dataset .
OPTIONAL {?DatasetURI dcat:distribution ?o} .
OPTIONAL {?DatasetURI dcat:landingPage ?LandingPage }
OPTIONAL {?DatasetURI dct:relation ?relatedResource }
OPTIONAL {?DatasetURI dct:conformsTo ?conformsTo }
OPTIONAL {?DatasetURI dct:accessRights ?accessRights }
OPTIONAL {?DatasetURI dct:created ?created }
OPTIONAL {?DatasetURI dct:modified ?modified }
OPTIONAL {?DatasetURI dct:license ?license }
OPTIONAL {?DatasetURI dcat:themeTaxonomy ?Taxanomy }
OPTIONAL {?DatasetURI dct:description ?description}
?DatasetURI dc:title ?title
{
?DatasetURI dc:temporal ?period .
?DatasetURI dcat:keyword ?keyword.
# ?period odp:periodStart ?period_start
}
FILTER (lang(?title)='en')
}"""
query3 = querybase + " limit 10"
results = get_results(endpoint_url, query3)
for result in results["results"]["bindings"]:
#print(result)
ds = result["DatasetURI"]["value"]
title = result["title"]["value"]
keyword = result["keyword"]["value"]
LandingPage = result["LandingPage"]["value"]
try:
created = result["created"]["value"]
except:
created = ""
try:
relatedResource = result["relatedResource"]["value"]
except:
relatedResource = ""
try:
conformsTo = result["conformsTo"]["value"]
except:
conformsTo = ""
try:
accessRights = result["accessRights"]["value"]
except:
accessRights = ""
try:
modified = result["modified"]["value"]
except:
modified = ""
try:
license = result["license"]["value"]
except:
license = ""
try:
taxanomy = result["Taxanomy"]["value"]
except:
taxanomy = ""
try:
description = result["description"]["value"]
except:
description = ""
#dsFormat = result["format"]["value"]
print( title, "\n\t",ds, "\n\t",keyword,"\n\tLanding: ",
LandingPage, "\n\tRelated resource:",relatedResource,
"\n\tConformsTo:",conformsTo,
"\n\tAccess rights:",accessRights,
"\n\tCreated:",created,
"\n\tModified:",modified,
"\n\tLicense:",license,
"\n\tTaxanomy:",taxanomy,
)
Members of the European Parliament (MEPs) http://data.europa.eu/88u/dataset/members-of-the-european-parliament European Parliament Landing: http://data.europa.eu/88u/document/176fb3a4-917b-4468-a14e-dab75a745a97 Related resource: https://data.europa.eu/euodp/en/data/dataset/eu-whoiswho-the-official-directory-of-the-european-union/resource/3f3433d4-0604-4682-a46a-c3d7c756358f ConformsTo: Access rights: Created: Modified: 2018-12-21 10:06:40.664007 License: Taxanomy: Members of the European Parliament (MEPs) http://data.europa.eu/88u/dataset/members-of-the-european-parliament parliament Landing: http://data.europa.eu/88u/document/176fb3a4-917b-4468-a14e-dab75a745a97 Related resource: https://data.europa.eu/euodp/en/data/dataset/eu-whoiswho-the-official-directory-of-the-european-union/resource/3f3433d4-0604-4682-a46a-c3d7c756358f ConformsTo: Access rights: Created: Modified: 2018-12-21 10:06:40.664007 License: Taxanomy: Members of the European Parliament (MEPs) http://data.europa.eu/88u/dataset/members-of-the-european-parliament MEPs Landing: http://data.europa.eu/88u/document/176fb3a4-917b-4468-a14e-dab75a745a97 Related resource: https://data.europa.eu/euodp/en/data/dataset/eu-whoiswho-the-official-directory-of-the-european-union/resource/3f3433d4-0604-4682-a46a-c3d7c756358f ConformsTo: Access rights: Created: Modified: 2018-12-21 10:06:40.664007 License: Taxanomy: Extra-EU27 (from 2020) trade of food, drinks and tobacco (SITC 0+1), by Member State http://data.europa.eu/88u/dataset/2YUEG5uyiMXBxEAfiV4Hg international trade Landing: http://data.europa.eu/88u/document/3be4bdcb-ad37-4d85-8242-333cb8f64cd7 Related resource: ConformsTo: Access rights: Created: Modified: 2020-06-15 License: Taxanomy: Extra-EU27 (from 2020) trade of food, drinks and tobacco (SITC 0+1), by Member State http://data.europa.eu/88u/dataset/2YUEG5uyiMXBxEAfiV4Hg trade statistics Landing: http://data.europa.eu/88u/document/3be4bdcb-ad37-4d85-8242-333cb8f64cd7 Related resource: ConformsTo: Access rights: Created: Modified: 2020-06-15 License: Taxanomy: Extra-EU27 (from 2020) trade of food, drinks and tobacco (SITC 0+1), by Member State http://data.europa.eu/88u/dataset/2YUEG5uyiMXBxEAfiV4Hg international trade Landing: http://data.europa.eu/88u/document/3be4bdcb-ad37-4d85-8242-333cb8f64cd7 Related resource: ConformsTo: Access rights: Created: Modified: 2020-06-15 License: Taxanomy: Extra-EU27 (from 2020) trade of food, drinks and tobacco (SITC 0+1), by Member State http://data.europa.eu/88u/dataset/2YUEG5uyiMXBxEAfiV4Hg trade statistics Landing: http://data.europa.eu/88u/document/3be4bdcb-ad37-4d85-8242-333cb8f64cd7 Related resource: ConformsTo: Access rights: Created: Modified: 2020-06-15 License: Taxanomy: Extra-EU27 (from 2020) trade of food, drinks and tobacco (SITC 0+1), by Member State http://data.europa.eu/88u/dataset/2YUEG5uyiMXBxEAfiV4Hg international trade Landing: http://data.europa.eu/88u/document/3be4bdcb-ad37-4d85-8242-333cb8f64cd7 Related resource: ConformsTo: Access rights: Created: Modified: 2020-06-15 License: Taxanomy: Extra-EU27 (from 2020) trade of food, drinks and tobacco (SITC 0+1), by Member State http://data.europa.eu/88u/dataset/2YUEG5uyiMXBxEAfiV4Hg trade statistics Landing: http://data.europa.eu/88u/document/3be4bdcb-ad37-4d85-8242-333cb8f64cd7 Related resource: ConformsTo: Access rights: Created: Modified: 2020-06-15 License: Taxanomy: Extra-EU27 (from 2020) trade, by product group http://data.europa.eu/88u/dataset/80M6m5XM51yMXoG29uZcZA international trade Landing: http://data.europa.eu/88u/document/fa15e69e-5f16-4018-a4eb-5e21e8cb7994 Related resource: ConformsTo: Access rights: Created: Modified: 2020-06-15 License: Taxanomy:
#Take down all in a pandas dataset
results = get_sparql_dataframe(endpoint_url, querybase)
results.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12244 entries, 0 to 12243 Data columns (total 13 columns): DatasetURI 12244 non-null object title 12244 non-null object period 12244 non-null object keyword 12244 non-null object conformsTo 7 non-null object LandingPage 12221 non-null object relatedResource 181 non-null object accessRights 0 non-null object created 0 non-null object modified 11106 non-null object license 0 non-null object Taxanomy 0 non-null object description 12244 non-null object dtypes: object(13) memory usage: 1.2+ MB
see above that we get different results compared to the SPARQL manager
in our result we get
compare www.europeandataportal.eu/data/datasets it says 1 093 704 datasets
#Check metadata keywords
results.keyword.value_counts()
international trade 336 agriculture 334 COVID-19 229 coronavirus 229 accountability 207 ... automated mobility 1 economic and financial affairs 1 excise-duties-tax 1 Facial Dysostosis 1 term 1 Name: keyword, Length: 1817, dtype: int64
%matplotlib inline
import matplotlib.pyplot as plt
plot = results.keyword.value_counts().plot.bar(y='counts', figsize=(25, 5))
plt.show()
Feels we lack a standard of keywords most are just used once
# Pie of top 30
# crazy that the keyword COVID-19 is one of the most used keywords.....
# feels that we lack data management
plot = results.keyword.value_counts()[0:30].plot.pie(y='counts', figsize=(25, 5))
plt.show()
# 31-60
plot = results.keyword.value_counts()[31:60].plot.bar(y='counts', figsize=(25, 5))
plt.show()
# 61-100
plot = results.keyword.value_counts()[61:100].plot.bar(y='counts', figsize=(25, 5))
plt.show()
# 101-130
plot = results.keyword.value_counts()[101:130].plot.bar(y='counts', figsize=(25, 5))
plt.show()
# 131-
plot = results.keyword.value_counts()[131:].plot.bar(y='counts', figsize=(25, 5))
plt.show()