Notebook

European data portal - Metadata quality¶

version 1.12

Update we got today a new Wikidata property P8402 for Open Data Portal --> we can start connect objects with its data portal see map what is connected today see tweet

This Jupyter Notebook https://tinyurl.com/EDPQua
SPARQL manager
DCAT Spec Data Catalog Vocabulary (DCAT) - Version 2
in Swedish input to the DCAT-AP-SE that we need better version management and use DOI for datasets. Now with copy datasets everywhere we get a chaos to understand what dataset we use, what version of it and if we can trust it and use it as a source for our data. Compare ORCID usage video
- the same challenge is with the data that every piece of data need something like Linked data to explain what e.g. a label Stockholm does it stands for
  - a city = Q94385
  - a Municipality = Q506250
  - a county = Q104231
  - today I see keywords in DCAT-AP as literals i.e. string with a language label --> we get this mess
    - zh 斯德哥尔摩
    - is that same as ar ستوكهولم
    - is that same as sv Stockholm
- versioning, traceability, public identifiers and linked data makes Open Data scale. Today I feel no one is in charge pointing in the right direction and taking responsibility for the quality.... we see the same pattern with museums and how they send objects to an Europeana portal see status of that project that I guess started in 2012 and now 2020 we have a mess... they have 55 million objects and a chaos and in the example below they start add fake metadata
  - Carl Larsson who is that - sadly Europeana doesnt know --> #Metadatadebt

Problem 1: Question to EDM helpdesk 2020-06-24 Strings not Things. Metadata is not handled with care in the European data portal e.g. keywords is not Linked data and is just a text string with a language tag.....

European Data Portal Helpdesk / Improvement and suggestions DESK-7510

 Things not Strings
 Issue Type:	Improvement and suggestions Improvement and suggestions
 Assignee:	EDP Helpdesk
 Created:	24/Jun/20 6:10 PM
 Priority:	Medium Medium
 Reporter:	EDP Helpdesk
 I think you should use Things not strings when describing the data

e.g. json europeandataportal resource-28 you have keywords with a language i.e. its asking for problems

keyword:
@language: "sv",@value: "Telefonnummer"
@language: "sv",@value: "Kommuner"
@language: "sv", @value: "E-postadresser"
....

Much better use Linked data and things

"Telefonnummer" same as https://www.wikidata.org/wiki/Q214995
"Kommuner" same as https://www.wikidata.org/wiki/Q127448
"E-postadresser" same as https://www.wikidata.org/wiki/Q9158

Regards
Magnus Sälgö
++46705937578
Stockholm, Sweden
salgo1960@gmail.com

Answer Dear Magnus, Thank you for contacting European Data Portal Helpdesk.

We have gotten the following comments from the responsible team: "Thanks for your comments. We store the metadata as it comes from the data providers, so it is not on us to change that. Besides that, DCAT-AP defines keywords as literals."

Please let me know if you need further assistance from our services.

Best regards, Pernille Schnoor Clausen EDP Helpdesk

Fostering an Open Data Echosystem to deliver good data and metadata about datasets¶

In Sweden I see a rather naive approach to open data and governance of open data. We have since 2012 tried getting museums Digital. They started speaking about linked data 2012 and today 2020 they sends text strings to a museum aggregator Europeana --> they cant even identify an artist at museum A is the same artist B at museum B see blogpost --> en:Wikipedia refused link them as the quality is so bad.

Things I see or miss

Good metadata about the datasets and someone who takes responsibilty... if this was business critical data no one should accept "we store the metadata as is .... so it is not on us to change that."
You get reposiblities and you take responsiblities we dont need gatekeepers adding no value
We miss dicussion platforms if we in Sweden should get data we need to contact 290 municipial units by phone.... this will not scale People are spending time on "Linked in" likeing each other instead of using real system for Public backlogs were we can ask question and track issues
Good patterna like describing the quality of delivered data as ShEx
An urgency for delivering good quality
Traceability of errors in dataset where you get a help test ticket that you can follow unti its in production compare using free Facebook tool Phabricator at Wikidata
platforms for coordinate data between different sources ie. if source X deliver COVID-19 data with fields xxx tha source Y commit to that in helpdesk ticket yyyy listen to me asking question about this 2018-dec at SWIB-18

The Google approach¶

see Google AI blog Building Google Dataset Search and Fostering an Open Data Ecosystem
- video Introducing the Knowledge Graph
- Connecting Replicas of Datasets
  - It is very common for a dataset, in particular a popular one, to be present in more than one repository..... a way to specify the connection explicitly, through schema.org/sameAs, ..... having the same Digital Object Identifier...
- Reconciling to the Google Knowledge Graph
  - Therefore, we try to reconcile information mentioned in the metadata fields with the items in the Knowledge Graph
  - This type of reconciliation opens up lots of possibilities to improve the search experience for users.
Making it easier to discover datasets
- Guidelines for datasets

Comment: See how dataverse.harvard.edu use DOI --> 10.7910 is the portal and DVN/PEJ5QU is the dataset --> unique persistent identifier is doi:10.7910/DVN/PEJ5QU

DOI 10.7910 DVN/PEJ5QU

support versioning

Versioning

The Europeana approach¶

Started in 2012 trying to do Linked data and has failed because of museums are not skilled enough delivering quality metadata and Europeana is not doing the work of cleaning metadata see tweet --> the quality was so bad that en:Wikipedia 2019 took a decision not linking Europeana because they show the wrong artist see T243764 "en:Wikipedia has problem with the quality of Europeana..."

Wikidata - a free open that is linked¶

Wikidata is a fast growing open "knowledge base" why not use that. Google has its own much bigger knowledge graph but why not use Wikidata or build a "European data portal knowledge graph"

video An introduction to Wikidata

FAIR aware¶

“Fostering FAIR Data Practices In Europe” see fairaware toolkit " It helps to assess your current level of awareness on making your datasets findable, accessible, interoperable and reusable (FAIR) before uploading them in a data repository"

In [ ]:

Check the data quality at European Data Portal¶

In [64]:

# Number of Categories / themes abd their number of datasets 
# --> result 25114 datasets compare 1 093 704 datasets says 
# https://www.europeandataportal.eu/data/datasets?locale=en 
import sys
from SPARQLWrapper import SPARQLWrapper, JSON

endpoint_url ="https://data.europa.eu/euodp/sparqlep"

query = """#European Data portal 
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dc: <http://purl.org/dc/terms/>

SELECT ?theme (count(?s) AS ?count) 
WHERE {?s a dcat:Dataset . ?s dcat:theme ?theme} 
GROUP BY ?theme"""


def get_results(endpoint_url, query):
    user_agent = "salgo60/%s.%s" % (sys.version_info[0], sys.version_info[1])
    sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()


results = get_results(endpoint_url, query)

for result in results["results"]["bindings"]:
    theme = result["theme"]["value"].replace("http://publications.europa.eu/resource/authority/data-theme/","")
    value = result["count"]["value"] 
    print(theme, value)

ECON 2560
GOVE 2666
JUST 353
INTR 429
TRAN 747
HEAL 2700
EDUC 2954
ENVI 2903
AGRI 941
SOCI 3636
TECH 2655
ENER 1009
REGI 1561

In [74]:

# Datasets missing theme
query_missingTheme = """#European Data portal 
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dc: <http://purl.org/dc/terms/>

SELECT ?theme (count(?s) AS ?count) 
WHERE {?s a dcat:Dataset .  minus {?s dcat:theme ?theme}} 
GROUP BY ?theme""" 

results = get_results(endpoint_url, query_missingTheme)

for result in results["results"]["bindings"]:
    print("missing theme: ", result["count"]["value"])

missing theme:  206

In [83]:

#Number Datasets we get 15371 feels low
#same query at https://www.europeandataportal.eu/sparql-manager/en/ 
# gives 1 071 960
query_NumberOf = """
PREFIX dcat: <http://www.w3.org/ns/dcat#>
SELECT (count(*) AS ?count) WHERE { ?s a dcat:Dataset }  LIMIT 1000"""

results = get_results(endpoint_url, query_NumberOf)

for result in results["results"]["bindings"]:
    print("Number sets: ", result["count"]["value"])

Number sets:  15371

In [84]:

# PREFIX dcat: <http://www.w3.org/ns/dcat#>

query2 = """#Get datasets with MP in the name 
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dc: <http://purl.org/dc/terms/>

SELECT ?DatasetURI ?title WHERE { 
?DatasetURI a dcat:Dataset .
?DatasetURI dc:title ?title
FILTER (lang(?title)='en')
FILTER(regex(?title, "MP", "i"))
} limit 10"""

results = get_results(endpoint_url, query2)

for result in results["results"]["bindings"]:
    #print(result)
    ds = result["DatasetURI"]["value"]
    title = result["title"]["value"] 
    print( title, "\n\t",ds)

Gender employment gap by NUTS 2 regions 
	 http://data.europa.eu/88u/dataset/YHvFUmXqU4LhS4fExi9CnA
Gender employment gap by degree of urbanisation 
	 http://data.europa.eu/88u/dataset/sOx8fxyrYRoKlWkdPNA
Implementation report and country fiches on the Environment Liability Directive (ELD) 
	 http://data.europa.eu/88u/dataset/implementation-report-and-country-fiches-on-the-environment-liability-directive-eld
Employment in the EU environmental economy by environmental protection and resource management activities 
	 http://data.europa.eu/88u/dataset/GcXikOlwIaw0BJG3nTHeog
Implementation report under the Landfill Directive 
	 http://data.europa.eu/88u/dataset/implementation-report-under-the-landfill-directive
INSPIRE Implementation report and country fiches in relation to the infrastructure for geospatial data 
	 http://data.europa.eu/88u/dataset/inspire-implementation-report-and-country-fiches-in-relation-to-the-infrastructure-for-geospatial-data
Atmospheric Particles-DMPS Particle Concentration (2018) 
	 http://data.europa.eu/88u/dataset/536fd44e-b05f-46ff-8ba8-791f855e8fb2
Assumptions for net migration by age, sex and type of projection 
	 http://data.europa.eu/88u/dataset/GzhfU2UD8w0IfX0vaR47A
Disaggregated final energy consumption in households - quantities 
	 http://data.europa.eu/88u/dataset/UVygjkxEv6PYwQWBgmgwyg
Non- innovative enterprises by barrier against innovation activities, level of importance of the barrier, NACE Rev. 2 activity and size class 
	 http://data.europa.eu/88u/dataset/ULZ2ng11rhPkqMfyPk9Q

In [85]:

import json
import pandas as pd
def get_sparql_dataframe(endpoint_url, query):
    """
    Helper function to convert SPARQL results into a Pandas data frame.
    """
    user_agent = "salgo60/%s.%s" % (sys.version_info[0], sys.version_info[1])
 
    sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    result = sparql.query()

    processed_results = json.load(result.response)
    cols = processed_results['head']['vars']

    out = []
    for row in processed_results['results']['bindings']:
        item = []
        for c in cols:
            item.append(row.get(c, {}).get('value'))
        out.append(item)

    return pd.DataFrame(out, columns=cols)

In [ ]:

In [86]:

# retrieves all datasets format  and, if available, retrieves the starting  
# date of the temporal coverage period of the datasets. 
querybase = """#Get datasets format and if coverage period
PREFIX dcat: <http://www.w3.org/ns/dcat#>
PREFIX dc: <http://purl.org/dc/terms/> 
PREFIX odp: <https://data.europa.eu/euodp/ontologies/ec-odp#>

SELECT distinct ?DatasetURI ?title ?period ?keyword ?conformsTo 
?LandingPage ?relatedResource ?accessRights ?created ?modified ?license
?Taxanomy ?description WHERE { 
?DatasetURI a dcat:Dataset .
OPTIONAL {?DatasetURI dcat:distribution ?o} .
OPTIONAL {?DatasetURI dcat:landingPage ?LandingPage }
OPTIONAL {?DatasetURI dct:relation ?relatedResource }
OPTIONAL {?DatasetURI dct:conformsTo  ?conformsTo }
OPTIONAL {?DatasetURI dct:accessRights ?accessRights }
OPTIONAL {?DatasetURI dct:created  ?created }
OPTIONAL {?DatasetURI dct:modified  ?modified }
OPTIONAL {?DatasetURI dct:license  ?license }
OPTIONAL {?DatasetURI dcat:themeTaxonomy  ?Taxanomy }
OPTIONAL {?DatasetURI dct:description ?description}  
?DatasetURI dc:title ?title
 {
   ?DatasetURI dc:temporal ?period .
    ?DatasetURI dcat:keyword ?keyword.
#   ?period odp:periodStart ?period_start
   }
FILTER (lang(?title)='en')
}"""  

query3 = querybase + "  limit 10"

results = get_results(endpoint_url, query3)

for result in results["results"]["bindings"]:
    #print(result)
    ds = result["DatasetURI"]["value"]
    title = result["title"]["value"] 
    keyword = result["keyword"]["value"] 
    LandingPage = result["LandingPage"]["value"]
     
    try:
        created = result["created"]["value"]
    except:
        created = ""
    try:
        relatedResource = result["relatedResource"]["value"]
    except:
        relatedResource = ""
    try:
        conformsTo = result["conformsTo"]["value"]
    except:
        conformsTo = ""
    try:
        accessRights = result["accessRights"]["value"]
    except:
        accessRights = ""
    try:
        modified = result["modified"]["value"]
    except:
        modified = ""
    try:
        license = result["license"]["value"]
    except:
        license = ""
    try:
        taxanomy = result["Taxanomy"]["value"]
    except:
        taxanomy = ""
    try:
        description = result["description"]["value"]
    except:
        description = ""

    #dsFormat = result["format"]["value"] 
    print( title, "\n\t",ds, "\n\t",keyword,"\n\tLanding: ",
          LandingPage,  "\n\tRelated resource:",relatedResource,
          "\n\tConformsTo:",conformsTo,
          "\n\tAccess rights:",accessRights, 
          "\n\tCreated:",created,
          "\n\tModified:",modified, 
          "\n\tLicense:",license,
          "\n\tTaxanomy:",taxanomy,
        
          )

Members of the European Parliament (MEPs) 
	 http://data.europa.eu/88u/dataset/members-of-the-european-parliament 
	 European Parliament 
	Landing:  http://data.europa.eu/88u/document/176fb3a4-917b-4468-a14e-dab75a745a97 
	Related resource: https://data.europa.eu/euodp/en/data/dataset/eu-whoiswho-the-official-directory-of-the-european-union/resource/3f3433d4-0604-4682-a46a-c3d7c756358f 
	ConformsTo:  
	Access rights:  
	Created:  
	Modified: 2018-12-21 10:06:40.664007 
	License:  
	Taxanomy: 
Members of the European Parliament (MEPs) 
	 http://data.europa.eu/88u/dataset/members-of-the-european-parliament 
	 parliament 
	Landing:  http://data.europa.eu/88u/document/176fb3a4-917b-4468-a14e-dab75a745a97 
	Related resource: https://data.europa.eu/euodp/en/data/dataset/eu-whoiswho-the-official-directory-of-the-european-union/resource/3f3433d4-0604-4682-a46a-c3d7c756358f 
	ConformsTo:  
	Access rights:  
	Created:  
	Modified: 2018-12-21 10:06:40.664007 
	License:  
	Taxanomy: 
Members of the European Parliament (MEPs) 
	 http://data.europa.eu/88u/dataset/members-of-the-european-parliament 
	 MEPs 
	Landing:  http://data.europa.eu/88u/document/176fb3a4-917b-4468-a14e-dab75a745a97 
	Related resource: https://data.europa.eu/euodp/en/data/dataset/eu-whoiswho-the-official-directory-of-the-european-union/resource/3f3433d4-0604-4682-a46a-c3d7c756358f 
	ConformsTo:  
	Access rights:  
	Created:  
	Modified: 2018-12-21 10:06:40.664007 
	License:  
	Taxanomy: 
Extra-EU27 (from 2020) trade of food, drinks and tobacco (SITC 0+1), by Member State 
	 http://data.europa.eu/88u/dataset/2YUEG5uyiMXBxEAfiV4Hg 
	 international trade 
	Landing:  http://data.europa.eu/88u/document/3be4bdcb-ad37-4d85-8242-333cb8f64cd7 
	Related resource:  
	ConformsTo:  
	Access rights:  
	Created:  
	Modified: 2020-06-15 
	License:  
	Taxanomy: 
Extra-EU27 (from 2020) trade of food, drinks and tobacco (SITC 0+1), by Member State 
	 http://data.europa.eu/88u/dataset/2YUEG5uyiMXBxEAfiV4Hg 
	 trade statistics 
	Landing:  http://data.europa.eu/88u/document/3be4bdcb-ad37-4d85-8242-333cb8f64cd7 
	Related resource:  
	ConformsTo:  
	Access rights:  
	Created:  
	Modified: 2020-06-15 
	License:  
	Taxanomy: 
Extra-EU27 (from 2020) trade of food, drinks and tobacco (SITC 0+1), by Member State 
	 http://data.europa.eu/88u/dataset/2YUEG5uyiMXBxEAfiV4Hg 
	 international trade 
	Landing:  http://data.europa.eu/88u/document/3be4bdcb-ad37-4d85-8242-333cb8f64cd7 
	Related resource:  
	ConformsTo:  
	Access rights:  
	Created:  
	Modified: 2020-06-15 
	License:  
	Taxanomy: 
Extra-EU27 (from 2020) trade of food, drinks and tobacco (SITC 0+1), by Member State 
	 http://data.europa.eu/88u/dataset/2YUEG5uyiMXBxEAfiV4Hg 
	 trade statistics 
	Landing:  http://data.europa.eu/88u/document/3be4bdcb-ad37-4d85-8242-333cb8f64cd7 
	Related resource:  
	ConformsTo:  
	Access rights:  
	Created:  
	Modified: 2020-06-15 
	License:  
	Taxanomy: 
Extra-EU27 (from 2020) trade of food, drinks and tobacco (SITC 0+1), by Member State 
	 http://data.europa.eu/88u/dataset/2YUEG5uyiMXBxEAfiV4Hg 
	 international trade 
	Landing:  http://data.europa.eu/88u/document/3be4bdcb-ad37-4d85-8242-333cb8f64cd7 
	Related resource:  
	ConformsTo:  
	Access rights:  
	Created:  
	Modified: 2020-06-15 
	License:  
	Taxanomy: 
Extra-EU27 (from 2020) trade of food, drinks and tobacco (SITC 0+1), by Member State 
	 http://data.europa.eu/88u/dataset/2YUEG5uyiMXBxEAfiV4Hg 
	 trade statistics 
	Landing:  http://data.europa.eu/88u/document/3be4bdcb-ad37-4d85-8242-333cb8f64cd7 
	Related resource:  
	ConformsTo:  
	Access rights:  
	Created:  
	Modified: 2020-06-15 
	License:  
	Taxanomy: 
Extra-EU27 (from 2020) trade, by product group 
	 http://data.europa.eu/88u/dataset/80M6m5XM51yMXoG29uZcZA 
	 international trade 
	Landing:  http://data.europa.eu/88u/document/fa15e69e-5f16-4018-a4eb-5e21e8cb7994 
	Related resource:  
	ConformsTo:  
	Access rights:  
	Created:  
	Modified: 2020-06-15 
	License:  
	Taxanomy:

In [87]:

#Take down all in a pandas dataset
results = get_sparql_dataframe(endpoint_url, querybase)

In [88]:

results.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12244 entries, 0 to 12243
Data columns (total 13 columns):
DatasetURI         12244 non-null object
title              12244 non-null object
period             12244 non-null object
keyword            12244 non-null object
conformsTo         7 non-null object
LandingPage        12221 non-null object
relatedResource    181 non-null object
accessRights       0 non-null object
created            0 non-null object
modified           11106 non-null object
license            0 non-null object
Taxanomy           0 non-null object
description        12244 non-null object
dtypes: object(13)
memory usage: 1.2+ MB

We have 12247 datasets if the query is ok¶

see above that we get different results compared to the SPARQL manager

in our result we get

no accessRights
no created
no license
no Taxanomy
some keywords in english

compare www.europeandataportal.eu/data/datasets it says 1 093 704 datasets

In [96]:

#Check metadata keywords
results.keyword.value_counts()

Out[96]:

international trade               336
agriculture                       334
COVID-19                          229
coronavirus                       229
accountability                    207
                                 ... 
automated mobility                  1
economic and financial affairs      1
excise-duties-tax                   1
Facial Dysostosis                   1
term                                1
Name: keyword, Length: 1817, dtype: int64

In [90]:

%matplotlib inline  
import matplotlib.pyplot as plt   
plot = results.keyword.value_counts().plot.bar(y='counts', figsize=(25, 5)) 
plt.show()

Feels we lack a standard of keywords most are just used once

In [97]:

# Pie of top 30 
# crazy that the keyword COVID-19 is one of the most used keywords.....
# feels that we lack data management
plot = results.keyword.value_counts()[0:30].plot.pie(y='counts', figsize=(25, 5)) 
plt.show()

In [92]:

# 31-60
plot = results.keyword.value_counts()[31:60].plot.bar(y='counts', figsize=(25, 5)) 
plt.show()

In [93]:

# 61-100
plot = results.keyword.value_counts()[61:100].plot.bar(y='counts', figsize=(25, 5)) 
plt.show()

In [94]:

# 101-130
plot = results.keyword.value_counts()[101:130].plot.bar(y='counts', figsize=(25, 5)) 
plt.show()

In [95]:

# 131-
plot = results.keyword.value_counts()[131:].plot.bar(y='counts', figsize=(25, 5)) 
plt.show()

In [ ]: