#!/usr/bin/env python # coding: utf-8 # # European data portal - Metadata quality # version 1.12 # # **Update** we got today a new Wikidata [property P8402](https://www.wikidata.org/wiki/Property_talk:P8402) for Open Data Portal --> we can start connect objects with its data portal see [map](https://w.wiki/WQg) what is connected today see [tweet](https://twitter.com/salgo60/status/1280024337240723461?s=20) # # # * This [Jupyter Notebook](https://github.com/salgo60/open-data-examples/blob/master/European%20data%20portal%20-%20quality%20of%20Metadata.ipynb) https://tinyurl.com/EDPQua # * [SPARQL manager](https://www.europeandataportal.eu/sparql-manager/sv/) # * [DCAT Spec](https://www.w3.org/TR/vocab-dcat-2/) Data Catalog Vocabulary (DCAT) - Version 2 # * in Swedish [input to the DCAT-AP-SE](https://github.com/DIGGSweden/DCAT-AP-SE/issues/72#issuecomment-653731898) that we need better version management and use DOI for datasets. Now with copy datasets everywhere we get a chaos to understand what dataset we use, what version of it and if we can trust it and use it as a source for our data. Compare [ORCID usage video](https://www.youtube.com/watch?v=a1Rijk_TMHA) # * the same challenge is with the data that every piece of data need something like [Linked data]() to explain what e.g. a label Stockholm does it stands for # * a city = [Q94385](https://www.wikidata.org/wiki/Q94385) # * a Municipality = [Q506250](https://www.wikidata.org/wiki/Q506250) # * a county = [Q104231](https://www.wikidata.org/wiki/Q104231) # * today I see keywords in DCAT-AP as literals i.e. string with a language label --> we get this mess # * **zh** [斯德哥尔摩](https://www.wikidata.org/wiki/Q1754?uselang=zh) # * is that same as **ar** [ستوكهولم](https://www.wikidata.org/wiki/Q1754?uselang=zh) # * is that same as **sv** [Stockholm](https://www.wikidata.org/wiki/Q1754?uselang=sv) # * versioning, traceability, public identifiers and linked data makes Open Data scale. Today I feel no one is in charge pointing in the right direction and taking responsibility for the quality.... we see the same pattern with museums and how they send objects to an Europeana portal see status of that project that I guess started in 2012 and now 2020 we have a mess... they have [55 million objects](https://classic.europeana.eu/portal/en/about.html) and a chaos and in the example below they start add fake metadata # * [Carl Larsson who is that - sadly Europeana doesnt know --> #Metadatadebt](https://minancestry.blogspot.com/2020/03/carl-larsson-who-is-that-sadly.html) # # **Problem 1: Question to EDM helpdesk 2020-06-24** Strings not Things. Metadata is not handled with care in the European data portal e.g. keywords is not Linked data and is just a text string with a language tag..... # # European Data Portal Helpdesk / Improvement and suggestions **DESK-7510** # ``` # Things not Strings # Issue Type: Improvement and suggestions Improvement and suggestions # Assignee: EDP Helpdesk # Created: 24/Jun/20 6:10 PM # Priority: Medium Medium # Reporter: EDP Helpdesk # I think you should use Things not strings when describing the data # ``` # # e.g. [json europeandataportal resource-28](https://www.europeandataportal.eu/data/api/datasets/https-catalog-skl-se-store-1-resource-38.jsonld?useNormalizedId=true&locale=en) # you have keywords with a language i.e. its asking for problems # # ``` # keyword: # @language: "sv",@value: "Telefonnummer" # @language: "sv",@value: "Kommuner" # @language: "sv", @value: "E-postadresser" # .... # ``` # Much better use Linked data and things # * "Telefonnummer" same as https://www.wikidata.org/wiki/Q214995 # * "Kommuner" same as https://www.wikidata.org/wiki/Q127448 # * "E-postadresser" same as https://www.wikidata.org/wiki/Q9158 # # ``` # Regards # Magnus Sälgö # ++46705937578 # Stockholm, Sweden # salgo1960@gmail.com # ``` # ---- # **Answer** # Dear Magnus, # Thank you for contacting European Data Portal Helpdesk. # # We have gotten the following comments from the responsible team: # "Thanks for your comments. We store the metadata as it comes from the data providers, so it is not on us to change that. Besides that, DCAT-AP defines keywords as literals." # # Please let me know if you need further assistance from our services. # # Best regards, [Pernille Schnoor Clausen](https://www.linkedin.com/in/pernille-schnoor-clausen-38515a2) # EDP Helpdesk # # ### Fostering an Open Data Echosystem to deliver good data and metadata about datasets # In Sweden I see a rather naive approach to open data and governance of open data. We have since [2012 tried getting museums Digital](http://www.digisam.se/linked-open-data/). They started speaking about linked data 2012 and today 2020 they sends text strings to a museum aggregator Europeana --> they cant even identify an artist at museum A is the same artist B at museum B see [blogpost](https://minancestry.blogspot.com/2020/03/carl-larsson-who-is-that-sadly.html) --> en:Wikipedia refused link them as the [quality is so bad](https://phabricator.wikimedia.org/T243764). # # Things I see or miss # 1. Good metadata about the datasets and someone who takes responsibilty... if this was business critical data no one should accept "we store the metadata as is .... so it is not on us to change that." # 1. You **get reposiblities** and you **take responsiblities** we dont need gatekeepers adding no value # 1. We miss dicussion platforms if we in Sweden should get data we need to contact 290 municipial units by phone.... **this will not scale** People are spending time on "Linked in" likeing each other instead of using real system for [Public backlogs](https://www.youtube.com/watch?v=502ILHjX9EE) were we can ask question and track issues # 1. Good patterna like describing the quality of delivered data as [ShEx](http://shex.io/) # 1. An urgency for delivering good quality # 1. Traceability of errors in dataset where you get a help test ticket that you can follow unti its in production compare using free Facebook tool [Phabricator at Wikidata](https://phabricator.wikimedia.org/tag/wikidata/) # 1. platforms for coordinate data between different sources ie. if source X deliver COVID-19 data with fields xxx tha source Y commit to that in helpdesk ticket yyyy listen to me [asking question about this 2018-dec at SWIB-18](https://youtu.be/K0l4fv5uUvg?t=1579) # # #### The Google approach # * see Google AI blog [Building Google Dataset Search and Fostering an Open Data Ecosystem](https://ai.googleblog.com/2018/09/building-google-dataset-search-and.html) # * video [Introducing the Knowledge Graph](https://www.youtube.com/watch?v=mmQl6VGvX-c) # * **Connecting Replicas of Datasets** # * _It is very common for a dataset, in particular a popular one, to be present in more than one repository..... a way to specify the connection explicitly, through schema.org/sameAs, ..... having the same [Digital Object Identifier...](https://www.doi.org/)_ # * **Reconciling to the Google Knowledge Graph** # * _Therefore, we try to reconcile information mentioned in the metadata fields with the items in the Knowledge Graph_ # * _This type of reconciliation opens up lots of possibilities to improve the search experience for users._ # * [Making it easier to discover datasets](https://www.blog.google/products/search/making-it-easier-discover-datasets/) # * [Guidelines for datasets](https://developers.google.com/search/docs/data-types/dataset) # # Comment: See how [dataverse.harvard.edu](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/PEJ5QU) use DOI --> 10.7910 is the portal and DVN/PEJ5QU is the dataset --> unique persistent identifier is [doi:10.7910/DVN/PEJ5QU](https://doi.org/10.7910/DVN/PEJ5QU) # # ![DOI 10.7910 DVN/PEJ5QU ](https://user-images.githubusercontent.com/14206509/86506605-21235680-bdd1-11ea-8622-a06afe7db8b5.png) # # support versioning # # ![Versioning](https://user-images.githubusercontent.com/14206509/86506735-96435b80-bdd2-11ea-9544-dd47aa530c7f.png) # # ![Versioning](https://user-images.githubusercontent.com/14206509/86506761-d1de2580-bdd2-11ea-9916-9cecadcc37c7.png) # # #### The Europeana approach # Started in 2012 [trying to do Linked data](https://web.archive.org/web/20121118181057/http://www.niso.org/apps/group_public/download.php/9407/IP_Isaac-etal_Europeana_isqv24no2-3.pdf) and has failed because of museums are not skilled enough delivering quality metadata and Europeana is not doing the work of cleaning metadata see [tweet](https://twitter.com/salgo60/status/1281471086035640320?s=20) --> the quality was so bad that en:Wikipedia 2019 took a decision not linking Europeana because they show the wrong artist see [T243764](https://phabricator.wikimedia.org/T243764) "en:Wikipedia has problem with the quality of Europeana..." # # #### Wikidata - a free open that is linked # Wikidata is a fast growing open "knowledge base" why not use that. Google has its own much bigger [knowledge graph](https://en.wikipedia.org/wiki/Knowledge_Graph) but why not use Wikidata or build a "European data portal knowledge graph" # * video [An introduction to Wikidata](https://www.youtube.com/watch?v=m_9_23jXPoE) # # # FAIR aware # “Fostering FAIR Data Practices In Europe” # see [fairaware toolkit](https://fairaware.dans.knaw.nl/) " It helps to assess your current level of awareness on making your datasets findable, accessible, interoperable and reusable (FAIR) before uploading them in a data repository" # # In[ ]: # ## Check the data quality at European Data Portal # # # In[64]: # Number of Categories / themes abd their number of datasets # --> result 25114 datasets compare 1 093 704 datasets says # https://www.europeandataportal.eu/data/datasets?locale=en import sys from SPARQLWrapper import SPARQLWrapper, JSON endpoint_url ="https://data.europa.eu/euodp/sparqlep" query = """#European Data portal PREFIX dcat: PREFIX dc: SELECT ?theme (count(?s) AS ?count) WHERE {?s a dcat:Dataset . ?s dcat:theme ?theme} GROUP BY ?theme""" def get_results(endpoint_url, query): user_agent = "salgo60/%s.%s" % (sys.version_info[0], sys.version_info[1]) sparql = SPARQLWrapper(endpoint_url, agent=user_agent) sparql.setQuery(query) sparql.setReturnFormat(JSON) return sparql.query().convert() results = get_results(endpoint_url, query) for result in results["results"]["bindings"]: theme = result["theme"]["value"].replace("http://publications.europa.eu/resource/authority/data-theme/","") value = result["count"]["value"] print(theme, value) # In[74]: # Datasets missing theme query_missingTheme = """#European Data portal PREFIX dcat: PREFIX dc: SELECT ?theme (count(?s) AS ?count) WHERE {?s a dcat:Dataset . minus {?s dcat:theme ?theme}} GROUP BY ?theme""" results = get_results(endpoint_url, query_missingTheme) for result in results["results"]["bindings"]: print("missing theme: ", result["count"]["value"]) # In[83]: #Number Datasets we get 15371 feels low #same query at https://www.europeandataportal.eu/sparql-manager/en/ # gives 1 071 960 query_NumberOf = """ PREFIX dcat: SELECT (count(*) AS ?count) WHERE { ?s a dcat:Dataset } LIMIT 1000""" results = get_results(endpoint_url, query_NumberOf) for result in results["results"]["bindings"]: print("Number sets: ", result["count"]["value"]) # In[84]: # PREFIX dcat: query2 = """#Get datasets with MP in the name PREFIX dcat: PREFIX dc: SELECT ?DatasetURI ?title WHERE { ?DatasetURI a dcat:Dataset . ?DatasetURI dc:title ?title FILTER (lang(?title)='en') FILTER(regex(?title, "MP", "i")) } limit 10""" results = get_results(endpoint_url, query2) for result in results["results"]["bindings"]: #print(result) ds = result["DatasetURI"]["value"] title = result["title"]["value"] print( title, "\n\t",ds) # In[85]: import json import pandas as pd def get_sparql_dataframe(endpoint_url, query): """ Helper function to convert SPARQL results into a Pandas data frame. """ user_agent = "salgo60/%s.%s" % (sys.version_info[0], sys.version_info[1]) sparql = SPARQLWrapper(endpoint_url, agent=user_agent) sparql.setQuery(query) sparql.setReturnFormat(JSON) result = sparql.query() processed_results = json.load(result.response) cols = processed_results['head']['vars'] out = [] for row in processed_results['results']['bindings']: item = [] for c in cols: item.append(row.get(c, {}).get('value')) out.append(item) return pd.DataFrame(out, columns=cols) # In[ ]: # In[86]: # retrieves all datasets format and, if available, retrieves the starting # date of the temporal coverage period of the datasets. querybase = """#Get datasets format and if coverage period PREFIX dcat: PREFIX dc: PREFIX odp: SELECT distinct ?DatasetURI ?title ?period ?keyword ?conformsTo ?LandingPage ?relatedResource ?accessRights ?created ?modified ?license ?Taxanomy ?description WHERE { ?DatasetURI a dcat:Dataset . OPTIONAL {?DatasetURI dcat:distribution ?o} . OPTIONAL {?DatasetURI dcat:landingPage ?LandingPage } OPTIONAL {?DatasetURI dct:relation ?relatedResource } OPTIONAL {?DatasetURI dct:conformsTo ?conformsTo } OPTIONAL {?DatasetURI dct:accessRights ?accessRights } OPTIONAL {?DatasetURI dct:created ?created } OPTIONAL {?DatasetURI dct:modified ?modified } OPTIONAL {?DatasetURI dct:license ?license } OPTIONAL {?DatasetURI dcat:themeTaxonomy ?Taxanomy } OPTIONAL {?DatasetURI dct:description ?description} ?DatasetURI dc:title ?title { ?DatasetURI dc:temporal ?period . ?DatasetURI dcat:keyword ?keyword. # ?period odp:periodStart ?period_start } FILTER (lang(?title)='en') }""" query3 = querybase + " limit 10" results = get_results(endpoint_url, query3) for result in results["results"]["bindings"]: #print(result) ds = result["DatasetURI"]["value"] title = result["title"]["value"] keyword = result["keyword"]["value"] LandingPage = result["LandingPage"]["value"] try: created = result["created"]["value"] except: created = "" try: relatedResource = result["relatedResource"]["value"] except: relatedResource = "" try: conformsTo = result["conformsTo"]["value"] except: conformsTo = "" try: accessRights = result["accessRights"]["value"] except: accessRights = "" try: modified = result["modified"]["value"] except: modified = "" try: license = result["license"]["value"] except: license = "" try: taxanomy = result["Taxanomy"]["value"] except: taxanomy = "" try: description = result["description"]["value"] except: description = "" #dsFormat = result["format"]["value"] print( title, "\n\t",ds, "\n\t",keyword,"\n\tLanding: ", LandingPage, "\n\tRelated resource:",relatedResource, "\n\tConformsTo:",conformsTo, "\n\tAccess rights:",accessRights, "\n\tCreated:",created, "\n\tModified:",modified, "\n\tLicense:",license, "\n\tTaxanomy:",taxanomy, ) # In[87]: #Take down all in a pandas dataset results = get_sparql_dataframe(endpoint_url, querybase) # In[88]: results.info() # ### We have 12247 datasets if the query is ok # see above that we get different results compared to the [SPARQL manager](https://www.europeandataportal.eu/sparql-manager/sv/) # # in our result we get # * no accessRights # * no created # * no license # * no Taxanomy # * some keywords in english # # compare [www.europeandataportal.eu/data/datasets](https://www.europeandataportal.eu/data/datasets?locale=en ) it says 1 093 704 datasets # In[96]: #Check metadata keywords results.keyword.value_counts() # In[90]: get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib.pyplot as plt plot = results.keyword.value_counts().plot.bar(y='counts', figsize=(25, 5)) plt.show() # Feels we lack a standard of keywords most are just used once # In[97]: # Pie of top 30 # crazy that the keyword COVID-19 is one of the most used keywords..... # feels that we lack data management plot = results.keyword.value_counts()[0:30].plot.pie(y='counts', figsize=(25, 5)) plt.show() # In[92]: # 31-60 plot = results.keyword.value_counts()[31:60].plot.bar(y='counts', figsize=(25, 5)) plt.show() # In[93]: # 61-100 plot = results.keyword.value_counts()[61:100].plot.bar(y='counts', figsize=(25, 5)) plt.show() # In[94]: # 101-130 plot = results.keyword.value_counts()[101:130].plot.bar(y='counts', figsize=(25, 5)) plt.show() # In[95]: # 131- plot = results.keyword.value_counts()[131:].plot.bar(y='counts', figsize=(25, 5)) plt.show() # In[ ]: