Datasets¶

A Dataset is a specialization of a Resource that provides users with operations to handle files, record their provenance and describe them with metadata.

In [1]:

from kgforge.core import KnowledgeGraphForge

A configuration file is needed in order to create a KnowledgeGraphForge session. A configuration can be generated using the notebook 00-Initialization.ipynb.

Note: DemoStore doesn't implement file operations yet. Use the BluBrainNexus store instead when creating a config file.

In [58]:

forge = KnowledgeGraphForge("../../configurations/forge.yml")

Imports¶

In [3]:

from kgforge.core import Resource

In [4]:

from kgforge.specializations.resources import Dataset

In [5]:

import pandas as pd

Creation with files added as parts¶

In [6]:

! ls -p ../../data | egrep -v /$

associations.tsv
my_data.xwz
my_data_derived.txt
persons-with-id.csv
persons.csv
tfidfvectorizer_model_schemaorg_linking

In [7]:

persons = Dataset(forge, name="Interesting Persons")

In [8]:

persons.add_files("../../data/persons.csv")

In [9]:

forge.register(persons)

<action> _register_one
<succeeded> True

In [10]:

forge.as_json(persons)

Out[10]:

{'id': 'https://bbp.epfl.ch/nexus/v1/resources/dke/kgforge/_/980a7cd9-36ef-4fc9-95b8-cbf622b49fd8',
 'type': 'Dataset',
 'hasPart': {'distribution': {'type': 'DataDownload',
   'atLocation': {'type': 'Location',
    'store': {'id': 'https://bluebrain.github.io/nexus/vocabulary/diskStorageDefault',
     'type': 'DiskStorage',
     '_rev': 1}},
   'contentSize': {'unitCode': 'bytes', 'value': 52},
   'contentUrl': 'https://bbp.epfl.ch/nexus/v1/files/dke/kgforge/2737f7f0-950a-471d-ae60-80b79c7451bd',
   'digest': {'algorithm': 'SHA-256',
    'value': '1dacd765946963fda4949753659089c5f532714b418d30788bedddfec47a389f'},
   'encodingFormat': 'text/csv',
   'name': 'persons.csv'}},
 'name': 'Interesting Persons'}

In [11]:

associations = Dataset(forge, name="Associations data")

In [12]:

associations.add_files("../../data/associations.tsv")

In [13]:

associations.add_derivation(persons)

In [14]:

forge.register(associations)

<action> _register_one
<succeeded> True

In [15]:

forge.as_json(associations)

Out[15]:

{'id': 'https://bbp.epfl.ch/nexus/v1/resources/dke/kgforge/_/80bd2bcb-b84f-4418-9d2e-42712a59fbfb',
 'type': 'Dataset',
 'derivation': {'type': 'Derivation',
  'entity': {'id': 'https://bbp.epfl.ch/nexus/v1/resources/dke/kgforge/_/980a7cd9-36ef-4fc9-95b8-cbf622b49fd8?rev=1',
   'type': 'Dataset',
   'name': 'Interesting Persons'}},
 'hasPart': {'distribution': {'type': 'DataDownload',
   'atLocation': {'type': 'Location',
    'store': {'id': 'https://bluebrain.github.io/nexus/vocabulary/diskStorageDefault',
     'type': 'DiskStorage',
     '_rev': 1}},
   'contentSize': {'unitCode': 'bytes', 'value': 477},
   'contentUrl': 'https://bbp.epfl.ch/nexus/v1/files/dke/kgforge/d6b03d5b-4007-48d7-9432-5bf302c62999',
   'digest': {'algorithm': 'SHA-256',
    'value': '789aa07948683fe036ac29811814a826b703b562f7d168eb70dee1fabde26859'},
   'encodingFormat': 'text/tab-separated-values',
   'name': 'associations.tsv'}},
 'name': 'Associations data'}

In [ ]:

# By default the files are downloaded in the current path (path="."). The urls or the files to download can be collected from a different json path (by setting a value for "follow") and 
# the files downloaded to a different path (by setting a value for "path")
# The argument overwrite: bool can be provided to decide whether to overwrite (True) existing files with the same name or
# to create new ones (False) with their names suffixed with a timestamp.
# A cross_bucket argument can be provided to download data from the configured bucket (cross_bucket=False - the default value) 
# or from a bucket different than the configured one (cross_bucket=True). The configured store should support crossing buckets for this to work.
associations.download(source="parts")

In [19]:

# A specific path can be provided.
associations.download(path="./downloaded/", source="parts")

In [ ]:

# A specific content type can be downloded.
associations.download(path="./downloaded/", source="parts", content_type="text/tab-separated-values")

In [20]:

! ls -l ./downloaded

total 8
-rw-r--r--  1 mfsy  staff  477 Apr 12 17:13 associations.tsv

In [18]:

# ! rm -R ./downloaded/

Creation with files added as distribution¶

In [59]:

persons = Dataset(forge, name="Interesting Persons")

In [60]:

persons.add_distribution("../../data/associations.tsv")

In [61]:

forge.register(persons)

<action> _register_one
<succeeded> True

In [62]:

forge.as_json(persons)

Out[62]:

{'id': 'https://bbp.epfl.ch/nexus/v1/resources/dke/kgforge/_/3579d0f7-dbf4-40be-90e5-cd641704dfb3',
 'type': 'Dataset',
 'distribution': {'type': 'DataDownload',
  'atLocation': {'type': 'Location',
   'store': {'id': 'https://bluebrain.github.io/nexus/vocabulary/diskStorageDefault',
    'type': 'DiskStorage',
    '_rev': 1}},
  'contentSize': {'unitCode': 'bytes', 'value': 477},
  'contentUrl': 'https://bbp.epfl.ch/nexus/v1/files/dke/kgforge/dbb6814c-ef9c-4320-b8a0-bf8190dd510a',
  'digest': {'algorithm': 'SHA-256',
   'value': '789aa07948683fe036ac29811814a826b703b562f7d168eb70dee1fabde26859'},
  'encodingFormat': 'text/tab-separated-values',
  'name': 'associations.tsv'},
 'name': 'Interesting Persons'}

In [ ]:

# When files are added as distributions, they can be directly downloaded without specifying which json path to use to collect the downlodable urls. In addition, content type and path arguments
# can still be provided
persons.download()

Creation with resources added as parts¶

In [21]:

distribution_1 = forge.attach("../../data/associations.tsv")

In [22]:

distribution_2 = forge.attach("../../data/persons.csv")

In [23]:

jane = Resource(type="Person", name="Jane Doe", distribution=distribution_1)

In [24]:

john = Resource(type="Person", name="John Smith", distribution=distribution_2)

In [25]:

persons = [jane, john]

In [26]:

forge.register(persons)

<count> 2
<action> _register_many
<succeeded> True

In [27]:

dataset = Dataset(forge, name="Interesting people")

In [28]:

dataset.add_parts(persons)

In [29]:

forge.register(dataset)

<action> _register_one
<succeeded> True

In [30]:

forge.as_json(dataset)

Out[30]:

{'id': 'https://bbp.epfl.ch/nexus/v1/resources/dke/kgforge/_/46d34055-c662-4a7d-90f4-2c866f89cf57',
 'type': 'Dataset',
 'hasPart': [{'id': 'https://bbp.epfl.ch/nexus/v1/resources/dke/kgforge/_/0a2041a9-12f7-49aa-b302-48bb09450832?rev=1',
   'type': 'Person',
   'distribution': {'contentUrl': 'https://bbp.epfl.ch/nexus/v1/files/dke/kgforge/901d4b2e-2b67-4504-aca7-3ab93966dbad'},
   'name': 'Jane Doe'},
  {'id': 'https://bbp.epfl.ch/nexus/v1/resources/dke/kgforge/_/87652108-38ca-4fe2-8e41-9ba5ef22f32b?rev=1',
   'type': 'Person',
   'distribution': {'contentUrl': 'https://bbp.epfl.ch/nexus/v1/files/dke/kgforge/86eb143e-89cb-462d-af0e-48afb7172f2d'},
   'name': 'John Smith'}],
 'name': 'Interesting people'}

In [34]:

dataset.download(path="./downloaded/", source="parts")

In [35]:

! ls -l ./downloaded

total 32
-rw-r--r--  1 mfsy  staff  477 Apr 12 17:14 associations.tsv
-rw-r--r--  1 mfsy  staff  477 Apr 12 17:14 associations.tsv.20220412171438
-rw-r--r--  1 mfsy  staff   52 Apr 12 17:14 persons.csv
-rw-r--r--  1 mfsy  staff   52 Apr 12 17:14 persons.csv.20220412171438

In [31]:

# ! rm -R ./downloaded/

Creation from resources converted as Dataset objects¶

In [63]:

dataset = Dataset.from_resource(forge, [jane, john], store_metadata=True)
print(*dataset, sep="\n")

{
    id: https://bbp.epfl.ch/nexus/v1/resources/dke/kgforge/_/0a2041a9-12f7-49aa-b302-48bb09450832
    type: Person
    distribution:
    {
        type: DataDownload
        atLocation:
        {
            type: Location
            store:
            {
                id: https://bluebrain.github.io/nexus/vocabulary/diskStorageDefault
                type: DiskStorage
                _rev: 1
            }
        }
        contentSize:
        {
            unitCode: bytes
            value: 477
        }
        contentUrl: https://bbp.epfl.ch/nexus/v1/files/dke/kgforge/901d4b2e-2b67-4504-aca7-3ab93966dbad
        digest:
        {
            algorithm: SHA-256
            value: 789aa07948683fe036ac29811814a826b703b562f7d168eb70dee1fabde26859
        }
        encodingFormat: text/tab-separated-values
        name: associations.tsv
    }
    name: Jane Doe
}
{
    id: https://bbp.epfl.ch/nexus/v1/resources/dke/kgforge/_/87652108-38ca-4fe2-8e41-9ba5ef22f32b
    type: Person
    distribution:
    {
        type: DataDownload
        atLocation:
        {
            type: Location
            store:
            {
                id: https://bluebrain.github.io/nexus/vocabulary/diskStorageDefault
                type: DiskStorage
                _rev: 1
            }
        }
        contentSize:
        {
            unitCode: bytes
            value: 52
        }
        contentUrl: https://bbp.epfl.ch/nexus/v1/files/dke/kgforge/86eb143e-89cb-462d-af0e-48afb7172f2d
        digest:
        {
            algorithm: SHA-256
            value: 1dacd765946963fda4949753659089c5f532714b418d30788bedddfec47a389f
        }
        encodingFormat: text/csv
        name: persons.csv
    }
    name: John Smith
}

Creation from a dataframe¶

See notebook 07 DataFrame IO.ipynb for details on conversions of instances of Resource from a Pandas DataFrame.

basics¶

In [37]:

dataframe = pd.read_csv("../../data/persons.csv")

In [38]:

dataframe

Out[38]:

	type	name
0	Person	Marie Curie
1	Person	Albert Einstein

In [39]:

persons = forge.from_dataframe(dataframe)

In [40]:

forge.register(persons)

<count> 2
<action> _register_many
<succeeded> True

In [41]:

dataset = Dataset(forge, name="Interesting people")

In [42]:

dataset.add_parts(persons)

In [43]:

forge.register(dataset)

<action> _register_one
<succeeded> True

In [44]:

forge.as_json(dataset)

Out[44]:

{'id': 'https://bbp.epfl.ch/nexus/v1/resources/dke/kgforge/_/5e1118bc-70b6-4b1d-b8ba-060a6f684230',
 'type': 'Dataset',
 'hasPart': [{'id': 'https://bbp.epfl.ch/nexus/v1/resources/dke/kgforge/_/58635f85-c6cc-4a7f-bf16-1558a5713080?rev=1',
   'type': 'Person',
   'name': 'Marie Curie'},
  {'id': 'https://bbp.epfl.ch/nexus/v1/resources/dke/kgforge/_/a9c2740b-2ef9-4557-841c-6ab425d35906?rev=1',
   'type': 'Person',
   'name': 'Albert Einstein'}],
 'name': 'Interesting people'}

advanced¶

In [45]:

dataframe = pd.read_csv("../../data/associations.tsv", sep="\t")

In [46]:

dataframe

Out[46]:

	id	name	type	agent__type	agent__name	agent__gender__id	agent__gender__type	agent__gender__label	distribution
0	(missing)	Curie Association	Association	Person	Marie Curie	http://purl.obolibrary.org/obo/PATO_0000383	LabeledOntologyEntity	female	../../data/scientists-database/marie_curie.txt
1	(missing)	Einstein Association	Association	Person	Albert Einstein	http://purl.obolibrary.org/obo/PATO_0000384	LabeledOntologyEntity	male	../../data/scientists-database/albert_einstein...

In [47]:

dataframe["distribution"] = dataframe["distribution"].map(lambda x: forge.attach(x))

In [48]:

associations = forge.from_dataframe(dataframe, na="(missing)", nesting="__")

In [49]:

print(*associations, sep="\n")

{
    type: Association
    agent:
    {
        type: Person
        gender:
        {
            id: http://purl.obolibrary.org/obo/PATO_0000383
            type: LabeledOntologyEntity
            label: female
        }
        name: Marie Curie
    }
    distribution: LazyAction(operation=Store.upload, args=['../../data/scientists-database/marie_curie.txt', None])
    name: Curie Association
}
{
    type: Association
    agent:
    {
        type: Person
        gender:
        {
            id: http://purl.obolibrary.org/obo/PATO_0000384
            type: LabeledOntologyEntity
            label: male
        }
        name: Albert Einstein
    }
    distribution: LazyAction(operation=Store.upload, args=['../../data/scientists-database/albert_einstein.txt', None])
    name: Einstein Association
}

In [50]:

forge.register(associations)

<count> 2
<action> _register_many
<succeeded> True

In [51]:

dataset = Dataset(forge, name="Interesting associations")

In [52]:

dataset.add_parts(associations)

In [53]:

forge.register(dataset)

<action> _register_one
<succeeded> True

In [54]:

forge.as_json(dataset)

Out[54]:

{'id': 'https://bbp.epfl.ch/nexus/v1/resources/dke/kgforge/_/ae57773c-9e71-4ec0-85b9-6e5f52a04349',
 'type': 'Dataset',
 'hasPart': [{'id': 'https://bbp.epfl.ch/nexus/v1/resources/dke/kgforge/_/8fc91342-42f7-4946-8a5d-a21be3448684?rev=1',
   'type': 'Association',
   'distribution': {'contentUrl': 'https://bbp.epfl.ch/nexus/v1/files/dke/kgforge/5e4e7cb5-707a-4d4a-9682-36e5b993fe40'},
   'name': 'Curie Association'},
  {'id': 'https://bbp.epfl.ch/nexus/v1/resources/dke/kgforge/_/d78d842a-89f9-4273-bc70-9666c1f72781?rev=1',
   'type': 'Association',
   'distribution': {'contentUrl': 'https://bbp.epfl.ch/nexus/v1/files/dke/kgforge/1dc1c1a2-f13c-47d6-9533-92c97bf5a5d6'},
   'name': 'Einstein Association'}],
 'name': 'Interesting associations'}