In this notebook you will load the citation dataset into Neo4j.
You will start by importing py2neo library which you will use to import the data into Neo4j. py2neo is a client library and toolkit for working with Neo4j from within Python applications. It is well suited for Data Science workflows and has great integration with other Python Data Science tools.
from py2neo import Graph
# Change the line of code below to use the credentials you tested in the previous notebook.
# graph = Graph("<Bolt URL>", auth=("neo4j", "<Password>"))
graph = Graph("bolt://localhost:7687", auth=("neo4j", "letmein"))
First, create some constraints to make sure you don't import duplicate data:
display(graph.run("CREATE CONSTRAINT ON (a:Article) ASSERT a.index IS UNIQUE").stats())
display(graph.run("CREATE CONSTRAINT ON (a:Author) ASSERT a.name IS UNIQUE").stats())
display(graph.run("CREATE CONSTRAINT ON (v:Venue) ASSERT v.name IS UNIQUE").stats())
constraints_added: 1 constraints_removed: 0 contained_updates: True indexes_added: 0 indexes_removed: 0 labels_added: 0 labels_removed: 0 nodes_created: 0 nodes_deleted: 0 properties_set: 0 relationships_created: 0 relationships_deleted: 0
constraints_added: 1 constraints_removed: 0 contained_updates: True indexes_added: 0 indexes_removed: 0 labels_added: 0 labels_removed: 0 nodes_created: 0 nodes_deleted: 0 properties_set: 0 relationships_created: 0 relationships_deleted: 0
constraints_added: 1 constraints_removed: 0 contained_updates: True indexes_added: 0 indexes_removed: 0 labels_added: 0 labels_removed: 0 nodes_created: 0 nodes_deleted: 0 properties_set: 0 relationships_created: 0 relationships_deleted: 0
Next, load the data into the database. You will create nodes for Articles, Venues, and Authors.
query = """
CALL apoc.periodic.iterate(
'UNWIND ["dblp-ref-0.json", "dblp-ref-1.json", "dblp-ref-2.json", "dblp-ref-3.json"] AS file
CALL apoc.load.json("https://github.com/neo4j-contrib/training-v3/raw/master/modules/gds-data-science/supplemental/data/" + file)
YIELD value WITH value
return value',
'MERGE (a:Article {index:value.id})
SET a += apoc.map.clean(value,["id","authors","references", "venue"],[0])
WITH a, value.authors as authors, value.references AS citations, value.venue AS venue
MERGE (v:Venue {name: venue})
MERGE (a)-[:VENUE]->(v)
FOREACH(author in authors |
MERGE (b:Author{name:author})
MERGE (a)-[:AUTHOR]->(b))
FOREACH(citation in citations |
MERGE (cited:Article {index:citation})
MERGE (a)-[:CITED]->(cited))',
{batchSize: 1000, iterateList: true});
"""
graph.run(query).to_data_frame()
batch | batches | committedOperations | errorMessages | failedBatches | failedOperations | failedParams | operations | retries | timeTaken | total | wasTerminated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | {'failed': 0, 'committed': 52, 'total': 52, 'e... | 52 | 51956 | {} | 0 | 0 | {} | {'failed': 0, 'committed': 51956, 'total': 519... | 0 | 57 | 51956 | False |
query = """
MATCH (a:Article) WHERE not(exists(a.title))
DETACH DELETE a
"""
graph.run(query).stats()
constraints_added: 0 constraints_removed: 0 contains_updates: False indexes_added: 0 indexes_removed: 0 labels_added: 0 labels_removed: 0 nodes_created: 0 nodes_deleted: 0 properties_set: 0 relationships_created: 0 relationships_deleted: 0
In the next notebook you will explore the data that you have imported.