#!/usr/bin/env python # coding: utf-8 # In[1]: ### Loading Credentials from local file; ### this cell is meant to be deleted before publishing import yaml with open("../creds.yml", 'r') as ymlfile: cfg = yaml.safe_load(ymlfile) uri = cfg["sonar_creds"]["uri"] user = cfg["sonar_creds"]["user"] password = cfg["sonar_creds"]["pass"] # SoNAR (IDH) - HNA Curriculum # # Notebook 3: SoNAR (IDH) # This curriculum is created for the SoNAR (IDH) project. SoNAR (IDH) is in its core a graph based approach to structure and links big amounts of historical data (more on the SoNAR (IDH) project and database can be found in Notebook 3). Therefor, the whole curriculum focuses on graph theory and network analysis. # # This notebook provides an introduction to the SoNAR (IDH) database and its underlying Neo4j graph-database technology as well as the Cypher query language which is part of the Neo4j ecosystem. # # Project summary # [SoNAR (IDH)](https://sonar.fh-potsdam.de/) is short for **Interfaces to Data for Historical Social Network Analysis and Research**. The main objective of the project is the examination and evaluation of approaches to build and operate an advanced research technology environment supporting HNA. # # SoNAR (IDH) is a research project in collaboration of the following institutions: # # * [Deutsches Forschungszentrum für Künstliche Intelligenz](https://www.dfki.de/) # * [Fachhochschule Potsdam](http://uclab.fh-potsdam.de/) # * [Humboldt-Universität zu Berlin](https://www.ibi.hu-berlin.de/) # * [Staatsbibliothek zu Berlin](https://staatsbibliothek-berlin.de/en/) # * [Heinrich-Heine-Universität Düsseldorf](https://www.uniklinik-duesseldorf.de/en/department-of-the-history-philosophy-and-ethics-of-medicine) # # # One of the main elements of the SonAR (IDH) projects is a [Neo4j](https://neo4j.com/) graph database. This database contains the merged data of multiple archives and libraries. # See [Chapter 2](#Data-Description) for more details about the structure and the contents of the SonAR (IDH) database. # # Data description # # The SoNAR (IDH) database consists of nodes and edges. Each of the nodes and edges have additional properties that provide rich meta information. # # This data description section provides details about the data sources and overall characteristics of the data. The section is based on the state of the SoNAR (IDH) database during February 2021. A diagram of the database schema can be found [here](https://camo.githubusercontent.com/9262db5eb53360acb5ccc2249ff97b4b7d82ee9199bdcb8563980f16b9d7cc95/68747470733a2f2f7472656c6c6f2d6174746163686d656e74732e73332e616d617a6f6e6177732e636f6d2f3564323530353865393136326235363762383630313439662f3565336331336262363037323836353631636335366635372f62646664383838363964376633656465616663366232633130326361666663342f556d6c4d6f64656c2e737667). # # #

# Hint: Nodes, edges and the respective properties were retrieved from different data sources. Edges of the type SocialRelation however, are implicit edges and were derived based on Resources.

# ## Summary stats # The SoNAR (IDH) database has the following aggregated characteristics: # # **Nodes Summary** # # * 9 categories of Nodes # * 34.511.952 Nodes # # |Node Type | Node Count | # |---------------|------------| # | CorpName | 1.487.711 | # | GeoName | 308.197 | # | MeetName | 814.044 | # | PerName | 5.087.660 | # | TopicTerm | 212.135 | # | UniTitle | 385.300 | # | ChronTerm | 537.054 | # | IsilTerm | 611 | # | Resource | 25.679.240 | # # **Edges Summary** # # * 10 categories of Edges # * 98.530.160 Edges # # # | Edge Type | Edge Count | # |---------------------|------------| # | RelationToPerName | 14.630.465 | # | RelationToCorpName | 5.099.190 | # | RelationToMeetName | 263.180 | # | RelationToUniTitle | 53.998 | # | RelationToTopicTerm | 4.951.617 | # | RelationToGeoName | 5.140.556 | # | RelationToChronTerm | 5.446.841 | # | RelationToIsil | 55.556.913 | # | RelationToResource | 7.387.400 | # | SocialRelation | 40.301.595 | # ## Data sources # SoNAR (IDH) combines data from four different data sources. The table below provides a compact overview: # # # |Data Source | Number of Nodes | Number of Edges
(incl. *RelationToIsilTerm*) | # |--------|----------| -------- | # |[GND (*Integrated Authority File*)](https://www.dnb.de/EN/Professionell/Standardisierung/GND/gnd_node.html)| 8.295.047 | 32.776.628 | # |[DNB (*German National Library*)](https://www.dnb.de/EN/Home/home_node.html)|19.384.733| 5.655.859 | # |[ZDB (*Zeitschriftendatenbank*)](https://www.zeitschriftendatenbank.de/startseite/)|1.908.334| 43.419.339 | # |[KPE (*Kalliope Union Catalog*)](https://kalliope-verbund.info/en/index.html)|4.386.173| 16.678.334 | # |[SBB (*Katalog der Staatsbibliothek zu Berlin*)](https://stabikat.de)|to be added|to be added| # # # Data access # We will need some specific libraries to work with the SoNAR (IDH) database. Let's start with installing the `neo4j` library. # # When you are using the curriculum on binder or by running it as a docker container locally, the package is already installed. When you want to interact with the SoNAR (IDH) database independently, install the package with the following code line in a new notebook cell: # # ```python # !pip install neo4j # ``` # In[2]: from neo4j import GraphDatabase driver = GraphDatabase.driver(uri, auth=(user, password)) # With the code above we create a [Neo4j driver object](https://neo4j.com/docs/api/python-driver/current/api.html#driver). This driver stores the connection details for the database. We can use this driver now to send requests to the database. # # Data exploration # Data exploration is usually the very first thing to do when working with new data. So let's start diving into the SoNAR (IDH) database by exploring it. # # Whenever we want to retrieve data from the Neo4j database of SoNAR (IDH) we can use a query language called "**Cypher Query Language**". Cypher provides a comparably easy to comprehend syntax for requesting data from the database. Furthermore, Cypher provides an extensive set of tools for applying graph algorithms, data science methods and data wrangling procedures. # # Throughout this curriculum we will use this Cypher Query Language whenever we directly retrieve data from SoNAR (IDH). # A more in-depth introduction to Cypher can be found [here](https://neo4j.com/docs/getting-started/current/cypher-intro/). More external resources are listed in the [Cypher summary chapter](#Summary-Cypher-Query-Language). # ## Nodes # ### Node labels # We start off by requesting the database to return all [node labels](https://neo4j.com/docs/getting-started/current/graphdb-concepts/#graphdb-labels). Node labels are categories nodes can belong to. You can think of them as entity groups. The SoNAR (IDH) database distinguishes between persons, corporations and more. Let's ask the database to return all the labels available. # In[3]: with driver.session() as session: result = session.run("CALL db.labels()").data() result # **Code Breakdown:** # # >The `with` statement is basically used to make the database call as resource effective and concise as possible. There are more advantages of the `with` call but their explanation would exceed the goal of this curriculum. However, an in-depth explanation of the `with` statement can be found [here](https://www.python.org/dev/peps/pep-0343/). # > # >When we request data from the database we need to establish a connection (`session`). The `driver` object we created earlier stores the connection details. When we use the method `driver.session()` we establish a new connection. This connection is assigned to the object `session` object for the `while` statement. # > # >The most relevant part of the code for retrieving the data is `"CALL db.labels()"`. This part is the actual Cypher query. The `CALL` clause is used to call the `db.labels()` procedure. More details about Neo4j procedures can be found below. # > # >The result of this code chunk is a list that contains a key-value pair (`dictionary`) per label in the database. # # # #

# Hint: Some parts of the pieces of code used in this curriculum might seem a little confusing for beginners. Most of the code chunks in this curriculum are written to work as "recipes" - even if you do not understand the code in every detail, you can easily adjust the code to your specific use case by doing small changes. # Don't feel discouraged when you feel lost, and just try to follow along the explanations. #

# Some useful built-in procedures for exploring and describing the database are listed in the table below. You can get a full list of built-in procedures by using the following query: `CALL dbms.procedures()` # # # |Procedure | Description | # |---------|----------| # |`db.labels()`| List all labels in the database.| # |`db.propertyKeys()`|List all property keys in the database.| # |`db.relationshipTypes()`|List all relationship types in the database. | # |`db.schema()`| Show the schema of the data. | # |`db.stats.retrieve()`|Retrieve statistical data about the current database.
Valid sections are 'GRAPH COUNTS', 'TOKENS', 'QUERIES', 'META'| # ### 📝 Exercise # # Now, try one of the other methods listed in the table above by following the same procedure we used with the `db.labels()` call. # In[ ]: # ### Selecting nodes # You can select nodes by using the `MATCH` statement. Cypher uses [ASCII-art](https://en.wikipedia.org/wiki/ASCII_art) style syntax to define nodes, relationships and the direction of relationships in queries. # # Nodes are referred to by using parentheses `()`. Inside the parentheses, you can define a node variable. This variable can be used to refer to a specific set of nodes throughout the rest of the query. # # The example below matches any kind of node and assigns the variable name n `(n)`. We use the `LIMIT` statement to tell the database we only want to have the first 5 results. The number of results can drastically increase the response time of the database, so the `LIMIT` statement oftentimes can be handy if you want to test a query or if you suspect too many results. # # The `RETURN` statement defines what the database returns after your query was evaluated. You can be very specific in this statement in case you only want to retrieve certain aspects of the query results. # In[4]: # define query query = """ MATCH (n) RETURN n LIMIT 5 """ # send query to database with driver.session() as session: result = session.run(query).data() # print result result # The output above is produced by calling the `.data()` method of the [Neo4j Python Driver](https://neo4j.com/docs/api/python-driver/current/api.html). This method returns the result of our query as a list of dictionaries. This result type is quite versatile since we can further manipulate the output to our liking by applying filters or transforming the result to different formats (e.g. Pandas data frame). # # #

# Hint: We are using triple quotes """ ... """ for the query to tell Python we are writing a character string over multiple lines. We are doing this, so the query looks tidy and well-structured. You also could write the full query in one line - but this results in bad readability and makes debugging more difficult. #

# ### Filtering nodes # In the next step, we want to apply filters inside the query, so we have control over the nodes we retrieve from the database. # # The query below only returns one node of the type `PerName` without specifying which exact node we want to retrieve. # In[5]: # define query query = """ MATCH (n:PerName) RETURN n LIMIT 1""" # send query to database with driver.session() as session: result = session.run(query).data() # print result result # **Filtering nodes by properties** # Now, let's try to find a specific person. Let's try to find the node of Max Weber, the sociologist and political economist. # We can define a filter based on properties of a node. The query below only returns nodes that have "Weber, Max" as `Name` property. The names in SoNAR are based on their GND entry and follow the order `name, first name`. You can check out GND entries on https://portal.dnb.de/. # # We suspect the name "Weber, Max" to be not unique inside the big SoNAR (IDH) database. So we want to check, how many Max Webers we can find. For that, we return the count of nodes (`RETURN count(n)`) and not the actual nodes. # In[6]: query = """ MATCH (n:PerName {Name: 'Weber, Max'}) RETURN count(n) """ with driver.session() as session: result = session.run(query).data() result # In fact, we detected 34 hits in the database. So we need to apply more filters to find the correct Max. # # Let's start by checking what properties are available for nodes of type `PerName`: # In[7]: query = """ MATCH (n:PerName) WITH LABELS(n) AS labels , KEYS(n) AS keys UNWIND labels AS label UNWIND keys AS key RETURN DISTINCT label, COLLECT(DISTINCT key) AS props ORDER BY label """ with driver.session() as session: result = session.run(query).data() result # Before we take a look at the output, let's talk about the query real quick: # # In this query, we use two Cypher list functions (`LABELS()` and `KEYS()`). These functions return a list of the element they are applied on (`KEYS(n)` returns all property names of the nodes captured in `n` as a list). The `UNWIND` clause # is used to expand the created lists back to individual rows. Finally, we match the distinct labels (we only include `PerName` nodes in this query) with a list of distinct properties that belong to `PerName` nodes. # # Here you can find the documentation for the applied functions and clauses: # # * [`LABELS()` & `KEYS()`](https://neo4j.com/docs/cypher-manual/current/functions/list/#functions-keys) # * [`UNWIND`](https://neo4j.com/docs/cypher-manual/current/clauses/unwind/) # * [`DISTINCT`](https://neo4j.com/docs/cypher-manual/current/clauses/return/#return-unique-results) # * [`COLLECT()`](https://neo4j.com/docs/cypher-manual/current/functions/aggregating/#functions-collect) # # Now, let's take a look at the result: # # We can see that there are several `date` properties for `PerName` nodes. The year of birth is stored in the property called `DateApproxBegin`. # So let's apply a date filter. Let's assume we only know that Max Weber was born in the year 1864, and we want to filter based on this information. # In[8]: query = """ MATCH (n:PerName) WHERE n.Name = "Weber, Max" AND n.DateApproxBegin = "1864" RETURN n """ with driver.session() as session: result = session.run(query).data() result # In the query above, we used a `WHERE` clause to apply a filter. You can define multiple conditions inside a filter, e.g. by concatenating multiple logical conditions with `AND`, `OR` or `XOR`. See this [documentation page](https://neo4j.com/docs/cypher-manual/current/clauses/where/#boolean-operations) for more details. # # # As a last example, let's assume we only know that the last Name of Max Weber is spelled "韦伯" in Chinese, so we need to use this information as a filter. # # In the query result above, you see a node property called `VariantName`. This variable stores many alternative variants of the name we are looking for. So let's check how we could query the database by searching within this property by using the `CONTAINS` operator (click [here](https://neo4j.com/docs/cypher-manual/current/clauses/where/#match-string-contains) for more details): # In[9]: query = """ MATCH (n:PerName) WHERE n.VariantName CONTAINS "韦伯" RETURN n """ with driver.session() as session: result = session.run(query).data() result #

# Hint: The most reliable way to select specific nodes of the SoNAR (IDH) database is by using the Id property. The Id property is a combination of the ISIL (International Standard Identifier for Libraries and Related Organisations) and the GND-ID.
# The Id of Max Weber is (DE-588)118629743. DE-588 is the ISIL code of the GND (Gemeinsame Normdatei) and 118629743 is the GND-ID of Max Weber. #

# ## Relationships # ### Relationship types # Similar to node labels, we can retrieve the categories of the relations inside the database. Every relation must have exactly one relationship type. This type defines the kind or category the relation belongs to. # In[10]: with driver.session() as session: result = session.run("CALL db.relationshipTypes()").data() result # ### Selecting relationships # In the section about nodes, we saw that we need to use parenthesis `()` to select nodes. When selecting relationships on the other hand, we need to use brackets `[]` instead. # # Additionally, we can not solely query for plain relationships, but we need to define a pattern in which this relationship needs to appear in the database. # # The most simple relationship pattern we can define is: the relationship needs to be between any kind of two nodes. In the Cypher query language, this would be expressed as: # >`()-[r]-()` # # In[11]: query = """ MATCH ()-[r]-() RETURN r LIMIT 5 """ with driver.session() as session: result = session.run(query).data() result # ### Filtering relationships # You can filter relationships in a similar fashion like you can filter nodes. # Let's retrieve relationships of the type `SocialRelation`. # In[12]: query = """ MATCH ()-[r:SocialRelation]-() RETURN r LIMIT 5 """ with driver.session() as session: result = session.run(query).data() result # This result is correct, but the output is not very informative. Let's do some deeper exploration of the relationships. # **Filtering Relationships by Properties** # Just as nodes, relationships can have properties that provide meta information about the relation. Let's check the properties of the five relationships we retrieved above: # In[13]: query = """ MATCH p = ()-[r:SocialRelation]-() UNWIND relationships(p) as rel RETURN properties(rel) as properties LIMIT 5 """ with driver.session() as session: result = session.run(query).data() result # As we can see, the properties of relationships of the type `SocialRelation` have three different elements: # # * `TypeAddInfo` - either **directed** or **undirected** # * `SourceType` - can take the values: **associatedRelation**, **areCoAuthors**, **areCoEditors**, **affiliatedRelation**, **correspondedRelation**, **knows** # * `Source` - **id** of the source # # #

# Hint: As mentioned earlier, SocialRelation-nodes are derived from Resource-nodes. The Source property of a SocialRelation is the id of the corresponding Resource #

# Let's use the properties to filter out people that are connected to each other because they had a **correspondence** with each other. # In[14]: # in the RETURN clause we define specifically what elements # we want to retrieve, this way the output is easier to read query = """ MATCH (n1:PerName)-[r:SocialRelation]-(n2:PerName) WHERE r.SourceType = "correspondedRelation" RETURN n1.Name, n2.Name, r.SourceType, r.TypeAddInfo LIMIT 5 """ with driver.session() as session: result = session.run(query).data() result # We can see that all of these relationships have a `TypeAddInfo` of **directed**. Relationships can be directed and undirected. In the SoNAR (IDH) database, all correspondences are directed and therefor hold the information whether someone was contacted or contacted someone else. # # Let's see who received letters from Max Weber. The query below extends the basic `()-[]-()` structure for representing a node-relationship search pattern by an `>`. This arrow defines that we are searching only for directed relationships. So the new pattern scaffolding is `()-[]->()` # In[15]: query = """ MATCH (n1:PerName)-[r:SocialRelation]->(n2:PerName) WHERE n1.Name = "Weber, Max" AND n1.DateApproxBegin = "1864" AND r.SourceType = "correspondedRelation" RETURN n1.Name, n2.Name, r.SourceType, r.TypeAddInfo """ with driver.session() as session: result = session.run(query).data() result # So far, we only focused on retrieving textual outputs from our queries. But of course we can visualize networks too. The code block below gives a quick example of how we can visualize the query output as a network. # # In the code below, we are going to use a custom written function (`to_nx_graph()`). This function is stored in another python file and hence we can load it as if it would be an own library. You can find a more in-depth explanation on the steps below in the chapter [Complex Queries & Data Preparation](#Complex-Queries-&-Data-Preparation). # # The query below is an extension of the query we just used. We check out the network of people Max Weber corresponded with, but we also take a look into the second degree of the same relationships. So we also check the correspondences of the people Max Weber corresponded with. # In[3]: # the line below loads in the custom function "to_nx_graph()". See chapter 6 for more details. from helper_functions.helper_fun import to_nx_graph driver = GraphDatabase.driver(uri, auth=(user, password)) query = """ MATCH (n1:PerName)-[r:SocialRelation]->(n2:PerName)-[r2:SocialRelation]->(n3:PerName) WHERE n1.Id = "(DE-588)118629743" AND r.SourceType = "correspondedRelation" AND r2.SourceType = "correspondedRelation" RETURN * """ G = to_nx_graph(neo4j_driver=driver, query=query) # For the visualizations we are going to use a custom draw function. Please check out Chapter 3 in Notebook 2 for more details. # In[11]: from matplotlib.colors import rgb2hex from matplotlib.patches import Circle import matplotlib.pyplot as plt import networkx as nx import numpy as np # defining general variables ## we start off by setting the position of nodes and edges again pos = nx.kamada_kawai_layout(G) ## set the color map to be used color_map = plt.cm.plasma ## extract the node label attribute from graph object #node_labels = nx.get_node_attributes(G, "label") # setup node_colors node_color_attribute = "type" groups = set(nx.get_node_attributes(G, node_color_attribute).values()) group_ids = np.array(range(len(groups))) if len(group_ids) > 1: group_ids_norm = (group_ids - np.min(group_ids))/np.ptp(group_ids) else: group_ids_norm = group_ids mapping = dict(zip(groups, group_ids_norm)) node_colors = [mapping[G.nodes()[n][node_color_attribute]] for n in G.nodes()] # defining the graph options & styling ## dictionary for node options: node_options = { "pos": pos, "alpha": 1, "node_size": 150, "alpha": 0.5, "node_color": node_colors, # here we set the node_colors object as an option "cmap": color_map # this cmap defines the color scale we want to use } ## dictionary for edge options: edge_options = { "pos": pos, "width": 1.5, "alpha": 0.2, } ## set plot size and plot margins plt.figure(figsize=[20, 20]) plt.margins(x=0.1, y = 0.1) # draw the graph ## draw the nodes nx.draw_networkx_nodes(G, **node_options) ## draw the edges nx.draw_networkx_edges(G, **edge_options) # create custom legend according to color_map geom_list = [Circle([], color = rgb2hex(color_map(float(mapping[term])))) for term in groups] plt.legend(geom_list, groups) # show the plot plt.show() # ### 📝 Exercise # # 1. Write a query that retrieves all `RelationToGeoName` edges from Max Weber as well as the corresponding `GeoName` nodes. # # 2. Visualize the resulting graph (see Notebook 2 for an explanation on how to visualize graphs). # In[ ]: # ## Summary Cypher query language # In this section about data exploration, we took a quick look into the very basics of the **Cypher Query Language**. Whenever you want to retrieve data directly from the SoNAR (IDH) database, you need to write a Cypher query. # # A full introduction into this query language would exceed the scope of this curriculum. But the list below provides an overview of good resources for digging deeper into Cypher: # # * [A quick introduction to Cypher basics by Neo4J](https://neo4j.com/docs/getting-started/current/cypher-intro/) # * [The official Neo4j Cypher manual](https://neo4j.com/docs/cypher-manual/current/) # * [The Cypher Query Language Developer Guide](https://neo4j.com/developer/cypher/) # * [Free Online Courses by the Neo4j GraphAcademy](https://neo4j.com/graphacademy/#_take_a_free_course) # # # The upcoming sections of this curriculum also heavily relies on Cypher, but there won't be detailed explanation of every used clause and command. You can see these cells as code recipes. You can check out the aforementioned resources for a documentation of the applied Cypher clauses. # # Descriptive analysis # ## General database summaries # We can also aggregate values and do more complex calculations with Cypher. Let's create a summary of how many **Nodes**, **Relationships**, **Node Labels** and **Relationship Types** are inside the database. # In[19]: driver = GraphDatabase.driver(uri, auth=(user, password)) query = """ MATCH (n) RETURN 'Number of Nodes: ' + count(n) as output UNION MATCH ()-[]->() RETURN 'Number of Relationships: ' + count(*) as output UNION CALL db.labels() YIELD label RETURN 'Number of Labels: ' + count(*) AS output UNION CALL db.relationshipTypes() YIELD relationshipType RETURN 'Number of Relationship Types: ' + count(*) AS output """ with driver.session() as session: result = session.run(query).data() result # ### Summarize node labels # In the next code cell, we calculate the count of each node category in the database. # In[20]: driver = GraphDatabase.driver(uri, auth=(user, password)) query = """ MATCH (n) RETURN DISTINCT COUNT(LABELS(n)) AS count, LABELS(n) AS label ORDER BY count """ with driver.session() as session: result = session.run(query).data() result # ### Summarize relationship types # We can do the same count calculation for relationship types too. However, the query below uses a slightly different logic to retrieve the count per relationship type than the query we applied to the nodes above. # # The query below calls the procedure `db.relationshipTypes()` to retrieve a list of all relationship types in the database. Afterwards, we use a procedure called `apoc.cypher.run()`. This procedure can be used to execute a Cypher query per row. We use this procedure to run the `count` function for each type retrieved from `db.relationshipTypes()`. # # This way of writing the query is a lot faster than the way we used above in the section [Summarize Node Labels](#Summarise-Node-Labels). # In[21]: query = """ CALL db.relationshipTypes() YIELD relationshipType as type CALL apoc.cypher.run('MATCH ()-[:`'+type+'`]->() RETURN count(*) as count',{}) YIELD value RETURN type, value.count AS count ORDER BY count """ with driver.session() as session: result = session.run(query).data() result # We also can easily create a plot using the result we just generated. The code block below uses pandas to convert the result we got in the code block above into a data frame. Furthermore, we use the Pandas method `plot.bar` to create a bar plot. More details on the method `plot.bar`can be found [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html). # # In[22]: import pandas as pd pd.DataFrame(result).plot.bar(x="type", y="count") # ## Graph analysis & algorithms # ### Degree centrality # Centrality algorithms can be used to uncover the roles and importance of nodes in a network. There are many ways to measure the centrality of a node. The example below uses the **Degree centrality** as one of the simplest centrality measures. (Needham & Hodler, 2019) # # Degree centrality simply counts the number of incoming and outgoing relationships from a node. Degree centrality was introduced by Freeman in his paper "Centrality in social networks conceptual clarification" (1979). # # # The example below calculates the number of `SocialRelation` for `PerName` nodes and returns the top 10 people with the most social relationships in the SoNAR (IDH) database. # # More information about Cypher based centrality procedures can be found [here](https://neo4j.com/docs/graph-data-science/current/algorithms/centrality/). # In[24]: # In the query below we use the build-in degree centrality procedure of Neo4j. # We define a "node projection" and a "relationship projection", to narrow down the degree centrality calculation # to a specific subset of nodes and edges. # More details can be found by following the link mentioned in the text above. query = """ CALL gds.alpha.degree.stream({ nodeProjection: {type: "PerName"}, relationshipProjection: {type: "SocialRelation"} }) YIELD nodeId, score RETURN gds.util.asNode(nodeId).Name AS Name, score ORDER BY score DESC LIMIT 10 """ with driver.session() as session: result = session.run(query).data() result # ### 📝 Exercise # # Task: # # 1. Calculate the top ten degree centrality of all `PerName` nodes with respect to `RelationToPerName` relationships. # In[ ]: # ### Shortest path # As shown in notebook 2, we can conduct a path finding algorithm to find the shortest path between two nodes. The shortest path algorithm can take weighted relationships into account and is widely applied in navigation systems. # # Furthermore, the detection of shortest paths can provide insights about how close people are to each other and how similar they might be to each other or if they share something in common. (Needham & Hodler, 2019) # # The example below shows the calculation of the shortest path between John Hume (`(DE-588)119444666`) and Marie Curie (`(DE-588)118523023`). Also, we define a `nodeProjection` and a `relationshipProjection`. These projections are arguments you can use inside the shortest path procedure to define specific properties and characteristics of the nodes and relationships you want to consider for the shortest path calculation. # # # More information on the Cypher shortest path finding algorithm and the projections can be found [here](https://neo4j.com/docs/graph-data-science/1.4/alpha-algorithms/shortest-path/). # In[25]: query = """ MATCH (start:PerName {Id: "(DE-588)119444666"}), (end:PerName {Id: "(DE-588)118523023"}) CALL gds.alpha.shortestPath.stream({ startNode: start, endNode: end, nodeProjection: {type: "PerName"}, relationshipProjection: { all: { type: "SocialRelation", orientation: "NATURAL", TypeAddInfo: "directed", SourceType: "correspondedRelation" } }}) YIELD nodeId, cost RETURN gds.util.asNode(nodeId).Name AS Name """ with driver.session() as session: result = session.run(query).data() result # ### 📝 Exercise # # Task: # # 1. Calculate the shortest path between John Hume and Marie Curie again, but use a different relationship type this time. # In[ ]: # # Complex queries & data preparation # In this last chapter of notebook 3 we want to take a look into more complex queries and data processing procedures. The queries in this chapter use concepts and functionalities of the Cypher query language we did not use so far. As mentioned earlier there won't be an in-depth explanation of how the queries are working, but there will be links to the documentation of the most important parts. # ## Analyze works and resources by genre in a time range # For the query below we want to retrieve all resources (`Resource`) and related works (`UniTitle`). Furthermore we apply a temporal filter, so we only retrieve resources and works created in a given time span. # In[13]: from neo4j import GraphDatabase import networkx as nx driver = GraphDatabase.driver(uri, auth=(user, password)) from_year = "1900" to_year = "1925" # this is a RegEx pattern that defines a 4 digit year pattern (eg. "1800") date_pattern = "([0-9]{4})" # query scaffolding with placeholders query = """ MATCH (n:UniTitle)-[r]-(m:Resource) WHERE m.DateApproxBegin =~ "{date_pattern}" AND toInteger(m.DateApproxBegin) >= toInteger({from_year}) AND toInteger(m.DateApproxBegin) <= toInteger({to_year}) RETURN * """ # replace placeholders in query sccaffolding query = query.format(from_year=from_year, to_year=to_year, date_pattern=date_pattern) # The query above uses the following elements to construct the database request: # # - **Regular Expressions** are used to select only correct year formats. Click [here](https://neo4j.com/docs/cypher-manual/current/clauses/where/#matching-using-regular-expressions) for more details on matching with Cypher using regular expressions. # # - **Scalar Functions** (`toInteger()`) to convert string values to integer values. Click [here](https://neo4j.com/docs/cypher-manual/current/functions/scalar/#functions-tointeger) for more details. # - The **Python string `format()` method** to replace placeholders inside a character string in Python. Click [here](https://www.w3schools.com/python/ref_string_format.asp) for more details. # In the next step, we use a custom function to process the query. We import a function called `to_nx_graph()`. This function is helping us to keep the code slim and clean. The function itself is doing things we did several times before already: # # On the one hand, it sends the query to the SoNAR (DH) database and ingests the data base reply. # On the other hand, the function generates a `networkx` Graph object from the returned data. This process is similar to the one used in the chapter "Case Study: Nobel Laureates" in notebook 2. # # Click [here](../notebooks/helper_functions/helper_fun.py) to see the source code of this helper function. # In[14]: from helper_functions.helper_fun import to_nx_graph G = to_nx_graph(neo4j_driver=driver, query=query) # This graph object can easily be converted to a data frame and analyzed as tabular data. # In[15]: import pandas as pd graph_df = pd.DataFrame.from_dict(dict(G.nodes(data=True)), orient='index') graph_df["type"].value_counts() # In the next step, we prepare the visualization of the graph. This way we get an general overview of the graph structure. # In[16]: from matplotlib.colors import rgb2hex from matplotlib.patches import Circle import matplotlib.pyplot as plt import networkx as nx import numpy as np # defining general variables ## we start off by setting the position of nodes and edges again pos = nx.kamada_kawai_layout(G) ## set the color map to be used color_map = plt.cm.plasma # setup node_colors node_color_attribute = "type" groups = set(nx.get_node_attributes(G, node_color_attribute).values()) group_ids = np.array(range(len(groups))) if len(group_ids) > 1: group_ids_norm = (group_ids - np.min(group_ids))/np.ptp(group_ids) else: group_ids_norm = group_ids mapping = dict(zip(groups, group_ids_norm)) node_colors = [mapping[G.nodes()[n][node_color_attribute]] for n in G.nodes()] # defining the graph options & styling ## dictionary for node options: node_options = { "pos": pos, "alpha": 1, "node_size": 150, "alpha": 0.5, "node_color": node_colors, # here we set the node_colors object as an option "cmap": color_map # this cmap defines the color scale we want to use } ## dictionary for edge options: edge_options = { "pos": pos, "width": 1.5, "alpha": 0.2, } ## set plot size and plot margins plt.figure(figsize=[20, 20]) plt.margins(x=0.1, y = 0.1) # draw the graph ## draw the nodes nx.draw_networkx_nodes(G, **node_options) ## draw the edges nx.draw_networkx_edges(G, **edge_options) # create custom legend according to color_map geom_list = [Circle([], color = rgb2hex(color_map(float(mapping[term])))) for term in groups] plt.legend(geom_list, groups) # show the plot plt.show() # ## Analyze persons by TopicTerm and time range # The following example is about analyzing persons based on a specific topic term and whether the person was alive during a given time period. # # The query below filters people that were Sociologists and were alive between January 1st, 1900 and January 1st, 1925. Furthermore, the query retrieves all connected resources of the persons that meet the filter criteria. # In[17]: from neo4j import GraphDatabase import networkx as nx driver = GraphDatabase.driver(uri, auth=(user, password)) from_date = "1900-01-01" to_date = "1925-01-01" topic_term = "Soziolog" date_pattern = "([0-9]{2}[.][0-9]{2}[.][0-9]{4})" # query scaffolding with placeholders query = """ MATCH (n:PerName)-[r1]-(t:TopicTerm), (n:PerName)-[r2]-(m:Resource) WHERE n.DateStrictBegin =~ '{date_pattern}' AND n.DateStrictEnd =~ '{date_pattern}' AND t.Name CONTAINS "{topic_term}" WITH apoc.date.parse(n.DateStrictBegin, "ms", "dd.MM.yyyy") AS parsed_birth, apoc.date.parse(n.DateStrictEnd, "ms", "dd.MM.yyyy") AS parsed_death, n, m, t, r1, r2 WHERE apoc.coll.max([date(datetime({{epochmillis: parsed_birth}})), date("{from_date}")]) <= apoc.coll.min([date(datetime({{epochmillis: parsed_death}})), date("{to_date}")]) RETURN * LIMIT 2000 """ # replace placeholders in query scaffolding query = query.format(from_date=from_date, to_date=to_date, date_pattern=date_pattern, topic_term=topic_term) # The query above uses the following new elements to construct the database request: # # - APOC procedures are used to parse actual date and time variables. APOC procedures are predefined Cypher functions that make processing data easier. A full user guide for the built-in APOC procedures can be found [here](https://neo4j.com/labs/apoc/4.1/). An introduction to working with dates using the Cypher language can be found [here](https://neo4j.com/developer/cypher/dates-datetimes-durations/). # - The Cypher `WITH` clause is used to chain together new variables with the rest of the query. More details on the `WITH` clause can be found [here](https://neo4j.com/docs/cypher-manual/current/clauses/with/) # In the next step, we call the custom function `to_nx_graph()` again and convert the graph to a data frame. # In[18]: from helper_functions.helper_fun import to_nx_graph G = to_nx_graph(neo4j_driver=driver, query=query) # In[37]: import pandas as pd graph_df = pd.DataFrame.from_dict(dict(G.nodes(data=True)), orient='index') graph_df["type"].value_counts() # **Aggregate by time period** # # Let's use the pandas data frame to aggregate the retrieved data and plot the distribution of the resources over time: # In[38]: # change this value to the year range you want to use as a aggregation period (e.g. "10y" for ten years) time_range = "10y" # cleaning up the dataframe # only keep observations with a valid year as "DateApproxBegin" agg_df = graph_df[graph_df.DateApproxBegin.str.fullmatch( "([0-9]{4})", na=False)] # only keep Resources and drop all other node types agg_df = agg_df[agg_df.type == "Resource"] # add a new column called "clean_date" to the dataframe containing a corectly formated date format agg_df.insert(0, "clean_date", pd.to_datetime( agg_df.DateApproxBegin, format="%Y"), False) # aggregate the data # aggregate by the given time range and calculate the number of observations in the time period per node type agg_df = agg_df.groupby(["type", pd.Grouper(key="clean_date", freq=time_range)])[ "type"].agg("count") # reset the grouping index so we have a "normal" dataframe again agg_df = agg_df.reset_index(name="count") # plot the result # replace the full "clean_date" values with only the string of the respective ending year of the time period agg_df["clean_date"] = agg_df["clean_date"].dt.strftime("%Y") # plot a bar chart agg_df.plot.bar(x="clean_date", y="count") # **Visualize the network** # Of course, we also can visualize the full network again: # In[20]: from matplotlib.colors import rgb2hex from matplotlib.patches import Circle import matplotlib.pyplot as plt import numpy as np import networkx as nx # defining general variables ## we start off by setting the position of nodes and edges again pos = nx.kamada_kawai_layout(G) ## set the color map to be used color_map = plt.cm.plasma # setup node_colors node_color_attribute = "type" groups = set(nx.get_node_attributes(G, node_color_attribute).values()) group_ids = np.array(range(len(groups))) if len(group_ids) > 1: group_ids_norm = (group_ids - np.min(group_ids))/np.ptp(group_ids) else: group_ids_norm = group_ids mapping = dict(zip(groups, group_ids_norm)) node_colors = [mapping[G.nodes()[n][node_color_attribute]] for n in G.nodes()] # defining the graph options & styling ## dictionary for node options: node_options = { "pos": pos, "alpha": 1, "node_size": 150, "alpha": 0.5, "node_color": node_colors, # here we set the node_colors object as an option "cmap": color_map # this cmap defines the color scale we want to use } ## dictionary for edge options: edge_options = { "pos": pos, "width": 1.5, "alpha": 0.2, } ## set plot size and plot margins plt.figure(figsize=[20, 20]) plt.margins(x=0.1, y = 0.1) # draw the graph ## draw the nodes nx.draw_networkx_nodes(G, **node_options) ## draw the edges nx.draw_networkx_edges(G, **edge_options) # create custom legend according to color_map geom_list = [Circle([], color = rgb2hex(color_map(float(mapping[term])))) for term in groups] plt.legend(geom_list, groups) # show the plot plt.show() # ### 📝 Exercise # # Task: # # 1. Create another bar plot of the resources distribution over time - this time change time range from 10 years to a smaller number. # This notebook introduced you to the SoNAR (IDH) database, the data structure and the Cypher query language to retrieve and analyze the SoNAR data. In the next notebook we are going to use an exploratory approach to analyze the historical network of physiologists. # # Solutions for the exercises # This section provides the solutions for the exercises in this notebook. # ## 4.1.2 📝 Exercise # # 1. Now, try one of the other methods listed in the table above by following the same procedure we used with the `db.labels()` call. # In[ ]: with driver.session() as session: result = session.run("CALL db.propertyKeys()").data() result # ## 4.2.4 📝 Exercise # # 1. Write a query that retrieves all RelationToGeoName edges from Max Weber as well as the corresponding GeoName nodes. # In[4]: from helper_functions.helper_fun import to_nx_graph from neo4j import GraphDatabase import networkx as nx driver = GraphDatabase.driver(uri, auth=(user, password)) query = """ MATCH (n1:PerName)-[r:RelationToGeoName]-(n2:GeoName) WHERE n1.Id = "(DE-588)118629743" RETURN * """ G = to_nx_graph(neo4j_driver=driver, query=query) # 2. Visualize the resulting graph (see Notebook 2 for an explanation on how to visualize graphs). # In[ ]: from matplotlib.colors import rgb2hex from matplotlib.patches import Circle import matplotlib.pyplot as plt # defining general variables ## we start off by setting the position of nodes and edges again pos = nx.kamada_kawai_layout(G) ## set the color map to be used color_map = plt.cm.plasma # setup node_colors node_color_attribute = "type" groups = set(nx.get_node_attributes(G, node_color_attribute).values()) group_ids = np.array(range(len(groups))) if len(group_ids) > 1: group_ids_norm = (group_ids - np.min(group_ids))/np.ptp(group_ids) else: group_ids_norm = group_ids mapping = dict(zip(groups, group_ids_norm)) node_colors = [mapping[G.nodes()[n][node_color_attribute]] for n in G.nodes()] # defining the graph options & styling ## dictionary for node options: node_options = { "pos": pos, "alpha": 1, "node_size": 150, "alpha": 0.5, "node_color": node_colors, # here we set the node_colors object as an option "cmap": color_map # this cmap defines the color scale we want to use } ## dictionary for edge options: edge_options = { "pos": pos, "width": 1.5, "alpha": 0.2, } ## set plot size and plot margins plt.figure(figsize=[5,5]) plt.margins(x=0.1, y = 0.1) # draw the graph ## draw the nodes nx.draw_networkx_nodes(G, **node_options) ## draw the edges nx.draw_networkx_edges(G, **edge_options) # create custom legend according to color_map geom_list = [Circle([], color = rgb2hex(color_map(float(mapping[term])))) for term in groups] plt.legend(geom_list, groups) # show the plot plt.show() # ## 5.2.2 📝 Exercise # # 1. Calculate the top ten degree centrality of all PerName nodes with respect to RelationToPerName relationships. # In[ ]: query = """ CALL gds.alpha.degree.stream({ nodeProjection: {type: "PerName"}, relationshipProjection: {type: "RelationToPerName"} }) YIELD nodeId, score RETURN gds.util.asNode(nodeId).Name AS Name, score ORDER BY score DESC LIMIT 10 """ with driver.session() as session: result = session.run(query).data() result # ## 5.2.4 📝 Exercise # # 1. Calculate the shortest path between John Hume and Marie Curie again, but use a different relationship type this time. # In[ ]: query = """ MATCH (start:PerName {Id: "(DE-588)119444666"}), (end:PerName {Id: "(DE-588)118523023"}) CALL gds.alpha.shortestPath.stream({ startNode: start, endNode: end, nodeProjection: {type: "PerName"}, relationshipProjection: { all: { type: "RelationToGeoName", orientation: "NATURAL", TypeAddInfo: "directed" } }}) YIELD nodeId, cost RETURN gds.util.asNode(nodeId).Name AS Name """ with driver.session() as session: result = session.run(query).data() result # ## 6.2.1 📝 Exercise # # 1. Create another bar plot of the resources distribution over time - this time change time range from 10 years to a smaller number. # In[ ]: # change this value to the year range you want to use as a aggregation period (e.g. "10y" for ten years) time_range = "5y" # cleaning up the dataframe # only keep observations with a valid year as "DateApproxBegin" agg_df = graph_df[graph_df.DateApproxBegin.str.fullmatch( "([0-9]{4})", na=False)] # only keep Resources and drop all other node types agg_df = agg_df[agg_df.type == "Resource"] # add a new column called "clean_date" to the dataframe containing a corectly formated date format agg_df.insert(0, "clean_date", pd.to_datetime( agg_df.DateApproxBegin, format="%Y"), False) # aggregate the data # aggregate by the given time range and calculate the number of observations in the time period per node type agg_df = agg_df.groupby(["type", pd.Grouper(key="clean_date", freq=time_range)])[ "type"].agg("count") # reset the grouping index so we have a "normal" dataframe again agg_df = agg_df.reset_index(name="count") # plot the result # replace the full "clean_date" values with only the string of the respective ending year of the time period agg_df["clean_date"] = agg_df["clean_date"].dt.strftime("%Y") # plot a bar chart agg_df.plot.bar(x="clean_date", y="count") # # Bibliography # Freeman, L. C. (1978). Centrality in social networks conceptual clarification. Social Networks, 1(3), 215–239. https://doi.org/10.1016/0378-8733(78)90021-7 # # Needham, M. & Hodler, A. (2019). Graph algorithms : practical examples in Apache Spark and Neo4j. Beijing: O'Reilly.