Node equivalence with Node2Vec

Node2Vec is an algorithm designed for detecting structural equivalence and segments of a graph.

In this notebook, the Leviticus-network will be explored with Node2Vec in a progression of complexity, according to the following criteria:

  1. The network as a directed but unweighted graph
  2. The network as a multiple directed graph (weights)
  3. The network as a multiple directed graph (weighted by average agency values)

Other tutorials

In [1]:
import collections

import networkx as nx
from node2vec import Node2Vec
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.manifold import MDS

Import data

In [38]:
data = pd.read_excel('Lev17-26.edges.Static.xlsx')
data.head()
Out[38]:
Source Source_label Source_agency Target Target_label Target_agency Label Weight Type Clause
0 49 Aaron's_sons 5 43 YHWH 0 swing 25 Directed 440323
1 43 YHWH 5 39 Moses -1 speak 36 Directed 440335
2 53 Israelites 5 43 YHWH -1 approach 36 Directed 440341
3 43 YHWH 5 39 Moses -1 speak 36 Directed 440342
4 53 Israelites 5 43 YHWH -1 approach 36 Directed 440347

1. Simple, directed graph

In [39]:
G = nx.DiGraph()

for n, row in data.iterrows():
    G.add_edge(row.Source_label, row.Target_label)
In [40]:
plt.figure(figsize = (15,15))

nx.draw(
    G,
    pos=nx.spring_layout(G),
    with_labels=True,
    edge_color='grey'
    )

plt.show()

We define a few global measures to account for in all examples:

In [41]:
dim = 16
wl = 4
nw = 150
p=1
q=1
In [42]:
node2vec = Node2Vec(G, dimensions=dim, walk_length=wl, num_walks=nw, p=p, q=q)
Computing transition probabilities: 100%|████████████████████████████████████████████| 59/59 [00:00<00:00, 4017.34it/s]
Generating walks (CPU: 1): 100%|████████████████████████████████████████████████████| 150/150 [00:00<00:00, 288.96it/s]

Definitions:

  • window (size) = the maximum distance between the current and the predicted node in the vector
  • min_count = ignores all words with total absolute frequency lower than this
In [43]:
model = node2vec.fit(window=6, min_count=1)
In [44]:
model.wv.save_word2vec_format('node2vec/simple_directed')

Clustering embeddings

In [45]:
def getEmbeddings(file, nodes):
    with open(file) as f:
        mylist = [line.rstrip('\n') for line in f]

    embeddings_dict = collections.defaultdict()
    for l in mylist[1:]:
        vector = l.split()
        embeddings_dict[vector[0]] = vector[1:]

    """Extract representations from the node2vec model"""
    embeddings = [embeddings_dict[str(n)] for n in nodes]
    embeddings = np.array(embeddings)
    return embeddings

How many clusters?

The more clusters, the less the distance is between the centroid and its members. This is measured with WCSS ("within-cluster sum of squares").

In [46]:
X = getEmbeddings('node2vec/simple_directed', G.nodes())

def elbow(X,max_clusters=10):

    wcss = [] #for storing the intertia property

    for i in range(1, max_clusters):
        kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter=300, n_init=10, random_state=0)
        kmeans.fit(X)
        wcss.append(kmeans.inertia_)

    plt.plot(range(1, max_clusters), wcss)
    plt.title('Scree plot of WCSS for n clusters (elbow method)')
    plt.xlabel('n of clusters')
    plt.ylabel('WCSS')
    plt.show()
    
elbow(X)
In [47]:
kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter=300, n_init=10, random_state=0).fit(X)
In [48]:
def draw(graph, X, kmeans, labels=True, size=(15,15)):

    plt.figure(figsize = size)
    
    nx.draw_networkx(
        graph,
        pos=nx.spring_layout(graph),
        with_labels=labels,
        node_size=[n[1]*8 for n in G.degree()],
        node_color=kmeans.labels_,
        edge_color='grey'
        )
    
    plt.axis('off')
    plt.show()
    
draw(G, X, kmeans, labels=False)
In [49]:
def mds(graph, X, kmeans, n_components=2, label=True, size=(7,7)):
    embedding = MDS(n_components=2)
    X_transformed = embedding.fit_transform(X)
    X = pd.DataFrame(X_transformed, index=graph.nodes)
    
    plt.figure(figsize=size)
    plt.scatter(X.iloc[:,0],X.iloc[:,1], c=kmeans.labels_)

    if label:
        for n, row in X.iterrows():
            plt.text(row[0],row[1],n, size=12)

    plt.xticks()
    plt.show()
    
mds(G, X, kmeans, size=(15,15))
C:\Users\Ejer\Anaconda3\lib\site-packages\sklearn\utils\validation.py:563: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64).
  FutureWarning)

2. Multiple directed graph

In [50]:
G = nx.MultiDiGraph()

for n, row in data.iterrows():
    G.add_edge(row.Source_label, row.Target_label)
In [51]:
node2vec = Node2Vec(G, dimensions=dim, walk_length=wl, num_walks=nw, p=p, q=q)
Computing transition probabilities: 100%|████████████████████████████████████████████| 59/59 [00:00<00:00, 4836.49it/s]
Generating walks (CPU: 1): 100%|████████████████████████████████████████████████████| 150/150 [00:00<00:00, 301.64it/s]
In [52]:
model = node2vec.fit(window=10, min_count=1)
In [53]:
model.wv.save_word2vec_format('node2vec/multiple_directed')
In [54]:
X = getEmbeddings('node2vec/multiple_directed', G.nodes())
    
elbow(X)
In [55]:
kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter=300, n_init=10, random_state=0).fit(X)
In [56]:
draw(G, X, kmeans, labels=False)
In [57]:
mds(G, X, kmeans)
C:\Users\Ejer\Anaconda3\lib\site-packages\sklearn\utils\validation.py:563: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64).
  FutureWarning)

3. Multiple directed and valued ties

In [72]:
G = nx.MultiDiGraph()

for n, row in data.iterrows():
    G.add_edge(row.Source_label, row.Target_label, value=row.Weight)
    
nodes = list(G.nodes())
In [73]:
node2vec = Node2Vec(G, dimensions=dim, walk_length=wl, num_walks=nw, p=p, q=q, weight_key='value')
Computing transition probabilities: 100%|████████████████████████████████████████████| 59/59 [00:00<00:00, 7673.30it/s]
Generating walks (CPU: 1): 100%|████████████████████████████████████████████████████| 150/150 [00:00<00:00, 342.57it/s]
In [74]:
model = node2vec.fit(window=6, min_count=1)
In [75]:
model.wv.save_word2vec_format('node2vec/multiple_valued_directed')
In [76]:
X = getEmbeddings('node2vec/multiple_valued_directed', G.nodes())
    
elbow(X)
In [77]:
kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter=300, n_init=10, random_state=0).fit(X)
In [78]:
draw(G, X, kmeans, labels=False, size=(7,7))