import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import graph
Let's start by loading the datasets and by saving them into three separate dataframes
# Answers to questions
df_a2q = pd.read_table("data/sx-stackoverflow-a2q.txt",
delimiter=" ",
names=["src", "tgt", "timestamp"],
nrows=50000)
# Comments to answers
df_c2a = pd.read_table("data/sx-stackoverflow-c2a.txt",
delimiter=" ",
names=["src", "tgt", "timestamp"],
nrows=50000)
# Comments to questions
df_c2q = pd.read_table("data/sx-stackoverflow-c2q.txt",
delimiter=" ",
names=["src", "tgt", "timestamp"],
nrows=50000)
Let's also convert the UNIX times in the timestamp
column into datetime
objects
df_a2q["timestamp"] = pd.to_datetime(df_a2q["timestamp"], unit='s')
df_c2a["timestamp"] = pd.to_datetime(df_c2a["timestamp"], unit='s')
df_c2q["timestamp"] = pd.to_datetime(df_c2q["timestamp"], unit='s')
Now we can save all three as separate graphs (ignoring rows with users commenting their own comments/answers)
G_a2q = graph.graph_from_df(df_a2q)
G_c2a = graph.graph_from_df(df_c2a)
G_c2q = graph.graph_from_df(df_c2q)
in order to merge them into a single graph $G$, which will contain all nodes and edges in the three graphs, assigning to each edge a timestamp
attribute (with the older date among the three graphs) and a weight
attribute (i.e. $3$ if an edge is only in one graph, $2$ if it's in two, and $1$ if it's in all three).
G = graph.merge_graphs(G_c2a, G_c2q)
G = graph.merge_graphs(G, G_a2q)
To check if our own implementation works, we can test it against networkx
's, and see if the two merged graphs have the same nodes and each of those has the same neighbors:
Gx_a2q = graph.graph_from_df(df_a2q, "networkx")
Gx_c2a = graph.graph_from_df(df_c2a, "networkx")
Gx_c2q = graph.graph_from_df(df_c2q, "networkx")
Gx = nx.compose(Gx_c2a, Gx_c2q)
Gx = nx.compose(Gx, Gx_a2q)
graph.test_graph(G, Gx)
OK! Same nodes and neighbors
By merging the three graphs, we obtain
G.summary()
Directed graph with 17099 nodes and 104662 edges
where each edge has by default a direction (unlike $G_x$ which is undirected), which will be that of the older interaction between the two users (i.e. the first occurrence in the dataframe, whose entries are in chronological order from oldest to newest).
For instance, let's take a look at a random subgraph of $G$ (some nodes won't appear to be connected to others but it's just because their neighbors won't be plotted: by design every node in the graph has at least a neighbor):
plt = graph.plot_subgraph(G)
plt.show()
from func import func_selector
import visual
Let's take a look at the overall features of one of the graphs we built and the distribution of the degrees of each of its nodes.
table, density = func_selector(G_a2q, 1)
plt.figure(figsize = (10,6))
visual.overall_features(G_a2q, table, density)
plt.show()
Graph features | |
---|---|
Directed | True |
Users | 7864 |
Answers/comments | 47196 |
Avg links per user | 6.00 |
Dense (|E| ~ |V|^2) | False |
The vast majority of nodes have a degree close to zero, but there also many outliers which have a much higher degree than the rest, as we can appreciate in the box plot below.
This suggests that most interactions are created by a small percentage of very active users.
plt.figure(figsize = (10,6))
plt.boxplot(density, vert=False)
plt.xlabel("Degree")
plt.show()
We are now interested in computing several metrics to assess the "importance" – the centrality – of a node within the graph in a well defined time interval.
Due to the prohibitively long execution times (especially of the betweenness centrality), we could only test the following functionalities and compare their results for short windows of time.
The betweenness centrality measures how "central" a node is by finding in how many paths between all possible node pairs it appears in; in a sense, it represents how many "communications" between nodes it affects.
Given a node $v$ in a directed and weighted graph $(V,E)$, its betweenness centrality (normalized between 0 and 1) is defined as
$$ BC(v) = \frac{1}{(N-1)(N-2)}\sum_{s\neq v\neq t\ \in\ V} \frac{\sigma_{st}(v)}{\sigma_{st}} $$where $\sigma_{st}$ is the number of equivalent (in weight) shortest paths between a node $s$ and a node $t$, $\sigma_{st}(v)$ is the number of those which also contain $v$ in the middle, and $N$ is the total number of nodes in the graph.
Remark: in the case of an undirected graph it has to be multiplied by 2
Sources
For node $238$ in the time window 2008-08-01 / 2008-08-10
we obtained:
func_selector(G, 2)
0.5616119645843778
which is quite a high value. This means that this particular user was present in many of the interactions among other users, for instance commenting/answering posts which were also commented/answered by many other users.
PageRank is an algorithm notably used by Google to rank web pages in search results and measure their importance.
Within our context, it can be used to assign to each node a certain asymptotic probability of being reached in a large number of steps; this means that the higher this probability is, the more the node is connected to the rest and in that sense "central" in the graph.
We can compute it by interpreting our graph as a Markov Chain and the corresponding adjacency matrix as a stochastic matrix filled with the transitional probabilities between all pairs of nodes. If we call $\mathcal{M}$ such matrix, we can define $$\hat{\mathcal{M}} = d\mathcal{M} + \frac{1-d}{N}\mathbb{1}$$ where $d$ is a scalar called damping factor which is the probability that a random surfer keeps visiting the nodes, and $\mathbb{1}$ is a matrix filled with ones with the same size of $\mathcal{M}$. The stationary distribution of all nodes will be the right eigenvector corresponding to the principal eigenvalue of $\hat{\mathcal{M}}$, with components the limit probability – the rank – of each node.
Source
For node $238$ in the time window 2008-08-01 / 2008-08-31
we obtained:
func_selector(G, 2)
0.0007821602043691682
which is a very low figure. Given the graph is directed, this suggests that even though this user commented/answered many posts by other users – as we can hypothesize from its betweenness centrality – very few users in comparison commented his own posts, i.e. it's a node with many outgoing connections but not as many incoming, hence not very reachable.
Closeness centrality measures how close a given node is to the rest of nodes in the graph.
For a node $v$, it is a defined as
$$CC(v) = \frac{N-1}{\sum_{u\in V} d(u,v)}$$where $d(u,v)$ is the distance in terms of number of edges of the shortest path between $v$ and another node $u$, and $N-1$ is again needed to normalize the measure.
Sources
For node $238$ in the time window 2008-08-01 / 2008-08-31
we obtained:
func_selector(G, 2)
0.37043633125556547
This value reinforces what we saw earlier with the betweenness centrality: this node is "central" in the sense that it communicates well with others (its distance to the other nodes is "small").
The degree centrality is the simplest notion of centrality and it's defined as the (normalized) number of neighboring nodes of a given node $v$:
$$DC(v)=\frac{\text{deg}(v)}{N-1}$$For node $238$ in the time window 2008-08-01 / 2008-08-31
we obtained:
func_selector(G, 2)
0.01826923076923077
which is a low value. Given the degree centrality only measures the ratio of neighboring nodes over all possible neighbors ($N-1$), the number we got simply suggests that node $238$ it's not in a dense region of the graph (which is by itself sparse as we saw earlier).
Let's see 10 random neighbours of node $372$
plt = graph.plot_neighbors(G, "372", max_neighbors=10)
plt.show()
Now let's plot the evolution of its degree centrality over the course of 5 months:
plt = visual.metric_evolution(G, "372", ["dc"])
plt.show()
Its degree centrality has decreased over time: this means that as the number of nodes in the network increased its neighbors didn't increase with the same rate, hence the user is less "central" because it's part of a bigger community with whom he hasn't interacted much.
By calling the function selector with the number 3 in input, it will call the functionality for the shortest ordered route.
It takes in input:
and it returns the shortest walk that goes from user $p_j$ to $p_n$, and that visits in order the nodes in $p$.
Let's see what we get by picking $p_1 = 404$, $p_n = 267$, and $p = [122,\ 32]$, in the time interval 2008-08-01 / 2008-09-30
shortest_ord_route = func_selector(G, 3)
print(f"\nThe shortest path between the two nodes that visits in order the nodes indicated is\n{short_ord_route}")
The shortest path between the two nodes that visits in order the nodes indicated is ['404', '396', '122', '5', '32', '267']
Now let's plot a subgraph containing these nodes and the edges between them, highlighting those in the shortest ordered path
route_edges = [(short_ord_route[idx], short_ord_route[idx + 1])
for idx in range(len(short_ord_route) - 1)]
graph.plot_subgraph(G,
nodes=shortest_ord_route,
highlight_edges=route_edges,
highlight_color="blue",
figsize=(8,8))
plt.show()
Calling the function selector with input 4 will start the disconnecting graphs functionality which takes in input:
The function returns the minimum number of links (considering their weights) required to disconnect the two users.
Let's consider the time intervals 2008-08-01 / 2008-08-31
and 2008-09-01 / 2008-09-30
.
out, removed, paths = func_selector(G, 4)
tot_w = sum([G[u][v]["weight"] for u, v in removed])
print(f"and the total weight of these links is {tot_w}")
A user unique to G_1 is 3261 A user unique to G_2 is 10577 The minimum number of links required to disconnect the two graphs is 4 and the total weight of these links is 12
Finally, let's view these links
graph.plot_subgraph(G, nodes=list(set([node for path in paths for node in path ])),
highlight_edges=removed,
highlight_color='red',
figsize=(10,10))
plt.show()
<module 'matplotlib.pyplot' from '/anaconda3/lib/python3.7/site-packages/matplotlib/pyplot.py'>
A number n of kids are in a camp. Between some k pairs of them (a kid can be part of more than one pairs) there are often fights. At night there are two dormitories where the kids can sleep. We want, if possible, to assign each kid in one of the two dormitories in such a way that each pair of kids that fights often is assigned to a different dormitory. (There are no space problems and the two dormitories can have different number of kids.)
n = [1,3,4,56,7,32,2]
n.sort()
n
[1, 2, 3, 4, 7, 32, 56]
def tuple_order(k):
out = []
for pair in k:
if(pair[0] > pair[1]):
pair = list(pair)
temp = pair[0]
pair[0] = pair[1]
pair[1] = temp
pair = tuple(pair)
out.append(pair)
return out
# n -> list of kids
# k -> list of pairs of kids fighting
def dorm_algo(n, k):
# kids in 1st and 2nd dorm
dorm0 = []
dorm1 = []
# Creating a list for kids who don't get in argues
safe = []
# First of all I order the list of kids...
n.sort()
# ...and the elements of the pairs using the aux function
k = tuple_order(k)
# Then I start checking
for pair in k:
if(pair[0] not in safe): # if kid0 not in safe list...
safe.append(pair[0]) # ...I add him...
dorm0.append(pair[0]) # ...and put him in dorm0
if(pair[1] not in safe): # if kid1 not in safe list...
safe.append(pair[1]) # ...I add him...
dorm1.append(pair[1]) # ...and put him in dorm1
# Now I check if kids from the same pairs are in the same dorm
# If so, I block the algorithm
for pair in k:
if(((pair[0] in dorm0) and (pair[1] in dorm0)) or (pair[0] in dorm1) and (pair[1] in dorm1)):
print("Nothing can be done!")
return
# Now is time to add the other kids in the dorms
for kid in n:
if((kid not in dorm0) and (kid not in dorm1)):
dorm0.append(kid)
# Print the final dorms with the kids
print("Fighting solved! Here the dorms:")
print("First dormitory:", dorm0)
print("Second dormitory:",dorm1)
Let's use as list of kids this:
n = [1,2,3,4,5,6]
k = [(3,2), (4,2)]
dorm_algo(n, k)
Fighting solved! Here the dorms: First dormitory: [2, 1, 5, 6] Second dormitory: [3, 4]
k = [(1,3),(4,1),(3,4)]
dorm_algo(n, k)
Nothing can be done!
k = [(5,4), (6,5), (2,6)]
dorm_algo(n, k)
Nothing can be done!
k = [(5,4), (6,5)]
dorm_algo(n, k)
Nothing can be done!
k = [(1,3),(4,1),(3,2),(2,5),(1,2)]
dorm_algo(n,k)
Nothing can be done!
k = [(1,3),(4,1),(3,2),(2,5),(2,6)]
dorm_algo(n,k)
Fighting solved! Here the dorms: First dormitory: [1, 2] Second dormitory: [3, 4, 5, 6]