Note: In this workbook, we try to replicate the results from the classic paper "Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth", Goldenberg, Libai and Muller (2001). This is a self-didactic attempt.
] add Graphs Distributions DataFrames GLM ProgressMeter
Updating registry at `C:\Users\Thibaut\.julia\registries\General` Updating git-repo `https://github.com/JuliaRegistries/General.git` Resolving package versions... Installed StatsFuns ─────────────── v1.0.1 Installed DualNumbers ───────────── v0.6.8 Installed PDMats ────────────────── v0.11.16 Installed HypergeometricFunctions ─ v0.3.11 Installed QuadGK ────────────────── v2.6.0 Installed StatsModels ───────────── v0.6.33 Installed FillArrays ────────────── v0.13.5 Installed GLM ───────────────────── v1.8.1 Installed ShiftedArrays ─────────── v2.0.0 Installed Distributions ─────────── v0.25.78 Updating `C:\Users\Thibaut\.julia\environments\v1.8\Project.toml` [31c24e10] + Distributions v0.25.78 [38e38edf] + GLM v1.8.1 Updating `C:\Users\Thibaut\.julia\environments\v1.8\Manifest.toml` [49dc2e85] + Calculus v0.5.1 [b429d917] + DensityInterface v0.4.0 [31c24e10] + Distributions v0.25.78 [fa6b7ba4] + DualNumbers v0.6.8 [1a297f60] + FillArrays v0.13.5 [38e38edf] + GLM v1.8.1 [34004b35] + HypergeometricFunctions v0.3.11 [90014a1f] + PDMats v0.11.16 [1fd47b50] + QuadGK v2.6.0 [79098fc4] + Rmath v0.7.0 [1277b4bf] + ShiftedArrays v2.0.0 [4c63d2b9] + StatsFuns v1.0.1 [3eaba693] + StatsModels v0.6.33 [f50d1b31] + Rmath_jll v0.3.0+0 [4607b0f0] + SuiteSparse Precompiling project... ✓ ShiftedArrays ✓ PDMats ✓ Rmath_jll ✓ DensityInterface ✓ FillArrays ✓ QuadGK ✓ DualNumbers ✓ Rmath ✓ HypergeometricFunctions ✓ StatsFuns ✓ StatsModels ✓ Distributions ✓ GLM 13 dependencies successfully precompiled in 15 seconds. 224 already precompiled. 1 skipped during auto due to previous errors.
using Graphs
using Distributions, DataFrames, GLM, ProgressMeter
using Dates
using Random: shuffle, seed!
seed!(20130810);
In Talk of the Network, the authors explore the pattern of personal communication between an individual's core friends group (strong ties) and a wider set of acquaintances (weak ties). This remarkable study is one of the first ones in marketing that explored the influence of social networks on the diffusion of marketing messages. The key questions investigated in this paper are:
In this workbook, we focus on replicating the efforts of the authors to answer the first question: do strong ties or weak ties influence the speed of information dissemination in a network?
This study employs a large number of synthetic networks as substrates to study the diffusion of information diffusion. To quote the authors logic to create and initialize the networks:
"Each individual belongs to a single personal network. Each network consists of individuals who are connected by strong ties. In each period, individuals also conduct a finte number of weak tie interactions outside their personal networks... We divide the entire market equally into personal networks, in which each individual can belong to one network. In addition, in each period, every individual conducts random meetings with individuals external to his personal network."
Given this specification, we utilize the built-in complete graph generator from Graphs.jl to build several mini-regular networks and then allow individuals in each of these mini-networks to mingle. Our final data structure is hence a vector of several complete networks that are built based on the number of strong ties for each individual. Note that each individual in the network has a fixed number of strong ties ($s$) and weak ties ($w$).
function initialize_network(n_nodes::Int, n_strong_ties::Int)
G = [complete_graph(n_strong_ties) for g in 1:floor(Int, n_nodes/n_strong_ties)]
return G
end
initialize_network (generic function with 1 method)
The probability of activation of a node, i.e., an uninformed individual turning to informed can happen in three ways: through a strong tie with probability $\beta_s$, through a weak tie with probability $\beta_w$ or through external marketing efforts with probability $\alpha$. In line with conventional wisdom, the authors assume $\alpha < \beta_w < \beta_s$.
At timestep $t$, if an individual is connected to $m$ strong ties and $j$ weak ties, the probability of the individual being informed in this time step is:
$$ p(t) = 1 - (1- \alpha)(1 - \beta_w)^j(1 - \beta_s)^m $$The outcome variable of interest is the number of time steps elapsed till 95% of the network engages.
Following our earlier discussion on the construction of substrate networks, each node in the network belongs to a complete sub-network. In addition, at each time step each node interacts with a fixed number of weak ties chosen at random from sub-networks other than its own.
Step 1: At $t = 0$, the status of all nodes is set to false
Step 2: For each node, the probability $p(t)$ of being informed is calculated using the above equation. A random draw $U$ is made from a standard uniform distribution and compared with $p(t)$. If $U < p(t)$ the status of the node is changed to true
Step 3: In each successive time step, Step 2 is repeated till 95% of the total network (of size 3000) engages
We now look at several helper functions that execute the above logic
The node status is stored as a vector of BitVector
's. At the beginning of each simulation run, we call the following function to set the status of all the nodes to false
.
function reset_node_status(G::Vector{Graphs.SimpleGraphs.SimpleGraph{Int}})
node_status = [falses(nv(g)) for g in G]
return node_status
end
reset_node_status (generic function with 1 method)
At each time step, we execute two tasks. First, we allow the nodes to mingle randomly with their strong ties and with weak ties from other sub-networks. At this point, we count the number of active strong and weak ties for each node. Then, we use this information to update the status of all the nodes in the network.
The first function counts the number of active strong ties within the node's sub-network. The second function executes the "random meetings" with weak ties as discussed in the paper. For each node we generate a random sample (without replacement) of size $w$ from sub-networks other than its own. We then count the number of active ties in its own sub-network and among the random sample taken from the rest of the network.
function count_active_str_ties(G::Vector{Graphs.SimpleGraphs.SimpleGraph{Int}},
node_network_id::Int,
node::Int,
node_status::Vector{BitVector})
n_active_str_ties = sum([node_status[node_network_id][nbr] for nbr in neighbors(G[node_network_id], node)])
return n_active_str_ties
end
count_active_str_ties (generic function with 1 method)
function random_meetings(G::Vector{Graphs.SimpleGraphs.SimpleGraph{Int}},
node_network_id::Int,
node::Int,
node_status::Vector{BitVector},
n_weak_ties::Int)
# Choose a random sample of size `n_weak_ties` from the other sub-networks and query
# their status. We first sample the network id, and use this to sample a random node
# in the sub-network defined by this id.
all_network_ids = 1:length(G)
other_network_ids = all_network_ids[all_network_ids .!= node_network_id]
possible_weak_ties = []
nsamples = 1
while nsamples < n_weak_ties
rand_network_id = sample(other_network_ids)
rand_nbr = sample(vertices(G[rand_network_id]))
if !((rand_network_id, rand_nbr) in possible_weak_ties)
push!(possible_weak_ties, (rand_network_id, rand_nbr))
nsamples += 1
end
end
n_active_wk_ties = sum([node_status[network_id][weak_tie] for (network_id, weak_tie) in possible_weak_ties])
return n_active_wk_ties
end
random_meetings (generic function with 1 method)
Finally, the function below conducts the updation of the status of all the nodes at each time step by calculating the probability of activation.
function update_status!(G::Vector{Graphs.SimpleGraphs.SimpleGraph{Int}},
node_status::Vector{BitVector},
n_weak_ties::Int,
alpha::Float64, beta_w::Float64, beta_s::Float64)
# assuming that the nodes update in random order
for node_network_id in shuffle(1:length(G))
for node in shuffle(vertices(G[node_network_id]))
n_active_str_ties = count_active_str_ties(G, node_network_id, node, node_status)
n_active_wk_ties = random_meetings(G, node_network_id, node, node_status, n_weak_ties)
activation_prob = 1 - (1 - alpha) * (1 - beta_w)^n_active_wk_ties * (1 - beta_s)^n_active_str_ties
if rand(Uniform()) < activation_prob
node_status[node_network_id][node] = true
end
end
end
return nothing
end
update_status! (generic function with 1 method)
The function execute_simulation
puts together the scaffolding to set up the parameter space $(s, w, \alpha, \beta_w, \beta_s)$ and execute diffusion along the network. From what I can gather from the paper, one simulation was carried out at each point on the parameter space. No further details regarding the execution are mentioned except that since each parameter has 7 levels, a total of $7^5 = 16,807$ simulations were executed in a factorial design. In this workbook, we work on a smaller parameter space using 3 levels for each parameter.
Also, I am assuming that the network is drawn at random for each run of the simulation.
One more interesting thing to note: The authors mention that their simulations were written in C, it would be interesting to compare the execution times with Julia. This is a non-standard problem that tests both the robustness of Julia types and its execution speed (maybe this will prompt someone to make a pull request!).
println("Number of strong ties per node (s): ", floor.(Int, range(5, stop=29, length=3)))
println("Number of weak ties per node(w): ", floor.(Int, range(5, stop=29, length=3)))
println("Effect of advertising (α): ", collect(range(0.0005, stop=0.01, length=3)))
println("Effect of weak ties (β_w): ", collect(range(0.005, stop=0.015, length=3)))
println("Effect of strong ties (β_s): ", collect(range(0.01, stop=0.07, length=3)))
Number of strong ties per node (s): [5, 17, 29] Number of weak ties per node(w): [5, 17, 29] Effect of advertising (α): [0.0005, 0.00525, 0.01] Effect of weak ties (β_w): [0.005, 0.01, 0.015] Effect of strong ties (β_s): [0.01, 0.04, 0.07]
parameter_space = [(s, w, alpha, beta_w, beta_s) for s in floor.(Int, range(5, stop=29, length=3)),
w in floor.(Int, range(5, stop=29, length=3)),
alpha in range(0.0005, stop=0.01, length=3),
beta_w in range(0.005, stop=0.015, length=3),
beta_s in range(0.01, stop=0.07, length=3)]
size(parameter_space), length(parameter_space)
((3, 3, 3, 3, 3), 243)
function execute_simulation(parameter_space, n_nodes::Int)
# n_nodes dictates how big the network will be
# We cannot pre-allocate the output since we do not know for how many time steps the simulation will
# run at each setting
output = DataFrame(s = Int[], w = Int[], alpha = Float64[],
beta_w = Float64[], beta_s = Float64[],
t = Int[], num_engaged = Int[])
println("Beginning simulation at : ", Dates.format(now(), "HH:MM"))
println("You might want to grab a cup of coffee while Julia brews the simulation...")
@showprogress 1 "Crunching numbers while you munch..." for (s, w, alpha, beta_w, beta_s) in parameter_space[1:end]
G = initialize_network(n_nodes, s)
node_status = reset_node_status(G)
num_engaged = sum(sum(node_status))
# Continue updates at each setting till 95% of the network engages
t = 1
while num_engaged < floor(Int, 0.95 * n_nodes)
update_status!(G, node_status, w, alpha, beta_w, beta_s)
num_engaged = sum(sum(node_status))
push!(output, [s, w, alpha, beta_w, beta_s, t, num_engaged])
t += 1
end
end
return output
end
execute_simulation (generic function with 1 method)
results = execute_simulation(parameter_space, 3000)
Beginning simulation at : 16:05 You might want to grab a cup of coffee while Julia brews the simulation...
Crunching numbers while you munch... 100%|███████████████| Time: 0:02:44
Row | s | w | alpha | beta_w | beta_s | t | num_engaged |
---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Int64 | Int64 | |
1 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 1 | 1 |
2 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 2 | 6 |
3 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 3 | 9 |
4 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 4 | 11 |
5 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 5 | 11 |
6 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 6 | 11 |
7 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 7 | 14 |
8 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 8 | 16 |
9 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 9 | 19 |
10 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 10 | 21 |
11 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 11 | 23 |
12 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 12 | 29 |
13 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 13 | 31 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
5643 | 5 | 29 | 0.01 | 0.015 | 0.07 | 10 | 2781 |
5644 | 5 | 29 | 0.01 | 0.015 | 0.07 | 11 | 2890 |
5645 | 17 | 29 | 0.01 | 0.015 | 0.07 | 1 | 63 |
5646 | 17 | 29 | 0.01 | 0.015 | 0.07 | 2 | 340 |
5647 | 17 | 29 | 0.01 | 0.015 | 0.07 | 3 | 1002 |
5648 | 17 | 29 | 0.01 | 0.015 | 0.07 | 4 | 1949 |
5649 | 17 | 29 | 0.01 | 0.015 | 0.07 | 5 | 2613 |
5650 | 17 | 29 | 0.01 | 0.015 | 0.07 | 6 | 2902 |
5651 | 29 | 29 | 0.01 | 0.015 | 0.07 | 1 | 238 |
5652 | 29 | 29 | 0.01 | 0.015 | 0.07 | 2 | 1146 |
5653 | 29 | 29 | 0.01 | 0.015 | 0.07 | 3 | 2373 |
5654 | 29 | 29 | 0.01 | 0.015 | 0.07 | 4 | 2921 |
To answer the research questions, the authors resort to simple linear regression.
Since our focus in this workbook is on highlighting the strengths of the JuliaGraphs ecosystem, we keep the regression modeling at the most basic level.
As discussed earlier, the outcome is the time taken for 95% of the network to engage with the message. The features used to predict this outcome are $s$, $w$, $\alpha$, $\beta_w$ and $\beta_S$.
first(results, 10)
Row | s | w | alpha | beta_w | beta_s | t | num_engaged |
---|---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Int64 | Int64 | |
1 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 1 | 1 |
2 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 2 | 6 |
3 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 3 | 9 |
4 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 4 | 11 |
5 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 5 | 11 |
6 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 6 | 11 |
7 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 7 | 14 |
8 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 8 | 16 |
9 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 9 | 19 |
10 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 10 | 21 |
To build the data required for the linear modeling, we group the data by each parameter setting and calculate the time the network takes to reach 95% activation.
all_engaged = combine(groupby(results, [:s, :w, :alpha, :beta_w, :beta_s]), df -> DataFrame(T95 = maximum(df[!,:t])));
first(all_engaged, 10)
Row | s | w | alpha | beta_w | beta_s | T95 |
---|---|---|---|---|---|---|
Int64 | Int64 | Float64 | Float64 | Float64 | Int64 | |
1 | 5 | 5 | 0.0005 | 0.005 | 0.01 | 154 |
2 | 17 | 5 | 0.0005 | 0.005 | 0.01 | 64 |
3 | 29 | 5 | 0.0005 | 0.005 | 0.01 | 47 |
4 | 5 | 17 | 0.0005 | 0.005 | 0.01 | 78 |
5 | 17 | 17 | 0.0005 | 0.005 | 0.01 | 43 |
6 | 29 | 17 | 0.0005 | 0.005 | 0.01 | 32 |
7 | 5 | 29 | 0.0005 | 0.005 | 0.01 | 47 |
8 | 17 | 29 | 0.0005 | 0.005 | 0.01 | 33 |
9 | 29 | 29 | 0.0005 | 0.005 | 0.01 | 25 |
10 | 5 | 5 | 0.00525 | 0.005 | 0.01 | 93 |
We then run a simple linear model on the data
ols = lm(@formula(T95 ~ s + w + alpha + beta_s + beta_w), all_engaged)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}} T95 ~ 1 + s + w + alpha + beta_s + beta_w Coefficients: ──────────────────────────────────────────────────────────────────────────────────── Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95% ──────────────────────────────────────────────────────────────────────────────────── (Intercept) 84.3588 3.04531 27.70 <1e-75 78.3594 90.3581 s -1.01132 0.0742558 -13.62 <1e-30 -1.1576 -0.865031 w -0.824588 0.0742558 -11.10 <1e-22 -0.970874 -0.678303 alpha -1374.92 187.594 -7.33 <1e-11 -1744.48 -1005.35 beta_s -292.798 29.7023 -9.86 <1e-18 -351.313 -234.284 beta_w -1095.06 178.214 -6.14 <1e-08 -1446.15 -743.976 ────────────────────────────────────────────────────────────────────────────────────
r2(ols)
0.6773101916389891
This is a rather strong finding. The speed of information diffusion is impacted equally strongly by both strong ties and weak ties. As the authors note, the surprising aspect of this strudy is that the effect of weak ties is rather strong despite the inferiority of the weak ties parameter in the model assumptions.