Notebook

Note: In this workbook, we try to replicate the results from the classic paper "Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth", Goldenberg, Libai and Muller (2001). This is a self-didactic attempt.

In [1]:

] add Graphs Distributions DataFrames GLM ProgressMeter

    Updating registry at `C:\Users\Thibaut\.julia\registries\General`
    Updating git-repo `https://github.com/JuliaRegistries/General.git`
   Resolving package versions...
   Installed StatsFuns ─────────────── v1.0.1
   Installed DualNumbers ───────────── v0.6.8
   Installed PDMats ────────────────── v0.11.16
   Installed HypergeometricFunctions ─ v0.3.11
   Installed QuadGK ────────────────── v2.6.0
   Installed StatsModels ───────────── v0.6.33
   Installed FillArrays ────────────── v0.13.5
   Installed GLM ───────────────────── v1.8.1
   Installed ShiftedArrays ─────────── v2.0.0
   Installed Distributions ─────────── v0.25.78
    Updating `C:\Users\Thibaut\.julia\environments\v1.8\Project.toml`
  [31c24e10] + Distributions v0.25.78
  [38e38edf] + GLM v1.8.1
    Updating `C:\Users\Thibaut\.julia\environments\v1.8\Manifest.toml`
  [49dc2e85] + Calculus v0.5.1
  [b429d917] + DensityInterface v0.4.0
  [31c24e10] + Distributions v0.25.78
  [fa6b7ba4] + DualNumbers v0.6.8
  [1a297f60] + FillArrays v0.13.5
  [38e38edf] + GLM v1.8.1
  [34004b35] + HypergeometricFunctions v0.3.11
  [90014a1f] + PDMats v0.11.16
  [1fd47b50] + QuadGK v2.6.0
  [79098fc4] + Rmath v0.7.0
  [1277b4bf] + ShiftedArrays v2.0.0
  [4c63d2b9] + StatsFuns v1.0.1
  [3eaba693] + StatsModels v0.6.33
  [f50d1b31] + Rmath_jll v0.3.0+0
  [4607b0f0] + SuiteSparse
Precompiling project...
  ✓ ShiftedArrays
  ✓ PDMats
  ✓ Rmath_jll
  ✓ DensityInterface
  ✓ FillArrays
  ✓ QuadGK
  ✓ DualNumbers
  ✓ Rmath
  ✓ HypergeometricFunctions
  ✓ StatsFuns
  ✓ StatsModels
  ✓ Distributions
  ✓ GLM
  13 dependencies successfully precompiled in 15 seconds. 224 already precompiled. 1 skipped during auto due to previous errors.

In [2]:

using Graphs

using Distributions, DataFrames, GLM, ProgressMeter
using Dates
using Random: shuffle, seed!

In [3]:

seed!(20130810);

1. Introduction¶

In Talk of the Network, the authors explore the pattern of personal communication between an individual's core friends group (strong ties) and a wider set of acquaintances (weak ties). This remarkable study is one of the first ones in marketing that explored the influence of social networks on the diffusion of marketing messages. The key questions investigated in this paper are:

What matters more - strong ties or weak ties?
What effect does the size of an average individuals network have?
How does advertising interact with the diffusion through weak ties and that through strong ties

In this workbook, we focus on replicating the efforts of the authors to answer the first question: do strong ties or weak ties influence the speed of information dissemination in a network?

2. Initializing the network¶

This study employs a large number of synthetic networks as substrates to study the diffusion of information diffusion. To quote the authors logic to create and initialize the networks:

"Each individual belongs to a single personal network. Each network consists of individuals who are connected by strong ties. In each period, individuals also conduct a finte number of weak tie interactions outside their personal networks... We divide the entire market equally into personal networks, in which each individual can belong to one network. In addition, in each period, every individual conducts random meetings with individuals external to his personal network."

Given this specification, we utilize the built-in complete graph generator from Graphs.jl to build several mini-regular networks and then allow individuals in each of these mini-networks to mingle. Our final data structure is hence a vector of several complete networks that are built based on the number of strong ties for each individual. Note that each individual in the network has a fixed number of strong ties ($s$) and weak ties ($w$).

In [4]:

function initialize_network(n_nodes::Int, n_strong_ties::Int)
    G = [complete_graph(n_strong_ties) for g in 1:floor(Int, n_nodes/n_strong_ties)]
    return G
end

Out[4]:

initialize_network (generic function with 1 method)

3. Model¶

3.1 Assumptions¶

The probability of activation of a node, i.e., an uninformed individual turning to informed can happen in three ways: through a strong tie with probability $\beta_s$, through a weak tie with probability $\beta_w$ or through external marketing efforts with probability $\alpha$. In line with conventional wisdom, the authors assume $\alpha < \beta_w < \beta_s$.

At timestep $t$, if an individual is connected to $m$ strong ties and $j$ weak ties, the probability of the individual being informed in this time step is:

$$ p(t) = 1 - (1- \alpha)(1 - \beta_w)^j(1 - \beta_s)^m $$

The outcome variable of interest is the number of time steps elapsed till 95% of the network engages.

3.2 Execution¶

Following our earlier discussion on the construction of substrate networks, each node in the network belongs to a complete sub-network. In addition, at each time step each node interacts with a fixed number of weak ties chosen at random from sub-networks other than its own.

Step 1: At $t = 0$, the status of all nodes is set to false

Step 2: For each node, the probability $p(t)$ of being informed is calculated using the above equation. A random draw $U$ is made from a standard uniform distribution and compared with $p(t)$. If $U < p(t)$ the status of the node is changed to true

Step 3: In each successive time step, Step 2 is repeated till 95% of the total network (of size 3000) engages

We now look at several helper functions that execute the above logic

3.2.1 Reset node status¶

The node status is stored as a vector of BitVector's. At the beginning of each simulation run, we call the following function to set the status of all the nodes to false.

In [5]:

function reset_node_status(G::Vector{Graphs.SimpleGraphs.SimpleGraph{Int}})
    node_status = [falses(nv(g)) for g in G]
    return node_status
end

Out[5]:

reset_node_status (generic function with 1 method)

3.2.2 Updating status of the nodes¶

At each time step, we execute two tasks. First, we allow the nodes to mingle randomly with their strong ties and with weak ties from other sub-networks. At this point, we count the number of active strong and weak ties for each node. Then, we use this information to update the status of all the nodes in the network.

The first function counts the number of active strong ties within the node's sub-network. The second function executes the "random meetings" with weak ties as discussed in the paper. For each node we generate a random sample (without replacement) of size $w$ from sub-networks other than its own. We then count the number of active ties in its own sub-network and among the random sample taken from the rest of the network.

In [6]:

function count_active_str_ties(G::Vector{Graphs.SimpleGraphs.SimpleGraph{Int}},
                               node_network_id::Int,
                               node::Int,
                               node_status::Vector{BitVector})
    n_active_str_ties = sum([node_status[node_network_id][nbr] for nbr in neighbors(G[node_network_id], node)])
    return n_active_str_ties
end

Out[6]:

count_active_str_ties (generic function with 1 method)

In [7]:

function random_meetings(G::Vector{Graphs.SimpleGraphs.SimpleGraph{Int}},
                         node_network_id::Int,
                         node::Int,
                         node_status::Vector{BitVector},
                         n_weak_ties::Int)
    # Choose a random sample of size `n_weak_ties` from the other sub-networks and query
    # their status. We first sample the network id, and use this to sample a random node
    # in the sub-network defined by this id.

    all_network_ids = 1:length(G)

    other_network_ids = all_network_ids[all_network_ids .!= node_network_id]
    possible_weak_ties = []
    nsamples = 1

    while nsamples < n_weak_ties
        rand_network_id = sample(other_network_ids)
        rand_nbr = sample(vertices(G[rand_network_id]))
        if !((rand_network_id, rand_nbr) in possible_weak_ties)
            push!(possible_weak_ties, (rand_network_id, rand_nbr))
            nsamples += 1
        end
    end

    n_active_wk_ties = sum([node_status[network_id][weak_tie] for (network_id, weak_tie) in possible_weak_ties])
    return n_active_wk_ties
end

Out[7]:

random_meetings (generic function with 1 method)

Finally, the function below conducts the updation of the status of all the nodes at each time step by calculating the probability of activation.

In [8]:

function update_status!(G::Vector{Graphs.SimpleGraphs.SimpleGraph{Int}},
                        node_status::Vector{BitVector},
                        n_weak_ties::Int,
                        alpha::Float64, beta_w::Float64, beta_s::Float64)
    # assuming that the nodes update in random order

    for node_network_id in shuffle(1:length(G))
        for node in shuffle(vertices(G[node_network_id]))
            n_active_str_ties = count_active_str_ties(G, node_network_id, node, node_status)
            n_active_wk_ties = random_meetings(G, node_network_id, node, node_status, n_weak_ties)

            activation_prob = 1 - (1 - alpha) * (1 - beta_w)^n_active_wk_ties * (1 - beta_s)^n_active_str_ties

            if rand(Uniform()) < activation_prob
                node_status[node_network_id][node] = true
            end
        end
    end

    return nothing
end

Out[8]:

update_status! (generic function with 1 method)

3.2.4 Simulation on the parameter space¶

The function execute_simulation puts together the scaffolding to set up the parameter space $(s, w, \alpha, \beta_w, \beta_s)$ and execute diffusion along the network. From what I can gather from the paper, one simulation was carried out at each point on the parameter space. No further details regarding the execution are mentioned except that since each parameter has 7 levels, a total of $7^5 = 16,807$ simulations were executed in a factorial design. In this workbook, we work on a smaller parameter space using 3 levels for each parameter.

Also, I am assuming that the network is drawn at random for each run of the simulation.

One more interesting thing to note: The authors mention that their simulations were written in C, it would be interesting to compare the execution times with Julia. This is a non-standard problem that tests both the robustness of Julia types and its execution speed (maybe this will prompt someone to make a pull request!).

In [9]:

println("Number of strong ties per node (s): ", floor.(Int, range(5, stop=29, length=3)))
println("Number of weak ties per node(w): ", floor.(Int, range(5, stop=29, length=3)))
println("Effect of advertising (α): ", collect(range(0.0005, stop=0.01, length=3)))
println("Effect of weak ties (β_w): ", collect(range(0.005, stop=0.015, length=3)))
println("Effect of strong ties (β_s): ", collect(range(0.01, stop=0.07, length=3)))

Number of strong ties per node (s): [5, 17, 29]
Number of weak ties per node(w): [5, 17, 29]
Effect of advertising (α): [0.0005, 0.00525, 0.01]
Effect of weak ties (β_w): [0.005, 0.01, 0.015]
Effect of strong ties (β_s): [0.01, 0.04, 0.07]

In [10]:

parameter_space = [(s, w, alpha, beta_w, beta_s) for s in floor.(Int, range(5, stop=29, length=3)), 
                                                     w in floor.(Int, range(5, stop=29, length=3)),
                                                     alpha in range(0.0005, stop=0.01, length=3),
                                                     beta_w in range(0.005, stop=0.015, length=3),
                                                     beta_s in range(0.01, stop=0.07, length=3)]

size(parameter_space), length(parameter_space)

Out[10]:

((3, 3, 3, 3, 3), 243)

In [11]:

function execute_simulation(parameter_space, n_nodes::Int)
    # n_nodes dictates how big the network will be
    # We cannot pre-allocate the output since we do not know for how many time steps the simulation will
    # run at each setting

    output = DataFrame(s = Int[], w = Int[], alpha = Float64[],
                       beta_w = Float64[], beta_s = Float64[],
                       t = Int[], num_engaged = Int[])

    println("Beginning simulation at : ", Dates.format(now(), "HH:MM"))
    println("You might want to grab a cup of coffee while Julia brews the simulation...")

    @showprogress 1 "Crunching numbers while you munch..." for (s, w, alpha, beta_w, beta_s) in parameter_space[1:end]
        G = initialize_network(n_nodes, s)
        node_status = reset_node_status(G)
        num_engaged = sum(sum(node_status))

        # Continue updates at each setting till 95% of the network engages
        t = 1
        while num_engaged < floor(Int, 0.95 * n_nodes)
            update_status!(G, node_status, w, alpha, beta_w, beta_s)
            num_engaged = sum(sum(node_status))
            push!(output, [s, w, alpha, beta_w, beta_s, t, num_engaged])
            t += 1
        end
    end

    return output
end

Out[11]:

execute_simulation (generic function with 1 method)

In [12]:

results = execute_simulation(parameter_space, 3000)

Beginning simulation at : 16:05
You might want to grab a cup of coffee while Julia brews the simulation...

Crunching numbers while you munch... 100%|███████████████| Time: 0:02:44

Out[12]:

5654×7 DataFrame

5629 rows omitted

Row	s	w	alpha	beta_w	beta_s	t	num_engaged
	Int64	Int64	Float64	Float64	Float64	Int64	Int64
1	5	5	0.0005	0.005	0.01	1	1
2	5	5	0.0005	0.005	0.01	2	6
3	5	5	0.0005	0.005	0.01	3	9
4	5	5	0.0005	0.005	0.01	4	11
5	5	5	0.0005	0.005	0.01	5	11
6	5	5	0.0005	0.005	0.01	6	11
7	5	5	0.0005	0.005	0.01	7	14
8	5	5	0.0005	0.005	0.01	8	16
9	5	5	0.0005	0.005	0.01	9	19
10	5	5	0.0005	0.005	0.01	10	21
11	5	5	0.0005	0.005	0.01	11	23
12	5	5	0.0005	0.005	0.01	12	29
13	5	5	0.0005	0.005	0.01	13	31
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
5643	5	29	0.01	0.015	0.07	10	2781
5644	5	29	0.01	0.015	0.07	11	2890
5645	17	29	0.01	0.015	0.07	1	63
5646	17	29	0.01	0.015	0.07	2	340
5647	17	29	0.01	0.015	0.07	3	1002
5648	17	29	0.01	0.015	0.07	4	1949
5649	17	29	0.01	0.015	0.07	5	2613
5650	17	29	0.01	0.015	0.07	6	2902
5651	29	29	0.01	0.015	0.07	1	238
5652	29	29	0.01	0.015	0.07	2	1146
5653	29	29	0.01	0.015	0.07	3	2373
5654	29	29	0.01	0.015	0.07	4	2921

4. Discussion¶

To answer the research questions, the authors resort to simple linear regression.

Since our focus in this workbook is on highlighting the strengths of the JuliaGraphs ecosystem, we keep the regression modeling at the most basic level.

As discussed earlier, the outcome is the time taken for 95% of the network to engage with the message. The features used to predict this outcome are $s$, $w$, $\alpha$, $\beta_w$ and $\beta_S$.

In [13]:

first(results, 10)

Out[13]:

10×7 DataFrame

Row	s	w	alpha	beta_w	beta_s	t	num_engaged
	Int64	Int64	Float64	Float64	Float64	Int64	Int64
1	5	5	0.0005	0.005	0.01	1	1
2	5	5	0.0005	0.005	0.01	2	6
3	5	5	0.0005	0.005	0.01	3	9
4	5	5	0.0005	0.005	0.01	4	11
5	5	5	0.0005	0.005	0.01	5	11
6	5	5	0.0005	0.005	0.01	6	11
7	5	5	0.0005	0.005	0.01	7	14
8	5	5	0.0005	0.005	0.01	8	16
9	5	5	0.0005	0.005	0.01	9	19
10	5	5	0.0005	0.005	0.01	10	21

To build the data required for the linear modeling, we group the data by each parameter setting and calculate the time the network takes to reach 95% activation.

In [14]:

all_engaged = combine(groupby(results, [:s, :w, :alpha, :beta_w, :beta_s]), df -> DataFrame(T95 = maximum(df[!,:t])));
first(all_engaged, 10)

Out[14]:

10×6 DataFrame

Row	s	w	alpha	beta_w	beta_s	T95
	Int64	Int64	Float64	Float64	Float64	Int64
1	5	5	0.0005	0.005	0.01	154
2	17	5	0.0005	0.005	0.01	64
3	29	5	0.0005	0.005	0.01	47
4	5	17	0.0005	0.005	0.01	78
5	17	17	0.0005	0.005	0.01	43
6	29	17	0.0005	0.005	0.01	32
7	5	29	0.0005	0.005	0.01	47
8	17	29	0.0005	0.005	0.01	33
9	29	29	0.0005	0.005	0.01	25
10	5	5	0.00525	0.005	0.01	93

We then run a simple linear model on the data

In [15]:

ols = lm(@formula(T95 ~ s + w + alpha + beta_s + beta_w), all_engaged)

Out[15]:

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

T95 ~ 1 + s + w + alpha + beta_s + beta_w

Coefficients:
────────────────────────────────────────────────────────────────────────────────────
                    Coef.   Std. Error       t  Pr(>|t|)     Lower 95%     Upper 95%
────────────────────────────────────────────────────────────────────────────────────
(Intercept)     84.3588      3.04531     27.70    <1e-75     78.3594       90.3581
s               -1.01132     0.0742558  -13.62    <1e-30     -1.1576       -0.865031
w               -0.824588    0.0742558  -11.10    <1e-22     -0.970874     -0.678303
alpha        -1374.92      187.594       -7.33    <1e-11  -1744.48      -1005.35
beta_s        -292.798      29.7023      -9.86    <1e-18   -351.313      -234.284
beta_w       -1095.06      178.214       -6.14    <1e-08  -1446.15       -743.976
────────────────────────────────────────────────────────────────────────────────────

In [16]:

r2(ols)

Out[16]:

0.6773101916389891

This is a rather strong finding. The speed of information diffusion is impacted equally strongly by both strong ties and weak ties. As the authors note, the surprising aspect of this strudy is that the effect of weak ties is rather strong despite the inferiority of the weak ties parameter in the model assumptions.