Notebook

Overview¶

Networks (a.k.a. graphs) are widely used mathematical objects for representing and analysing social systems. This week is about getting familiar with networks, and we'll focus on four main aspects:

Basic mathematical description of networks
The NetworkX library.
Building the network of Computational Social Scientists.
Basic analysis of the network of Computational Social Scientists

Part 1: Basic mathematical description of networks¶

This week, let's start with some lecturing. You will watch some videos made by Sune for his course Social Graphs and Interactions, where he covers networks in detail.

Video Lecture. Start by watching the "History of Networks".

In [188]:

from IPython.display import YouTubeVideo
YouTubeVideo("qjM9yMarl70",width=800, height=450)

Out[188]:

Video Lecture. Then check out a few comments on "Network Notation".

In [189]:

YouTubeVideo("MMziC5xktHs",width=800, height=450)

Out[189]:

Reading. We'll be reading the textbook Network Science (NS) by Laszlo Barabasi. You can read the whole thing for free here.

Read chapter 1.

Read chapter 2.

Exercise 1 Answer in a Jupyter notebook.

List three different real networks and state the nodes and links for each of them.

Tell us of the network you are personally most interested in. Address the following questions:

What are its nodes and links?

How large is it?

Can be mapped out?

Why do you care about it?

In your view what would be the area where network science could have the biggest impact in the next decade? Explain your answer - and base it on the text in the book.

One person per pair: go to DTU Learn and fill the Survey "Networks"

Part 2: Exercises using the `NetworkX` library¶

We will analyse networks in Python using the NetworkX library. The cool thing about networkx is that it includes a lot of algorithms and metrics for analysing networks, so you don't have to code things from scratch. Get started by running the magic pip install networkx command. Then, get familiar with the library through the following exercises:

Exercises We will start by solving some exercises from the book.

Go to the NetworkX project's tutorial page. The goal of this exercise is to create your own notebook that contains the entire tutorial. You're free to add your own (e.g. shorter) comments in place of the ones in the official tutorial - and change the code to make it your own where ever it makes sense.

Go to Section 2.12: Homework, then

Write the solution for exercise 2.1 (the 'Königsberg Problem') from NS in your notebook.

Solve exercise 2.3 ('Graph representation') from NS using NetworkX in your notebook. (You don't have to solve the last sub-question about cycles of length 4 ... but I'll be impressed if you do it).

Solve exercise 2.5 ('Bipartite Networks') from NS using NetworkX in your notebook.

Ok, enough with theory :) It is time to go back to our cool dataset that was super painful to download! And guess what? We will build the network of Computational Social Scientists. Then, we will use some Network Science to study some of its properties.

Exercise 1: Filter Computational Social Science Papers. Our set of articles contains many papers that we do not want to retain. A first task is that of selecting CSS papers. We will adopt a heuristic method. What we know ia that Computational Social Science is about using quantitative methods to tackle a Social Science questions. So, we will filter papers that cover at least one field within the Social Science, and that include some quantiative methods or some experts in a quantitative discipline. Follow me:

Write a function that given a list of authorIds outputs the list of their top fields. Note: You should use the column "field" in your author dataset.

Apply this function to each record of the authorIds column in your paper dataset. Store the result in a new column of your paper dataset, called "author_field".

Create a set of social_science_fields that includes the following disciplines: Political Science, Sociology, Economics

Create a set of quantitative_fields that includes the following disciplines: Mathematics, Physics, Computer Science.

Select the subset of rows in your paper dataset that respect all of the following conditions:

the paper fields include at least one of the social_science_fields

the paper fields include at least one of the quantitative_fields OR the paper is written by at least one author whose top field is among the quantitative_fields.

the paper does not include the field "Biology"

the paper has less than 10 Computational Social Science authors

the paper is published after 2008

the paper has a DOI

Store the results in a new dataframe called ccs_papers. Save the dataframe to file. Note: we will not use the entire paper dataset again.
6. How many papers are you left with? How many unique authors are involved in these papers?
7. Print the titles of the top 10 papers in your set (by citation count). Do you think that these papers are from Computational Social Science? 8. Why do you think I wanted you to use the selection criteria above?

Exercise 2: Build the network of Computational Social Scientists. We are ready!! After crunching all this data, we are getting closer to our object of study, let's build the network of Computational Social Scientists! In this network, nodes correspond to authors of papers, and a direct link going from node A to node B exists if A and B ever worked together. The weight on the link corresponds to the number of times A worked with B.

Consider your dataframe of ccs_papers. Create a weighted edgelist, where each element of the list is a tuple with three entries. The first two entries are the authorIds of two authors that have collaborated on at least one paper, the last entry is the number of papers they have worked on together. Note that we want each pair of authors to appear only once in the list.

Create an undirected Graph using networkx. Then, use the networkx function add_weighted_edges_from to create a weighted, undirected, graph starting from the edgelist you created in step 1.

Add the following attributes to each node as a node attribute:

the author's name. For this, go back to your author dataset. Remember how you have stored both the name and the aliases of an author. Now, what happens is that (don't ask me why!) the first name of authors is often truncated to the first letter (for example my own name in the dataset is "L. Alessandretti"). However, the full author's name can often be found among the aliases (in my case it would be "Laura Alessandretti"). Make sure that, if the full name is available among the aliases, you use that one. Note: Here you need to use some heuristics. Just make reasonable choices!

the author's top field.

the author's median citation_count (considering only the ccs papers)

the author's total number of ccs_papers

the year in which the author published they first ccs paper

Save the Network as a json file, and give yourself a pat on the back :)

We begin with a preliminary analysis of the network.

Exercise 3: Basic analysis of the network of Computational Social Scientists

Why do you think I want you guys to use an undirected graph? Could have we used an directed graph instead?

What is the total number of nodes in the network? What is the total number of links? What is the density of the network (that is the total number of links over the maximum number of links)?

What are the average, median, mode, minimum and maximum value of the degree? What are the average, median, mode, minimum and maximum value of the nodes strength? How do you intepret the results?

List the top 5 authors by degree. What is their total number of citations?

Look them up online. What do they work on?

Plot the distribution of degrees, using appropriate binning. What do you observe?

Plot a scatter plot of the the degree versus the "median number of citations" per ccs paper for all authors. Use logarithmic axes where appropriate. Compute the spearman correlation between the two.

Bin your degrees using the bins in point 6. and compute the 25th, 50th, and 75th percentile in each bin. Add the result to your figure as a line plot with errorbars (the median value is the line plot, and the 25th and 75th percentiles are the errorbars).

Why do you think I wanted you guys to use the Spearman correlation (instead of the usual Pearson correlation)?

Comment on your results. Do you observe any relation? If yes, what could be the underlying reason, and how could you further explore possible reasons? If not, why do you think that is the case?

Your feedback¶

I hope you enjoyed today's class. It would be awesome if you could spend a few minutes to share your feedback.
Go to DTU Learn and fill the Survey "Week 4 - Feedback".

In [ ]:

Overview¶

Part 1: Basic mathematical description of networks¶

Part 2: Exercises using the NetworkX library¶

Part 3: Building the network of Computational Social Scientisti¶

Part 4: Preliminary analysis of the Computational Social Scientists¶

Your feedback¶

Part 2: Exercises using the `NetworkX` library¶