#!/usr/bin/env python # coding: utf-8 # # Overview # # Networks (a.k.a. graphs) are widely used mathematical objects for representing and analysing social systems. # This week is about getting familiar with networks, and we'll focus on four main aspects: # # * Basic mathematical description of networks # * The `NetworkX` library. # * Building the network of Computational Social Scientists. # * Basic analysis of the network of Computational Social Scientists # # Part 1: Basic mathematical description of networks # # This week, let's start with some lecturing. You will watch some videos made by Sune for his course _Social Graphs and Interactions_, where he covers networks in detail. # # > **_Video Lecture_**. Start by watching the ["History of Networks"](https://youtu.be/qjM9yMarl70). # # In[188]: from IPython.display import YouTubeVideo YouTubeVideo("qjM9yMarl70",width=800, height=450) # > **_Video Lecture_**. Then check out a few comments on ["Network Notation"](https://youtu.be/MMziC5xktHs). # In[189]: YouTubeVideo("MMziC5xktHs",width=800, height=450) # > __Reading__. We'll be reading the textbook _Network Science_ (NS) by Laszlo Barabasi. You can read the whole # > thing for free [**here**](http://barabasi.com/networksciencebook/). # > # > * Read chapter 1\. # > * Read chapter 2\. # > # > __Exercise 1__ Answer in a Jupyter notebook. # > * List three different real networks and state the nodes and links for each of them. # > * Tell us of the network you are personally most interested in. Address the following questions: # > * What are its nodes and links? # > * How large is it? # > * Can be mapped out? # > * Why do you care about it? # > * In your view what would be the area where network science could have the biggest impact in the next decade? Explain your answer - and base it on the text in the book. # > # > One person per pair: go to [DTU Learn](https://learn.inside.dtu.dk/d2l/home/145262) and fill the Survey "_Networks_" # # Part 2: Exercises using the `NetworkX` library # # We will analyse networks in Python using the [NetworkX](https://networkx.org/) library. The cool thing about networkx is that it includes a lot of algorithms and metrics for analysing networks, so you don't have to code things from scratch. Get started by running the magic ``pip install networkx`` command. Then, get familiar with the library through the following exercises: # # > __Exercises__ We will start by solving some exercises from the book. # # > * Go to the NetworkX project's [tutorial page](https://networkx.org/documentation/stable/tutorial.html). The goal of this exercise is to create your own notebook that contains the entire tutorial. You're free to add your own (e.g. shorter) comments in place of the ones in the official tutorial - and change the code to make it your own where ever it makes sense. # > * Go to Section 2.12: [Homework](http://networksciencebook.com/chapter/2#homework2), then # > * Write the solution for exercise 2.1 (the 'Königsberg Problem') from NS in your notebook. # > * Solve exercise 2.3 ('Graph representation') from NS using NetworkX in your notebook. (You don't have to solve the last sub-question about cycles of length 4 ... but I'll be impressed if you do it). # > * Solve exercise 2.5 ('Bipartite Networks') from NS using NetworkX in your notebook. # # Part 3: Building the network of Computational Social Scientisti # Ok, enough with theory :) It is time to go back to our cool dataset that was super painful to download! And guess what? We will build the network of Computational Social Scientists. Then, we will use some Network Science to study some of its properties. # > **Exercise 1: Filter Computational Social Science Papers**. Our set of articles contains many papers that we do not want to retain. A first task is that of selecting CSS papers. We will adopt a heuristic method. What we know ia that Computational Social Science is about using quantitative methods to tackle a Social Science questions. So, we will filter papers that cover at least one field within the Social Science, and that include some quantiative methods or some experts in a quantitative discipline. Follow me: # > 1. Write a function that given a list of authorIds outputs the list of their top fields. __Note:__ You should use the column "*field*" in your *author* dataset. # > 2. Apply this function to each record of the authorIds column in your *paper* dataset. Store the result in a new column of your *paper* dataset, called "*author_field*". # > 3. Create a set of *social_science_fields* that includes the following disciplines: _Political Science_, _Sociology_, _Economics_ # > 4. Create a set of *quantitative_fields* that includes the following disciplines: _Mathematics_, _Physics_, _Computer Science_. # > 5. Select the subset of rows in your paper dataset that respect all of the following conditions: # > * the paper *fields* include at least one of the *social_science_fields* # > * the paper *fields* include at least one of the *quantitative_fields* OR the paper is written by at least one author whose top field is among the *quantitative_fields*. # > * the paper does not include the field "_Biology_" # > * the paper has less than 10 Computational Social Science authors # > * the paper is published after 2008 # > * the paper has a DOI # >Store the results in a new dataframe called *ccs_papers*. Save the dataframe to file. __Note:__ we will not use the entire _paper_ dataset again. # > 6. How many papers are you left with? How many unique authors are involved in these papers? # > 7. Print the titles of the top 10 papers in your set (by citation count). Do you think that these papers are from Computational Social Science? # > 8. Why do you think I wanted you to use the selection criteria above? # > **Exercise 2: Build the network of Computational Social Scientists**. We are ready!! After crunching all this data, we are getting closer to our object of study, let's build the network of Computational Social Scientists! In this network, nodes correspond to authors of papers, and a direct link going from node _A_ to node _B_ exists if _A_ and _B_ ever worked together. The weight on the link corresponds to the number of times _A_ worked with _B_. # > # > 1. Consider your dataframe of *ccs_papers*. Create a weighted _edgelist_, where each element of the list is a tuple with three entries. The first two entries are the _authorIds_ of two authors that have collaborated on at least one paper, the last entry is the number of papers they have worked on together. Note that we want each pair of authors to appear only once in the list. # > 2. Create an undirected [``Graph``](https://networkx.org/documentation/stable/reference/classes/graph.html) using networkx. Then, use the networkx function [``add_weighted_edges_from``](https://networkx.org/documentation/stable/reference/classes/generated/networkx.Graph.add_weighted_edges_from.html#networkx.Graph.add_weighted_edges_from) to create a weighted, undirected, graph starting from the edgelist you created in step 1. # > 3. Add the following attributes to each node as a [node attribute](https://networkx.org/documentation/stable/reference/generated/networkx.classes.function.set_node_attributes.html): # > * the author's _name_. For this, go back to your _author_ dataset. Remember how you have stored both the _name_ and the _aliases_ of an author. Now, what happens is that (don't ask me why!) the first _name_ of authors is often truncated to the first letter (for example my own _name_ in the dataset is "_L. Alessandretti_"). However, the full author's name can often be found among the aliases (in my case it would be "_Laura Alessandretti_"). Make sure that, if the full name is available among the aliases, you use that one. __Note:__ Here you need to use some heuristics. Just make reasonable choices! # > * the author's _top field_. # > * the author's median _citation_count_ (considering only the ccs papers) # > * the author's total number of _ccs_papers_ # > * the year in which the author published they first _ccs paper_ # > Save the Network as a json file, and give yourself a pat on the back :) # # Part 4: Preliminary analysis of the Computational Social Scientists # We begin with a preliminary analysis of the network. # # > **Exercise 3: Basic analysis of the network of Computational Social Scientists** # > 1. Why do you think I want you guys to use an _undirected_ graph? Could have we used an directed graph instead? # > 2. What is the total number of nodes in the network? What is the total number of links? What is the density of the network (that is the total number of links over the maximum number of links)? # > 3. What are the average, median, mode, minimum and maximum value of the degree? What are the average, median, mode, minimum and maximum value of the nodes strength? How do you intepret the results? # > 4. List the top 5 authors by degree. What is their total number of citations? # > 5. Look them up online. What do they work on? # > 6. Plot the distribution of degrees, using appropriate binning. What do you observe? # > 7. Plot a scatter plot of the the degree versus the "*median number of citations*" per ccs paper for all authors. Use logarithmic axes where appropriate. Compute the [spearman correlation](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) between the two. # > 8. Bin your degrees using the bins in point 6. and compute the 25th, 50th, and 75th percentile in each bin. Add the result to your figure as a line plot with errorbars (the median value is the line plot, and the 25th and 75th percentiles are the errorbars). # > 9. Why do you think I wanted you guys to use the Spearman correlation (instead of the usual Pearson correlation)? # > 10. Comment on your results. Do you observe any relation? If yes, what could be the underlying reason, and how could you further explore possible reasons? If not, why do you think that is the case? # # Your feedback # I hope you enjoyed today's class. It would be awesome if you could spend a few minutes to share your feedback. # **Go to [DTU Learn](https://learn.inside.dtu.dk/d2l/home/145262) and fill the Survey "_Week 4 - Feedback"_.** # In[ ]: