Please read the assignment overview page carefully before proceeding. This page contains information about formatting (including formats etc), group sizes, and many other aspects of handing in the assignment.
If you fail to follow these simple instructions, it will negatively impact your grade!
Due date and time: The assignment is due on Feb 28th at 23:55. Hand in your Jupyter notebook file (with extension .ipynb
) via DTU Learn (Assignment, Assignment 1).
Remember to include in the first cell of your notebook:
Gather the list of researchers that have joined the most important scientific conference in Computational Social Science in 2019.
You can find the programmes of the 2019 edition at the links below:
Oral presentations: https://2019.ic2s2.org/oral-presentations/
Poster presentations: https://2019.ic2s2.org/posters/
Consider the list of author ids you have found in Week 2, Part 3, first excercise. For each author, use the Academic Graph API to find:
- their aliases
- their name
- their papers, where for each paper we want to retain:
- title
- abstract
- the year of publication
- the externalIds (this is because there are universal identifiers for scientific works called DOI that we can use across platforms)
- s2FieldsOfStudy the fields of study
- citationCount the number of times that this paper was cited
Create three dataframe to store the data you have collected.
- Author dataset: in the author dataset, one raw is one unique author, and each row contains the following information:
- authorId: (str) the id of the author
- name: (str) the name of the author
- aliases: (list) the aliases of the author
- citationCount: (int) the total number of citations received by an author
- field: (str) the s2FieldsOfStudy that occurs most times across an author's papers (you should first obtain the category for each s2FieldsOfStudy)
- Paper dataset: in the paper dataset, one row is one unique paper, and each row contains the following information:
- paperId: (str) the id of the paper
- title: (str) the title of the paper
- year: (int) the year of publication
- externalId.DOI: (str) the DOI of the paper
- citationCount: (int) the number of citations
- fields: (list) the fields included in the paper (you should first obtain the category for each s2FieldsOfStudy)
- authorIds: (list) this is a list of author Ids, including all the authors of this paper that are in our author dataset
- Paper abstract dataset: in the paper abstract dataset, one row is one unique paper, and each row contains the following information:
- paperId: (str) the id of the paper
- abstract: (str) the abstract of the paper
(Note: we keep the abstract separate to keep the size of files more manageable)
(Note: If you did not manage to get all the years or all the authors' collaborators, you can still follow the exercise. Just remember to clarify your starting point.)
As we have discussed in the lecture, one impact of heavy tails is that sample averages can be poor estimators of the underlying mean of the distribution. To understand this point better, recall the Law of Large Numbers. Consider a sample of IID variables $ X_1, \ldots, X_n $ from the same distribution $ F $ with finite expected value $ \mathbb E |X_i| = \int x F(dx) = \mu $.
According to the law, the mean of the sample $ \bar X_n := \frac{1}{n} \sum_{i=1}^n X_i $ satisfies $$ \bar X_n \to \mu \text{ as } n \to \infty $$
This basically tell us that if we have a large enough sample, the sample mean will converge to the population mean.
The condition that $ \mathbb E | X_i | $ is finite holds in most cases but can fail if the distribution $ F $ is very heavy tailed. Further, even when $ \mathbb E | X_i | $ is finite, the variance of a heavy tailed distribution can be so large that the sample mean will converge very slowly to the population mean. We will look into this in the following exercise.
- Sample N=10,000 data points from a Gaussian Distribution with parameters $\mu = 0 $ and $\sigma = 4$, using the
np.random.standard_normal()
function. Store your data in a numpy array $\mathbf{X}$.- Create a figure.
- Plot the distribution of the data in $\mathbf{X}$.
- Compute the cumulative average of $\mathbf{X}$ (you achieve this by computing $average(\{\mathbf{X}[0],..., \mathbf{X}[i-1]\})$ for each index $i \in [1, ..., N+1]$ ). Store the result in an array.
- In a similar way, compute the cumulative standard error of $\mathbf{X}$. Note: the standard error of a sample is defined as $ \sigma_{M} = \frac{\sigma}{\sqrt(n)} $, where $\sigma$ is the sample standard deviation and $n$ is the sample size. Store the result in an array.
- Compute the values of the distribution mean and median using the formulas you can find on the Wikipedia page of the Gaussian Distribution
- Create a figure.
- Plot the cumulative average computed in point 3. as a line plot (where the x-axis represent the size of the sample considered, and the y-axis is the average).
- Add errorbars to each point in the graph with width equal to the standard error of the mean (the one you computed in point 4).
- Add a horizontal line corresponding to the distribution mean (the one you found in point 5).
- Compute the cumulative median of $\mathbf{X}$ (you achieve this by computing $median(\{\mathbf{X}[0],..., \mathbf{X}[i-1]\})$ for each index $i \in [1, ..., N+1]$). Store the result in an array.
- Create a figure.
- Plot the cumulative median computed in point 7. as a line plot (where the x-axis represent the size of the sample considered, and the y-axis is the average).
- Add a horizontal line corresponding to the distribution median (the one you found in point 5).
- Optional: Add errorbars to your median line graph, with width equal to the standard error of the median. You can compute the standard error of the median via bootstrapping.
- Now sample N = 10,000 data points from a Pareto Distribution with parameters $x_m=1$ and $\alpha=0.5$ using the
np.random.pareto()
function, and store it in a numpy array. (Optional: Write yourself the function to sample from a Pareto distribution using the Inverse Transform Sampling method)- Repeat points 2 to 8 for the Pareto Distribution sample computed in point 9.
- Now sample N = 10,000 data points from a Lognormal Distribution with parameters $\mu=0$ and $\sigma=4$ using the
np.random.standard_normal()
function, and store it in a numpy array.- Repeat points 2 to 8 for the Lognormal Distribution sample computed in point 11.
- Now, consider the array collecting the citations of papers from 2009 you created in Week3, Exercise 2, point 1. First, compute the mean and median number of citations for this population. Then, extract a random sample of N=10,000 papers.
- Repeat points 2,3,4,6,7 and 8 above for the paper citation sample prepared in point 13.
Answer the following questions:
(Hint: I suggest you plot the graphs above multiple times for different random samples, to get a better understanding of what is going on)
- Compare the evolution of the cumulative average for the Gaussian, Pareto and LogNormal distribution. What do you observe? Would you expect these results? Why?
- Compare the cumulative median vs the cumulative average for the three distributions. What do you observe? Can you draw any conclusions regarding which statistics (the mean or the median) is more usfeul in the different cases?
- Consider the plots you made using the citation count data in point 14. What do you observe? What are the implications?
- What do you think are the main take-home message of this exercise?