#!/usr/bin/env python # coding: utf-8 # # Overview # # Here is the plan for today. # # __Part 1:__ In the first part of the class we will learn few things about __Data Visualization__, an important aspect in Computational Social Science. Why is it so important to make nice plots if we can use stats and modelling? I hope I will convince that it is _very_ important to make meaningful visualizations. # # __Part 2:__ In the second part of the class we will learn about __Visualizing Distributions__. Plotting histogram is something you are probably very familiar with at this point, but there are a few things that you should know about how to visualize distributions when the data you have is highly heterogeneous. # # __Part 3:__ In the final part of the class, we will learn about __Heavy tailed distributions__. Heavy tailed distributions are ubiquitous in Computational Social Science, so it is really important to understand what they are and how to study them. # # # ## Part 1: Intro to visualization # Start by watching this short introduction video to Data Visualization. # # > * _Video Lecture_: [Intro to Data Visualization](https://www.youtube.com/watch?v=oLSdlg3PUO0) # In[1]: from IPython.display import YouTubeVideo YouTubeVideo("oLSdlg3PUO0",width=800, height=450) # Ok, but is data visualization really so necessary? Let's see if I can convince you of that with this little visualization exercise. # # > __Exercise 1: Visualization vs stats__ # > # > Start by downloading these four datasets: [Data 1](https://raw.githubusercontent.com/suneman/socialdataanalysis2020/master/files/data1.tsv), [Data 2](https://raw.githubusercontent.com/suneman/socialdataanalysis2020/master/files/data2.tsv), [Data 3](https://raw.githubusercontent.com/suneman/socialdataanalysis2020/master/files/data3.tsv), and [Data 4](https://raw.githubusercontent.com/suneman/socialdataanalysis2020/master/files/data4.tsv). The format is `.tsv`, which stands for _tab separated values_. # > Each file has two columns (separated using the tab character). The first column is $x$-values, and the second column is $y$-values. # > # > * Using the `numpy` function `mean`, calculate the mean of both $x$-values and $y$-values for each dataset. # > * Use python string formatting to print precisely two decimal places of these results to the output cell. Check out [this _stackoverflow_ page](http://stackoverflow.com/questions/8885663/how-to-format-a-floating-number-to-fixed-width-in-python) for help with the string formatting. # > * Now calculate the variance for all of the various sets of $x$- and $y$-values (to three decimal places). # > * Use [`scipy.stats.pearsonr`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html) to calculate the [Pearson correlation](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) between $x$- and $y$-values for all four data sets (also to three decimal places). # > * The next step is use _linear regression_ to fit a straight line $f(x) = a x + b$ through each dataset and report $a$ and $b$ (to two decimal places). An easy way to fit a straight line in Python is using `scipy`'s `linregress`. It works like this # > ``` # > from scipy import stats # > slope, intercept, r_value, p_value, std_err = stats.linregress(x,y) # >``` # > * Finally, it's time to plot the four datasets using `matplotlib.pyplot`. Use a two-by-two [`subplot`](http://matplotlib.org/examples/pylab_examples/subplot_demo.html) to put all of the plots nicely in a grid and use the same $x$ and $y$ range for all four plots. And include the linear fit in all four plots. (To get a sense of what I think the plot should look like, you can take a look at my version [here](https://raw.githubusercontent.com/suneman/socialdataanalysis2017/master/files/anscombe.png).) # > * Explain - in your own words - what you think my point with this exercise is. # # # Get more insight in the ideas behind this exercise by reading [here](https://en.wikipedia.org/wiki/Anscombe%27s_quartet). # # And the [video below](https://www.youtube.com/watch?v=DbJyPELmhJc) generalizes in the coolest way imaginable. It's a treat, but don't watch it until **after** you've done the exercises. # # In[179]: from IPython.display import YouTubeVideo YouTubeVideo("DbJyPELmhJc",width=800, height=450) # ## Prelude to Part 2: Some tips to make nicer figures. # Before even starting visualizing some cool data, I just want to give a few tips for making nice plots in matplotlib. Unless you feel like you are already a pro-visualizer, those should be pretty useful to make your plots look much nicer. # Paying attention to details can make an incredible difference when we present our work to others. # # **Note**: there are many Python libraries to make visualizations. I am a huge fan of matplotlib, which is one of the most widely used ones, so this is what we will use for this class. # > *Video Lecture*: [How to improve your plots](https://www.youtube.com/watch?v=sdszHGaP_ag) # In[3]: from IPython.display import YouTubeVideo YouTubeVideo("sdszHGaP_ag",width=800, height=450) # ## Part 2 : Plotting distributions # As you probably have discovered in the exercise above, using summary statistics (mean, median, standard deviations) to capture the properties of your dataset can be sometimes misleading. # It is _very good practice_, whenever you have a dataset at hand, to start by plotting the _distribution_ of the data. # There is a lot we can learn about our data just by looking at the probability distribution of data points. # # # # Very often, when it comes to real-world datasets, the data span several orders of magnitude. # In these cases, plotting probability distributions as histograms in the usual way does not work very well. # But there are a couple of tricks you can apply in these instances to visualize the distribution. # # In the video-lecture below, I guide you through plotting histograms for very heterogeneous datasets. # The example I go through is based on two datasets: (i) financial data describing the prices and returns of a stock; and (ii) a dataset describing the number of comments posted by a set of Reddit users. # In fact, the dataset I use don't really matter. You can use the same techniques to plot any data. # # # > *Video Lecture*: [Plotting histograms and distributions](https://www.youtube.com/watch?v=UpwEsguMtY4) # In[186]: YouTubeVideo("UpwEsguMtY4",width=800, height=450) # > __Exercise 2: Paper citations__. Now, we will put things into practice by plotting the distribution of citations per author. Consider the _Paper dataset_, you prepared in Week2, Exercise 3. # # > 1. Select all the papers published in 2009, and store their "citationCount" in an array. __Note:__ You should have an array of a few hundreds data points (if you don't come and talk to me). # > 2. Make a histogram of the number of citations per paper, using the function [``numpy.histogram``](https://numpy.org/doc/stable/reference/generated/numpy.histogram.html). Here are some important points on histograms (they should be already quite clear if you have watched the video above): # > * __Binning__: By default numpy makes 10 equally spaced bins, but you always have to customize the binning. The number and size of bins you choose for your histograms can completely change the visualization. If you use too few bins, the histogram doesn't portray well the data. If you have too many, you get a broken comb look. Unfortunately is no "best" number of bins, because different bin sizes can reveal different features of the data. Play a bit with the binning to find a suitable number of bins. Define a vector $\nu$ including the desired bins and then feed it as a parameter of numpy.histogram, by specifying _bins=$\nu$_ as an argument of the function. You always have at least two options: # > * _Linear binning_: Use linear binning, when the data is not heavy tailed, by using ``np.linspace`` to define bins. # > * _Logarithmic binning_: Use logarithmic binning, by using ``np.logspace`` to define your bins. # > * __Normalization__: To plot [probability densities](https://en.wikipedia.org/wiki/Probability_density_function), you can set the argument _density=True_ of the ``numpy.histogram`` function. # > 3. Have you used _Logarithmic_ or _Linear_ binning in this case? Justify your choice. # > 4. Why do you think I wanted you to use only papers published in 2009? # > 5. Compute the mean and the median value of the number of citations per paper and plot them as vertical lines on top of your histogram. What do you observe? Which value do you think is more meaningful? # # # ## Part 3 : Heavy tailed distributions # When it comes to real-world data, it is very common to observe distributions that are so-called "Heavy tailed". In this section, we will explore this concept a bit more in detail. # We will start by watching a video-lecure by me and reading a great paper. # > *Reading:* [Power laws, Pareto distributions and Zipf’s law](https://www.cs.cornell.edu/courses/cs6241/2019sp/readings/Newman-2005-distributions.pdf) Read the introduction and skim through the rest of the article. # > *Video Lecture*: [Heavy tailed distributions](https://www.youtube.com/watch?v=S2OZBTKx8_E) # # In[187]: YouTubeVideo("S2OZBTKx8_E",width=800, height=450) # As we have discussed in the lecture, one impact of heavy tails is that sample averages can be poor estimators of the underlying mean of the distribution. # To understand this point better, recall [the Law of Large Numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers). Consider a sample of IID variables $ X_1, \ldots, X_n $ from the same distribution $ F $ with finite expected value $ \mathbb E |X_i| = \int x F(dx) = \mu $. # # According to the law, the mean of the sample $ \bar X_n := \frac{1}{n} \sum_{i=1}^n X_i $ satisfies # # $$ # \bar X_n \to \mu \text{ as } n \to \infty # $$ # # This basically tell us that if we have a large enough sample, the sample mean will converge to the population mean. # # The condition that $ \mathbb E | X_i | $ is finite holds in most cases but can fail if the distribution $ F $ is very heavy tailed. Further, even when $ \mathbb E | X_i | $ is finite, the variance of a heavy tailed distribution can be so large that the sample mean will converge very slowly to the population mean. We will look into this in the following exercise. # # # # # >__Exercise 3: Law of large numbers__. # > # > 1. Sample __N=10,000__ data points from a [Gaussian Distribution](https://en.wikipedia.org/wiki/Normal_distribution) with parameters $\mu = 0 $ and $\sigma = 4$, using the [`np.random.standard_normal()`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.standard_normal.html) function. Store your data in a numpy array $\mathbf{X}$. # > 2. Create a figure. # > - Plot the distribution of the data in $\mathbf{X}$. # > 3. Compute the cumulative average of $\mathbf{X}$ (you achieve this by computing $average(\{\mathbf{X}[0],..., \mathbf{X}[i-1]\})$ for each index $i \in [1, ..., N+1]$ ). Store the result in an array. # > 4. In a similar way, compute the cumulative standard error of $\mathbf{X}$. __Note__: the standard error of a sample is defined as $ \sigma_{M} = \frac{\sigma}{\sqrt(n)} $, where $\sigma$ is the sample standard deviation and $n$ is the sample size. Store the result in an array. # > 5. Compute the values of the distribution mean and median using the formulas you can find on the [Wikipedia page of the Gaussian Distribution](https://en.wikipedia.org/wiki/Normal_distribution) # > 6. Create a figure. # > - Plot the cumulative average computed in point 3. as a line plot (where the x-axis represent the size of the sample considered, and the y-axis is the average). # > - Add errorbars to each point in the graph with width equal to the standard error of the mean (the one you computed in point 4). # > - Add a horizontal line corresponding to the distribution mean (the one you found in point 5). # > 7. Compute the cumulative median of $\mathbf{X}$ (you achieve this by computing $median(\{\mathbf{X}[0],..., \mathbf{X}[i-1]\})$ for each index $i \in [1, ..., N+1]$). Store the result in an array. # > 8. Create a figure. # > - Plot the cumulative median computed in point 7. as a line plot (where the x-axis represent the size of the sample considered, and the y-axis is the average). # > - Add a horizontal line corresponding to the distribution median (the one you found in point 5). # > - _Optional:_ Add errorbars to your median line graph, with width equal to the standard error of the median. You can compute the standard error of the median [via bootstrapping](https://online.stat.psu.edu/stat500/book/export/html/619). # > 9. Now sample __N = 10,000__ data points from a [Pareto Distribution](https://en.wikipedia.org/wiki/Pareto_distribution) with parameters $x_m=1$ and $\alpha=0.5$ using the [`np.random.pareto()`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.pareto.html) function, and store it in a numpy array. (_Optional:_ Write yourself the function to sample from a Pareto distribution using the [_Inverse Transform Sampling method_](https://en.wikipedia.org/wiki/Inverse_transform_sampling)) # > 10. Repeat points 2 to 8 for the Pareto Distribution sample computed in point 9. # > 11. Now sample __N = 10,000__ data points from a [Lognormal Distribution](https://en.wikipedia.org/wiki/Log-normal_distribution) with parameters $\mu=0$ and $\sigma=4$ using the [`np.random.standard_normal()`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.standard_normal.html) function, and store it in a numpy array. # > 12. Repeat points 2 to 8 for the Lognormal Distribution sample computed in point 11. # > 13. Now, consider the array collecting the citations of papers from 2009 you created in Exercise 2, point 1. First, compute the mean and median number of citations for this population. Then, extract a random sample of __N=10,000__ papers. # > 14. Repeat points 2,3,4,6,7 and 8 above for the paper citation sample prepared in point 13. # > 15. Answer the following questions (__Hint__: I suggest you plot the graphs above multiple times for different random samples, to get a better understanding of what is going on): # > - Compare the evolution of the cumulative average for the Gaussian, Pareto and LogNormal distribution. What do you observe? Would you expect these results? Why? # > - Compare the cumulative median vs the cumulative average for the three distributions. What do you observe? Can you draw any conclusions regarding which statistics (the mean or the median) is more usfeul in the different cases? # > - Consider the plots you made using the citation count data in point 14. What do you observe? What are the implications? # > - What do you think are the main take-home message of this exercise? # > # # As we have discussed in the lecture, another property of heavy tailed distributed data is that outliers are very frequent. We will explore this better in the following exercise. # # >__Exercise 4: The big jump principle__. # > # > 1. Produce a sample of __N=10,000__ data points extracted from a Gaussian distribution with parameters $\mu = 0 $ and $\sigma = 4$ (reuse the code from the previous exercise). Compute (i) the maximum and (ii) the sum of the values in the sample. # > 2. Repeat point 1. for __S = 1000__ samples and store the sums and maxima in two arrays. # > 3. Create a scatter plot, showing the sums against the maxima. # > 4. Repeat points 1,2, and 3 for (i) a Pareto distribution with parameters $x_m=1$ and $\alpha=0.5$; (ii) a log-normal distribution with parameters $\mu=0$ and $\sigma=4$; and (iii) data samples of size __N=10,000__ extracted from the array collecting the number of citations of papers (from Exercise 3, point 1). __Hint:__ Remember to use a logarithmic scale when the data span many orders of magnitude. # > 5. Answer the following questions. # > - Compare the scatterplots obtained for the Gaussian, Power-Law and Lognormal distributions. What do you observe? Would you expect that? Why? # > - Focus on the scatter plot obtained for the citation data. What do you observe? What are the implications? # # Your Feedback # I hope you enjoyed today's class. It would be awesome if you could spend a few minutes to share your feedback. # **Go to [DTU Learn](https://learn.inside.dtu.dk/d2l/home/145262) and fill the Survey "_Week 3 - Feedback"_.**