#!/usr/bin/env python # coding: utf-8 # # Lab 2: Pandas, Data Management, Transformation and Simple Visualisation # ## Learning goals # # - Import data sets with Pandas # - How to transform variables and create new variables in Pandas. # - Understand the log transformation in statistics. # - Understand what normalizeing a variable is and why we do it. # - Creating summary statistics in Pandas. # - Creating simple visualisations with built-in pandas functions. # # ## Literature # # PDS [Ch. 3](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html): # [Pandas Objects](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html), # [Indexing and selection](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html) # [Operating on data](https://jakevdp.github.io/PythonDataScienceHandbook/03.03-operations-in-pandas.html), # [Missing data](https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html) # # What is Pandas? # # Numpy arrays provide a powerful and flexible way of handling and manipulating data. But it is not particularily user-friendly, at least compared to the built-in systems in R. In R, data management is built around an object called a **data frame** which gives you the ability to store data within R in the intuitive way you are used to from a spreadsheet: with column and row variables and where you can easily select a variable by name. # # R also provides simple tools for importing formatted data (did you have some frustrations trying to import data directly into a Numpy array in the assignment on lab 1?) # # Pandas recreates the *data frame* object for Python, and some of the related tools for importing, cleaning, manipulating and summarizing data--all built on top of the powerful Numpy array. # # Once you have Pandas up and running, you have turned Python into a full-featured data management and analysis platform. # We start by importing the packages we wish to use # In[1]: import pandas as pd import numpy as np import matplotlib.pyplot as plt # I again also want to set some plot settings to make them look better, don't worry too much about these settings, everything works fine without them. # In[2]: from cycler import cycler plt.rcParams['xtick.labelsize'] = 12 plt.rcParams['ytick.labelsize'] = 12 plt.rcParams["axes.labelsize"]= 12 plt.rcParams["figure.facecolor"] = "#f2f2f2" #plt.rcParams['figure.savefig.dpi'] = 100 plt.rcParams['savefig.edgecolor'] = "#f2f2f2" plt.rcParams['savefig.facecolor'] ="#f2f2f2" plt.rcParams["figure.figsize"] = [16,10] plt.rcParams['savefig.bbox'] = "tight" plt.rcParams['font.size'] = 14 greens = ['#66c2a4','#41ae76','#238b45','#006d2c','#00441b'] multi =['#66c2a4','#1f78b4','#a6cee3','#b2df8a','#33a02c','#fb9a99','#e31a1c','#fdbf6f'] plt.rcParams["axes.prop_cycle"] = cycler(color=multi) # To demonstrate the use of Pandas we will make use of a real-life, big, messy dataset: [The Penn World Tables](https://www.rug.nl/ggdc/productivity/pwt/?lang=en) or PWT for short. # # A users guide can be found [here](https://www.rug.nl/ggdc/docs/pwt_80_user_guide.pdf) # # A definition of the variables can be found [here](https://jmaurit.github.io/anv_statistikk/data/pwtDefs.xlsx) # # PWT contains macroeconomic data on 183 countries from 1950 to 2019. # # This is a so called **panel data** - that follows many individuals (countries) over several time periods (years). The name *Pandas* comes from *PANel DAta Sets*, so handling such data is literally what Pandas was made for. # # # ## Loading in data with Pandas # Let's load in our data, which I have saved a copy of on my website. For a csv file it is super easy: # In[3]: pwt = pd.read_csv("http://jmaurit.github.io/anv_statistikk/data/pwt100.csv", sep=";", decimal=",") # In[4]: pwt # csv stands for *comma seperated file*, that is, the data columns are typically separated by comma. # # Except in this file the data is seperated by a semicolon (you can easily check by viewing your data file in a text editor like textEdit on a mac). So I needed to specify that the seperator is a ; by writing **sep=";"** # Another common issue working with Norwegian or European data is that a "," is used for the decimal place (3,4 instead of 3.4 as in the US/UK). Therefor we specify **decimal=","**. # # # # We are not limited to csv files. Pandas can easily read in data in many other formats including directly from excel files. See the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) for more information. # ## Working with a Pandas object # # Let's take a look at our data. We can see a snippet of our data by simply writing # In[16]: pwt # This shows the first and last 5 rows of our data. After you have imported a dataset, you should ALWAYS inspect it to see that everything was imported correctly and it looks like you expect it to look. This will save you a lot of time. # # In a Python data frame each columns/variable consists of a **Series**, that acts a lot like a Numpy array. # Notice a few things: # # - Pandas has automatically recognized the first row in the data as a header that contains the column/variable names. # - Pandas has also inserted row numbers that index the data set. # - When there is missing data, it is indicated by **NaN** # - Not all of the columns are shown in the notebook. I would like to see all of columns, so I change the default settings: # # In[17]: pd.set_option('max_columns', None) # In[18]: pwt # We could also see the column names by writing # In[30]: pwt.columns # ### Selecting and Indexing a Data Frame # Let's say we want to select the column for real GDP (expenditure side - don't worry if you don't understand what that means), which has the column name gdpe in PWT: # In[19]: pwt["rgdpe"] # It is exactly the same to do the more Pythonic: # In[20]: pwt.rgdpe # This is one reason it is nice to have column names without spaces or special characters # ### Creating a new variable # # We can also easily create a new variable. For example, let us create a real GDP per person variable: # In[21]: pwt["rgdpe_per_pers"] = pwt["rgdpe"]/pwt["pop"] # In[22]: pwt # Now we have created a new column/variable called "rgdpe_per_pers" which is read GDP divided by the population ("pop"). # # Was Norway poor until we found oil? # # Now we are ready to answer our first non-trivial question using data. Oil was discovered on the Norwegian North Sea on Christmas Eve of 1969. The story goes that the transformation of Norway into Western Europe's largest petroleum producer catapulted the country from being a relatively poor country compared to Scandinavian neighbors to being the richest of its neighbors. Is this true? # ## Selecting and slicing data # Since we are interested in only a few countries, lets select that data we want and place it in a new object. We want data from Norway as well as Sweden and Denmark for comparison. # # There are several ways of selecting and "slicing" data in Pandas. # # If we wanted to select by a numerical index - say the first 3 rows and first 2 columns we would use the **.iloc** attribute: # In[23]: pwt.iloc[:3, :2] # Or we can use the **.loc** to select by the index and column names # In[24]: pwt.loc[:, "country":"year"] # Here the **:** in the row position indicates that I want all rows, while in the column position I select all the columns between and including "country" and "year" # If I wanted two specific columns I could write # In[25]: pwt.loc[:, ["country", "year"]] # What we want are all the rows for Norway, Sweden and Denmark, and we would like the columns "country", "year", and "rgdpe". If we wanted to just get Norwegian data we could write # In[29]: NorwayData = pwt.loc[pwt.country=="Norway",["country", "year", "rgdpe_per_pers"]] NorwayData # Notice what is being done in the above code. I am making a comparison of whether the column pwt.country is equal to (==) "Norway". This will return a True or False and we are selecting all the True rows - that is where pwt.country=="Norway". While in the column place, I am saying I want all the columns # I could also make use of the built-in plotting funcitonality in Pandas (building on matplotlib) # In[30]: NorwayData.plot(x="year", y="rgdpe_per_pers") # Here we see that the plot function appears as an attribute of the Pandas data frame object (NorwayData), and we can specify the columns we want to represent the x-axis ("year") and y-axis ("rgdpe_per_pers") # We could also display the data as a barchart # In[31]: NorwayData.plot(x="year", y="rgdpe_per_pers", kind="bar") # Looking at the individual real GDP series, we see a steady growth through the 1960s and 1970s, but no distinct change after oil was found in 1969. There was a slow-down in the late 1980's into the early 1990's. This was a period with low oil prices and a financial crisis in Norway and other scandinavian countries. Real GDP per person (expenditure) seems to increase markedly in the late 1990's and through 2008, after which there is some stagnation. # You can read more about the built-in visualisation functions in the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) # But I want to get data for Norway, Denmark and Sweden, then I will use the isin() function: # In[32]: scanCountries = ["Norway", "Sweden", "Denmark"] scanData = pwt.loc[pwt.country.isin(scanCountries),["country", "year", "rgdpe_per_pers"]] scanData # Now we have the data we want # ## Simple plotting with Pandas # In[33]: fig, ax = plt.subplots() scanData.loc[scanData.country=="Norway"].plot(x="year", y="rgdpe_per_pers", color="red", ax=ax, label="Norway") scanData.loc[scanData.country=="Denmark"].plot(x="year", y="rgdpe_per_pers", color="blue", ax=ax, label="Sweden") scanData.loc[scanData.country=="Sweden"].plot(x="year", y="rgdpe_per_pers", color="Green", ax=ax, label="Denmark") plt.show() # A common transformation for long economic time series data is to log the series. We will spend more time on explaining log-transformations, but for now, it suffices to say that transforming a long series that shows growth can tell us something about the rate-of-growth of the series over time. Let's try doing a log-transformation with the np.log() function # In[35]: scanData["log_rgdpe_per_pers"] = np.log(scanData.rgdpe_per_pers) # In[37]: fig, ax = plt.subplots() scanData.loc[scanData.country=="Norway"].plot(x="year", y="log_rgdpe_per_pers", color="red", ax=ax, label="Norway") scanData.loc[scanData.country=="Denmark"].plot(x="year", y="log_rgdpe_per_pers", color="blue", ax=ax, label="Sweden") scanData.loc[scanData.country=="Sweden"].plot(x="year", y="log_rgdpe_per_pers", color="Green", ax=ax, label="Denmark") plt.show() # With the log transformation we can see trends in rate-of-change/growth more easily. We might seem to notice that while the slope of the lines for Denmark and Sweden follow each closely from about the mid-90s to mid-2000s, the growth rate in Norway has been considerably stronger in this period. # # # # Assignment # ### 1. In the lab we looked at real GDP expenditure. Now do an analysis with output-side real GDP. # # Roughly, expenditure-side GDP measures the consumption of an economy, while the output-side GDP measures the productive capacity of an economy. In most situations these two sides should be roughly similar, but they can and do diverge. # # * Create a new variable called "rgdpo_per_pers" which divides output-side real gdp ("rdgpo") by population ("pop"). # # * Extract a dataframe with Norwegian, Swedish and Danish data and create a chart of the rgdpo_per_pers comparing the three countries. # # * Also do a log transformation and plot. # # * How does this measure of GDP differ from the one above? Can you explain why? # ### 2. Creating fake data. # # You might think that statisticians would frown upon faking data (and indeed, in certain contexts they do), but it can be a useful tool in understanding models - also in this course. In this exercise you are going to create your own fake data set. # # * We have a relationship we wish to model which we think has the following form: # # $$y_i = \beta_1*x_{1i} + \beta_2*x_{2i} + e_i$$ # # Perhaps $y_i$ is the test score on a standardized exam which is a linear function of how much a student, i, studies, $x_{1i}$, and how high their IQ is, $x_{2i}$, plus some random component $e_i$. # # * Generate NP arrays for $x_{1i}$ and $x_{2i}$. Generate data for N=1000 students, assuming that both $x_{1i}$ and $x_{2i}$ come from uniform distribution (see lab 1) between 0 and 1. # # * Generate the series $e_i$ which is from a standard normal distribution - mean 0, standard deviation 1. # # * Generate y_i series by letting $\beta_1 = 0.5$ and $\beta_2 = -0.2$ and computing based on the formula above. # # * Create a Pandas data frame with the 4 NP arrays you have created (See [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)). # # * Use the Pandas function: [pandas.DataFrame.corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) to generate correlation coefficients between your variables. How do these estimates compare to the $\beta_1$ and $\beta_2$ values you chose # ### 3. Open-ended assignment: # # Find an interesting data set or series, and load it in as as a Pandas data frame or series. Do some initial transformations or cleaning if necessary. Plot the series or multiple series. Try to explain some of the patterns you see. # # You can and should see this as the first step of starting your course project. # # References # # - Feenstra, Robert C., Robert Inklaar and Marcel P. Timmer (2015), "The Next Generation of the Penn World Table" American Economic Review, 105(10), 3150-3182, available for download at www.ggdc.net/pwt