DeLong: Teaching Economics

Last edited: 2019-10-08

Introduction: Python

Due ???? via upload to ???

J. Bradford DeLong

 

You should have gotten to this point vis this link: http://datahub.berkeley.edu/user-redirect/interact?account=braddelong&repo=LS2019&branch=master&path=Introduction-Python-%26-Economics-delong.ipynb

 

This introductory notebook will familiarize you with (some of) the programming tools that will be useful to you. There are many other very good resources for a (gentle) introduction to computer programming. I especially recommend Berkeley's Data 8 website: http://data8.org/fa19/.

 

Table of Contents

  1. Computing Environment: The Jupyter Notebook
    1.1. Markdown Text Cells
    1.2. Python Code Cells
    1.3. The Python Kernel
    1.4. Questions and Exercises
    1.5. Writing Jupyter Notebooks
    1.6. Errors
    1.7. Libraries Check
  2. Introduction to Python Programming Concepts 2.1. Basics: The Kernel
    2.2. Basics: Numbers
    1. Python and OOP
    2. Pandas
    3. Visualization
    4. Building Classes

 

1. Our Computing Environment: The Jupyter Notebook

This webpage is called a Jupyter notebook. A notebook is a place to write programs and view their results.

 

1.1. Markdown Text Cells

In a notebook, each box containing text or code is called a cell.

Text cells (like this one) can be edited by double-clicking on them to make them the active cell. The formatting is then stripped, leaving an unformatted text string, and a blue or green bar appears on the right. You can then edit the text stream. The text in these cells written in a simple format called Markdown. You almost surely want to learn Markdown.

After you edit a text cell, click the "run cell" button at the top that looks like ▶| to invoke the Markdown processor on the changed cell, and display the formatted version. (Try not to delete any of the instructions about what you should do.)

Notice any dollar signs in the unformatted text stream? Those tell the formatting processor that the symbols between the dollar signs make up a mathematical expression written in another not-so-simple format called LaTeX. For example:

(3.8) $ y^{*mal} = \phi y^{sub} \left( 1 + \frac{ \gamma h}{\beta}\right) $

(3.14) $ L_t^{*mal} = \left[ \left( \frac{H_t}{y^{sub}} \right) \left( \frac{s}{\delta} \right)^\theta \left( \frac{1}{\phi} \right) \left[ \frac{1}{(1+\gamma h/\delta)^\theta} \frac{1}{(1+\gamma h/\beta)} \right] \right]^\gamma $

You almost surely want to learn (some of) LaTeX as well. (Jupyter notebooks use only a small subset of LaTex, which is a very powerful and complex programming language—it is, in fact, Turing-complete, which means that if you can write a program to compute something in any computer language on any (classical) computer, you can program it in LaTeX as well. For our purposes, full LaTeX is overkill: Markdown and the equation-processing parts of LaTeX do perfectly well.)

 

Understanding Check 1.1: This paragraph is in its own text cell. Try editing it so that this sentence is the last sentence in the paragraph, and then click the 'run cell' (▶|) button in the toolbar above. In short, you should move the previous sentence to a positon after this sentence, and then click the 'run cell' (▶|) button.

 

Note: You almost surely also want, sometime, to learn something about one of the genius founders of computer science, mid-twentieth century British mathematician Alan Turing. Here are three very good resources:

  1. Charles Petzold (2008): The Annotated Turing: A Guided Tour Through Alan Turing's Historic Paper on Computability and the Turing Machine http://books.google.com/?isbn=9780470229057... https://books.google.com › books Charles Petzold
  2. Andrew Hodges (2012) Alan Turing: The Enigma https://books.google.com/?isbn=9781448137817...
  3. (2014): The Imitation Game https://en.wikipedia.org/wiki/The_Imitation_Game

 

1.2. Python Code Cells

Other cells contain code in the Python 3 language. You can switch a cell type from 'code' to 'Markdown' or back by selecting the appropriate option in the toolbar above. (Still other cells might contain 'raw' expressions, but we will not use those.)

To run the code in a code cell, first click on the cell to make it active. The cell should then be highlighted with a little green or blue bar to the left. Next, either press the 'run cell' button (▶|) in the toolbar above, or hold down the shift key and press return or enter.

Running a code cell will execute all of the code it contains, if this notebook is connected to a Python interpreter and so able to call the Python kernel.

Try running this cell:

In [1]:
print("Hello, World!")
[1] "Hello, World!"

And this one:

In [2]:
print("\N{WAVING HAND SIGN}, \N{EARTH GLOBE ASIA-AUSTRALIA}!")
👋, 🌏!

The fundamental building block of Python code is an expression—something that starts at the left end of a line, and ends with a 'return' character which is not inside any unclosed left parenthesis ('(')), left brace ('{'), or left bracket ('[').

Code cells can contain multiple expressions.

When you run a code cell, the successive expressions contained in the lines of code are executed in the order in which they appear. Every print expression prints a line.

Run the next cell and notice the order of the output:

In [3]:
print("First this line,")
print("then the whole 🌍")
print("and then this one.")
First this line,
then the whole 🌍
and then this one.

Understanding Check 1.2: Change the cell above so that it prints out:

First print out this line,
and then this one,
and then, finally, the whole 🌏.

 

1.3. The Python Kernel

Look at the upper right corner of this window tab. You should say an open circle, and immediately to the left the words 'Python 3'. (If it says something else—like 'R', for example—click on the word or phrase, select 'Python 3' in the popup that appears, and click 'select'.

If the circle is closed and empty, the Python kernel is idle and ready to execute code. If the circle is filled in, wait a little while. If it does not become clear, either reselect 'Python 3' or click the 'Restart Kernel' item in the 'Kernel' menu at the top of this window.

Do not be scared should you see a "Kernel Restarting" message! Your data and work will still be saved. Once you see "Kernel Ready" in a light blue box on the top right of the notebook, you'll be ready to work again.

After a kernel restart, however, you will need to rerun any cells in the notebook above the cell you are currently working on if they import programming modules, load data, or carry out calculations with variables.

Next to every code cell, you'll see some text that says "In [...]". Before you run the cell, you'll see "In [ ]". When the cell is running, you'll see In [*]. If you see an asterisk (*) next to a cell that doesn't go away, it's likely that the code inside the cell is taking too long to run, and it might be a good time to interrupt the kernel. When a cell is finished running, you'll see a number inside the brackets, like so: In [1]. The number corresponds to the order in which you run the cells; so, the first cell you run will show a 1 when it's finished running, the second will show a 2, and so on.

If your kernel seems stuck, your notebook is very slow and unresponsive, or your kernel loses its connection. If this happens, try:

  1. clicking "Kernel > Interrupt".
  2. clicking " Kernel > Restart".
  3. If that doesn't help, restart your server. Save your work by clicking "File > Save and Checkpoint". Then click "Kernel > Shutdown Kernel:. Then click "Kernel > Restart Kernel". Then, navigate back to the notebook you were working on. If you do this, you will have to run your code cells from the start of your notebook up until where you paused your work.

 

1.4. Questions and Exercises

There will be some questions for you in these notebooks.

For free response questions, write your answers in the provided markdown cell that starts with ANSWER:. Do not change the heading, and write your entire answer in that one cell.

For questions that are to be answered numerically, there is a code cell that starts with:

__# ANSWER__ 

and has a line in which there is a variable (like "X") currently set to underscores so:

X = ___

Replace those underscores with your final answer. It is okay to make other computations in that cell and others, so long as you set the variable to your answer.

 

1.5. Writing Jupyter Notebooks

You will use Jupyter notebooks for your own projects or documents. In order for you to make your own notebook, you'll need to create your own cells for text and code.

To add a cell, click the + button in the menu bar. It'll start out as a text cell. You can change it to a code cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing "Code".

 

Understanding Check 1.3: Add a code cell below this one. Write code in it that prints out:

A whole new cell! ♪🌏♪

(That musical note symbol is like the Earth symbol. Its long-form name is \N{EIGHTH NOTE}.)

Run your cell to verify that it works.

 

1.6. Errors

You will make errors. And the computer will tell you when you do. Making programming errors is not a problem. Not taking steps to correct the errors you make will be.

Python is a language, and like natural human languages, it has rules. It differs from natural language in two important ways:

  1. The rules are simple. You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.
  2. The rules are rigid. If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes, automatically correcting them, and resolving ambiguities in a sensible way.
    • A computer running Python code is not smart enough to do that.
    • It will, instead, refuse to carry out any calculations, and send you an error message instead.
    • Sometimes (often?) the error message will be several lines long. The line to look at is the last line of the error message: all the lines above the last line are cascading consequences of the first error the computer found; resolve it, and the others may disappear; fail to resolve it, and the others are unfixable.

Whenever you write code, you will make mistakes. When you run a code cell that has errors, Python will produce error messages to tell you what you did wrong. You will know it is an error because it is in a pink box to call your attention to it.

Errors are okay. Even—especially—experienced programmers make many errors. Perhaps the best programmers are those who make—and then correct—the most.

When you make an error, you just have to find the source of the problem, fix it, and move on.

We have made an error in the next cell. Run it and see what happens:

In [1]:
print("This line is missing something."
  File "<ipython-input-1-0fbe4427aee1>", line 1
    print("This line is missing something."
                                           ^
SyntaxError: unexpected EOF while parsing

You should see something like this (minus our annotations):

The computer tells you that this is a SyntaxError: it is missing something that the computer requires in order to interpret the expression. 'EOF' means "end of file". 'unexpected EOF' means that the computer found itself confronted with the end of the cell before everything needed to make a valid expression had been presented to it. Where it needed to find a ')', it found instead the end of the file that the notebook submitted to the Python kernel when you issued the 'run cell' command.

There's a lot of terminology in programming languages, but You do not need to know all of the vast ocean of programming-language terminology in order to program effectively. When you see a cryptic message like this, you can often fix it—fix 'the bug'—without having to figure out exactly what the message means.

(If it is not immediately obvious to you, feel free to ask a friend or somebody else in the class: there is a saying in the practice of debugging programs: "with enough eyeballs, all bugs are shallow"—meaning that somewhere there is an eyeball attached to a brain which already knows how to immediately solve that bug, you only have to find that eyeball and get it to look at the faulty code).

Note: in the toolbar, there is the option to click "Cell > Run All", which will run all the code cells in this notebook in order until it hits an error. When it hits an error, it stops the 'run all' process.

 

Understanding Check 1.4: Try to fix the code above so that you can run the cell and see the intended message instead of an error.

 

In [ ]:
print("This line is missing something.")

1.7. Libraries Check

Now that you know something about our computing environment, it is time to move into understanding Python proper. First, however, run the code cell below to ensure all the libraries needed for this notebook are installed:

In [2]:
!pip install numpy
!pip install pandas
!pip install matplotlib
Requirement already satisfied: numpy in /Users/delong/anaconda3/lib/python3.6/site-packages
You are using pip version 9.0.2, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Requirement already satisfied: pandas in /Users/delong/anaconda3/lib/python3.6/site-packages
Requirement already satisfied: python-dateutil>=2 in /Users/delong/anaconda3/lib/python3.6/site-packages (from pandas)
Requirement already satisfied: pytz>=2011k in /Users/delong/anaconda3/lib/python3.6/site-packages (from pandas)
Requirement already satisfied: numpy>=1.7.0 in /Users/delong/anaconda3/lib/python3.6/site-packages (from pandas)
Requirement already satisfied: six>=1.5 in /Users/delong/anaconda3/lib/python3.6/site-packages (from python-dateutil>=2->pandas)
You are using pip version 9.0.2, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Requirement already satisfied: matplotlib in /Users/delong/anaconda3/lib/python3.6/site-packages
Requirement already satisfied: numpy>=1.7.1 in /Users/delong/anaconda3/lib/python3.6/site-packages (from matplotlib)
Requirement already satisfied: six>=1.10 in /Users/delong/anaconda3/lib/python3.6/site-packages (from matplotlib)
Requirement already satisfied: python-dateutil in /Users/delong/anaconda3/lib/python3.6/site-packages (from matplotlib)
Requirement already satisfied: pytz in /Users/delong/anaconda3/lib/python3.6/site-packages (from matplotlib)
Requirement already satisfied: cycler>=0.10 in /Users/delong/anaconda3/lib/python3.6/site-packages (from matplotlib)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=1.5.6 in /Users/delong/anaconda3/lib/python3.6/site-packages (from matplotlib)
You are using pip version 9.0.2, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

2. Introduction to Python Programming Concepts

2.1. Basics

The departure point for all programming is the concept of the expression. An expression is a combination of variables, operators, and other Python elements that the language interprets and acts upon. Expressions act as a set of instructions to be fed through the interpreter, with the goal of generating specific outcomes. Here are some examples of very basic expressions:

In [7]:
# Examples of basic expressions:

print(2 + 2)           # addition

print('me' + ' and I') # string concatenation 

print("me" + str(2))   # you can print a number with a string if 
                       # you cast the number and so change it into
                       # type string with 'str()'...

print(12 ** 2)         # exponentiation
4
me and I
me2
144

If instead we had:

In [3]:
# Examples of basic expressions:

(2 + 2)           # addition

('me' + ' and I') # string concatenation 

("me" + str(2))   # you can print a number with a string if 
                       # you cast the number and so change it into
                       # type string with 'str()'...

(12 ** 2)         # exponentiation
Out[3]:
144

You will notice that the last expression and only the last in a cell gets printed out below the cell if it has a value. If you want the computer to print more things to your screen, you need to explicitly tell it to print() whatever is inside the parentheses.

 

2.1.1. Numbers and Arithmetic

An expression can be as simple as a number object:

In [1]:
3.25000
Out[1]:
3.25

or it can be an arithmetic calculation:

In [2]:
24 * 24 - 15/3 + 15**3
Out[2]:
3946.0

A great many basic arithmetic expressions are built into Python.

 

2.1.2. Variables and Assignment Expressions

A variables is an object in the computer's memory (in this case, an integer object and a float object) with a name. We can store a quantity in that object, and then refer to it by its name and use it later. In Python we do this via expressions that are called assignment statements: an equals sign, with the name of the variable we are assigning a value to on the left-hand side of the equal signs, and what we want the value to be on the right-hand side.

In the example below, a and b are variables:

In [8]:
a = 4
b = 10/5

Notice that when you create a variable object—unlike what you previously saw—it does not print anything out. Our previous expressions did calculations, and then—if the expression is the last in the cell—reports that calculated value. The '=' sign in an object assignment expression redirects that reporting to the object named before the '=' sign. An assignment expression thus has no value to report.

Once you have assigned a value to a variable, that value is then bound to the variable name. In order to refer to and use that variable and its value, simply type the variable object name that we stored the value under:

In [8]:
c=8/2

c
Out[8]:
4.0

Variables are stored within the notebook's environment: once you have stored a variable value in one cell, that value carries over and can be used in all subsequently executed cells—until the kernel restarts, after which you have to rerun all the previously executed code cells to restore the computer's memory environment to its previous working state:

In [9]:
# Notice that 'a' retains its value from the previous 
# code cell above—as long as the kernel was not restarted:

print(a)
a + b
4
Out[9]:
6.0

Understanding Check 2.1: See if you can write a series of expressions that creates two new variables called x and y and assigns them values of 10.5 and 7.2. Then assign their product to the variable combo and print it:

In [10]:
# Fill in the missing lines to complete the expressions.

# x = 10.5
# y = 7.2
# combo = x * y
# print(combo)

#...
75.60000000000001

 

2.1.3. List Objects

Computers have turned out to be useful for vastly more tasks than simply calculations. Data manipulation plays an especially key role. For data manipulation, the most important concept is the type of object killed a lists (plus its more advanced counterpart, a numpy array).

A list object is an ordered collection of other objects, of sub-objects. Lists allow us to store and access groups of variables and other objects for easy access and analysis.

Check out this documentation for an in-depth look at the capabilities of lists. (Yes, a list can contain itself as a sub-object; no, there is no way to create a list of all lists that do not contain themselves.)

To initialize a list, use brackets. Putting objects separated by commas in between the brackets adds them to the list:

In [9]:
# lists...

lst = []              # an empty list
print(lst)


lst = [1, 3, 6,       # assigning to a new list to our empty list
    'lists', 'are', 
    'fun', 
    4]                # note how one expression stretches across
                      # four lines
print(lst)
[]
[1, 3, 6, 'lists', 'are', 'fun', 4]

To access a value in a list object, count from left to right, starting at zero. Then write the the name of the list, followed by the number of the subobject you wish to access in brackets:

In [10]:
# Elements are selected thus:

example = lst[2] # to select the '6' in the object 'lst'

print(example)
6

There are some subtleties in how Python treats lists. When you assign a list to a variable object, the variable is then a pointer to the list object. What does this mean? It means this:

In [12]:
a = [1,2,3]     # assign an original list object to variable a
b = a           # assign a to variable b; b now points to list a 
b[0] = 4        # now we assign a new value to the first subobject 
                # of b: we assign the value '0' to it
    
# What now is the value of a[0]? Is it '1'—our original assignment?
# Or is it '4'—did the reassignment of b[0] also carry over to a[0]?
# In Python, a[0] is now equal to '4'

print(a[0])
4

In Python we can use the '+' sign not just to add numbers, but to add an element to a list—these are two very different senses of the word "add":

In [21]:
# adding an element to a list

example_list = [1]
example_list = example_list + [2]
print("example_list is:", example_list)

example_number = 1
example_number = example_number + 2
print("example_number is:", example_number)
example_list is: [1, 2]
example_number is: 3

I think that this use of '+' for two different things, addition and concatenation, is a design flaw in Python. But that ship has long ago sailed. You will need to check, when reading and writing Python, whether each '+' is "add" in the sense of "add two numbers", or "add" in the sense of "add an extra object to a collection".

 

**2.1.3.1. Slicing Lists**

As you can see from above, lists do not have to be made up of elements of the same kind. Indices do not have to be taken one at a time, either. Instead, we can take a slice of indices and return the elements at those indices as a separate list.

In [14]:
# This line will store the 1st (inclusive),
# i.e., not including the 0th element!,
# through 4th (exclusive) elements of lst 
# as a new list called lst_2:

lst_2 = lst[1:4]

lst_2
Out[14]:
[3, 6, 'lists']

Why does Python use zero-based indexing? Why is the first element of the list 'lst' the 0th, 'lst[0] = 1', rather than the 1st, 'lst[1] = 3', element? This may confuse you, and it may be easier to remember how Python works if you think of Python lists like days of the week. Suppose we have the list:

days_of_the_week = [Monday, Tuesday, Wednesday, Thursday, Friday]

What is two days from now, Monday? Wednesday. What is one day from now? Tuesday. Today is Monday—and Monday is not one day from now. Python works the same way.

 

Understanding Check 2.2: Slicing Lists: Build a list of length 10 containing whatever elements you'd like. Then, slice it into a new list of length five using a index slicing. Finally, assign the last element in your sliced list to the given variable and print it.

In [15]:
# Fill in the ellipses to complete the question.

# lst=...
# lst2=lst[...:...]
# a=lst2[...]

print(a)
6

Lists can also be operated on with a few built-in analysis functions. These include min and max, among others. Lists can also be concatenated together. Find some examples below.

In [16]:
# MOAR List Examples:

a_list = [1, 6, 4, 8, 13, 2]  # a list containing six integers


b_list = [4, 5, 2, 14, 9, 11] # another list containing six integers.

print('Max of a_list:', max(a_list))
print('Min of b_list:', min(a_list))

c_list = a_list + b_list      # concatenate a_list and b_list

print('Concatenated:', c_list)
Max of a_list: 13
Min of b_list: 1
Concatenated: [1, 6, 4, 8, 13, 2, 4, 5, 2, 14, 9, 11]

 

**2.1.3.2. Numpy Arrays**

Python list objects are very flexible and powerful, but they are slow. And even though our computer hardware is immensely powerful, it is not quite powerful enough, sometimes, for the uses we wish to make of it in the environments we want to work in. Therefore there is an add-on library to Python, numpy, for "numerical Python", to work faster. And there is an object type in numpy, the array. A numpy array is a kind of list, in which we know that all of the subobjects—elements—of the array will be numbers, and so the computer can can operate on them much more quickly.

We tell the computer that we want to use the numpy library with an import statement:

In [16]:
import numpy as np

It is conventional to use the abbreviation 'np' to call on numpy. Whenever, in any Python program, you see an 'np.', you can safely assume that somewhere earlier in the computer's workflow there was a:

import numpy as np

Now let's take a look at some things we can do with numpy array objects:

In [17]:
# Initialize an array of integers 0 through 9.
example_array = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# This can also be accomplished using np.arange
example_array_2 = np.arange(10)
print('Undoubled Array:')
print(example_array_2)

# Double the values in example_array and print the new array.
double_array = example_array*2
print('Doubled Array:')
print(double_array)
Undoubled Array:
[0 1 2 3 4 5 6 7 8 9]
Doubled Array:
[ 0  2  4  6  8 10 12 14 16 18]

This behavior differs from that of a list. See below what happens if you multiply a list.

In [14]:
example_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]
example_list * 2
Out[14]:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Notice that instead of multiplying each of the elements by two, multiplying a list and a number returns that many copies of that list. Other mathematical operations also have... interesting... behaviors with lists. Beware, but also explore.

 

2.2. How to Program

2.2.1. The Write-Check-Rewrite Loop

The computer does not think and work like you do. You can very easily get wrong ideas into your head about what the computer will do in response to the lines of code you type and run, and then fixing your code will become next to impossible unless you work in small, discrete steps. Therefore: write a line of code; run it; check it to make sure it did what you thought it would do; fix it; rerun it; and check it again.

We will sometimes use the "ok" grader to help you see whether you have coded correctly.

Go ahead and attempt Understnding Check 2.2. Running the cell directly after it will test whether you have assigned seconds_in_a_decade correctly. If you haven't, this test will tell you the correct answer. Resist the urge to just copy it, and instead try to adjust your expression. (Sometimes the tests will give hints about what went wrong...)

 

Understanding Check 2.2: Assign the name seconds_in_a_decade to the number of seconds between midnight January 1, 2010 and midnight January 1, 2020. Note that there are two leap years in this span of a decade. A non-leap year has 365 days and a leap year has 366 days.

In [ ]:
# Change the next line 
# so that it computes the number of seconds in a decade 
# and assigns that number the name, seconds_in_a_decade.

seconds_in_a_decade = ...

# seconds_in_a_decade = 60*60*24*(365*10+1+1)

# We've put this line in this cell 
# so that it will print the value you've given to seconds_in_a_decade when you run it.  
# You don't need to change this.

seconds_in_a_decade
In [ ]:
ok.grade("q22");

 

2.2.2. Comments

You may have noticed lines like these in the code cells so far:

# Change the next line 
# so that it computes the number of seconds in a decade 
# and assigns that number the name, seconds_in_a_decade.

These are comment. Comments do not make anything happen in Python: Python ignores anything on a line after a #.

Comments are there to communicate something about the code to you, the human reader. Comments are essential if anybody else is to understand your code, and in this case "anybody else" includes you a season, a month, a week, and possibly a day from now.

 

2.2.3. Programming Dos and Don'ts...

  1. Do restart your kernel and run cells up to your current working point every fifteen minutes or so. Yes, it takes a little time. But if you don't, sooner or later the machine's namespace will get confused, and then you will get confused about the state of the machine's namespace, and by assuming things about it that are false you will lose hours and hours...

  2. Do reload the page when restarting the kernel does not seem to do the job...

  3. Do edit code cells by copying them below your current version and then working on the copy: when you break everything in the current cell (as you will), you can then go back to the old cell and start fresh...

  4. Do exercise "agile" development practices: if there is a line of code that you have not tested, test it. The best way to test is to ask the machine to echo back to you the thing you have just created in its namespace to make sure that it is what you want it to be. Only after you are certain that your namespace contains what you think it does should you write the next line of code. And then you should immediately test it...

  5. Do take screenshots of your error messages...

  6. Do google your error messages: Ms. Google is your best friend here...

  7. Do not confuse assignment ("=") and test for equality ("=="). In general, if there is an "if" anywhere nearby, you should be testing for equality. If there is not, you should be assignment a variable in your namespace to a value. Do curse the mathematicians 500 years ago who did not realize that in the twenty-first century it would be very convenient if we had different and not confusable symbols for equals-as-assignment and equals-as-test...

  8. Do expect things to go wrong: it's not the end of the world, or even a threat to your soul. Back up to a stage where things were working as expected, and then try to figure out why things diverged from your expectations. This happens to everybody.
     
    Here, for example, we have Gandalf the Grey, Python νB, confronting unexpected behavior from a Python pandas.DataFrame. Yes, he is going to have to upgrade his entire system and reboot. But in the end he will be fine:

Tools Data Science and Python MRE key

Note: On the elective affinity between computer programming and sorcery: https://www.bradford-delong.com/2017/08/live-from-cyberspace-the-elective-affinity-between-fantasy-and-computer-programming-paul-dourish-the-original-hac.html

 

2.3. Higher-Order Computer Language Structures

2.3.1. Background

At the bottom-most level, a computer is a collection of electronic circuits in which there is either power—understood as a '1'—or no power—understood as a '0'. These circuits are groups into sets of 8—a 'byte'—which is then understood as an 8-digit binary number, something between 0 and 127 (decimal) inclusive. In modern computers these bytes are gathered in groups of 8 into 'words'. The computer then takes these words and adds, negates, and moves them within its systems in accordance with a pattern set by the interaction of the computer's hardware design and the word currently in the computer's instruction register.

But working directly with a computer—programming "on the bare metal"—is impossible for humans. At the bare-metal level, a computer can do things like:

  • move a number into a particular memory location or a register (MOV),
  • add a number to a register (ADD),
  • negate a number (NOT),
  • compare two registers and set the flag if they are equal (CMP), * change the next memory location to be read into the instruction register if the flag is set (JIF),
  • and a few other things.

So we build programming languages that translate our ideas about what the computer should do into a pattern of machine-language instructions that the computer can load into its instruction register and work with. And it is essential that such programming languages allow us to work not with raw 8-bit or 64-bit binary numbers but instead with objects. Early programming languages set up only four kinds of objects: you could ask the interpreter or compiler that took your program and translated it into machine language to set things up so that some of the numbers in the computer memory could represent integers, others could represent numbers with a fractional part, still others could represent strings of symbols like letters, and still others could represent sets of operations—combinations of the basic instruction operations.

Note: You probably want to learn a little—but only a little—about how what you write in your code cells is translated into dancing electrons. I recommend reading Charles Petzold (1999): Code: The Hidden Language of Computer Hardware and Software http://books.google.com/?isbn=9780735605053, starting at the beginning and continuing until you get bored...

But it very soon became clear—in fact, at the start of computer science it was obvious—that having only these four kinds of objects was inadequate. So slightly later computer languages introduced other types of objects. We have already seen lists and arrays. But the most important of the early additional objects was the loop—ways of conveniently specifying that a similar calculation was to be done over and over again. Then came the function—an easy-to-type and easy-to-remember way of setting up a calculation that you will want to do that has a few inputs, one output, and that you are going to want to do over and over again for different input values. Ultimately, computer science developed the idea that a modern computer language should be object-oriented: it should allow you—in fact, attempt to force you—to define and use your own kinds of objects to have whatever features and properties you found conveneint.

Python is a very modern computer language.

 

2.3.2. Loop Objects

But, first, let us back up to loops. Loops are super useful. If each line of code we wrote were only executed once, programming would be very unproductive. But by repeated execution of sets of statements on slightly different sets of data, we can begin to unlock the potential of computers.

Python has two major types of loops: for loops and while loops. A for loop does a thing over and over again for different values of a something, whatever it is, that is named immediately after the keyword for. A while loop does a thing over and again while something is true, and stops when that something becomes false.

 

**2.3.2.1. While Loops**

The while loop repeatedly performs operations until a conditional—something that is either True or False—is no longer satisfied. You will hear such a conditional thing called a boolean expression.

Thirteenth-century Italian mathematician Leonardo Bonacci of Pisa—now called Fibonacci—wrote the Book of Calculation: the first European-language (Latin) book showing what Hindu-Arabic numerals could do—how much easier than was possible using Roman numerals it would be to do money-changing, interest-owed, measure-conversion, and other calculations. In his book, he gave an example of how you could calculate the growth of a hypothetical population of rabbits, which followed what we now call the very useful Fibonacci sequence, in which the first two numbers are 1 and 1, and each subsequent number is the sum of the preceding two numbers.

The code cell below uses a while loop to calculate the nth Fibonacci number:

In [17]:
# using a while loop to calculate the nth Fibonacci number
#
# compute the nth Fibonacci number:

n = 14                # set which sequence number to compute
i = 1
previous_number = 0
current_number = 1

while i < n+1:
    print(i, current_number)
    next_number = previous_number + current_number
    previous_number = current_number
    current_number = next_number
    i = i + 1
    
1 1
2 1
3 2
4 3
5 5
6 8
7 13
8 21
9 34
10 55
11 89
12 144
13 233
14 377

That—the fourteenth Fibonacci number—was as far as Leonardo of Pisa got in his calculations in his Book of Calculation.

The program does the calculation he does, but much more quickly. It starts with the zeroth Fibonacci number—0—and the first number—1—set as the values of the variables previous_number and current_number, respectively, with the index of the sequence i set equal to 1, and with the desired Fibonnaci number to be calculated set to n.

Then the computer calls the while loop object: while the index i is less than the desired number n, the program prints the current value of the index i and the current_number, sets the next_number to the sum of the current_number and the previous_number, sets the previous_number equal to the current_number, sets the current_number equal to the next_number, adds one to the index i, and then goes back to the top of the loop to check whether i is still less than n. If it is not, it exits the loop, and the program cell's calulations come to an end.

If you had told Fibonnaci back in the thirteenth century that in less than 800 years every student could easily learn how to construct a golem that in a split fraction of a second could calculate not the 14th of his numbers (377) but the 1000th:

4346655768693745643568852767504062580256466051
73717804024817290895365554179490518904038798400
79255169295922593080322634775209689623239873322
47116164299644090653318793829896964992851600370
4476137795166849228875

what would his reaction have been?

 

**2.3.2.2. For Loops**

For loop objects are essential in traversing a list and performing an analogous set of calculations at each element. For example, the great German mathematician and philosopher Gottfried Wilhelm Leibniz (1646-1716) discovered a wonderful formula for 𝜋 as an infinite sum of simple fractions like this:

$ \pi = 4 - \frac{4}{3} + \frac{4}{5} - \frac{4}{7} + \frac{4}{9} - \frac{4}{11} + ... $

We can use for loop objects to, first, calculate the terms of the series; and then to sum them:

In [30]:
# using Leibnitz's series for π to calculate an approximate value

n=1000                        # for how many terms do we wish to calculate the series?

#
# set up the terms of Leibnitz's series

series_terms = []             # start with an empty list to hold the series terms

for i in range(1,n+1):        # for each of the first n terms of the series
    series_terms = series_terms + [4*(-1)**(i+1)*1/(2*i-1)] # calculate the series term

#
# use the terms to calculate an approximation to π

π_approximation = 0          # start with zero

for term in series_terms:     # for each of the series terms in our list 
    π_approximation = π_approximation + term    # add the series term to our
                                                  # approximation to π
    
print(π_approximation)       # print our approximation to π
3.140592653839794

In both of the for loop objects in the above cell, the most important line is the "for ... in ..." line. This sets the structure. It tells the computer to step through every element of the object named after the "in", perform the indicated operations, and then move on. Once Python has stepped through every element, the computer exists the loop and prints "π_approximation"

Note that the "i" and the "term" are arbitrary: as variables they have no existence outside of the for loop, and they could be named anything. For example:

In [31]:
π_approximation = 0          # start with zero

for rudolph_the_red_nosed_reindeer in series_terms:
    π_approximation = π_approximation + rudolph_the_red_nosed_reindeer 
    
print(π_approximation)       # print our approximation to π
3.140592653839794

works exactly the same.

 

Understanding Check 2.3: In the following cell, partial steps to manipulate an array are included. You must fill in the blanks to accomplish the following:

  1. Iterate over the entire array, checking if each element is a multiple of 5
  2. If an element is not a multiple of 5, add 1 to it repeatedly until it is
  3. Iterate back over the list and print each element.

Hint: To check if an integer x is a multiple of y, use the modulus operator %. Typing x % y will return the remainder when x is divided by y. Therefore, (x % y != 0) will return True when y does not divide x, and False when it does.

In [24]:
import numpy as np

# Make use of iterators, range, length, while 
# loops, and indices to complete this question.

question_3 = np.array([12, 31, 50, 0, 22, 28, 19, 105, 44, 12, 77])

# for i in ...:
#    while ...:
#       question_3[i]+=1      

# print(question_3)
[ 15  35  50   0  25  30  20 105  45  15  80]

2.3.3. Function Objects

The loop objects in the previous section are messy. They contain several expressions, and extend over several lines. It would be nice to figure a way to gather all these pieces together, and make it more transparent just why the object exists and what it is good for.

A programming language, after all, is much more than a means for instructing a computer to perform tasks. It is a way for us to organize our thoughts about the computer and what it is doing: programs must be written for people to read as well as for computers to execute. And the most powerful way to enable people to read is for the programing language to rpovide tools by which compound elements can be built, combined, and then named and used as a single unit—as an easily-understood and referenced object.

Function objects are thus useful when you want to repeat a series of steps in a calculation for multiple different sets of input values, but don't want to type out the steps over and over again. The purpose of function object provides you with an easy-to-type and easy-to-remember way of setting up such a calculation, and the referring to it and adding it to your program cells whenever you wish.

Many functions are built into Python already; for example, you've already made use of len() to retrieve the number of elements in a list.

You can also write your own functions. At this point you already have the skills to begin to do so. It is good to get into the habit fo writing functions: the major benefit of using functions is that it makes your code much easier for humans to read—and the human who will have the most trouble reading your code is yourself three months from now, when you are trying to study for the final exam.

Functions generally take a set of parameters (also called inputs), which define the objects they will use when they are run. For example, the len() function takes a list or array as its parameter, and returns the length of that list.

All of the loops we wrote above can be better presented when encapsulated in functions:

In [35]:
# a function to calculate the nth Fibonacci number

def Fibonacci(n):
    """
    compute the nth Fibonacci number
    function input: n, the index of the 
    Fibonacci number to be computed
    function output: F_n, the nth Fibonacci
    number
    """
    
    i = 1
    previous_number = 0
    current_number = 1

    while i < n+1:
        next_number = previous_number + current_number
        previous_number = current_number
        current_number = next_number
        i = i + 1
    return previous_number
    

Once you have defined this function, you can then, whenever you want, simply calculate the 14th Fibonacci number and store it in a variable—called "result", say—by simply invoking:

In [38]:
result = Fibonacci(14)

result
Out[38]:
377

Or, if Python did not already have a perfectly good π-calculating function already, a function to calculate π using the Leibnitz approximation might be useful:

In [42]:
# a function to calculate the n-term Leibnitz approximation to π 

def Leibnitz_π(n):
    """
    compute the n-term Leibnitz approximation to π
    function input: n, the number of terms to be 
    calculated in the approximation
    function output: π_{Ln}
    """

    series_terms = []

    for i in range(1,n+1):
        series_terms = series_terms + [4*(-1)**(i+1)*1/(2*i-1)] 

    π_approximation = 0 

    for term in series_terms: 
        π_approximation = π_approximation + term
    
    return π_approximation

result = Leibnitz_π(10000)

result
Out[42]:
3.1414926535900345

But Python does, in the "math" library:

In [43]:
import math as math
math.pi
Out[43]:
3.141592653589793

Or if we needed a function to test whether a number is prime:

In [52]:
# prime number test function

def is_multiple(m,n):
    """
    is m a multiple of n?
    """
    if (m%n == 0):
        return True
    else:
        return False


def is_it_prime(n):
    """
    tests the number n for primality
    """
    for i in range(2, n):
        if (is_multiple(n, i)):
            return False   
            break
        if (i >= n/2):
            return True
            break
        
is_it_prime(9)
Out[52]:
False

Note: The function above uses the if statements. Read more about the if statement here: https://www.tutorialspoint.com/python/python_if_else.htm.

 

Remember: The principal reason to use functions—and other objects—is to aid in your understanding, not the computer's. The computer does not care: for it, it is all patterns of ones and zeroes. Thus wherever you can make your program easier to read by explicitly using a function, do so.

For example, there are lots of times when economists want to calculate the marginal utility of spending on something—say, the marginal utility of spending on consumption for a consumer whose attitude toward risk is captured by a constant relative risk aversion parameter γ. The calculation of marginal utility as a function of the current level of consumption c and the CRRA parameter γ is straightforward:

marg_util = c**(-γ)

unless $ \gamma = 1 $, in which case:

marg_util = 1/c

So you wind up writing code like:

if (γ=1):
    marg_util = 1/c
else:
    marg_util = c**(-γ)

So why not put it into a function called marginal_utility, so that you will have one line instead of 4, and to remind yourself of what is going on each time you read through your code?

In [54]:
# marginal utility function

def marginal_utility(c, γ):
    """
    the marginal utility of increasing spending on consumption
    for a consumer with a constant-relative risk aversion
    parameter γ and a current level of consumption spending c
    """
    if γ == 1:
        return 1 / c
    else:
        return c**(-γ)

c = 5
γ = 2

result = marginal_utility(c, γ)
result
Out[54]:
0.04

2.4. Object-Oriented Programming (OOP)

Python is a modern kind of computer language called "object oriented". That means this: in Python, everything is an object that someone has defined and that you can redefine, and you are free to define your own objects as you wish.

 

2.4.1. Objects

OK. What does that list paragraph mean, and why does it matter:

In Python, an object is a collection of data and instructions held in computer memory that consists of:

  1. a name—what you have called it
  2. contents—a value or a set of values
  3. methods—things you can do to or extract from it that the computer knows how to do by virtue of what you defined the object to be.

You might think that a Python variable is simply a named box in which you can store a number. And, indeed the variable has a name and does have its value as its contents. But Python knows that a variable is also an object that understands a great many ways you would like to use it. These predefined ways to use it are called "methods".

 

2.4.2. Methods

Here are some of the methods that come with a variable object:

In [55]:
y = 3

print(y)               # value
print(y + y)           # add
print(y * y)           # multiply
print(y - y)           # subtract
print(y/y)             # divide
print(y**y)            # exponentiate
print(y.__abs__())     # absolute value
print(y.__bool__())    # is it "true" (i.e., not zero)?
print(y.__lt__(5))     # is it less than 5?
print(y.real)          # real part
print(y.imag)          # imaginary part
print(y.conjugate())   # complex conjugate
3
6
9
0
1.0
27
3
True
True
3
0
3
In [58]:
x = 3 + 2j

print(x)               # value
print(x.real)          # real part
print(x.imag)          # imaginary part
print(x.conjugate())   # complex conjugate
print(x + x)           # add
print(x * x)           # multiply
print(x - x)           # subtract
print(x/x)             # divide
print(x**x)            # exponentiate
print(x.__abs__())     # absolute value
print(x.__bool__())    # is x "true" (i.e., not zero)?
print(y.__lt__(5))     # is it less than 5?
(3+2j)
3.0
2.0
(3-2j)
(6+4j)
(5+12j)
0j
(1+0j)
(-5.409738793917678-13.410442370412747j)
3.605551275463989
True
True

You can call the

dir(y)

command to see all the methods that Python has associated with the variable y. And suppose that you want more methods attached to one of your objects? Then you can extend the class of object it is, and define your own.

 

Python has tools for you to figure out what methods an object has. Consider a list object. Does it know how to use the "append" method to add another object to it? Let's see:

In [66]:
university_list = ['UC San Diego', 'UC Riverside', 'UC Irvine', 'UC Los Angeles',
                   'UC Santa Barbara', 'UC Merced', 'UC Santa Cruz', 
                   'UC San Francisco', 'UC Davis']
print("university_list is", len(university_list), "items long")
print("does university list understand the .append method?", callable(university_list.append))
university_list is 9 items long
does university list understand the .append method? True

So let's add the missing UC campus—the one that Buffy Summers graduated from:

In [68]:
university_list.append('UC Sunnydale')

university_list
Out[68]:
['UC San Diego',
 'UC Riverside',
 'UC Irvine',
 'UC Los Angeles',
 'UC Santa Barbara',
 'UC Merced',
 'UC Santa Cruz',
 'UC San Francisco',
 'UC Davis',
 'UC Sunnydale']

"university list" is now a longer list than it was:

In [69]:
print("university_list is", len(university_list), "items long")
university_list is 10 items long

The 9th element of "university list" is the character string "UC Sunnydale". And that is an object that has its methods:

In [70]:
university_list[9].upper()
Out[70]:
'UC SUNNYDALE'
In [71]:
university_list[9].lower()
Out[71]:
'uc sunnydale'

2.4.3. Building Your Own Classes ###

Imagine now you want to write a program with consumers, who can:

  1. hold and spend cash
  2. consume goods
  3. work and earn cash

A natural solution in Python would be to create consumers as objects with

  1. data, such as cash on hand
  2. methods, such as buy or work that affect this data

Python makes it easy to do this, by providing you with class definitions that allow you to build objects according to your own specifications.

A class definition is a blueprint for a particular class of objects (e.g., lists, strings or complex numbers). It describesWhat kind of data the class stores, and What methods it has for acting on these data. An object or instance is a realization of the class, created from the blueprint. Each instance has its own unique data. Methods set out in the class definition act on this (and other) data. In Python, the data and methods of an object are collectively referred to as attributes. Attributes are accessed via “dotted attribute notation”: object_name.data object_name.method_name()

In [ ]:
x = [1, 5, 4]
x.sort()
x.__class__

xis an object or instance, created from the definition for Python lists, but with its own particular data. x.sort() and x.__class__ are two attributes of x. dir(x) can be used to view all the attributes of x.

Example: A Consumer Class

we’ll build a Consumer class with

  1. a wealth attribute that stores the consumer’s wealth (data)
  2. an earn method, where earn(y) increments the consumer’s wealth by y
  3. a spend method, where spend(x) either decreases wealth by x or returns an error if insufficient funds exist
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

class Consumer:

    def __init__(self, w):
        "Initialize consumer with w dollars of wealth"
        self.wealth = w

    def earn(self, y):
        "The consumer earns y dollars"
        self.wealth += y

    def spend(self, x):
        "The consumer spends x dollars if feasible"
        new_wealth = self.wealth - x
        if new_wealth < 0:
            print("Insufficent funds")
        else:
            self.wealth = new_wealth

If you look at the Consumer class definition again you’ll see the word self throughout the code. The rules with self are that

  1. Any instance data should be prepended with self e.g., the earn method references self.wealth rather than just wealth
  2. Any method defined within the class should have self as its first argument e.g., def earn(self, y) rather than just def earn(y) Any method referenced within the class should be called as self.method_name

Example: The Solow Growth Model

For our next example, let’s write a simple class to implement the Solow growth model. The Solow growth model is a neoclassical growth model where the amount of capital stock per capita $k_{t}$ evolves according to the rule $k_{t+1} = \frac{s z k_t^{\alpha} + (1 - \delta) k_t}{1 + n} \tag{1}$ Here $s$ is an exogenously given savings rate $z$ is a productivity parameter $\aplha$ is capital’s share of income $n$ is the population growth rate $\delta$ is the depreciation rate The steady state of the model is the k that solves$\tag{1}$ when $k_{t+1} = k_t = k$

Some points of interest in the code are:

  1. An instance maintains a record of its current capital stock in the variable self.k.
  2. The h method implements the right-hand side of (1).
  3. update method uses h to update capital as per (1). Notice how inside update the reference to the local method h is self.h. The methods steady_state and generate_sequence are fairly self-explanatory
In [ ]:
class Solow:
    r"""
    Implements the Solow growth model with the update rule

        k_{t+1} = [(s z k^α_t) + (1 - δ)k_t] /(1 + n)

    """
    def __init__(self, n=0.05,  # population growth rate
                       s=0.25,  # savings rate
                       δ=0.1,   # depreciation rate
                       α=0.3,   # share of labor
                       z=2.0,   # productivity
                       k=1.0):  # current capital stock

        self.n, self.s, self.δ, self.α, self.z = n, s, δ, α, z
        self.k = k

    def h(self):
        "Evaluate the h function"
        # Unpack parameters (get rid of self to simplify notation)
        n, s, δ, α, z = self.n, self.s, self.δ, self.α, self.z
        # Apply the update rule
        return (s * z * self.k**α + (1 - δ) * self.k) / (1 + n)

    def update(self):
        "Update the current state (i.e., the capital stock)."
        self.k =  self.h()

    def steady_state(self):
        "Compute the steady state value of capital."
        # Unpack parameters (get rid of self to simplify notation)
        n, s, δ, α, z = self.n, self.s, self.δ, self.α, self.z
        # Compute and return steady state
        return ((s * z) / (n + δ))**(1 / (1 - α))

    def generate_sequence(self, t):
        "Generate and return a time series of length t"
        path = []
        for i in range(t):
            path.append(self.k)
            self.update()
        return path

Here’s a little program that uses the class to compute time series from two different initial conditions. The common steady state is also plotted for comparison

In [ ]:
s1 = Solow()
s2 = Solow(k=8.0)

T = 60
fig, ax = plt.subplots(figsize=(9, 6))

# Plot the common steady state value of capital
ax.plot([s1.steady_state()]*T, 'k-', label='steady state')

# Plot time series for each economy
for s in s1, s2:
    lb = f'capital series from initial state {s.k}'
    ax.plot(s.generate_sequence(T), 'o-', lw=2, alpha=0.6, label=lb)

ax.legend()
plt.show()

Congratulations, you have finished your first assignment for Econ 101B! Run the cell below to submit all of your work. Make sure to check on OK to make sure that it has uploaded.

 

2.5. Pandas Dataframes

We will be managing our data using "dataframe" objects from the Pandas library, one of the most widely used Python libraries in data science. We need to import "pandas", and it is conventional to abbreviate it "pd" ;

In [75]:
import numpy as np
import pandas as pd

 

2.5.1 Creating Dataframes

The rows and columns of a pandas dataframe are essentially a collection of lists stacked on top/next to each other. For example, to store the top 10 movies and their ratings, I could create 10 lists. Each list would contain a movie title and its corresponding rating, and each list would be one row of the Dataframe's data table:

In [93]:
top_10_movies_df = pd.DataFrame(data=np.array(
            [[9.2, 'The Shawshank Redemption', 1994],
             [9.2, 'The Godfather', 1972],
             [9.0, 'The Godfather: Part II', 1974],
             [8.9, 'Pulp Fiction', 1994],
             [8.9, "Schindler's List", 1993],
             [8.9, 'The Lord of the Rings: The Return of the King', 2003],
             [8.9, '12 Angry Men', 1957],
             [8.9, 'The Dark Knight', 2008],
             [8.9, 'Il buono, il brutto, il cattivo', 1966],
             [8.8, 'The Lord of the Rings: The Fellowship of the Ring',2001]]), 
             columns=["Rating", "Movie", "Date"])

top_10_movies_df
Out[93]:
Rating Movie Date
0 9.2 The Shawshank Redemption 1994
1 9.2 The Godfather 1972
2 9.0 The Godfather: Part II 1974
3 8.9 Pulp Fiction 1994
4 8.9 Schindler's List 1993
5 8.9 The Lord of the Rings: The Return of the King 2003
6 8.9 12 Angry Men 1957
7 8.9 The Dark Knight 2008
8 8.9 Il buono, il brutto, il cattivo 1966
9 8.8 The Lord of the Rings: The Fellowship of the Ring 2001

Alternatively, we could use an object of class "dictionary" to store or data. You can think of a list as a way of associating each value with a key which is its place in the list. Thus:

In [77]:
Godfather_II_list = ['The Godfather: Part II', 1974, 9.0]

print("0:", Godfather_II_list[0])
print("1:", Godfather_II_list[1])
print("2:", Godfather_II_list[2])
The Godfather: Part II
1974
9.0

A dictionary is like a list, but instead of its keys being the numbers 0, 1, 2, 3..., the dictionary's keys can be anything you define:

In [89]:
Godfather_II_dict = {'title': 'The Godfather: Part II', 'date': 1974, 'rating': 9.0}
                     
print("title:", Godfather_II_dict['title']) 
print("date:", Godfather_II_dict['date'])
print("rating:", Godfather_II_dict['rating'])
title: The Godfather: Part II
date: 1974
rating: 9.0

In our top 10 movies example, we could create a dictionary that contains three keys: "Rating" as the key to a list of the ratings, "Movie" as a key to the movie titles, and "Date" as a key to the dates:

In [92]:
top_10_movies_dict = {"Rating" : [9.2, 9.2, 9.0, 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8], 
                      "Movie" : ['The Shawshank Redemption (1994)',
                                'The Godfather',
                                'The Godfather: Part II',
                                'Pulp Fiction',
                                "Schindler's List",
                                'The Lord of the Rings: The Return of the King',
                                '12 Angry Men',
                                'The Dark Knight',
                                'Il buono, il brutto, il cattivo',
                                'The Lord of the Rings: The Fellowship of the Ring'],
                      "Date" : [1994, 1972, 1974, 1994, 1993, 2003, 1957, 2008, 1966, 2001]
                     }

Now, we can use this dictionary to create a table with columns Rating, Movie, and Date:

In [94]:
top_10_movies_df2 = pd.DataFrame(data=top_10_movies_dict, columns=["Rating", "Movie", "Date"])
top_10_movies_df2
Out[94]:
Rating Movie Date
0 9.2 The Shawshank Redemption (1994) 1994
1 9.2 The Godfather 1972
2 9.0 The Godfather: Part II 1974
3 8.9 Pulp Fiction 1994
4 8.9 Schindler's List 1993
5 8.9 The Lord of the Rings: The Return of the King 2003
6 8.9 12 Angry Men 1957
7 8.9 The Dark Knight 2008
8 8.9 Il buono, il brutto, il cattivo 1966
9 8.8 The Lord of the Rings: The Fellowship of the Ring 2001

Both ways produced the same data table, the same dataframe. The list method created the table by using the lists to make up the rows of the table. The dictionary method took the dictionary keys and used them to make up the columns of the table.

 

2.5.2. Reading in Dataframes

Luckily for you, most data tables in this course will be premade by somebody else—and precleaned. Perhaps the most common file type that is used for economic data is a comma-separated-values (.csv). If properly cleaaned, they are easy to read in as pandas dataframes. We will use the "pd.read_csv" function, which takes as its one input parameter the url of the csv file you want to turn into a dataframe:

In [95]:
import pandas as pd

# Run this cell to read in the table
nipa_quarterly_accounts_df = pd.read_csv("https://delong.typepad.com/files/quarterly_accounts.csv")

What has this command done? Let's use the "head" method attached to objects in the dataframe class to see:

In [96]:
nipa_quarterly_accounts_df.head()
Out[96]:
Year Quarter Real GDI Real GDP Nominal GDP
0 1947 Q1 1912.5 1934.5 243.1
1 1947 Q2 1910.9 1932.3 246.3
2 1947 Q3 1914.0 1930.3 250.1
3 1947 Q4 1932.0 1960.7 260.3
4 1948 Q1 1984.4 1989.5 266.2

2.5.3. Indexing Dataframes

Oftentimes, tables will contain a lot of extraneous data that muddles our data tables, making it more difficult to quickly and accurately obtain the data we need. To correct for this, we can select out columns or rows that we need by indexing our dataframes.

The easiest way to index into a table is with square bracket notation. Suppose you wanted to obtain all of the Real GDP data from the data. Using a single pair of square brackets, you could index the table for "Real GDP"

In [98]:
# Run this cell and see what it outputs
nipa_quarterly_accounts_df["Real GDP"]
Out[98]:
0       1934.5
1       1932.3
2       1930.3
3       1960.7
4       1989.5
5       2021.9
6       2033.2
7       2035.3
8       2007.5
9       2000.8
10      2022.8
11      2004.7
12      2084.6
13      2147.6
14      2230.4
15      2273.4
16      2304.5
17      2344.5
18      2392.8
19      2398.1
20      2423.5
21      2428.5
22      2446.1
23      2526.4
24      2573.4
25      2593.5
26      2578.9
27      2539.8
28      2528.0
29      2530.7
        ...   
251    14541.9
252    14604.8
253    14745.9
254    14845.5
255    14939.0
256    14881.3
257    14989.6
258    15021.1
259    15190.3
260    15291.0
261    15362.4
262    15380.8
263    15384.3
264    15491.9
265    15521.6
266    15641.3
267    15793.9
268    15757.6
269    15935.8
270    16139.5
271    16220.2
272    16350.0
273    16460.9
274    16527.6
275    16547.6
276    16571.6
277    16663.5
278    16778.1
279    16851.4
280    16903.2
Name: Real GDP, Length: 281, dtype: float64

Notice how the above cell returns an array of all the real GDP values in their original order. Now, if you wanted to get the first real GDP value from this array, you could index it with another pair of square brackets:

In [99]:
nipa_quarterly_accounts_df["Real GDP"][0]
Out[99]:
1934.5

Pandas columns have many of the same properties as numpy arrays. Keep in mind that pandas dataframes, as well as many other data structures, are zero-indexed, meaning indexes start at 0 and end at the number of elements minus one.

If you wanted to create a new datatable with select columns from the original table, you can index with double brackets.

In [100]:
## Note: .head() returns the first five rows of the table
nipa_quarterly_accounts_df[["Year", "Quarter", "Real GDP", "Real GDI"]].head()
Out[100]:
Year Quarter Real GDP Real GDI
0 1947 Q1 1934.5 1912.5
1 1947 Q2 1932.3 1910.9
2 1947 Q3 1930.3 1914.0
3 1947 Q4 1960.7 1932.0
4 1948 Q1 1989.5 1984.4

You can get rid of columns you dont need using .drop()

In [104]:
nipa_quarterly_accounts_df.drop("Nominal GDP", axis=1).head()
Out[104]:
Year Quarter Real GDI Real GDP
0 1947 Q1 1912.5 1934.5
1 1947 Q2 1910.9 1932.3
2 1947 Q3 1914.0 1930.3
3 1947 Q4 1932.0 1960.7
4 1948 Q1 1984.4 1989.5

Finally, you can use square bracket notation to index rows by their indices. For example, if I wanted the 20th to 30th rows of accounts:

In [106]:
nipa_quarterly_accounts_df[20:31]
Out[106]:
Year Quarter Real GDI Real GDP Nominal GDP
20 1952 Q1 2398.3 2423.5 360.2
21 1952 Q2 2412.6 2428.5 361.4
22 1952 Q3 2435.0 2446.1 368.1
23 1952 Q4 2509.5 2526.4 381.2
24 1953 Q1 2554.3 2573.4 388.5
25 1953 Q2 2572.2 2593.5 392.3
26 1953 Q3 2555.7 2578.9 391.7
27 1953 Q4 2504.1 2539.8 386.5
28 1954 Q1 2510.1 2528.0 385.9
29 1954 Q2 2514.5 2530.7 386.7
30 1954 Q3 2537.1 2559.4 391.6

2.5.4. Filtering Data

Indexing rows based on indices is only useful when you know the specific set of rows that you need, and you can only really get a range of entries. Working with data often involves huge datasets, making it inefficient and sometimes impossible to know exactly what indices to be looking at. On top of that, most data analysis concerns itself with looking for patterns or specific conditions in the data, which is impossible to look for with simple index based sorting.

Thankfully, you can also use square bracket notation to filter out data based on a condition. Suppose we only wanted real GDP and nominal GDP data from the 21st century:

In [107]:
nipa_quarterly_accounts_df[nipa_quarterly_accounts_df["Year"] >= 2000][["Real GDP", "Nominal GDP"]]
Out[107]:
Real GDP Nominal GDP
212 12359.1 10031.0
213 12592.5 10278.3
214 12607.7 10357.4
215 12679.3 10472.3
216 12643.3 10508.1
217 12710.3 10638.4
218 12670.1 10639.5
219 12705.3 10701.3
220 12822.3 10834.4
221 12893.0 10934.8
222 12955.8 11037.1
223 12964.0 11103.8
224 13031.2 11230.1
225 13152.1 11370.7
226 13372.4 11625.1
227 13528.7 11816.8
228 13606.5 11988.4
229 13706.2 12181.4
230 13830.8 12367.7
231 13950.4 12562.2
232 14099.1 12813.7
233 14172.7 12974.1
234 14291.8 13205.4
235 14373.4 13381.6
236 14546.1 13648.9
237 14589.6 13799.8
238 14602.6 13908.5
239 14716.9 14066.4
240 14726.0 14233.2
241 14838.7 14422.3
... ... ...
251 14541.9 14566.5
252 14604.8 14681.1
253 14745.9 14888.6
254 14845.5 15057.7
255 14939.0 15230.2
256 14881.3 15238.4
257 14989.6 15460.9
258 15021.1 15587.1
259 15190.3 15785.3
260 15291.0 15973.9
261 15362.4 16121.9
262 15380.8 16227.9
263 15384.3 16297.3
264 15491.9 16475.4
265 15521.6 16541.4
266 15641.3 16749.3
267 15793.9 16999.9
268 15757.6 17031.3
269 15935.8 17320.9
270 16139.5 17622.3
271 16220.2 17735.9
272 16350.0 17874.7
273 16460.9 18093.2
274 16527.6 18227.7
275 16547.6 18287.2
276 16571.6 18325.2
277 16663.5 18538.0
278 16778.1 18729.1
279 16851.4 18905.5
280 16903.2 19057.7

69 rows × 2 columns

The nipa_quarterly_accounts_df table is being indexed by the condition nipa_quarterly_accounts_df["Year"] >= 2000, which returns a table where only rows that have a "Year" greater than $2000$ is returned. We then index this table with the double bracket notation from the previous section to only get the real GDP and nominal GDP columns.

Suppose now we wanted a table with data from the first quarter, and where the real GDP was less than 5000 or nominal GDP is greater than 15,000.

In [108]:
nipa_quarterly_accounts_df[(nipa_quarterly_accounts_df["Quarter"] ==
    "Q1") & ((nipa_quarterly_accounts_df["Real GDP"] < 5000) | 
    (nipa_quarterly_accounts_df["Nominal GDP"] > 15000))]
Out[108]:
Year Quarter Real GDI Real GDP Nominal GDP
0 1947 Q1 1912.5 1934.5 243.1
4 1948 Q1 1984.4 1989.5 266.2
8 1949 Q1 2001.5 2007.5 275.4
12 1950 Q1 2060.1 2084.6 281.2
16 1951 Q1 2281.0 2304.5 336.4
20 1952 Q1 2398.3 2423.5 360.2
24 1953 Q1 2554.3 2573.4 388.5
28 1954 Q1 2510.1 2528.0 385.9
32 1955 Q1 2661.6 2683.8 413.8
36 1956 Q1 2775.4 2770.0 440.5
40 1957 Q1 2862.0 2854.5 470.6
44 1958 Q1 2779.9 2772.7 468.4
48 1959 Q1 2976.5 2976.6 511.1
52 1960 Q1 3121.9 3123.2 543.3
56 1961 Q1 3109.9 3102.3 545.9
60 1962 Q1 3328.6 3336.8 595.2
64 1963 Q1 3469.1 3456.1 622.7
68 1964 Q1 3658.6 3672.7 671.1
72 1965 Q1 3885.5 3873.5 719.2
76 1966 Q1 4167.8 4201.9 797.3
80 1967 Q1 4286.5 4324.9 846.0
84 1968 Q1 4465.6 4490.6 911.1
88 1969 Q1 4665.4 4691.6 995.4
92 1970 Q1 4690.4 4707.1 1053.5
96 1971 Q1 4778.0 4834.3 1137.8
256 2011 Q1 14924.4 14881.3 15238.4
260 2012 Q1 15500.4 15291.0 15973.9
264 2013 Q1 15642.7 15491.9 16475.4
268 2014 Q1 15912.8 15757.6 17031.3
272 2015 Q1 16599.6 16350.0 17874.7
276 2016 Q1 16776.1 16571.6 18325.2
280 2017 Q1 16992.1 16903.2 19057.7

Many different conditions can be included to filter, and you can use & and | operators to connect them together. Make sure to include parantheses for each condition!

Another way to reorganize data to make it more convenient is to sort the data by the values in a specific column. For example, if we wanted to find the highest real GDP since 1947, we could sort the table for real GDP:

In [109]:
nipa_quarterly_accounts_df.sort_values("Real GDP")
Out[109]:
Year Quarter Real GDI Real GDP Nominal GDP
2 1947 Q3 1914.0 1930.3 250.1
1 1947 Q2 1910.9 1932.3 246.3
0 1947 Q1 1912.5 1934.5 243.1
3 1947 Q4 1932.0 1960.7 260.3
4 1948 Q1 1984.4 1989.5 266.2
9 1949 Q2 1995.9 2000.8 271.7
11 1949 Q4 1979.6 2004.7 271.0
8 1949 Q1 2001.5 2007.5 275.4
5 1948 Q2 2030.2 2021.9 272.9
10 1949 Q3 2007.9 2022.8 273.3
6 1948 Q3 2031.5 2033.2 279.5
7 1948 Q4 2041.6 2035.3 280.7
12 1950 Q1 2060.1 2084.6 281.2
13 1950 Q2 2144.4 2147.6 290.7
14 1950 Q3 2225.9 2230.4 308.5
15 1950 Q4 2268.9 2273.4 320.3
16 1951 Q1 2281.0 2304.5 336.4
17 1951 Q2 2321.3 2344.5 344.5
18 1951 Q3 2362.0 2392.8 351.8
19 1951 Q4 2382.7 2398.1 356.6
20 1952 Q1 2398.3 2423.5 360.2
21 1952 Q2 2412.6 2428.5 361.4
22 1952 Q3 2435.0 2446.1 368.1
23 1952 Q4 2509.5 2526.4 381.2
28 1954 Q1 2510.1 2528.0 385.9
29 1954 Q2 2514.5 2530.7 386.7
27 1953 Q4 2504.1 2539.8 386.5
30 1954 Q3 2537.1 2559.4 391.6
24 1953 Q1 2554.3 2573.4 388.5
26 1953 Q3 2555.7 2578.9 391.7
... ... ... ... ... ...
244 2008 Q1 14842.2 14889.5 14668.4
246 2008 Q3 14767.0 14891.6 14843.0
242 2007 Q3 14822.4 14938.5 14569.7
255 2010 Q4 14904.9 14939.0 15230.2
245 2008 Q2 14832.4 14963.4 14813.0
257 2011 Q2 14996.1 14989.6 15460.9
243 2007 Q4 14816.6 14991.8 14685.3
258 2011 Q3 15093.1 15021.1 15587.1
259 2011 Q4 15217.0 15190.3 15785.3
260 2012 Q1 15500.4 15291.0 15973.9
261 2012 Q2 15522.8 15362.4 16121.9
262 2012 Q3 15517.1 15380.8 16227.9
263 2012 Q4 15650.6 15384.3 16297.3
264 2013 Q1 15642.7 15491.9 16475.4
265 2013 Q2 15719.8 15521.6 16541.4
266 2013 Q3 15752.0 15641.3 16749.3
268 2014 Q1 15912.8 15757.6 17031.3
267 2013 Q4 15851.3 15793.9 16999.9
269 2014 Q2 16136.1 15935.8 17320.9
270 2014 Q3 16327.9 16139.5 17622.3
271 2014 Q4 16520.8 16220.2 17735.9
272 2015 Q1 16599.6 16350.0 17874.7
273 2015 Q2 16700.6 16460.9 18093.2
274 2015 Q3 16726.7 16527.6 18227.7
275 2015 Q4 16789.8 16547.6 18287.2
276 2016 Q1 16776.1 16571.6 18325.2
277 2016 Q2 16783.0 16663.5 18538.0
278 2016 Q3 16953.0 16778.1 18729.1
279 2016 Q4 16882.1 16851.4 18905.5
280 2017 Q1 16992.1 16903.2 19057.7

281 rows × 5 columns

But wait! The table looks like it's sorted in increasing order. This is because sort_values defaults to ordering the column in ascending order. To correct this, add in the extra optional parameter

In [110]:
nipa_quarterly_accounts_df.sort_values("Real GDP", ascending=False)
Out[110]:
Year Quarter Real GDI Real GDP Nominal GDP
280 2017 Q1 16992.1 16903.2 19057.7
279 2016 Q4 16882.1 16851.4 18905.5
278 2016 Q3 16953.0 16778.1 18729.1
277 2016 Q2 16783.0 16663.5 18538.0
276 2016 Q1 16776.1 16571.6 18325.2
275 2015 Q4 16789.8 16547.6 18287.2
274 2015 Q3 16726.7 16527.6 18227.7
273 2015 Q2 16700.6 16460.9 18093.2
272 2015 Q1 16599.6 16350.0 17874.7
271 2014 Q4 16520.8 16220.2 17735.9
270 2014 Q3 16327.9 16139.5 17622.3
269 2014 Q2 16136.1 15935.8 17320.9
267 2013 Q4 15851.3 15793.9 16999.9
268 2014 Q1 15912.8 15757.6 17031.3
266 2013 Q3 15752.0 15641.3 16749.3
265 2013 Q2 15719.8 15521.6 16541.4
264 2013 Q1 15642.7 15491.9 16475.4
263 2012 Q4 15650.6 15384.3 16297.3
262 2012 Q3 15517.1 15380.8 16227.9
261 2012 Q2 15522.8 15362.4 16121.9
260 2012 Q1 15500.4 15291.0 15973.9
259 2011 Q4 15217.0 15190.3 15785.3
258 2011 Q3 15093.1 15021.1 15587.1
243 2007 Q4 14816.6 14991.8 14685.3
257 2011 Q2 14996.1 14989.6 15460.9
245 2008 Q2 14832.4 14963.4 14813.0
255 2010 Q4 14904.9 14939.0 15230.2
242 2007 Q3 14822.4 14938.5 14569.7
246 2008 Q3 14767.0 14891.6 14843.0
244 2008 Q1 14842.2 14889.5 14668.4
... ... ... ... ... ...
26 1953 Q3 2555.7 2578.9 391.7
24 1953 Q1 2554.3 2573.4 388.5
30 1954 Q3 2537.1 2559.4 391.6
27 1953 Q4 2504.1 2539.8 386.5
29 1954 Q2 2514.5 2530.7 386.7
28 1954 Q1 2510.1 2528.0 385.9
23 1952 Q4 2509.5 2526.4 381.2
22 1952 Q3 2435.0 2446.1 368.1
21 1952 Q2 2412.6 2428.5 361.4
20 1952 Q1 2398.3 2423.5 360.2
19 1951 Q4 2382.7 2398.1 356.6
18 1951 Q3 2362.0 2392.8 351.8
17 1951 Q2 2321.3 2344.5 344.5
16 1951 Q1 2281.0 2304.5 336.4
15 1950 Q4 2268.9 2273.4 320.3
14 1950 Q3 2225.9 2230.4 308.5
13 1950 Q2 2144.4 2147.6 290.7
12 1950 Q1 2060.1 2084.6 281.2
7 1948 Q4 2041.6 2035.3 280.7
6 1948 Q3 2031.5 2033.2 279.5
10 1949 Q3 2007.9 2022.8 273.3
5 1948 Q2 2030.2 2021.9 272.9
8 1949 Q1 2001.5 2007.5 275.4
11 1949 Q4 1979.6 2004.7 271.0
9 1949 Q2 1995.9 2000.8 271.7
4 1948 Q1 1984.4 1989.5 266.2
3 1947 Q4 1932.0 1960.7 260.3
0 1947 Q1 1912.5 1934.5 243.1
1 1947 Q2 1910.9 1932.3 246.3
2 1947 Q3 1914.0 1930.3 250.1

281 rows × 5 columns

Now we can clearly see that the highest real GDP was attained in the first quarter of this year, and had a value of 16903.2

2.5.5. Some Useful Functions for Numeric Data

Here are a few useful functions when dealing with numeric data columns. To find the minimum value in a column, call min() on a column of the table.

In [ ]:
nipa_quarterly_accounts_df["Real GDP"].min()

To find the maximum value, call max().

In [ ]:
nipa_quarterly_accounts_df["Nominal GDP"].max()

And to find the average value of a column, use mean().

In [ ]:
nipa_quarterly_accounts_df["Real GDI"].mean()

 

3.0 Introduction to Data Visualization

Now that you can read in data and manipulate it, you are now ready to learn about how to visualize data. To begin, run the cells below to import the required packages we will be using.

In [111]:
%matplotlib inline
import matplotlib.pyplot as plt

We will be using US unemployment data from FRED to show what we can do with data. The statement below will put the csv file into a pandas DataFrame.

In [112]:
import pandas as pd

unemployment_data = pd.read_csv("https://delong.typepad.com/detailed_unemployment.csv")
unemployment_data.head()
Out[112]:
date total_unemployed more_than_15_weeks not_in_labor_searched_for_work multi_jobs leavers losers housing_price_index
0 11/1/10 16.9 8696 2531 6708 5.7 63.0 186.07
1 12/1/10 16.6 8549 2609 6899 6.4 61.2 183.27
2 1/1/11 16.2 8393 2800 6816 6.5 60.1 181.35
3 2/1/11 16.0 8175 2730 6741 6.4 60.2 179.66
4 3/1/11 15.9 8166 2434 6735 6.4 60.3 178.84

One of the advantages of pandas is its built-in plotting methods. We can simply call .plot() on a dataframe to plot columns against one another. All that we have to do is specify which column to plot on which axis. Something special that pandas does is attempt to automatically parse dates into something that it can understand and order them sequentially.

Note: total_unemployed is a percentage—not a number. Divide it by 100 to get the unemployment rate as a number.

In [113]:
unemployment_data.plot(x='date', y='total_unemployed')
Out[113]:
<matplotlib.axes._subplots.AxesSubplot at 0x1079bbeb8>

The base package for most plotting in Python is matplotlib. Below we will look at how to plot with it. First we will extract the columns that we are interested in, then plot them in a scatter plot. Note that plt is the common convention for matplotlib.pyplot.

In [114]:
total_unemployed = unemployment_data['total_unemployed']
not_labor = unemployment_data['not_in_labor_searched_for_work']

#Plot the data by inputting the x and y axis
plt.scatter(total_unemployed, not_labor)

# we can then go on to customize the plot with labels
plt.xlabel("Percent Unemployed")
plt.ylabel("Total Not In Labor, Searched for Work")
Out[114]:
<matplotlib.text.Text at 0x107f3c0f0>

Though matplotlib is sometimes considered an "ugly" plotting tool, it is powerful. It is highly customizable and is the foundation for most Python plotting libraries. Check out the documentation to get a sense of all of the things you can do with it, which extend far beyond scatter and line plots. An arguably more attractive package is seaborn, which we will go over in future notebooks.

 

Understanding Check 3.1: Try plotting the total percent of people unemployed vs those unemployed for more than 15 weeks.

In [115]:
total_unemployed = unemployment_data['total_unemployed']
unemp_15_weeks = unemployment_data['more_than_15_weeks']

plt.scatter(total_unemployed, unemp_15_weeks)
plt.xlabel('total percent of people unemployed')
plt.ylabel('people unemployed for more than 15 weeks')

# note: plt.show() is the equivalent of print, but for graphs
plt.show()

Some materials in this notebook were taken from Data 8, CS 61A, and DS Modules lessons.

The Fine Chancery Hand of the Twenty-First Century

David Guarino, UCB '07, on LinkedIn

Https www typepad com site blogs 6a00e551f08003883400e551f080068834 compose preview post

This website does not host notebooks, it only renders notebooks available on other websites.

Delivered by Fastly, Rendered by OVHcloud

nbviewer GitHub repository.

nbviewer version: 90c61cc

nbconvert version: 5.6.1

Rendered (Thu, 28 Oct 2021 21:47:24 UTC)