Lecture 6

In [ ]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

Review and continuation Table operations

Last class we discussed the following table methods which return new Tables as output:

  1. tb.select(label): constructs a new table with just the specified columns
  2. tb.drop(label): constructs a new table in which the specified columns are omitted
  3. tb.sort(label): constructs a new table with rows sorted by the specified column
  4. tb.where(label, condition): constructs a new table with just the rows that match the condition

There are a number of properties we can extract from a Table including:

  • num_rows: returns the number of rows in a Table
  • num_columns: returns the number columns in a Table

There are also a number of additional methods for Tables including:

  • relabel('column_name', 'new_name'): returns a table where the column name 'column_name' is now called 'new_name'
  • take(row_numbers): returns a Table with the selected row numbers
In [ ]:
# Load the ice cream data. Each row represents one ice cream cone.
cones = Table.read_table('cones.csv')
cones
In [ ]:
# select only the chocolate cones using the `where` method as we did last class
In [ ]:
# print the number of rows and columns
In [ ]:
# relabel a column
In [ ]:
# extract a row

Columns of Tables are Arrays

We can extract columns from a Table as either:

  • A new Table with fewer columns using tb.select()
  • An ndarray using tb.column()
In [ ]:
# select() returns a table
In [ ]:
 
In [ ]:
# column() returns a an array
In [ ]:
 

Lists

Lists are one of the most widely used data structions in Python. They like like ndarrays but they can hold heterogeneous types of data.

  • We construction lists using square brackets [], where the elements in the list are separated by commas.
  • We can access the third items in a list called my_list using my_list[2].
In [ ]:

Constructing Tables

We have created tables by loading data from comma separated value files (.csv files). We can also create Tables from scratch by using:

  • Table(): constucts an empty Table
  • tb.with_columns("Name", array) adds columns to a Table
  • tb.with_row("Name", list) adds a row to a Table

Let's try creating a table that says how many blocks away different streets are from our classroom (now that we are back in person!).

In [ ]:
# create an array of street names
In [ ]:
# create a Table with street names and the distance from our classroom
In [ ]:
 
In [ ]:
# add another row to the Table
In [ ]:
# add another column to the Table saying whether a street is one-way or two-way
In [ ]:
 

Example: Census data

The US government conducts a census every 10 years. We can examine the census data to see interesting patterns in the population of people in the United States.

In [ ]:
# As of August 2021, this census file is online here: 
data = 'http://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/nc-est2019-agesex-res.csv'

# A local copy can be accessed here in case census.gov moves the file:
# data = path_data + 'nc-est2019-agesex-res.csv'

full = Table.read_table(data)
full
In [ ]:
# get a reduced set of columns that we want to analyze further
In [ ]:
# rename the columns to make them easier to work with
In [ ]:
# let's examine the data a little more
In [ ]:
# let's remove the totals (value of 999 in the AGE column)
In [ ]:
# let's split the data into male, female and everyone
In [ ]:
 
In [ ]:
# let's see which ages have the most people
In [ ]:
 
In [ ]:
# let's create a Table with age males and females 
In [ ]:
 
In [ ]:
# let's add a precent female column to our Table
In [ ]:
 

Line Graphs

A useful way to visualize data as a function of time is a line plot. We can do this using the tb.plot('x_col_name', 'y_col_name') method.

In [ ]:
# plot Percent Female as a function of Age
In [ ]:
 
In [ ]:
# plot Males and Females
In [ ]:
# see the which ages have had the biggest changes between 2014 to 2019
In [ ]:
 
In [ ]:
# Let's look at the percent change between 2014 to 2019 for each age
In [ ]:
# plot percent change - any ideas why larger increases around age 72 and late 90's? 
In [ ]: