Instructor: Amy Lee
Developers: Michaela Palmer, Maya Shen, Cynthia Leu, Chris Cheung
FPF 2017
Welcome to lab! Please read this lab in its entirety, as the analysis will make a lot more sense with the background context provided. This lab is intended to be a hands-on introduction to data science as it can be applied to Chinatown demographics and analyzing primary texts.
We will be reading and analyzing representations of Chinatown in the form of data and maps. In addition, we will learn how data tools can be used to read and analyze large volumes of text.
You are currently working in a Jupyter Notebook. A Notebook allows text and code to be combined into one document. Each rectangular section of a notebook is called a "cell." There are two types of cells in this notebook: text cells and code cells.
Jupyter allows you to run simulations and regressions in real time. To do this, select a code cell, and click the "run cell" button at the top that looks like ▶| to confirm any changes. Alternatively, you can hold down the shift
key and then press return
or enter
.
In the following simulations, anytime you see In [ ]
you should click the "run cell" button to see output. If you get an error message after running a cell, go back to the beginning of the lab and make sure that every previous code cell has been run.
In a notebook, each rectangle containing text or code is called a cell.
Cells (like this one) can be edited by double-clicking on them. This cell is a text cell, written in a simple format called Markdown to add formatting and section headings. You don't need to worry about Markdown today, but it's a pretty fun+easy tool to learn.
After you edit a cell, click the "run cell" button at the top that looks like ▶| to confirm any changes. (Try not to delete the instructions.) You can also press SHIFT-ENTER
to run any cell or progress from one cell to the next.
Other cells contain code in the Python programming language. Running a code cell will execute all of the code it contains.
Try running this cell:
print("Hello, World!")
We will now quickly go through some very basic functionality of Python, which we'll be using throughout the rest of this notebook.
Quantitative information arises everywhere in data science. In addition to representing commands to print
out lines, expressions can represent numbers and methods of combining numbers.
The expression 3.2500
evaluates to the number 3.25. (Run the cell and see.)
3.2500
We don't necessarily always need to say "print
", because Jupyter always prints the last line in a code cell. If you want to print more than one line, though, do specify "print
".
print(3)
4
5
Many basic arithmetic operations are built in to Python, like *
(multiplication), +
(addition), -
(subtraction), and /
(division). There are many others, which you can find information about here. Use parentheses to specify the order of operations, which act according to PEMDAS, just as you may have learned in school. Use parentheses for a happy new year!
2 + (6 * 5 - (6 * 3)) ** 2 * (( 2 ** 3 ) / 4 * 7)
We sometimes want to work with the result of some computation more than once. To be able to do that without repeating code everywhere we want to use it, we can store it in a variable with assignment statements, which have the variable name on the left, an equals sign, and the expression to be evaluated and stored on the right. In the cell below, (3 * 11 + 5) / 2 - 9
evaluates to 10, and gets stored in the variable result
.
result = (3 * 11 + 5) / 2 - 9
result
One important form of an expression is the call expression, which first names a function and then describes its arguments. The function returns some value, based on its arguments. Some important mathematical functions are:
Function | Description |
---|---|
abs |
Returns the absolute value of its argument |
max |
Returns the maximum of all its arguments |
min |
Returns the minimum of all its arguments |
round |
Round its argument to the nearest integer |
Here are two call expressions that both evaluate to 3
abs(2 - 5)
max(round(2.8), min(pow(2, 10), -1 * pow(2, 10)))
These function calls first evaluate the expressions in the arguments (inside the parentheses), then evaluate the function on the results. abs(2-5)
evaluates first to abs(3)
, then returns 3
.
A statement is a whole line of code. Some statements are just expressions, like the examples above, that can be broken down into its subexpressions which get evaluated individually before evaluating the statement as a whole.
The most common way to combine or manipulate values in Python is by calling functions. Python comes with many built-in functions that perform common operations.
For example, the abs
function takes a single number as its argument and returns the absolute value of that number. The absolute value of a number is its distance from 0 on the number line, so abs(5)
is 5 and abs(-5)
is also 5.
abs(5)
abs(-5)
Functions can be called as above, putting the argument in parentheses at the end, or by using "dot notation", and calling the function after finding the arguments, as in the cell immediately below.
from datascience import make_array
nums = make_array(1, 2, 3) # makes a list of items, in this case, numbers
nums.mean() # finds the average of the array
%%capture
!python -m spacy download en
!pip install --no-cache-dir wordcloud
!pip3 install --no-cache-dir -U folium
!pip install --no-cache-dir textblob
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import *
%matplotlib inline
import folium
import pandas as pd
from IPython.display import HTML, display, IFrame
import folium
import spacy
from wordcloud import WordCloud
from textblob import TextBlob
import geojson
This map reflects the pervasive bias against the Chinese in California and in turn further fostered the hysteria. It was published as part of an official report of a Special Committee established by the San Francisco Board of Supervisors "on the Condition of the Chinese Quarter." The Report resulted from a dramatic increase in hostility to the Chinese, particularly because many Chinese laborers had been driven out of other Western states by vigilantes and sought safety in San Francisco (Shah 2001, 37).
The substance and tone of the Report is best illustrated by a few excerpts: "The general aspect of the streets and habitations was filthy in the extreme, . . . a slumbering pest, likely at any time to generate and spread disease, . . . a constant source of danger . . . , the filthiest spot inhabited by men, women and children on the American continent." (Report 4-5). "The Chinese brought here with them and have successfully maintained and perpetuated the grossest habits of bestiality practiced by the human race." (Ibid. 38).
The map highlights the Committee's points, particularly the pervasiveness of gambling, prostitution and opium use. It shows the occupancy of the street floor of every building in Chinatown, color coded to show: General Chinese Occupancy|Chinese Gambling Houses|Chinese Prostitution|Chinese Opium Resorts|Chinese Joss Houses|and White Prostitution. The Report concludes with a recommendation that the Chinese be driven out of the City by stern enforcement of the law: "compulsory obedience to our laws [is] necessarily obnoxious and revolting to the Chinese|and the more rigidly this enforcement is insisted upon and carried out the less endurable will existence be to them here, the less attractive will life be to them in California. Fewer will come and fewer will remain. . . . Scatter them by such a policy as this to other States . . . ." (Ibid. 67-68)
In this section, we will examine some of the factors that influence population growth and how they are changing the landscape of Chinatowns across the U.S.
Now it's time to work with tables and explore some real data. A Table
is just like how we made a list above with make_array
, but for all the rows in a table.
We're going to first look at the most recent demographic data from 2010-2015:
historical_data = Table.read_table('data/2010-2015.csv') # read in data from file
historical_data['FIPS'] = ['0' + str(x) for x in historical_data['FIPS']] # fix FIPS columns
historical_data.show(10) # show first ten rows
We can get some quick summary statistics by calling the .stats()
function on our Table
variable:
historical_data.stats()
So which census tract has the highest Asian population?
First we can find the highest population by using the max
function:
max(historical_data['Asian'])
Let's plug that into a table that uses the where
and are.equal_to
functions:
historical_data.where('Asian', are.equal_to(max(historical_data['Asian'])))
This FIPS code 06075035300 is tract 353. Does this make sense to you?
It might be better to look at which census tracts has Asian as the highest proportion of the population:
historical_data['Asian_percentage'] = historical_data['Asian'] / historical_data['Population']
historical_data.show(5)
Now we can use the same method to get the max
and subset our table:
max(historical_data['Asian_percentage'])
historical_data.where('Asian_percentage', are.equal_to(max(historical_data['Asian_percentage'])))
FIPS code 06075011800 is census tract 118. Does this make sense?
For your reference, here's a table of useful Table
functions:
Name | Example | Purpose |
---|---|---|
Table |
Table() |
Create an empty table, usually to extend with data |
Table.read_table |
Table.read_table("my_data.csv") |
Create a table from a data file |
with_columns |
tbl = Table().with_columns("N", np.arange(5), "2*N", np.arange(0, 10, 2)) |
Create a copy of a table with more columns |
column |
tbl.column("N") |
Create an array containing the elements of a column |
sort |
tbl.sort("N") |
Create a copy of a table sorted by the values in a column |
where |
tbl.where("N", are.above(2)) |
Create a copy of a table with only the rows that match some predicate |
num_rows |
tbl.num_rows |
Compute the number of rows in a table |
num_columns |
tbl.num_columns |
Compute the number of columns in a table |
select |
tbl.select("N") |
Create a copy of a table with only some of the columns |
drop |
tbl.drop("2*N") |
Create a copy of a table without some of the columns |
take |
tbl.take(np.arange(0, 6, 2)) |
Create a copy of the table with only the rows whose indices are in the given array |
join |
tbl1.join("shared_column_name", tbl2) |
Join together two tables with a common column name |
are.equal_to() |
tbl.where("SEX", are.equal_to(0)) |
find values equal to that indicated |
are.not_equal_to() |
tbl.where("SEX", are.not_equal_to(0)) |
find values not including the one indicated |
are.above() |
tbl.where("AGE", are.above(30)) |
find values greater to that indicated |
are.below() |
tbl.where("AGE", are.below(40)) |
find values less than that indicated |
are.between() |
tbl.where("SEX", are.between(18, 60)) |
find values between the two indicated |
If we were interested in the relationship between two variables in our dataset, we'd want to look at correlation.
The correlation coefficient ranges from −1 to 1. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables. ~Wikipedia
r = 1: the scatter diagram is a perfect straight line sloping upwards
r = -1: the scatter diagram is a perfect straight line sloping downwards.
Let's calculate the correlation coefficient between each of the continuous variables in our dataset.. We can use the .to_df().corr()
function:
historical_data.to_df().corr()
We often visualize correlations with a scatter
plot:
historical_data.scatter('Population', 'Asian')
historical_data.scatter('One_race', 'Asian')
historical_data.scatter('Two_or_more_races', 'Asian')
To look at a 1-1 relationship over time we might prefer a simple line graph. We can first group the data by Year
, then take the mean
for the Population
, and plot
that against Year
:
historical_data.to_df().groupby('Year')['Population'].mean()
historical_data.to_df().groupby('Year')['Population'].mean().plot()
historical_data.to_df().groupby('Year')['Asian_percentage'].mean().plot()
Let's look at only the year 2015:
historical_2015 = historical_data.where('Year', are.equal_to(2015))
historical_2015.show(5)
We can make a choropleth map with a little function, don't worry about the code below!
def choro_column(tab, column):
sf_2010 = geojson.load(open("data/2010-sf.geojson"))
threshold_scale = np.linspace(min(tab[column]), max(tab[column]), 6, dtype=float).tolist()
mapa = folium.Map(location=(37.7793784, -122.4063879), zoom_start=11)
mapa.choropleth(geo_data=sf_2010,
data=tab.to_df(),
columns=['FIPS', column],
fill_color='YlOrRd',
key_on='feature.properties.GEOID10',
threshold_scale=threshold_scale)
mapa.save("map.html")
return mapa
Here's a choropleth of all the population:
choro_column(historical_2015, 'Population')
IFrame('map.html', width=700, height=400)
Let's look at only Asian:
choro_column(historical_2015, 'Asian')
Try making one more choropleth below with only Asian_percentage
:
Now let's take a look at the historical data showing how the Asian population has changed over time, as compared to the black population.
First, let's load in all our of decennial San Francisco Chinatown census data acquired from an online domain called Social Explorer. Let's first examine this dataset to get a sense of what's in it.
historical = Table.read_table('data/process.csv')
historical.show(5)
historical['Other'] = historical['Total Population'] - historical['White'] - historical['Black']
historical.show(5)
You can use the mean function to find the average total population in Chinatown. Do you notice any significant changes between 1940 and 2010?
historical.to_df().groupby('Year')['Total Population'].mean()
Let's plot the results on a graph.
historical.to_df().groupby('Year')['Total Population'].mean().plot()
historical.to_df().groupby('Year')['White'].mean()
We can plot the average population of different racial groups.
historical.to_df().groupby('Year')['White'].mean().plot()
historical.to_df().groupby('Year')['Black'].mean()
historical.to_df().groupby('Year')['Black'].mean().plot()
historical.to_df().groupby('Year')['Other'].mean()
historical.to_df().groupby('Year')['Other'].mean().plot()
One of the goals of this module is to compare different Chinatowns from across the US. We will now compare the SF Chinatown data to the census data from Manhattan's Chinatown. Let's load the Manhattan data.
manhattan = Table.read_table('data/manhattan_cleaned.csv')
manhattan.show(10)
manhattan.to_df().corr()
manhattan.scatter('Chinese Population', 'White Population', color=['b','r'])
manhattan_2010 = manhattan.where('Year', are.equal_to(2010))
manhattan_2010.show()
def choro_column(tab, column):
tab = tab.to_df()
tab['Census Tract'] = tab['Census Tract'].astype(str).str.strip('0').str.strip('.')
nyc_2010 = geojson.load(open("data/nyc-census-2010.geojson"))
tracts = folium.features.GeoJson(nyc_2010)
threshold_scale = np.linspace(min(tab[column]), max(tab[column]), 6, dtype=float).tolist()
mapa = folium.Map(location=(40.7128, -74.00609), zoom_start=11)
mapa.choropleth(geo_data=nyc_2010,
data=tab,
columns=['Census Tract', column],
fill_color='YlOrRd',
key_on='feature.properties.CTLabel',
threshold_scale=threshold_scale)
mapa.save("map.html")
return mapa
choro_column(manhattan_2010, 'Chinese Population')
IFrame('map.html', width=700, height=400)
manhattan_2010['Asian_percentage'] = manhattan_2010['Asian/Other Population'] / manhattan_2010['Total Population']
manhattan_2010.show(5)
choro_column(manhattan_2010, 'Asian_percentage')
IFrame('map.html', width=700, height=400)
In this class, we have been learning how to 'close-read' primary texts. Close-reading generally involves picking select passages and reading for the latent meanings embedded in word choice, syntax, the use of metaphors and symbols, etc. Here, we are introducing another way of analyzing primary texts using computational methods. Computational text analysis generally involves 'counting' words. Let's see how this works by analyzing some of the poems written by Chinese immigrants on Angel Island.
Run the following cell to import the poems from a .txt file.
with open('data/islandpoetry1_22.txt', "r") as f:
raw = f.read()
print(raw)
We're interested in which words appear the most often in our set of poems. It's pretty hard to read or see much in this form. We'll coming back to the topic of what words are the most common with actual numbers a bit later but for now, run the following cell to generate two interesting visualizations of the most common words (minus those such as "the", "a", etc.).
wordcloud = WordCloud().generate(raw)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
# lower max_font_size
wordcloud = WordCloud(max_font_size=40).generate(raw)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Oops, it seems we've forgotten just how many poems we have in our set. Luckily we have a quick way of finding out! Each "\n" in the display of the poem text indicates a line break. It turns out that each poem is separated by an empty line, a.k.a. two line breaks or "\n"'s.
num_poems = len(raw.split("\n\n"))
num_poems
We can also use this idea to calculate the number of characters in each poem.
num_char_per_poem = [len(p) for p in raw.split("\n\n")]
print(num_char_per_poem)
This is interesting but seems like just a long list of numbers. What about the average number of characters per poem?
np.mean(num_char_per_poem)
Let's look at it in histogram form to get a better idea of our data.
Table().with_columns("Character Count", np.asarray(num_char_per_poem)).hist()
Each bar of this histogram tells us what proportion of the poems (the height of the bar) have that many characters (the position of the bar on the x-axis).
We can also use "\n" to look at enjambment too. Let's calculate the proportion of lines that are enjambed out of the total number of lines per poem.
from string import punctuation
poems = raw.split("\n\n")
all_poems_enjambment = []
for p in poems:
lines = p.split("\n")
poems = raw.split("\n\n")
enjambment = 0
for l in lines:
try:
if l[-1] in punctuation:
pass
else:
enjambment += 1
except:
pass
enj = enjambment/len(lines)
all_poems_enjambment.append(enj)
print(all_poems_enjambment)
Once again, what about the average?
np.mean(all_poems_enjambment)
Let's now return to the question of the words that appear the most frequently in these 49 poems. First we have to use spaCy, an open-source software library for Natural Language Processing (NLP), to parse through the text and replace all the "\n"'s with spaces.
nlp = spacy.load('en', parser=False)
parsed_text = nlp(raw.replace("\n", " "))
We can separate all the words/symbols and put them in a table.
toks_tab = Table()
toks_tab.append_column(label="Word", values=[word.text for word in parsed_text])
toks_tab.show()
toks_tab.append_column(label="POS", values=[word.pos_ for word in parsed_text])
toks_tab.show()
Now let's create a new table with even more columns using the "tablefy" function below.
def tablefy(parsed_text):
toks_tab = Table()
toks_tab.append_column(label="Word", values=[word.text for word in parsed_text])
toks_tab.append_column(label="POS", values=[word.pos_ for word in parsed_text])
toks_tab.append_column(label="Lemma", values=[word.lemma_ for word in parsed_text])
toks_tab.append_column(label="Stop Word", values=[word.is_stop for word in parsed_text])
toks_tab.append_column(label="Punctuation", values=[word.is_punct for word in parsed_text])
toks_tab.append_column(label="Space", values=[word.is_space for word in parsed_text])
toks_tab.append_column(label="Number", values=[word.like_num for word in parsed_text])
toks_tab.append_column(label="OOV", values=[word.is_oov for word in parsed_text])
toks_tab.append_column(label="Dependency", values=[word.dep_ for word in parsed_text])
return toks_tab
tablefy(parsed_text).show()
Next, let's look at the frequency of words. However, we want to get rid of words such as "the" and "and" (stop words), punctuation, and spaces. We can do this by selecting rows that are not stop words, punctuation, or spaces and then sorting by word!
word_counts = tablefy(parsed_text).where("Stop Word", are.equal_to(False)).where(
"Punctuation", are.equal_to(False)).where(
"Space", are.equal_to(False)).group("Word").sort("count",descending=True)
word_counts
In this table, we have both the words "sad" and "sadness" - it seems strange to separate them. It turns out that these words are part of the same "lexeme", or a unit of meaning. For example, "run", "runs", "ran", and "running" are all part of the same lexeme with the lemma 'run'. Lemmas are another column in our table from above! Nice!
lemma_counts = tablefy(parsed_text).where("Stop Word", are.equal_to(False)).where(
"Punctuation", are.equal_to(False)).where(
"Space", are.equal_to(False)).group("Lemma").sort("count",descending=True)
lemma_counts
Now let's look at how many words there are of each part of speech.
pos_counts = tablefy(parsed_text).where("Stop Word", are.equal_to(False)).where(
"Punctuation", are.equal_to(False)).where(
"Space", are.equal_to(False)).group("POS").sort("count",descending=True)
pos_counts
We can also look at the proportions of each POS out of all the words!
for i in np.arange(pos_counts.num_rows):
pos = pos_counts.column("POS").item(i)
count = pos_counts.column("count").item(i)
total = np.sum(pos_counts.column("count"))
proportion = str(count / total)
print(pos + " proportion: " + proportion)
If we're interested in words' relations with each other, we can look at words that are next to each other. The function below returns the word following the first instance of the word you search for in the specified source.
def nextword(word, source):
for i, w in enumerate(source):
if w == word:
return source[i+1]
Mess around a bit with this function! Change the "word" argument.
split_txt = raw.split()
# Change the target or "home" to other words!
nextword("home", split_txt)
We are specifically interested in the word "I" and the words that poets use in succession. Let's make an array of all the words that come after it in these poems. For easier viewing, the phrases have been printed out. What do you notice?
one_after_i = make_array()
for i, w in enumerate(split_txt):
if w == "I":
one_after_i = np.append(one_after_i, split_txt[i+1])
for i in one_after_i:
print("I " + i)
Above we have only shown the next word, what about the next two words? Does this give you any new insight?
two_after_i = make_array()
for i, w in enumerate(split_txt):
if w == "I":
two_after_i = np.append(two_after_i, split_txt[i+1] + " " + split_txt[i+2])
for i in two_after_i:
print("I " + i)
Try doing some exploring of your own! If you're feeling stuck, feel free to copy and edit code from above.
# Write your own code here!
We can do some analysis of the overall sentiments, or emotions conveyed, in each of the poems using the code below. Here, we analyze the overall sentiment of each poem individually. Once you run the next cell, you'll see the sentiment values for each poem. A value below 0 denotes a negative sentiment, and a value above 0 is positive.
sentiments = make_array()
for p in poems:
poem = TextBlob(p)
sentiments = np.append(sentiments, poem.sentiment.polarity)
sentiments
Now, what does this mean? It appears that the number of poems with negative sentiment is about the same as the number of poems with positive or neutral (0) sentiment. We can look at the proportion of negative poems in the next cell:
neg_proportion = np.count_nonzero(sentiments < 0)/len(sentiments)
neg_proportion
Okay, so just under half of the poems have negative sentiment. So, on average the poems have slightly positive sentiment, right?
We can also perform sentiment analysis across the text of all of the poems at once and see what happens:
poems_all = TextBlob(raw.replace('\n', ' '))
poems_all.sentiment.polarity
This way of analyzing the text tells us that the language in all of the poems has slightly negative sentiment.
One more analysis we can perform is computing the average sentiment of the poems, given the list of each individual poem's sentiments that we computed earlier:
np.mean(sentiments)
This method also tells us that our poems have slightly negative sentiment, on average.
Here, let's look at one of the poems with it's sentiment value:
poem_3 = poems[3].replace('\n', ' ')
print(poem_3)
print(TextBlob(poem_3).sentiment.polarity)
Let's look at one more poem:
poem_47 = poems[47].replace('\n', ' ')
print(poem_47)
print(TextBlob(poem_47).sentiment.polarity)
*Please fill out our feedback form!*