This notebook will familiarize you with some of the basic strategies for data analysis that can be useful not only in this course, but possibly for the rest of your time at Cal. We will cover an overview of our computing environment, and then will explore the data on closure and VOT that you submit.
If you want a more in-depth introduction to Python, click here to explore that notebook. You should be able to get through this entire notebook without that tutorial, it is there if you want to dive deeper into what is going on in the code.
1 - [Computing Environment](#computing environment)
3 - [Exploring the Data](#exploring data)
4 - Relationships between Closures
6 - [Comparing to Others](#to class)
This webpage is called a Jupyter notebook. A notebook is a place to write programs and view their results.
In a notebook, each rectangle containing text or code is called a cell.
Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called Markdown to add formatting and section headings. You don't need to learn Markdown, but you might want to.
After you edit a text cell, click the "run cell" button at the top that looks like ▶| to confirm any changes. (Try not to delete the instructions of the lab.)
Understanding Check 1 This paragraph is in its own text cell. Try editing it so that this sentence is the last sentence in the paragraph, and then click the "run cell" ▶| button . This sentence, for example, should be deleted. So should this one.
A programming language is a vocabulary and set of grammatical rules for instructing a computer or computing device to perform specific tasks.
Other cells contain code in the Python 3 language. Just like natural human languages, it has rules -- Python is a programming language, which means that it is a set of grammatical rules and vocabulary for instructing a computer to perform tasks. It differs from natural language in two important ways:
There's a lot of terminology in programming languages, but you don't need to know it all in order to program effectively. From time to time, you'll see a cryptic message, but you can often get by without deciphering it, by utilizing appropriate resources (sometimes it's as simple as a Google search).
Running a code cell will execute all of the code it contains.
To run the code in a code cell, first click on that cell to activate it. It'll be highlighted with a little green or blue rectangle. Next, either press ▶| or hold down the shift
key and press return
or enter
.
Try running this cell:
print("Hello, World!")
The fundamental building block of Python code is an expression. Cells can contain multiple lines with multiple expressions. When you run a cell, the lines of code are executed in the order in which they appear. Every print
expression prints a line. Run the next cell and notice the order of the output.
print("First this line is printed,")
print("and then this one.")
You can use Jupyter notebooks for your own projects or documents. When you make your own notebook, you'll need to create your own cells for text and code.
To add a cell, click the + button in the menu bar. It'll start out as a text cell. You can change it to a code cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing "Code".
File > Save and Checkpoint
to save the notebook.Kernel > Interrupt
, then try running the cell again.Kernel > Restart
, then run through all of the cells.Cell > Run All
.Run the cell below so that we can get started on our module! These are our import statements (and a few other things). Because of the size of the Python community, if there is a function that you want to use, there is a good chance that someone has written one already and been kind enough to share their work in the form of packages. We can start using those packages by writing import
and then the package name.
# imports -- just run this cell
import scipy
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import mode
from ipywidgets import interact
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from matplotlib import colors
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings('ignore')
sns.set_style('darkgrid')
%matplotlib inline
We will start by familiarizing ourselves with the data.
To visualize the data, we need to load the file first. In the line where we we assign file_name
to equal the name of our dataset, which is a compilation of the results from the homework you completed last week.
Note that we have data/
in front of the file name, which means that are file example_data.csv
is in the data
directory (folder).
file_name = 'data/fall17.csv'
data = pd.read_csv(file_name)
data.head()
We are going to add several columns to our dataframe. A column for each of the following:
class
)clo
/vot
)vclo
/vvot
)vlclo
/vlvot
)First we will add the column for the average of all of the closures for each row. To do that, we'll first pull out just the columns that we want to take the average of.
subset = data[['pclo', 'tclo', 'kclo', 'bclo', 'dclo', 'gclo']]
subset.head()
Then we will take the average across those rows.
clo_avg = subset.mean(axis=1)
clo_avg
And finally, we will append those values to our dataframe as a column called clo
.
data['clo'] = clo_avg
data.head()
We then repeat this process for all of the other columns that we want to create.
data['vot'] = data[['pvot', 'tvot', 'kvot', 'bvot', 'dvot', 'gvot']].mean(axis=1)
data['vclo'] = data[['bclo', 'dclo', 'gclo']].mean(axis=1)
data['vvot'] = data[['bvot', 'dvot', 'gvot']].mean(axis=1)
data['vlclo'] = data[['pclo', 'tclo', 'kclo']].mean(axis=1)
data['vlvot'] = data[['pvot', 'tvot', 'kvot']].mean(axis=1)
data.head()
Below we compute the some basic properties about the column clo
.
closure_mode = mode(data['clo'])[0][0]
print('Mode: ', closure_mode)
data['clo'].describe()
We can calculate all of the above statistics (except mode) for the entire table with one line.
data.describe()
Now that we have our data in order, let's get a picture of the data with some plots.
Let's start by visualizing the distribution of vot
with a histogram.
sns.distplot(data['vot'], kde_kws={"label": "vot"})
Next, we'll compare the distributions of the voiced and voiceless voice-onset times.
sns.distplot(data['vvot'], kde_kws={"label": "voiced vot"})
sns.distplot(data['vlvot'], kde_kws={"label": "voiceless vot"})
plt.xlabel('ms')
The distributions of the three voiceless stops are below.
sns.distplot(data['pvot'], kde_kws={"label": "pvot"})
sns.distplot(data['tvot'], kde_kws={"label": "tvot"})
sns.distplot(data['kvot'], kde_kws={"label": "kvot"})
plt.xlabel('ms')
plt.ylabel('proportion per ms')
The distributions of the three voiced stops are below.
sns.distplot(data['bvot'], kde_kws={"label": "bvot"})
sns.distplot(data['dvot'], kde_kws={"label": "dvot"})
sns.distplot(data['gvot'], kde_kws={"label": "gvot"})
plt.xlabel('ms')
plt.ylabel('proportion per ms')
Below, we see the native languages represented in the data.
sns.countplot(y="language", data=data)
Below, we have a the distribution of height.
sns.distplot(data['height'])
plt.xlabel('height (cm)')
Now will will shift back away from single column visualizations, and start to compare values between columns, looking specifically at the different closures in our dataframe. Run the cell below that will automate some of plotting for us.
def plot_with_equality_line(xs, ys, best_fit=False):
fig, ax = plt.subplots()
sns.regplot(xs, ys, fit_reg=best_fit, ax=ax)
lims = [np.min([ax.get_xlim(), ax.get_ylim()]), np.max([ax.get_xlim(), ax.get_ylim()])]
ax.plot(lims, lims, '--', alpha=0.75, zorder=0, c='black')
ax.set_xlim(lims)
ax.set_ylim(lims)
print('Points above line: ' + str(sum(xs < ys)))
print('Points below line: ' + str(sum(xs > ys)))
print('Points on line: ' + str(sum(xs == ys)))
We'll start by making scatter plots. They takes the values (from identified columns) of individual rows, and plots them as a dot on our coordinate plane. So in the plot below, each point will represent a person's tclo
and pclo
. We are going to plot a dashed line that marks where the x-values are equal to the y-values, which helps us see which value is bigger for an individual. If a point is above the line, their y-value is larger than their x. If a point is below, their x-value is greater than their y.
plot_with_equality_line(data['tclo'], data['pclo'])
plt.xlabel('tclo (ms)')
plt.ylabel('pclo (ms)')
plot_with_equality_line(data['kclo'], data['pclo'])
plt.xlabel('kclo (ms)')
plt.ylabel('pclo (ms)')
plot_with_equality_line(data['kclo'], data['tclo'])
plt.xlabel('kclo (ms)')
plt.ylabel('tclo (ms)')
plot_with_equality_line(data['dclo'], data['bclo'])
plt.xlabel('dclo (ms)')
plt.ylabel('bclo (ms)')
plot_with_equality_line(data['gclo'], data['bclo'])
plt.xlabel('gclo (ms)')
plt.ylabel('bclo (ms)')
plot_with_equality_line(data['gclo'], data['dclo'])
plt.ylabel('dclo (ms)')
plt.xlabel('gclo (ms)')
Those scatter plots are informative, but sometimes it's difficult to make conclustions from them, especially in our case where we have so much raw data. To make easier comparisons about the ranges of values that our closures we can use boxplots.
sns.boxplot(data=data[['pclo', 'tclo', 'kclo']], width=.3, palette="Set3")
plt.ylabel('duration (ms)')
plt.xlabel('Voiceless Closures')
With the above plot, it can be different to compare values of the box-and-whisker plots because the outliers require us to zoom out. Below, we will zoom in to the boxes.
sns.boxplot(data=data[['pclo', 'tclo', 'kclo']], width=.3, palette="Set3")
plt.ylabel('duration (ms)')
plt.xlabel('Voiceless Closures')
plt.ylim(0, 212)
We then recreate those graphs, but using our voiced closures.
sns.boxplot(data=data[['bclo', 'dclo', 'gclo']], width=.3, palette="Set2")
plt.ylabel('duration (ms)')
plt.xlabel('Voiced Closures')
sns.boxplot(data=data[['bclo', 'dclo', 'gclo']], width=.3, palette="Set2")
plt.ylabel('duration (ms)')
plt.xlabel('Voiced Closures')
plt.ylim(0, 212)
Do our box-whisker plots corroborate the scatter plot data? Are we able to come to the same conclusions that we were before?
Now let's explore relationships between closure and different characteristics of the persons who delivered those stats, looking at language and height. We'll draw scatter plots to see whether there are linear relationships between them.
Before we look at the actual relationship, it is important to realize any potential limitations of our observations. If you look back up to the bar plot of different native languages, you will see that the majority speak English as their native language.
Question: if we try to come up with conclusion about people who speak Tagalog or Farsi as their first language, would the conclusions be reliable and why?
sns.violinplot(x="vot", y="language", data=data)
plt.xlabel('vot (ms)')
Compare the distributions. Can you make any meaningful observations?
Now we'll look at how height influences closure, but first we are going to trim out one of the outliers.
trimmed = data[data['clo'] < 250]
sns.lmplot('height', 'clo', data=trimmed, fit_reg=True)
plt.xlabel('height (cm)')
plt.ylabel('clo (ms)')
In the scatter plot above, each dot represents the average closure and height of an individual.
Change "fit_reg" to "True" in the code above to see the regression line.
What does this graph tell about the relationship between height and closure? Regression lines describe a general trend of the data, sometimes referred to as the 'line of best fit'.
Let's see if there's a different kind of relationship between height and voiced/voiceless.
sns.regplot('height', 'vclo', data=trimmed, fit_reg=True)
sns.regplot('height', 'vlclo', data=trimmed, fit_reg=True)
plt.xlabel('height (cm)')
plt.ylabel('clo (ms)')
So far, we've been presenting two kinds of information in one plot (e.g. language vs. closure). Would presenting more than two at once help us at analyzing? Let's try it.
Below, the color of the dots will depend on the language that person speaks rather than its gender.
sns.lmplot('height', 'clo',data=trimmed, fit_reg=False, hue="language")
plt.xlabel('height (cm)')
plt.ylabel('clo (ms)')
What conclusions can you make from the graph above, if any? Is it easy to analyze this plot? Why?
The lesson here is that sometimes less is more.
It's often useful to compare current data with past data. Below, we'll explore class data collected from Fall 2015.
old_file_name = 'data/fall15.csv'
fa15 = pd.read_csv(old_file_name)
fa15.head()
The data from the previous semester does not have all of the same features (columns) that this semester's data has. So in order to make easy comparisons, we will just select out the columns that are in both dataframes.
current_subset = data[fa15.columns]
current_subset.head()
Let's look at the difference between the major statistics of the previous data and this semester's.
difference = fa15.describe() - current_subset.describe()
difference
It's a little unintuitive to tell how large of differences those are, so let's look at the relative difference to this semester's data.
relative_difference = difference / current_subset.describe()
relative_difference
Now, let's add some color to help spot the largest relative changes. Run the next two cells.
scale = pd.DataFrame({'scale': np.arange(-3,5,1)*.2}).set_index(relative_difference.index)
def background_gradient(s, df, m=None, M=None, cmap='RdBu_r', low=0, high=0):
# code modified from: https://stackoverflow.com/questions/38931566/pandas-style-background-gradient-both-rows-and-colums
if m is None:
m = df.min().min()
if M is None:
M = df.max().max()
rng = M - m
norm = colors.Normalize(m - (rng * low), M + (rng * high))
normed = norm(s.values)
c = [colors.rgb2hex(x) for x in ListedColormap(sns.color_palette(cmap,8))(normed)]
return ['background-color: %s' % color for color in c]
relative_difference.merge(scale, left_index=True, right_index=True).style.apply(background_gradient,
df=relative_difference, m=-1, M=1)
Now that we can see where the largest relative differences between this semester's and the prior semester's data are, let's take a look at them with further visualization. We'll start with vot
because the column has quite a few rows with dark colors.
sns.distplot(data['vot'], kde_kws={"label": "Fall 2017 vot"})
sns.distplot(fa15['vot'], kde_kws={"label": "Fall 2015 vot"})
plt.xlabel('ms')
Why is this? The graph below should offer some insight.
sns.distplot(data['vlvot'], kde_kws={"label": "Fall 2017 vlvot"}) # notice the call to voiced vot
sns.distplot(fa15['vot'], kde_kws={"label": "Fall 2015 vot"})
plt.xlabel('ms')
There are some large differences for kvot
, so let's take a look at those distributions.
sns.distplot(fa15['kvot'], kde_kws={"label": "Fall 2015 kvot"})
sns.distplot(data['kvot'], kde_kws={"label": "Fall 2017 kvot"})
plt.xlabel('kvot (ms)')
Those differences mainly come from the presence of outliers. A particularly large value for Fall 2015 and a particularly small value for Fall 2017. Fell free to copy and paste some of the code from above and explore more of the relationships between the older data and this semester's data. Remember that to insert a cell below, you can either press esc + b
or you can click Insert > Insert Cell Below
on the toolbar.