Putting things into groups - no matter how (in)accurate - is a lot of fun.
You can use k-means to group data that is all in one line. NBA player salaries, for example. We're going to cluster some data we used the other day about congressional speeches.
First, let's import the speeches
# You should have this downloaded & extracted already, so I've commented it out
# !curl -O http://www.cs.cornell.edu/home/llee/data/convote/convote_v1.1.tar.gz
# !tar -zxvf convote_v1.1.tar.gz
import re
import glob
import pandas as pd
paths = glob.glob('convote_v1.1/data_stage_one/development_set/*')
speeches = []
for path in paths:
speech = {}
filename = path[-26:]
speech['filename'] = filename
speech['bill no'] = filename[:3]
speech['speaker no'] = filename[4:10]
speech['bill vote'] = filename[-5]
speech['party'] = filename[-7]
speech['contents'] = open(path, 'r').read()
cleaned_contents = re.sub(r"[^ \w]",'', speech['contents'])
cleaned_contents = re.sub(r" +",' ', cleaned_contents)
cleaned_contents = cleaned_contents.strip()
tokens = cleaned_contents.split(' ')
speech['tokenized contents'] = tokens
speech['word count'] = len(tokens)
speeches.append(speech)
speeches_df = pd.DataFrame(speeches)
speeches_df[:5]
It'd be good to get an overview of the data first. Make a histograph of speech word counts.
Seems a little unbalanced, maybe? Let's try clustering with 4 clusters.
Initialize a k-means object called km and fit it to the data. Remember to import!
# If you get a n_samples=1 should be >= n_clusters=4 error,
# you'll want to make sure you're using *two sets of square brackets*
# around the column name
Look at the first ten km.labels_, and write a comment explaining what that is.
Add this new information into your dataframe. Call it k-means label.
Make a histogram for the word count of each k-means label.
One of them might seem kind of crazy. Use groupby and describe to get a better explanation of how your data was clustered.
Explain how they ended up grouped that way, and what you think about the groups.
You can have an infinite number of k-means dimensions (more or less), but we're just going to step up to two dimensions now. The fun thing about two dimensions is that latitude and longitude fall under that category!
!curl -O http://www.boutell.com/zipcodes/zipcode.zip
!unzip zipcode.zip
Read this file into a pandas dataframe called zipcodes, and look at the first five elements.
Map it longitude by latitude using plt.scatter. Pass s=1 to make the dots real tiny, and edgecolors='none' to get prevent the map from being all black.
Make sure you pass longitude first! Or, try latitude/longitude first, see how it looks, then try it again with longitude/latitude first.
Look familiar? Normally you'd get sent to prison for scatterplotting geographic data, but I'm not going to tell anyone.
Unfortunately we need to clean the data up a little bit first before we do k-means on it. Let's examine everything with NaN (Not a Number) for latitude.
# pd.isnull checks to see if latitude is None or NaN
zipcodes[pd.isnull(zipcodes["latitude"])]
We can't have those terrible empty rows! Let's make a new data frame that doesn't have those elements.
# The ~ means 'not', so 'the zipcodes that are not null for latitude'
cleaned_zipcodes = zipcodes[~pd.isnull(zipcodes["latitude"])]
You can scatterplot it again if you'd like to make sure the data still looks okay.
Now, k-means cluster in 10 groups across longitude and latitude. Make sure you use cleaned_zipcodes.
Plot it again, this time coloring according to the assigned labels. Make sure to pass edgecolors='none' again.
Fun, right? Play around with the number of clusters to see what other results you can get. Why does clustering by zip codes seem to show you population centers?