Last year I wrote a blog post examining trends in Seattle bicycling and how they relate to weather, daylight, day of the week, and other factors.

Here I want to revisit the same data from a different perspective: rather than making assumptions in order to build models that might describe the data, I'll instead wipe the slate clean and ask what information we can extract from the data themselves, without reliance on any model assumptions.
In other words, where the previous post examined the data using a *supervised machine learning* approach for *data modeling*, this post will examine the data using an *unsupervised learning approach* for *data exploration*.

Along the way, we'll see some examples of importing, transforming, visualizing, and analyzing data in the Python language, using mostly Pandas, Matplotlib, and Scikit-learn. We will also see some real-world examples of the use of unsupervised machine learning algorithms, such as Principal Component Analysis and Gaussian Mixture Models, in exploring and extracting meaning from data.

To spoil the punchline (and perhaps whet your appetite) what we will find is that from analysis of bicycle counts alone, we can make some definite statements about the aggregate work habits of Seattleites who commute by bicycle.

The data we will use here are the hourly bicycle counts on Seattle's Fremont Bridge.
These data come from an automated bicycle counter, installed in late 2012, which has inductive sensors under the sidewalks on either side of the bridge.
The daily or hourly bicycle counts can be downloaded from http://data.seattle.gov/; here is the direct link to the hourly dataset.
To download the data directly, you can uncomment the following `curl`

command:

In [1]:

```
# !curl -o FremontBridge.csv https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD
```

Once this is downloaded, we can use the pandas library to load the data:

In [2]:

```
import pandas as pd
data = pd.read_csv('FremontBridge.csv', index_col='Date', parse_dates=True)
data.head()
```

Out[2]:

Fremont Bridge West Sidewalk | Fremont Bridge East Sidewalk | |
---|---|---|

Date | ||

2012-10-03 00:00:00 | 4 | 9 |

2012-10-03 01:00:00 | 4 | 6 |

2012-10-03 02:00:00 | 1 | 1 |

2012-10-03 03:00:00 | 2 | 3 |

2012-10-03 04:00:00 | 6 | 1 |

In [3]:

```
data.columns = ['West', 'East']
data.fillna(0, inplace=True)
data['Total'] = data.eval('East + West')
```

In [4]:

```
# first some standard imports
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # plot styling
import numpy as np
data.resample('W', how='sum').plot()
plt.ylabel('weekly trips');
```

From here, we could do a variety of other visualizations based on our intuition about what might affect bicycle counts.
For example, we could look at the effect of the days of the week, the effect of the weather, and other factors that I explored previously.
But we could also proceed by letting the dataset speak for itself, and use *unsupervised machine learning* techniques (that is, machine learning without reference to data labels) to learn what the data have to tell us.

We will consider each day in the dataset as its own separate entity (or *sample*, in usual machine learning parlance).
For each day, we have 48 observations: two observations (east and west sidewalk sensors) for each of the 24 hour-long periods.
By examining the days in light of these observations and doing some careful analysis, we should be able to extract meaningful quantitative statements from the data themselves, without the need to lean on any other assumptions.

The first step in this approach is to transform our data; essentially we will want a two-dimensional matrix, where each row of the matrix corresponds to a day, and each column of the matrix corresponds to one of the 48 observations.
We can arrange the data this way using the `pivot_table()`

function in Pandas.
We want the "East" and "West" column values, indexed by date, and separated by hour of the day.
Any missing values we will fill with zero:

In [5]:

```
pivoted = data.pivot_table(['East', 'West'],
index=data.index.date,
columns=data.index.hour,
fill_value=0)
pivoted.head()
```

Out[5]:

East | ... | West | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | |

2012-10-03 | 9 | 6 | 1 | 3 | 1 | 10 | 50 | 95 | 146 | 104 | ... | 77 | 72 | 133 | 192 | 122 | 59 | 29 | 25 | 24 | 5 |

2012-10-04 | 11 | 0 | 6 | 3 | 1 | 11 | 51 | 89 | 134 | 94 | ... | 63 | 73 | 114 | 154 | 137 | 57 | 27 | 31 | 25 | 11 |

2012-10-05 | 7 | 4 | 3 | 2 | 2 | 7 | 37 | 101 | 119 | 81 | ... | 63 | 80 | 120 | 144 | 107 | 42 | 27 | 11 | 10 | 16 |

2012-10-06 | 7 | 5 | 2 | 2 | 1 | 2 | 15 | 16 | 47 | 55 | ... | 89 | 115 | 107 | 107 | 41 | 40 | 25 | 18 | 14 | 15 |

2012-10-07 | 5 | 5 | 1 | 2 | 2 | 3 | 8 | 12 | 26 | 36 | ... | 126 | 122 | 132 | 118 | 68 | 26 | 19 | 12 | 9 | 5 |

5 rows × 48 columns

Next we extract the raw values and put them in a matrix:

In [6]:

```
X = pivoted.values
X.shape
```

Out[6]:

(1001, 48)

Our data consists of just over 1000 days, each with the aforementioned 48 measurements.

We can think of this data now as representing 1001 distinct objects which live in a *48-dimensional* space: the value of each dimension is the number of bicycle trips measured on a particular side of the bridge at a particular hour.
Visualizing 48-dimensional data is quite difficult, so instead we will use a standard *dimensionality reduction* technique to project this to a more manageable size.

The technique we'll use is Principal Component Analysis (PCA), a fast linear projection which rotates the data such that the projection preserves the maximum variance. We can ask for components preserving 90% of the variance as follows:

In [7]:

```
from sklearn.decomposition import PCA
Xpca = PCA(0.9).fit_transform(X)
Xpca.shape
```

Out[7]:

(1001, 2)

In [8]:

```
total_trips = X.sum(1)
plt.scatter(Xpca[:, 0], Xpca[:, 1], c=total_trips,
cmap='cubehelix')
plt.colorbar(label='total trips');
```

We see that the days lie in two quite distinct groups, and that the total number of trips increases along the length of each projected cluster. Further, the two groups begin to be less distinguishable when the number of trips during the day is very small.

I find this extremely interesting: from the raw data, we can determine that there are basically *two primary types of days* for Seattle bicyclists.
Let's model these clusters and try to figure out what these types-of-day are.

When you have groups of data you'd like to automatically separate, but no previously-determined labels for the groups, the type of algorithm you are looking at is a *clustering* algorithm.
There are a number of clustering algorithms out there, but for nicely-defined oval-shaped blobs like we see above, Gaussian Mixture Models are a very good choice.
We can compute the Gaussian Mixture Model of the data using, again, scikit-learn, and quickly plot the predicted labels for the points:

In [9]:

```
from sklearn.mixture import GMM
gmm = GMM(2, covariance_type='full', random_state=0)
gmm.fit(Xpca)
cluster_label = gmm.predict(Xpca)
plt.scatter(Xpca[:, 0], Xpca[:, 1], c=cluster_label);
```

In [10]:

```
pivoted['Cluster'] = cluster_label
data = data.join(pivoted['Cluster'], on=data.index.date)
data.head()
```

Out[10]:

West | East | Total | Cluster | |
---|---|---|---|---|

Date | ||||

2012-10-03 00:00:00 | 4 | 9 | 13 | 0 |

2012-10-03 01:00:00 | 4 | 6 | 10 | 0 |

2012-10-03 02:00:00 | 1 | 1 | 2 | 0 |

2012-10-03 03:00:00 | 2 | 3 | 5 | 0 |

2012-10-03 04:00:00 | 6 | 1 | 7 | 0 |

Now we can find the average trend by cluster and time using a GroupBy within this updated dataset

In [11]:

```
by_hour = data.groupby(['Cluster', data.index.time]).mean()
by_hour.head()
```

Out[11]:

West | East | Total | ||
---|---|---|---|---|

Cluster | ||||

0 | 00:00:00 | 5.312139 | 6.213873 | 11.526012 |

01:00:00 | 2.713873 | 2.969653 | 5.683526 | |

02:00:00 | 2.294798 | 1.732659 | 4.027457 | |

03:00:00 | 1.570809 | 1.426301 | 2.997110 | |

04:00:00 | 4.179191 | 2.650289 | 6.829480 |

Finally, we can plot the average hourly trend among the days within each cluster:

In [12]:

```
fig, ax = plt.subplots(1, 2, figsize=(14, 5))
hourly_ticks = 4 * 60 * 60 * np.arange(6)
for i in range(2):
by_hour.ix[i].plot(ax=ax[i], xticks=hourly_ticks)
ax[i].set_title('Cluster {0}'.format(i))
ax[i].set_ylabel('average hourly trips')
```

These plots give us some insight into the interpretation of the two clusters: the first cluster shows a sharp bimodal traffic pattern, while the second shows a wide unimodal pattern.

In the bimodal cluster, we see a peak around 8:00am which is dominated by cyclists on the west sidewalk, and another peak around 5:00pm which is dominated by cyclists on the east sidewalk. This is very clearly a commute pattern, with the majority of cyclists riding toward downtown Seattle in the morning, and away from downtown Seattle in the evening.

In the unimodal cluster, we see fairly steady traffic in each direction beginning early in the morning and going until late at night, with a peak around 2:00 in the afternoon. This is very clearly a recreational pattern of use, with people out riding through the entire day.

I find this is fascinating: from simple unsupervised dimensionality reduction and clustering, we've discovered two distinct classes of days in the data, and found that these classes have very intuitive explanations.

Let's go one step deeper and figure out what we can learn about people (well, bicycle commuters) in Seattle from just this hourly commute data. As a rough approximation, you might guess that these two classes of data might be largely reflective of workdays in the first cluster, and non-work days in the second. We can check this intuition by re-plotting our projected data, except labeling them by day of the week:

In [13]:

```
dayofweek = pd.to_datetime(pivoted.index).dayofweek
plt.scatter(Xpca[:, 0], Xpca[:, 1], c=dayofweek,
cmap=plt.cm.get_cmap('jet', 7))
cb = plt.colorbar(ticks=range(7))
cb.set_ticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
plt.clim(-0.5, 6.5);
```

We see that the weekday/weekend intuition holds, but only to a point: in particular, it is clear that **there are a handful of weekdays which follow the typical weekend pattern!**
Further, it's interesting to note that Fridays tend to be pulled closer to weekend days in this plot, though as a whole they still fall solidly in the work-day cluster.

Let's take a closer look at the "special" weekdays that fall in the "wrong" cluster. We start by constructing a dataset listing the cluster id and the day of the week for each of the dates in our dataset:

In [14]:

```
results = pd.DataFrame({'cluster': cluster_label,
'is_weekend': (dayofweek > 4),
'weekday': pivoted.index.map(lambda x: x.strftime('%a'))},
index=pivoted.index)
results.head()
```

Out[14]:

cluster | is_weekend | weekday | |
---|---|---|---|

2012-10-03 | 0 | False | Wed |

2012-10-04 | 0 | False | Thu |

2012-10-05 | 0 | False | Fri |

2012-10-06 | 1 | True | Sat |

2012-10-07 | 1 | True | Sun |

First, let's see how many weekend days fall in the first, commute-oriented cluster

In [15]:

```
weekend_workdays = results.query('cluster == 0 and is_weekend')
len(weekend_workdays)
```

Out[15]:

0

zero! Apparently, there is not a single weekend during the year where Seattle cyclists as a whole decide to go to work.

Similarly, we can see how many weekdays fall in the second, recreation-oriented cluster:

In [16]:

```
midweek_holidays = results.query('cluster == 1 and not is_weekend')
len(midweek_holidays)
```

Out[16]:

23

In [17]:

```
from pandas.tseries.holiday import USFederalHolidayCalendar
cal = USFederalHolidayCalendar()
holidays = cal.holidays('2012', '2016', return_name=True)
holidays.head()
```

Out[17]:

2012-01-02 New Years Day 2012-01-16 Dr. Martin Luther King Jr. 2012-02-20 Presidents Day 2012-05-28 MemorialDay 2012-07-04 July 4th dtype: object

Just for completeness, we will add to the list the day before and day after each of these holidays:

In [18]:

```
holidays_all = pd.concat([holidays,
"Day Before " + holidays.shift(-1, 'D'),
"Day After " + holidays.shift(1, 'D')])
holidays_all = holidays_all.sort_index()
holidays_all.head()
```

Out[18]:

2012-01-01 Day Before New Years Day 2012-01-02 New Years Day 2012-01-03 Day After New Years Day 2012-01-15 Day Before Dr. Martin Luther King Jr. 2012-01-16 Dr. Martin Luther King Jr. dtype: object

*observed* holidays, which is why New Years Day 2012 falls on January 2nd.
With this ready to go, we can compute the complete list of non-weekend days on which Seattle bicycle commuters as a whole chose to stay home from work:

In [19]:

```
holidays_all.name = 'name' # required for join
joined = midweek_holidays.join(holidays_all)
set(joined['name'])
```

Out[19]:

{'Christmas', 'Day After Christmas', 'Day After Thanksgiving', 'Day Before Christmas', 'July 4th', 'Labor Day', 'MemorialDay', 'New Years Day', 'Thanksgiving'}

In [20]:

```
set(holidays) - set(joined.name)
```

Out[20]:

{'Columbus Day', 'Dr. Martin Luther King Jr.', 'Presidents Day', 'Veterans Day'}

A colleague of mine, Ariel Rokem, saw the first version of this post and noticed something interesting. For the most part, Fridays tend to lie on the upper side of the weekday cluster, closer in this parameter space to the typical weekend pattern. This pattern holds nearly universally for Fridays, all except for three strange outliers which lie far on the other side of the cluster.

We can see these more clearly if we highlight the Friday points in the plot:

In [21]:

```
fridays = (dayofweek == 4)
plt.scatter(Xpca[:, 0], Xpca[:, 1], c='gray', alpha=0.2)
plt.scatter(Xpca[fridays, 0], Xpca[fridays, 1], c='yellow');
```