Data Summative Pt. 2¶

We've imported the schedules from NextBus (look at schedules.py), let's quickly do a sanity check and graph them. Their data has been aggregated into schedules_times.pickle, which is generated from aggregate_schedules.py.

In [2]:

%matplotlib inline
import pickle
from datetime import datetime
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from pytz import timezone

In [3]:

with open('schedules_times.pickle', 'rb') as schedule_times_file:
    weekday_stop_times, sat_stop_times, sun_stop_times = pickle.load(schedule_times_file)

Let's define a function to make plotting these easier.

In [4]:

def plot_schedule_dist(times, schedule_str="Weekday"):
    vals, bin_edges = np.histogram(times, 100)
    fig, ax = plt.subplots()
    ax.plot([datetime.fromtimestamp(x // 1000, tz=timezone('UTC')) for x in bin_edges][:-1], vals)
    plt.title("Distribution of %s Stop Times" % schedule_str)
    plt.xlabel("Time")
    plt.ylabel("Number of Stops")
    fig.autofmt_xdate()

In [5]:

plot_schedule_dist(weekday_stop_times, "Weekday")

In [6]:

plot_schedule_dist(sat_stop_times, 'Saturday')

In [7]:

plot_schedule_dist(sun_stop_times, "Sunday")

Clearly, weekdays involve two modes of stops (for the two rush hours) while the weekends only involve one. As well, while the shape of the distributions for Saturday and Sunday are very similar, Sundays have fewer stops than Saturdays, presumably because it is less busy.

Punctuality¶

What are the factors that affect the punctuality of TTC vehicles?

Punctuality of vehicles can be determined by finding the times when vehicles stop at stops and comparing it to when they were scheduled to. This actually involved a fair bit of logic, which can be found in punctuality.py. The results are stored in the MongoDB database.

In [8]:

from pymongo import MongoClient, ASCENDING, DESCENDING
client = MongoClient()
db = client.datasummative
punctuality_collection = db.punctuality

We'll begin by looking at the distribution of punctuality values.

In [9]:

punctualities = []
for punc_row in punctuality_collection.find(fields={'punctuality': 1}):
    punctualities.append(punc_row['punctuality'] / (1000 * 60))
punctualities_series_orig = pd.Series(punctualities)

In [10]:

n, bins, patches = plt.hist(punctualities_series_orig,bins=300)
t = plt.title("Punctuality of TTC Vehicles")
t = plt.xlabel("Minutes Late")
t = plt.ylabel("Number of Instances")

This appears clearly right-skewed. From last time, we can import some statistical functions and apply them here.

In [11]:

def pearsons_index(series):
    return 3 * (series.mean() - series.median()) / series.std()
def data_range(series):
    return series.quantile(1) - series.quantile(0)
def interquartile_range(series):
    return series.quantile(.75) - series.quantile(.25)
def outlier_range(series):
    iqr = interquartile_range(series)
    return (series.quantile(.25) - 1.5 * iqr, series.quantile(.75) + 1.5 * iqr)

In [12]:

print(punctualities_series_orig.describe())
print("Pearson's index: ", pearsons_index(punctualities_series_orig))
print("Interquartile range: ", interquartile_range(punctualities_series_orig))
print("Data range: ", data_range(punctualities_series_orig))

count    222510.000000
mean         10.598743
std          55.564036
min           0.000000
25%           1.350000
50%           3.483333
75%           7.916667
max        1386.633333
dtype: float64
Pearson's index:  0.384173486614
Interquartile range:  6.56666666667
Data range:  1386.63333333

Once again, this data is clearly dirty and requires some cleaning up. The maximum value, 1387, corresponds to just over 23 hours, which is obviously wrong. We'll remove the outliers and try again.

This time, we do not use 1.5 * IQR to remove the outliers. Instead, I'm making a judgement call and cutting off the data at a maximum of 40 ( = 40 minutes), since most of the data drops off after about 35. Using the 1.5 * IQR would have led to the highest value being 18 minutes, and would remove 10% of the data from the series. At 40 minutes, only approximately 1% of the dataset is removed.

In [13]:

outlier_r = outlier_range(punctualities_series_orig)
punctualities_series = pd.Series(x for x in punctualities
                                 if x >= outlier_r[0] and x <= 40)
print(punctualities_series.describe())
print("Pearson's index: ", pearsons_index(punctualities_series))
print("Interquartile range: ", interquartile_range(punctualities_series))
print("Data range: ", data_range(punctualities_series))

count    219122.000000
mean          5.580403
std           6.187868
min           0.000000
25%           1.350000
50%           3.416667
75%           7.466667
max          39.983333
dtype: float64
Pearson's index:  1.04902199392
Interquartile range:  6.11666666667
Data range:  39.9833333333

In [14]:

n, bins, patches = plt.hist(punctualities_series,bins=20)
t = plt.title("Punctuality of TTC Vehicles")
t = plt.xlabel("Minutes Late")
t = plt.ylabel("Number of Instances")

Here, we can see that the data is right-skewed, which makes sense since most vehicles arrive close to on top, whereas fewer and fewer vehicles arrive later. The Pearson's Index is greater than one, which confirms that the data is right-skewed.

We can plot this on a box-and-whiskers plot, for the fun of it.

In [15]:

plt.boxplot(punctualities_series, True, '+', vert=False, whis=np.inf)
plt.xlim(0, 50)
t = plt.title("Punctuality of TTC Vehicles")
t = plt.xlabel("Minutes Late")
t = plt.yticks((1,), ('TTC Vehicles',), rotation=90)

Routes and Punctuality¶

How is punctuality across routes?

In [16]:

# rt_and_punc = {}
# for punc_row in punctuality_collection.find(fields={'punctuality': 1,
#                                                     'rt_tag': 1}):
#     rt_tag = punc_row['rt_tag']
#     if rt_tag in rt_and_punc:
#         rt_and_punc[rt_tag].append(punc_row['punctuality'] / (1000 * 60))
#     else:
#         rt_and_punc[rt_tag] = [punc_row['punctuality'] / (1000 * 60)]
# ## convert to panda series, then find median
# rt_and_punc_medians = {}
# for rt in rt_and_punc:
#     s = pd.Series(rt_and_punc[rt])
#     rt_and_punc_medians[rt] = s.median()
with open('median_punctuality.pkl', 'rb') as median_punc_file:
    rt_and_punc_medians = pickle.load(median_punc_file)

In [17]:

N = len(rt_and_punc_medians)
ind = np.arange(N)
d = list(zip(*rt_and_punc_medians.items()))
p = plt.bar(ind, d[1])
t = plt.xlabel('Routes')
t = plt.ylabel('Median Late Time (minutes)')
t = plt.xticks(ind + 0.4, d[0], rotation=90)
t = plt.title('Median Late Times of Routes')

Well that's pretty much useless. Let's try again as a histogram.

In [18]:

p = plt.hist(d[1], bins=30)
t = plt.xlabel('Median Late Time (minutes)')
t = plt.title('Median Late Times of Routes')

Again, this is left-skewed, although taking the medians also forms two clusters. The mode is around 3 minutes.

Velocity and Punctuality¶

Are the median velocities of routes correlated with the median punctuality of routes?

In [19]:

import math
# velocity_collection = db.velocity
# routes_velocity = {}
# for velocity in velocity_collection.find(fields={'route_tag': 1,
#                                                  'vx': 1,
#                                                  'vy': 1}):
#     rt_tag = velocity['route_tag']
#     vel = math.hypot(velocity['vx'], velocity['vy'])
#     if rt_tag in routes_velocity:
#         routes_velocity[rt_tag].append(vel)
#     else:
#         routes_velocity[rt_tag] = [vel]
# routes_velocity_median = {}
# for rt in routes_velocity:
#     s = pd.Series(routes_velocity[rt])
#     routes_velocity_median[rt] = s.median()

In [20]:

import pickle
# with open('median_velocities.pkl', 'wb') as median_vels_file:
#     pickle.dump(routes_velocity_median, median_vels_file)
# with open('median_punctuality.pkl', 'wb') as median_punc_file:
#     pickle.dump(rt_and_punc_medians, median_punc_file)
with open('median_velocities.pkl', 'rb') as median_vels_file:
    routes_velocity_median = pickle.load(median_vels_file)

In [21]:

routes = []
puncs = []
velocities = []
for rt in rt_and_punc_medians:
    if rt in routes_velocity_median:
        routes.append(rt)
        puncs.append(rt_and_punc_medians[rt])
        velocities.append(routes_velocity_median[rt])
routes_puncs_df = pd.DataFrame({'route': routes, 'punc': puncs, 'vel': velocities}, index=routes)
del routes
del velocities
del puncs

In [22]:

p = plt.scatter(routes_puncs_df['vel'], routes_puncs_df['punc'])
t = plt.title('Median Punctuality vs. Median Velocity of Routes')
t = plt.xlabel('Median Velocity (m/s)')
t = plt.ylabel('Median Punctuality (minutes)')

This appears to indicate little correlation. We can verify this by running a least-squares regression.

In [23]:

from pandas.stats.api import ols
res = ols(y=routes_puncs_df['punc'], x=routes_puncs_df['vel'])
print(res)

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         172
Number of Degrees of Freedom:   2

R-squared:         0.0274
Adj R-squared:     0.0217

Rmse:              6.8197

F-stat (1, 170):     4.7845, p-value:     0.0301

Degrees of Freedom: model 1, resid 170

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     0.8392     0.3836       2.19     0.0301     0.0872     1.5911
     intercept     3.9364     1.9720       2.00     0.0475     0.0713     7.8014
---------------------------------End of Summary---------------------------------

0.027 is a remarkably bad R-squared value. There is no correlation and the data does not fit.

In [24]:

plt.hold(True)
p = plt.scatter(routes_puncs_df['vel'], routes_puncs_df['punc'])
t = plt.title('Median Punctuality vs. Median Velocity of Routes')
t = plt.xlabel('Median Velocity (m/s)')
t = plt.ylabel('Median Punctuality (minutes)')
m, b = 0.8392, 3.9364
p = plt.plot(np.arange(1, 11), np.poly1d((m, b))(np.arange(1,11)), '-r')
plt.hold(False)

Punctuality and Number of Trips¶

Does a route having more trips make it more punctual on median?

In [25]:

with open('trips_routes.pkl', 'rb') as trips_routes_file:
    routes_trips = pickle.load(trips_routes_file)
rts = []
num_trips_s = []
for rt, num_trips in routes_trips:
    rts.append(rt)
    num_trips_s.append(num_trips)
routes_trips_df = pd.DataFrame({'route': rts, 'num_trips': num_trips_s}, index=rts)
del rts
del num_trips_s
del routes_trips
puncs_trips_df = routes_puncs_df.merge(routes_trips_df, right_index=True, left_index=True)

In [26]:

res = ols(y=puncs_trips_df['punc'], x=puncs_trips_df['num_trips'])
print(res)

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         172
Number of Degrees of Freedom:   2

R-squared:         0.1862
Adj R-squared:     0.1814

Rmse:              6.2380

F-stat (1, 170):    38.9004, p-value:     0.0000

Degrees of Freedom: model 1, resid 170

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x    -0.0050     0.0008      -6.24     0.0000    -0.0065    -0.0034
     intercept    11.4722     0.7205      15.92     0.0000    10.0601    12.8844
---------------------------------End of Summary---------------------------------

In [27]:

def plot_with_reg(x, y):
    plt.hold(True)
    plt.plot(x,y, 'b.')
    res = ols(y=y, x=x)
    m = res.beta['x']
    b = res.beta['intercept']
    x_fit = np.linspace(min(x), max(x))
    y_fit = np.poly1d((m,b))
    plt.plot(x_fit, y_fit(x_fit), '-k')
    plt.hold(False)

In [28]:

plot_with_reg(puncs_trips_df['num_trips'], puncs_trips_df['punc'])
t = plt.title('Number of Trips vs. Punctuality')
t = plt.xlabel('Number of Trips (per six weeks)')
t = plt.ylabel('Punctuality (minutes)')
t = plt.xlim(0,5000)
t = plt.ylim(0, 40)

The fit, although weak in the strict sense (R-squared = 0.1862), does indicate that an increase in number of trips tends to decrease the median time waiting for a route. This makes sense, since more frequent vehicles means that even if one is missed, or is late, it is very difficult for many vehicles to be late at once.

Ridership and Punctuality¶

Are there correlations between ridership and punctuality?

In [29]:

ridership_collection = db.ridership
rts = []
costs = []
riders = []
for r in ridership_collection.find():
    rts.append(r['route'])
    costs.append(r['cost'])
    riders.append(r['ridership'])
ridership_df = pd.DataFrame({'route': rts, 'cost': costs, 'ridership': riders},
                            index=rts)
ridership_punc_df = routes_puncs_df.merge(ridership_df, right_index=True, left_index=True)
del rts
del costs
del riders
print(ridership_punc_df.shape)

(136, 6)

In [30]:

print(ols(x=ridership_punc_df['ridership'], y=ridership_punc_df['punc']))

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         136
Number of Degrees of Freedom:   2

R-squared:         0.1538
Adj R-squared:     0.1475

Rmse:              5.0319

F-stat (1, 134):    24.3621, p-value:     0.0000

Degrees of Freedom: model 1, resid 134

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x    -0.0002     0.0000      -4.94     0.0000    -0.0002    -0.0001
     intercept     9.0460     0.5849      15.47     0.0000     7.8996    10.1924
---------------------------------End of Summary---------------------------------

In [31]:

plot_with_reg(ridership_punc_df['ridership'], ridership_punc_df['punc'])
t = plt.ylim(0, 30)
t = plt.title("Punctuality vs. Ridership")
t = plt.xlabel("Weekday Ridership")
t = plt.ylabel("Punctuality (minutes)")

This indicates that there is a weak negative correlation between the number of riders and the average lateness of the routes. This is expected, since routes with more riders tend to come more frequently and thus are less late less often than routes with fewer riders.

Ridership and Velocity¶

Does having more riders slow down vehicles?

In [32]:

print(ols(x=ridership_punc_df['ridership'], y=ridership_punc_df['vel']))

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         136
Number of Degrees of Freedom:   2

R-squared:         0.0104
Adj R-squared:     0.0030

Rmse:              0.9908

F-stat (1, 134):     1.4055, p-value:     0.2379

Degrees of Freedom: model 1, resid 134

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x    -0.0000     0.0000      -1.19     0.2379    -0.0000     0.0000
     intercept     4.8046     0.1152      41.72     0.0000     4.5789     5.0303
---------------------------------End of Summary---------------------------------

In [33]:

plot_with_reg(x=ridership_punc_df['ridership'], y=ridership_punc_df['vel'])
t = plt.ylabel("Median Velocity (m/s)")
t = plt.xlabel("Weekday Ridership")
t = plt.title("Velocity vs. Ridership")

There is almost no correlation between ridership and velocity of routes. This means that routes that carry more people are not expected to be much faster or slower than those that do.

Weather¶

Is there any correlation between punctuality and temperature or pressure?

In [51]:

# weather_collection = db.weather
# ts = []
# ps = []
# temps = []
# pressures = []
# for punctuality_row in punctuality_collection.find():
#     t = punctuality_row['datetime_real']
#     p = punctuality_row['punctuality'] / (1000* 60)
#     if p > 40:
#         continue
#     weather_row = weather_collection.find_one({'datetime': {'$lte': t}},
#                                               sort=[('datetime', DESCENDING)])
#     ts.append(t)
#     ps.append(p)
#     temps.append(weather_row['temp'])
#     pressures.append(weather_row['pressure'])
# di = {'time': ts, 'punc': ps, 'temp': temps, 'pressure': pressures}
# with open('weather_punc.pkl', 'wb') as weather_punc_file:
#     pickle.dump(di, weather_punc_file)
# del ts
# del ps
# del temps
# del pressures
# del di

In [52]:

with open('weather_punc.pkl', 'rb') as weather_punc_file:
    di = pickle.load(weather_punc_file)
weather_punc_df = pd.DataFrame(di, index=di['time'])
del di

In [55]:

print(ols(x=weather_punc_df['temp'], y=weather_punc_df['punc']))

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         491464
Number of Degrees of Freedom:   2

R-squared:         0.0003
Adj R-squared:     0.0003

Rmse:              5.5336

F-stat (1, 491462):   154.5662, p-value:     0.0000

Degrees of Freedom: model 1, resid 491462

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     0.0241     0.0019      12.43     0.0000     0.0203     0.0279
     intercept     4.5434     0.0241     188.32     0.0000     4.4961     4.5907
---------------------------------End of Summary---------------------------------

In [53]:

plot_with_reg(weather_punc_df['temp'], weather_punc_df['punc'])
t = plt.title('Punctuality and Temperature')
t = plt.xlabel('Temperature (degrees Celcius)')
t = plt.ylabel('Punctuality (minutes)')

In [56]:

print(ols(x=weather_punc_df['pressure'],  y=weather_punc_df['punc']))

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         491464
Number of Degrees of Freedom:   2

R-squared:         0.0008
Adj R-squared:     0.0008

Rmse:              5.5324

F-stat (1, 491462):   375.3727, p-value:     0.0000

Degrees of Freedom: model 1, resid 491462

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x    -0.2921     0.0151     -19.37     0.0000    -0.3217    -0.2626
     intercept    34.0656     1.5092      22.57     0.0000    31.1077    37.0235
---------------------------------End of Summary---------------------------------

In [54]:

plot_with_reg(weather_punc_df['pressure'], weather_punc_df['punc'])
t = plt.title('Punctuality and Pressure')
t = plt.xlabel('Pressure (kPa)')
t = plt.ylabel('Punctuality (minutes)')

From this, we can see that neither temperature nor pressure has any real effect on the punctuality of the vehicles.