We've imported the schedules from NextBus (look at schedules.py
), let's quickly do a sanity check and graph them. Their data has been aggregated into schedules_times.pickle
, which is generated from aggregate_schedules.py
.
%matplotlib inline
import pickle
from datetime import datetime
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from pytz import timezone
with open('schedules_times.pickle', 'rb') as schedule_times_file:
weekday_stop_times, sat_stop_times, sun_stop_times = pickle.load(schedule_times_file)
Let's define a function to make plotting these easier.
def plot_schedule_dist(times, schedule_str="Weekday"):
vals, bin_edges = np.histogram(times, 100)
fig, ax = plt.subplots()
ax.plot([datetime.fromtimestamp(x // 1000, tz=timezone('UTC')) for x in bin_edges][:-1], vals)
plt.title("Distribution of %s Stop Times" % schedule_str)
plt.xlabel("Time")
plt.ylabel("Number of Stops")
fig.autofmt_xdate()
plot_schedule_dist(weekday_stop_times, "Weekday")
plot_schedule_dist(sat_stop_times, 'Saturday')
plot_schedule_dist(sun_stop_times, "Sunday")
Clearly, weekdays involve two modes of stops (for the two rush hours) while the weekends only involve one. As well, while the shape of the distributions for Saturday and Sunday are very similar, Sundays have fewer stops than Saturdays, presumably because it is less busy.
What are the factors that affect the punctuality of TTC vehicles?
Punctuality of vehicles can be determined by finding the times when vehicles stop at stops and comparing it to when they were scheduled to. This actually involved a fair bit of logic, which can be found in punctuality.py
. The results are stored in the MongoDB database.
from pymongo import MongoClient, ASCENDING, DESCENDING
client = MongoClient()
db = client.datasummative
punctuality_collection = db.punctuality
We'll begin by looking at the distribution of punctuality values.
punctualities = []
for punc_row in punctuality_collection.find(fields={'punctuality': 1}):
punctualities.append(punc_row['punctuality'] / (1000 * 60))
punctualities_series_orig = pd.Series(punctualities)
n, bins, patches = plt.hist(punctualities_series_orig,bins=300)
t = plt.title("Punctuality of TTC Vehicles")
t = plt.xlabel("Minutes Late")
t = plt.ylabel("Number of Instances")
This appears clearly right-skewed. From last time, we can import some statistical functions and apply them here.
def pearsons_index(series):
return 3 * (series.mean() - series.median()) / series.std()
def data_range(series):
return series.quantile(1) - series.quantile(0)
def interquartile_range(series):
return series.quantile(.75) - series.quantile(.25)
def outlier_range(series):
iqr = interquartile_range(series)
return (series.quantile(.25) - 1.5 * iqr, series.quantile(.75) + 1.5 * iqr)
print(punctualities_series_orig.describe())
print("Pearson's index: ", pearsons_index(punctualities_series_orig))
print("Interquartile range: ", interquartile_range(punctualities_series_orig))
print("Data range: ", data_range(punctualities_series_orig))
count 222510.000000 mean 10.598743 std 55.564036 min 0.000000 25% 1.350000 50% 3.483333 75% 7.916667 max 1386.633333 dtype: float64 Pearson's index: 0.384173486614 Interquartile range: 6.56666666667 Data range: 1386.63333333
Once again, this data is clearly dirty and requires some cleaning up. The maximum value, 1387, corresponds to just over 23 hours, which is obviously wrong. We'll remove the outliers and try again.
This time, we do not use 1.5 * IQR to remove the outliers. Instead, I'm making a judgement call and cutting off the data at a maximum of 40 ( = 40 minutes), since most of the data drops off after about 35. Using the 1.5 * IQR would have led to the highest value being 18 minutes, and would remove 10% of the data from the series. At 40 minutes, only approximately 1% of the dataset is removed.
outlier_r = outlier_range(punctualities_series_orig)
punctualities_series = pd.Series(x for x in punctualities
if x >= outlier_r[0] and x <= 40)
print(punctualities_series.describe())
print("Pearson's index: ", pearsons_index(punctualities_series))
print("Interquartile range: ", interquartile_range(punctualities_series))
print("Data range: ", data_range(punctualities_series))
count 219122.000000 mean 5.580403 std 6.187868 min 0.000000 25% 1.350000 50% 3.416667 75% 7.466667 max 39.983333 dtype: float64 Pearson's index: 1.04902199392 Interquartile range: 6.11666666667 Data range: 39.9833333333
n, bins, patches = plt.hist(punctualities_series,bins=20)
t = plt.title("Punctuality of TTC Vehicles")
t = plt.xlabel("Minutes Late")
t = plt.ylabel("Number of Instances")
Here, we can see that the data is right-skewed, which makes sense since most vehicles arrive close to on top, whereas fewer and fewer vehicles arrive later. The Pearson's Index is greater than one, which confirms that the data is right-skewed.
We can plot this on a box-and-whiskers plot, for the fun of it.
plt.boxplot(punctualities_series, True, '+', vert=False, whis=np.inf)
plt.xlim(0, 50)
t = plt.title("Punctuality of TTC Vehicles")
t = plt.xlabel("Minutes Late")
t = plt.yticks((1,), ('TTC Vehicles',), rotation=90)
How is punctuality across routes?
# rt_and_punc = {}
# for punc_row in punctuality_collection.find(fields={'punctuality': 1,
# 'rt_tag': 1}):
# rt_tag = punc_row['rt_tag']
# if rt_tag in rt_and_punc:
# rt_and_punc[rt_tag].append(punc_row['punctuality'] / (1000 * 60))
# else:
# rt_and_punc[rt_tag] = [punc_row['punctuality'] / (1000 * 60)]
# ## convert to panda series, then find median
# rt_and_punc_medians = {}
# for rt in rt_and_punc:
# s = pd.Series(rt_and_punc[rt])
# rt_and_punc_medians[rt] = s.median()
with open('median_punctuality.pkl', 'rb') as median_punc_file:
rt_and_punc_medians = pickle.load(median_punc_file)
N = len(rt_and_punc_medians)
ind = np.arange(N)
d = list(zip(*rt_and_punc_medians.items()))
p = plt.bar(ind, d[1])
t = plt.xlabel('Routes')
t = plt.ylabel('Median Late Time (minutes)')
t = plt.xticks(ind + 0.4, d[0], rotation=90)
t = plt.title('Median Late Times of Routes')
Well that's pretty much useless. Let's try again as a histogram.
p = plt.hist(d[1], bins=30)
t = plt.xlabel('Median Late Time (minutes)')
t = plt.title('Median Late Times of Routes')
Again, this is left-skewed, although taking the medians also forms two clusters. The mode is around 3 minutes.
Are the median velocities of routes correlated with the median punctuality of routes?
import math
# velocity_collection = db.velocity
# routes_velocity = {}
# for velocity in velocity_collection.find(fields={'route_tag': 1,
# 'vx': 1,
# 'vy': 1}):
# rt_tag = velocity['route_tag']
# vel = math.hypot(velocity['vx'], velocity['vy'])
# if rt_tag in routes_velocity:
# routes_velocity[rt_tag].append(vel)
# else:
# routes_velocity[rt_tag] = [vel]
# routes_velocity_median = {}
# for rt in routes_velocity:
# s = pd.Series(routes_velocity[rt])
# routes_velocity_median[rt] = s.median()
import pickle
# with open('median_velocities.pkl', 'wb') as median_vels_file:
# pickle.dump(routes_velocity_median, median_vels_file)
# with open('median_punctuality.pkl', 'wb') as median_punc_file:
# pickle.dump(rt_and_punc_medians, median_punc_file)
with open('median_velocities.pkl', 'rb') as median_vels_file:
routes_velocity_median = pickle.load(median_vels_file)
routes = []
puncs = []
velocities = []
for rt in rt_and_punc_medians:
if rt in routes_velocity_median:
routes.append(rt)
puncs.append(rt_and_punc_medians[rt])
velocities.append(routes_velocity_median[rt])
routes_puncs_df = pd.DataFrame({'route': routes, 'punc': puncs, 'vel': velocities}, index=routes)
del routes
del velocities
del puncs
p = plt.scatter(routes_puncs_df['vel'], routes_puncs_df['punc'])
t = plt.title('Median Punctuality vs. Median Velocity of Routes')
t = plt.xlabel('Median Velocity (m/s)')
t = plt.ylabel('Median Punctuality (minutes)')
This appears to indicate little correlation. We can verify this by running a least-squares regression.
from pandas.stats.api import ols
res = ols(y=routes_puncs_df['punc'], x=routes_puncs_df['vel'])
print(res)
-------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <x> + <intercept> Number of Observations: 172 Number of Degrees of Freedom: 2 R-squared: 0.0274 Adj R-squared: 0.0217 Rmse: 6.8197 F-stat (1, 170): 4.7845, p-value: 0.0301 Degrees of Freedom: model 1, resid 170 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- x 0.8392 0.3836 2.19 0.0301 0.0872 1.5911 intercept 3.9364 1.9720 2.00 0.0475 0.0713 7.8014 ---------------------------------End of Summary---------------------------------
0.027 is a remarkably bad R-squared value. There is no correlation and the data does not fit.
plt.hold(True)
p = plt.scatter(routes_puncs_df['vel'], routes_puncs_df['punc'])
t = plt.title('Median Punctuality vs. Median Velocity of Routes')
t = plt.xlabel('Median Velocity (m/s)')
t = plt.ylabel('Median Punctuality (minutes)')
m, b = 0.8392, 3.9364
p = plt.plot(np.arange(1, 11), np.poly1d((m, b))(np.arange(1,11)), '-r')
plt.hold(False)
Does a route having more trips make it more punctual on median?
with open('trips_routes.pkl', 'rb') as trips_routes_file:
routes_trips = pickle.load(trips_routes_file)
rts = []
num_trips_s = []
for rt, num_trips in routes_trips:
rts.append(rt)
num_trips_s.append(num_trips)
routes_trips_df = pd.DataFrame({'route': rts, 'num_trips': num_trips_s}, index=rts)
del rts
del num_trips_s
del routes_trips
puncs_trips_df = routes_puncs_df.merge(routes_trips_df, right_index=True, left_index=True)
res = ols(y=puncs_trips_df['punc'], x=puncs_trips_df['num_trips'])
print(res)
-------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <x> + <intercept> Number of Observations: 172 Number of Degrees of Freedom: 2 R-squared: 0.1862 Adj R-squared: 0.1814 Rmse: 6.2380 F-stat (1, 170): 38.9004, p-value: 0.0000 Degrees of Freedom: model 1, resid 170 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- x -0.0050 0.0008 -6.24 0.0000 -0.0065 -0.0034 intercept 11.4722 0.7205 15.92 0.0000 10.0601 12.8844 ---------------------------------End of Summary---------------------------------
def plot_with_reg(x, y):
plt.hold(True)
plt.plot(x,y, 'b.')
res = ols(y=y, x=x)
m = res.beta['x']
b = res.beta['intercept']
x_fit = np.linspace(min(x), max(x))
y_fit = np.poly1d((m,b))
plt.plot(x_fit, y_fit(x_fit), '-k')
plt.hold(False)
plot_with_reg(puncs_trips_df['num_trips'], puncs_trips_df['punc'])
t = plt.title('Number of Trips vs. Punctuality')
t = plt.xlabel('Number of Trips (per six weeks)')
t = plt.ylabel('Punctuality (minutes)')
t = plt.xlim(0,5000)
t = plt.ylim(0, 40)
The fit, although weak in the strict sense (R-squared = 0.1862), does indicate that an increase in number of trips tends to decrease the median time waiting for a route. This makes sense, since more frequent vehicles means that even if one is missed, or is late, it is very difficult for many vehicles to be late at once.
Are there correlations between ridership and punctuality?
ridership_collection = db.ridership
rts = []
costs = []
riders = []
for r in ridership_collection.find():
rts.append(r['route'])
costs.append(r['cost'])
riders.append(r['ridership'])
ridership_df = pd.DataFrame({'route': rts, 'cost': costs, 'ridership': riders},
index=rts)
ridership_punc_df = routes_puncs_df.merge(ridership_df, right_index=True, left_index=True)
del rts
del costs
del riders
print(ridership_punc_df.shape)
(136, 6)
print(ols(x=ridership_punc_df['ridership'], y=ridership_punc_df['punc']))
-------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <x> + <intercept> Number of Observations: 136 Number of Degrees of Freedom: 2 R-squared: 0.1538 Adj R-squared: 0.1475 Rmse: 5.0319 F-stat (1, 134): 24.3621, p-value: 0.0000 Degrees of Freedom: model 1, resid 134 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- x -0.0002 0.0000 -4.94 0.0000 -0.0002 -0.0001 intercept 9.0460 0.5849 15.47 0.0000 7.8996 10.1924 ---------------------------------End of Summary---------------------------------
plot_with_reg(ridership_punc_df['ridership'], ridership_punc_df['punc'])
t = plt.ylim(0, 30)
t = plt.title("Punctuality vs. Ridership")
t = plt.xlabel("Weekday Ridership")
t = plt.ylabel("Punctuality (minutes)")
This indicates that there is a weak negative correlation between the number of riders and the average lateness of the routes. This is expected, since routes with more riders tend to come more frequently and thus are less late less often than routes with fewer riders.
Does having more riders slow down vehicles?
print(ols(x=ridership_punc_df['ridership'], y=ridership_punc_df['vel']))
-------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <x> + <intercept> Number of Observations: 136 Number of Degrees of Freedom: 2 R-squared: 0.0104 Adj R-squared: 0.0030 Rmse: 0.9908 F-stat (1, 134): 1.4055, p-value: 0.2379 Degrees of Freedom: model 1, resid 134 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- x -0.0000 0.0000 -1.19 0.2379 -0.0000 0.0000 intercept 4.8046 0.1152 41.72 0.0000 4.5789 5.0303 ---------------------------------End of Summary---------------------------------
plot_with_reg(x=ridership_punc_df['ridership'], y=ridership_punc_df['vel'])
t = plt.ylabel("Median Velocity (m/s)")
t = plt.xlabel("Weekday Ridership")
t = plt.title("Velocity vs. Ridership")
There is almost no correlation between ridership and velocity of routes. This means that routes that carry more people are not expected to be much faster or slower than those that do.
Is there any correlation between punctuality and temperature or pressure?
# weather_collection = db.weather
# ts = []
# ps = []
# temps = []
# pressures = []
# for punctuality_row in punctuality_collection.find():
# t = punctuality_row['datetime_real']
# p = punctuality_row['punctuality'] / (1000* 60)
# if p > 40:
# continue
# weather_row = weather_collection.find_one({'datetime': {'$lte': t}},
# sort=[('datetime', DESCENDING)])
# ts.append(t)
# ps.append(p)
# temps.append(weather_row['temp'])
# pressures.append(weather_row['pressure'])
# di = {'time': ts, 'punc': ps, 'temp': temps, 'pressure': pressures}
# with open('weather_punc.pkl', 'wb') as weather_punc_file:
# pickle.dump(di, weather_punc_file)
# del ts
# del ps
# del temps
# del pressures
# del di
with open('weather_punc.pkl', 'rb') as weather_punc_file:
di = pickle.load(weather_punc_file)
weather_punc_df = pd.DataFrame(di, index=di['time'])
del di
print(ols(x=weather_punc_df['temp'], y=weather_punc_df['punc']))
-------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <x> + <intercept> Number of Observations: 491464 Number of Degrees of Freedom: 2 R-squared: 0.0003 Adj R-squared: 0.0003 Rmse: 5.5336 F-stat (1, 491462): 154.5662, p-value: 0.0000 Degrees of Freedom: model 1, resid 491462 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- x 0.0241 0.0019 12.43 0.0000 0.0203 0.0279 intercept 4.5434 0.0241 188.32 0.0000 4.4961 4.5907 ---------------------------------End of Summary---------------------------------
plot_with_reg(weather_punc_df['temp'], weather_punc_df['punc'])
t = plt.title('Punctuality and Temperature')
t = plt.xlabel('Temperature (degrees Celcius)')
t = plt.ylabel('Punctuality (minutes)')
print(ols(x=weather_punc_df['pressure'], y=weather_punc_df['punc']))
-------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <x> + <intercept> Number of Observations: 491464 Number of Degrees of Freedom: 2 R-squared: 0.0008 Adj R-squared: 0.0008 Rmse: 5.5324 F-stat (1, 491462): 375.3727, p-value: 0.0000 Degrees of Freedom: model 1, resid 491462 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- x -0.2921 0.0151 -19.37 0.0000 -0.3217 -0.2626 intercept 34.0656 1.5092 22.57 0.0000 31.1077 37.0235 ---------------------------------End of Summary---------------------------------
plot_with_reg(weather_punc_df['pressure'], weather_punc_df['punc'])
t = plt.title('Punctuality and Pressure')
t = plt.xlabel('Pressure (kPa)')
t = plt.ylabel('Punctuality (minutes)')
From this, we can see that neither temperature nor pressure has any real effect on the punctuality of the vehicles.