Uber provides anonymized and aggregated travel time data through Uber Movement platform for many citites across the world. For India, current and historic data is available for 5 cities - Bangalore, Hyderabad, New Delhi, Mumbai and Kolkata. It also provides the details on the ward boundaries in the form of JSON file.
OpenStreetMap (OSM) is a free, editable map of the whole world that is being built by volunteers largely from scratch and released with an open-content license. OSM data includes a global navigable street network dataset. Several services exists that provide routing and network analysis on top of this data.
In this project, we use the open travel time dataset from Uber and leverage open-source routing services for OpenStreetMap to build a fairly accurate model for travel time within each of the metro cities in India. We show that by using rich ecosystem of Python Geospatial libraries, we can easily consume, process, and visualize large amount of geospatial data easily and incorporate it easily into a machine learning model.
Open datasets
Python libraries
Services
import pandas as pd
import geopandas as gpd
import numpy as np
import requests
import shapely
import matplotlib.pyplot as plt
import datetime
import os
import math
import random
import folium
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from matplotlib import pyplot
%matplotlib inline
data_folder = os.path.join('data', 'uber')
The Uber Movement Travel Times data comes as a CSV file for each quarter. Here we are using the Travel Times By Date By Hour Buckets (All Days) dataset. This data set includes the arithmetic mean, geometric mean, and standard deviations for aggregated travel times between every ward in the city, for every day of the quarter and aggregated into time categories. This is a large dataset with over 7M rows.
We import the data as a Pandas DataFrame and call convert_dtypes()
to select the best datatypes for each column.
travel_times_file = 'bangalore-wards-2020-1-All-DatesByHourBucketsAggregate.csv'
travel_times_filepath = os.path.join(data_folder, travel_times_file)
travel_times= pd.read_csv(travel_times_filepath)
travel_times = travel_times.convert_dtypes()
travel_times
<class 'pandas.core.frame.DataFrame'> Int64Index: 7640967 entries, 0 to 7640966 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 sourceid Int64 1 dstid Int64 2 month Int64 3 day Int64 4 start_hour Int64 5 end_hour Int64 6 mean_travel_time Float64 7 standard_deviation_travel_time Float64 8 geometric_mean_travel_time Float64 9 geometric_standard_deviation_travel_time Float64 10 time_period int64 11 travel_time Float64 12 src_lon float64 13 src_lat float64 14 dst_lon float64 15 dst_lat float64 16 distance float64 dtypes: Float64(5), Int64(6), float64(5), int64(1) memory usage: 1.1 GB
The travel times dataset contain details of travel between Zones. For Indian citites, the zones are Wards as defined by the local municipal corporation. This data comes as a GeoJSON file that contains the polygon representation of each ward. We use geopandas
to read the file as a GeoDataFrame.
wards_file = 'bangalore_wards.json'
wards_filepath = os.path.join(data_folder, wards_file)
wards = gpd.read_file(wards_filepath)
wards
WARD_NO | WARD_NAME | MOVEMENT_ID | DISPLAY_NAME | geometry | |
---|---|---|---|---|---|
0 | 2 | Chowdeswari Ward | 1 | Unnamed Road, Bengaluru | MULTIPOLYGON (((77.59229 13.09720, 77.59094 13... |
1 | 3 | Atturu | 2 | 9th Cross Bhel Layout, Adityanagar, Vidyaranya... | MULTIPOLYGON (((77.56862 13.12705, 77.57064 13... |
2 | 4 | Yelahanka Satellite Town | 3 | 15th A Cross Road, Yelahanka Satellite Town, Y... | MULTIPOLYGON (((77.59094 13.09842, 77.59229 13... |
3 | 51 | Vijnanapura | 4 | SP Naidu Layout 4th Cross Street, SP Naidu Lay... | MULTIPOLYGON (((77.67683 13.01147, 77.67695 13... |
4 | 53 | Basavanapura | 5 | Medahalli Kadugodi Road, Bharathi Nagar, Krish... | MULTIPOLYGON (((77.72899 13.02061, 77.72994 13... |
... | ... | ... | ... | ... | ... |
193 | 172 | Madivala | 194 | 0 1st B Cross Road, Cashier Layout, 1st Stage,... | MULTIPOLYGON (((77.61399 12.92347, 77.61419 12... |
194 | 26 | Ramamurthy Nagar | 195 | Kalkere-Agara Main Road, Horamavu Agara, Kalke... | MULTIPOLYGON (((77.68336 13.05192, 77.68384 13... |
195 | 25 | Horamavu | 196 | 0 Horamavu Agara Main Road, 1st Block, Mallapp... | MULTIPOLYGON (((77.64931 13.07853, 77.64993 13... |
196 | 86 | Marathahalli | 197 | 0 3rd Cross Road, Manjunatha Layout, Marathaha... | MULTIPOLYGON (((77.68549 12.94121, 77.68539 12... |
197 | 198 | Hemmigepura | 198 | BGS Road, Kodipalya, Bengaluru | MULTIPOLYGON (((77.49854 12.92574, 77.49854 12... |
198 rows × 5 columns
fig, ax = plt.subplots(figsize=(10,10))
wards['geometry'].plot(color='grey',ax=ax)
<AxesSubplot:>
np.random.seed(0)
travel_times = pd.concat([travel_times]*5, ignore_index=True)
travel_times['random'] = np.random.uniform(0, 1, len(travel_times))
travel_times['travel_time'] = np.exp(travel_times['random']*np.log(travel_times['geometric_standard_deviation_travel_time']) + np.log(travel_times['geometric_mean_travel_time']))
The source data contains the travel times grouped by blocks of time (peak/off-peak etc.), defined by start_hour
and end_hour
columns. To allow us to model this easily, we add a time_period
columns and assign an integer category value.
travel_times
sourceid | dstid | month | day | start_hour | end_hour | mean_travel_time | standard_deviation_travel_time | geometric_mean_travel_time | geometric_standard_deviation_travel_time | time_period | travel_time | src_lon | src_lat | dst_lon | dst_lat | distance | random | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 102 | 97 | 3 | 13 | 10 | 16 | 322.8 | 425.14 | 270.59 | 1.7 | 3 | 362.063898 | 77.563817 | 12.982784 | 77.566287 | 12.970359 | 1778.9 | 0.548814 |
1 | 102 | 97 | 1 | 17 | 19 | 0 | 306.71 | 200.99 | 256.58 | 1.84 | 5 | 396.842057 | 77.563817 | 12.982784 | 77.566287 | 12.970359 | 1778.9 | 0.715189 |
2 | 102 | 97 | 2 | 7 | 19 | 0 | 282.94 | 206.01 | 233.29 | 2.0 | 5 | 354.279465 | 77.563817 | 12.982784 | 77.566287 | 12.970359 | 1778.9 | 0.602763 |
3 | 102 | 97 | 1 | 19 | 7 | 10 | 294.18 | 183.97 | 258.09 | 1.61 | 2 | 334.554688 | 77.563817 | 12.982784 | 77.566287 | 12.970359 | 1778.9 | 0.544883 |
4 | 102 | 97 | 2 | 9 | 7 | 10 | 263.55 | 149.79 | 232.89 | 1.67 | 2 | 289.404809 | 77.563817 | 12.982784 | 77.566287 | 12.970359 | 1778.9 | 0.423655 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
38204830 | 162 | 83 | 3 | 2 | 16 | 19 | 3120.83 | 459.43 | 3089.7 | 1.15 | 4 | 3091.908238 | 77.587545 | 12.924545 | 77.599077 | 13.005976 | 11869.0 | 0.005112 |
38204831 | 8 | 127 | 2 | 2 | 10 | 16 | 2943.88 | 338.43 | 2924.38 | 1.12 | 3 | 3128.379559 | 77.672583 | 12.994474 | 77.548812 | 12.963703 | 17072.9 | 0.595019 |
38204832 | 34 | 95 | 2 | 1 | 19 | 0 | 2624.09 | 863.77 | 2493.34 | 1.37 | 5 | 3387.562808 | 77.506307 | 13.038050 | 77.632594 | 12.973725 | 19533.4 | 0.973561 |
38204833 | 56 | 120 | 1 | 19 | 10 | 16 | 2366.86 | 349.15 | 2340.33 | 1.16 | 3 | 2635.211006 | 77.603741 | 13.035174 | 77.547044 | 12.972093 | 13119.4 | 0.799564 |
38204834 | 128 | 25 | 1 | 24 | 19 | 0 | 2135.83 | 604.69 | 2033.07 | 1.39 | 5 | 2273.702852 | 77.551941 | 12.960545 | 77.553857 | 13.026029 | 10127.4 | 0.339695 |
38204835 rows × 18 columns
categories_to_hour = {
1: [0, 6],
2: [7, 9],
3: [10, 15],
4: [16, 18],
5: [19, 23]
}
def get_time_period(hour):
for category, (start_hour, end_hour) in categories_to_hour.items():
if hour >= start_hour and hour <= end_hour:
return category
travel_times['time_period'] = travel_times['start_hour'].apply(get_time_period)
Travel time has a strong correlation with the day of the week. So we compute a new column dow
from the day and month columns
year = 2020
def get_dow(row):
return datetime.date(year, int(row['month']), int(row['day'])).weekday()
travel_times['dow'] = travel_times.apply(get_dow, axis=1)
travel_times
For modeling purposes, we use centroid of each ward to represent the ward. We use GeoPandas centroid()
function to get the point geometry representing the centroid.
Our source data comes in the EPSG:4326 WGS84 Geographic Projection - which is not suitable forgeoprocessing operations. To get the accurate centroid computation, we must re-project the data to a Planar Projection. We use a UTM projection suitable for the region of the data - WGS 84 UTM Zone 43N - which is defined by the code EPSG:32643. Once computed, we transform it back to EPSG:4326 and add it to our GeoDataFrame.
centroid_utm = wards.geometry.to_crs('EPSG:32643').centroid
wards['centroid'] = centroid_utm.to_crs('EPSG:4326')
fig, ax = plt.subplots(figsize=(10,10))
wards['geometry'].plot(color='grey',ax=ax)
wards['centroid'].plot(color='red',ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x22370fe8108>
We have the travel times for each pair of source and destination wards. The travel time is strongly correlated with the distance between the wards. We need to compute the actual distance along the road network for our ward.
ward_no = wards['MOVEMENT_ID']
index = pd.MultiIndex.from_product([ward_no, ward_no], names = ['sourceid', 'dstid'])
distancematrix = pd.DataFrame(index = index).reset_index()
distancematrix = distancematrix.query('sourceid != dstid')
def get_coordinates(row):
source_ward = wards[wards['MOVEMENT_ID'] == row['sourceid']].iloc[0]
dst_ward = wards[wards['MOVEMENT_ID'] == row['dstid']].iloc[0]
src_lon, src_lat = source_ward['centroid'].x, source_ward['centroid'].y
dst_lon, dst_lat = dst_ward['centroid'].x, dst_ward['centroid'].y
return src_lon, src_lat, dst_lon, dst_lat
distancematrix[['src_lon', 'src_lat', 'dst_lon', 'dst_lat']] = distancematrix.apply(get_coordinates, axis=1, result_type='expand')
distancematrix
sourceid | dstid | src_lon | src_lat | dst_lon | dst_lat | |
---|---|---|---|---|---|---|
1 | 1 | 2 | 77.580422 | 13.121709 | 77.560037 | 13.102805 |
2 | 1 | 3 | 77.580422 | 13.121709 | 77.583926 | 13.090987 |
3 | 1 | 4 | 77.580422 | 13.121709 | 77.669565 | 13.006063 |
4 | 1 | 5 | 77.580422 | 13.121709 | 77.715456 | 13.016847 |
5 | 1 | 6 | 77.580422 | 13.121709 | 77.705502 | 13.022373 |
... | ... | ... | ... | ... | ... | ... |
39198 | 198 | 193 | 77.505015 | 12.891903 | 77.594507 | 12.910882 |
39199 | 198 | 194 | 77.505015 | 12.891903 | 77.614418 | 12.920018 |
39200 | 198 | 195 | 77.505015 | 12.891903 | 77.676539 | 13.033613 |
39201 | 198 | 196 | 77.505015 | 12.891903 | 77.653272 | 13.044560 |
39202 | 198 | 197 | 77.505015 | 12.891903 | 77.691495 | 12.950743 |
39006 rows × 6 columns
We need to get driving distance between approximately 40,000 coordinates. To do this efficiently, we ran the Open Source Routing Machine (OSRM) service locally using docker images provided by the project. OSRM holds the network graph in memory and the routing is extremely fast. We write and apply the following function and get the driving distance in meters.
def get_distance(row):
coordinates = '{},{};{},{}'.format(
row['src_lon'], row['src_lat'], row['dst_lon'], row['dst_lat'])
url = 'http://127.0.0.1:5000/route/v1/driving/'
response = requests.get(url + coordinates)
if response.status_code== 200:
data = response.json()
distance = data['routes'][0]['distance']
return distance
The resulting distance data is saved locally and used in the subsequent analysis.
osrm_data_folder = os.path.join('data', 'osrm')
distancematrix_file = 'distancematrix.csv'
distancematrix_filepath = os.path.join(osrm_data_folder, distancematrix_file)
distancematrix = pd.read_csv(distancematrix_filepath)
travel_times = pd.merge(travel_times, distancematrix, on=['sourceid', 'dstid'])
travel_times
sourceid | dstid | month | day | start_hour | end_hour | mean_travel_time | standard_deviation_travel_time | geometric_mean_travel_time | geometric_standard_deviation_travel_time | time_period | dow | src_lon | src_lat | dst_lon | dst_lat | distance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 102 | 97 | 3 | 13 | 10 | 16 | 322.80 | 425.14 | 270.59 | 1.70 | 3 | 4 | 77.563817 | 12.982784 | 77.566287 | 12.970359 | 1778.9 |
1 | 102 | 97 | 1 | 17 | 19 | 0 | 306.71 | 200.99 | 256.58 | 1.84 | 5 | 4 | 77.563817 | 12.982784 | 77.566287 | 12.970359 | 1778.9 |
2 | 102 | 97 | 2 | 7 | 19 | 0 | 282.94 | 206.01 | 233.29 | 2.00 | 5 | 4 | 77.563817 | 12.982784 | 77.566287 | 12.970359 | 1778.9 |
3 | 102 | 97 | 1 | 19 | 7 | 10 | 294.18 | 183.97 | 258.09 | 1.61 | 2 | 6 | 77.563817 | 12.982784 | 77.566287 | 12.970359 | 1778.9 |
4 | 102 | 97 | 2 | 9 | 7 | 10 | 263.55 | 149.79 | 232.89 | 1.67 | 2 | 6 | 77.563817 | 12.982784 | 77.566287 | 12.970359 | 1778.9 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7640962 | 162 | 83 | 3 | 2 | 16 | 19 | 3120.83 | 459.43 | 3089.70 | 1.15 | 4 | 0 | 77.587545 | 12.924545 | 77.599077 | 13.005976 | 11869.0 |
7640963 | 8 | 127 | 2 | 2 | 10 | 16 | 2943.88 | 338.43 | 2924.38 | 1.12 | 3 | 6 | 77.672583 | 12.994474 | 77.548812 | 12.963703 | 17072.9 |
7640964 | 34 | 95 | 2 | 1 | 19 | 0 | 2624.09 | 863.77 | 2493.34 | 1.37 | 5 | 5 | 77.506307 | 13.038050 | 77.632594 | 12.973725 | 19533.4 |
7640965 | 56 | 120 | 1 | 19 | 10 | 16 | 2366.86 | 349.15 | 2340.33 | 1.16 | 3 | 6 | 77.603741 | 13.035174 | 77.547044 | 12.972093 | 13119.4 |
7640966 | 128 | 25 | 1 | 24 | 19 | 0 | 2135.83 | 604.69 | 2033.07 | 1.39 | 5 | 4 | 77.551941 | 12.960545 | 77.553857 | 13.026029 | 10127.4 |
7640967 rows × 17 columns
We use scikit-learn
library to build and train a linear regressor.
The independent variables considered are sourceid, dstid, day, time_period, dow, src_lon, src_lat, dst_lon, dst_lat, distance
. The dependnet variable is the travel time geometric_mean_travel_time
.
Of the independent variables we goes for one-hot-encoding of to categorical variables time_period
and dow
We sample the travel times to get a subset that will be used for training.
num_samples = 500000
samples = travel_times.sample(n=num_samples, random_state=1)
sel_input=['sourceid', 'dstid', 'day', 'time_period', 'dow', 'src_lon', 'src_lat', 'dst_lon', 'dst_lat', 'distance']
cat_ip=['time_period','dow']
scale_ip= list(set(sel_input)-set(cat_ip))
x = samples[sel_input].values
y = samples['travel_time']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=1)
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-37-cdb2a8455402> in <module> ----> 1 x = samples[sel_input].values 2 y = samples['travel_time'] 3 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=1) ~/opt/anaconda3/envs/spatial_data_science/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key) 3028 if is_iterator(key): 3029 key = list(key) -> 3030 indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1] 3031 3032 # take() does not accept boolean indexers ~/opt/anaconda3/envs/spatial_data_science/lib/python3.9/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing) 1264 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr) 1265 -> 1266 self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing) 1267 return keyarr, indexer 1268 ~/opt/anaconda3/envs/spatial_data_science/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing) 1314 if raise_missing: 1315 not_found = list(set(key) - set(ax)) -> 1316 raise KeyError(f"{not_found} not in index") 1317 1318 not_found = key[missing_mask] KeyError: "['dow'] not in index"
The data set contains categorical variable in case of 'time_period' and 'dow' and hence these are one_hot_encoded since these categorical values has no significance for linear model.
category_Trans= ColumnTransformer([('encoder',OneHotEncoder(categories='auto', sparse=False),[sel_input.index(i) for i in cat_ip]),
('scaler',StandardScaler(),[sel_input.index(i) for i in scale_ip])],remainder='passthrough')
regressor = Pipeline(steps=[('ct',category_Trans),('model',LinearRegression())])
regressor.fit(x_train,y_train)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-38-923b4e1a10f6> in <module> ----> 1 regressor.fit(x_train,y_train) NameError: name 'regressor' is not defined
print('Training Accuracy: ', regressor.score(x_train,y_train))
Training Accuracy: 0.7442950836977186
print('Prediction Accuracy: ', regressor.score(x_test,y_test))
Prediction Accuracy: 0.7441506633075343
While the model performs well with the test partition, that dataset is not representative of real world data. We want to see how the model performs to routing requests that are not between centroids of wards. To achieve this, we create a dataset with random source and destination coordinates and check the model prediction against travel times predicted by commercial data providers such as Google Maps.
We generate random coordinate pairs within the bounds of the city. But to ensure that the points fall within the actual city geometry, we do a spatial join to select the points that intersect the wards. After the join, we select a subset of 100 points.
n_points = 200
x_min, y_min, x_max, y_max = wards.total_bounds
np.random.seed(0)
src_x = np.random.uniform(x_min, x_max, n_points)
src_y = np.random.uniform(y_min, y_max, n_points)
dst_x = np.random.uniform(x_min, x_max, n_points)
dst_y = np.random.uniform(y_min, y_max, n_points)
src_gdf = gpd.GeoDataFrame(geometry=gpd.points_from_xy(src_x, src_y), crs='EPSG:4326')
dst_gdf = gpd.GeoDataFrame(geometry=gpd.points_from_xy(dst_x, dst_y), crs='EPSG:4326')
src_gdf = gpd.sjoin(src_gdf, wards, how='inner', op='intersects')
dst_gdf = gpd.sjoin(dst_gdf, wards, how='inner', op='intersects')
src_selected = src_gdf[:100].reset_index()
dst_selected = dst_gdf[:100].reset_index()
fig, ax = plt.subplots(figsize=(10,10))
wards['geometry'].plot(color='grey',ax=ax)
src_selected.geometry.plot(color='green',ax=ax)
dst_selected.geometry.plot(color='red',ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x200333bb488>
start_date = datetime.date(2020, 1, 1)
end_date = datetime.date(2020, 3, 31)
days = (end_date - start_date).days
random.seed(0)
random_dates = [start_date + datetime.timedelta(days=random.randrange(days))
for _ in range(100)]
months = [x.month for x in random_dates]
days = [x.day for x in random_dates]
dows = [x.weekday() for x in random_dates]
random_time_periods = [random.choice(range(1, 6)) for _ in range(100)]
data = pd.DataFrame({
'sourceid': src_selected['MOVEMENT_ID'],
'dstid': dst_selected['MOVEMENT_ID'],
'month': months,
'day': days,
'time_period': random_time_periods,
'dow': dows,
'src_lon': src_selected.geometry.x,
'src_lat': src_selected.geometry.y,
'dst_lon': dst_selected.geometry.x,
'dst_lat': dst_selected.geometry.y,
})
The resulting distance data is saved locally and used in the subsequent analysis.
osrm_data_folder = os.path.join('data', 'osrm')
model_test_file = 'model_test.csv'
model_test_filepath = os.path.join(osrm_data_folder, model_test_file)
model_test = pd.read_csv(model_test_filepath)
model_test
sourceid | dstid | month | day | time_period | dow | src_lon | src_lat | dst_lon | dst_lat | distance | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 91 | 184 | 2 | 19 | 2 | 2 | 77.637890 | 12.930561 | 77.590090 | 12.888096 | 9686.2 |
1 | 91 | 184 | 2 | 23 | 1 | 6 | 77.644817 | 12.944129 | 77.586162 | 12.885430 | 14531.8 |
2 | 91 | 165 | 1 | 6 | 1 | 0 | 77.643652 | 12.941535 | 77.761146 | 12.935575 | 16056.7 |
3 | 10 | 165 | 2 | 3 | 5 | 0 | 77.655367 | 12.950985 | 77.713831 | 12.937901 | 10362.1 |
4 | 10 | 165 | 3 | 6 | 4 | 4 | 77.658390 | 12.949876 | 77.716294 | 12.927034 | 8051.9 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 185 | 49 | 3 | 1 | 1 | 6 | 77.556457 | 12.861758 | 77.572991 | 13.007320 | 20097.0 |
96 | 159 | 49 | 1 | 9 | 5 | 3 | 77.588550 | 12.909991 | 77.570795 | 12.997582 | 12067.3 |
97 | 76 | 178 | 1 | 12 | 4 | 6 | 77.648405 | 13.006604 | 77.661723 | 12.881504 | 20443.1 |
98 | 193 | 178 | 3 | 27 | 2 | 4 | 77.597410 | 12.915178 | 77.650824 | 12.889310 | 8032.5 |
99 | 11 | 115 | 1 | 17 | 1 | 4 | 77.557797 | 13.049415 | 77.526487 | 12.968088 | 12629.0 |
100 rows × 11 columns
To validate our model against real-world data, we collected reference travel times from Google Maps. Google Maps allows one to set a specific departure time in the past and get a range of travel times. We used our randomly generated source and destimation pairs along with random departure times and collected reference data.
reference_data_folder = os.path.join('data', 'googlemaps')
reference_file = 'googlemaps_traveltimes.csv'
reference_filepath = os.path.join(reference_data_folder, reference_file)
reference_data = pd.read_csv(reference_filepath)
reference_data
sourceid | dstid | month | day | time_period | dow | src_lon | src_lat | dst_lon | dst_lat | distance | goog_distance | goog_min | goog_max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 91 | 184 | 2 | 19 | 2 | 2 | 77.637890 | 12.930561 | 77.590090 | 12.888096 | 9686.2 | 10700 | 22 | 40 |
1 | 91 | 184 | 2 | 23 | 1 | 6 | 77.644817 | 12.944129 | 77.586162 | 12.885430 | 14531.8 | 15700 | 30 | 40 |
2 | 91 | 165 | 1 | 6 | 1 | 0 | 77.643652 | 12.941535 | 77.761146 | 12.935575 | 16056.7 | 16200 | 30 | 40 |
3 | 10 | 165 | 2 | 3 | 5 | 0 | 77.655367 | 12.950985 | 77.713831 | 12.937901 | 10362.1 | 11000 | 20 | 35 |
4 | 10 | 165 | 3 | 6 | 4 | 4 | 77.658390 | 12.949876 | 77.716294 | 12.927034 | 8051.9 | 10300 | 24 | 40 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 185 | 49 | 3 | 1 | 1 | 6 | 77.556457 | 12.861758 | 77.572991 | 13.007320 | 20097.0 | 20300 | 40 | 50 |
96 | 159 | 49 | 1 | 9 | 5 | 3 | 77.588550 | 12.909991 | 77.570795 | 12.997582 | 12067.3 | 11900 | 22 | 40 |
97 | 76 | 178 | 1 | 12 | 4 | 6 | 77.648405 | 13.006604 | 77.661723 | 12.881504 | 20443.1 | 24000 | 45 | 80 |
98 | 193 | 178 | 3 | 27 | 2 | 4 | 77.597410 | 12.915178 | 77.650824 | 12.889310 | 8032.5 | 8000 | 18 | 35 |
99 | 11 | 115 | 1 | 17 | 1 | 4 | 77.557797 | 13.049415 | 77.526487 | 12.968088 | 12629.0 | 13400 | 24 | 28 |
100 rows × 14 columns
def predict_time(row):
input = row[['sourceid', 'dstid', 'day', 'time_period', 'dow', 'src_lon', 'src_lat', 'dst_lon', 'dst_lat', 'distance']]
return round(regressor.predict([input])[0]/60)
reference_data['predicted'] = reference_data.apply(predict_time, axis=1)
results = reference_data[['sourceid', 'dstid', 'distance', 'goog_distance', 'goog_min', 'goog_max', 'predicted']].copy()
results['within_range'] = np.where(
(results['predicted'] <= results['goog_max'])
& (results['predicted'] >= results['goog_min']), 'Y', 'N')
results.head(40)
sourceid | dstid | distance | goog_distance | goog_min | goog_max | predicted | within_range | |
---|---|---|---|---|---|---|---|---|
0 | 91 | 184 | 9686.2 | 10700 | 22 | 40 | 29 | Y |
1 | 91 | 184 | 14531.8 | 15700 | 30 | 40 | 26 | N |
2 | 91 | 165 | 16056.7 | 16200 | 30 | 40 | 34 | Y |
3 | 10 | 165 | 10362.1 | 11000 | 20 | 35 | 29 | Y |
4 | 10 | 165 | 8051.9 | 10300 | 24 | 40 | 31 | Y |
5 | 10 | 165 | 10610.8 | 10700 | 18 | 35 | 30 | Y |
6 | 10 | 165 | 11898.9 | 11500 | 22 | 40 | 37 | Y |
7 | 177 | 165 | 20806.9 | 21200 | 40 | 60 | 52 | Y |
8 | 15 | 20 | 17683.4 | 16800 | 30 | 65 | 40 | Y |
9 | 15 | 20 | 20441.6 | 18800 | 35 | 60 | 45 | Y |
10 | 15 | 82 | 13335.8 | 16000 | 24 | 40 | 25 | Y |
11 | 169 | 148 | 25992.0 | 26400 | 55 | 110 | 70 | Y |
12 | 169 | 148 | 23628.8 | 23200 | 40 | 65 | 59 | Y |
13 | 169 | 180 | 29579.8 | 28500 | 55 | 100 | 74 | Y |
14 | 169 | 180 | 31056.3 | 29200 | 65 | 120 | 81 | Y |
15 | 169 | 180 | 31774.2 | 30000 | 60 | 110 | 82 | Y |
16 | 168 | 76 | 15065.9 | 19500 | 35 | 55 | 42 | Y |
17 | 1 | 154 | 22362.3 | 23800 | 40 | 50 | 41 | Y |
18 | 176 | 198 | 14221.3 | 14100 | 28 | 50 | 38 | Y |
19 | 196 | 198 | 26543.1 | 33500 | 50 | 85 | 63 | Y |
20 | 196 | 198 | 32441.9 | 34800 | 60 | 75 | 65 | Y |
21 | 196 | 198 | 30500.7 | 30700 | 50 | 90 | 73 | Y |
22 | 5 | 198 | 34863.5 | 34000 | 55 | 80 | 78 | Y |
23 | 194 | 144 | 9954.9 | 10800 | 26 | 50 | 31 | Y |
24 | 198 | 35 | 21076.7 | 21300 | 35 | 50 | 51 | N |
25 | 198 | 188 | 9760.8 | 10200 | 18 | 35 | 24 | Y |
26 | 198 | 188 | 14291.3 | 14300 | 24 | 28 | 26 | Y |
27 | 198 | 55 | 20778.0 | 21500 | 40 | 75 | 49 | Y |
28 | 198 | 13 | 33244.1 | 32800 | 55 | 70 | 67 | Y |
29 | 32 | 13 | 19364.5 | 19600 | 30 | 60 | 43 | Y |
30 | 32 | 13 | 21956.7 | 21500 | 40 | 75 | 48 | Y |
31 | 180 | 13 | 27069.0 | 27500 | 45 | 80 | 63 | Y |
32 | 66 | 13 | 15000.5 | 17400 | 28 | 50 | 37 | Y |
33 | 81 | 13 | 22482.0 | 23600 | 40 | 65 | 58 | Y |
34 | 78 | 192 | 22069.3 | 21800 | 40 | 50 | 41 | Y |
35 | 78 | 192 | 20063.5 | 19400 | 40 | 50 | 41 | Y |
36 | 78 | 166 | 17271.0 | 16500 | 28 | 45 | 45 | Y |
37 | 195 | 166 | 16745.3 | 15100 | 26 | 40 | 43 | N |
38 | 2 | 166 | 30049.5 | 31200 | 45 | 60 | 59 | Y |
39 | 179 | 166 | 16929.2 | 16900 | 30 | 55 | 44 | Y |
To demonstrate the use of our technique in a real-world application, we show how it can be used in a real-time routing application.
from dotenv import load_dotenv
load_dotenv()
ORS_API_KEY = os.getenv('ORS_API_KEY')
def get_driving_route(source_coordinates, dest_coordinates):
parameters = {
'api_key': ORS_API_KEY,
'start' : '{},{}'.format(source_coordinates[1], source_coordinates[0]),
'end' : '{},{}'.format(dest_coordinates[1], dest_coordinates[0])
}
response = requests.get(
'https://api.openrouteservice.org/v2/directions/driving-car', params=parameters)
if response.status_code == 200:
data = response.json()
return data
else:
print('Request failed.')
return -9999
def get_ward(coordinates):
df = pd.DataFrame({'x': [coordinates[1]], 'y': [coordinates[0]]})
src_gdf = gpd.GeoDataFrame(geometry=gpd.points_from_xy(df.x, df.y), crs='EPSG:4326')
src_gdf = gpd.sjoin(src_gdf, wards, how='inner', op='intersects')
return int(src_gdf['MOVEMENT_ID'][0])
def get_route(source, destination, departure_time):
sourceid = get_ward(source)
dstid = get_ward(destination)
day = departure_time.day
time_period = get_time_period(departure_time.hour)
dow = departure_time.weekday()
driving_data = get_driving_route(source, destination)
summary = driving_data['features'][0]['properties']['summary']
distance = summary['distance']
input = [sourceid, dstid, day, time_period, dow, source[1], source[0], destination[1], destination[0], distance]
travel_time = round(regressor.predict([input])[0]/60)
ors_travel_time = round(summary['duration']/60)
route= driving_data['features'][0]['geometry']['coordinates']
def swap(coord):
coord[0],coord[1]=coord[1],coord[0]
return coord
route=list(map(swap, route))
m = folium.Map(location=[(source[0] + destination[0])/2,(source[1] + destination[1])/2], zoom_start=13)
tooltip = 'Model predicted time = {} mins, \
Default travel time = {} mins'.format(travel_time, ors_travel_time)
folium.PolyLine(
route,
weight=8,
color='blue',
opacity=0.6,
tooltip=tooltip
).add_to(m)
folium.Marker(
location=(source[0],source[1]),
icon=folium.Icon(icon='play',color='green')
).add_to(m)
folium.Marker(
location=(destination[0],destination[1]),
icon=folium.Icon(icon='stop',color='red')
).add_to(m)
return m
We pick a set of coordinates within the city and show how to get turn-by-turn directions using OpenRouteService API and predict the travel-time using our model.
source = 12.946538, 77.579975
destination = 12.994029, 77.661008
departure_time = datetime.datetime.now()
get_route(source, destination, departure_time)
We can check how the model performs by comparing with the travel time predicted by Google Maps.
import webbrowser
url='https://www.google.com/maps/dir/{},{}/{},{}'.format(source[0],source[1],destination[0],destination[1])
webbrowser.open(url)
True