Travel time prediction in Indian Metro cities using Uber Movement data and OpenStreetMap¶

Uber provides anonymized and aggregated travel time data through Uber Movement platform for many citites across the world. For India, current and historic data is available for 5 cities - Bangalore, Hyderabad, New Delhi, Mumbai and Kolkata. It also provides the details on the ward boundaries in the form of JSON file.

OpenStreetMap (OSM) is a free, editable map of the whole world that is being built by volunteers largely from scratch and released with an open-content license. OSM data includes a global navigable street network dataset. Several services exists that provide routing and network analysis on top of this data.

In this project, we use the open travel time dataset from Uber and leverage open-source routing services for OpenStreetMap to build a fairly accurate model for travel time within each of the metro cities in India. We show that by using rich ecosystem of Python Geospatial libraries, we can easily consume, process, and visualize large amount of geospatial data easily and incorporate it easily into a machine learning model.

Open datasets

Uber Movement - Travel times and ward boundaries
OpenStreetMap

Python libraries

geopandas
shapely
matplotlib
folium
scikit-learn

Services

Open Source Routing Machine (OSRM)
OpenRouteService (ORS) API

In [16]:

import pandas as pd
import geopandas as gpd
import numpy as np
import requests
import shapely
import matplotlib.pyplot as plt
import datetime
import os
import math
import random
import folium
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from matplotlib import pyplot

%matplotlib inline

Reading Datasets¶

Travel Times¶

In [3]:

data_folder = os.path.join('data', 'uber')

The Uber Movement Travel Times data comes as a CSV file for each quarter. Here we are using the Travel Times By Date By Hour Buckets (All Days) dataset. This data set includes the arithmetic mean, geometric mean, and standard deviations for aggregated travel times between every ward in the city, for every day of the quarter and aggregated into time categories. This is a large dataset with over 7M rows.

We import the data as a Pandas DataFrame and call convert_dtypes() to select the best datatypes for each column.

In [4]:

travel_times_file = 'bangalore-wards-2020-1-All-DatesByHourBucketsAggregate.csv'
travel_times_filepath = os.path.join(data_folder, travel_times_file)
travel_times= pd.read_csv(travel_times_filepath)
travel_times = travel_times.convert_dtypes()

In [39]:

travel_times

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7640967 entries, 0 to 7640966
Data columns (total 17 columns):
 #   Column                                    Dtype  
---  ------                                    -----  
 0   sourceid                                  Int64  
 1   dstid                                     Int64  
 2   month                                     Int64  
 3   day                                       Int64  
 4   start_hour                                Int64  
 5   end_hour                                  Int64  
 6   mean_travel_time                          Float64
 7   standard_deviation_travel_time            Float64
 8   geometric_mean_travel_time                Float64
 9   geometric_standard_deviation_travel_time  Float64
 10  time_period                               int64  
 11  travel_time                               Float64
 12  src_lon                                   float64
 13  src_lat                                   float64
 14  dst_lon                                   float64
 15  dst_lat                                   float64
 16  distance                                  float64
dtypes: Float64(5), Int64(6), float64(5), int64(1)
memory usage: 1.1 GB

Ward Boundaries¶

The travel times dataset contain details of travel between Zones. For Indian citites, the zones are Wards as defined by the local municipal corporation. This data comes as a GeoJSON file that contains the polygon representation of each ward. We use geopandas to read the file as a GeoDataFrame.

In [6]:

wards_file = 'bangalore_wards.json'
wards_filepath = os.path.join(data_folder, wards_file)
wards = gpd.read_file(wards_filepath)

In [7]:

wards

Out[7]:

	WARD_NO	WARD_NAME	MOVEMENT_ID	DISPLAY_NAME	geometry
0	2	Chowdeswari Ward	1	Unnamed Road, Bengaluru	MULTIPOLYGON (((77.59229 13.09720, 77.59094 13...
1	3	Atturu	2	9th Cross Bhel Layout, Adityanagar, Vidyaranya...	MULTIPOLYGON (((77.56862 13.12705, 77.57064 13...
2	4	Yelahanka Satellite Town	3	15th A Cross Road, Yelahanka Satellite Town, Y...	MULTIPOLYGON (((77.59094 13.09842, 77.59229 13...
3	51	Vijnanapura	4	SP Naidu Layout 4th Cross Street, SP Naidu Lay...	MULTIPOLYGON (((77.67683 13.01147, 77.67695 13...
4	53	Basavanapura	5	Medahalli Kadugodi Road, Bharathi Nagar, Krish...	MULTIPOLYGON (((77.72899 13.02061, 77.72994 13...
...	...	...	...	...	...
193	172	Madivala	194	0 1st B Cross Road, Cashier Layout, 1st Stage,...	MULTIPOLYGON (((77.61399 12.92347, 77.61419 12...
194	26	Ramamurthy Nagar	195	Kalkere-Agara Main Road, Horamavu Agara, Kalke...	MULTIPOLYGON (((77.68336 13.05192, 77.68384 13...
195	25	Horamavu	196	0 Horamavu Agara Main Road, 1st Block, Mallapp...	MULTIPOLYGON (((77.64931 13.07853, 77.64993 13...
196	86	Marathahalli	197	0 3rd Cross Road, Manjunatha Layout, Marathaha...	MULTIPOLYGON (((77.68549 12.94121, 77.68539 12...
197	198	Hemmigepura	198	BGS Road, Kodipalya, Bengaluru	MULTIPOLYGON (((77.49854 12.92574, 77.49854 12...

198 rows × 5 columns

In [11]:

fig, ax = plt.subplots(figsize=(10,10))
wards['geometry'].plot(color='grey',ax=ax)

Out[11]:

<AxesSubplot:>

Data Pre-Processing¶

Travel Times¶

In [49]:

np.random.seed(0)
travel_times = pd.concat([travel_times]*5, ignore_index=True)

In [50]:

travel_times['random'] = np.random.uniform(0, 1, len(travel_times))

In [51]:

travel_times['travel_time'] = np.exp(travel_times['random']*np.log(travel_times['geometric_standard_deviation_travel_time']) + np.log(travel_times['geometric_mean_travel_time']))

The source data contains the travel times grouped by blocks of time (peak/off-peak etc.), defined by start_hour and end_hour columns. To allow us to model this easily, we add a time_period columns and assign an integer category value.

In [52]:

travel_times

Out[52]:

	sourceid	dstid	month	day	start_hour	end_hour	mean_travel_time	standard_deviation_travel_time	geometric_mean_travel_time	geometric_standard_deviation_travel_time	time_period	travel_time	src_lon	src_lat	dst_lon	dst_lat	distance	random
0	102	97	3	13	10	16	322.8	425.14	270.59	1.7	3	362.063898	77.563817	12.982784	77.566287	12.970359	1778.9	0.548814
1	102	97	1	17	19	0	306.71	200.99	256.58	1.84	5	396.842057	77.563817	12.982784	77.566287	12.970359	1778.9	0.715189
2	102	97	2	7	19	0	282.94	206.01	233.29	2.0	5	354.279465	77.563817	12.982784	77.566287	12.970359	1778.9	0.602763
3	102	97	1	19	7	10	294.18	183.97	258.09	1.61	2	334.554688	77.563817	12.982784	77.566287	12.970359	1778.9	0.544883
4	102	97	2	9	7	10	263.55	149.79	232.89	1.67	2	289.404809	77.563817	12.982784	77.566287	12.970359	1778.9	0.423655
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
38204830	162	83	3	2	16	19	3120.83	459.43	3089.7	1.15	4	3091.908238	77.587545	12.924545	77.599077	13.005976	11869.0	0.005112
38204831	8	127	2	2	10	16	2943.88	338.43	2924.38	1.12	3	3128.379559	77.672583	12.994474	77.548812	12.963703	17072.9	0.595019
38204832	34	95	2	1	19	0	2624.09	863.77	2493.34	1.37	5	3387.562808	77.506307	13.038050	77.632594	12.973725	19533.4	0.973561
38204833	56	120	1	19	10	16	2366.86	349.15	2340.33	1.16	3	2635.211006	77.603741	13.035174	77.547044	12.972093	13119.4	0.799564
38204834	128	25	1	24	19	0	2135.83	604.69	2033.07	1.39	5	2273.702852	77.551941	12.960545	77.553857	13.026029	10127.4	0.339695

38204835 rows × 18 columns

In [53]:

categories_to_hour = {
    1: [0, 6],
    2: [7, 9],
    3: [10, 15],
    4: [16, 18],
    5: [19, 23]
}

def get_time_period(hour):
    for category, (start_hour, end_hour) in categories_to_hour.items():
        if hour >= start_hour and hour <= end_hour:
            return category

In [54]:

travel_times['time_period'] = travel_times['start_hour'].apply(get_time_period)

Travel time has a strong correlation with the day of the week. So we compute a new column dow from the day and month columns

In [55]:

year = 2020

def get_dow(row):
    return datetime.date(year, int(row['month']), int(row['day'])).weekday()

In [ ]:

travel_times['dow'] = travel_times.apply(get_dow, axis=1)

In [ ]:

travel_times

Ward Boundaries¶

For modeling purposes, we use centroid of each ward to represent the ward. We use GeoPandas centroid() function to get the point geometry representing the centroid.

Our source data comes in the EPSG:4326 WGS84 Geographic Projection - which is not suitable forgeoprocessing operations. To get the accurate centroid computation, we must re-project the data to a Planar Projection. We use a UTM projection suitable for the region of the data - WGS 84 UTM Zone 43N - which is defined by the code EPSG:32643. Once computed, we transform it back to EPSG:4326 and add it to our GeoDataFrame.

In [30]:

centroid_utm = wards.geometry.to_crs('EPSG:32643').centroid
wards['centroid'] = centroid_utm.to_crs('EPSG:4326')

In [31]:

fig, ax = plt.subplots(figsize=(10,10))
wards['geometry'].plot(color='grey',ax=ax)
wards['centroid'].plot(color='red',ax=ax)

Out[31]:

<matplotlib.axes._subplots.AxesSubplot at 0x22370fe8108>

Distance Computation¶

We have the travel times for each pair of source and destination wards. The travel time is strongly correlated with the distance between the wards. We need to compute the actual distance along the road network for our ward.

In [32]:

ward_no = wards['MOVEMENT_ID']
index = pd.MultiIndex.from_product([ward_no, ward_no], names = ['sourceid', 'dstid'])

distancematrix = pd.DataFrame(index = index).reset_index()
distancematrix = distancematrix.query('sourceid != dstid')

In [33]:

def get_coordinates(row):
    source_ward = wards[wards['MOVEMENT_ID'] == row['sourceid']].iloc[0]
    dst_ward = wards[wards['MOVEMENT_ID'] == row['dstid']].iloc[0]

    src_lon, src_lat = source_ward['centroid'].x, source_ward['centroid'].y
    dst_lon, dst_lat = dst_ward['centroid'].x, dst_ward['centroid'].y
    return src_lon, src_lat, dst_lon, dst_lat

In [34]:

distancematrix[['src_lon', 'src_lat', 'dst_lon', 'dst_lat']] = distancematrix.apply(get_coordinates, axis=1, result_type='expand')

In [35]:

distancematrix

Out[35]:

	sourceid	dstid	src_lon	src_lat	dst_lon	dst_lat
1	1	2	77.580422	13.121709	77.560037	13.102805
2	1	3	77.580422	13.121709	77.583926	13.090987
3	1	4	77.580422	13.121709	77.669565	13.006063
4	1	5	77.580422	13.121709	77.715456	13.016847
5	1	6	77.580422	13.121709	77.705502	13.022373
...	...	...	...	...	...	...
39198	198	193	77.505015	12.891903	77.594507	12.910882
39199	198	194	77.505015	12.891903	77.614418	12.920018
39200	198	195	77.505015	12.891903	77.676539	13.033613
39201	198	196	77.505015	12.891903	77.653272	13.044560
39202	198	197	77.505015	12.891903	77.691495	12.950743

39006 rows × 6 columns

We need to get driving distance between approximately 40,000 coordinates. To do this efficiently, we ran the Open Source Routing Machine (OSRM) service locally using docker images provided by the project. OSRM holds the network graph in memory and the routing is extremely fast. We write and apply the following function and get the driving distance in meters.

In [36]:

def get_distance(row):
    
    coordinates = '{},{};{},{}'.format(
        row['src_lon'], row['src_lat'], row['dst_lon'], row['dst_lat'])
    url = 'http://127.0.0.1:5000/route/v1/driving/'
    response = requests.get(url + coordinates) 
    if response.status_code== 200:
        data = response.json()   
        distance = data['routes'][0]['distance']
    
    return distance

The resulting distance data is saved locally and used in the subsequent analysis.

In [33]:

osrm_data_folder = os.path.join('data', 'osrm')
distancematrix_file = 'distancematrix.csv'
distancematrix_filepath = os.path.join(osrm_data_folder, distancematrix_file)
distancematrix = pd.read_csv(distancematrix_filepath)

In [34]:

travel_times = pd.merge(travel_times, distancematrix, on=['sourceid', 'dstid']) 

In [39]:

travel_times

Out[39]:

	sourceid	dstid	month	day	start_hour	end_hour	mean_travel_time	standard_deviation_travel_time	geometric_mean_travel_time	geometric_standard_deviation_travel_time	time_period	dow	src_lon	src_lat	dst_lon	dst_lat	distance
0	102	97	3	13	10	16	322.80	425.14	270.59	1.70	3	4	77.563817	12.982784	77.566287	12.970359	1778.9
1	102	97	1	17	19	0	306.71	200.99	256.58	1.84	5	4	77.563817	12.982784	77.566287	12.970359	1778.9
2	102	97	2	7	19	0	282.94	206.01	233.29	2.00	5	4	77.563817	12.982784	77.566287	12.970359	1778.9
3	102	97	1	19	7	10	294.18	183.97	258.09	1.61	2	6	77.563817	12.982784	77.566287	12.970359	1778.9
4	102	97	2	9	7	10	263.55	149.79	232.89	1.67	2	6	77.563817	12.982784	77.566287	12.970359	1778.9
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7640962	162	83	3	2	16	19	3120.83	459.43	3089.70	1.15	4	0	77.587545	12.924545	77.599077	13.005976	11869.0
7640963	8	127	2	2	10	16	2943.88	338.43	2924.38	1.12	3	6	77.672583	12.994474	77.548812	12.963703	17072.9
7640964	34	95	2	1	19	0	2624.09	863.77	2493.34	1.37	5	5	77.506307	13.038050	77.632594	12.973725	19533.4
7640965	56	120	1	19	10	16	2366.86	349.15	2340.33	1.16	3	6	77.603741	13.035174	77.547044	12.972093	13119.4
7640966	128	25	1	24	19	0	2135.83	604.69	2033.07	1.39	5	4	77.551941	12.960545	77.553857	13.026029	10127.4

7640967 rows × 17 columns

Data Modeling¶

We use scikit-learn library to build and train a linear regressor.

The independent variables considered are sourceid, dstid, day, time_period, dow, src_lon, src_lat, dst_lon, dst_lat, distance. The dependnet variable is the travel time geometric_mean_travel_time. Of the independent variables we goes for one-hot-encoding of to categorical variables time_period and dow We sample the travel times to get a subset that will be used for training.

In [35]:

num_samples = 500000
samples = travel_times.sample(n=num_samples, random_state=1)

In [36]:

sel_input=['sourceid', 'dstid', 'day', 'time_period', 'dow', 'src_lon', 'src_lat', 'dst_lon', 'dst_lat', 'distance']
cat_ip=['time_period','dow']
scale_ip= list(set(sel_input)-set(cat_ip))

In [37]:

x = samples[sel_input].values
y = samples['travel_time']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=1)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-37-cdb2a8455402> in <module>
----> 1 x = samples[sel_input].values
      2 y = samples['travel_time']
      3 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=1)

~/opt/anaconda3/envs/spatial_data_science/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3028             if is_iterator(key):
   3029                 key = list(key)
-> 3030             indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
   3031 
   3032         # take() does not accept boolean indexers

~/opt/anaconda3/envs/spatial_data_science/lib/python3.9/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1264             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1265 
-> 1266         self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
   1267         return keyarr, indexer
   1268 

~/opt/anaconda3/envs/spatial_data_science/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1314             if raise_missing:
   1315                 not_found = list(set(key) - set(ax))
-> 1316                 raise KeyError(f"{not_found} not in index")
   1317 
   1318             not_found = key[missing_mask]

KeyError: "['dow'] not in index"

The data set contains categorical variable in case of 'time_period' and 'dow' and hence these are one_hot_encoded since these categorical values has no significance for linear model.

In [ ]:

category_Trans= ColumnTransformer([('encoder',OneHotEncoder(categories='auto', sparse=False),[sel_input.index(i) for i in cat_ip]),
                                   ('scaler',StandardScaler(),[sel_input.index(i) for i in scale_ip])],remainder='passthrough')

In [ ]:

regressor = Pipeline(steps=[('ct',category_Trans),('model',LinearRegression())])

In [38]:

regressor.fit(x_train,y_train)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-38-923b4e1a10f6> in <module>
----> 1 regressor.fit(x_train,y_train)

NameError: name 'regressor' is not defined

In [50]:

print('Training Accuracy: ', regressor.score(x_train,y_train))

Training Accuracy:  0.7442950836977186

In [51]:

print('Prediction Accuracy: ', regressor.score(x_test,y_test))

Prediction Accuracy:  0.7441506633075343

Checking Model Performance¶

While the model performs well with the test partition, that dataset is not representative of real world data. We want to see how the model performs to routing requests that are not between centroids of wards. To achieve this, we create a dataset with random source and destination coordinates and check the model prediction against travel times predicted by commercial data providers such as Google Maps.

Random Points within a Polygon¶

We generate random coordinate pairs within the bounds of the city. But to ensure that the points fall within the actual city geometry, we do a spatial join to select the points that intersect the wards. After the join, we select a subset of 100 points.

In [31]:

n_points = 200

x_min, y_min, x_max, y_max = wards.total_bounds

np.random.seed(0)
src_x = np.random.uniform(x_min, x_max, n_points)
src_y = np.random.uniform(y_min, y_max, n_points)
dst_x = np.random.uniform(x_min, x_max, n_points)
dst_y = np.random.uniform(y_min, y_max, n_points)

src_gdf = gpd.GeoDataFrame(geometry=gpd.points_from_xy(src_x, src_y), crs='EPSG:4326')
dst_gdf = gpd.GeoDataFrame(geometry=gpd.points_from_xy(dst_x, dst_y), crs='EPSG:4326')

In [32]:

src_gdf = gpd.sjoin(src_gdf, wards, how='inner', op='intersects')
dst_gdf = gpd.sjoin(dst_gdf, wards, how='inner', op='intersects')

src_selected = src_gdf[:100].reset_index()
dst_selected = dst_gdf[:100].reset_index()

In [33]:

fig, ax = plt.subplots(figsize=(10,10))
wards['geometry'].plot(color='grey',ax=ax)
src_selected.geometry.plot(color='green',ax=ax)
dst_selected.geometry.plot(color='red',ax=ax)

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x200333bb488>

Random Days and Times¶

In [34]:

start_date = datetime.date(2020, 1, 1)
end_date = datetime.date(2020, 3, 31)
days = (end_date - start_date).days

random.seed(0) 
random_dates = [start_date + datetime.timedelta(days=random.randrange(days))
                for _ in range(100)]
months = [x.month for x in random_dates] 
days = [x.day for x in random_dates]
dows = [x.weekday() for x in random_dates]
random_time_periods = [random.choice(range(1, 6)) for _ in range(100)]

In [35]:

data = pd.DataFrame({
    'sourceid': src_selected['MOVEMENT_ID'],
    'dstid': dst_selected['MOVEMENT_ID'],
    'month': months,
    'day': days,
    'time_period': random_time_periods,
    'dow': dows,
    'src_lon': src_selected.geometry.x,
    'src_lat': src_selected.geometry.y,
    'dst_lon': dst_selected.geometry.x,
    'dst_lat': dst_selected.geometry.y,
})

Random Test Dataset¶

The resulting distance data is saved locally and used in the subsequent analysis.

In [36]:

osrm_data_folder = os.path.join('data', 'osrm')
model_test_file = 'model_test.csv'
model_test_filepath = os.path.join(osrm_data_folder, model_test_file)
model_test = pd.read_csv(model_test_filepath)

In [37]:

model_test

Out[37]:

	sourceid	dstid	month	day	time_period	dow	src_lon	src_lat	dst_lon	dst_lat	distance
0	91	184	2	19	2	2	77.637890	12.930561	77.590090	12.888096	9686.2
1	91	184	2	23	1	6	77.644817	12.944129	77.586162	12.885430	14531.8
2	91	165	1	6	1	0	77.643652	12.941535	77.761146	12.935575	16056.7
3	10	165	2	3	5	0	77.655367	12.950985	77.713831	12.937901	10362.1
4	10	165	3	6	4	4	77.658390	12.949876	77.716294	12.927034	8051.9
...	...	...	...	...	...	...	...	...	...	...	...
95	185	49	3	1	1	6	77.556457	12.861758	77.572991	13.007320	20097.0
96	159	49	1	9	5	3	77.588550	12.909991	77.570795	12.997582	12067.3
97	76	178	1	12	4	6	77.648405	13.006604	77.661723	12.881504	20443.1
98	193	178	3	27	2	4	77.597410	12.915178	77.650824	12.889310	8032.5
99	11	115	1	17	1	4	77.557797	13.049415	77.526487	12.968088	12629.0

100 rows × 11 columns

Predicted vs. Reference Travel Times¶

To validate our model against real-world data, we collected reference travel times from Google Maps. Google Maps allows one to set a specific departure time in the past and get a range of travel times. We used our randomly generated source and destimation pairs along with random departure times and collected reference data.

In [38]:

reference_data_folder = os.path.join('data', 'googlemaps')
reference_file = 'googlemaps_traveltimes.csv'
reference_filepath = os.path.join(reference_data_folder, reference_file)
reference_data = pd.read_csv(reference_filepath)

In [39]:

reference_data

Out[39]:

	sourceid	dstid	month	day	time_period	dow	src_lon	src_lat	dst_lon	dst_lat	distance	goog_distance	goog_min	goog_max
0	91	184	2	19	2	2	77.637890	12.930561	77.590090	12.888096	9686.2	10700	22	40
1	91	184	2	23	1	6	77.644817	12.944129	77.586162	12.885430	14531.8	15700	30	40
2	91	165	1	6	1	0	77.643652	12.941535	77.761146	12.935575	16056.7	16200	30	40
3	10	165	2	3	5	0	77.655367	12.950985	77.713831	12.937901	10362.1	11000	20	35
4	10	165	3	6	4	4	77.658390	12.949876	77.716294	12.927034	8051.9	10300	24	40
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	185	49	3	1	1	6	77.556457	12.861758	77.572991	13.007320	20097.0	20300	40	50
96	159	49	1	9	5	3	77.588550	12.909991	77.570795	12.997582	12067.3	11900	22	40
97	76	178	1	12	4	6	77.648405	13.006604	77.661723	12.881504	20443.1	24000	45	80
98	193	178	3	27	2	4	77.597410	12.915178	77.650824	12.889310	8032.5	8000	18	35
99	11	115	1	17	1	4	77.557797	13.049415	77.526487	12.968088	12629.0	13400	24	28

100 rows × 14 columns

In [40]:

def predict_time(row):
    input = row[['sourceid', 'dstid', 'day', 'time_period', 'dow', 'src_lon', 'src_lat', 'dst_lon', 'dst_lat', 'distance']]
    return round(regressor.predict([input])[0]/60)

In [41]:

reference_data['predicted'] = reference_data.apply(predict_time, axis=1)
results = reference_data[['sourceid', 'dstid', 'distance', 'goog_distance', 'goog_min', 'goog_max', 'predicted']].copy()
results['within_range'] = np.where(
    (results['predicted'] <= results['goog_max'])
        & (results['predicted'] >= results['goog_min']), 'Y', 'N')
results.head(40)

Out[41]:

	sourceid	dstid	distance	goog_distance	goog_min	goog_max	predicted	within_range
0	91	184	9686.2	10700	22	40	29	Y
1	91	184	14531.8	15700	30	40	26	N
2	91	165	16056.7	16200	30	40	34	Y
3	10	165	10362.1	11000	20	35	29	Y
4	10	165	8051.9	10300	24	40	31	Y
5	10	165	10610.8	10700	18	35	30	Y
6	10	165	11898.9	11500	22	40	37	Y
7	177	165	20806.9	21200	40	60	52	Y
8	15	20	17683.4	16800	30	65	40	Y
9	15	20	20441.6	18800	35	60	45	Y
10	15	82	13335.8	16000	24	40	25	Y
11	169	148	25992.0	26400	55	110	70	Y
12	169	148	23628.8	23200	40	65	59	Y
13	169	180	29579.8	28500	55	100	74	Y
14	169	180	31056.3	29200	65	120	81	Y
15	169	180	31774.2	30000	60	110	82	Y
16	168	76	15065.9	19500	35	55	42	Y
17	1	154	22362.3	23800	40	50	41	Y
18	176	198	14221.3	14100	28	50	38	Y
19	196	198	26543.1	33500	50	85	63	Y
20	196	198	32441.9	34800	60	75	65	Y
21	196	198	30500.7	30700	50	90	73	Y
22	5	198	34863.5	34000	55	80	78	Y
23	194	144	9954.9	10800	26	50	31	Y
24	198	35	21076.7	21300	35	50	51	N
25	198	188	9760.8	10200	18	35	24	Y
26	198	188	14291.3	14300	24	28	26	Y
27	198	55	20778.0	21500	40	75	49	Y
28	198	13	33244.1	32800	55	70	67	Y
29	32	13	19364.5	19600	30	60	43	Y
30	32	13	21956.7	21500	40	75	48	Y
31	180	13	27069.0	27500	45	80	63	Y
32	66	13	15000.5	17400	28	50	37	Y
33	81	13	22482.0	23600	40	65	58	Y
34	78	192	22069.3	21800	40	50	41	Y
35	78	192	20063.5	19400	40	50	41	Y
36	78	166	17271.0	16500	28	45	45	Y
37	195	166	16745.3	15100	26	40	43	N
38	2	166	30049.5	31200	45	60	59	Y
39	179	166	16929.2	16900	30	55	44	Y

In [ ]:

Real-time Routing and Prediction¶

To demonstrate the use of our technique in a real-world application, we show how it can be used in a real-time routing application.

In [59]:

from dotenv import load_dotenv
load_dotenv()

ORS_API_KEY = os.getenv('ORS_API_KEY')

In [61]:

def get_driving_route(source_coordinates, dest_coordinates):
    parameters = {
    'api_key': ORS_API_KEY,
    'start' : '{},{}'.format(source_coordinates[1], source_coordinates[0]),
    'end' : '{},{}'.format(dest_coordinates[1], dest_coordinates[0])
    }

    response = requests.get(
        'https://api.openrouteservice.org/v2/directions/driving-car', params=parameters)

    if response.status_code == 200:
        data = response.json()
        return data
    else:
        print('Request failed.')
        return -9999

In [62]:

def get_ward(coordinates):
    df = pd.DataFrame({'x': [coordinates[1]], 'y': [coordinates[0]]})
    src_gdf = gpd.GeoDataFrame(geometry=gpd.points_from_xy(df.x, df.y), crs='EPSG:4326')
    src_gdf = gpd.sjoin(src_gdf, wards, how='inner', op='intersects')
    return int(src_gdf['MOVEMENT_ID'][0])

In [80]:

def get_route(source, destination, departure_time):
    sourceid = get_ward(source)
    dstid = get_ward(destination)
    day = departure_time.day
    time_period = get_time_period(departure_time.hour)
    dow = departure_time.weekday()
    driving_data = get_driving_route(source, destination)
    summary = driving_data['features'][0]['properties']['summary']
    distance = summary['distance']
    input = [sourceid, dstid, day, time_period, dow, source[1], source[0], destination[1], destination[0], distance]
    travel_time = round(regressor.predict([input])[0]/60)
    ors_travel_time = round(summary['duration']/60)
    route= driving_data['features'][0]['geometry']['coordinates']
    
    def swap(coord):
        coord[0],coord[1]=coord[1],coord[0]
        return coord

    route=list(map(swap, route))
    m = folium.Map(location=[(source[0] + destination[0])/2,(source[1] + destination[1])/2], zoom_start=13)
    
    tooltip = 'Model predicted time = {} mins, \
        Default travel time = {} mins'.format(travel_time, ors_travel_time)
    folium.PolyLine(
        route,
        weight=8,
        color='blue',
        opacity=0.6,
        tooltip=tooltip
    ).add_to(m)

    folium.Marker(
        location=(source[0],source[1]),
        icon=folium.Icon(icon='play',color='green')
    ).add_to(m)

    folium.Marker(
        location=(destination[0],destination[1]),
        icon=folium.Icon(icon='stop',color='red')
    ).add_to(m)

    return m

In [ ]:

Live Demo¶

We pick a set of coordinates within the city and show how to get turn-by-turn directions using OpenRouteService API and predict the travel-time using our model.

In [86]:

source = 12.946538, 77.579975
destination = 12.994029, 77.661008
departure_time = datetime.datetime.now()

In [87]:

get_route(source, destination, departure_time)

Out[87]:

We can check how the model performs by comparing with the travel time predicted by Google Maps.

In [88]:

import webbrowser

url='https://www.google.com/maps/dir/{},{}/{},{}'.format(source[0],source[1],destination[0],destination[1])
webbrowser.open(url)

Out[88]:

True

In [ ]: