Online prediction with BigQuery ML¶

ML.Predict in BigQuery ML is primarily meant for batch predictions. What if you want to build a web application to provide online predictions? Here, I show the basic Python code to do online prediction. You can wrap this code in AppEngine or other web framework/toolkit to provide scalable, fast, online predictions.

In [ ]:

!pip install google-cloud   # Reset Session after installing

In [3]:

PROJECT = 'cloud-training-demos'  # change as needed

Create a model¶

Let's start by creating a simple prediction model to predict arrival delays of aircraft. I'll use this to illustrate the process.

First, if necessary, create the BigQuery dataset that will store the output of the model.

In [ ]:

%bash
bq mk -d flights

Then, do a "CREATE MODEL". This will take about 5 minutes.

In [3]:

%bq query

CREATE OR REPLACE MODEL flights.arrdelay
OPTIONS
  (model_type='linear_reg', input_label_cols=['arr_delay']) AS
SELECT
  arr_delay,
  carrier,
  origin,
  dest,
  dep_delay,
  taxi_out,
  distance
FROM
  `cloud-training-demos.flights.tzcorr`
WHERE
  arr_delay IS NOT NULL

Out[3]:

Done

Batch prediction with model¶

Once you have a trained model, batch prediction can be done within BigQuery itself.

For example, to find the predicted arrival delays for a flight from DFW to LAX for a range of departure delays

In [9]:

%bq query
SELECT * FROM ml.PREDICT(MODEL flights.arrdelay, (
SELECT 
  'AA' as carrier,
  'DFW' as origin,
  'LAX' as dest,
  dep_delay,
  18 as taxi_out,
  1235 as distance
FROM
  UNNEST(GENERATE_ARRAY(-3, 10)) as dep_delay
))

Out[9]:

predicted_arr_delay	carrier	origin	dest	dep_delay	taxi_out	distance
-8.31085723761	AA	DFW	LAX	-3	18	1235
-7.32016373789	AA	DFW	LAX	-2	18	1235
-6.32947023816	AA	DFW	LAX	-1	18	1235
-5.33877673844	AA	DFW	LAX	0	18	1235
-4.34808323871	AA	DFW	LAX	1	18	1235
-3.35738973899	AA	DFW	LAX	2	18	1235
-2.36669623926	AA	DFW	LAX	3	18	1235
-1.37600273954	AA	DFW	LAX	4	18	1235
-0.385309239811	AA	DFW	LAX	5	18	1235
0.605384259914	AA	DFW	LAX	6	18	1235
1.59607775964	AA	DFW	LAX	7	18	1235
2.58677125936	AA	DFW	LAX	8	18	1235
3.57746475909	AA	DFW	LAX	9	18	1235
4.56815825881	AA	DFW	LAX	10	18	1235

(rows: 14, time: 1.2s, 16KB processed, job: job_WVW0Xk2F-GX7NphFVwFZXwenWOIM)

Online prediction in Python¶

The batch prediction technique above can not be used for online prediction though. Typical BigQuery queries have a latency of 1-2 seconds and that is too high for a web application.

For online prediction, it is better to grab the weights and do the computation yourself.

In [22]:

%bq query
SELECT
  processed_input AS input,
  model.weight AS input_weight
FROM
  ml.WEIGHTS(MODEL flights.arrdelay) AS model

Out[22]:

input	input_weight
carrier
origin
dest
dep_delay	36.5569545237
taxi_out	8.15557957221
distance	-1.88324519311
__INTERCEPT__	1.09017737502

(rows: 7, time: 1.3s, 16KB processed, job: job_5YCmyG8UY5qfA8aNa9hy_RQBgzI1)

In [16]:

%bq query
SELECT
  processed_input AS input,
  model.weight AS input_weight,
  category.category AS category_name,
  category.weight AS category_weight
FROM
  ml.WEIGHTS(MODEL flights.arrdelay) AS model,
  UNNEST(category_weights) AS category

Out[16]:

input	category_name	category_weight
carrier	B6	2.41360895323
carrier	EV	2.27889474393
carrier	AA	-0.0548843604154
carrier	US	1.82939081237
carrier	DL	-1.28516116856
carrier	F9	5.76556618145
carrier	WN	2.18336049823
carrier	HA	6.17381375535
carrier	VX	5.05422774942
carrier	AS	2.022529139
carrier	UA	-1.9492217572
carrier	OO	1.11110679975
carrier	NK	7.09579009395
carrier	MQ	1.19620456586
origin	WRG	8.99721250183
origin	MSY	4.96015143256
origin	AMA	6.20849983281
origin	SAF	4.19530797642
origin	GUC	3.60929742614
origin	MKG	2.23621423776
origin	CHO	3.63391263959
origin	ESC	1.01325898826
origin	MCI	4.16187598831
origin	GSP	3.06866659933
origin	GJT	3.6366332815

(rows: 658, time: 1.1s, 16KB processed, job: job_h3LTu2B6REiDJWXHslEMZ9MyIjH_)

Here's how to do that in Python.

p.s. I'm assuming that you are in an environment where you are already authenticated with Google Cloud. If not, see this article on how to access BigQuery using private keys

In [4]:

def query_to_dataframe(query):
  import pandas as pd
  import pkgutil
  privatekey = None # pkgutil.get_data(KEYDIR, 'privatekey.json')
  return pd.read_gbq(query,
                     project_id=PROJECT,
                     dialect='standard',
                     private_key=privatekey)

You need to pull 3 pieces of information:

The weights for each of your numeric columns
The scaling for each of your numeric columns
The vocabulary and weights for each of your categorical columns

I pull them using three separate BigQuery queries below

In [27]:

numeric_query = """
SELECT
  processed_input AS input,
  model.weight AS input_weight
FROM
  ml.WEIGHTS(MODEL flights.arrdelay) AS model
"""
numeric_weights = query_to_dataframe(numeric_query).dropna()
numeric_weights

Requesting query... ok.
Job ID: 4f89a94d-11a1-4cfa-bbb2-f046ffe63df1
Query running...
Query done.
Cache hit.

Retrieving results...
Got 7 rows.

Total time taken 1.02 s.
Finished at 2018-08-12 04:20:46.

Out[27]:

	input	input_weight
3	dep_delay	36.556955
4	taxi_out	8.155580
5	distance	-1.883245
6	__INTERCEPT__	1.090177

In [84]:

scaling_query = """
SELECT
  input, min, max, mean, stddev
FROM
  ml.FEATURE_INFO(MODEL flights.arrdelay) AS model
"""
scaling_df = query_to_dataframe(scaling_query).dropna()
scaling_df

Requesting query... ok.
Job ID: a3d9e2fb-f940-48ba-a5aa-2e7645d82db1
Query running...
Query done.
Processed: 0.0 B Billed: 0.0 B
Standard price: $0.00 USD

Retrieving results...
Got 6 rows.

Total time taken 2.63 s.
Finished at 2018-08-12 05:00:15.

Out[84]:

	input	min	max	mean	stddev
3	dep_delay	-82.0	1988.0	9.169095	36.900368
4	taxi_out	1.0	225.0	16.099878	8.901454
5	distance	31.0	4983.0	825.795621	608.756947

In [25]:

categorical_query = """
SELECT
  processed_input AS input,
  model.weight AS input_weight,
  category.category AS category_name,
  category.weight AS category_weight
FROM
  ml.WEIGHTS(MODEL flights.arrdelay) AS model,
  UNNEST(category_weights) AS category
"""
categorical_weights = query_to_dataframe(categorical_query)
categorical_weights.head()

Requesting query... ok.
Job ID: b662c5fd-32af-4624-88db-d7aeed97300f
Query running...
Query done.
Cache hit.

Retrieving results...
Got 658 rows.

Total time taken 1.03 s.
Finished at 2018-08-12 04:17:36.

Out[25]:

	input	input_weight	category_name	category_weight
0	carrier	NaN	B6	2.413609
1	carrier	NaN	EV	2.278895
2	carrier	NaN	AA	-0.054884
3	carrier	NaN	US	1.829391
4	carrier	NaN	DL	-1.285161

With the three pieces of information in-place, you can simply do the math for linear regression:

In [87]:

def compute_prediction(rowdict, numeric_weights, scaling_df, categorical_weights):
  input_values = rowdict
  # numeric inputs
  pred = 0
  for column_name in numeric_weights['input'].unique():
    wt = numeric_weights[ numeric_weights['input'] == column_name ]['input_weight'].values[0]
    if column_name != '__INTERCEPT__':
      #minv = scaling_df[ scaling_df['input'] == column_name ]['min'].values[0]
      #maxv = scaling_df[ scaling_df['input'] == column_name ]['max'].values[0]
      #scaled_value = (input_values[column_name] - minv)/(maxv - minv)
      meanv = scaling_df[ scaling_df['input'] == column_name ]['mean'].values[0]
      stddev = scaling_df[ scaling_df['input'] == column_name ]['stddev'].values[0]
      scaled_value = (input_values[column_name] - meanv)/stddev
    else:
      scaled_value = 1.0
    contrib = wt * scaled_value
    print('col={} wt={} scaled_value={} contrib={}'.format(column_name, wt, scaled_value, contrib))
    pred = pred + contrib
  # categorical inputs
  for column_name in categorical_weights['input'].unique():
    category_weights = categorical_weights[ categorical_weights['input'] == column_name ]
    wt = category_weights[ category_weights['category_name'] == input_values[column_name] ]['category_weight'].values[0]
    print('col={} wt={} value={} contrib={}'.format(column_name, wt, input_values[column_name], wt))
    pred = pred + wt
  return pred

Here is an example of the prediction code in action:

In [88]:

rowdict = {
  'carrier' : 'AA',
  'origin': 'DFW',
  'dest': 'LAX',
  'dep_delay': -3,
  'taxi_out': 18,
  'distance': 1235
}
print(compute_prediction(rowdict, numeric_weights,  scaling_df, categorical_weights))

col=dep_delay wt=36.5569545237 scaled_value=-0.329782492822 contrib=-12.0558435928
col=taxi_out wt=8.15557957221 scaled_value=0.213461991601 contrib=1.74090625815
col=distance wt=-1.88324519311 scaled_value=0.672196648431 contrib=-1.26591110699
col=__INTERCEPT__ wt=1.09017737502 scaled_value=1.0 contrib=1.09017737502
col=carrier wt=-0.0548843604154 value=AA contrib=-0.0548843604154
col=origin wt=0.966535564037 value=DFW contrib=0.966535564037
col=dest wt=1.26816262538 value=LAX contrib=1.26816262538
-8.310857237611458

As you can see, this matches the batch prediction value, telling us that we got the computation correct.

Now, all that we have to do is to wrap up the code into a web application to get online prediction.

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License