ML.Predict in BigQuery ML is primarily meant for batch predictions. What if you want to build a web application to provide online predictions? Here, I show the basic Python code to do online prediction. You can wrap this code in AppEngine or other web framework/toolkit to provide scalable, fast, online predictions.
!pip install google-cloud # Reset Session after installing
PROJECT = 'cloud-training-demos' # change as needed
Let's start by creating a simple prediction model to predict arrival delays of aircraft. I'll use this to illustrate the process.
First, if necessary, create the BigQuery dataset that will store the output of the model.
%bash
bq mk -d flights
Then, do a "CREATE MODEL". This will take about 5 minutes.
%bq query
CREATE OR REPLACE MODEL flights.arrdelay
OPTIONS
(model_type='linear_reg', input_label_cols=['arr_delay']) AS
SELECT
arr_delay,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM
`cloud-training-demos.flights.tzcorr`
WHERE
arr_delay IS NOT NULL
Once you have a trained model, batch prediction can be done within BigQuery itself.
For example, to find the predicted arrival delays for a flight from DFW to LAX for a range of departure delays
%bq query
SELECT * FROM ml.PREDICT(MODEL flights.arrdelay, (
SELECT
'AA' as carrier,
'DFW' as origin,
'LAX' as dest,
dep_delay,
18 as taxi_out,
1235 as distance
FROM
UNNEST(GENERATE_ARRAY(-3, 10)) as dep_delay
))
predicted_arr_delay | carrier | origin | dest | dep_delay | taxi_out | distance |
---|---|---|---|---|---|---|
-8.31085723761 | AA | DFW | LAX | -3 | 18 | 1235 |
-7.32016373789 | AA | DFW | LAX | -2 | 18 | 1235 |
-6.32947023816 | AA | DFW | LAX | -1 | 18 | 1235 |
-5.33877673844 | AA | DFW | LAX | 0 | 18 | 1235 |
-4.34808323871 | AA | DFW | LAX | 1 | 18 | 1235 |
-3.35738973899 | AA | DFW | LAX | 2 | 18 | 1235 |
-2.36669623926 | AA | DFW | LAX | 3 | 18 | 1235 |
-1.37600273954 | AA | DFW | LAX | 4 | 18 | 1235 |
-0.385309239811 | AA | DFW | LAX | 5 | 18 | 1235 |
0.605384259914 | AA | DFW | LAX | 6 | 18 | 1235 |
1.59607775964 | AA | DFW | LAX | 7 | 18 | 1235 |
2.58677125936 | AA | DFW | LAX | 8 | 18 | 1235 |
3.57746475909 | AA | DFW | LAX | 9 | 18 | 1235 |
4.56815825881 | AA | DFW | LAX | 10 | 18 | 1235 |
The batch prediction technique above can not be used for online prediction though. Typical BigQuery queries have a latency of 1-2 seconds and that is too high for a web application.
For online prediction, it is better to grab the weights and do the computation yourself.
%bq query
SELECT
processed_input AS input,
model.weight AS input_weight
FROM
ml.WEIGHTS(MODEL flights.arrdelay) AS model
input | input_weight |
---|---|
carrier | |
origin | |
dest | |
dep_delay | 36.5569545237 |
taxi_out | 8.15557957221 |
distance | -1.88324519311 |
__INTERCEPT__ | 1.09017737502 |
%bq query
SELECT
processed_input AS input,
model.weight AS input_weight,
category.category AS category_name,
category.weight AS category_weight
FROM
ml.WEIGHTS(MODEL flights.arrdelay) AS model,
UNNEST(category_weights) AS category
input | input_weight | category_name | category_weight |
---|---|---|---|
carrier | B6 | 2.41360895323 | |
carrier | EV | 2.27889474393 | |
carrier | AA | -0.0548843604154 | |
carrier | US | 1.82939081237 | |
carrier | DL | -1.28516116856 | |
carrier | F9 | 5.76556618145 | |
carrier | WN | 2.18336049823 | |
carrier | HA | 6.17381375535 | |
carrier | VX | 5.05422774942 | |
carrier | AS | 2.022529139 | |
carrier | UA | -1.9492217572 | |
carrier | OO | 1.11110679975 | |
carrier | NK | 7.09579009395 | |
carrier | MQ | 1.19620456586 | |
origin | WRG | 8.99721250183 | |
origin | MSY | 4.96015143256 | |
origin | AMA | 6.20849983281 | |
origin | SAF | 4.19530797642 | |
origin | GUC | 3.60929742614 | |
origin | MKG | 2.23621423776 | |
origin | CHO | 3.63391263959 | |
origin | ESC | 1.01325898826 | |
origin | MCI | 4.16187598831 | |
origin | GSP | 3.06866659933 | |
origin | GJT | 3.6366332815 |
Here's how to do that in Python.
p.s. I'm assuming that you are in an environment where you are already authenticated with Google Cloud. If not, see this article on how to access BigQuery using private keys
def query_to_dataframe(query):
import pandas as pd
import pkgutil
privatekey = None # pkgutil.get_data(KEYDIR, 'privatekey.json')
return pd.read_gbq(query,
project_id=PROJECT,
dialect='standard',
private_key=privatekey)
You need to pull 3 pieces of information:
I pull them using three separate BigQuery queries below
numeric_query = """
SELECT
processed_input AS input,
model.weight AS input_weight
FROM
ml.WEIGHTS(MODEL flights.arrdelay) AS model
"""
numeric_weights = query_to_dataframe(numeric_query).dropna()
numeric_weights
Requesting query... ok. Job ID: 4f89a94d-11a1-4cfa-bbb2-f046ffe63df1 Query running... Query done. Cache hit. Retrieving results... Got 7 rows. Total time taken 1.02 s. Finished at 2018-08-12 04:20:46.
input | input_weight | |
---|---|---|
3 | dep_delay | 36.556955 |
4 | taxi_out | 8.155580 |
5 | distance | -1.883245 |
6 | __INTERCEPT__ | 1.090177 |
scaling_query = """
SELECT
input, min, max, mean, stddev
FROM
ml.FEATURE_INFO(MODEL flights.arrdelay) AS model
"""
scaling_df = query_to_dataframe(scaling_query).dropna()
scaling_df
Requesting query... ok. Job ID: a3d9e2fb-f940-48ba-a5aa-2e7645d82db1 Query running... Query done. Processed: 0.0 B Billed: 0.0 B Standard price: $0.00 USD Retrieving results... Got 6 rows. Total time taken 2.63 s. Finished at 2018-08-12 05:00:15.
input | min | max | mean | stddev | |
---|---|---|---|---|---|
3 | dep_delay | -82.0 | 1988.0 | 9.169095 | 36.900368 |
4 | taxi_out | 1.0 | 225.0 | 16.099878 | 8.901454 |
5 | distance | 31.0 | 4983.0 | 825.795621 | 608.756947 |
categorical_query = """
SELECT
processed_input AS input,
model.weight AS input_weight,
category.category AS category_name,
category.weight AS category_weight
FROM
ml.WEIGHTS(MODEL flights.arrdelay) AS model,
UNNEST(category_weights) AS category
"""
categorical_weights = query_to_dataframe(categorical_query)
categorical_weights.head()
Requesting query... ok. Job ID: b662c5fd-32af-4624-88db-d7aeed97300f Query running... Query done. Cache hit. Retrieving results... Got 658 rows. Total time taken 1.03 s. Finished at 2018-08-12 04:17:36.
input | input_weight | category_name | category_weight | |
---|---|---|---|---|
0 | carrier | NaN | B6 | 2.413609 |
1 | carrier | NaN | EV | 2.278895 |
2 | carrier | NaN | AA | -0.054884 |
3 | carrier | NaN | US | 1.829391 |
4 | carrier | NaN | DL | -1.285161 |
With the three pieces of information in-place, you can simply do the math for linear regression:
def compute_prediction(rowdict, numeric_weights, scaling_df, categorical_weights):
input_values = rowdict
# numeric inputs
pred = 0
for column_name in numeric_weights['input'].unique():
wt = numeric_weights[ numeric_weights['input'] == column_name ]['input_weight'].values[0]
if column_name != '__INTERCEPT__':
#minv = scaling_df[ scaling_df['input'] == column_name ]['min'].values[0]
#maxv = scaling_df[ scaling_df['input'] == column_name ]['max'].values[0]
#scaled_value = (input_values[column_name] - minv)/(maxv - minv)
meanv = scaling_df[ scaling_df['input'] == column_name ]['mean'].values[0]
stddev = scaling_df[ scaling_df['input'] == column_name ]['stddev'].values[0]
scaled_value = (input_values[column_name] - meanv)/stddev
else:
scaled_value = 1.0
contrib = wt * scaled_value
print('col={} wt={} scaled_value={} contrib={}'.format(column_name, wt, scaled_value, contrib))
pred = pred + contrib
# categorical inputs
for column_name in categorical_weights['input'].unique():
category_weights = categorical_weights[ categorical_weights['input'] == column_name ]
wt = category_weights[ category_weights['category_name'] == input_values[column_name] ]['category_weight'].values[0]
print('col={} wt={} value={} contrib={}'.format(column_name, wt, input_values[column_name], wt))
pred = pred + wt
return pred
Here is an example of the prediction code in action:
rowdict = {
'carrier' : 'AA',
'origin': 'DFW',
'dest': 'LAX',
'dep_delay': -3,
'taxi_out': 18,
'distance': 1235
}
print(compute_prediction(rowdict, numeric_weights, scaling_df, categorical_weights))
col=dep_delay wt=36.5569545237 scaled_value=-0.329782492822 contrib=-12.0558435928 col=taxi_out wt=8.15557957221 scaled_value=0.213461991601 contrib=1.74090625815 col=distance wt=-1.88324519311 scaled_value=0.672196648431 contrib=-1.26591110699 col=__INTERCEPT__ wt=1.09017737502 scaled_value=1.0 contrib=1.09017737502 col=carrier wt=-0.0548843604154 value=AA contrib=-0.0548843604154 col=origin wt=0.966535564037 value=DFW contrib=0.966535564037 col=dest wt=1.26816262538 value=LAX contrib=1.26816262538 -8.310857237611458
As you can see, this matches the batch prediction value, telling us that we got the computation correct.
Now, all that we have to do is to wrap up the code into a web application to get online prediction.
Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License