The following is a demonstration of predicting potential flight delays using a publicly available airlines dataset. For this example, the dataset used is a small sample of what is more than two decades worth of flight data in order to ensure the download and import process would not take more than a minute or two.
The data comes originally from RITA where it is described in detail. To use the entire 26 years worth of flight information to more accurately predict delays and cancellation please download one of the following and change the path to the data in the notebook:
There are obvious benefits to predicting potential delays and logistic issues for a business. It helps the user make contingency plans and corrections to avoid undesirable outcomes. Recommendation engines can forewarn flyers of possible delays and rank flight options accordingly, other businesses might pay more for a flight to ensure certain shipments arrive on time, and airline carriers can use the information to better their flight plans. The goal is to have the machine take in all the possible factors that will affect a flight and return the probability of a flight being delayed.
Connection to an H2O cloud is established through the h2o.init
function from the h2o
module. To connect to a pre-existing H2O cluster make sure to edit the H2O location with argument myIP
and myPort
.
import h2o
import os
import tabulate
import operator
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
h2o.init()
We will use the h2o.importFile
function to do a parallel read of the data into the H2O distributed key-value store. During import of the data, features Year, Month, DayOfWeek, and FlightNum were set to be parsed as enumerator or categorical rather than numeric columns. Once the data is in H2O, get an overview of the airlines dataset quickly by using describe
.
airlines_hex = h2o.import_file(path = os.path.realpath("../data/allyears2k.csv"),
destination_frame = "airlines.hex")
airlines_hex.describe()
Run a logistic regression model using function h2o.glm
and selecting “binomial” for parameter Family
. Add some regularization by setting alpha to 0.5 and lambda to 1e-05.
# Set predictor and response variables
myY = "IsDepDelayed"
myX = ["Dest", "Origin", "DayofMonth", "Year", "UniqueCarrier", "DayOfWeek", "Month", "Distance"]
# GLM - Predict Delays
glm_model = H2OGeneralizedLinearEstimator(
family = "binomial",standardize = True, solver = "IRLSM",
link = "logit", alpha = 0.5, model_id = "glm_model_from_python" )
glm_model.train(x = myX,
y = myY,
training_frame = airlines_hex)
print "AUC of the training set : " + str(glm_model.auc())
# Variable importances from each algorithm
# Calculate magnitude of normalized GLM coefficients
glm_varimp = glm_model.coef_norm()
for k,v in glm_varimp.iteritems():
glm_varimp[k] = abs(glm_varimp[k])
# Sort in descending order by magnitude
glm_sorted = sorted(glm_varimp.items(), key = operator.itemgetter(1), reverse = True)
table = tabulate.tabulate(glm_sorted, headers = ["Predictor", "Normalized Coefficient"], tablefmt = "orgtbl")
print "Variable Importances:\n\n" + table
Build a binary classfication model using function h2o.deeplearning
and selecting “bernoulli” for parameter Distribution
. Run 100 passes over the data by setting parameter epoch
to 100.
# Deep Learning - Predict Delays
deeplearning_model = H2ODeepLearningEstimator(
distribution = "bernoulli", model_id = "deeplearning_model_from_python",
epochs = 100, hidden = [200,200],
seed = 6765686131094811000, variable_importances = True)
deeplearning_model.train(x = myX,
y = myY,
training_frame = airlines_hex)
print "AUC of the training set : " + str(deeplearning_model.auc())
deeplearning_model.varimp(table)
Shut down the cluster now that we are done using it.
h2o.shutdown(prompt=False)