Implementing GLM with Grid Search in R using H2O.ai as a Backend
2017 Jan 10For R users, dealing with the tool’s poor memory management is particularly frustrating, severely limiting R’s use as a serious tool for building models that can be deployed into production and enable the creation of intelligent platforms/systems.
In this guide, we’ll briefly demonstrate how to use H2O.ai as a processing backend (which abstracts away all these memory and processing issues) for creating a GLM model.
The key advantage here is that H2O handles all memory management. Regardless of your data source, it manages the memory buffer pipeline efficiently, preventing memory overflows or general system slowdowns.
This example is entirely based on H2O’s documentation and aims solely to showcase the tool’s functionality.
For this example, I’ll be using a notebook, but it could also be used on an Amazon instance with the following command during cluster initialization:
#Production Cluster (Not applicable)
#localH2O <- h2o.init(ip = '10.112.81.210', port =54321, nthreads=-1)
# Machine 1
#localH2O <- h2o.init(ip = '10.112.80.74', port =54321, nthreads=-1)
# Machine 2 - Yes, here we're using a cluster with two nodes for processing! ;)
First, let’s remove any old H2O.ai installation from the machine in question:
# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
Next, we’ll download and install all packages that H2O directly or indirectly depends on.
# Next, we download packages that H2O depends on.
if (! ("methods" %in% rownames(installed.packages()))) { install.packages("methods") }
if (! ("statmod" %in% rownames(installed.packages()))) { install.packages("statmod") }
if (! ("stats" %in% rownames(installed.packages()))) { install.packages("stats") }
if (! ("graphics" %in% rownames(installed.packages()))) { install.packages("graphics") }
if (! ("RCurl" %in% rownames(installed.packages()))) { install.packages("RCurl") }
if (! ("jsonlite" %in% rownames(installed.packages()))) { install.packages("jsonlite") }
if (! ("tools" %in% rownames(installed.packages()))) { install.packages("tools") }
if (! ("utils" %in% rownames(installed.packages()))) { install.packages("utils") }
Then, we’ll install the H2O library and instantiate it in R Studio.
# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-turing/8/R")))
# Load library
library(h2o)
With the installation complete and the library loaded, let’s proceed with some configurations.
Within R Studio itself, you can choose the number of processors the cluster (in this case, your notebook/desktop) will utilize. Remember that the more cores used, the more processing H2O consumes, leaving fewer resources for other tasks. The default is 2 cores, but in my case, I’ll use all available processors.
# Start instance with all cores.
# The -1 is the parameter to use with all cores. Use this carefully.
# The default parameter is 2 cores.
h2o.init(nthreads = -1)
Now, let’s view our cluster information:
# Cluster Info
h2o.clusterInfo()
# R is connected to the H2O cluster:
# H2O cluster uptime: 3 seconds 267 milliseconds
# H2O cluster version: 3.10.0.8
# H2O cluster version age: 2 months and 26 days
# H2O cluster name: H2O_started_from_R_flavio.clesio_udy929
# H2O cluster total nodes: 1
# H2O cluster total memory: 1.78 GB
# H2O cluster total cores: 4
# H2O cluster allowed cores: 4
# H2O cluster healthy: TRUE
# H2O Connection ip: localhost
# H2O Connection port: 54321
# H2O Connection proxy: NA
# R Version: R version 3.3.2 (2016-10-31)
As we can see, out of the 4 total processors, I’m using all of them (allowed cores) for processing.
Another fact we can observe here is that H2O is also configured to use a Web GUI. To access it, simply navigate to http://localhost:54321/flow/index.html in your browser.
For this example, we’ll use the Airlines dataset, which contains extensive real flight data in the US, including all causes of delays from 1987 to 2008. The full 12GB version can be found here.
Moving forward, let’s load the data directly from a URL and import it into an R object.
# GLM Demo Deep Dive
# Path of normalized archive. Can be a URL or a local path
airlinesURL = "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
# We'll create the object .hex (extension of data files in H2O)
# and using the importFile property, we'll set the path and the destination frame.
# As default behaviour H2O parse the file automatically.
airlines.hex = h2o.importFile(path = airlinesURL, destination_frame = "airlines.hex")
In this case, the airlines.hex object will be the dataframe on which H2O will apply the algorithms.
This .hex format is exclusive to H2O and can be used for numerous algorithms within the platform, as it’s already optimized for handling sparse objects and/or text-type columns.
To view the descriptive statistics of this file, simply use the summary() function as you would in R.
# Let's see the summary
summary(airlines.hex)
For our experiment, we’ll split the data into training and testing sets with a 70%/30% proportion.
A point worth mentioning here is that because H2O is a platform designed for Big Data, it uses the probabilistic sampling method. This is necessary (computationally speaking) because operations like selection/stratification can be costly.
# Construct test and train sets using sampling
# A small note is that H2O uses probabilistic splitting, which means that resulting splits
# can deviate from the exact number. This is necessary when we're talking about a platform that
# deals with big data. If you need exact sampling, the best way is to do this in your RDBMS
airlines.split = h2o.splitFrame(data = airlines.hex, ratios = 0.70, seed = -1)
After creating the splitFrame object, we’ll allocate the partitions to each dataset. The object in the first position will always be our training set, and the second position will be our test set.
# Get the train dataframe (1st split object)
airlines.train = airlines.split[[1]]
# Get the test dataframe (2nd split object)
airlines.test = airlines.split[[2]]
Let’s summarize each of these frames below to verify the distribution of cancelled flights:
# Display a summary using table-like in some summarized way
h2o.table(airlines.train$Cancelled)
# Cancelled Count
# 1 0 29921
# 2 1 751
h2o.table(airlines.test$Cancelled)
# Cancelled Count
# 1 0 12971
# 2 1 335
With our samples separated, we’ll now choose the variables that will enter our model.
First, we’ll create two objects to pass as parameters to our algorithm.
Since we want to predict whether departing flights are delayed, the Y object (dependent variable) will be the variable IsDepDelayed (If the departure flight is delayed), and the X object (independent variables) will include all other fields from the dataset.
# Set dependent variable (Is departure delayed)
Y = "IsDepDelayed"
# Set independent variables
X = c("Origin", "Dest", "DayofMonth", "Year", "UniqueCarrier", "DayOfWeek", "Month", "DepTime", "ArrTime", "Distance")
Now, let’s proceed with creating the model using GLM:
# Define the data for the model and display the results
airlines.glm <- h2o.glm(training_frame = airlines.train ,x = X ,y = Y ,family = "binomial" ,alpha = 0.5 ,max_iterations = 300 ,beta_epsilon = 0 ,lambda = 1e-05 ,lambda_search = FALSE ,early_stopping = FALSE ,nfolds = 0 ,seed = NULL ,intercept = TRUE ,gradient_epsilon = -1 ,remove_collinear_columns = FALSE ,max_runtime_secs = 10000 ,missing_values_handling = c("Skip"))
The meaning of the model parameters are:
x: a vector containing the names of the independent variables;
y: an index containing the dependent variable;
training_frame: an H2O frame containing the model variables;
family: specifies the model’s distribution, which can be gaussian, binomial, poisson, gamma, and tweedie. An excellent explanation of how these parameters can be chosen is available at this link;
alpha: a number in [0, 1] specifying the mixture of the elastic-net regularization parameter. It determines the degree of mixing between Lasso and Ridge regularizers. alpha = 1 results in LASSO penalization, alpha = 0 results in ridge penalization;
max_iterations: a non-negative integer specifying the maximum number of model iterations;
beta_epsilon: a non-negative number specifying the maximum difference magnitude between coefficient estimates across successive iterations. In other words: this parameter determines the convergence speed of the GLM model;
lambda: a non-negative parameter for shrinking the variable values via Elastic-Net, which multiplies P(α, β) in the objective function. When lambda = 0, no penalization is applied, and the model is already fitted;
lambda_search: a logical value indicating whether a search criterion will be applied in the lambda space defined by minimum and maximum parameters;
early_stopping: a value indicating whether to stop early during lambda_search if the likelihood factor stops changing with more iterations;
nfolds: Number of partitions for Cross-Validation;
seed: Specifies the random number generator (RNG) seed for Cross-Validation (ensures experiment reproducibility);
gradient_epsilon: Convergence criterion. Convergence is achieved if the I-Infinity norm of the gradient is below a specified threshold. If lambda_search = FALSE and lambda = 0, the default gradient_epsilon is .000001; otherwise, it’s .0001. If lambda_search = TRUE, the conditional values above are 1E-8 and 1E-6, respectively.
remove_collinear_columns: If no regularization factor is applied, the model ignores collinear columns (the coefficient will be 0);
max_runtime_secs: Maximum allowed training time in seconds for the model. Use 0 to disable; and
missing_values_handling: Specifies how to handle missing values. Options include “MeanImputation” or “Skip”. MeanImputation replaces missing values with the mean for numeric attributes and the most frequent value for categorical attributes. This is applied during model training.
Note that here, the sky is the limit in terms of adjustments and parameterization. The ideal approach is to fully understand the mechanics of each parameter and use the best possible combination.
With our model fitted, let’s look at some of its basic statistics.
# View model information: training statistics, performance, important variables
summary(airlines.glm)
# Model Details:
# ==============
#
# H2OBinomialModel: glm
# Model Key: GLM_model_R_1484053333586_1
# GLM Model: summary
# family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
# 1 binomial logit Elastic Net (alpha = 0.5, lambda = 1.0E-5 ) 283 272 5 RTMP_sid_a6c9_1
#
# H2OBinomialMetrics: glm
# ** Reported on training data. **
#
# MSE: 0.2098326
# RMSE: 0.4580749
# LogLoss: 0.607572
# Mean Per-Class Error: 0.3720209
# AUC: 0.7316312
# Gini: 0.4632623
# R^2: 0.1602123
# Null Deviance: 41328.6
# Residual Deviance: 36240.45
# AIC: 36786.45
#
# Confusion Matrix for F1-optimal threshold:
# NO YES Error Rate
# NO 5418 9146 0.627987 =9146/14564
# YES 1771 13489 0.116055 =1771/15260
# Totals 7189 22635 0.366047 =10917/29824
#
# Maximum Metrics: Maximum metrics at their respective thresholds
# metric threshold value idx
# 1 max f1 0.363651 0.711915 294
# 2 max f2 0.085680 0.840380 389
# 3 max f0point5 0.539735 0.683924 196
# 4 max accuracy 0.521518 0.673887 207
# 5 max precision 0.987571 1.000000 0
# 6 max recall 0.040200 1.000000 398
# 7 max specificity 0.987571 1.000000 0
# 8 max absolute_mcc 0.521518 0.348709 207
# 9 max min_per_class_accuracy 0.513103 0.672412 212
# 10 max mean_per_class_accuracy 0.521518 0.674326 207
#
# Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
#
#
# Scoring History:
# timestamp duration iteration negative_log_likelihood objective
# 1 2017-01-10 11:11:07 0.000 sec 0 21224.29620 0.69198
# 2 2017-01-10 11:11:07 0.066 sec 1 18857.11178 0.61705
# ... [truncated]
This model already shows a respectable AUC of 73.16%, which is not bad for a model with minimal tuning.
Let’s analyze the importance of each variable in the model:
# Get the variable importance of the models
h2o.varimp(airlines.glm)
# Standardized Coefficient Magnitudes: standardized coefficient magnitudes
# names coefficients sign
# 1 Origin.TLH 3.233673 NEG
# 2 Origin.CRP 2.998012 NEG
# 3 Origin.LIH 2.859198 NEG
# 4 Dest.LYH 2.766090 POS
# 5 Origin.KOA 2.461819 NEG
# # ---
# names coefficients sign
# 278 Dest.JAN 0.000000 POS
# 279 Dest.LIT 0.000000 POS
# 280 Dest.SJU 0.000000 POS
# 281 Origin.LAN 0.000000 POS
# 282 Origin.SBN 0.000000 POS
# 283 Origin.SDF 0.000000 POS
Some attribute values are crucial determinants of departure flight delays, especially the origin airports such as TLH (Tallahassee International Airport), CRP (Corpus Christi International Airport), LIH (Lihue Airport), and KOA (Kona International Airport).
Now, let’s use the predict function to see how the model classified the predictions.
# Predict using GLM model
pred = h2o.predict(object = airlines.glm, newdata = airlines.test)
After that, let’s examine the model’s prediction results.
# Look at summary of predictions: probability of TRUE class (p1)
summary(pred)
# predict NO YES
# YES:9798 Min. :0.01186 Min. :0.02857
# NO :3126 1st Qu.:0.33715 1st Qu.:0.37018
# NA : 382 Median :0.48541 Median :0.51363
# Mean :0.48780 Mean :0.51220
# 3rd Qu.:0.62886 3rd Qu.:0.66189
# Max. :0.97143 Max. :0.98814
# NA's :382 NA's :382
Another way to use the GLM model in H2O is by performing parameter selection via Grid Search.
Grid Search is essentially a method for optimizing Machine Learning model parameters. Often, it involves trying numerous parameter combinations, selecting the combination that yields the lowest error according to a specific error function.
What we’ll do now is use GLM to find the best model based on a specific set of parameter combinations.
First, we’ll construct a list of alpha parameters (recalling that this parameter creates a mix of Lasso and Ridge via Elastic-Net). In this case, we’ll use a list ranging from 0.0 to 1 with increments of 0.05 for each parameter.
# Construct a hyper-parameter space
alpha_opts = c(0,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1)
Now, let’s create a list of these parameters to pass to the Grid function later.
# List of hyperparameters
hyper_params_opt = list(alpha = alpha_opts)
In the Grid Search function, we’ll pass some parameters for model fitting, as shown below.
# Grid object with hyperparameters
glm_grid <- h2o.grid("glm" ,grid_id = "glm_grid_1" ,x = X ,y = Y ,training_frame = airlines.train ,hyper_params = hyper_params_opt ,family = "binomial")
This step might take a considerable amount of time depending on your data volume and the number of parameters chosen in the search list.
After the processing is complete, we’ll sort the list of models according to their AUC.
# Sort grids by best performance (lower AUC). Little note: As we're dealing with classification
# in some probabilistic fashion, we'll use AUC as model selection metric.
# If the nature of the problem is cost sensitive (e.g., a delayed departure plane is much more expensive for
# the airport service than a delayed arrival), precision and recall can be the best choice.
glm_sorted_grid <- h2o.getGrid(grid_id = "glm_grid_1", sort_by = "auc", decreasing = FALSE)
To evaluate each of the models, we can display the order of models based on AUC.
# Print the models
print(glm_sorted_grid)
# H2O Grid Details
# ==============
#
# Grid ID: glm_grid_1
# Used hyper parameters:
# - alpha
# Number of models: 21
# Number of failed models: 0
#
# Hyper-Parameter Search Summary: ordered by increasing auc
# alpha model_ids auc
# 1 [D@4800a43e glm_grid_1_model_1 0.7076911403181928
# 2 [D@66030470 glm_grid_1_model_2 0.7122987232329416
# 3 [D@6a4a43d3 glm_grid_1_model_3 0.7145455620514375
# 4 [D@17604a1a glm_grid_1_model_4 0.715989429818657
# 5 [D@21e1e99f glm_grid_1_model_5 0.7169797604977775
# # ---
# alpha model_ids auc
# 16 [D@78833412 glm_grid_1_model_16 0.720595118360825
# 17 [D@44d770f2 glm_grid_1_model_17 0.7207086912177467
# 18 [D@31669527 glm_grid_1_model_18 0.7208228330257134
# 19 [D@5b376f34 glm_grid_1_model_19 0.7209144533220885
# 20 [D@6acad45e glm_grid_1_model_20 0.7209885192412766
# 21 [D@237ad7de glm_grid_1_model_0 0.7240682725570593
With these parameters, the best model is glm_grid_1_model_0, which achieved an AUC of approximately 72.40%. (Note: This model is slightly worse than the default model because the parameter set for the Grid search differs from the initial model’s parameters).
To retrieve the best model, simply execute the following command:
# Grab the model_id based in AUC
best_glm_model_id <- glm_grid@model_ids[[1]]
# The best model
best_glm <- h2o.getModel(best_glm_model_id)
Let’s examine the characteristics of this model:
# Summary
summary(best_glm)
# Model Details:
# ==============
#
# H2OBinomialModel: glm
# Model Key: glm_grid_1_model_0
# GLM Model: summary
# family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
# 1 binomial logit Ridge ( lambda = 7.29E-5 ) 283 282 3 RTMP_sid_a6c9_1
#
# H2OBinomialMetrics: glm
# ** Reported on training data. **
#
# MSE: 0.2121424
# RMSE: 0.4605891
# LogLoss: 0.612699
# Mean Per-Class Error: 0.3833898
# AUC: 0.7240683
# Gini: 0.4481365
# R^2: 0.1494395
# Null Deviance: 42448.59
# Residual Deviance: 37585.41
# AIC: 38151.41
#
# Confusion Matrix for F1-optimal threshold:
# NO YES Error Rate
# NO 4993 9601 0.657873 =9601/14594
# YES 1751 14327 0.108907 =1751/16078
# Totals 6744 23928 0.370110 =11352/30672
#
# Maximum Metrics: Maximum metrics at their respective thresholds
# metric threshold value idx
# 1 max f1 0.373247 0.716243 296
# 2 max f2 0.105583 0.846435 391
# 3 max f0point5 0.551991 0.685249 194
# 4 max accuracy 0.513313 0.665949 218
# 5 max precision 0.980714 1.000000 0
# 6 max recall 0.048978 1.000000 399
# 7 max specificity 0.980714 1.000000 0
# 8 max absolute_mcc 0.548278 0.332916 196
# 9 max min_per_class_accuracy 0.524282 0.664324 211
# 10 max mean_per_class_accuracy 0.548278 0.666166 196
#
# Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
#
#
# Scoring History:
# timestamp duration iteration negative_log_likelihood objective
# 1 2017-01-10 11:11:07 0.000 sec 0 21224.29620 0.69198
# 2 2017-01-10 11:11:07 0.066 sec 1 18857.11178 0.61705
# ... [truncated]
With this, we conclude this post/tutorial on how to use R with H2O.ai.
Over the coming weeks, we will bring more tutorials and delve deeper into the power of H2O.
Best regards!