Implementing GLM with Grid Search in R using H2O.ai as Backend

2017 Jan 10

For those who use R, there’s nothing more irritating than dealing with the tool’s terrible memory management, which greatly limits R’s use as a serious tool for building models that can go into production and enable the creation of intelligent platforms/systems.

Here, in a few lines, we’ll show how to use H2O.ai as a processing backend (which abstracts away all these memory and processing issues) for creating a model using GLM.

The trick here is that H2O handles all memory management, and regardless of your data source, it performs the entire buffer management pipeline so that there are no memory overflows or generalized system slowdowns.

This example is entirely based on the H2O documentation and aims only to show how this tool works.

In this case, I’ll use it in a notebook, but it could also be used on an Amazon machine, for instance, using the command below when initializing the cluster:

 #Production Cluster (Not applicable)
 #localH2O <- h2o.init(ip = '10.112.81.210', port =54321, nthreads=-1)
 # Machine 1
 #localH2O <- h2o.init(ip = '10.112.80.74', port =54321, nthreads=-1)
 # Machine 2 - Yes, here we use a cluster with two nodes for processing! ;)

First, let’s remove any old H2O.ai installation from the machine in question:

 # The following two commands remove any previously installed H2O packages for R.
 if ("package:h2o" %in% search()) {
  detach("package:h2o", unload=TRUE)
 }
 if ("h2o" %in% rownames(installed.packages())) {
  remove.packages("h2o")
 }

Next, we’ll download and install all packages that H2O has direct or indirect dependencies on.

 # Next, we download packages that H2O depends on.
 if (! ("methods" %in% rownames(installed.packages()))) {
  install.packages("methods")
 }
 if (! ("statmod" %in% rownames(installed.packages()))) {
  install.packages("statmod")
 }
 if (! ("stats" %in% rownames(installed.packages()))) {
  install.packages("stats")
 }
 if (! ("graphics" %in% rownames(installed.packages()))) {
  install.packages("graphics")
 }
 if (! ("RCurl" %in% rownames(installed.packages()))) {
  install.packages("RCurl")
 }
 if (! ("jsonlite" %in% rownames(installed.packages()))) {
  install.packages("jsonlite")
 }
 if (! ("tools" %in% rownames(installed.packages()))) {
  install.packages("tools")
 }
 if (! ("utils" %in% rownames(installed.packages()))) {
  install.packages("utils")
 }

Next, we’ll install the H2O library and instantiate it in R Studio.

 # Now we download, install and initialize the H2O package for R.
 install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-turing/8/R")))

 # Load library
 library(h2o)

With the installation done and the library loaded, let’s move on to some configurations.

In R Studio itself, you can choose the number of processors the cluster (in this case, your notebook/desktop) will use. Remember that the more cores used, the more processing H2O will consume, and fewer resources will be available for other tasks. The default is to use 2 cores, but in my case, I’ll use all processors.

 # Start instance with all cores.
 # The -1 is the parameter to use with all cores. Use this carefully.
 # The default parameter is 2 cores.
 h2o.init(nthreads = -1)

Now let’s see our cluster information:

 # Cluster Info
 h2o.clusterInfo()
 # R is connected to the H2O cluster:
 # H2O cluster uptime: 3 seconds 267 milliseconds
 # H2O cluster version: 3.10.0.8
 # H2O cluster version age: 2 months and 26 days
 # H2O cluster name: H2O_started_from_R_flavio.clesio_udy929
 # H2O cluster total nodes: 1
 # H2O cluster total memory: 1.78 GB
 # H2O cluster total cores: 4
 # H2O cluster allowed cores: 4
 # H2O cluster healthy: TRUE
 # H2O Connection ip: localhost
 # H2O Connection port: 54321
 # H2O Connection proxy: NA
 # R Version: R version 3.3.2 (2016-10-31)

As we can see, out of 4 total processors, I am using all of them (allowed cores) for processing.

Another fact we can see here is that H2O is also set up to use the Web GUI. To access it, simply go to the address in your browser: http://localhost:54321/flow/index.html.

For this example, we will use the Airlines dataset, which contains various real flight details in the USA and all causes of delays from 1987 to 2008. The complete version with 12GB can be found here.

Moving on, let’s now load the data directly from a URL and import it into an R object.

 # GLM Demo Deep Dive
 # Path of normalized archive. Can be a URL or a local path
 airlinesURL = "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"

 # We'll create the object .hex (extension of data files in H2O)
 # and using the importFile property, we'll set the path and the destination frame.
 # As default behaviour H2O parse the file automatically.
 airlines.hex = h2o.importFile(path = airlinesURL, destination_frame = "airlines.hex")

In this case, the airlines.hex object will be the dataframe on which H2O will apply the algorithms.

This .hex format is exclusive to H2O and can be used for numerous algorithms within the platform, as it is already optimized to handle sparse objects and/or text columns.

To view the descriptive statistics of this file, simply use the same summary() function as in R.

 # Let's see the summary
 summary(airlines.hex)

For our experiment, we will split the data into training and testing sets with a 70%/30% ratio.

One necessary point to mention at this stage is that because H2O is a platform designed for Big Data, it uses probabilistic sampling methods. This is necessary (in computational terms) because often the operation of selection/ stratification can be costly.

 # Construct test and train sets using sampling
 # A small note is that H2O uses probabilistic splitting, witch means that resulting splits
 # can deviate for the exact number. This is necessary when we're talking about a platform that
 # deals with big data. If you need a exact sampling, the best way is to do this in your RDBMS
 airlines.split = h2o.splitFrame(data = airlines.hex, ratios = 0.70, seed = -1)

After creating the splitFrame object, we will allocate the partitions to each dataset. The object in the first position will always be our training data, and the second position will be our testing data.

 # Get the train dataframe(1st split object)
 airlines.train = airlines.split[[1]]
 # Get the test dataframe(2nd split object)
 airlines.test = airlines.split[[2]]

Let’s summarize each of these frames below to check the distribution of canceled flights:

 # Display a summary using table-like in some summarized way
 h2o.table(airlines.train$Cancelled)
 # Cancelled Count
 # 1 0 29921
 # 2 1 751
 h2o.table(airlines.test$Cancelled)
 # Cancelled Count
 # 1 0 12971
 # 2 1 335

With our samples separated, let’s now choose the variables that will enter our model.

First, let’s create two objects to pass as parameters to our algorithm.

Since we want to predict if flights are delayed, the object Y (dependent variable) will be the variable IsDepDelayed (If the departure flight is delayed) and the object X (independent variables) will be all other fields in the dataset.

 # Set dependent variable (Is departure delayed)
 Y = "IsDepDelayed"

 # Set independent variables
 X = c("Origin", "Dest", "DayofMonth", "Year", "UniqueCarrier", "DayOfWeek", "Month", "DepTime", "ArrTime", "Distance")

Now let’s proceed with creating the model using GLM:

 # Define the data for the model and display the results
 airlines.glm <- h2o.glm(training_frame=airlines.train ,x=X ,y=Y ,family = "binomial" ,alpha = 0.5 ,max_iterations = 300 ,beta_epsilon = 0 ,lambda = 1e-05 ,lambda_search = FALSE ,early_stopping = FALSE ,nfolds = 0 ,seed = NULL ,intercept = TRUE ,gradient_epsilon = -1 ,remove_collinear_columns = FALSE ,max_runtime_secs = 10000 ,missing_values_handling = c("Skip"))

The meaning of the model parameters are:

x: vector containing the names of the independent variables;

y: index containing the dependent variable;

training_frame: An H2O frame containing the model variables;

family: Specification of the model distribution, which can be gaussian, binomial, poisson, gamma, and tweedie. An excellent explanation of how these parameters can be chosen is here in this link;

alpha: A number in [0, 1] specifying the mixture of the elastic-net regularization parameter. It indicates the degree of mixture between the Lasso and Ridge regularizers. Setting alpha = 1 means penalization via LASSO, alpha = 0 means penalization via ridge;

max_iterations: A non-negative integer specifying the maximum number of model iterations;

beta_epsilon: A non-negative integer specifying the magnitude of the maximum difference between coefficient estimates across successive iterations. In other words: this parameter defines the convergence speed of the GLM model;

lambda: A non-negative parameter for shrinking the variable value through Elastic-Net, which multiplies P(α, β) in the objective function. When lambda = 0, no penalization is applied and the model is already fitted;

lambda_search: A logical value indicating whether there will be a search criterion in the space of lambda values defined by a minimum and maximum parameter;

early_stopping: A value indicating whether there will be an early stop during lambda_search if the likelihood factor stops changing as more iterations occur;

nfolds: Number of partitions in Cross-Validation;

seed: Specifies the random number generator (RNG) seed for Cross-Validation (ensures experiment reproducibility);

intercept: The constant term of the model, which broadly speaking means the degree of endogenous factors in the model;

gradient_epsilon: Convergence criterion. Converges if the I-Infinity norm gradient is below a certain threshold. If lambda_search = FALSE and lambda = 0, the default value of gradient_epsilon is 0.000001; otherwise, the default value is 0.0001. If lambda_search = TRUE, the conditional values above are 1E-8 and 1E-6, respectively.

remove_collinear_columns: If no regularization is applied, the model ignores collinear columns (the coefficient will be 0);

max_runtime_secs: Maximum allowed time in seconds for model training. Use 0 to disable; and

missing_values_handling: How to handle missing values. Can be “MeanImputation” or “Skip”. MeanImputation replaces missing values with the mean for numeric attributes and the most frequent category for categorical ones. It is applied during model training.

Note that here the sky is the limit in terms of adjustments and/or parameterizations. The ideal is to have a perfect understanding of the mechanics of each parameter and use the best possible combination.

With our model fitted, let’s see some of its basic statistics.

 # View model information: training statistics, performance, important variables
 summary(airlines.glm)
 # Model Details:
 # ==============
 #
 # H2OBinomialModel: glm
 # Model Key: GLM_model_R_1484053333586_1
 # GLM Model: summary
 # family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
 # 1 binomial logit Elastic Net (alpha = 0.5, lambda = 1.0E-5 ) 283 272 5 RTMP_sid_a6c9_1
 #
 # H2OBinomialMetrics: glm
 # ** Reported on training data. **
 #
 # MSE: 0.2098326
 # RMSE: 0.4580749
 # LogLoss: 0.607572
 # Mean Per-Class Error: 0.3720209
 # AUC: 0.7316312
 # Gini: 0.4632623
 # R^2: 0.1602123
 # Null Deviance: 41328.6
 # Residual Deviance: 36240.45
 # AIC: 36786.45
 #
 # Confusion Matrix for F1-optimal threshold:
 # NO YES Error Rate
 # NO 5418 9146 0.627987 =9146/14564
 # YES 1771 13489 0.116055 =1771/15260
 # Totals 7189 22635 0.366047 =10917/29824
 #
 # Maximum Metrics: Maximum metrics at their respective thresholds
 # metric threshold value idx
 # 1 max f1 0.363651 0.711915 294
 # 2 max f2 0.085680 0.840380 389
 # 3 max f0point5 0.539735 0.683924 196
 # 4 max accuracy 0.521518 0.673887 207
 # 5 max precision 0.987571 1.000000 0
 # 6 max recall 0.040200 1.000000 398
 # 7 max specificity 0.987571 1.000000 0
 # 8 max absolute_mcc 0.521518 0.348709 207
 # 9 max min_per_class_accuracy 0.513103 0.672412 212
 # 10 max mean_per_class_accuracy 0.521518 0.674326 207

Here, our model already shows an AUC of 73.16%. Not bad for a model with few adjustments.

Let’s analyze the importance of each variable in the model:

 # Get the variable importance of the models
 h2o.varimp(airlines.glm)
 # Standardized Coefficient Magnitudes: standardized coefficient magnitudes
 # names coefficients sign
 # 1 Origin.TLH 3.233673 NEG
 # 2 Origin.CRP 2.998012 NEG
 # 3 Origin.LIH 2.859198 NEG
 # 4 Dest.LYH 2.766090 POS
 # 5 Origin.KOA 2.461819 NEG
 #
 # ---
 # names coefficients sign
 # 278 Dest.JAN 0.000000 POS
 # 279 Dest.LIT 0.000000 POS
 # 280 Dest.SJU 0.000000 POS
 # 281 Origin.LAN 0.000000 POS
 # 282 Origin.SBN 0.000000 POS
 # 283 Origin.SDF 0.000000 POS

Some attribute values are determinants in departure flight delays, especially if the origin airports are TLH (Tallahassee International Airport), CRP (Corpus Christi International Airport), LIH (Lihue Airport), and KOA (Kona International Airport).

Now let’s use the predict function to see how the model performed its classifications.

 # Predict using GLM model
 pred = h2o.predict(object = airlines.glm, newdata = airlines.test)

After that, let’s look at the result of our model.

 # Look at summary of predictions: probability of TRUE class (p1)
 summary(pred)
 # predict NO YES
 # YES:9798 Min. :0.01186 Min. :0.02857
 # NO :3126 1st Qu.:0.33715 1st Qu.:0.37018
 # NA : 382 Median :0.48541 Median :0.51363
 # Mean :0.48780 Mean :0.51220
 # 3rd Qu.:0.62886 3rd Qu.:0.66189
 # Max. :0.97143 Max. :0.98814
 # NA's :382 NA's :382

Another way to use the GLM model in H2O is to perform parameter selection via Grid Search.

Grid Search is nothing more than a method for optimizing the parameters of a Machine Learning model, where often a combination of numerous parameters is tried, and the combination that achieves the lowest error according to a specific error function is chosen.

What we’ll do now is use GLM to obtain the best model according to a specific parameter combination.

First, let’s build a list of alpha parameters (recalling, this parameter creates a combination of Lasso and Ridge via Elastic-Net). In this case, we’ll provide a list ranging from 0.0 to 1.0 with an increment of 0.05 for each parameter.

 # Construct a hyper-parameter space
 alpha_opts = c(0,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1)

Now let’s create a list with these parameters to pass to the Grid function later.

 # List of hyperparameters
 hyper_params_opt = list(alpha = alpha_opts)

In the Grid Search function, we’ll pass some parameters for fitting the models, as we can see below.

 # Grid object with hyperparameters
 glm_grid <- h2o.grid("glm" ,grid_id = "glm_grid_1" ,x=X ,y=Y ,training_frame=airlines.train ,hyper_params = hyper_params_opt ,family = "binomial")

This step can take a considerable amount of time depending on your data volume and the number of parameters chosen in the search list.

After the processing is complete, let’s sort the list of models according to AUC.

 # Sort grids by best performance (lower AUC). Little note: As we're dealing with classification
 # in some probabilistc fashion, we'll use AUC as model selection metric.
 # If the nature of the problem are cost sensitive (e.g. A delayed departure plane is much expensive for
 # the airport service than a delayed arrival) precision and recall can be the best choice
 glm_sorted_grid <- h2o.getGrid(grid_id = "glm_grid_1", sort_by = "auc", decreasing = FALSE)

To evaluate each of the models, we can display the order of models according to AUC.

 #Print the models
 print(glm_sorted_grid)
 # H2O Grid Details
 # ================
 #
 # Grid ID: glm_grid_1
 # Used hyper parameters:
 # - alpha
 # Number of models: 21
 # Number of failed models: 0
 #
 # Hyper-Parameter Search Summary: ordered by increasing auc
 # alpha model_ids auc
 # 1 [D@4800a43e glm_grid_1_model_1 0.7076911403181928
 # 2 [D@66030470 glm_grid_1_model_2 0.7122987232329416
 # 3 [D@6a4a43d3 glm_grid_1_model_3 0.7145455620514375
 # 4 [D@17604a1a glm_grid_1_model_4 0.715989429818657
 # 5 [D@21e1e99f glm_grid_1_model_5 0.7169797604977775
 #
 # ---
 # alpha model_ids auc
 # 16 [D@78833412 glm_grid_1_model_16 0.720595118360825
 # 17 [D@44d770f2 glm_grid_1_model_17 0.7207086912177467
 # 18 [D@31669527 glm_grid_1_model_18 0.7208228330257134
 # 19 [D@5b376f34 glm_grid_1_model_19 0.7209144533220885
 # 20 [D@6acad45e glm_grid_1_model_20 0.7209885192412766
 # 21 [D@237ad7de glm_grid_1_model_0 0.7240682725570593

With these parameters, the best model is glm_grid_1_model_0, which achieved around 72.40% AUC. (Note: This model is slightly worse than the default model because the parameter set for the Grid is different from the first model).

To get the best model, simply execute the command below:

 # Grab the model_id based in AUC
 best_glm_model_id <- glm_grid@model_ids[[1]]

 # The best model
 best_glm <- h2o.getModel(best_glm_model_id)

Let’s look at the characteristics of this model:

 # Summary
 summary(best_glm)
 # Model Details:
 # ==============
 #
 # H2OBinomialModel: glm
 # Model Key: glm_grid_1_model_0
 # GLM Model: summary
 # family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
 # 1 binomial logit Ridge ( lambda = 7.29E-5 ) 283 282 3 RTMP_sid_a6c9_1
 #
 # H2OBinomialMetrics: glm
 # ** Reported on training data. **
 #
 # MSE: 0.2121424
 # RMSE: 0.4605891
 # LogLoss: 0.612699
 # Mean Per-Class Error: 0.3833898
 # AUC: 0.7240683
 # Gini: 0.4481365
 # R^2: 0.1494395
 # Null Deviance: 42448.59
 # Residual Deviance: 37585.41
 # AIC: 38151.41
 #
 # Confusion Matrix for F1-optimal threshold:
 # NO YES Error Rate
 # NO 4993 9601 0.657873 =9601/14594
 # YES 1751 14327 0.108907 =1751/16078
 # Totals 6744 23928 0.370110 =11352/30672
 #
 # Maximum Metrics: Maximum metrics at their respective thresholds
 # metric threshold value idx
 # 1 max f1 0.373247 0.716243 296
 # 2 max f2 0.105583 0.846435 391
 # 3 max f0point5 0.551991 0.685249 194
 # 4 max accuracy 0.513313 0.665949 218
 # 5 max precision 0.980714 1.000000 0
 # 6 max recall 0.048978 1.000000 399
 # 7 max specificity 0.980714 1.000000 0
 # 8 max absolute_mcc 0.548278 0.332916 196
 # 9 max min_per_class_accuracy 0.524282 0.664324 211
 # 10 max mean_per_class_accuracy 0.548278 0.666166 196
 #
 # Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
 #
 #
 # Scoring History:
 # timestamp duration iteration negative_log_likelihood objective
 # 1 2017-01-10 11:11:07 0.000 sec 0 21224.29620 0.69198
 # 2 2017-01-10 11:11:07 0.066 sec 1 18857.11178 0.61705
 # ... [truncated]

With this, we conclude this post/tutorial on how to use R with H2O.ai.

Over the next few weeks, we will bring some tutorials and delve a bit into the power of this H2O.

Best regards!