Implementing GLM with Grid Search in R using H2O.ai as Backend
2017 Jan 10For those who use R, there’s nothing more annoying than dealing with the tool’s terrible memory management, which greatly limits the use of R as a serious tool for building models that can go into production and enable the creation of intelligent platforms/systems.
Here, in a few lines, we will show how to use H2O.ai as a processing backend (which abstracts all these memory and processing issues) for creating a model using GLM.
The trick here is that H2O handles all memory management, and regardless of your data source, it manages the entire buffer memory pipeline so that there’s no memory overflow or system slowdown.
This example is entirely based on the H2O documentation and aims only to show how this tool works.
In this case, I will use it in a notebook, but it could also be used, for example, on an Amazon machine using the command below at cluster initialization:
[sourcecode language=”r”] #Production Cluster (Not applicable) #localH2O <- h2o.init(ip = ‘10.112.81.210’, port =54321, nthreads=-1) # Machine 1 #localH2O <- h2o.init(ip = ‘10.112.80.74’, port =54321, nthreads=-1) # Machine 2 - Yes, here we use a two-node cluster for processing! ;) [/sourcecode]
First, let’s remove any old H2O.ai installation from the machine in question:
[sourcecode language=”r”] # The following two commands remove any previously installed H2O packages for R. if (“package:h2o” %in% search()) { detach(“package:h2o”, unload=TRUE) } if (“h2o” %in% rownames(installed.packages())) { remove.packages(“h2o”) } [/sourcecode] Next, we will download and install all packages that H2O has a direct or indirect dependency on. [sourcecode language=”r”] # Next, we download packages that H2O depends on. if (! (“methods” %in% rownames(installed.packages()))) { install.packages(“methods”) } if (! (“statmod” %in% rownames(installed.packages()))) { install.packages(“statmod”) } if (! (“stats” %in% rownames(installed.packages()))) { install.packages(“stats”) } if (! (“graphics” %in% rownames(installed.packages()))) { install.packages(“graphics”) } if (! (“RCurl” %in% rownames(installed.packages()))) { install.packages(“RCurl”) } if (! (“jsonlite” %in% rownames(installed.packages()))) { install.packages(“jsonlite”) } if (! (“tools” %in% rownames(installed.packages()))) { install.packages(“tools”) } if (! (“utils” %in% rownames(installed.packages()))) { install.packages(“utils”) } [/sourcecode] Next, we will install the H2O lib and instantiate the lib in R Studio. [sourcecode language=”r”] # Now we download, install and initialize the H2O package for R. install.packages(“h2o”, type=”source”, repos=(c(“http://h2o-release.s3.amazonaws.com/h2o/rel-turing/8/R”))) [/sourcecode] [sourcecode language=”r”] # Load library library(h2o) [/sourcecode]
Installation complete and library loaded, let’s move on to some configurations.
In R Studio itself, you can choose the number of processors your cluster (in this case, your notebook/desktop) will use. Remember that the more cores used, the more processing H2O will consume, leaving fewer resources for other tasks. The default is to use 2 cores, but in my case, I will use all processors.
[sourcecode language=”r”] # Start instance with all cores. # The -1 is the parameter to use with all cores. Use this carefully. # The default parameter is 2 cores. h2o.init(nthreads = -1) [/sourcecode] Now let’s look at our cluster information: [sourcecode language=”r”] # Cluster Info h2o.clusterInfo() # R is connected to the H2O cluster: # H2O cluster uptime: 3 seconds 267 milliseconds # H2O cluster version: 3.10.0.8 # H2O cluster version age: 2 months and 26 days # H2O cluster name: H2O_started_from_R_flavio.clesio_udy929 # H2O cluster total nodes: 1 # H2O cluster total memory: 1.78 GB # H2O cluster total cores: 4 # H2O cluster allowed cores: 4 # H2O cluster healthy: TRUE # H2O Connection ip: localhost # H2O Connection port: 54321 # H2O Connection proxy: NA # R Version: R version 3.3.2 (2016-10-31) [/sourcecode]
As we can see, out of 4 total processors, I am using all of them (allowed cores) for processing.
Another fact we can see here is that H2O is also instantiated to use the Web GUI. To access it, simply go to the browser with the address http://localhost:54321/flow/index.html.
For this example, we will use the Airlines dataset, which contains extensive real flight information in the USA and all causes of delays from 1987 to 2008. The complete 12GB version can be found here.
Moving on, let’s now load the data directly from a URL and import it into an R object.
[sourcecode language=”r”] # GLM Demo Deep Dive # Path of normalized archive. Can be a URL or a local path airlinesURL = “https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv” [/sourcecode] [sourcecode language=”r”] # We’ll create the object .hex (extention of data files in H2O) # and using the importFile property, we’ll set the path and the destination frame. # As default behaviour H2O parse the file automatically. airlines.hex = h2o.importFile(path = airlinesURL, destination_frame = “airlines.hex”) [/sourcecode]
In this case, the airlines.hex object will be the dataframe on which H2O will apply the algorithms.
This .hex format is exclusive to H2O and can be used for numerous algorithms within the platform, as it is already optimized to handle sparse objects and/or text columns.
To view the descriptive statistics of this file, simply use the same summary() function from R.
[sourcecode language=”r”] # Let’s see the summary summary(airlines.hex) [/sourcecode]
For our experiment, we will split the training and testing data in a 70%/30% proportion.
One necessary point to mention at this stage is that because H2O is a platform designed for Big Data, it uses probabilistic sampling methods. This is necessary (computationally) because, in many cases, the operation of selection/stratification can be costly.
[sourcecode language=”r”] # Construct test and train sets using sampling # A small note is that H2O uses probabilistic splitting, witch means that resulting splits # can deviate for the exact number. This is necessary when we’re talking about a platform that # deals with big data. If you need a exact sampling, the best way is to do this in your RDBMS airlines.split = h2o.splitFrame(data = airlines.hex,ratios = 0.70, seed = -1) [/sourcecode]
After creating the splitFrame object, we will allocate the partitions to each dataset, with the object in the first position always being our training base, and the second position our testing base.
[sourcecode language=”r”] # Get the train dataframe(1st split object) airlines.train = airlines.split[[1]] # Get the test dataframe(2nd split object) airlines.test = airlines.split[[2]] [/sourcecode]
Let’s summarize each of these frames below to check the distribution of canceled flights:
[sourcecode language=”r”] # Display a summary using table-like in some sumarized way h2o.table(airlines.train$Cancelled) # Cancelled Count # 1 0 29921 # 2 1 751 h2o.table(airlines.test$Cancelled) # Cancelled Count # 1 0 12971 # 2 1 335 [/sourcecode]
With our samples separated, let’s now choose the variables that will go into our model.
First, let’s create two objects to pass as parameters to our algorithm.
Since we want to predict whether flights are delayed, the object Y (dependent variable) will be the variable IsDepDelayed (If the departure flight is delayed) and the object X (independent variables) will be all other fields in the dataset.
[sourcecode language=”r”] # Set dependent variable (Is departure delayed) Y = “IsDepDelayed” [/sourcecode] [sourcecode language=”r”] # Set independent variables X = c(“Origin”, “Dest”, “DayofMonth”, “Year”, “UniqueCarrier”, “DayOfWeek”, “Month”, “DepTime”, “ArrTime”, “Distance”) [/sourcecode] Now let’s create the model using GLM: [sourcecode language=”r”] # Define the data for the model and display the results airlines.glm <- h2o.glm(training_frame=airlines.train ,x=X ,y=Y ,family = “binomial” ,alpha = 0.5 ,max_iterations = 300 ,beta_epsilon = 0 ,lambda = 1e-05 ,lambda_search = FALSE ,early_stopping = FALSE ,nfolds = 0 ,seed = NULL ,intercept = TRUE ,gradient_epsilon = -1 ,remove_collinear_columns = FALSE ,max_runtime_secs = 10000 ,missing_values_handling = c(“Skip”)) [/sourcecode]
The meaning of the model parameters are:
x: vector containing the names of the independent variables;
y: index containing the dependent variable;
training_frame: An H2O frame containing the model variables;
family: Specification of the model distribution, which can be gaussian, binomial, poisson, gamma, and tweedie. An excellent explanation of how these parameters can be chosen is here in this link;
alpha: A number in [0, 1] specifying the mixture of the elastic-net regularization parameter. It defines the degree of mixing between the Lasso and Ridge regularizers. making alpha = 1 penalty via LASSO, alpha = 0 penalty via ridge;
max_iterations: A non-negative integer specifying the maximum number of model iterations;
beta_epsilon: A non-negative integer specifying the magnitude of the maximum difference between coefficient estimates through successive iterations. In other words: this parameter defines the convergence speed of the GLM model;
lambda: A non-negative parameter for shrinking the variable value through Elastic-Net, which multiplies P(α, β) in the objective function. When lambda = 0, no penalty is applied and the model is already fitted;
lambda_search: A logical value indicating whether there will be a search criterion in the lambda value space defined by a minimum and maximum parameter;
early_stopping: A value indicating whether there will be an early stop during lambda_search if the likelihood ratio stops changing as more iterations occur;
nfolds: Number of partitions in Cross-Validation;
seed: Specifies the random number generator (RNG) seed for Cross-Validation (ensures experiment reproducibility);
intercept: The constant term of the model, which broadly means the degree of endogenous factors in the model;
gradient_epsilon: Convergence criterion. Converges if the gradient of the I-Infinity norm is below a certain limit. If lambda_search = FALSE and lambda = 0, the default value of gradient_epsilon is .000001; otherwise, the default value is .0001. If lambda_search = TRUE, the conditional values above are 1E-8 and 1E-6 respectively.
remove_collinear_columns: If no regularization factor is applied, the model ignores collinear columns (the coefficient will be 0);
max_runtime_secs: Maximum allowed time in seconds for model training. Use 0 to disable; and
missing_values_handling: How missing values are handled. Can be “MeanImputation” or “Skip”. MeanImputation replaces missing values with the mean for numeric attributes and the mode for categorical ones. It is applied during model training.
Note that here the sky is the limit in terms of adjustments and/or parameterizations. The ideal is to have a perfect understanding of the mechanics of each parameter and use the best possible combination.
With our model fitted, let’s look at some basic statistics of this model.
[sourcecode language=”r”] # View model information: training statistics, performance, important variables summary(airlines.glm) # Model Details: # ============== # # H2OBinomialModel: glm # Model Key: GLM_model_R_1484053333586_1 # GLM Model: summary # family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations training_frame # 1 binomial logit Elastic Net (alpha = 0.5, lambda = 1.0E-5 ) 283 272 5 RTMP_sid_a6c9_1 # # H2OBinomialMetrics: glm # ** Reported on training data. ** # # MSE: 0.2098326 # RMSE: 0.4580749 # LogLoss: 0.607572 # Mean Per-Class Error: 0.3720209 # AUC: 0.7316312 # Gini: 0.4632623 # R^2: 0.1602123 # Null Deviance: 41328.6 # Residual Deviance: 36240.45 # AIC: 36786.45 # # Confusion Matrix for F1-optimal threshold: # NO YES Error Rate # NO 5418 9146 0.627987 =9146/14564 # YES 1771 13489 0.116055 =1771/15260 # Totals 7189 22635 0.366047 =10917/29824 # # Maximum Metrics: Maximum metrics at their respective thresholds # metric threshold value idx # 1 max f1 0.363651 0.711915 294 # 2 max f2 0.085680 0.840380 389 # 3 max f0point5 0.539735 0.683924 196 # 4 max accuracy 0.521518 0.673887 207 # 5 max precision 0.987571 1.000000 0 # 6 max recall 0.040200 1.000000 398 # 7 max specificity 0.987571 1.000000 0 # 8 max absolute_mcc 0.521518 0.348709 207 # 9 max min_per_class_accuracy 0.513103 0.672412 212 # 10 max mean_per_class_accuracy 0.521518 0.674326 207 [/sourcecode]
Here, this model already shows a 73.16% AUC. Not bad for a model with few adjustments.
Let’s analyze the importance of each variable in the model:
[sourcecode language=”r”] # Get the variable importance of the models h2o.varimp(airlines.glm) # Standardized Coefficient Magnitudes: standardized coefficient magnitudes # names coefficients sign # 1 Origin.TLH 3.233673 NEG # 2 Origin.CRP 2.998012 NEG # 3 Origin.LIH 2.859198 NEG # 4 Dest.LYH 2.766090 POS # 5 Origin.KOA 2.461819 NEG # # — # names coefficients sign # 278 Dest.JAN 0.000000 POS # 279 Dest.LIT 0.000000 POS # 280 Dest.SJU 0.000000 POS # 281 Origin.LAN 0.000000 POS # 282 Origin.SBN 0.000000 POS # 283 Origin.SDF 0.000000 POS [/sourcecode]
Some attribute values are decisive in departure flight delays, especially if the origin airports are TLH (Tallahassee International Airport), CRP (Corpus Christi International Airport), LIH (Lihue Airport), and KOA (Kona International Airport).
Now let’s use the predict function to see how the model made its classifications.
[sourcecode language=”r”] # Predict using GLM model pred = h2o.predict(object = airlines.glm, newdata = airlines.test) [/sourcecode] After that, let’s look at the result of our model. [sourcecode language=”r”] # Look at summary of predictions: probability of TRUE class (p1) summary(pred) # predict NO YES # YES:9798 Min. :0.01186 Min. :0.02857 # NO :3126 1st Qu.:0.33715 1st Qu.:0.37018 # NA : 382 Median :0.48541 Median :0.51363 # Mean :0.48780 Mean :0.51220 # 3rd Qu.:0.62886 3rd Qu.:0.66189 # Max. :0.97143 Max. :0.98814 # NA’s :382 NA’s :382 [/sourcecode]
Another way to use the GLM model in H2O is to perform parameter selection via Grid Search.
Grid Search is nothing more than a method for optimizing the parameters of a Machine Learning model, often involving a combination of numerous parameters, with the combination yielding the lowest error according to a specific error function.
What we will do now is use GLM to obtain the best model based on a specific combination of parameters.
First, we will construct a list of alpha parameters (recapping, this parameter creates a combination of Lasso and Ridge via Elastic-Net). In this case, we will pass a list ranging from 0.0 to 1 with an increment of 0.05 for each parameter.
[sourcecode language=”r”] # Construct a hyper-parameter space alpha_opts = c(0,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1) [/sourcecode] Now let’s create a list with these parameters to pass to the Grid function later. [sourcecode language=”r”] # List of hyperparameters hyper_params_opt = list(alpha = alpha_opts) [/sourcecode] In the Grid Search function, we will pass some parameters for model fitting, as shown below. [sourcecode language=”r”] # Grid object with hyperparameters glm_grid <- h2o.grid(“glm” ,grid_id = “glm_grid_1” ,x=X ,y=Y ,training_frame=airlines.train ,hyper_params = hyper_params_opt ,family = “binomial”) [/sourcecode]
This step can take a considerable amount of time depending on your data volume and the number of parameters chosen in the search list.
After the processing is complete, we will sort the list of models according to AUC.
[sourcecode language=”r”] # Sort grids by best performance (lower AUC). Little note: As we’re dealing with classification # in some probabilistc fashion, we’ll use AUC as model selection metric. # If the nature of the problem are cost sensitive (e.g. A delayed departure plane is much expensive for # the airport service than a delayed arrival) precision and recall can be the best choice glm_sorted_grid <- h2o.getGrid(grid_id = “glm_grid_1”, sort_by = “auc”, decreasing = FALSE) [/sourcecode]
To evaluate each of the models, we can display the order of models according to AUC.
[sourcecode language=”r”] #Print the models print(glm_sorted_grid) # H2O Grid Details # ================ # # Grid ID: glm_grid_1 # Used hyper parameters: # - alpha # Number of models: 21 # Number of failed models: 0 # # Hyper-Parameter Search Summary: ordered by increasing auc # alpha model_ids auc # 1 [D@4800a43e glm_grid_1_model_1 0.7076911403181928 # 2 [D@66030470 glm_grid_1_model_2 0.7122987232329416 # 3 [D@6a4a43d3 glm_grid_1_model_3 0.7145455620514375 # 4 [D@17604a1a glm_grid_1_model_4 0.715989429818657 # 5 [D@21e1e99f glm_grid_1_model_5 0.7169797604977775 # # — # alpha model_ids auc # 16 [D@78833412 glm_grid_1_model_16 0.720595118360825 # 17 [D@44d770f2 glm_grid_1_model_17 0.7207086912177467 # 18 [D@31669527 glm_grid_1_model_18 0.7208228330257134 # 19 [D@5b376f34 glm_grid_1_model_19 0.7209144533220885 # 20 [D@6acad45e glm_grid_1_model_20 0.7209885192412766 # 21 [D@237ad7de glm_grid_1_model_0 0.7240682725570593 [/sourcecode]
With these parameters, the best model is glm_grid_1_model_0, which achieved around 72.40% AUC. (Note: This model is slightly worse than the default model, as the Grid parameter set is different from the first model).
To get the best model, simply execute the command below:
[sourcecode language=”r”] # Grab the model_id based in AUC best_glm_model_id <- glm_grid@model_ids[[1]] [/sourcecode] [sourcecode language=”r”] # The best model best_glm <- h2o.getModel(best_glm_model_id) [/sourcecode] Let’s examine the characteristics of this model: [sourcecode language=”r”] # Summary summary(best_glm) # Model Details: # ============== # # H2OBinomialModel: glm # Model Key: glm_grid_1_model_0 # GLM Model: summary # family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations training_frame # 1 binomial logit Ridge ( lambda = 7.29E-5 ) 283 282 3 RTMP_sid_a6c9_1 # # H2OBinomialMetrics: glm # ** Reported on training data. ** # # MSE: 0.2121424 # RMSE: 0.4605891 # LogLoss: 0.612699 # Mean Per-Class Error: 0.3833898 # AUC: 0.7240683 # Gini: 0.4481365 # R^2: 0.1494395 # Null Deviance: 42448.59 # Residual Deviance: 37585.41 # AIC: 38151.41 # # Confusion Matrix for F1-optimal threshold: # NO YES Error Rate # NO 4993 9601 0.657873 =9601/14594 # YES 1751 14327 0.108907 =1751/16078 # Totals 6744 23928 0.370110 =11352/30672 # # Maximum Metrics: Maximum metrics at their respective thresholds # metric threshold value idx # 1 max f1 0.373247 0.716243 296 # 2 max f2 0.105583 0.846435 391 # 3 max f0point5 0.551991 0.685249 194 # 4 max accuracy 0.513313 0.665949 218 # 5 max precision 0.980714 1.000000 0 # 6 max recall 0.048978 1.000000 399 # 7 max specificity 0.980714 1.000000 0 # 8 max absolute_mcc 0.548278 0.332916 196 # 9 max min_per_class_accuracy 0.524282 0.664324 211 # 10 max mean_per_class_accuracy 0.548278 0.666166 196 # # Gains/Lift Table: Extract with h2o.gainsLift(<model>, <data>) or h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>) # # # # Scoring History: # timestamp duration iteration negative_log_likelihood objective # 1 2017-01-10 11:11:07 0.000 sec 0 21224.29620 0.69198 # 2 2017-01-10 11:11:07 0.066 sec 1 18857.11178 0.61705 # … [truncated]
With this, we conclude this post/tutorial on how to use R with H2O.ai.
Over the coming weeks, we will bring tutorials and break down the power of this H2O a bit further.
Best regards!