IMPORTANT: The file content has been truncated. Status: Showing lines 1-158 of 158 total lines. Action: To read more of the file, you can use the ‘start_line’ and ‘end_line’ parameters in a subsequent ‘read_file’ call. For example, to read the next section of the file, use start_line: 159.

— FILE CONTENT (truncated) —

layout: post permalink: /glm-grid-search-r-h2oai-backend

title: “Implementing GLM with Grid Search in R using H2O.ai as Backend” image: Cover excerpt: “Implementing GLM with Grid Search in R using H2O.ai as Backend” lang: en-us —

For R users, there’s nothing more annoying than dealing with the tool’s terrible memory management, which greatly limits R’s use as a serious tool for building models that can go into production and enable the creation of intelligent platforms/systems.

In a few lines here, we will show how to use H2O.ai as a processing backend (which abstracts away all these memory and processing problems) for creating a GLM model.

The trick here is that H2O handles all memory management, and regardless of your data source, it manages the entire memory buffer pipeline so that there are no memory overflows; or even generalized system slowdowns.

This example is based entirely on H2O documentation and aims solely to show how this tool works.

In this case, I will use it on a notebook, but it could also be used, for example, on an Amazon machine by running the command below when initializing the cluster:

[sourcecode language=”r”] #Production Cluster (Not applicable) #localH2O <- h2o.init(ip = ‘10.112.81.210’, port =54321, nthreads=-1) # Machine 1 #localH2O <- h2o.init(ip = ‘10.112.80.74’, port =54321, nthreads=-1) # Machine 2 - Yes, here we use a two-node cluster for processing! ;) [/sourcecode]

Firstly, let’s remove any old H2O.ai installation from the machine in question:

[sourcecode language=”r”] # The following two commands remove any previously installed H2O packages for R. if (“package:h2o” %in% search()) { detach(“package:h2o”, unload=TRUE) } if (“h2o” %in% rownames(installed.packages())) { remove.packages(“h2o”) } [/sourcecode] Next, we will download and install all packages that H2O has a direct or indirect dependency on. [sourcecode language=”r”] # Next, we download packages that H2O depends on. if (! (“methods” %in% rownames(installed.packages()))) { install.packages(“methods”) } if (! (“statmod” %in% rownames(installed.packages()))) { install.packages(“statmod”) } if (! (“stats” %in% rownames(installed.packages()))) { install.packages(“stats”) } if (! (“graphics” %in% rownames(installed.packages()))) { install.packages(“graphics”) } if (! (“RCurl” %in% rownames(installed.packages()))) { install.packages(“RCurl”) } if (! (“jsonlite” %in% rownames(installed.packages()))) { install.packages(“jsonlite”) } if (! (“tools” %in% rownames(installed.packages()))) { install.packages(“tools”) } if (! (“utils” %in% rownames(installed.packages()))) { install.packages(“utils”) } [/sourcecode] Then, we will install the H2O library and instantiate it in R Studio. [sourcecode language=”r”] # Now we download, install and initialize the H2O package for R. install.packages(“h2o”, type=”source”, repos=(c(“http://h2o-release.s3.amazonaws.com/h2o/rel-turing/8/R”))) [/sourcecode] [sourcecode language=”r”] # Load library library(h2o) [/sourcecode]

Installation complete and library loaded, let’s move on to some configurations.

Within R Studio itself, you can choose the number of processors the cluster (in this case, your notebook/desktop) will use. Remember that the more cores used, the more processing H2O will consume, and fewer resources will be available for other tasks. The default is to use 2 cores, but in my case, I will use all processors.

[sourcecode language=”r”] # Start instance with all cores. # The -1 is the parameter to use with all cores. Use this carefully. # The default parameter is 2 cores. h2o.init(nthreads = -1) [/sourcecode] Now let’s view our cluster information: [sourcecode language=”r”] # Cluster Info h2o.clusterInfo() # R is connected to the H2O cluster: # H2O cluster uptime: 3 seconds 267 milliseconds # H2O cluster version: 3.10.0.8 # H2O cluster version age: 2 months and 26 days # H2O cluster name: H2O_started_from_R_flavio.clesio_udy929 # H2O cluster total nodes: 1 # H2O cluster total memory: 1.78 GB # H2O cluster total cores: 4 # H2O cluster allowed cores: 4 # H2O cluster healthy: TRUE # H2O Connection ip: localhost # H2O Connection port: 54321 # H2O Connection proxy: NA # R Version: R version 3.3.2 (2016-10-31) [/sourcecode]

As we can see, out of the 4 total processors, I am using all of them (allowed cores) for processing.

Another fact we can see here is that H2O is also instantiated to use the Web GUI. To access it, simply go to the browser at http://localhost:54321/flow/index.html.

For this example, we will use the Airlines dataset, which contains various real flight information in the USA and all causes of delay from 1987 to 2008. The complete 12GB version can be found here.

Moving on, let’s now load the data directly from a URL and import it into an R object.

[sourcecode language=”r”] # GLM Demo Deep Dive # Path of normalized archive. Can be a URL or a local path airlinesURL = “https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv” [/sourcecode] [sourcecode language=”r”] # We’ll create the object .hex (extension of data files in H2O) # and using the importFile property, we’ll set the path and the destination frame. # As default behavior H2O parses the file automatically. airlines.hex = h2o.importFile(path = airlinesURL, destination_frame = “airlines.hex”) [/sourcecode]

In this case, the airlines.hex object will be the dataframe on which H2O will apply the algorithms.

This .hex format is exclusive to H2O and can be used for numerous algorithms within the platform, as it is already optimized to handle sparse objects and/or text-type columns.

To view the descriptive statistics of this file, simply use the same summary() function from R.

[sourcecode language=”r”] # Let’s see the summary summary(airlines.hex) [/sourcecode]

For our experiment, we will split the data into training and testing sets in a 70%/30% proportion.

One necessary thing to mention at this point is that because H2O is a platform designed for Big Data, it uses probabilistic sampling methods. This is necessary (in computational terms) because, in many cases, the operation of selection/stratification can be costly.

[sourcecode language=”r”] # Construct test and train sets using sampling # A small note is that H2O uses probabilistic splitting, which means that resulting splits # can deviate from the exact number. This is necessary when we’re talking about a platform that # deals with big data. If you need exact sampling, the best way is to do this in your RDBMS airlines.split = h2o.splitFrame(data = airlines.hex,ratios = 0.70, seed = -1) [/sourcecode]

After creating the splitFrame object, let’s allocate the partitions to each dataset, where the object in the first position will always be our training base, and the second position will be our testing base.

[sourcecode language=”r”] # Get the train dataframe(1st split object) airlines.train = airlines.split[[1]] # Get the test dataframe(2nd split object) airlines.test = airlines.split[[2]] [/sourcecode]

Let’s summarize each of these frames below to check the distribution of cancelled flights:

[sourcecode language=”r”] # Display a summary using table-like in some summarized way h2o.table(airlines.train$Cancelled) # Cancelled Count # 1 0 29921 # 2 1 751 h2o.table(airlines.test$Cancelled) # Cancelled Count # 1 0 12971 # 2 1 335 [/sourcecode]

With our samples separated, let’s now choose the variables that will enter our model.

First, we will create two objects to pass as parameters to our algorithm.

Since we want to predict whether departure flights are delayed, the Y object (dependent variable) will be the IsDepDelayed variable, and the X object (independent variables) will be all other fields in the dataset.

[sourcecode language=”r”] # Set dependent variable (Is departure delayed) Y = “IsDepDelayed” [/sourcecode] [sourcecode language=”r”] # Set independent variables X = c(“Origin”, “Dest”, “DayofMonth”, “Year”, “UniqueCarrier”, “DayOfWeek”, “Month”, “DepTime”, “ArrTime”, “Distance”) [/sourcecode] Now let’s create the model using GLM: [sourcecode language=”r”] # Define the data for the model and display the results airlines.glm <- h2o.glm(training_frame=airlines.train ,x=X ,y=Y ,family = “binomial” ,alpha = 0.5 ,max_iterations = 300 ,beta_epsilon = 0 ,lambda = 1e-05 ,lambda_search = FALSE ,early_stopping = FALSE ,nfolds = 0 ,seed = NULL ,intercept = TRUE ,gradient_epsilon = -1 ,remove_collinear_columns = FALSE ,max_runtime_secs = 10000 ,missing_values_handling = c(“Skip”)) [/sourcecode]

The meaning of the model parameters are:

x: vector containing the names of the independent variables;

y: index containing the dependent variable;

training_frame: An H2O frame containing the model variables;

family: Specification of the model distribution that can be gaussian, binomial, poisson, gamma, and tweedie. An excellent explanation of how these parameters can be chosen is here in this link;

alpha: A number in [0, 1] specifying the mixture of the elastic-net regularization parameter. It gives the degree of mixing between the LASSO and Ridge regularizers. making alpha = 1 is LASSO penalization, alpha = 0 is ridge penalization;

max_iterations: A non-negative integer specifying the maximum number of interactions of the model;

beta_epsilon: A non-negative integer specifying the maximum difference magnitude between coefficient estimates through successive iterations. In other words: This parameter defines the convergence speed of the GLM model;

lambda: A non-negative parameter for shrinking the variable value through Elastic-Net, which multiplies P(α, β) in the objective function. When lambda = 0, no penalty is applied and the model is already adjusted;

lambda_search: A logical value indicating whether there will be a search criterion in the lambda value space defined by a minimum and maximum parameter;

early_stopping: A value indicating whether there will be an early stop during lambda_search if the likelihood factor stops changing as interactions occur;

nfolds: Number of partitions in Cross-Validation;

seed: Specifies the random number generator (RNG) seed for Cross-Validation (ensures experiment reproducibility);

intercept: Constant term of the model, which broadly means the degree of endogenous factors in the model;

gradient_epsilon: Convergence criterion. Converges if the I-Infinity norm gradient is below a certain limit. If lambda_search = FALSE and lambda = 0, the default gradient_epsilon is 0.000001, otherwise it’s 0.0001. If lambda_search = TRUE, the conditional values above are 1E-8 and 1E-6 respectively.

remove_collinear_columns: If no regularization factor is applied, the model ignores collinear columns (the coefficient will be 0);

max_runtime_secs: Maximum allowed time in seconds for model training. Use 0 to disable; and

missing_values_handling: How missing values are handled. Can be “MeanImputation” or “Skip”. MeanImputation replaces missing values with the mean for numeric attributes and the most frequent value for categorical ones. It’s applied during model training.

Note that here the sky is the limit in terms of adjustments and/or parameterizations. The ideal is to have a perfect understanding of the mechanics of each parameter and use the best possible combination.

With our model tuned, let’s look at some basic statistics for this model.

[sourcecode language=”r”] # View model information: training statistics, performance, important variables summary(airlines.glm) # Model Details: # ============== # # H2OBinomialModel: glm # Model Key: GLM_model_R_1484053333586_1 # GLM Model: summary # family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations training_frame # 1 binomial logit Elastic Net (alpha = 0.5, lambda = 1.0E-5 ) 283 272 5 RTMP_sid_a6c9_1 # # H2OBinomialMetrics: glm # ** Reported on training data. ** # # MSE: 0.2098326 # RMSE: 0.4580749 # LogLoss: 0.607572 # Mean Per-Class Error: 0.3720209 # AUC: 0.7316312 # Gini: 0.4632623 # R^2: 0.1602123 # Null Deviance: 41328.6 # Residual Deviance: 36240.45 # AIC: 36786.45 # # Confusion Matrix for F1-optimal threshold: # NO YES Error Rate # NO 5418 9146 0.627987 =9146/14564 # YES 1771 13489 0.116055 =1771/15260 # Totals 7189 22635 0.366047 =10917/29824 # # Maximum Metrics: Maximum metrics at their respective thresholds # metric threshold value idx # 1 max f1 0.363651 0.711915 294 # 2 max f2 0.085680 0.840380 389 # 3 max f0point5 0.539735 0.683924 196 # 4 max accuracy 0.521518 0.673887 207 # 5 max precision 0.987571 1.000000 0 # 6 max recall 0.040200 1.000000 398 # 7 max specificity 0.987571 1.000000 0 # 8 max absolute_mcc 0.521518 0.348709 207 # 9 max min_per_class_accuracy 0.513103 0.672412 212 # 10 max mean_per_class_accuracy 0.521518 0.674326 207 [/sourcecode]

Here, this model already shows a 73.16% AUC result. Not bad for a model with few adjustments.

Let’s analyze the importance of each variable in the model:

[sourcecode language=”r”] # Get the variable importance of the models h2o.varimp(airlines.glm) # Standardized Coefficient Magnitudes: standardized coefficient magnitudes # names coefficients sign # 1 Origin.TLH 3.233673 NEG # 2 Origin.CRP 2.998012 NEG # 3 Origin.LIH 2.859198 NEG # 4 Dest.LYH 2.766090 POS # 5 Origin.KOA 2.461819 NEG # # — # names coefficients sign # 278 Dest.JAN 0.000000 POS # 279 Dest.LIT 0.000000 POS # 280 Dest.SJU 0.000000 POS # 281 Origin.LAN 0.000000 POS # 282 Origin.SBN 0.000000 POS # 283 Origin.SDF 0.000000 POS [/sourcecode]

Some origin airport attributes are determinants in departure flight delays, especially TLH (Tallahassee International Airport), CRP (Corpus Christi International Airport), LIH (Lihue Airport), and KOA (Kona International Airport).

Now let’s use the predict function to see how the model classified the predictions.

[sourcecode language=”r”] # Predict using GLM model pred = h2o.predict(object = airlines.glm, newdata = airlines.test) [/sourcecode] After that, let’s see our model’s result. [sourcecode language=”r”] # Look at summary of predictions: probability of TRUE class (p1) summary(pred) # predict NO YES # YES:9798 Min. :0.01186 Min. :0.02857 # NO :3126 1st Qu.:0.33715 1st Qu.:0.37018 # NA : 382 Median :0.48541 Median :0.51363 # Mean :0.48780 Mean :0.51220 # 3rd Qu.:0.62886 3rd Qu.:0.66189 # Max. :0.97143 Max. :0.98814 # NA’s :382 NA’s :382 [/sourcecode]

Another way to use the GLM model in H2O is to perform parameter selection via Grid Search.

Grid Search is nothing more than a method for optimizing machine learning model parameters, often involving a combination of numerous parameters, where the combination yielding the lowest error according to a specific error function is chosen.

What we will do now is use GLM to obtain the best model according to a specific parameter combination.

First, let’s construct a list of alpha parameters (recapping, this parameter creates a combination of Lasso and Ridge via Elastic-Net). In this case, we will provide a list ranging from 0.0 to 1 with increments of 0.05 for each parameter.

[sourcecode language=”r”] # Construct a hyper-parameter space alpha_opts = c(0,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1) [/sourcecode] Now let’s create a list with these parameters to pass to the Grid function later. [sourcecode language=”r”] # List of hyperparameters hyper_params_opt = list(alpha = alpha_opts) [/sourcecode] In the Grid Search function, we will pass some parameters for model tuning, as shown below. [sourcecode language=”r”] # Grid object with hyperparameters glm_grid <- h2o.grid(“glm” ,grid_id = “glm_grid_1” ,x=X ,y=Y ,training_frame=airlines.train ,hyper_params = hyper_params_opt ,family = “binomial”) [/sourcecode]

This step can take a considerable amount of time depending on your data volume and the number of parameters chosen in the search list.

After the processing is complete, let’s sort the list of models according to their AUC.

[sourcecode language=”r”] # Sort grids by best performance (lower AUC). Little note: As we’re dealing with classification # in some probabilistic fashion, we’ll use AUC as model selection metric. # If the nature of the problem is cost-sensitive (e.g., a delayed departure plane is much more expensive for # the airport service than a delayed arrival), precision and recall can be the best choice. glm_sorted_grid <- h2o.getGrid(grid_id = “glm_grid_1”, sort_by = “auc”, decreasing = FALSE) [/sourcecode]

To evaluate each of the models, we can display the order of the models according to their AUC.

[sourcecode language=”r”] #Print the models print(glm_sorted_grid) # H2O Grid Details # ================ # # Grid ID: glm_grid_1 # Used hyper parameters: # - alpha # Number of models: 21 # Number of failed models: 0 # # Hyper-Parameter Search Summary: ordered by increasing auc # alpha model_ids auc # 1 [D@4800a43e glm_grid_1_model_1 0.7076911403181928 # 2 [D@66030470 glm_grid_1_model_2 0.7122987232329416 # 3 [D@6a4a43d3 glm_grid_1_model_3 0.7145455620514375 # 4 [D@17604a1a glm_grid_1_model_4 0.715989429818657 # 5 [D@21e1e99f glm_grid_1_model_5 0.7169797604977775 # # — # alpha model_ids auc # 16 [D@78833412 glm_grid_1_model_16 0.720595118360825 # 17 [D@44d770f2 glm_grid_1_model_17 0.7207086912177467 # 18 [D@31669527 glm_grid_1_model_18 0.7208228330257134 # 19 [D@5b376f34 glm_grid_1_model_19 0.7209144533220885 # 20 [D@6acad45e glm_grid_1_model_20 0.7209885192412766 # 21 [D@237ad7de glm_grid_1_model_0 0.7240682725570593 [/sourcecode]

With these parameters, the best model is glm_grid_1_model_0, which achieved around 72.40% AUC. (Note: This model is slightly worse than the default model, given that the parameter set for the Grid is different from the first model).

To retrieve the best model, simply execute the command below:

[sourcecode language=”r”] # Grab the model_id based on AUC best_glm_model_id <- glm_grid@model_ids[[1]] [/sourcecode] [sourcecode language=”r”] # The best model best_glm <- h2o.getModel(best_glm_model_id) [/sourcecode] Let’s examine the characteristics of this model: [sourcecode language=”r”] # Summary summary(best_glm) # Model Details: # ============== # # H2OBinomialModel: glm # Model Key: glm_grid_1_model_0 # GLM Model: summary # family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations training_frame # 1 binomial logit Ridge ( lambda = 7.29E-5 ) 283 282 3 RTMP_sid_a6c9_1 # # H2OBinomialMetrics: glm # ** Reported on training data. ** # # MSE: 0.2121424 # RMSE: 0.4605891 # LogLoss: 0.612699 # Mean Per-Class Error: 0.3833898 # AUC: 0.7240683 # Gini: 0.4481365 # R^2: 0.1494395 # Null Deviance: 42448.59 # Residual Deviance: 37585.41 # AIC: 38151.41 # # Confusion Matrix for F1-optimal threshold: # NO YES Error Rate # NO 4993 9601 0.657873 =9601/14594 # YES 1751 14327 0.108907 =1751/16078 # Totals 6744 23928 0.370110 =11352/30672 # # Maximum Metrics: Maximum metrics at their respective thresholds # metric threshold value idx # 1 max f1 0.373247 0.716243 296 # 2 max f2 0.105583 0.846435 391 # 3 max f0point5 0.551991 0.685249 194 # 4 max accuracy 0.513313 0.665949 218 # 5 max precision 0.980714 1.000000 0 # 6 max recall 0.048978 1.000000 399 # 7 max specificity 0.980714 1.000000 0 # 8 max absolute_mcc 0.548278 0.332916 196 # 9 max min_per_class_accuracy 0.524282 0.664324 211 # 10 max mean_per_class_accuracy 0.548278 0.666166 196 # # Gains/Lift Table: Extract with h2o.gainsLift(&lt;model&gt;, &lt;data&gt;) or h2o.gainsLift(&lt;model&gt;, valid=&lt;T/F&gt;, xval=&lt;T/F&gt;) # # # # Scoring History: # timestamp duration iteration negative_log_likelihood objective # 1 2017-01-10 11:11:07 0.000 sec 0 21224.29620 0.69198 # 2 2017-01-10 11:11:07 0.066 sec 1 18857.11178 0.61705 # … [truncated] [/sourcecode]

With this, we finish this post/tutorial on how to use R with H2O.ai.

Over the next few weeks, we will bring some tutorials and break down the power of this H2O.

Best regards!