AutoML using H2o

Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. AutoML automates most of the steps in an ML pipeline, with a minimum amount of human effort and without compromising on its performance.

Automatic machine learning broadly includes the following steps:

Data preparation and Ingestion: The real-world data can be raw data or just in any format. In this step, data needs to be converted into a format that can be processed easily. This also required to decide the data type of different columns in the dataset. We also required a clear knowledge about the task we need to perform on data (e.g classification, regression, etc.)
Feature Engineering: This includes various steps that are required for cleaning the dataset such as dealing with NULL /missing values, selecting the most important features of the dataset, and removing the low-correlational features, dealing with the skewed dataset.
Hyperparameter Optimization: To obtain the best results on any model, the AutoML need to carefully tune the hyperparameter values.
Model Selection: H2O autoML trains with a large number of models in order to produce the best results. H2O AutoML also trains the data of different ensembles to get the best performance out of training data.

H2O AutoML contains the cutting-edge and distributed implementation of many machine learning algorithms. These algorithms are available in Java, Python, Spark, Scala, and R. H2O also provide a web GUI that uses JSON to implement these algorithms. The models trained on H2O AutoML can be easily deployed on the Spark server, AWS, etc.

The main advantage of H2O AutoML is that it automates the steps like basic data processing, model training and tuning, Ensemble and stacking of various models to provide the models with the best performance so that developers can focus on other steps like data collection, feature engineering and deployment of model.

Functionalities of H2O AutoML

H2O AutoML provides necessary data processing capabilities. These are also included in all of the H2O algorithms.
Trains a Random grid of algorithms like GBMs, DNNs, GLMs, etc. using a carefully chosen hyper-parameter space.
Individual models are tuned using cross-validation.
Two Stacked Ensembles are trained. One ensemble contains all the models (optimized for model performance), and the other ensemble provides just the best performing model from each algorithm class/family (optimized for production use).
Returns a sorted “Leaderboard” of all models.
All models can be easily exported to production.

Architecture:

H2O AutoML uses H2O architecture. H2O architecture can be divided into different layers in which the top layer will be different APIs, and the bottom layer will be H2O JVM.

H2O Software Stack

H2O provides REST API clients for Python, R, Excel, Tableau, and Flow Web UI using socket connections.

The bottom layer contains different components that will run on the H2O JVM process.

An H2O cluster consists of one or more nodes. Each node is a single JVM process. Each JVM process is split into three layers: language, algorithms, and core infrastructure.

The first layer in the bottom section is the language layer. The language layer consists of an expression evaluation engine for R and the Scala layer.
The second layer is the algorithm layer. This layer contains an algorithms that are already provided in the H2O such as: XGBoost, GBM, Random Forest, K-Means etc.
The third layer is the core infrastructure layer that deals with resource management such as Memory and CPU management.

Implementation:

In this code, we will be using California Housing Dataset which is easily available in colab. First, we need to import the necessary packages.

Code:

python3

# code

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

Now, we load the California Housing Dataset. This is already available in the sample data folder when we load the environment in colab.

Code:

python3

# code

df = pd.read_csv('sample_data / california_housing_train.csv')

Let’s look at the dataset, we use the head function to list the first few rows of the dataset.

Code:

python3

# print first 5 rows of dataframe
 
df.head()

longitude    latitude    housing_median_age    total_rooms    total_bedrooms    population    households    median_income    median_house_value
0    -114.31    34.19    15.0    5612.0    1283.0    1015.0    472.0    1.4936    66900.0
1    -114.47    34.40    19.0    7650.0    1901.0    1129.0    463.0    1.8200    80100.0
2    -114.56    33.69    17.0    720.0    174.0    333.0    117.0    1.6509    85700.0
3    -114.57    33.64    14.0    1501.0    337.0    515.0    226.0    3.1917    73400.0
4    -114.57    33.57    20.0    1454.0    326.0    624.0    262.0    1.9250    65500.0

Now, let’s check for null values in the dataset. As we can see that there are no null values in the dataset.

Code:

python3

# calculate total null values in every column

df.isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
dtype: int64

Now we need to install the h2o, we can install it using pip. Note, if you are using the local environment for H2O, you need to install the Java Development Kit (JDK). After installing JDK and H2O, we will initialize it, if it works fine this will start an H2O instance on the localhost. There are many arguments which we can pass such as:
- nthreads: No of cores H2O server can use, by default it uses all cores of CPU.
- ip: IP address of the server where the H2O server will run. By default, it uses localhost.
- port: port on which the H2O server will run.
- max_mem_size: A character string specifying the maximum size, in bytes, of the memory allocation pool to H2O. This value must be a multiple of 1024 greater than 2MB. Append the letter m or M to indicate megabytes, or g or G to indicate gigabytes. Similarly, there is another parameter min_mem_size. For more details please look at H2O docs

Code:

python3

# install  and import H2o ! pip install h2o

import h2o
# We will be using default parameter Here with H2O init method
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.7" 2020-04-14; OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-2ubuntu218.04); OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-2ubuntu218.04, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.6/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpebz1_45i
  JVM stdout: /tmp/tmpebz1_45i/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpebz1_45i/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O_cluster_uptime:    03 secs
H2O_cluster_timezone:    Etc/UTC
H2O_data_parsing_timezone:    UTC
H2O_cluster_version:    3.30.0.6
H2O_cluster_version_age:    13 days
H2O_cluster_name:    H2O_from_python_unknownUser_h4lj71
H2O_cluster_total_nodes:    1
H2O_cluster_free_memory:    3.180 Gb
H2O_cluster_total_cores:    2
H2O_cluster_allowed_cores:    2
H2O_cluster_status:    accepting new members, healthy
H2O_connection_url:    http://127.0.0.1:54321
H2O_connection_proxy:    {"http": null, "https": null}
H2O_internal_security:    False
H2O_API_Extensions:    Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version:    3.6.9 final

The H2O instance can also be assessed from localhost: 54321, this instance provides a web GUI called FlowGUI. Now, we need to convert the train data frame into the H2O Dataframe.

python3

# convert pandas DataFrame into H2O Frame

train_df = h2o.H2OFrame(df)
# Describe  the train h20Frame
train_df.describe()

Parse progress: |?????????????????????????????????????????????????????????| 100%
Rows:17000
Cols:9


longitude    latitude    housing_median_age    total_rooms    total_bedrooms    population    households    median_income    median_house_value
type    real    real    int    int    int    int    int    real    int
mins    -124.35    32.54    1.0    2.0    1.0    3.0    1.0    0.4999    14999.0
mean    -119.5621082352941    35.62522470588239    28.589352941176436    2643.6644117647143    539.4108235294095    1429.573941176477    501.2219411764718    3.8835781000000016    207300.9123529415
maxs    -114.31    41.95    52.0    37937.0    6445.0    35682.0    6082.0    15.0001    500001.0
sigma    2.0051664084260357    2.137339794657087    12.586936981660406    2179.9470714527765    421.4994515798648    1147.852959159527    384.52084085590155    1.9081565183791034    115983.76438720895
zeros    0    0    0    0    0    0    0    0    0
missing    0    0    0    0    0    0    0    0    0
0    -114.31    34.19    15.0    5612.0    1283.0    1015.0    472.0    1.4936    66900.0
1    -114.47    34.4    19.0    7650.0    1901.0    1129.0    463.0    1.82    80100.0
2    -114.56    33.69    17.0    720.0    174.0    333.0    117.0    1.6509    85700.0
3    -114.57    33.64    14.0    1501.0    337.0    515.0    226.0    3.1917    73400.0
4    -114.57    33.57    20.0    1454.0    326.0    624.0    262.0    1.925    65500.0
5    -114.58    33.63    29.0    1387.0    236.0    671.0    239.0    3.3438    74000.0
6    -114.58    33.61    25.0    2907.0    680.0    1841.0    633.0    2.6768    82400.0
7    -114.59    34.83    41.0    812.0    168.0    375.0    158.0    1.7083    48500.0
8    -114.59    33.61    34.0    4789.0    1175.0    3134.0    1056.0    2.1782    58400.0
9    -114.6    34.83    46.0    1497.0    309.0    787.0    271.0    2.1908    48100.0

Now, we load our test dataset into pandas DataFrame and convert it into the H2O Dataframe.

Code:

python3

# code

test = pd.read_csv('sample_data / california_housing_test.csv')

test = h2o.H2OFrame(test)
# selecting feature and label columns
 

x = test.columns

y = 'median_house_value'
# remove label classvariable from feature variable
x.remove(y)

Parse progress: |?????????????????????????????????????????????????????????| 100%

Now, we run AutoML and start training.

Code:

python3

# import autoML from H2O

from h2o.automl import H2OAutoML
# callh20automl  function 

aml = H2OAutoML(max_runtime_secs = 600,

                # exclude_algos =['DeepLearning'],

                seed = 1,

                # stopping_metric ='logloss',

                # sort_metric ='logloss',

                balance_classes = False,

                project_name ='Project 1'
)
# train model and record time % time 

aml.train(x = x, y = y, training_frame = train_df)

AutoML progress: |????????????????????????????????????????????????????????| 100%
CPU times: user 40 s, sys: 1.24 s, total: 41.2 s
Wall time: 9min 39s

In this step, we will look for the best performing model using the leaderboard and it will most probably be one of the two stacked ensemble models.

python3

# View the H2O aml leaderboard

lb = aml.leaderboard
# Print all rows instead of 10 rows

lb.head(rows = lb.nrows)

model_id    mean_residual_deviance    rmse    mse    mae    rmsle
StackedEnsemble_AllModels_AutoML_20200714_173719    2.04045e+09    45171.3    2.04045e+09    29642.1    0.221447
StackedEnsemble_BestOfFamily_AutoML_20200714_173719    2.06576e+09    45450.6    2.06576e+09    29949.4    0.223522
GBM_3_AutoML_20200714_173719    2.15623e+09    46435.2    2.15623e+09    30763.8    0.227577
GBM_4_AutoML_20200714_173719    2.15913e+09    46466.4    2.15913e+09    30786.7    0.228627
XGBoost_grid__1_AutoML_20200714_173719_model_5    2.16562e+09    46536.2    2.16562e+09    31075.9    0.233288
GBM_2_AutoML_20200714_173719    2.17639e+09    46651.8    2.17639e+09    31014.8    0.229731
GBM_grid__1_AutoML_20200714_173719_model_2    2.2457e+09    47388.8    2.2457e+09    31717.9    0.236673
GBM_grid__1_AutoML_20200714_173719_model_4    2.24615e+09    47393.6    2.24615e+09    31533.6    0.235206
GBM_grid__1_AutoML_20200714_173719_model_5    2.30368e+09    47996.7    2.30368e+09    31888    0.234582
GBM_grid__1_AutoML_20200714_173719_model_3    2.31412e+09    48105.3    2.31412e+09    32428.7    0.241596
GBM_1_AutoML_20200714_173719    2.38155e+09    48801.2    2.38155e+09    32817.8    0.241261
GBM_5_AutoML_20200714_173719    2.38712e+09    48858.1    2.38712e+09    32730.3    0.238373
XGBoost_grid__1_AutoML_20200714_173719_model_2    2.41444e+09    49137    2.41444e+09    33359.3    nan
XGBoost_grid__1_AutoML_20200714_173719_model_1    2.43811e+09    49377.2    2.43811e+09    33392.7    nan
XGBoost_grid__1_AutoML_20200714_173719_model_6    2.44549e+09    49451.8    2.44549e+09    33620.7    nan
XGBoost_grid__1_AutoML_20200714_173719_model_7    2.46672e+09    49666.1    2.46672e+09    33264.5    nan
XGBoost_3_AutoML_20200714_173719    2.47346e+09    49733.9    2.47346e+09    33829    nan
XGBoost_grid__1_AutoML_20200714_173719_model_3    2.53867e+09    50385.2    2.53867e+09    33713.1    0.252152
XGBoost_grid__1_AutoML_20200714_173719_model_4    2.61998e+09    51185.8    2.61998e+09    34084.3    nan
GBM_grid__1_AutoML_20200714_173719_model_1    2.63332e+09    51315.9    2.63332e+09    35218.1    nan
XGBoost_1_AutoML_20200714_173719    2.64565e+09    51435.9    2.64565e+09    34900.5    nan
XGBoost_2_AutoML_20200714_173719    2.67031e+09    51675    2.67031e+09    35556.1    nan
DRF_1_AutoML_20200714_173719    2.90447e+09    53893.1    2.90447e+09    36925.5    0.263639
XRT_1_AutoML_20200714_173719    2.92071e+09    54043.6    2.92071e+09    37116.6    0.264397
XGBoost_grid__1_AutoML_20200714_173719_model_8    4.32541e+09    65767.9    4.32541e+09    43502.3    0.287448
DeepLearning_1_AutoML_20200714_173719    5.06767e+09    71187.6    5.06767e+09    49467.4    nan
DeepLearning_grid__2_AutoML_20200714_173719_model_1    6.01537e+09    77558.8    6.01537e+09    56478.1    0.386805
DeepLearning_grid__3_AutoML_20200714_173719_model_1    7.85515e+09    88629.3    7.85515e+09    64133.5    0.448841
GBM_grid__1_AutoML_20200714_173719_model_6    8.44986e+09    91923.1    8.44986e+09    71726.4    0.483173
DeepLearning_grid__1_AutoML_20200714_173719_model_2    8.72689e+09    93417.8    8.72689e+09    65346.1    nan
DeepLearning_grid__1_AutoML_20200714_173719_model_1    8.9643e+09    94680    8.9643e+09    68862.6    nan
GLM_1_AutoML_20200714_173719    1.34525e+10    115985    1.34525e+10    91648.3    0.592579

In this step, we explore the base learners of the stacked ensemble model and select the best performing base learning model.

Code:

python3

# Get the top model of leaderboard

se = aml.leader
 
# Get the metalearner model of top model

metalearner = h2o.get_model(se.metalearner()['name']))
 
# list baselearner models :
metalearner.varimp()

[('XGBoost_grid__1_AutoML_20200714_173719_model_5',
  36607.81502851827,
  1.0,
  0.3400955145231931),
 ('GBM_4_AutoML_20200714_173719',
  33538.168782584005,
  0.9161477885652846,
  0.311577753531396),
 ('GBM_3_AutoML_20200714_173719',
  27022.573640463357,
  0.7381640674105295,
  0.25104628830851705),
 ('XGBoost_grid__1_AutoML_20200714_173719_model_3',
  7512.2319349954105,
  0.2052084214570911,
  0.06979046367994166),
 ('GBM_2_AutoML_20200714_173719',
  1221.399944930078,
  0.03336445903637191,
  0.011347102862762904),
 ('XGBoost_grid__1_AutoML_20200714_173719_model_4',
  897.9511180098376,
  0.024528945999926915,
  0.008342184510556763),
 ('XGBoost_grid__1_AutoML_20200714_173719_model_2',
  839.6650323257486,
  0.022936769967604773,
  0.007800692583632669),
 ('GBM_grid__1_AutoML_20200714_173719_model_2', 0.0, 0.0, 0.0),
 ('GBM_grid__1_AutoML_20200714_173719_model_4', 0.0, 0.0, 0.0),
 ('GBM_grid__1_AutoML_20200714_173719_model_5', 0.0, 0.0, 0.0),
 ('GBM_grid__1_AutoML_20200714_173719_model_3', 0.0, 0.0, 0.0),
 ('GBM_1_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('GBM_5_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('XGBoost_grid__1_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0),
 ('XGBoost_grid__1_AutoML_20200714_173719_model_6', 0.0, 0.0, 0.0),
 ('XGBoost_grid__1_AutoML_20200714_173719_model_7', 0.0, 0.0, 0.0),
 ('XGBoost_3_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('GBM_grid__1_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0),
 ('XGBoost_1_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('XGBoost_2_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('DRF_1_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('XRT_1_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('XGBoost_grid__1_AutoML_20200714_173719_model_8', 0.0, 0.0, 0.0),
 ('DeepLearning_1_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('DeepLearning_grid__2_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0),
 ('DeepLearning_grid__3_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0),
 ('GBM_grid__1_AutoML_20200714_173719_model_6', 0.0, 0.0, 0.0),
 ('DeepLearning_grid__1_AutoML_20200714_173719_model_2', 0.0, 0.0, 0.0),
 ('DeepLearning_grid__1_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0),
 ('GLM_1_AutoML_20200714_173719', 0.0, 0.0, 0.0)]

Now, we calculate error on this base learning model and plot the feature importance plot using this model.

python3

# model performance on test dataset

model = h2o.get_model('XGBoost_grid__1_AutoML_20200714_173719_model_5')
model.model_performance(test)

ModelMetricsRegression: xgboost
** Reported on test data. **

MSE: 2194912948.887177
RMSE: 46849.89806698812
MAE: 31039.50846508789
RMSLE: 0.24452804591616809
Mean Residual Deviance: 2194912948.887177

Code:

python3

# plot the graph for variable importance

model.varimp_plot(num_of_features = 9)

Now, we can save this model using the model.save method, this model can be deployed on various platforms.

Code:

python3

# sAVE THE BASELEARNER MODEL

model_path = h2o.save_model(model = model, path ='sample_data/', force = True)

References:

Article Tags :

Machine Learning