Open In App

AutoML using H2o

Automated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems. AutoML automates most of the steps in an ML pipeline, with a minimum amount of human effort and without compromising on its performance.

Automatic machine learning broadly includes the following steps:



H2O AutoML contains the cutting-edge and distributed implementation of many machine learning algorithms. These algorithms are available in Java, Python,  Spark, Scala, and R. H2O also provide a web GUI that uses JSON to implement these algorithms. The models trained on H2O AutoML can be easily deployed on the Spark server, AWS, etc.

The main advantage of H2O AutoML is that it automates the steps like basic data processing, model training and tuning, Ensemble and stacking of various models to provide the models with the best performance so that developers can focus on other steps like data collection, feature engineering and deployment of model.



Functionalities of H2O AutoML

Architecture:

H2O AutoML uses  H2O architecture. H2O architecture can be divided into different layers in which the top layer will be different APIs, and the bottom layer will be H2O JVM.

H2O Software Stack

H2O provides REST API clients for Python, R, Excel, Tableau, and Flow Web UI using socket connections.

The bottom layer contains different components that will run on the H2O JVM process. 

An H2O cluster consists of one or more nodes. Each node is a single JVM process. Each JVM process is split into three layers: language, algorithms, and core infrastructure. 

Implementation:

Code: 




# code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Code: 




# code
df = pd.read_csv('sample_data / california_housing_train.csv')

Code: 




# print first 5 rows of dataframe
 
df.head()

longitude    latitude    housing_median_age    total_rooms    total_bedrooms    population    households    median_income    median_house_value
0    -114.31    34.19    15.0    5612.0    1283.0    1015.0    472.0    1.4936    66900.0
1    -114.47    34.40    19.0    7650.0    1901.0    1129.0    463.0    1.8200    80100.0
2    -114.56    33.69    17.0    720.0    174.0    333.0    117.0    1.6509    85700.0
3    -114.57    33.64    14.0    1501.0    337.0    515.0    226.0    3.1917    73400.0
4    -114.57    33.57    20.0    1454.0    326.0    624.0    262.0    1.9250    65500.0

Code: 




# calculate total null values in every column
df.isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
dtype: int64

Code: 




# install  and import H2o ! pip install h2o
import h2o
# We will be using default parameter Here with H2O init method
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.7" 2020-04-14; OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-2ubuntu218.04); OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-2ubuntu218.04, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.6/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpebz1_45i
  JVM stdout: /tmp/tmpebz1_45i/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpebz1_45i/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O_cluster_uptime:    03 secs
H2O_cluster_timezone:    Etc/UTC
H2O_data_parsing_timezone:    UTC
H2O_cluster_version:    3.30.0.6
H2O_cluster_version_age:    13 days
H2O_cluster_name:    H2O_from_python_unknownUser_h4lj71
H2O_cluster_total_nodes:    1
H2O_cluster_free_memory:    3.180 Gb
H2O_cluster_total_cores:    2
H2O_cluster_allowed_cores:    2
H2O_cluster_status:    accepting new members, healthy
H2O_connection_url:    http://127.0.0.1:54321
H2O_connection_proxy:    {"http": null, "https": null}
H2O_internal_security:    False
H2O_API_Extensions:    Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version:    3.6.9 final




# convert pandas DataFrame into H2O Frame
train_df = h2o.H2OFrame(df)
# Describe  the train h20Frame
train_df.describe()

Parse progress: |?????????????????????????????????????????????????????????| 100%
Rows:17000
Cols:9


longitude    latitude    housing_median_age    total_rooms    total_bedrooms    population    households    median_income    median_house_value
type    real    real    int    int    int    int    int    real    int
mins    -124.35    32.54    1.0    2.0    1.0    3.0    1.0    0.4999    14999.0
mean    -119.5621082352941    35.62522470588239    28.589352941176436    2643.6644117647143    539.4108235294095    1429.573941176477    501.2219411764718    3.8835781000000016    207300.9123529415
maxs    -114.31    41.95    52.0    37937.0    6445.0    35682.0    6082.0    15.0001    500001.0
sigma    2.0051664084260357    2.137339794657087    12.586936981660406    2179.9470714527765    421.4994515798648    1147.852959159527    384.52084085590155    1.9081565183791034    115983.76438720895
zeros    0    0    0    0    0    0    0    0    0
missing    0    0    0    0    0    0    0    0    0
0    -114.31    34.19    15.0    5612.0    1283.0    1015.0    472.0    1.4936    66900.0
1    -114.47    34.4    19.0    7650.0    1901.0    1129.0    463.0    1.82    80100.0
2    -114.56    33.69    17.0    720.0    174.0    333.0    117.0    1.6509    85700.0
3    -114.57    33.64    14.0    1501.0    337.0    515.0    226.0    3.1917    73400.0
4    -114.57    33.57    20.0    1454.0    326.0    624.0    262.0    1.925    65500.0
5    -114.58    33.63    29.0    1387.0    236.0    671.0    239.0    3.3438    74000.0
6    -114.58    33.61    25.0    2907.0    680.0    1841.0    633.0    2.6768    82400.0
7    -114.59    34.83    41.0    812.0    168.0    375.0    158.0    1.7083    48500.0
8    -114.59    33.61    34.0    4789.0    1175.0    3134.0    1056.0    2.1782    58400.0
9    -114.6    34.83    46.0    1497.0    309.0    787.0    271.0    2.1908    48100.0

Code: 




# code
test = pd.read_csv('sample_data / california_housing_test.csv')
test = h2o.H2OFrame(test)
# selecting feature and label columns
 
x = test.columns
y = 'median_house_value'
# remove label classvariable from feature variable
x.remove(y)

Parse progress: |?????????????????????????????????????????????????????????| 100%

Code: 




# import autoML from H2O
from h2o.automl import H2OAutoML
# callh20automl  function
aml = H2OAutoML(max_runtime_secs = 600,
                # exclude_algos =['DeepLearning'],
                seed = 1,
                # stopping_metric ='logloss',
                # sort_metric ='logloss',
                balance_classes = False,
                project_name ='Project 1'
)
# train model and record time % time
aml.train(x = x, y = y, training_frame = train_df)

AutoML progress: |????????????????????????????????????????????????????????| 100%
CPU times: user 40 s, sys: 1.24 s, total: 41.2 s
Wall time: 9min 39s




# View the H2O aml leaderboard
lb = aml.leaderboard
# Print all rows instead of 10 rows
lb.head(rows = lb.nrows)

model_id    mean_residual_deviance    rmse    mse    mae    rmsle
StackedEnsemble_AllModels_AutoML_20200714_173719    2.04045e+09    45171.3    2.04045e+09    29642.1    0.221447
StackedEnsemble_BestOfFamily_AutoML_20200714_173719    2.06576e+09    45450.6    2.06576e+09    29949.4    0.223522
GBM_3_AutoML_20200714_173719    2.15623e+09    46435.2    2.15623e+09    30763.8    0.227577
GBM_4_AutoML_20200714_173719    2.15913e+09    46466.4    2.15913e+09    30786.7    0.228627
XGBoost_grid__1_AutoML_20200714_173719_model_5    2.16562e+09    46536.2    2.16562e+09    31075.9    0.233288
GBM_2_AutoML_20200714_173719    2.17639e+09    46651.8    2.17639e+09    31014.8    0.229731
GBM_grid__1_AutoML_20200714_173719_model_2    2.2457e+09    47388.8    2.2457e+09    31717.9    0.236673
GBM_grid__1_AutoML_20200714_173719_model_4    2.24615e+09    47393.6    2.24615e+09    31533.6    0.235206
GBM_grid__1_AutoML_20200714_173719_model_5    2.30368e+09    47996.7    2.30368e+09    31888    0.234582
GBM_grid__1_AutoML_20200714_173719_model_3    2.31412e+09    48105.3    2.31412e+09    32428.7    0.241596
GBM_1_AutoML_20200714_173719    2.38155e+09    48801.2    2.38155e+09    32817.8    0.241261
GBM_5_AutoML_20200714_173719    2.38712e+09    48858.1    2.38712e+09    32730.3    0.238373
XGBoost_grid__1_AutoML_20200714_173719_model_2    2.41444e+09    49137    2.41444e+09    33359.3    nan
XGBoost_grid__1_AutoML_20200714_173719_model_1    2.43811e+09    49377.2    2.43811e+09    33392.7    nan
XGBoost_grid__1_AutoML_20200714_173719_model_6    2.44549e+09    49451.8    2.44549e+09    33620.7    nan
XGBoost_grid__1_AutoML_20200714_173719_model_7    2.46672e+09    49666.1    2.46672e+09    33264.5    nan
XGBoost_3_AutoML_20200714_173719    2.47346e+09    49733.9    2.47346e+09    33829    nan
XGBoost_grid__1_AutoML_20200714_173719_model_3    2.53867e+09    50385.2    2.53867e+09    33713.1    0.252152
XGBoost_grid__1_AutoML_20200714_173719_model_4    2.61998e+09    51185.8    2.61998e+09    34084.3    nan
GBM_grid__1_AutoML_20200714_173719_model_1    2.63332e+09    51315.9    2.63332e+09    35218.1    nan
XGBoost_1_AutoML_20200714_173719    2.64565e+09    51435.9    2.64565e+09    34900.5    nan
XGBoost_2_AutoML_20200714_173719    2.67031e+09    51675    2.67031e+09    35556.1    nan
DRF_1_AutoML_20200714_173719    2.90447e+09    53893.1    2.90447e+09    36925.5    0.263639
XRT_1_AutoML_20200714_173719    2.92071e+09    54043.6    2.92071e+09    37116.6    0.264397
XGBoost_grid__1_AutoML_20200714_173719_model_8    4.32541e+09    65767.9    4.32541e+09    43502.3    0.287448
DeepLearning_1_AutoML_20200714_173719    5.06767e+09    71187.6    5.06767e+09    49467.4    nan
DeepLearning_grid__2_AutoML_20200714_173719_model_1    6.01537e+09    77558.8    6.01537e+09    56478.1    0.386805
DeepLearning_grid__3_AutoML_20200714_173719_model_1    7.85515e+09    88629.3    7.85515e+09    64133.5    0.448841
GBM_grid__1_AutoML_20200714_173719_model_6    8.44986e+09    91923.1    8.44986e+09    71726.4    0.483173
DeepLearning_grid__1_AutoML_20200714_173719_model_2    8.72689e+09    93417.8    8.72689e+09    65346.1    nan
DeepLearning_grid__1_AutoML_20200714_173719_model_1    8.9643e+09    94680    8.9643e+09    68862.6    nan
GLM_1_AutoML_20200714_173719    1.34525e+10    115985    1.34525e+10    91648.3    0.592579

Code: 




# Get the top model of leaderboard
se = aml.leader
 
# Get the metalearner model of top model
metalearner = h2o.get_model(se.metalearner()['name']))
 
# list baselearner models :
metalearner.varimp()

[('XGBoost_grid__1_AutoML_20200714_173719_model_5',
  36607.81502851827,
  1.0,
  0.3400955145231931),
 ('GBM_4_AutoML_20200714_173719',
  33538.168782584005,
  0.9161477885652846,
  0.311577753531396),
 ('GBM_3_AutoML_20200714_173719',
  27022.573640463357,
  0.7381640674105295,
  0.25104628830851705),
 ('XGBoost_grid__1_AutoML_20200714_173719_model_3',
  7512.2319349954105,
  0.2052084214570911,
  0.06979046367994166),
 ('GBM_2_AutoML_20200714_173719',
  1221.399944930078,
  0.03336445903637191,
  0.011347102862762904),
 ('XGBoost_grid__1_AutoML_20200714_173719_model_4',
  897.9511180098376,
  0.024528945999926915,
  0.008342184510556763),
 ('XGBoost_grid__1_AutoML_20200714_173719_model_2',
  839.6650323257486,
  0.022936769967604773,
  0.007800692583632669),
 ('GBM_grid__1_AutoML_20200714_173719_model_2', 0.0, 0.0, 0.0),
 ('GBM_grid__1_AutoML_20200714_173719_model_4', 0.0, 0.0, 0.0),
 ('GBM_grid__1_AutoML_20200714_173719_model_5', 0.0, 0.0, 0.0),
 ('GBM_grid__1_AutoML_20200714_173719_model_3', 0.0, 0.0, 0.0),
 ('GBM_1_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('GBM_5_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('XGBoost_grid__1_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0),
 ('XGBoost_grid__1_AutoML_20200714_173719_model_6', 0.0, 0.0, 0.0),
 ('XGBoost_grid__1_AutoML_20200714_173719_model_7', 0.0, 0.0, 0.0),
 ('XGBoost_3_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('GBM_grid__1_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0),
 ('XGBoost_1_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('XGBoost_2_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('DRF_1_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('XRT_1_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('XGBoost_grid__1_AutoML_20200714_173719_model_8', 0.0, 0.0, 0.0),
 ('DeepLearning_1_AutoML_20200714_173719', 0.0, 0.0, 0.0),
 ('DeepLearning_grid__2_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0),
 ('DeepLearning_grid__3_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0),
 ('GBM_grid__1_AutoML_20200714_173719_model_6', 0.0, 0.0, 0.0),
 ('DeepLearning_grid__1_AutoML_20200714_173719_model_2', 0.0, 0.0, 0.0),
 ('DeepLearning_grid__1_AutoML_20200714_173719_model_1', 0.0, 0.0, 0.0),
 ('GLM_1_AutoML_20200714_173719', 0.0, 0.0, 0.0)]




# model performance on test dataset
model = h2o.get_model('XGBoost_grid__1_AutoML_20200714_173719_model_5')
model.model_performance(test)

ModelMetricsRegression: xgboost
** Reported on test data. **

MSE: 2194912948.887177
RMSE: 46849.89806698812
MAE: 31039.50846508789
RMSLE: 0.24452804591616809
Mean Residual Deviance: 2194912948.887177

Code: 




# plot the graph for variable importance
model.varimp_plot(num_of_features = 9)


Code: 




# sAVE THE BASELEARNER MODEL
model_path = h2o.save_model(model = model, path ='sample_data/', force = True)

References:


Article Tags :