Implementing ML model in AutoML

4 minute read

Implement Machine Learning Model using AutoML:

Install h2o module to use the AutoML.

!pip install h2o
Collecting h2o
[?25l  Downloading https://files.pythonhosted.org/packages/f5/4a/e24acf8729af20384a1788e97b39b016be4bbf46a0bb475038f1fee97260/h2o-3.30.0.7.tar.gz (128.8MB)
     |████████████████████████████████| 128.8MB 84kB/s 
[?25hRequirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from h2o) (2.23.0)
Requirement already satisfied: tabulate in /usr/local/lib/python3.6/dist-packages (from h2o) (0.8.7)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from h2o) (0.16.0)
Collecting colorama>=0.3.8
  Downloading https://files.pythonhosted.org/packages/c9/dc/45cdef1b4d119eb96316b3117e6d5708a08029992b2fee2c143c7a0a5cc5/colorama-0.4.3-py2.py3-none-any.whl
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->h2o) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->h2o) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->h2o) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->h2o) (2020.6.20)
Building wheels for collected packages: h2o
  Building wheel for h2o (setup.py) ... [?25l[?25hdone
  Created wheel for h2o: filename=h2o-3.30.0.7-py2.py3-none-any.whl size=128865965 sha256=73528d7a6beb2b647c8ea501e4fec0ade3a5f9fda31be352aab1679483d59b99
  Stored in directory: /root/.cache/pip/wheels/a6/c2/6d/9612d426d2c947be23a8cd2d0156a9107927de630b8821ecea
Successfully built h2o
Installing collected packages: colorama, h2o
Successfully installed colorama-0.4.3 h2o-3.30.0.7

Import the h2o Python module and H2OAutoML class and initialize a local H2O cluster.

import h2o
from h2o.automl import H2OAutoML
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.7" 2020-04-14; OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-2ubuntu218.04); OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-2ubuntu218.04, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.6/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpaw9v2fg3
  JVM stdout: /tmp/tmpaw9v2fg3/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpaw9v2fg3/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O_cluster_uptime: 02 secs
H2O_cluster_timezone: Etc/UTC
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.30.0.7
H2O_cluster_version_age: 6 hours and 5 minutes
H2O_cluster_name: H2O_from_python_unknownUser_qrzuv8
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 3.180 Gb
H2O_cluster_total_cores: 2
H2O_cluster_allowed_cores: 2
H2O_cluster_status: accepting new members, healthy
H2O_connection_url: http://127.0.0.1:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version: 3.6.9 final
data.head()
sku national_inv lead_time in_transit_qty forecast_3_month forecast_6_month forecast_9_month sales_1_month sales_3_month sales_6_month sales_9_month min_bankpotential_issue pieces_past_due perf_6_month_avg perf_12_month_avg local_bo_qtydeck_risk oe_constraint ppap_risk stop_auto_buy rev_stop went_on_backorder
1.11312e+06 0 8 1 6 6 6 0 4 9 12 0No 1 0.9 0.89 0No No No Yes No Yes
1.11327e+06 0 8 0 2 3 4 1 2 3 3 0No 0 0.96 0.97 0No No No Yes No Yes
1.11387e+06 20 2 0 45 99 153 16 42 80 111 10No 0 0.81 0.88 0No No No Yes No Yes
1.11422e+06 0 8 0 9 14 21 5 17 36 43 0No 0 0.96 0.98 0No No No Yes No Yes
1.11482e+06 0 12 0 31 31 31 7 15 33 47 2No 3 0.98 0.98 0No No No Yes No Yes
1.11545e+06 55 8 0 216 360 492 30 108 275 340 51No 0 0 0 0No No Yes Yes No Yes
1.11562e+06 -34 8 0 120 240 240 83 122 144 165 33No 0 1 0.97 34No No No Yes No Yes
1.11645e+06 4 9 0 43 67 115 5 22 40 58 4No 0 0.69 0.68 0No No No Yes No Yes
1.11683e+06 2 8 0 4 6 9 1 5 6 9 2No 0 1 0.95 0No No No Yes No Yes
1.11687e+06 -7 8 0 56 96 112 13 30 56 76 0No 0 0.97 0.92 7No No No Yes No Yes

Load Data:

For the example we will load [product_backorders.csv] for binary classification. The goal here is to predict whether or not a product will be put on backorder status, given a number of product metrics such as current inventory, transit time, demand forecasts and prior sales. We load both way either from github or from the local.

# Load the Data:
data_file_path="https://github.com/h2oai/h2o-tutorials/raw/master/h2o-world-2017/automl/data/product_backorders.csv"
data = h2o.import_file(data_file_path)
Parse progress: |█████████████████████████████████████████████████████████| 100%
y = "went_on_backorder"
X = data.columns
X.remove(y)
X.remove("sku")

Run the AutoML:

Run AutoML, stopping after 10 models. The max_models argument specifies the number of individual (or “base”) models, and does not include the two ensemble models that are trained at the end.

# Run AutoML:
auto_ml = H2OAutoML(max_models = 10, seed = 1)
auto_ml.train(x = X, y = y, training_frame = data)
AutoML progress: |████████████████████████████████████████████████████████| 100%

Leader Board:

We will view the AutoML Leaderboard. Since we did not specify a leaderboard_frame in the H2OAutoML.train() method for scoring and ranking the models, the AutoML leaderboard uses cross-validation metrics to rank the models. Simply it is just a summatization of the models ranking from top to bottom. The leader model is stored at auto_ml.leader and the leaderboard is stored at auto_ml.leaderboard.

leader_board=auto_ml.leaderboard

Now we will view a snapshot of the top models.

leader_board.head()
model_id auc logloss aucpr mean_per_class_error rmse mse
StackedEnsemble_AllModels_AutoML_20200721_233627 0.950875 0.18191 0.749727 0.1494040.2275680.0517873
StackedEnsemble_BestOfFamily_AutoML_20200721_2336270.950305 0.1831050.746107 0.1516350.2283310.0521349
GBM_4_AutoML_20200721_233627 0.948839 0.1735790.73916 0.1572460.22659 0.051343
GBM_3_AutoML_20200721_233627 0.94683 0.1770910.7331 0.1477160.22862 0.0522671
XGBoost_3_AutoML_20200721_233627 0.945957 0.1766620.736604 0.1509750.2283940.0521638
GBM_2_AutoML_20200721_233627 0.945111 0.1797640.727168 0.1663820.2302320.0530067
GBM_5_AutoML_20200721_233627 0.944997 0.17789 0.731015 0.14231 0.2298190.0528166
XGBoost_1_AutoML_20200721_233627 0.944094 0.1813150.726938 0.1701480.2298170.0528157
XGBoost_2_AutoML_20200721_233627 0.943922 0.1804670.72038 0.1535930.2299680.0528851
GBM_1_AutoML_20200721_233627 0.942459 0.1838150.720288 0.15893 0.2320040.0538257

If we need to view the entire leaderboard:

leader_board.head(rows=leader_board.nrows)
model_id auc logloss aucpr mean_per_class_error rmse mse
StackedEnsemble_AllModels_AutoML_20200721_233627 0.950875 0.18191 0.749727 0.1494040.2275680.0517873
StackedEnsemble_BestOfFamily_AutoML_20200721_2336270.950305 0.1831050.746107 0.1516350.2283310.0521349
GBM_4_AutoML_20200721_233627 0.948839 0.1735790.73916 0.1572460.22659 0.051343
GBM_3_AutoML_20200721_233627 0.94683 0.1770910.7331 0.1477160.22862 0.0522671
XGBoost_3_AutoML_20200721_233627 0.945957 0.1766620.736604 0.1509750.2283940.0521638
GBM_2_AutoML_20200721_233627 0.945111 0.1797640.727168 0.1663820.2302320.0530067
GBM_5_AutoML_20200721_233627 0.944997 0.17789 0.731015 0.14231 0.2298190.0528166
XGBoost_1_AutoML_20200721_233627 0.944094 0.1813150.726938 0.1701480.2298170.0528157
XGBoost_2_AutoML_20200721_233627 0.943922 0.1804670.72038 0.1535930.2299680.0528851
GBM_1_AutoML_20200721_233627 0.942459 0.1838150.720288 0.15893 0.2320040.0538257
DRF_1_AutoML_20200721_233627 0.935803 0.2221610.692536 0.1714520.2542890.064663
GLM_1_AutoML_20200721_233627 0.741995 0.3386750.266396 0.29912 0.3143870.0988395

Save the Leader Model:

h2o.save_model(auto_ml.leader, path = "./automl_classify_model_bin")
'/content/automl_classify_model_bin/StackedEnsemble_AllModels_AutoML_20200721_233627'

Download the model for future use:

auto_ml.leader.download_mojo(path = "./")
'/content/StackedEnsemble_AllModels_AutoML_20200721_233627.zip'

We can further use the h2o module to load the saved model and predict. We can always refer to the h2o.ai to get the insights of the module and perform our own requirement specific tasks.Visit the h2o.ai for more details.

Updated: