IHM Example Using HistGBT
This notebook shows an example of using HistGBT
to model In-hospital mortality from MIMIC-III dataset.
Data is presumed to have been already extracted from cohort and defined via a yaml configuration as below:
# USER DEFINED
tgt_col: y_true
idx_cols: stay
time_order_col:
- Hours
- seqnum
feat_cols: null
train:
tgt_file: '{DATA_DIR}/IHM_V0_COHORT_OUT_EXP-SPLIT0-train.csv'
feat_file: '{DATA_DIR}/IHM_V0_FEAT_EXP-SPLIT0-train.csv'
val:
tgt_file: '{DATA_DIR}/IHM_V0_COHORT_OUT_EXP-SPLIT0-val.csv'
feat_file: '{DATA_DIR}/IHM_V0_FEAT_EXP-SPLIT0-val.csv'
test:
tgt_file: '{DATA_DIR}/IHM_V0_COHORT_OUT_EXP-SPLIT0-test.csv'
feat_file: '{DATA_DIR}/IHM_V0_FEAT_EXP-SPLIT0-test.csv'
# DATA DEFINITIONS
## Definitions of categorical data in the dataset
category_map:
Capillary refill rate: ['0.0', '1.0']
Glascow coma scale eye opening: ['To Pain', '3 To speech', '1 No Response', '4 Spontaneously',
'To Speech', 'Spontaneously', '2 To pain', 'None']
Glascow coma scale motor response: ['1 No Response' , '3 Abnorm flexion' , 'Abnormal extension' , 'No response',
'4 Flex-withdraws' , 'Localizes Pain' , 'Flex-withdraws' , 'Obeys Commands',
'Abnormal Flexion' , '6 Obeys Commands' , '5 Localizes Pain' , '2 Abnorm extensn']
Glascow coma scale total: ['11', '10', '13', '12', '15', '14', '3', '5', '4', '7', '6', '9', '8']
Glascow coma scale verbal response: ['1 No Response', 'No Response', 'Confused', 'Inappropriate Words', 'Oriented',
'No Response-ETT', '5 Oriented', 'Incomprehensible sounds', '1.0 ET/Trach',
'4 Confused', '2 Incomp sounds', '3 Inapprop words']
numerical: ['Heart Rate', 'Fraction inspired oxygen', 'Weight', 'Respiratory rate',
'pH', 'Diastolic blood pressure', 'Glucose', 'Systolic blood pressure',
'Height', 'Oxygen saturation', 'Temperature', 'Mean blood pressure']
## Definitions of normal values in the dataset
normal_values:
Capillary refill rate: 0.0
Diastolic blood pressure: 59.0
Fraction inspired oxygen: 0.21
Glucose: 128.0
Heart Rate: 86
Height: 170.0
Mean blood pressure: 77.0
Oxygen saturation: 98.0
Respiratory rate: 19
Systolic blood pressure: 118.0
Temperature: 36.6
Weight: 81.0
pH: 7.4
Glascow coma scale eye opening: '4 Spontaneously'
Glascow coma scale motor response: '6 Obeys Commands'
Glascow coma scale total: '15'
Glascow coma scale verbal response: '5 Oriented'
Preamble
The following code cell imports the required libraries and sets up the notebook
# Jupyter notebook specific imports
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
# Imports injecting into namespace
from tqdm.auto import tqdm
tqdm.pandas()
# General imports
import os
import json
import pickle
from pathlib import Path
import pandas as pd
import numpy as np
from getpass import getpass
import argparse
from sklearn.preprocessing import StandardScaler
from sklearn.exceptions import NotFittedError
from lightsaber import constants as C
import lightsaber.data_utils.utils as du
from lightsaber.data_utils.pt_dataset import (filter_preprocessor)
from lightsaber.data_utils import sk_dataloader as skd
from lightsaber.trainers import sk_trainer as skr
from sklearn.ensemble import HistGradientBoostingClassifier
import logging
log = logging.getLogger()
data_dir = Path(os.environ.get('LS_DATA_PATH', './data'))
assert data_dir.is_dir()
conf_path = os.environ.get('LS_CONF_PATH', os.path.abspath('./ihm_expt_config.yml'))
expt_conf = du.yaml.load(open(conf_path).read().format(DATA_DIR=data_dir),
Loader=du._Loader)
IHM Model Training
In general, we need to follow the following steps to train a HistGBT
for IHM model.
-
Data Ingestion: The first step involves setting up the pre-processors to train an IHM model. In this example, we will use a
StandardScaler
fromscikit-learn
using filters defined within lightsaber. -
We would next read the train, test, and validation dataset. In some cases, users may also want to define a calibration dataset
-
Model Definition: We would next need to define a base model for classification. In this example, we will use a standard
scikit-learn::HistGBT
model -
Model Training: Once the models are defined, we can use
lightsaber
to train the model via the pre-packagedSKModel
and the corresponding trainer code. This step will also generate the relevantmetrics
for this problem. -
we will also show how to train a single hyper-parameter setting as well as a grid search over a pre-specified hyper-parameter space.
Data Ingestion
We firs start by reading extracted cohort data and use a StandardScaler
demonstrating the proper usage of a pre-processor
flatten = 'sum'
preprocessor = StandardScaler()
train_filter = [filter_preprocessor(cols=expt_conf['numerical'],
preprocessor=preprocessor,
refit=True),
]
train_dataloader = skd.SKDataLoader(tgt_file=expt_conf['train']['tgt_file'],
feat_file=expt_conf['train']['feat_file'],
idx_col=expt_conf['idx_cols'],
tgt_col=expt_conf['tgt_col'],
feat_columns=expt_conf['feat_cols'],
time_order_col=expt_conf['time_order_col'],
category_map=expt_conf['category_map'],
filter=train_filter,
fill_value=expt_conf['normal_values'],
flatten=flatten,
)
print(train_dataloader.shape, len(train_dataloader))
# For other datasets use fitted preprocessors
fitted_filter = [filter_preprocessor(cols=expt_conf['numerical'],
preprocessor=preprocessor, refit=False),
]
val_dataloader = skd.SKDataLoader(tgt_file=expt_conf['val']['tgt_file'],
feat_file=expt_conf['val']['feat_file'],
idx_col=expt_conf['idx_cols'],
tgt_col=expt_conf['tgt_col'],
feat_columns=expt_conf['feat_cols'],
time_order_col=expt_conf['time_order_col'],
category_map=expt_conf['category_map'],
filter=fitted_filter,
fill_value=expt_conf['normal_values'],
flatten=flatten,
)
test_dataloader = skd.SKDataLoader(tgt_file=expt_conf['test']['tgt_file'],
feat_file=expt_conf['test']['feat_file'],
idx_col=expt_conf['idx_cols'],
tgt_col=expt_conf['tgt_col'],
feat_columns=expt_conf['feat_cols'],
time_order_col=expt_conf['time_order_col'],
category_map=expt_conf['category_map'],
filter=fitted_filter,
fill_value=expt_conf['normal_values'],
flatten=flatten,
)
print(val_dataloader.shape, len(val_dataloader))
print(test_dataloader.shape, len(test_dataloader))
Training a Single Model
Model definition
We can define a base classification model using standard scikit-learn
workflow as below:
model_name = 'HistGBT'
hparams = argparse.Namespace(learning_rate=0.01,
max_iter=100,
l2_regularization=0.01
)
base_model = HistGradientBoostingClassifier(learning_rate=hparams.learning_rate,
l2_regularization=hparams.l2_regularization,
max_iter=hparams.max_iter)
wrapped_model = skr.SKModel(base_model, hparams, name=model_name)
Model training with in-built model tracking and evaluation
mlflow_conf = dict(experiment_name=f'classifier_ihm')
artifacts = dict(preprocessor=preprocessor)
experiment_tags = dict(model=model_name,
tune=False)
(run_id, metrics,
val_y, val_yhat, val_pred_proba,
test_y, test_yhat, test_pred_proba) = skr.run_training_with_mlflow(mlflow_conf,
wrapped_model,
train_dataloader=train_dataloader,
val_dataloader=val_dataloader,
test_dataloader=test_dataloader,
artifacts=artifacts,
**experiment_tags)
print(f"MLFlow Experiment: {mlflow_conf['experiment_name']} \t | Run ID: {run_id}")
print(metrics)
Hyper-parameter Search
lightsaber
also naturally supports hyper-parameter search to find the best model w.r.t.\ a pre-defined metric using the similar trace as above.
To conduct a grid-search we follow two steps:
- we define a grid
h_search
over the model parameter space - We pass an experiment tag
tune
set toTrue
along with the gridh_search
to the trainer code
model_name = 'HistGBT'
hparams = argparse.Namespace(learning_rate=0.01,
max_iter=100,
l2_regularization=0.01
)
h_search = dict(
learning_rate=[0.01, 0.1, 0.02],
max_iter=[50, 100]
)
base_model = HistGradientBoostingClassifier(**vars(hparams))
wrapped_model = skr.SKModel(base_model, hparams, name=model_name)
mlflow_conf = dict(experiment_name=f'classifier_ihm')
artifacts = dict(preprocessor=preprocessor)
experiment_tags = dict(model=model_name,
tune=True)
(run_id, metrics,
val_y, val_yhat, val_pred_proba,
test_y, test_yhat, test_pred_proba) = skr.run_training_with_mlflow(mlflow_conf,
wrapped_model,
train_dataloader=train_dataloader,
val_dataloader=val_dataloader,
test_dataloader=test_dataloader,
artifacts=artifacts,
h_search=h_search,
**experiment_tags)
print(f"MLFlow Experiment: {mlflow_conf['experiment_name']} \t | Run ID: {run_id}")
print(metrics)
IHM Model Registration
This block shows how to register a model for subsequent steps. Given a run_id
this block can be run independtly of other aspects
Internally, the following steps happen:
- a saved model (along with hyper-params and weights) is retrieved using
run_id
- model is initialized using the weights
- model is logged to mlflow under registered model name
print(f"Registering model for run: {run_id}")
# Reading from yaml to log other artifacts
data_dir = Path(os.environ.get('LS_DATA_PATH', './data'))
assert data_dir.is_dir()
conf_path = os.environ.get('LS_CONF_PATH', os.path.abspath('./ihm_expt_config.yml'))
expt_conf = du.yaml.load(open(conf_path).read().format(DATA_DIR=data_dir),
Loader=du._Loader)
mlflow_conf = dict(experiment_name=f'classifier_ihm')
registered_model_name = 'classifier_ihm_HistGBT_v0'
print("model ready to be registered")
# Register model
skr.register_model_with_mlflow(run_id, mlflow_conf,
registered_model_name=registered_model_name,
test_feat_file=expt_conf['test']['feat_file'],
test_tgt_file=expt_conf['test']['tgt_file'],
config=os.path.abspath('./ihm_expt_config.yml')
)
IHM Model Inference
Lightsaber
also natively supports conducting inferences on new patients using the registered model. The key steps involve:
- loading the registerd model from mlflow
- Ingest the new test data using
SKDataLoader
in inference mode (settingtgt_file
toNone
) - Use the
SKModel.predict_patient
method to generate inference for the patient of interest
print(f"Inference using model for run: {run_id}")
# Reading from yaml to log other artifacts
data_dir = Path(os.environ.get('LS_DATA_PATH', './data'))
assert data_dir.is_dir()
conf_path = os.environ.get('LS_CONF_PATH', os.path.abspath('./ihm_expt_config.yml'))
expt_conf = du.yaml.load(open(conf_path).read().format(DATA_DIR=data_dir),
Loader=du._Loader)
mlflow_conf = dict(experiment_name=f'classifier_ihm')
registered_model_name = 'classifier_ihm_HistGBT_v0'
wrapped_model = skr.load_model_from_mlflow(run_id, mlflow_conf)
print("model ready to be inferred from")
inference_dataloader = skd.SKDataLoader(tgt_file=None,
feat_file=expt_conf['test']['feat_file'],
idx_col=expt_conf['idx_cols'],
tgt_col=expt_conf['tgt_col'],
feat_columns=expt_conf['feat_cols'],
time_order_col=expt_conf['time_order_col'],
category_map=expt_conf['category_map'],
filter=fitted_filter,
fill_value=expt_conf['normal_values'],
flatten=flatten,
)
print(inference_dataloader.shape, len(inference_dataloader))
patient_id = inference_dataloader.sample_idx.index[0]
print(f"Inference for patient: {patient_id}")
# patient_id = '10011_episode1_timeseries.csv'
wrapped_model.predict_patient(patient_id, inference_dataloader)