IHM Example Using LSTM

This notebook shows an example of using LSTM to model In-hospital mortality from MIMIC-III dataset.

Data is presumed to have been already extracted from cohort and defined via a yaml configuration as below:

# USER DEFINED
tgt_col: y_true
idx_cols: stay
time_order_col: 
    - Hours
    - seqnum

feat_cols: null

train:
    tgt_file: '{DATA_DIR}/IHM_V0_COHORT_OUT_EXP-SPLIT0-train.csv'
    feat_file: '{DATA_DIR}/IHM_V0_FEAT_EXP-SPLIT0-train.csv'

val:
    tgt_file: '{DATA_DIR}/IHM_V0_COHORT_OUT_EXP-SPLIT0-val.csv'
    feat_file: '{DATA_DIR}/IHM_V0_FEAT_EXP-SPLIT0-val.csv'

test:
    tgt_file: '{DATA_DIR}/IHM_V0_COHORT_OUT_EXP-SPLIT0-test.csv'
    feat_file: '{DATA_DIR}/IHM_V0_FEAT_EXP-SPLIT0-test.csv'

# DATA DEFINITIONS

## Definitions of categorical data in the dataset
category_map:
  Capillary refill rate: ['0.0', '1.0']
  Glascow coma scale eye opening: ['To Pain', '3 To speech', '1 No Response', '4 Spontaneously',
                                   'To Speech', 'Spontaneously', '2 To pain', 'None'] 
  Glascow coma scale motor response: ['1 No Response' , '3 Abnorm flexion' , 'Abnormal extension' , 'No response',
                                      '4 Flex-withdraws' , 'Localizes Pain' , 'Flex-withdraws' , 'Obeys Commands',
                                      'Abnormal Flexion' , '6 Obeys Commands' , '5 Localizes Pain' , '2 Abnorm extensn']
  Glascow coma scale total: ['11', '10', '13', '12', '15', '14', '3', '5', '4', '7', '6', '9', '8']
  Glascow coma scale verbal response: ['1 No Response', 'No Response', 'Confused', 'Inappropriate Words', 'Oriented', 
                                       'No Response-ETT', '5 Oriented', 'Incomprehensible sounds', '1.0 ET/Trach', 
                                       '4 Confused', '2 Incomp sounds', '3 Inapprop words']

numerical: ['Heart Rate', 'Fraction inspired oxygen', 'Weight', 'Respiratory rate', 
            'pH', 'Diastolic blood pressure', 'Glucose', 'Systolic blood pressure',
            'Height', 'Oxygen saturation', 'Temperature', 'Mean blood pressure']

## Definitions of normal values in the dataset
normal_values:
  Capillary refill rate: 0.0
  Diastolic blood pressure: 59.0
  Fraction inspired oxygen: 0.21
  Glucose: 128.0
  Heart Rate: 86
  Height: 170.0
  Mean blood pressure: 77.0
  Oxygen saturation: 98.0
  Respiratory rate: 19
  Systolic blood pressure: 118.0
  Temperature: 36.6
  Weight: 81.0
  pH: 7.4
  Glascow coma scale eye opening: '4 Spontaneously'
  Glascow coma scale motor response: '6 Obeys Commands'
  Glascow coma scale total:  '15'
  Glascow coma scale verbal response: '5 Oriented'

Preamble

The following code cell imports the required libraries and sets up the notebook

# Jupyter notebook specific imports
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

# Imports injecting into namespace
from tqdm.auto import tqdm
tqdm.pandas()

# General imports
import os
import json
import pickle
from pathlib import Path

import pandas as pd
import numpy as np
from getpass import getpass
import argparse

from sklearn.preprocessing import StandardScaler
from sklearn.exceptions import NotFittedError

import torch as T
from torch import nn
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint

from lightsaber import constants as C
import lightsaber.data_utils.utils as du
from lightsaber.data_utils import pt_dataset as ptd
from lightsaber.trainers import pt_trainer as ptr

from lightsaber.model_lib.pt_sota_models import rnn

import logging
logging.basicConfig(level=logging.INFO)
log = logging.getLogger()

data_dir = Path(os.environ.get('LS_DATA_PATH', './data'))
assert data_dir.is_dir()

conf_path = os.environ.get('LS_CONF_PATH', os.path.abspath('./ihm_expt_config.yml')) 
expt_conf = du.yaml.load(open(conf_path).read().format(DATA_DIR=data_dir),
                         Loader=du._Loader)

IHM Model Training

In general, user need to follow the following steps to train a RNN for IHM model.

Data Ingestion: The first step involves setting up the pre-processors to train an IHM model. In this example, we will use StandardScaler from scikit-learn using filters defined within lightsaber.
We would next read the train, test, and validation dataset. In some cases, users may also want to define a calibration dataset
Model Definition: We would next need to define a base model for classification. In this example, we will use a pre-packaged LSTM model from lightsaber
Model Training: Once the models are defined, we can use lightsaber to train the model via the pre-packaged PyModel and the corresponding trainer code. This step will also generate the relevant metrics for this problem.

Data Ingestion

We first start by reading extracted cohort data and use a StandardScaler demonstrating the proper usage of a pre-processor

preprocessor = StandardScaler()
train_filter = [ptd.filter_preprocessor(cols=expt_conf['numerical'], 
                                        preprocessor=preprocessor,
                                        refit=True),
                ptd.filter_fillna(fill_value=expt_conf['normal_values'],
                                  time_order_col=expt_conf['time_order_col'])
                ]
transform = ptd.transform_drop_cols(cols_to_drop=expt_conf['time_order_col'])

train_dataset = ptd.BaseDataset(tgt_file=expt_conf['train']['tgt_file'],
                                feat_file=expt_conf['train']['feat_file'],
                                idx_col=expt_conf['idx_cols'],
                                tgt_col=expt_conf['tgt_col'],
                                feat_columns=expt_conf['feat_cols'],
                                time_order_col=expt_conf['time_order_col'],
                                category_map=expt_conf['category_map'],
                                transform=transform,
                                filter=train_filter,
                               )
# print(train_dataset.data.head())
print(train_dataset.shape, len(train_dataset))

# For other datasets use fitted preprocessors
fitted_filter = [ptd.filter_preprocessor(cols=expt_conf['numerical'], 
                                         preprocessor=preprocessor, refit=False),
                 ptd.filter_fillna(fill_value=expt_conf['normal_values'],
                                   time_order_col=expt_conf['time_order_col'])
                 ]

val_dataset = ptd.BaseDataset(tgt_file=expt_conf['val']['tgt_file'],
                              feat_file=expt_conf['val']['feat_file'],
                              idx_col=expt_conf['idx_cols'],
                              tgt_col=expt_conf['tgt_col'],
                              feat_columns=expt_conf['feat_cols'],
                              time_order_col=expt_conf['time_order_col'],
                              category_map=expt_conf['category_map'],
                              transform=transform,
                              filter=fitted_filter,
                              )

test_dataset = ptd.BaseDataset(tgt_file=expt_conf['test']['tgt_file'],
                               feat_file=expt_conf['test']['feat_file'],
                               idx_col=expt_conf['idx_cols'],
                               tgt_col=expt_conf['tgt_col'],
                               feat_columns=expt_conf['feat_cols'],
                               time_order_col=expt_conf['time_order_col'],
                               category_map=expt_conf['category_map'],
                               transform=transform,
                               filter=fitted_filter,
                               )

print(val_dataset.shape, len(val_dataset))
print(test_dataset.shape, len(test_dataset))

# Handling imbala
input_dim, target_dim = train_dataset.shape
output_dim = 2

weight_labels = train_dataset.target.iloc[:, 0].value_counts()
weight_labels = (weight_labels.max() / ((weight_labels + 0.0000001) ** (1)))
weight_labels.sort_index(inplace=True)
weights = T.FloatTensor(weight_labels.values).to(train_dataset.device)
print(weights)

Single Run

# For most models you need to change only this part
hparams = argparse.Namespace(lr=0.01,
                             batch_size=32,
                             hidden_dim=32,
                             rnn_class='LSTM',
                             n_layers=2,
                             dropout=0.1,
                             recurrent_dropout=0.1,
                             bidirectional=False,
                             )

hparams.rnn_class = C.PYTORCH_CLASS_DICT[hparams.rnn_class]

base_model = rnn.RNNClassifier(input_dim, output_dim, 
                               hidden_dim=hparams.hidden_dim,
                               rnn_class=hparams.rnn_class,
                               n_layers=hparams.n_layers,
                               dropout=hparams.dropout,
                               recurrent_dropout=hparams.recurrent_dropout,
                               bidirectional=hparams.bidirectional
                              )

criterion = nn.CrossEntropyLoss(weight=weights)
# optimizer = T.optim.Adam(base_model.parameters(),
#                          lr=hparams.lr,
#                          weight_decay=1e-5  # standard value)
#                          )

# scheduler = T.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min')

# Creating the wrapped model
wrapped_model = ptr.PyModel(hparams, base_model,
                            train_dataset=train_dataset,
                            val_dataset=val_dataset, # None
                            test_dataset=None, #test_dataset, # test_dataset
                            #optimizer=optimizer,
                            loss_func=criterion,
                            #scheduler=scheduler,
                            collate_fn=ptd.collate_fn
                            )

# Training 
overfit_batches, fast_dev_run, terminate_on_nan, auto_lr_find, limit_batch = 0, False, False, False, 1.0
default_root_dir = os.path.join('./out/', 'classifier_ihm')
checkpoint_callback = ModelCheckpoint(dirpath=default_root_dir)
callbacks = [checkpoint_callback]

train_args = argparse.Namespace(gpus=1,
                                max_epochs=50,
                                callbacks=callbacks,
                                default_root_dir=default_root_dir,
                                terminate_on_nan=terminate_on_nan,
                                auto_lr_find=auto_lr_find,
                                overfit_batches=overfit_batches,
                                fast_dev_run=fast_dev_run, #True if devugging
                                limit_train_batches=limit_batch,
                                limit_val_batches=limit_batch,
                                limit_predict_batches=limit_batch,
                               )

mlflow_conf = dict(experiment_name=f'classifier_ihm')
artifacts = dict(preprocessor=preprocessor, 
                 weight_labels=weight_labels,
                )
experiment_tags = dict(model='RNNClassifier',
                       input_dim=input_dim,
                       output_dim=output_dim
                      )

(run_id, metrics, 
 y_val, y_val_hat, y_val_proba, 
 y_test, y_test_hat, y_test_proba) = ptr.run_training_with_mlflow(mlflow_conf, 
                                                                  train_args, 
                                                                  wrapped_model,
                                                                  artifacts=artifacts,
                                                                  **experiment_tags)

print(f"MLFlow Experiment: {mlflow_conf['experiment_name']} \t | Run ID: {run_id}")
print(metrics)

IHM Model Registration

This block shows how to register a model for subsequent steps. Given a run_id this block can be run independtly of other aspects

Internally, the following steps happen:

a saved model (along with hyper-params and weights) is retrieved using run_id
model is initialized using the weights
model is logged to mlflow under registered model name

print(f"Registering model for run: {run_id}")

# Reading things from mlflow
# Model coders can create functions to repeat this - part of model init
import torch
from lightsaber.trainers import helper
data_dir = Path(os.environ.get('LS_DATA_PATH', './data'))
assert data_dir.is_dir()

conf_path = os.environ.get('LS_CONF_PATH', os.path.abspath('./ihm_expt_config.yml')) 
expt_conf = du.yaml.load(open(conf_path).read().format(DATA_DIR=data_dir),
                         Loader=du._Loader)

mlflow_conf = dict(experiment_name=f'classifier_ihm')
registered_model_name = 'classifier_ihm_rnn_v0'

## Loading model attributes from mlflow
mlflow_setup = helper.setup_mlflow(**mlflow_conf)
run_data = helper.fetch_mlflow_run(run_id, 
                                   mlflow_uri=mlflow_setup['mlflow_uri'],
                                   artifacts_prefix=['artifact/weight_labels'],
                                   parse_params=True
                                  )
hparams = run_data['params']

hparams = argparse.Namespace(**hparams)
hparams.rnn_class = helper.import_model_class(hparams.rnn_class.split("'")[1::2][0])

weight_labels = pickle.load(open(helper.get_artifact_path(run_data['artifact_paths'][0],
                                   artifact_uri=run_data['info'].artifact_uri), 'rb'))

weights = T.FloatTensor(weight_labels.values)

## Setting model weights
base_model = rnn.RNNClassifier(input_dim=input_dim, 
                               output_dim=output_dim, 
                               hidden_dim=hparams.hidden_dim,
                               rnn_class=hparams.rnn_class,
                               n_layers=hparams.n_layers,
                               dropout=hparams.dropout,
                               recurrent_dropout=hparams.recurrent_dropout,
                               bidirectional=hparams.bidirectional
                              )

criterion = nn.CrossEntropyLoss(weight=weights)

wrapped_model = ptr.PyModel(hparams, base_model,
                            train_dataset=None,
                            val_dataset=None, # None
                            test_dataset=None, # test_dataset
                            #optimizer=optimizer,
                            loss_func=criterion,
                            #scheduler=scheduler,
                            collate_fn=ptd.collate_fn
                            )

# Recreate models
base_model = rnn.RNNClassifier(input_dim=int(run_data['tags']['input_dim']),
                               output_dim=int(run_data['tags']['output_dim']), 
                               hidden_dim=hparams.hidden_dim,
                               rnn_class=hparams.rnn_class,
                               n_layers=hparams.n_layers,
                               dropout=hparams.dropout,
                               recurrent_dropout=hparams.recurrent_dropout,
                               bidirectional=hparams.bidirectional
                               )
criterion = nn.CrossEntropyLoss(weight=weights)


# Creating the wrapped model
wrapped_model = ptr.PyModel(hparams, base_model,
                            train_dataset=None,
                            val_dataset=None, # None
                            test_dataset=None, # test_dataset
                            cal_dataset=None,
                            loss_func=criterion,
                            collate_fn=ptd.collate_fn
                            )
print('model ready for logging')

# Register model
ptr.register_model_with_mlflow(run_id, mlflow_conf, wrapped_model, 
                               registered_model_name=registered_model_name,
                               test_feat_file=expt_conf['test']['feat_file'],
                               test_tgt_file=expt_conf['test']['tgt_file'],
                               config=os.path.abspath('./ihm_expt_config.yml'),
                               model_path='model_checkpoint'
                              )

IHM Model Inference

Lightsaber also natively supports conducting inferences on new patients using the registered model. The key steps involve:

loading the registerd model from mlflow
Ingest the new test data using BaseDataset in inference mode (setting tgt_file to None)
Use the PyModel.predict_patient method to generate inference for the patient of interest

It is to be noted, for the first step, users may need to perform additional setup as show below

print(f"Inference using model for run: {run_id}")

# Reading things from mlflow
# Model coders can create functions to repeat this - part of model init
import torch
from lightsaber.trainers import helper
data_dir = Path(os.environ.get('LS_DATA_PATH', './data'))
assert data_dir.is_dir()

conf_path = os.environ.get('LS_CONF_PATH', os.path.abspath('./ihm_expt_config.yml')) 
expt_conf = du.yaml.load(open(conf_path).read().format(DATA_DIR=data_dir),
                         Loader=du._Loader)

mlflow_conf = dict(experiment_name=f'classifier_ihm')
registered_model_name = 'classifier_ihm_rnn_v0'

## Loading model attributes from mlflow
mlflow_setup = helper.setup_mlflow(**mlflow_conf)
run_data = helper.fetch_mlflow_run(run_id, 
                                   mlflow_uri=mlflow_setup['mlflow_uri'],
                                   artifacts_prefix=['artifact/weight_labels'],
                                   parse_params=True
                                  )
hparams = run_data['params']

hparams = argparse.Namespace(**hparams)
hparams.rnn_class = helper.import_model_class(hparams.rnn_class.split("'")[1::2][0])

weight_labels = pickle.load(open(helper.get_artifact_path(run_data['artifact_paths'][0],
                                   artifact_uri=run_data['info'].artifact_uri), 'rb'))

weights = T.FloatTensor(weight_labels.values)

## Setting model weights
base_model = rnn.RNNClassifier(input_dim=input_dim, 
                               output_dim=output_dim, 
                               hidden_dim=hparams.hidden_dim,
                               rnn_class=hparams.rnn_class,
                               n_layers=hparams.n_layers,
                               dropout=hparams.dropout,
                               recurrent_dropout=hparams.recurrent_dropout,
                               bidirectional=hparams.bidirectional
                              )

criterion = nn.CrossEntropyLoss(weight=weights)

wrapped_model = ptr.PyModel(hparams, base_model,
                            train_dataset=None,
                            val_dataset=None, # None
                            test_dataset=None, # test_dataset
                            #optimizer=optimizer,
                            loss_func=criterion,
                            #scheduler=scheduler,
                            collate_fn=ptd.collate_fn
                            )

# Loading saved model from mlflow
wrapped_model = ptr.load_model_from_mlflow(run_id, mlflow_conf, wrapped_model)

inference_dataloader = ptd.BaseDataset(tgt_file=None,
                                       feat_file=expt_conf['test']['feat_file'],
                                       idx_col=expt_conf['idx_cols'],
                                       tgt_col=expt_conf['tgt_col'],
                                       feat_columns=expt_conf['feat_cols'],
                                       time_order_col=expt_conf['time_order_col'],
                                       category_map=expt_conf['category_map'],
                                       transform=transform,
                                       filter=fitted_filter,
                                       )

print(inference_dataloader.shape, len(inference_dataloader))

patient_id = inference_dataloader.sample_idx.index[0]
print(f"Inference for patient: {patient_id}")

# patient_id = '10011_episode1_timeseries.csv'
wrapped_model.predict_patient(patient_id, inference_dataloader)