IHM Example Using LSTM
This notebook shows an example of using LSTM
to model In-hospital mortality from MIMIC-III dataset.
Data is presumed to have been already extracted from cohort and defined via a yaml configuration as below:
# USER DEFINED
tgt_col: y_true
idx_cols: stay
time_order_col:
- Hours
- seqnum
feat_cols: null
train:
tgt_file: '{DATA_DIR}/IHM_V0_COHORT_OUT_EXP-SPLIT0-train.csv'
feat_file: '{DATA_DIR}/IHM_V0_FEAT_EXP-SPLIT0-train.csv'
val:
tgt_file: '{DATA_DIR}/IHM_V0_COHORT_OUT_EXP-SPLIT0-val.csv'
feat_file: '{DATA_DIR}/IHM_V0_FEAT_EXP-SPLIT0-val.csv'
test:
tgt_file: '{DATA_DIR}/IHM_V0_COHORT_OUT_EXP-SPLIT0-test.csv'
feat_file: '{DATA_DIR}/IHM_V0_FEAT_EXP-SPLIT0-test.csv'
# DATA DEFINITIONS
## Definitions of categorical data in the dataset
category_map:
Capillary refill rate: ['0.0', '1.0']
Glascow coma scale eye opening: ['To Pain', '3 To speech', '1 No Response', '4 Spontaneously',
'To Speech', 'Spontaneously', '2 To pain', 'None']
Glascow coma scale motor response: ['1 No Response' , '3 Abnorm flexion' , 'Abnormal extension' , 'No response',
'4 Flex-withdraws' , 'Localizes Pain' , 'Flex-withdraws' , 'Obeys Commands',
'Abnormal Flexion' , '6 Obeys Commands' , '5 Localizes Pain' , '2 Abnorm extensn']
Glascow coma scale total: ['11', '10', '13', '12', '15', '14', '3', '5', '4', '7', '6', '9', '8']
Glascow coma scale verbal response: ['1 No Response', 'No Response', 'Confused', 'Inappropriate Words', 'Oriented',
'No Response-ETT', '5 Oriented', 'Incomprehensible sounds', '1.0 ET/Trach',
'4 Confused', '2 Incomp sounds', '3 Inapprop words']
numerical: ['Heart Rate', 'Fraction inspired oxygen', 'Weight', 'Respiratory rate',
'pH', 'Diastolic blood pressure', 'Glucose', 'Systolic blood pressure',
'Height', 'Oxygen saturation', 'Temperature', 'Mean blood pressure']
## Definitions of normal values in the dataset
normal_values:
Capillary refill rate: 0.0
Diastolic blood pressure: 59.0
Fraction inspired oxygen: 0.21
Glucose: 128.0
Heart Rate: 86
Height: 170.0
Mean blood pressure: 77.0
Oxygen saturation: 98.0
Respiratory rate: 19
Systolic blood pressure: 118.0
Temperature: 36.6
Weight: 81.0
pH: 7.4
Glascow coma scale eye opening: '4 Spontaneously'
Glascow coma scale motor response: '6 Obeys Commands'
Glascow coma scale total: '15'
Glascow coma scale verbal response: '5 Oriented'
Preamble
The following code cell imports the required libraries and sets up the notebook
# Jupyter notebook specific imports
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
# Imports injecting into namespace
from tqdm.auto import tqdm
tqdm.pandas()
# General imports
import os
import json
import pickle
from pathlib import Path
import pandas as pd
import numpy as np
from getpass import getpass
import argparse
from sklearn.preprocessing import StandardScaler
from sklearn.exceptions import NotFittedError
import torch as T
from torch import nn
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
from lightsaber import constants as C
import lightsaber.data_utils.utils as du
from lightsaber.data_utils import pt_dataset as ptd
from lightsaber.trainers import pt_trainer as ptr
from lightsaber.model_lib.pt_sota_models import rnn
import logging
logging.basicConfig(level=logging.INFO)
log = logging.getLogger()
data_dir = Path(os.environ.get('LS_DATA_PATH', './data'))
assert data_dir.is_dir()
conf_path = os.environ.get('LS_CONF_PATH', os.path.abspath('./ihm_expt_config.yml'))
expt_conf = du.yaml.load(open(conf_path).read().format(DATA_DIR=data_dir),
Loader=du._Loader)
IHM Model Training
In general, user need to follow the following steps to train a RNN
for IHM model.
-
Data Ingestion: The first step involves setting up the pre-processors to train an IHM model. In this example, we will use
StandardScaler
fromscikit-learn
using filters defined within lightsaber. -
We would next read the train, test, and validation dataset. In some cases, users may also want to define a calibration dataset
-
Model Definition: We would next need to define a base model for classification. In this example, we will use a pre-packaged
LSTM
model fromlightsaber
-
Model Training: Once the models are defined, we can use
lightsaber
to train the model via the pre-packagedPyModel
and the corresponding trainer code. This step will also generate the relevantmetrics
for this problem.
Data Ingestion
We first start by reading extracted cohort data and use a StandardScaler
demonstrating the proper usage of a pre-processor
preprocessor = StandardScaler()
train_filter = [ptd.filter_preprocessor(cols=expt_conf['numerical'],
preprocessor=preprocessor,
refit=True),
ptd.filter_fillna(fill_value=expt_conf['normal_values'],
time_order_col=expt_conf['time_order_col'])
]
transform = ptd.transform_drop_cols(cols_to_drop=expt_conf['time_order_col'])
train_dataset = ptd.BaseDataset(tgt_file=expt_conf['train']['tgt_file'],
feat_file=expt_conf['train']['feat_file'],
idx_col=expt_conf['idx_cols'],
tgt_col=expt_conf['tgt_col'],
feat_columns=expt_conf['feat_cols'],
time_order_col=expt_conf['time_order_col'],
category_map=expt_conf['category_map'],
transform=transform,
filter=train_filter,
)
# print(train_dataset.data.head())
print(train_dataset.shape, len(train_dataset))
# For other datasets use fitted preprocessors
fitted_filter = [ptd.filter_preprocessor(cols=expt_conf['numerical'],
preprocessor=preprocessor, refit=False),
ptd.filter_fillna(fill_value=expt_conf['normal_values'],
time_order_col=expt_conf['time_order_col'])
]
val_dataset = ptd.BaseDataset(tgt_file=expt_conf['val']['tgt_file'],
feat_file=expt_conf['val']['feat_file'],
idx_col=expt_conf['idx_cols'],
tgt_col=expt_conf['tgt_col'],
feat_columns=expt_conf['feat_cols'],
time_order_col=expt_conf['time_order_col'],
category_map=expt_conf['category_map'],
transform=transform,
filter=fitted_filter,
)
test_dataset = ptd.BaseDataset(tgt_file=expt_conf['test']['tgt_file'],
feat_file=expt_conf['test']['feat_file'],
idx_col=expt_conf['idx_cols'],
tgt_col=expt_conf['tgt_col'],
feat_columns=expt_conf['feat_cols'],
time_order_col=expt_conf['time_order_col'],
category_map=expt_conf['category_map'],
transform=transform,
filter=fitted_filter,
)
print(val_dataset.shape, len(val_dataset))
print(test_dataset.shape, len(test_dataset))
# Handling imbala
input_dim, target_dim = train_dataset.shape
output_dim = 2
weight_labels = train_dataset.target.iloc[:, 0].value_counts()
weight_labels = (weight_labels.max() / ((weight_labels + 0.0000001) ** (1)))
weight_labels.sort_index(inplace=True)
weights = T.FloatTensor(weight_labels.values).to(train_dataset.device)
print(weights)
Single Run
# For most models you need to change only this part
hparams = argparse.Namespace(lr=0.01,
batch_size=32,
hidden_dim=32,
rnn_class='LSTM',
n_layers=2,
dropout=0.1,
recurrent_dropout=0.1,
bidirectional=False,
)
hparams.rnn_class = C.PYTORCH_CLASS_DICT[hparams.rnn_class]
base_model = rnn.RNNClassifier(input_dim, output_dim,
hidden_dim=hparams.hidden_dim,
rnn_class=hparams.rnn_class,
n_layers=hparams.n_layers,
dropout=hparams.dropout,
recurrent_dropout=hparams.recurrent_dropout,
bidirectional=hparams.bidirectional
)
criterion = nn.CrossEntropyLoss(weight=weights)
# optimizer = T.optim.Adam(base_model.parameters(),
# lr=hparams.lr,
# weight_decay=1e-5 # standard value)
# )
# scheduler = T.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min')
# Creating the wrapped model
wrapped_model = ptr.PyModel(hparams, base_model,
train_dataset=train_dataset,
val_dataset=val_dataset, # None
test_dataset=None, #test_dataset, # test_dataset
#optimizer=optimizer,
loss_func=criterion,
#scheduler=scheduler,
collate_fn=ptd.collate_fn
)
# Training
overfit_batches, fast_dev_run, terminate_on_nan, auto_lr_find, limit_batch = 0, False, False, False, 1.0
default_root_dir = os.path.join('./out/', 'classifier_ihm')
checkpoint_callback = ModelCheckpoint(dirpath=default_root_dir)
callbacks = [checkpoint_callback]
train_args = argparse.Namespace(gpus=1,
max_epochs=50,
callbacks=callbacks,
default_root_dir=default_root_dir,
terminate_on_nan=terminate_on_nan,
auto_lr_find=auto_lr_find,
overfit_batches=overfit_batches,
fast_dev_run=fast_dev_run, #True if devugging
limit_train_batches=limit_batch,
limit_val_batches=limit_batch,
limit_predict_batches=limit_batch,
)
mlflow_conf = dict(experiment_name=f'classifier_ihm')
artifacts = dict(preprocessor=preprocessor,
weight_labels=weight_labels,
)
experiment_tags = dict(model='RNNClassifier',
input_dim=input_dim,
output_dim=output_dim
)
(run_id, metrics,
y_val, y_val_hat, y_val_proba,
y_test, y_test_hat, y_test_proba) = ptr.run_training_with_mlflow(mlflow_conf,
train_args,
wrapped_model,
artifacts=artifacts,
**experiment_tags)
print(f"MLFlow Experiment: {mlflow_conf['experiment_name']} \t | Run ID: {run_id}")
print(metrics)
IHM Model Registration
This block shows how to register a model for subsequent steps. Given a run_id
this block can be run independtly of other aspects
Internally, the following steps happen:
- a saved model (along with hyper-params and weights) is retrieved using
run_id
- model is initialized using the weights
- model is logged to mlflow under registered model name
print(f"Registering model for run: {run_id}")
# Reading things from mlflow
# Model coders can create functions to repeat this - part of model init
import torch
from lightsaber.trainers import helper
data_dir = Path(os.environ.get('LS_DATA_PATH', './data'))
assert data_dir.is_dir()
conf_path = os.environ.get('LS_CONF_PATH', os.path.abspath('./ihm_expt_config.yml'))
expt_conf = du.yaml.load(open(conf_path).read().format(DATA_DIR=data_dir),
Loader=du._Loader)
mlflow_conf = dict(experiment_name=f'classifier_ihm')
registered_model_name = 'classifier_ihm_rnn_v0'
## Loading model attributes from mlflow
mlflow_setup = helper.setup_mlflow(**mlflow_conf)
run_data = helper.fetch_mlflow_run(run_id,
mlflow_uri=mlflow_setup['mlflow_uri'],
artifacts_prefix=['artifact/weight_labels'],
parse_params=True
)
hparams = run_data['params']
hparams = argparse.Namespace(**hparams)
hparams.rnn_class = helper.import_model_class(hparams.rnn_class.split("'")[1::2][0])
weight_labels = pickle.load(open(helper.get_artifact_path(run_data['artifact_paths'][0],
artifact_uri=run_data['info'].artifact_uri), 'rb'))
weights = T.FloatTensor(weight_labels.values)
## Setting model weights
base_model = rnn.RNNClassifier(input_dim=input_dim,
output_dim=output_dim,
hidden_dim=hparams.hidden_dim,
rnn_class=hparams.rnn_class,
n_layers=hparams.n_layers,
dropout=hparams.dropout,
recurrent_dropout=hparams.recurrent_dropout,
bidirectional=hparams.bidirectional
)
criterion = nn.CrossEntropyLoss(weight=weights)
wrapped_model = ptr.PyModel(hparams, base_model,
train_dataset=None,
val_dataset=None, # None
test_dataset=None, # test_dataset
#optimizer=optimizer,
loss_func=criterion,
#scheduler=scheduler,
collate_fn=ptd.collate_fn
)
# Recreate models
base_model = rnn.RNNClassifier(input_dim=int(run_data['tags']['input_dim']),
output_dim=int(run_data['tags']['output_dim']),
hidden_dim=hparams.hidden_dim,
rnn_class=hparams.rnn_class,
n_layers=hparams.n_layers,
dropout=hparams.dropout,
recurrent_dropout=hparams.recurrent_dropout,
bidirectional=hparams.bidirectional
)
criterion = nn.CrossEntropyLoss(weight=weights)
# Creating the wrapped model
wrapped_model = ptr.PyModel(hparams, base_model,
train_dataset=None,
val_dataset=None, # None
test_dataset=None, # test_dataset
cal_dataset=None,
loss_func=criterion,
collate_fn=ptd.collate_fn
)
print('model ready for logging')
# Register model
ptr.register_model_with_mlflow(run_id, mlflow_conf, wrapped_model,
registered_model_name=registered_model_name,
test_feat_file=expt_conf['test']['feat_file'],
test_tgt_file=expt_conf['test']['tgt_file'],
config=os.path.abspath('./ihm_expt_config.yml'),
model_path='model_checkpoint'
)
IHM Model Inference
Lightsaber
also natively supports conducting inferences on new patients using the registered model. The key steps involve:
- loading the registerd model from mlflow
- Ingest the new test data using
BaseDataset
in inference mode (settingtgt_file
toNone
) - Use the
PyModel.predict_patient
method to generate inference for the patient of interest
It is to be noted, for the first step, users may need to perform additional setup as show below
print(f"Inference using model for run: {run_id}")
# Reading things from mlflow
# Model coders can create functions to repeat this - part of model init
import torch
from lightsaber.trainers import helper
data_dir = Path(os.environ.get('LS_DATA_PATH', './data'))
assert data_dir.is_dir()
conf_path = os.environ.get('LS_CONF_PATH', os.path.abspath('./ihm_expt_config.yml'))
expt_conf = du.yaml.load(open(conf_path).read().format(DATA_DIR=data_dir),
Loader=du._Loader)
mlflow_conf = dict(experiment_name=f'classifier_ihm')
registered_model_name = 'classifier_ihm_rnn_v0'
## Loading model attributes from mlflow
mlflow_setup = helper.setup_mlflow(**mlflow_conf)
run_data = helper.fetch_mlflow_run(run_id,
mlflow_uri=mlflow_setup['mlflow_uri'],
artifacts_prefix=['artifact/weight_labels'],
parse_params=True
)
hparams = run_data['params']
hparams = argparse.Namespace(**hparams)
hparams.rnn_class = helper.import_model_class(hparams.rnn_class.split("'")[1::2][0])
weight_labels = pickle.load(open(helper.get_artifact_path(run_data['artifact_paths'][0],
artifact_uri=run_data['info'].artifact_uri), 'rb'))
weights = T.FloatTensor(weight_labels.values)
## Setting model weights
base_model = rnn.RNNClassifier(input_dim=input_dim,
output_dim=output_dim,
hidden_dim=hparams.hidden_dim,
rnn_class=hparams.rnn_class,
n_layers=hparams.n_layers,
dropout=hparams.dropout,
recurrent_dropout=hparams.recurrent_dropout,
bidirectional=hparams.bidirectional
)
criterion = nn.CrossEntropyLoss(weight=weights)
wrapped_model = ptr.PyModel(hparams, base_model,
train_dataset=None,
val_dataset=None, # None
test_dataset=None, # test_dataset
#optimizer=optimizer,
loss_func=criterion,
#scheduler=scheduler,
collate_fn=ptd.collate_fn
)
# Loading saved model from mlflow
wrapped_model = ptr.load_model_from_mlflow(run_id, mlflow_conf, wrapped_model)
inference_dataloader = ptd.BaseDataset(tgt_file=None,
feat_file=expt_conf['test']['feat_file'],
idx_col=expt_conf['idx_cols'],
tgt_col=expt_conf['tgt_col'],
feat_columns=expt_conf['feat_cols'],
time_order_col=expt_conf['time_order_col'],
category_map=expt_conf['category_map'],
transform=transform,
filter=fitted_filter,
)
print(inference_dataloader.shape, len(inference_dataloader))
patient_id = inference_dataloader.sample_idx.index[0]
print(f"Inference for patient: {patient_id}")
# patient_id = '10011_episode1_timeseries.csv'
wrapped_model.predict_patient(patient_id, inference_dataloader)