geno4sd.ml_tools.rubricoe.feature_ranking module

Summary

Functions:

`rank_features`	Returns ranked feature indices of X, according to clf, from higher to lower importance.
`score_curve`	Computes a score for a sequence of subsets of features.
`youden_index`

Reference

youden_index(y_true, y_pred)[source]

rank_features(clf, X0, y0, test_size=0.2, number_of_folds=5, verbose=0, return_ranking_coefs=False, n_jobs=3, details_files=False, details_files_path=None, seed=None)[source]

Returns ranked feature indices of X, according to clf, from higher to lower importance. The final result is obtained through fitting on multiple data subsamplings, as many as number_of_folds.

Parameters

clf – Estimator used to produce scores. Expected to have an attribute coef_ and fit and predict methods.
X0 (array-like) – Dataset with rows observations and columns features/variables.
y0 (array-like) – Targets of X0.
test_size (float, default .2) – Proportion of samples (rows of X0) to be used for testing.
number_of_folds (int, default 5) – Number of splits for cross-validation purposes.
n_jobs (int, default 1) – Number of workers used by the multiprocessing pool.
verbose (int, default 0) – Level of stdout verbosity.
n_features (int, optional) – Number of features of the dataset to use. If not specified, all features are used.
details_files (bool, default False) – Flag to enable or disable writing detailed population splits information to files.
details_files_path (str, optional) – Path where output files will be written to. Not optional if details_files is True.
seed (int, optional) – Seed value used for reproducibility. If not specified, no reproducibility is enforced.

score_curve(clf, X0, y0, step_size=10, test_size=0.2, number_of_folds=5, n_jobs=1, score_function=<function youden_index>, verbose=0, n_features=None, adaptive_features=False, tolerance_steps=10, window_size=10, details_files=False, details_files_path=None, seed=None)[source]

Computes a score for a sequence of subsets of features.

Computes a sequence of pairs (x_i, y_i) where y_i is the score of clf on a test set after fitting a training subset of X[:x_i], as many subsets as number_of_folds (these subsets are fixed once selected). The x_i are generated as a range from x_i to the total number of features in X, increasing by step_size.

Parameters

clf – Estimator used to produce scores. Expected to have an attribute coef_ and fit and predict methods.
X0 – Dataset with rows observations and columns features/variables.
y0 – Targets of X0.
n_features (int, optional) – If specified, upper bound of ranked features used to generate score curves. Otherwise use all features.
step_size (int, default 10) – Number of features to be added at each step of the curve.
test_size (float, default .2) – Proportion of samples (rows of X0) to be used for testing.
number_of_folds (int, default 5) – Number of splits for cross-validation purposes.
n_jobs (int, default 1) – Number of workers used by the multiprocessing pool.
score_function (Callable, default youden_index) – Callable that computes a score. Assumed to receive parameters (y_true, y_pred) and return a float.
verbose (int, default 0) – Level of stdout verbosity.
n_features – Number of features of the dataset to use. If not specified, all features are used.
adaptive_features (bool, default False) – Flag to enable or disable early stopping.
tolerance_steps (int, default 10) – Steps to wait while observing a downwards trend befor early stopping.
window_size (int, 10) – Size of rolling average window used in early stopping.
details_files (bool, default False) – Flag to enable or disable writing detailed population splits information to files.
details_files_path (str, optional) – Path where output files will be written to. Not optional if details_files is True.
seed (int, optional) – Seed value used for reproducibility. If not specified, no reproducibility is enforced.