geno4sd.ml_tools.rubricoe.feature_ranking module

Summary

Functions:

rank_features

Returns ranked feature indices of X, according to clf, from higher to lower importance.

score_curve

Computes a score for a sequence of subsets of features.

youden_index

Reference

youden_index(y_true, y_pred)[source]
rank_features(clf, X0, y0, test_size=0.2, number_of_folds=5, verbose=0, return_ranking_coefs=False, n_jobs=3, details_files=False, details_files_path=None, seed=None)[source]

Returns ranked feature indices of X, according to clf, from higher to lower importance. The final result is obtained through fitting on multiple data subsamplings, as many as number_of_folds.

Parameters
  • clf – Estimator used to produce scores. Expected to have an attribute coef_ and fit and predict methods.

  • X0 (array-like) – Dataset with rows observations and columns features/variables.

  • y0 (array-like) – Targets of X0.

  • test_size (float, default .2) – Proportion of samples (rows of X0) to be used for testing.

  • number_of_folds (int, default 5) – Number of splits for cross-validation purposes.

  • n_jobs (int, default 1) – Number of workers used by the multiprocessing pool.

  • verbose (int, default 0) – Level of stdout verbosity.

  • n_features (int, optional) – Number of features of the dataset to use. If not specified, all features are used.

  • details_files (bool, default False) – Flag to enable or disable writing detailed population splits information to files.

  • details_files_path (str, optional) – Path where output files will be written to. Not optional if details_files is True.

  • seed (int, optional) – Seed value used for reproducibility. If not specified, no reproducibility is enforced.

score_curve(clf, X0, y0, step_size=10, test_size=0.2, number_of_folds=5, n_jobs=1, score_function=<function youden_index>, verbose=0, n_features=None, adaptive_features=False, tolerance_steps=10, window_size=10, details_files=False, details_files_path=None, seed=None)[source]

Computes a score for a sequence of subsets of features.

Computes a sequence of pairs (x_i, y_i) where y_i is the score of clf on a test set after fitting a training subset of X[:x_i], as many subsets as number_of_folds (these subsets are fixed once selected). The x_i are generated as a range from x_i to the total number of features in X, increasing by step_size.

Parameters
  • clf – Estimator used to produce scores. Expected to have an attribute coef_ and fit and predict methods.

  • X0 – Dataset with rows observations and columns features/variables.

  • y0 – Targets of X0.

  • n_features (int, optional) – If specified, upper bound of ranked features used to generate score curves. Otherwise use all features.

  • step_size (int, default 10) – Number of features to be added at each step of the curve.

  • test_size (float, default .2) – Proportion of samples (rows of X0) to be used for testing.

  • number_of_folds (int, default 5) – Number of splits for cross-validation purposes.

  • n_jobs (int, default 1) – Number of workers used by the multiprocessing pool.

  • score_function (Callable, default youden_index) – Callable that computes a score. Assumed to receive parameters (y_true, y_pred) and return a float.

  • verbose (int, default 0) – Level of stdout verbosity.

  • n_features – Number of features of the dataset to use. If not specified, all features are used.

  • adaptive_features (bool, default False) – Flag to enable or disable early stopping.

  • tolerance_steps (int, default 10) – Steps to wait while observing a downwards trend befor early stopping.

  • window_size (int, 10) – Size of rolling average window used in early stopping.

  • details_files (bool, default False) – Flag to enable or disable writing detailed population splits information to files.

  • details_files_path (str, optional) – Path where output files will be written to. Not optional if details_files is True.

  • seed (int, optional) – Seed value used for reproducibility. If not specified, no reproducibility is enforced.