geno4sd.ml_tools.rubricoe.rubricoe module

Summary

Functions:

`compute`	Function to run the full RubricOE analysis
`compute_curves`	Splits data into working and validation, then computes as many score curves as specified in the iterations parameter using only validation data.
`compute_feature_counts`	Computes the proportion of iterations where a feature was selected as a top feature according to its corresponding curve.
`compute_top_features`	Filters out top features according to threshold.
`plot_scores`	Basic function to plot scores of a repeated experiment with confidence bars and mean value.

Reference

compute_curves(X, y, iterations, curve_steps, validation_size, clf_ranking, clf_scoring, n_features=None, ranking_test_size=0.2, ranking_number_of_folds=5, return_ranking_coefs=False, scoring_test_size=0.2, scoring_number_of_folds=5, score_function=<function youden_index>, scoring_n_jobs=1, ranking_n_jobs=1, scoring_adaptive_features=False, scoring_tolerance_steps=10, scoring_window_size=10, verbose=0, details_files=True, details_files_parent_path='./', seed=None)[source]

Splits data into working and validation, then computes as many score curves as specified in the iterations parameter using only validation data. Each iteration has its own set of random splits for ranking and random splits for scoring. Returns a list with ranked features and scores for each iteration, along with indices for samples in working and validation data. If return_ranking_coefs is True, it also returns the average rankings for each feature.

X:: 2D array with rows observations and columns variables.
y:: 1D array with integer labels (0 or 1) for each row of X.
iterations:: Number of repetitions for the inner RubricOE loop.
validation_size:: Proportion of observations that will be held out during the entire procedure for validation.
clf_ranking:: Function that will be used for ranking features in the inner loop. An example is lreb.LinRidgeRegSVD(). Generally a function with a .fit(data,labels) method with a coef_ attribute will work.
clf_scoring:: Function that will be used for training and testing computations in the inner loop on subsets of features. An example is sklearn’s SVC(). Generally a function with .fit(train_data,labels) and .predict(test_data) methods will work.
n_features:: If specified, upper bound of ranked features used to generate score curves. Note: All features are still used for feature ranking, only score curves are affected.
ranking_test_size:: Proportion of observations in working data that will be randomly discarded in each fold when training the ranking procedure.
ranking_number_of_folds:: Number of repetitions of training the ranking procedure.
return_ranking_coefs:: Flag to specify if should return actual rankings of features.
scoring_test_size:: Proportion of observations in working data that will be used in each fold for testing performance of clf_scoring.
scoring_number_of_folds:: Number of repetitions of training the ranking procedure.
score_function:: Function that will be used to generate values in the score curve. By default it uses the Youden Index, but more generally a function of the form score(true,pred) will work
score_n_jobs:: Number of jobs that will be spawned to compute curve points in parallel.
verbose:: Verbosity level. 0 disables all messages. Greater than 0 prints messages.

ranked_features_list, curve_list, idx_working, idx_validation A tuple with, in order: * a list with lists of ranked feature indices from higher to lower importance (one per iteration), * a list with lists of scores obtained by the classifier when increasing the number of features selected (one per iteration), * an array with the indices of the observations corresponding to the working set * an array with the indices of the observations corresponding to the validation set

If the flag return_ranking_coefs is set to True, the return tuple is

ranked_features_list, ranking_coefs, curve_list, idx_working, idx_validation With, in order: * a list with lists of ranked feature indices from higher to lower importance (one per iteration), * a list with lists of rankings of the features from higher to lower importance (one per iteration), * a list with lists of scores obtained by the classifier when increasing the number of features selected (one per iteration), * an array with the indices of the observations corresponding to the working set * an array with the indices of the observations corresponding to the validation set

compute_feature_counts(ranked_features_list, curve_list, step_size)[source]: Computes the proportion of iterations where a feature was selected as a top feature according to its corresponding curve. Expects parameters similar to the output of compute_curves.

compute_top_features(feature_counts, threshold=1.0)[source]: Filters out top features according to threshold. A threshold value of 1 indicates that the feature was selected as a top feature in all iterations. Expects feature_counts to be similar to the output of compute_feature_counts.

plot_scores(eval_scores, extra_string='')[source]: Basic function to plot scores of a repeated experiment with confidence bars and mean value.

compute(df, iterations, curve_steps, validation_size, n_features, ranking_test_size, scoring_test_size, ranking_number_of_folds, scoring_number_of_folds, C, scoring_n_jobs, threshold, output_filename, details=True, details_files_parent_path='./', verbose=0)[source]

Function to run the full RubricOE analysis

df:: Dataframe of input matrix with column ‘phenotype’ with label
iterations: int: Number of iterations of main RubricOE loop
curve_steps: int: Scoring curve resolution
validation_size: float: Proportion of data to use for validation
n_features: int: Number of features to use for score curve. Set to “all” for all features.
ranking_test_size: float: Proportion of non-validation data to use per split on ranking step.
scoring_test_size: float: Proportion of non-validation data to use per split on scoring step.
ranking_number_of_folds:: Number of splits used in ranking step.
scoring_number_of_folds: int: Number of splits used in scoring step
C: float: Regularization coefficient in Ridge Regression with Error Bars
scoring_n_jobs: int: Number of parallel processes to run in scoring step
threshold: float: Proportion of iterations where a SNP should be present to be interpreted as a top SNP.
output_filename: str,: Base name of output files.
details: Boolean, default=True,: Whether to output files with detailed intermediate results.
details_files_path: str, optional: Path where output files will be written to. Not optional if details_files is True.
verbose: int, default=0,: Set to 0 to omit progress notifications.

A the subset input dataframe, relative to the top features