geno4sd.ml_tools.rubricoe.rubricoe module
Summary
Functions:
Function to run the full RubricOE analysis |
|
Splits data into working and validation, then computes as many score curves as specified in the iterations parameter using only validation data. |
|
Computes the proportion of iterations where a feature was selected as a top feature according to its corresponding curve. |
|
Filters out top features according to threshold. |
|
Basic function to plot scores of a repeated experiment with confidence bars and mean value. |
Reference
- compute_curves(X, y, iterations, curve_steps, validation_size, clf_ranking, clf_scoring, n_features=None, ranking_test_size=0.2, ranking_number_of_folds=5, return_ranking_coefs=False, scoring_test_size=0.2, scoring_number_of_folds=5, score_function=<function youden_index>, scoring_n_jobs=1, ranking_n_jobs=1, scoring_adaptive_features=False, scoring_tolerance_steps=10, scoring_window_size=10, verbose=0, details_files=True, details_files_parent_path='./', seed=None)[source]
Splits data into working and validation, then computes as many score curves as specified in the iterations parameter using only validation data. Each iteration has its own set of random splits for ranking and random splits for scoring. Returns a list with ranked features and scores for each iteration, along with indices for samples in working and validation data. If return_ranking_coefs is True, it also returns the average rankings for each feature.
- X:
2D array with rows observations and columns variables.
- y:
1D array with integer labels (0 or 1) for each row of X.
- iterations:
Number of repetitions for the inner RubricOE loop.
- validation_size:
Proportion of observations that will be held out during the entire procedure for validation.
- clf_ranking:
Function that will be used for ranking features in the inner loop. An example is lreb.LinRidgeRegSVD(). Generally a function with a .fit(data,labels) method with a coef_ attribute will work.
- clf_scoring:
Function that will be used for training and testing computations in the inner loop on subsets of features. An example is sklearn’s SVC(). Generally a function with .fit(train_data,labels) and .predict(test_data) methods will work.
- n_features:
If specified, upper bound of ranked features used to generate score curves. Note: All features are still used for feature ranking, only score curves are affected.
- ranking_test_size:
Proportion of observations in working data that will be randomly discarded in each fold when training the ranking procedure.
- ranking_number_of_folds:
Number of repetitions of training the ranking procedure.
- return_ranking_coefs:
Flag to specify if should return actual rankings of features.
- scoring_test_size:
Proportion of observations in working data that will be used in each fold for testing performance of clf_scoring.
- scoring_number_of_folds:
Number of repetitions of training the ranking procedure.
- score_function:
Function that will be used to generate values in the score curve. By default it uses the Youden Index, but more generally a function of the form score(true,pred) will work
- score_n_jobs:
Number of jobs that will be spawned to compute curve points in parallel.
- verbose:
Verbosity level. 0 disables all messages. Greater than 0 prints messages.
ranked_features_list, curve_list, idx_working, idx_validation A tuple with, in order: * a list with lists of ranked feature indices from higher to lower importance (one per iteration), * a list with lists of scores obtained by the classifier when increasing the number of features selected (one per iteration), * an array with the indices of the observations corresponding to the working set * an array with the indices of the observations corresponding to the validation set
If the flag return_ranking_coefs is set to True, the return tuple is
ranked_features_list, ranking_coefs, curve_list, idx_working, idx_validation With, in order: * a list with lists of ranked feature indices from higher to lower importance (one per iteration), * a list with lists of rankings of the features from higher to lower importance (one per iteration), * a list with lists of scores obtained by the classifier when increasing the number of features selected (one per iteration), * an array with the indices of the observations corresponding to the working set * an array with the indices of the observations corresponding to the validation set
- compute_feature_counts(ranked_features_list, curve_list, step_size)[source]
Computes the proportion of iterations where a feature was selected as a top feature according to its corresponding curve. Expects parameters similar to the output of compute_curves.
- compute_top_features(feature_counts, threshold=1.0)[source]
Filters out top features according to threshold. A threshold value of 1 indicates that the feature was selected as a top feature in all iterations. Expects feature_counts to be similar to the output of compute_feature_counts.
- plot_scores(eval_scores, extra_string='')[source]
Basic function to plot scores of a repeated experiment with confidence bars and mean value.
- compute(df, iterations, curve_steps, validation_size, n_features, ranking_test_size, scoring_test_size, ranking_number_of_folds, scoring_number_of_folds, C, scoring_n_jobs, threshold, output_filename, details=True, details_files_parent_path='./', verbose=0)[source]
Function to run the full RubricOE analysis
- df:
Dataframe of input matrix with column ‘phenotype’ with label
- iterations: int
Number of iterations of main RubricOE loop
- curve_steps: int
Scoring curve resolution
- validation_size: float
Proportion of data to use for validation
- n_features: int
Number of features to use for score curve. Set to “all” for all features.
- ranking_test_size: float
Proportion of non-validation data to use per split on ranking step.
- scoring_test_size: float
Proportion of non-validation data to use per split on scoring step.
- ranking_number_of_folds:
Number of splits used in ranking step.
- scoring_number_of_folds: int
Number of splits used in scoring step
- C: float
Regularization coefficient in Ridge Regression with Error Bars
- scoring_n_jobs: int
Number of parallel processes to run in scoring step
- threshold: float
Proportion of iterations where a SNP should be present to be interpreted as a top SNP.
- output_filename: str,
Base name of output files.
- details: Boolean, default=True,
Whether to output files with detailed intermediate results.
- details_files_path: str, optional
Path where output files will be written to. Not optional if details_files is True.
- verbose: int, default=0,
Set to 0 to omit progress notifications.
A the subset input dataframe, relative to the top features