geno4sd.ml_tools.ReVeaL module
Summary
Functions:
| This function compute the mutational load and shingles, storing them in a folder as csv file | |
| Function to compute the mutational load files | |
| This function compute the shingles, generating the train and test sample | |
| Permute the sample labels. | |
| Split data into training and test set. | 
Reference
- compute_mutational_load(sample_data, samples, regions_size, chromosome, window_size=- 1, out_file_name=None)[source]
- Function to compute the mutational load files - sample_data:
- panda dataframe containing the sample with relative start and stop alteration 
- samples:
- list of all samples 
- regions_size:
- panda dataframe containing the start and stop of the region of interest 
- chromosome: int
- relative to the chromosome of interest 
- window_size: int
- relative the the window size, by default the entire region is used. Note if the window size is not multiple of the region size the last window will be the remaining region portion. 
- out_file_name: str, optional
- if provided the mutational load table is stored as csv file. 
 - a dataframe containing the mutational load, each row is a sample, the column is a window 
- compute_shingle(train_samples, test_samples, mutation_load, moment_type=1, out_file_name=None)[source]
- This function compute the shingles, generating the train and test sample - rain_samples:
- a panda dataframe containing the train ids to use to create the shingle 
- test_samples:
- a panda dataframe containing the test ids to use to create the shingle 
- mutation_load:
- mutation load from which compute the shingle 
- moment_type: int, optional, default=1, value=[1,4]
- the moments of a function are quantitative measures related to the shape of the distribtion. 
- out_file_name: str
- indicating the filename to store the shingle in csv if provided, default=None 
 - a panda datagrame containing the shingles - References - Parida L, Haferlach C, Rhrissorrakrai K, Utro F, Levovitz C, Kern W, et al. (2019) Dark-matter matters: Discriminating subtle blood cancers using the darkest DNA. PLoS Comput Biol 15(8): e1007332. https://doi.org/10.1371/journal.pcbi.1007332 
- permute_labels(label_info, seed=None)[source]
- Permute the sample labels. - label_info:
- dataframe containing the original sample labels per class 
- seed: int, optional, default=None
- It affects the ordering of the labels, which controls the randomness. Pass an int for reproducible output across multiple function calls. 
 - dataframe with the permuted sample label 
- train_test_split(label_info, test_train_sizes, num_fold, sample_size, permuted=False, seed=None)[source]
- Split data into training and test set. - label_info:
- panda dataframe containing the original sample labels per class 
- test_train_sizes:
- dataframe containing the size of train and test samples for each label 
- num_fold: int
- number of folds for the crossvalidation 
- sample_size: int
- number of sample used to generate a shingle. 
- permuted: Boolean, optional, default=False
- If True labels will be permuted. 
- seed: int, optional, default=None
- It affects the ordering of the labels, which controls the randomness. Pass an int for reproducible output across multiple function calls. 
 - Returns
- train_samples (a panda dataframe containing the training ids) 
- test_samples (a panda dataframe containing the testing ids) 
 
 
- compute(sample_info, label_info, test_train_sizes, regions, chromosomes, num_fold, sample_size, out_folder, window_size=- 1, n_jobs=22, moment_type=1, permuted=False, seed=None)[source]
- This function compute the mutational load and shingles, storing them in a folder as csv file - sample_info:
- panda dataframe containing the sample with relative start and stop alteration 
- label_info:
- panda dataframe containing the original sample labels per class 
- test_train_sizes:
- panda dataframe containing per each label the number of train and test sample per fold to be generated 
- regions:
- panda dataframe containing the start and stop of the region of interest 
- chromosomes:
- list containing the desidered chromosomes to compute the shingles 
- num_fold: int
- number of folds for the crossvalidation 
- sample_size: int
- number of element to sample to create the shingle 
- out_folder: str
- name of the folder where the output will be stored, if it doesn’t exist will be automatically created. 
- window_size: int
- relative the the window size, by default the entire region is used. Note if the window size is not multiple of the region size the last window will be the remaining region portion. 
- n_jobs: int, optional, default=22
- number of jobs for multiprocessing 
- moment_type: int, optional, default=1, value=[1,4]
- the moments of a function are quantitative measures related to the shape of the distribtion. 
- permuted: Boolean, optional, default=False
- If True labels will be permuted. 
- seed: int, optional, default=None
- It affects the ordering of the labels, which controls the randomness. Pass an int for reproducible output across multiple function calls. 
 - Return type
- a numpy array with the shingle of all chromosomes (note they chromosome order may not be guaranteed) 
 - References - Parida L, Haferlach C, Rhrissorrakrai K, Utro F, Levovitz C, Kern W, et al. (2019) Dark-matter matters: Discriminating subtle blood cancers using the darkest DNA. PLoS Comput Biol 15(8): e1007332. https://doi.org/10.1371/journal.pcbi.1007332