geno4sd.ml_tools.ReVeaL module

Summary

Functions:

compute

This function compute the mutational load and shingles, storing them in a folder as csv file

compute_mutational_load

Function to compute the mutational load files

compute_shingle

This function compute the shingles, generating the train and test sample

permute_labels

Permute the sample labels.

train_test_split

Split data into training and test set.

Reference

compute_mutational_load(sample_data, samples, regions_size, chromosome, window_size=- 1, out_file_name=None)[source]

Function to compute the mutational load files

sample_data:

panda dataframe containing the sample with relative start and stop alteration

samples:

list of all samples

regions_size:

panda dataframe containing the start and stop of the region of interest

chromosome: int

relative to the chromosome of interest

window_size: int

relative the the window size, by default the entire region is used. Note if the window size is not multiple of the region size the last window will be the remaining region portion.

out_file_name: str, optional

if provided the mutational load table is stored as csv file.

a dataframe containing the mutational load, each row is a sample, the column is a window

compute_shingle(train_samples, test_samples, mutation_load, moment_type=1, out_file_name=None)[source]

This function compute the shingles, generating the train and test sample

rain_samples:

a panda dataframe containing the train ids to use to create the shingle

test_samples:

a panda dataframe containing the test ids to use to create the shingle

mutation_load:

mutation load from which compute the shingle

moment_type: int, optional, default=1, value=[1,4]

the moments of a function are quantitative measures related to the shape of the distribtion.

out_file_name: str

indicating the filename to store the shingle in csv if provided, default=None

a panda datagrame containing the shingles

References

Parida L, Haferlach C, Rhrissorrakrai K, Utro F, Levovitz C, Kern W, et al. (2019) Dark-matter matters: Discriminating subtle blood cancers using the darkest DNA. PLoS Comput Biol 15(8): e1007332. https://doi.org/10.1371/journal.pcbi.1007332

permute_labels(label_info, seed=None)[source]

Permute the sample labels.

label_info:

dataframe containing the original sample labels per class

seed: int, optional, default=None

It affects the ordering of the labels, which controls the randomness. Pass an int for reproducible output across multiple function calls.

dataframe with the permuted sample label

train_test_split(label_info, test_train_sizes, num_fold, sample_size, permuted=False, seed=None)[source]

Split data into training and test set.

label_info:

panda dataframe containing the original sample labels per class

test_train_sizes:

dataframe containing the size of train and test samples for each label

num_fold: int

number of folds for the crossvalidation

sample_size: int

number of sample used to generate a shingle.

permuted: Boolean, optional, default=False

If True labels will be permuted.

seed: int, optional, default=None

It affects the ordering of the labels, which controls the randomness. Pass an int for reproducible output across multiple function calls.

Returns

  • train_samples (a panda dataframe containing the training ids)

  • test_samples (a panda dataframe containing the testing ids)

compute(sample_info, label_info, test_train_sizes, regions, chromosomes, num_fold, sample_size, out_folder, window_size=- 1, n_jobs=22, moment_type=1, permuted=False, seed=None)[source]

This function compute the mutational load and shingles, storing them in a folder as csv file

sample_info:

panda dataframe containing the sample with relative start and stop alteration

label_info:

panda dataframe containing the original sample labels per class

test_train_sizes:

panda dataframe containing per each label the number of train and test sample per fold to be generated

regions:

panda dataframe containing the start and stop of the region of interest

chromosomes:

list containing the desidered chromosomes to compute the shingles

num_fold: int

number of folds for the crossvalidation

sample_size: int

number of element to sample to create the shingle

out_folder: str

name of the folder where the output will be stored, if it doesn’t exist will be automatically created.

window_size: int

relative the the window size, by default the entire region is used. Note if the window size is not multiple of the region size the last window will be the remaining region portion.

n_jobs: int, optional, default=22

number of jobs for multiprocessing

moment_type: int, optional, default=1, value=[1,4]

the moments of a function are quantitative measures related to the shape of the distribtion.

permuted: Boolean, optional, default=False

If True labels will be permuted.

seed: int, optional, default=None

It affects the ordering of the labels, which controls the randomness. Pass an int for reproducible output across multiple function calls.

Return type

a numpy array with the shingle of all chromosomes (note they chromosome order may not be guaranteed)

References

Parida L, Haferlach C, Rhrissorrakrai K, Utro F, Levovitz C, Kern W, et al. (2019) Dark-matter matters: Discriminating subtle blood cancers using the darkest DNA. PLoS Comput Biol 15(8): e1007332. https://doi.org/10.1371/journal.pcbi.1007332