vaep package#

VAEP Variatonal autoencoder for proteomics

vaep.savefig(fig, name, folder: Path = '.', pdf=True, dpi=300, tight_layout=True)#

Save matplotlib Figure (having method savefig) as pdf and png.

Subpackages#

Submodules#

vaep.data_handling module#

Functionality to handle protein and peptide datasets.

vaep.data_handling.compute_stats_missing(X: DataFrame, col_no_missing: str = 'no_missing', col_no_identified: str = 'no_identified', col_prop_samples: str = 'prop_samples') DataFrame[source]#

Dataset of repeated samples indicating if an observation has the variables observed or missing x in {0,1}

vaep.data_handling.coverage(X: DataFrame, coverage_col: float, coverage_row: float)[source]#

Select proteins by column depending on their coverage. Of these selected proteins, where the rows have a certain number of overall proteins.

vaep.data_handling.get_sorted_not_missing(X: DataFrame) DataFrame[source]#

Return a Dataframe with missing values. Order columns by degree of completness over columns from variables least to most shared among observations.

vaep.imputation module#

Reduce number of missing values of DDA massspectromety.

Imputation can be down by column.

vaep.imputation.compute_moments_shift(observed: Series, imputed: Series, names: Tuple[str, str] = ('observed', 'imputed')) Dict[str, float][source]#

Summary of overall shift of mean and std. dev. of predictions for a imputation method.

vaep.imputation.imputation_KNN(data, alone=True, threshold=0.5)[source]#
Parameters:
  • data (pandas.DataFrame) –

  • alone (bool # is not used) –

  • threshold (float) – Threshold of missing data by column in interval (0, 1)

vaep.imputation.imputation_mixed_norm_KNN(data)[source]#
vaep.imputation.imputation_normal_distribution(log_intensities: Series, mean_shift=1.8, std_shrinkage=0.3, copy=True)[source]#

Impute missing log-transformed intensity values of a single feature. Samples one value for imputation for all samples.

Parameters:
  • log_intensities (pd.Series) – Series of normally distributed values of a single feature (for all samples/runs). Here usually log-transformed intensities.

  • mean_shift (integer, float) – Shift the mean of the log_intensities by factors of their standard deviation to the negative.

  • std_shrinkage (float) – Value greater than zero by which to shrink (or inflate) the standard deviation of the log_intensities.

vaep.imputation.impute_missing(protein_values, mean=None, std=None)[source]#

Imputation is based on the mean and standard deviation from the protein_values. If mean and standard deviation (std) are given, missing values are imputed and protein_values are returned imputed. If no mean and std are given, the mean and std are computed from the non-missing protein_values.

Parameters:
  • protein_values (Iterable) –

  • mean (float) –

  • std (float) –

Returns:

protein_values

Return type:

pandas.Series

vaep.imputation.impute_shifted_normal(df_wide: DataFrame, mean_shift: float = 1.8, std_shrinkage: float = 0.3, completeness: float = 0.6, axis=1, seed=123) Series[source]#

Get replacements for missing values.

Parameters:
  • df_wide (pd.DataFrame) – DataFrame in wide format, contains missing

  • mean_shift (float, optional) – shift mean of feature by factor of standard deviations, by default 1.8

  • std_shrinkage (float, optional) – shrinks standard deviation by facotr, by default 0.3

  • axis (int, optional) – axis along which to impute, by default 1 (i.e. mean and std per row)

Returns:

Series of imputed values in long format.

Return type:

pd.Series

vaep.imputation.stats_by_level(series: Series, index_level: int = 0, min_count: int = 5) Series[source]#

Count, mean and std. dev. by index level.

vaep.logging module#

Custom logging setup for notebooks.

vaep.logging.setup_logger(logger, level=20, fname_base=None)#

Setup logging in project. Takes a logger an creates

Parameters:
  • logger (logging.Logger) – logger instance to configre

  • level (int, optional) – logging level, by default logging.INFO

  • fname_base (str, optional) – filename for logging, by default None

Returns:

Configured logger instance for logging

Return type:

logging.Logger

Examples

>>> import logging
>>> logger = logging.getLogger('vaep')
>>> _ = setup_logger_w_file(logger) # no logging to file
>>> logger.handlers = [] # reset logger
>>> _ = setup_logger_w_file() #
vaep.logging.setup_logger_w_file(logger, level=20, fname_base=None)[source]#

Setup logging in project. Takes a logger an creates

Parameters:
  • logger (logging.Logger) – logger instance to configre

  • level (int, optional) – logging level, by default logging.INFO

  • fname_base (str, optional) – filename for logging, by default None

Returns:

Configured logger instance for logging

Return type:

logging.Logger

Examples

>>> import logging
>>> logger = logging.getLogger('vaep')
>>> _ = setup_logger_w_file(logger) # no logging to file
>>> logger.handlers = [] # reset logger
>>> _ = setup_logger_w_file() #
vaep.logging.setup_nb_logger(level: int = 20, format_str: str = '%(name)s - %(levelname)-8s %(message)s') None[source]#

vaep.model module#

vaep.model.build_df_from_pred_batches(pred, scaler=None, index=None, columns=None)[source]#
vaep.model.get_latent_space(model_method_call: callable, dl: DataLoader, dl_index: Index, latent_tuple_pos: int = 0) DataFrame[source]#

Create a DataFrame of the latent space based on the model method call to be used (here: the model encoder or a latent space helper method)

Parameters:
  • model_method_call (callable) – A method call on a pytorch.Module to create encodings for a batch of data.

  • dl (torch.utils.data.DataLoader) – The DataLoader to use, producing predictions in a non-random fashion.

  • dl_index (pd.Index) – pandas Index

  • latent_tuple_pos (int, optional) – if model_method_call returns a tuple from a batch, the tensor at the tuple position to selecet, by default 0

Returns:

DataFrame of latent space indexed with dl_index.

Return type:

pd.DataFrame

vaep.nb module#

class vaep.nb.Config[source]#

Bases: object

Config class with a setter enforcing that config entries cannot be overwritten.

Can contain configs, which are itself configs: keys, paths,

dump(fname=None)[source]#
classmethod from_dict(d: dict)[source]#
items()[source]#
keys()[source]#
overwrite_entry(entry, value)[source]#

Explicitly overwrite a given value.

update_from_dict(params: dict)[source]#
values()[source]#
vaep.nb.add_default_paths(cfg: Config, folder_data='', out_root=None)[source]#

Add default paths to config.

vaep.nb.args_from_dict(args: dict) Config[source]#
vaep.nb.get_params(args: keys, globals, remove=True) dict[source]#
vaep.nb.remove_keys_from_globals(keys: keys, globals: dict)[source]#

vaep.sampling module#

vaep.sampling.feature_frequency(df_wide: DataFrame, measure_name: str = 'freq') Series[source]#

Generate frequency table based on singly indexed (both axes) DataFrame.

Parameters:
  • df_wide (pd.DataFrame) – Singly indexed DataFrame with singly indexed columns (no MultiIndex)

  • measure_name (str, optional) – Name of the returned series, by default ‘freq’

Returns:

Frequency on non-missing entries per feature (column).

Return type:

pd.Series

vaep.sampling.frequency_by_index(df_long: DataFrame, sample_index_to_drop: Union[str, int]) Series[source]#

Generate frequency table based on an index level of a 2D multiindex.

Parameters:
  • df_long (pd.DataFrame) – One column, 2D multindexed DataFrame

  • sample_index_to_drop (Union[str, int]) – index name or position not to use

Returns:

frequency of index categories in table (not missing)

Return type:

pd.Series

vaep.sampling.get_thresholds(df_long: DataFrame, frac_non_train: float, random_state: int) Series[source]#

Get thresholds for MNAR/MCAR sampling. Thresholds are sampled from a normal distrubiton with a mean of the quantile of the simulated missing data.

Parameters:
  • df_long (pd.DataFrame) – Long-format data in pd.DataFrame. Index name is feature name. 2 dimensional MultiIndex.

  • frac_non_train (float) – Percentage of single unit (sample) to sample.

  • random_state (int) – Random state to use for sampling procedure.

Returns:

Thresholds for MNAR/MCAR sampling.

Return type:

pd.Series

vaep.sampling.sample_data(series: Series, sample_index_to_drop: Union[str, int], frac=0.95, weights: Optional[Series] = None, random_state=42) Tuple[Series, Series][source]#

sample from doubly indexed series with sample index and feature index.

Parameters:
  • series (pd.Series) – Long-format data in pd.Series. Index name is feature name. 2 dimensional MultiIndex.

  • sample_index_to_drop (Union[str, int]) – Sample index (as str or integer Index position). Unit to group by (i.e. Samples)

  • frac (float, optional) – Percentage of single unit (sample) to sample, by default 0.95

  • weights (pd.Series, optional) – Weights to pass on for sampling on a single group, by default None

  • random_state (int, optional) – Random state to use for sampling procedure, by default 42

Returns:

First series contains the entries sampled, whereas the second series contains the entires not sampled from the orginally passed series.

Return type:

Tuple[pd.Series, pd.Series]

vaep.sampling.sample_mnar_mcar(df_long: DataFrame, frac_non_train: float, frac_mnar: float, random_state: int = 42) Tuple[DataSplits, Series, Series, Series][source]#

Sampling of data for MNAR/MCAR simulation. The function samples from the df_long DataFrame and returns the training, validation and test splits in dhte DataSplits object.

Select features as described in > Lazar, Cosmin, Laurent Gatto, Myriam Ferro, Christophe Bruley, and Thomas Burger. 2016. > “Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative > Proteomics Data Sets to Compare Imputation Strategies.” > Journal of Proteome Research 15 (4): 1116–25.

  • select MNAR based on threshold matrix on quantile

  • specify MNAR and MCAR proportions in validation and test set

  • use needed MNAR as specified by frac_mnar

  • sample MCAR from the remaining data

  • distribute MNAR and MCAR in validation and test set

Parameters:
  • df_long (pd.DataFrame) – intensities in long format with unique index.

  • frac_non_train (float) – proprotion of data in df_long to be used for evaluation in total in validation and test split

  • frac_mnar (float) – Frac of simulated data to be missing not at random (MNAR)

  • random_state (int, optional) – random seed for reproducibility, by default 42

Returns:

datasplits, thresholds, fake_na_mcar, fake_na_mnar

Containing training, validation and test splits, as well as the thresholds, mcar and mnar simulated missing intensities.

Return type:

Tuple[DataSplits, pd.Series, pd.Series, pd.Series]

vaep.transform module#

class vaep.transform.MinMaxScaler(feature_range=(0, 1), *, copy=True, clip=False)#

Bases: MinMaxScaler

inverse_transform(X, **kwargs)#

Undo the scaling of X according to feature_range.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input data that will be transformed. It cannot be sparse.

Returns:

Xt – Transformed data.

Return type:

ndarray of shape (n_samples, n_features)

Returns:

Y – If X is a pandas DataFrame, Y will be a DataFrame with the initial Indix and column Index objects.

Return type:

array-like

transform(X, **kwargs)#

Scale features of X according to feature_range.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input data that will be transformed.

Returns:

Xt – Transformed data.

Return type:

ndarray of shape (n_samples, n_features)

Returns:

Y – If X is a pandas DataFrame, Y will be a DataFrame with the initial Indix and column Index objects.

Return type:

array-like

class vaep.transform.VaepPipeline(df_train: DataFrame, encode: Pipeline, decode: Optional[List[str]] = None)[source]#

Bases: object

Custom Pipeline combining a pandas.DataFrame and a sklearn.pipeline.Pipleine.

inverse_transform(X, index=None)[source]#
transform(X)[source]#
vaep.transform.inverse_transform(self, X, **kwargs)[source]#

Undo the scaling of X according to feature_range.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input data that will be transformed. It cannot be sparse.

Returns:

Xt – Transformed data.

Return type:

ndarray of shape (n_samples, n_features)

Returns:

Y – If X is a pandas DataFrame, Y will be a DataFrame with the initial Indix and column Index objects.

Return type:

array-like

vaep.transform.make_pandas_compatible(cls)[source]#

Patch transform and inverse_transform.

vaep.transform.transform(self, X, **kwargs)[source]#

vaep.utils module#

vaep.utils.append_to_filepath(filepath: Union[Path, str], to_append: str, sep: str = '_', new_suffix: Optional[str] = None) Path[source]#

Append filepath with specified to_append using a seperator.

Example: data.csv to data_processed.csv

vaep.utils.create_random_df(N: int, M: int, scaling_factor: float = 30.0, prop_na: float = 0.0, start_idx: int = 0, name_index='Sample ID', name_columns='peptide')[source]#
vaep.utils.create_random_missing_data(N, M, mean: float = 25.0, std_dev: float = 2.0, prop_missing: float = 0.15)[source]#
vaep.utils.create_random_missing_data_long(N: int, M: int, prop_missing=0.1)[source]#

Build example long