vaep package#
VAEP Variatonal autoencoder for proteomics
- vaep.savefig(fig, name, folder: Path = '.', pdf=True, dpi=300, tight_layout=True)#
Save matplotlib Figure (having method savefig) as pdf and png.
Subpackages#
- vaep.analyzers package
Analysis
- Submodules
- vaep.analyzers.analyzers module
AnalyzePeptides
AnalyzePeptides.df
AnalyzePeptides.stats
AnalyzePeptides.calculate_PCs()
AnalyzePeptides.describe_peptides()
AnalyzePeptides.df_long
AnalyzePeptides.df_wide
AnalyzePeptides.fname_stub
AnalyzePeptides.from_csv()
AnalyzePeptides.from_pickle()
AnalyzePeptides.get_PCA()
AnalyzePeptides.get_consecutive_dates()
AnalyzePeptides.get_dectection_limit()
AnalyzePeptides.get_prop_not_na()
AnalyzePeptides.log_transform()
AnalyzePeptides.plot_pca()
AnalyzePeptides.to_long_format()
AnalyzePeptides.to_wide_format()
LatentAnalysis
add_date_colorbar()
cast_object_to_category()
corr_lower_triangle()
get_consecutive_data_indices()
plot_corr_histogram()
plot_date_map()
plot_scatter()
scatter_plot_w_dates()
seaborn_scatter()
- vaep.analyzers.compare_predictions module
- vaep.analyzers.diff_analysis module
- vaep.cmd_interface package
- vaep.databases package
- vaep.io package
PathsList
add_indices()
dump_json()
dump_to_csv()
extend_name()
from_pickle()
get_fname_from_keys()
load_json()
parse_dict()
resolve_path()
search_files()
search_subfolders()
to_pickle()
- Submodules
- vaep.io.dataloaders module
- vaep.io.datasets module
- vaep.io.datasplits module
- vaep.io.format module
- vaep.io.load module
- vaep.io.types module
- vaep.models package
Metrics
RecorderDump
calc_net_weight_count()
calculte_metrics()
collect_metrics()
compare_indices()
get_df_from_nested_dict()
plot_loss()
plot_training_losses()
split_prediction_by_mask()
- Submodules
- vaep.models.ae module
- vaep.models.analysis module
- vaep.models.collab module
- vaep.models.collect_dumps module
- vaep.models.vae module
- vaep.pandas package
flatten_dict_of_dicts()
get_columns_accessor()
get_columns_accessor_from_iterable()
get_columns_namedtuple()
get_counts_per_bin()
get_last_index_matching_proportion()
get_lower_whiskers()
get_unique_non_unique_columns()
highlight_min()
index_to_dict()
interpolate()
key_map()
length()
parse_query_expression()
prop_unique_index()
replace_with()
select_max_by()
unique_cols()
- Submodules
- vaep.pandas.calc_errors module
- vaep.pandas.missing_data module
- vaep.plotting package
- vaep.sklearn package
- vaep.stats package
Submodules#
vaep.data_handling module#
Functionality to handle protein and peptide datasets.
- vaep.data_handling.compute_stats_missing(X: DataFrame, col_no_missing: str = 'no_missing', col_no_identified: str = 'no_identified', col_prop_samples: str = 'prop_samples') DataFrame [source]#
Dataset of repeated samples indicating if an observation has the variables observed or missing x in {0,1}
vaep.imputation module#
Reduce number of missing values of DDA massspectromety.
Imputation can be down by column.
- vaep.imputation.compute_moments_shift(observed: Series, imputed: Series, names: Tuple[str, str] = ('observed', 'imputed')) Dict[str, float] [source]#
Summary of overall shift of mean and std. dev. of predictions for a imputation method.
- vaep.imputation.imputation_KNN(data, alone=True, threshold=0.5)[source]#
- Parameters:
data (pandas.DataFrame) –
alone (bool # is not used) –
threshold (float) – Threshold of missing data by column in interval (0, 1)
- vaep.imputation.imputation_normal_distribution(log_intensities: Series, mean_shift=1.8, std_shrinkage=0.3, copy=True)[source]#
Impute missing log-transformed intensity values of a single feature. Samples one value for imputation for all samples.
- Parameters:
log_intensities (pd.Series) – Series of normally distributed values of a single feature (for all samples/runs). Here usually log-transformed intensities.
mean_shift (integer, float) – Shift the mean of the log_intensities by factors of their standard deviation to the negative.
std_shrinkage (float) – Value greater than zero by which to shrink (or inflate) the standard deviation of the log_intensities.
- vaep.imputation.impute_missing(protein_values, mean=None, std=None)[source]#
Imputation is based on the mean and standard deviation from the protein_values. If mean and standard deviation (std) are given, missing values are imputed and protein_values are returned imputed. If no mean and std are given, the mean and std are computed from the non-missing protein_values.
- Parameters:
- Returns:
protein_values
- Return type:
- vaep.imputation.impute_shifted_normal(df_wide: DataFrame, mean_shift: float = 1.8, std_shrinkage: float = 0.3, completeness: float = 0.6, axis=1, seed=123) Series [source]#
Get replacements for missing values.
- Parameters:
df_wide (pd.DataFrame) – DataFrame in wide format, contains missing
mean_shift (float, optional) – shift mean of feature by factor of standard deviations, by default 1.8
std_shrinkage (float, optional) – shrinks standard deviation by facotr, by default 0.3
axis (int, optional) – axis along which to impute, by default 1 (i.e. mean and std per row)
- Returns:
Series of imputed values in long format.
- Return type:
pd.Series
vaep.logging module#
Custom logging setup for notebooks.
- vaep.logging.setup_logger(logger, level=20, fname_base=None)#
Setup logging in project. Takes a logger an creates
- Parameters:
logger (logging.Logger) – logger instance to configre
level (int, optional) – logging level, by default logging.INFO
fname_base (str, optional) – filename for logging, by default None
- Returns:
Configured logger instance for logging
- Return type:
Examples
>>> import logging >>> logger = logging.getLogger('vaep') >>> _ = setup_logger_w_file(logger) # no logging to file >>> logger.handlers = [] # reset logger >>> _ = setup_logger_w_file() #
- vaep.logging.setup_logger_w_file(logger, level=20, fname_base=None)[source]#
Setup logging in project. Takes a logger an creates
- Parameters:
logger (logging.Logger) – logger instance to configre
level (int, optional) – logging level, by default logging.INFO
fname_base (str, optional) – filename for logging, by default None
- Returns:
Configured logger instance for logging
- Return type:
Examples
>>> import logging >>> logger = logging.getLogger('vaep') >>> _ = setup_logger_w_file(logger) # no logging to file >>> logger.handlers = [] # reset logger >>> _ = setup_logger_w_file() #
vaep.model module#
- vaep.model.get_latent_space(model_method_call: callable, dl: DataLoader, dl_index: Index, latent_tuple_pos: int = 0) DataFrame [source]#
Create a DataFrame of the latent space based on the model method call to be used (here: the model encoder or a latent space helper method)
- Parameters:
model_method_call (callable) – A method call on a pytorch.Module to create encodings for a batch of data.
dl (torch.utils.data.DataLoader) – The DataLoader to use, producing predictions in a non-random fashion.
dl_index (pd.Index) – pandas Index
latent_tuple_pos (int, optional) – if model_method_call returns a tuple from a batch, the tensor at the tuple position to selecet, by default 0
- Returns:
DataFrame of latent space indexed with dl_index.
- Return type:
pd.DataFrame
vaep.nb module#
- class vaep.nb.Config[source]#
Bases:
object
Config class with a setter enforcing that config entries cannot be overwritten.
Can contain configs, which are itself configs: keys, paths,
vaep.sampling module#
- vaep.sampling.feature_frequency(df_wide: DataFrame, measure_name: str = 'freq') Series [source]#
Generate frequency table based on singly indexed (both axes) DataFrame.
- Parameters:
df_wide (pd.DataFrame) – Singly indexed DataFrame with singly indexed columns (no MultiIndex)
measure_name (str, optional) – Name of the returned series, by default ‘freq’
- Returns:
Frequency on non-missing entries per feature (column).
- Return type:
pd.Series
- vaep.sampling.frequency_by_index(df_long: DataFrame, sample_index_to_drop: Union[str, int]) Series [source]#
Generate frequency table based on an index level of a 2D multiindex.
- vaep.sampling.get_thresholds(df_long: DataFrame, frac_non_train: float, random_state: int) Series [source]#
Get thresholds for MNAR/MCAR sampling. Thresholds are sampled from a normal distrubiton with a mean of the quantile of the simulated missing data.
- Parameters:
- Returns:
Thresholds for MNAR/MCAR sampling.
- Return type:
pd.Series
- vaep.sampling.sample_data(series: Series, sample_index_to_drop: Union[str, int], frac=0.95, weights: Optional[Series] = None, random_state=42) Tuple[Series, Series] [source]#
sample from doubly indexed series with sample index and feature index.
- Parameters:
series (pd.Series) – Long-format data in pd.Series. Index name is feature name. 2 dimensional MultiIndex.
sample_index_to_drop (Union[str, int]) – Sample index (as str or integer Index position). Unit to group by (i.e. Samples)
frac (float, optional) – Percentage of single unit (sample) to sample, by default 0.95
weights (pd.Series, optional) – Weights to pass on for sampling on a single group, by default None
random_state (int, optional) – Random state to use for sampling procedure, by default 42
- Returns:
First series contains the entries sampled, whereas the second series contains the entires not sampled from the orginally passed series.
- Return type:
Tuple[pd.Series, pd.Series]
- vaep.sampling.sample_mnar_mcar(df_long: DataFrame, frac_non_train: float, frac_mnar: float, random_state: int = 42) Tuple[DataSplits, Series, Series, Series] [source]#
Sampling of data for MNAR/MCAR simulation. The function samples from the df_long DataFrame and returns the training, validation and test splits in dhte DataSplits object.
Select features as described in > Lazar, Cosmin, Laurent Gatto, Myriam Ferro, Christophe Bruley, and Thomas Burger. 2016. > “Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative > Proteomics Data Sets to Compare Imputation Strategies.” > Journal of Proteome Research 15 (4): 1116–25.
select MNAR based on threshold matrix on quantile
specify MNAR and MCAR proportions in validation and test set
use needed MNAR as specified by frac_mnar
sample MCAR from the remaining data
distribute MNAR and MCAR in validation and test set
- Parameters:
df_long (pd.DataFrame) – intensities in long format with unique index.
frac_non_train (float) – proprotion of data in df_long to be used for evaluation in total in validation and test split
frac_mnar (float) – Frac of simulated data to be missing not at random (MNAR)
random_state (int, optional) – random seed for reproducibility, by default 42
- Returns:
datasplits, thresholds, fake_na_mcar, fake_na_mnar
Containing training, validation and test splits, as well as the thresholds, mcar and mnar simulated missing intensities.
- Return type:
Tuple[DataSplits, pd.Series, pd.Series, pd.Series]
vaep.transform module#
- class vaep.transform.MinMaxScaler(feature_range=(0, 1), *, copy=True, clip=False)#
Bases:
MinMaxScaler
- inverse_transform(X, **kwargs)#
Undo the scaling of X according to feature_range.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input data that will be transformed. It cannot be sparse.
- Returns:
Xt – Transformed data.
- Return type:
ndarray of shape (n_samples, n_features)
- Returns:
Y – If X is a pandas DataFrame, Y will be a DataFrame with the initial Indix and column Index objects.
- Return type:
array-like
- transform(X, **kwargs)#
Scale features of X according to feature_range.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input data that will be transformed.
- Returns:
Xt – Transformed data.
- Return type:
ndarray of shape (n_samples, n_features)
- Returns:
Y – If X is a pandas DataFrame, Y will be a DataFrame with the initial Indix and column Index objects.
- Return type:
array-like
- class vaep.transform.VaepPipeline(df_train: DataFrame, encode: Pipeline, decode: Optional[List[str]] = None)[source]#
Bases:
object
Custom Pipeline combining a pandas.DataFrame and a sklearn.pipeline.Pipleine.
- vaep.transform.inverse_transform(self, X, **kwargs)[source]#
Undo the scaling of X according to feature_range.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input data that will be transformed. It cannot be sparse.
- Returns:
Xt – Transformed data.
- Return type:
ndarray of shape (n_samples, n_features)
- Returns:
Y – If X is a pandas DataFrame, Y will be a DataFrame with the initial Indix and column Index objects.
- Return type:
array-like
vaep.utils module#
- vaep.utils.append_to_filepath(filepath: Union[Path, str], to_append: str, sep: str = '_', new_suffix: Optional[str] = None) Path [source]#
Append filepath with specified to_append using a seperator.
Example: data.csv to data_processed.csv
- vaep.utils.create_random_df(N: int, M: int, scaling_factor: float = 30.0, prop_na: float = 0.0, start_idx: int = 0, name_index='Sample ID', name_columns='peptide')[source]#