pimmslearn package#
pimmslearn: a package for imputation using self-supervised deep learning models:
Collaborative Filtering
Denoising Autoencoder
Variational Autoencoder
The package offers Imputation transformers in the style of scikit-learn.
PyPI package is called pimms-learn (with a hyphen).
- pimmslearn.savefig(fig, name, folder: Path = '.', pdf=True, dpi=300, tight_layout=True)#
Save matplotlib Figure (having method savefig) as pdf and png.
Subpackages#
- pimmslearn.analyzers package
Analysis- Submodules
- pimmslearn.analyzers.analyzers module
AnalyzePeptidesAnalyzePeptides.dfAnalyzePeptides.statsAnalyzePeptides.calculate_PCs()AnalyzePeptides.describe_peptides()AnalyzePeptides.df_longAnalyzePeptides.df_wideAnalyzePeptides.fname_stubAnalyzePeptides.from_csv()AnalyzePeptides.from_pickle()AnalyzePeptides.get_PCA()AnalyzePeptides.get_consecutive_dates()AnalyzePeptides.get_dectection_limit()AnalyzePeptides.get_prop_not_na()AnalyzePeptides.log_transform()AnalyzePeptides.plot_pca()AnalyzePeptides.to_long_format()AnalyzePeptides.to_wide_format()
LatentAnalysisadd_date_colorbar()cast_object_to_category()corr_lower_triangle()get_consecutive_data_indices()plot_corr_histogram()plot_date_map()plot_scatter()scatter_plot_w_dates()seaborn_scatter()
- pimmslearn.analyzers.compare_predictions module
- pimmslearn.analyzers.diff_analysis module
- pimmslearn.cmd_interface package
- pimmslearn.databases package
- pimmslearn.io package
PathsListadd_indices()dump_json()dump_to_csv()extend_name()from_pickle()get_fname_from_keys()load_json()parse_dict()resolve_path()search_files()search_subfolders()to_pickle()- Submodules
- pimmslearn.io.dataloaders module
- pimmslearn.io.datasets module
- pimmslearn.io.datasplits module
- pimmslearn.io.format module
- pimmslearn.io.load module
- pimmslearn.io.types module
- pimmslearn.models package
MetricsRecorderDumpcalc_net_weight_count()calculte_metrics()collect_metrics()compare_indices()get_df_from_nested_dict()plot_loss()plot_training_losses()split_prediction_by_mask()- Submodules
- pimmslearn.models.ae module
- pimmslearn.models.analysis module
- pimmslearn.models.collab module
- pimmslearn.models.collect_dumps module
- pimmslearn.models.vae module
- pimmslearn.pandas package
calc_errors_per_feat()flatten_dict_of_dicts()get_absolute_error()get_columns_accessor()get_columns_accessor_from_iterable()get_columns_namedtuple()get_counts_per_bin()get_last_index_matching_proportion()get_lower_whiskers()get_unique_non_unique_columns()highlight_min()index_to_dict()interpolate()key_map()length()parse_query_expression()prop_unique_index()replace_with()select_max_by()unique_cols()- Submodules
- pimmslearn.pandas.calc_errors module
- pimmslearn.pandas.missing_data module
- pimmslearn.plotting package
- pimmslearn.sklearn package
- pimmslearn.stats package
Submodules#
pimmslearn.data_handling module#
Functionality to handle protein and peptide datasets.
- pimmslearn.data_handling.compute_stats_missing(X: DataFrame, col_no_missing: str = 'no_missing', col_no_identified: str = 'no_identified', col_prop_samples: str = 'prop_samples') DataFrame[source]#
Dataset of repeated samples indicating if an observation has the variables observed or missing x in {0,1}
pimmslearn.filter module#
pimmslearn.imputation module#
Reduce number of missing values of DDA massspectromety.
Imputation can be down by column.
- pimmslearn.imputation.compute_moments_shift(observed: Series, imputed: Series, names: Tuple[str, str] = ('observed', 'imputed')) Dict[str, float][source]#
Summary of overall shift of mean and std. dev. of predictions for a imputation method.
- pimmslearn.imputation.impute_shifted_normal(df_wide: DataFrame, mean_shift: float = 1.8, std_shrinkage: float = 0.3, completeness: float = 0.6, axis=1, seed=123) Series[source]#
Get replacements for missing values.
- Parameters:
df_wide (pd.DataFrame) – DataFrame in wide format, contains missing
mean_shift (float, optional) – shift mean of feature by factor of standard deviations, by default 1.8
std_shrinkage (float, optional) – shrinks standard deviation by facotr, by default 0.3
axis (int, optional) – axis along which to impute, by default 1 (i.e. mean and std per row)
- Returns:
Series of imputed values in long format.
- Return type:
pd.Series
pimmslearn.logging module#
Custom logging setup for notebooks.
- pimmslearn.logging.setup_logger(logger, level=20, fname_base=None)#
Setup logging in project. Takes a logger an creates
- Parameters:
logger (logging.Logger) – logger instance to configre
level (int, optional) – logging level, by default logging.INFO
fname_base (str, optional) – filename for logging, by default None
- Returns:
Configured logger instance for logging
- Return type:
Examples
>>> import logging >>> logger = logging.getLogger('pimmslearn') >>> _ = setup_logger_w_file(logger) # no logging to file >>> logger.handlers = [] # reset logger >>> _ = setup_logger_w_file() #
- pimmslearn.logging.setup_logger_w_file(logger, level=20, fname_base=None)[source]#
Setup logging in project. Takes a logger an creates
- Parameters:
logger (logging.Logger) – logger instance to configre
level (int, optional) – logging level, by default logging.INFO
fname_base (str, optional) – filename for logging, by default None
- Returns:
Configured logger instance for logging
- Return type:
Examples
>>> import logging >>> logger = logging.getLogger('pimmslearn') >>> _ = setup_logger_w_file(logger) # no logging to file >>> logger.handlers = [] # reset logger >>> _ = setup_logger_w_file() #
pimmslearn.model module#
- pimmslearn.model.get_latent_space(model_method_call: callable, dl: DataLoader, dl_index: Index, latent_tuple_pos: int = 0) DataFrame[source]#
Create a DataFrame of the latent space based on the model method call to be used (here: the model encoder or a latent space helper method)
- Parameters:
model_method_call (callable) – A method call on a pytorch.Module to create encodings for a batch of data.
dl (torch.utils.data.DataLoader) – The DataLoader to use, producing predictions in a non-random fashion.
dl_index (pd.Index) – pandas Index
latent_tuple_pos (int, optional) – if model_method_call returns a tuple from a batch, the tensor at the tuple position to selecet, by default 0
- Returns:
DataFrame of latent space indexed with dl_index.
- Return type:
pd.DataFrame
pimmslearn.nb module#
- class pimmslearn.nb.Config[source]#
Bases:
objectConfig class with a setter enforcing that config entries cannot be overwritten.
Can contain configs, which are itself configs: keys, paths,
pimmslearn.normalization module#
- pimmslearn.normalization.normalize_by_median(df_wide: DataFrame, axis: int = 1) DataFrame[source]#
Normalize by median. Level using global median of medians.
- Parameters:
df_wide (pd.DataFrame) – DataFrame with samples as rows and features as columns
axis (int, optional) – Axis to normalize, by default 1 (i.e. by row/sample)
- Returns:
Normalized DataFrame
- Return type:
pd.DataFrame
- pimmslearn.normalization.normalize_sceptre(quant: DataFrame, iter_thresh: float = 1.1, iter_max: int = 10, check_convex=True) DataFrame[source]#
Normalize by sample and channel as in SCeptre paper. Code adapted to work with current pandas versions.
- Parameters:
- Returns:
Normalized DataFrame with same index and columns as input
- Return type:
pd.DataFrame
pimmslearn.sampling module#
- pimmslearn.sampling.check_split_integrity(splits: DataSplits) DataSplits[source]#
Check if IDs in are only in validation or test data for rare cases. Returns the corrected splits.
- pimmslearn.sampling.feature_frequency(df_wide: DataFrame, measure_name: str = 'freq') Series[source]#
Generate frequency table based on singly indexed (both axes) DataFrame.
- Parameters:
df_wide (pd.DataFrame) – Singly indexed DataFrame with singly indexed columns (no MultiIndex)
measure_name (str, optional) – Name of the returned series, by default ‘freq’
- Returns:
Frequency on non-missing entries per feature (column).
- Return type:
pd.Series
- pimmslearn.sampling.frequency_by_index(df_long: DataFrame, sample_index_to_drop: str | int) Series[source]#
Generate frequency table based on an index level of a 2D multiindex.
- pimmslearn.sampling.get_thresholds(df_long: DataFrame, frac_non_train: float, random_state: int) Series[source]#
Get thresholds for MNAR/MCAR sampling. Thresholds are sampled from a normal distrubiton with a mean of the quantile of the simulated missing data.
- Parameters:
- Returns:
Thresholds for MNAR/MCAR sampling.
- Return type:
pd.Series
- pimmslearn.sampling.sample_data(series: Series, sample_index_to_drop: str | int, frac=0.95, weights: Series | None = None, random_state=42) Tuple[Series, Series][source]#
sample from doubly indexed series with sample index and feature index.
- Parameters:
series (pd.Series) – Long-format data in pd.Series. Index name is feature name. 2 dimensional MultiIndex.
sample_index_to_drop (Union[str, int]) – Sample index (as str or integer Index position). Unit to group by (i.e. Samples)
frac (float, optional) – Percentage of single unit (sample) to sample, by default 0.95
weights (pd.Series, optional) – Weights to pass on for sampling on a single group, by default None
random_state (int, optional) – Random state to use for sampling procedure, by default 42
- Returns:
First series contains the entries sampled, whereas the second series contains the entires not sampled from the orginally passed series.
- Return type:
Tuple[pd.Series, pd.Series]
- pimmslearn.sampling.sample_mnar_mcar(df_long: DataFrame, frac_non_train: float, frac_mnar: float, random_state: int = 42) Tuple[DataSplits, Series, Series, Series][source]#
Sampling of data for MNAR/MCAR simulation. The function samples from the df_long DataFrame and returns the training, validation and test splits in dhte DataSplits object.
Select features as described in > Lazar, Cosmin, Laurent Gatto, Myriam Ferro, Christophe Bruley, and Thomas Burger. 2016. > “Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative > Proteomics Data Sets to Compare Imputation Strategies.” > Journal of Proteome Research 15 (4): 1116–25.
select MNAR based on threshold matrix on quantile
specify MNAR and MCAR proportions in validation and test set
use needed MNAR as specified by frac_mnar
sample MCAR from the remaining data
distribute MNAR and MCAR in validation and test set
- Parameters:
df_long (pd.DataFrame) – intensities in long format with unique index.
frac_non_train (float) – proprotion of data in df_long to be used for evaluation in total in validation and test split
frac_mnar (float) – Frac of simulated data to be missing not at random (MNAR)
random_state (int, optional) – random seed for reproducibility, by default 42
- Returns:
datasplits, thresholds, fake_na_mcar, fake_na_mnar
Containing training, validation and test splits, as well as the thresholds, mcar and mnar simulated missing intensities.
- Return type:
Tuple[DataSplits, pd.Series, pd.Series, pd.Series]
pimmslearn.transform module#
- class pimmslearn.transform.MinMaxScaler(feature_range=(0, 1), *, copy=True, clip=False)#
Bases:
MinMaxScaler- inverse_transform(X, **kwargs)#
Undo the scaling of X according to feature_range.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input data that will be transformed. It cannot be sparse.
- Returns:
Xt – Transformed data.
- Return type:
ndarray of shape (n_samples, n_features)
- Returns:
Y – If X is a pandas DataFrame, Y will be a DataFrame with the initial Indix and column Index objects.
- Return type:
array-like
- transform(X, **kwargs)#
Scale features of X according to feature_range.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input data that will be transformed.
- Returns:
Xt – Transformed data.
- Return type:
ndarray of shape (n_samples, n_features)
- Returns:
Y – If X is a pandas DataFrame, Y will be a DataFrame with the initial Indix and column Index objects.
- Return type:
array-like
- class pimmslearn.transform.VaepPipeline(df_train: DataFrame, encode: Pipeline, decode: List[str] | None = None)[source]#
Bases:
objectCustom Pipeline combining a pandas.DataFrame and a sklearn.pipeline.Pipleine.
- pimmslearn.transform.inverse_transform(self, X, **kwargs)[source]#
Undo the scaling of X according to feature_range.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input data that will be transformed. It cannot be sparse.
- Returns:
Xt – Transformed data.
- Return type:
ndarray of shape (n_samples, n_features)
- Returns:
Y – If X is a pandas DataFrame, Y will be a DataFrame with the initial Indix and column Index objects.
- Return type:
array-like
pimmslearn.utils module#
- pimmslearn.utils.append_to_filepath(filepath: Path | str, to_append: str, sep: str = '_', new_suffix: str | None = None) Path[source]#
Append filepath with specified to_append using a seperator.
Example: data.csv to data_processed.csv
- pimmslearn.utils.create_random_df(N: int, M: int, scaling_factor: float = 30.0, prop_na: float = 0.0, start_idx: int = 0, name_index='Sample ID', name_columns='peptide')[source]#