pimmslearn.sklearn package#

Scikit-learn related functions for the project for ALD part.

Might be moved to a separate package in the future.

pimmslearn.sklearn.get_PCA(df, n_components=2, imputer=<class 'sklearn.impute._base.SimpleImputer'>)[source]#

Submodules#

pimmslearn.sklearn.ae_transformer module#

Scikit-learn style interface for Denoising and Variational Autoencoder model.

class pimmslearn.sklearn.ae_transformer.AETransformer(hidden_layers: list[int], latent_dim: int = 15, out_folder: str = '.', model='VAE', batch_size: int = 64)[source]#

Bases: TransformerMixin, BaseEstimator

Autoencoder transformer (Denoising or Variational).

Autoencoder transformer which can be used to impute missing values in a dataset it is fitted to. The data is standard normalized for fitting the model, but imputations are provided on the original scale after internally fitting the model.

The data uses the wide data format with samples as rows and features as columns.

Parameters:
  • hidden_layers (list[int]) – Architecture for encoder. Decoder is mirrored.

  • dim_latent (int, optional) – Hidden space dimension, by default 15

  • out_folder (str, optional) – Output folder for model, by default ‘.’

  • model (str, optional) – Model type (“VAE”, “DAE”), by default ‘VAE’

  • batch_size (int, optional) – Batch size for training, by default 64

fit(X: DataFrame, y: DataFrame | None = None, epochs_max: int = 100, cuda: bool = True, patience: int | None = None)[source]#

Fit the model to the data.

Parameters:
  • X (pd.DataFrame) – training data of dimension N_samples x M_features

  • y (pd.DataFrame, optional) – validation data points which are missing in X of dimension N_sample x M_features, by default None

  • epochs_max (int, optional) – Maximal number of epochs to train, by default 100

  • cuda (bool, optional) – If the model should be trained with an accelerator, by default True

  • patience (Optional[int], optional) – If added, early stopping is added with specified patience, by default None

Returns:

Return itself fitted to the training data.

Return type:

AETransformer

set_fit_request(*, cuda: bool | None | str = '$UNCHANGED$', epochs_max: bool | None | str = '$UNCHANGED$', patience: bool | None | str = '$UNCHANGED$') AETransformer#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • cuda (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cuda parameter in fit.

  • epochs_max (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for epochs_max parameter in fit.

  • patience (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for patience parameter in fit.

Returns:

self – The updated object.

Return type:

object

transform(X)[source]#

Impute the data using the trained model.

Parameters:

X (pd.DataFrame) – The data to be imputed, shape (N_samples, N_features).

Returns:

X_transformed – Return the imputed DataFrame using the model.

Return type:

array, shape (N_samples, M_features)

pimmslearn.sklearn.cf_transformer module#

Scikit-learn style interface for Collaborative Filtering model.

class pimmslearn.sklearn.cf_transformer.CollaborativeFilteringTransformer(target_column: str, sample_column: str, item_column: str, n_factors: int = 15, out_folder: str = '.', batch_size: int = 4096)[source]#

Bases: TransformerMixin, BaseEstimator

Collaborative Filtering transformer.

Collaborative filtering operates on long data specifying two identifiers (sample and feature) with a quantitative value to predict. Therefore we need to specify three columns. The sample and feature identifiers are embedded into a space which is then used to predict the quantitative value.

The data is expected as a Series with a MultiIndex of the sample and feature identifiers, and the quantitative value as its values.

Parameters:
  • target_column (str) – Target column name to predict, e.g. intensity

  • item_column (str) – Column name for features (items) to embed, e.g. peptides

  • sample_column (str) – Sample column name, e.g. Sample_ID

  • n_factors (int, optional) – number of dimension of item and sample embeddings, by default 15

  • out_folder (str, optional) – Output folder for model, by default ‘.’

  • batch_size (int, optional) – Batch size for training of data in long format, by default 4096

fit(X: Series, y: Series | None = None, epochs_max=20, cuda: bool = True, patience: int = 1)[source]#

Fit the collaborative filtering model to the data provided in long-format.

Parameters:
  • X (Series, shape (n_values, )) – The training data as a Series with the target_column as it values and target_column as its name. The Series has a MultiIndex defined by the item_column and sample_column. Is of shape (n_values, )

  • y (Series, optional) – The validation data as a Series with the target_column as it values and target_column as its name. The Series has a MultiIndex defined by the item_column and sample_column. Is of shape (n_values, ), by default None

  • epochs_max (int, optional) – Maximal number of epochs to train, by default 100

  • cuda (bool, optional) – If the model should be trained with an accelerator, by default True

  • patience (Optional[int], optional) – If added, early stopping is added with specified patience, by default None

Returns:

Return itself fitted to the training data.

Return type:

AETransformer

plot_loss(y, figsize=(8, 4), save: bool = False)[source]#

Plot the training and validation loss of the model.

set_fit_request(*, cuda: bool | None | str = '$UNCHANGED$', epochs_max: bool | None | str = '$UNCHANGED$', patience: bool | None | str = '$UNCHANGED$') CollaborativeFilteringTransformer#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • cuda (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cuda parameter in fit.

  • epochs_max (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for epochs_max parameter in fit.

  • patience (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for patience parameter in fit.

Returns:

self – The updated object.

Return type:

object

transform(X)[source]#

Predict the mising features in the long data based on the index of sample_column and item_column.

Parameters:

X (Series, shape (n_samples, )) – The training data with columns target_column, item_column and sample_column.

Returns:

X_transformed – The complete data with imputed values in long format

Return type:

pd.Series (n_samples, n_features)