pimmslearn.sklearn package#
Scikit-learn related functions for the project for ALD part.
Might be moved to a separate package in the future.
- pimmslearn.sklearn.get_PCA(df, n_components=2, imputer=<class 'sklearn.impute._base.SimpleImputer'>)[source]#
Submodules#
pimmslearn.sklearn.ae_transformer module#
Scikit-learn style interface for Denoising and Variational Autoencoder model.
- class pimmslearn.sklearn.ae_transformer.AETransformer(hidden_layers: list[int], latent_dim: int = 15, out_folder: str = '.', model='VAE', batch_size: int = 64)[source]#
Bases:
TransformerMixin,BaseEstimatorAutoencoder transformer (Denoising or Variational).
Autoencoder transformer which can be used to impute missing values in a dataset it is fitted to. The data is standard normalized for fitting the model, but imputations are provided on the original scale after internally fitting the model.
The data uses the wide data format with samples as rows and features as columns.
- Parameters:
hidden_layers (list[int]) – Architecture for encoder. Decoder is mirrored.
dim_latent (int, optional) – Hidden space dimension, by default 15
out_folder (str, optional) – Output folder for model, by default ‘.’
model (str, optional) – Model type (“VAE”, “DAE”), by default ‘VAE’
batch_size (int, optional) – Batch size for training, by default 64
- fit(X: DataFrame, y: DataFrame | None = None, epochs_max: int = 100, cuda: bool = True, patience: int | None = None)[source]#
Fit the model to the data.
- Parameters:
X (pd.DataFrame) – training data of dimension N_samples x M_features
y (pd.DataFrame, optional) – validation data points which are missing in X of dimension N_sample x M_features, by default None
epochs_max (int, optional) – Maximal number of epochs to train, by default 100
cuda (bool, optional) – If the model should be trained with an accelerator, by default True
patience (Optional[int], optional) – If added, early stopping is added with specified patience, by default None
- Returns:
Return itself fitted to the training data.
- Return type:
- set_fit_request(*, cuda: bool | None | str = '$UNCHANGED$', epochs_max: bool | None | str = '$UNCHANGED$', patience: bool | None | str = '$UNCHANGED$') AETransformer#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
cuda (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cudaparameter infit.epochs_max (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
epochs_maxparameter infit.patience (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
patienceparameter infit.
- Returns:
self – The updated object.
- Return type:
pimmslearn.sklearn.cf_transformer module#
Scikit-learn style interface for Collaborative Filtering model.
- class pimmslearn.sklearn.cf_transformer.CollaborativeFilteringTransformer(target_column: str, sample_column: str, item_column: str, n_factors: int = 15, out_folder: str = '.', batch_size: int = 4096)[source]#
Bases:
TransformerMixin,BaseEstimatorCollaborative Filtering transformer.
Collaborative filtering operates on long data specifying two identifiers (sample and feature) with a quantitative value to predict. Therefore we need to specify three columns. The sample and feature identifiers are embedded into a space which is then used to predict the quantitative value.
The data is expected as a Series with a MultiIndex of the sample and feature identifiers, and the quantitative value as its values.
- Parameters:
target_column (str) – Target column name to predict, e.g. intensity
item_column (str) – Column name for features (items) to embed, e.g. peptides
sample_column (str) – Sample column name, e.g. Sample_ID
n_factors (int, optional) – number of dimension of item and sample embeddings, by default 15
out_folder (str, optional) – Output folder for model, by default ‘.’
batch_size (int, optional) – Batch size for training of data in long format, by default 4096
- fit(X: Series, y: Series | None = None, epochs_max=20, cuda: bool = True, patience: int = 1)[source]#
Fit the collaborative filtering model to the data provided in long-format.
- Parameters:
X (Series, shape (n_values, )) – The training data as a Series with the target_column as it values and target_column as its name. The Series has a MultiIndex defined by the item_column and sample_column. Is of shape (n_values, )
y (Series, optional) – The validation data as a Series with the target_column as it values and target_column as its name. The Series has a MultiIndex defined by the item_column and sample_column. Is of shape (n_values, ), by default None
epochs_max (int, optional) – Maximal number of epochs to train, by default 100
cuda (bool, optional) – If the model should be trained with an accelerator, by default True
patience (Optional[int], optional) – If added, early stopping is added with specified patience, by default None
- Returns:
Return itself fitted to the training data.
- Return type:
- plot_loss(y, figsize=(8, 4), save: bool = False)[source]#
Plot the training and validation loss of the model.
- set_fit_request(*, cuda: bool | None | str = '$UNCHANGED$', epochs_max: bool | None | str = '$UNCHANGED$', patience: bool | None | str = '$UNCHANGED$') CollaborativeFilteringTransformer#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
cuda (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cudaparameter infit.epochs_max (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
epochs_maxparameter infit.patience (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
patienceparameter infit.
- Returns:
self – The updated object.
- Return type:
- transform(X)[source]#
Predict the mising features in the long data based on the index of sample_column and item_column.
- Parameters:
X (Series, shape (n_samples, )) – The training data with columns target_column, item_column and sample_column.
- Returns:
X_transformed – The complete data with imputed values in long format
- Return type:
pd.Series (n_samples, n_features)