pimmslearn.pandas package#

pimmslearn.pandas.calc_errors_per_feat(pred: DataFrame, freq_feat: Series, target_col='observed') DataFrame[source]#

Calculate absolute errors and sort by freq of features.

pimmslearn.pandas.flatten_dict_of_dicts(d: dict, parent_key: str = '') dict[source]#

Build tuples for nested dictionaries for use as pandas.MultiIndex.

Parameters:
  • d (dict) – Nested dictionary for which all keys are flattened to tuples.

  • parent_key (str, optional) – Outer key (used for recursion), by default ‘’

Returns:

Flattend dictionary with tuple keys: {(outer_key, …, inner_key) : value}

Return type:

dict

pimmslearn.pandas.get_absolute_error(pred: DataFrame, y_true: str = 'observed') DataFrame[source]#
pimmslearn.pandas.get_columns_accessor(df: DataFrame, all_lower_case=False) OmegaConf[source]#
pimmslearn.pandas.get_columns_accessor_from_iterable(cols: Iterable[str], all_lower_case=False) OmegaConf[source]#
pimmslearn.pandas.get_columns_namedtuple(df: DataFrame) namedtuple[source]#

Create namedtuple instance of column names. Spaces in column names are replaced with underscores in the look-up.

Parameters:

df (pd.DataFrame) – A pandas DataFrame

Returns:

NamedTuple instance with columns as attributes.

Return type:

namedtuple

pimmslearn.pandas.get_counts_per_bin(df: DataFrame, bins: range, columns: List[str] | None = None) DataFrame[source]#

Return counts per bin for selected columns in DataFrame.

pimmslearn.pandas.get_last_index_matching_proportion(df_counts: DataFrame, prop: float = 0.25, prop_col: str = 'proportion') object[source]#

df_counts needs to be sorted by “prop_col” (descending).

Parameters:
  • df_counts (pd.DataFrame) – df counts with ascending values along proportion column. Index should be unique.

  • prop (float, optional) – cutoff, inclusive, by default 0.25

  • prop_col (str, optional) – column name for proportion, by default ‘proportion’

Returns:

Index value for cutoff

Return type:

object

pimmslearn.pandas.get_lower_whiskers(df: DataFrame, factor: float = 1.5) Series[source]#
pimmslearn.pandas.get_unique_non_unique_columns(df: DataFrame) SimpleNamespace[source]#

Get back a namespace with an column.Index both of the unique and non-unique columns.

Parameters:

df (pd.DataFrame)

Returns:

SimpleNamespace with unique and non_unique column names indices.

Return type:

types.SimpleNamespace

pimmslearn.pandas.highlight_min(s: Series) list[source]#

Highlight the min in a Series yellow for using in pandas.DataFrame.style

Parameters:

s (pd.Series) – Pandas Series

Returns:

list of strings containing the background color for the values speciefied. To be used as pandas.DataFrame.style.apply(highlight_min)

Return type:

list

pimmslearn.pandas.index_to_dict(index: Index) dict[source]#
pimmslearn.pandas.interpolate(wide_df: DataFrame, name='interpolated') DataFrame[source]#

Interpolate NA values with the values before and after. Uses n=3 replicates. First rows replicates are the two following. Last rows replicates are the two preceding.

Parameters:
  • wide_df (pd.DataFrame) – rows are sample, columns are measurements

  • name (str, optional) – name for measurement in columns, by default ‘replicates’

Returns:

pd.DataFrame in long-format

Return type:

pd.DataFrame

pimmslearn.pandas.key_map(d: dict) dict[source]#

Build a schema of dicts

Parameters:

d (dict) – dictionary of dictionaries

Returns:

Key map of dictionaries

Return type:

dict

pimmslearn.pandas.length(x)[source]#

Len function which return 0 if object (probably np.nan) has no length. Otherwise return length of list, pandas.Series, numpy.array, dict, etc.

pimmslearn.pandas.parse_query_expression(s: str, printable: str = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ') str[source]#

Parse a query expression for pd.DataFrame.query to a file name. Removes all characters not listed in printable.

pimmslearn.pandas.prop_unique_index(df: DataFrame) DataFrame[source]#
pimmslearn.pandas.replace_with(string_key: str, replace: str = '()/', replace_with: str = '') str[source]#
pimmslearn.pandas.select_max_by(df: DataFrame, grouping_columns: list, selection_column: str) DataFrame[source]#
pimmslearn.pandas.unique_cols(s: Series) bool[source]#

Check all entries are equal in pandas.Series

Ref: https://stackoverflow.com/a/54405767/968487

Parameters:

s (pandas.Series) – Series to check uniqueness

Returns:

Boolean on if all values are equal.

Return type:

bool

Submodules#

pimmslearn.pandas.calc_errors module#

pimmslearn.pandas.calc_errors.calc_errors_per_bin(pred: DataFrame, target_col='observed') DataFrame[source]#

Calculate absolute errors. Bin by integer value of simulated NA and provide count.

pimmslearn.pandas.calc_errors.calc_errors_per_feat(pred: DataFrame, freq_feat: Series, target_col='observed') DataFrame[source]#

Calculate absolute errors and sort by freq of features.

pimmslearn.pandas.calc_errors.get_absolute_error(pred: DataFrame, y_true: str = 'observed') DataFrame[source]#

pimmslearn.pandas.missing_data module#

Functionality related to analyzing missing values in a pandas DataFrame.

pimmslearn.pandas.missing_data.decompose_NAs(data: DataFrame, level: int | str, label: int = 'summary') DataFrame[source]#

Decompose missing values by a level into real and indirectly imputed missing values. Real missing value have missing for all samples in a group. Indirectly imputed missing values are in MS-based proteomics data that would be imputed by the mean (or median) of the observed values in a group if the mean (or median) is used for imputation.

Parameters:
  • data (pd.DataFrame) – DataFrame with samples in columns and features in rows.

  • level (Union[int, str]) – Index level to group by. Examples: Protein groups, peptides or precursors in MS data.

  • label (int, optional) – Column name of single column dataframe returned, by default ‘summary’

Returns:

One column DataFrame with summary information about missing values.

Return type:

pd.DataFrame

pimmslearn.pandas.missing_data.get_record(data: DataFrame, columns_sample=False) dict[source]#

Get summary record of data.

pimmslearn.pandas.missing_data.list_files(folder='.') list[str][source]#
pimmslearn.pandas.missing_data.percent_missing(df: DataFrame)[source]#

Total percentage of missing values in a DataFrame.

Parameters:

df (pd.DataFrame) – DataFrame with data.

Returns:

Proportion of missing values in the DataFrame.

Return type:

float

pimmslearn.pandas.missing_data.percent_non_missing(df: DataFrame) float[source]#