vaep.pandas package#

vaep.pandas.flatten_dict_of_dicts(d: dict, parent_key: str = '') dict[source]#

Build tuples for nested dictionaries for use as pandas.MultiIndex.

Parameters:
  • d (dict) – Nested dictionary for which all keys are flattened to tuples.

  • parent_key (str, optional) – Outer key (used for recursion), by default ‘’

Returns:

Flattend dictionary with tuple keys: {(outer_key, …, inner_key) : value}

Return type:

dict

vaep.pandas.get_columns_accessor(df: DataFrame, all_lower_case=False) OmegaConf[source]#
vaep.pandas.get_columns_accessor_from_iterable(cols: Iterable[str], all_lower_case=False) OmegaConf[source]#
vaep.pandas.get_columns_namedtuple(df: DataFrame) namedtuple[source]#

Create namedtuple instance of column names. Spaces in column names are replaced with underscores in the look-up.

Parameters:

df (pd.DataFrame) – A pandas DataFrame

Returns:

NamedTuple instance with columns as attributes.

Return type:

namedtuple

vaep.pandas.get_counts_per_bin(df: DataFrame, bins: range, columns: Optional[List[str]] = None)[source]#

Return counts per bin for selected columns in DataFrame.

vaep.pandas.get_last_index_matching_proportion(df_counts: DataFrame, prop: float = 0.25, prop_col: str = 'proportion') object[source]#

df_counts needs to be sorted by “prop_col” (descending).

Parameters:
  • df_counts (pd.DataFrame) – df counts with ascending values along proportion column. Index should be unique.

  • prop (float, optional) – cutoff, inclusive, by default 0.25

  • prop_col (str, optional) – column name for proportion, by default ‘proportion’

Returns:

Index value for cutoff

Return type:

object

vaep.pandas.get_lower_whiskers(df: DataFrame, factor: float = 1.5) Series[source]#
vaep.pandas.get_unique_non_unique_columns(df: DataFrame) SimpleNamespace[source]#

Get back a namespace with an column.Index both of the unique and non-unique columns.

Parameters:

df (pd.DataFrame) –

Returns:

SimpleNamespace with unique and non_unique column names indices.

Return type:

types.SimpleNamespace

vaep.pandas.highlight_min(s: Series) list[source]#

Highlight the min in a Series yellow for using in pandas.DataFrame.style

Parameters:

s (pd.Series) – Pandas Series

Returns:

list of strings containing the background color for the values speciefied. To be used as pandas.DataFrame.style.apply(highlight_min)

Return type:

list

vaep.pandas.index_to_dict(index: Index) dict[source]#
vaep.pandas.interpolate(wide_df: DataFrame, name='interpolated') DataFrame[source]#

Interpolate NA values with the values before and after. Uses n=3 replicates. First rows replicates are the two following. Last rows replicates are the two preceding.

Parameters:
  • wide_df (pd.DataFrame) – rows are sample, columns are measurements

  • name (str, optional) – name for measurement in columns, by default ‘replicates’

Returns:

pd.DataFrame in long-format

Return type:

pd.DataFrame

vaep.pandas.key_map(d: dict) dict[source]#

Build a schema of dicts

Parameters:

d (dict) – dictionary of dictionaries

Returns:

Key map of dictionaries

Return type:

dict

vaep.pandas.length(x)[source]#

Len function which return 0 if object (probably np.nan) has no length. Otherwise return length of list, pandas.Series, numpy.array, dict, etc.

vaep.pandas.parse_query_expression(s: str, printable: str = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ') str[source]#

Parse a query expression for pd.DataFrame.query to a file name. Removes all characters not listed in printable.

vaep.pandas.prop_unique_index(df: DataFrame) DataFrame[source]#
vaep.pandas.replace_with(string_key: str, replace: str = '()/', replace_with: str = '') str[source]#
vaep.pandas.select_max_by(df: DataFrame, grouping_columns: list, selection_column: str) DataFrame[source]#
vaep.pandas.unique_cols(s: Series) bool[source]#

Check all entries are equal in pandas.Series

Ref: https://stackoverflow.com/a/54405767/968487

Parameters:

s (pandas.Series) – Series to check uniqueness

Returns:

Boolean on if all values are equal.

Return type:

bool

Submodules#

vaep.pandas.calc_errors module#

vaep.pandas.calc_errors.calc_errors_per_bin(pred: DataFrame, target_col='observed') DataFrame[source]#

Calculate absolute errors. Bin by integer value of simulated NA and provide count.

vaep.pandas.calc_errors.calc_errors_per_feat(pred: DataFrame, freq_feat: Series, target_col='observed') DataFrame[source]#

Calculate absolute errors and sort by freq of features.

vaep.pandas.calc_errors.get_absolute_error(pred: DataFrame, y_true: str = 'observed') DataFrame[source]#

vaep.pandas.missing_data module#

vaep.pandas.missing_data.get_record(data: DataFrame, columns_sample=False) dict[source]#

Get summary record of data.

vaep.pandas.missing_data.list_files(folder='.') list[str][source]#
vaep.pandas.missing_data.percent_missing(df: DataFrame) float[source]#
vaep.pandas.missing_data.percent_non_missing(df: DataFrame) float[source]#