Imputation#

The survey_enhance.impute module contains classes for imputing missing values in a dataset by training random forest models and using them to predict the missing values (as well as some functionality for adjusting the distribution of predicted values).

class survey_enhance.impute.Imputation[source]#

Bases: object

An Imputation represents a learned function f(input_variables) -> output_variables.

X_category_mappings: List[Dict[str, int]] = None#

The mapping from category names to integers for each input variable.

X_columns: List[str]#

The names of the input variables.

Y_columns: List[str]#

The names of the output variables.

encode_categories(X: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]#
static load(path: str) survey_enhance.impute.Imputation[source]#

Load the imputation model from disk.

Parameters

path (str) – The path to load the model from.

Returns

The imputation model.

Return type

Imputation

models: List[survey_enhance.impute.ManyToOneImputation]#

Each column of the output variables is predicted by a separate model, stored in this list.

predict(X: pandas.core.frame.DataFrame, mean_quantile: float = 0.5, verbose: bool = False) pandas.core.frame.DataFrame[source]#

Predict the output variables for the input dataset.

Parameters
  • X (pd.DataFrame) – The dataset to predict on.

  • mean_quantile (float) – The beta parameter for the imputation.

Returns

The predicted dataset.

Return type

pd.DataFrame

random_generator: numpy.random._generator.Generator = None#

The random generator used to sample from the distribution of the imputation.

save(path: str)[source]#

Save the imputation model to disk.

Parameters

path (str) – The path to save the model to.

solve_for_mean_quantiles(targets: list, input_data: pandas.core.frame.DataFrame, weights: pandas.core.series.Series)[source]#
train(X: pandas.core.frame.DataFrame, Y: pandas.core.frame.DataFrame, num_trees: int = 100)[source]#

Train a random forest model to predict the output variables from the input variables.

Parameters
  • X (pd.DataFrame) – The dataset containing the input variables.

  • Y (pd.DataFrame) – The dataset containing the output variables.

class survey_enhance.impute.ManyToOneImputation[source]#

Bases: object

An Imputation consists of a set of ManyToOneImputation models, one for each output variable.

model: sklearn.ensemble._forest.RandomForestRegressor#

The random forest model.

predict(X: pandas.core.frame.DataFrame, mean_quantile: float = 0.5, random_generator: Optional[numpy.random._generator.Generator] = None) pandas.core.frame.DataFrame[source]#

Predict the output variable for the input dataset.

Parameters
  • X (pd.DataFrame) – The dataset to predict on.

  • mean_quantile (float) – The mean quantile under the Beta distribution.

  • random_generator (np.random.Generator) – The random generator.

Returns

The predicted distribution of values for each input row.

Return type

pd.Series

solve_for_mean_quantile(target: float, input_df: pandas.core.frame.DataFrame, weights: numpy.ndarray, max_iterations: int = 10, verbose: bool = False)[source]#

Solve for the mean quantile that produces the target value.

Parameters
  • target (float) – The target value.

  • input_df (pd.DataFrame) – The input dataset.

  • weights (np.ndarray) – The sample weights.

  • max_iterations (int, optional) – The maximum number of iterations. Defaults to 5.

  • verbose (bool, optional) – Whether to print the loss at each iteration. Defaults to False.

Returns

The mean quantile.

Return type

float

train(X: pandas.core.frame.DataFrame, y: pandas.core.series.Series, sample_weight: Optional[pandas.core.series.Series] = None, num_trees: int = 100)[source]#

Train a random forest model to predict the output variable from the input variables.

Parameters
  • X (pd.DataFrame) – The dataset containing the input variables.

  • y (pd.Series) – The dataset containing the output variable.

  • sample_weight (pd.Series) – The sample weights.