Imputation#

The survey_enhance.impute module contains classes for imputing missing values in a dataset by training random forest models and using them to predict the missing values (as well as some functionality for adjusting the distribution of predicted values).

class survey_enhance.impute.Imputation[source]#

Bases: object

An Imputation represents a learned function f(input_variables) -> output_variables.

X_category_mappings: List[Dict[str, int]] = None#: The mapping from category names to integers for each input variable.

X_columns: List[str]#: The names of the input variables.

Y_columns: List[str]#: The names of the output variables.

encode_categories(X: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]#

static load(path: str) → survey_enhance.impute.Imputation[source]#

Load the imputation model from disk.

Parameters: path (str) – The path to load the model from.
Returns: The imputation model.
Return type: Imputation

models: List[survey_enhance.impute.ManyToOneImputation]#: Each column of the output variables is predicted by a separate model, stored in this list.

predict(X: pandas.core.frame.DataFrame, mean_quantile: float = 0.5, verbose: bool = False) → pandas.core.frame.DataFrame[source]#

Predict the output variables for the input dataset.

Parameters

X (pd.DataFrame) – The dataset to predict on.
mean_quantile (float) – The beta parameter for the imputation.

Returns

The predicted dataset.

Return type

pd.DataFrame

random_generator: numpy.random._generator.Generator = None#: The random generator used to sample from the distribution of the imputation.

save(path: str)[source]#

Save the imputation model to disk.

Parameters: path (str) – The path to save the model to.

solve_for_mean_quantiles(targets: list, input_data: pandas.core.frame.DataFrame, weights: pandas.core.series.Series)[source]#

train(X: pandas.core.frame.DataFrame, Y: pandas.core.frame.DataFrame, num_trees: int = 100)[source]#

Train a random forest model to predict the output variables from the input variables.

Parameters

X (pd.DataFrame) – The dataset containing the input variables.
Y (pd.DataFrame) – The dataset containing the output variables.

class survey_enhance.impute.ManyToOneImputation[source]#

Bases: object

An Imputation consists of a set of ManyToOneImputation models, one for each output variable.

model: sklearn.ensemble._forest.RandomForestRegressor#: The random forest model.

predict(X: pandas.core.frame.DataFrame, mean_quantile: float = 0.5, random_generator: Optional[numpy.random._generator.Generator] = None) → pandas.core.frame.DataFrame[source]#

Predict the output variable for the input dataset.

Parameters

X (pd.DataFrame) – The dataset to predict on.
mean_quantile (float) – The mean quantile under the Beta distribution.
random_generator (np.random.Generator) – The random generator.

Returns

The predicted distribution of values for each input row.

Return type

pd.Series

solve_for_mean_quantile(target: float, input_df: pandas.core.frame.DataFrame, weights: numpy.ndarray, max_iterations: int = 10, verbose: bool = False)[source]#

Solve for the mean quantile that produces the target value.

Parameters

target (float) – The target value.
input_df (pd.DataFrame) – The input dataset.
weights (np.ndarray) – The sample weights.
max_iterations (int, optional) – The maximum number of iterations. Defaults to 5.
verbose (bool, optional) – Whether to print the loss at each iteration. Defaults to False.

Returns

The mean quantile.

Return type

float

train(X: pandas.core.frame.DataFrame, y: pandas.core.series.Series, sample_weight: Optional[pandas.core.series.Series] = None, num_trees: int = 100)[source]#

Train a random forest model to predict the output variable from the input variables.

Parameters

X (pd.DataFrame) – The dataset containing the input variables.
y (pd.Series) – The dataset containing the output variable.
sample_weight (pd.Series) – The sample weights.