Imputation#
The survey_enhance.impute
module contains classes for imputing missing values in a dataset by training random forest models and using them to predict the missing values (as well as some functionality for adjusting the distribution of predicted values).
- class survey_enhance.impute.Imputation[source]#
Bases:
object
An Imputation represents a learned function f(input_variables) -> output_variables.
- X_category_mappings: List[Dict[str, int]] = None#
The mapping from category names to integers for each input variable.
- X_columns: List[str]#
The names of the input variables.
- Y_columns: List[str]#
The names of the output variables.
- static load(path: str) survey_enhance.impute.Imputation [source]#
Load the imputation model from disk.
- Parameters
path (str) – The path to load the model from.
- Returns
The imputation model.
- Return type
- models: List[survey_enhance.impute.ManyToOneImputation]#
Each column of the output variables is predicted by a separate model, stored in this list.
- predict(X: pandas.core.frame.DataFrame, mean_quantile: float = 0.5, verbose: bool = False) pandas.core.frame.DataFrame [source]#
Predict the output variables for the input dataset.
- Parameters
X (pd.DataFrame) – The dataset to predict on.
mean_quantile (float) – The beta parameter for the imputation.
- Returns
The predicted dataset.
- Return type
pd.DataFrame
- random_generator: numpy.random._generator.Generator = None#
The random generator used to sample from the distribution of the imputation.
- save(path: str)[source]#
Save the imputation model to disk.
- Parameters
path (str) – The path to save the model to.
- solve_for_mean_quantiles(targets: list, input_data: pandas.core.frame.DataFrame, weights: pandas.core.series.Series)[source]#
- train(X: pandas.core.frame.DataFrame, Y: pandas.core.frame.DataFrame, num_trees: int = 100)[source]#
Train a random forest model to predict the output variables from the input variables.
- Parameters
X (pd.DataFrame) – The dataset containing the input variables.
Y (pd.DataFrame) – The dataset containing the output variables.
- class survey_enhance.impute.ManyToOneImputation[source]#
Bases:
object
An Imputation consists of a set of ManyToOneImputation models, one for each output variable.
- model: sklearn.ensemble._forest.RandomForestRegressor#
The random forest model.
- predict(X: pandas.core.frame.DataFrame, mean_quantile: float = 0.5, random_generator: Optional[numpy.random._generator.Generator] = None) pandas.core.frame.DataFrame [source]#
Predict the output variable for the input dataset.
- Parameters
X (pd.DataFrame) – The dataset to predict on.
mean_quantile (float) – The mean quantile under the Beta distribution.
random_generator (np.random.Generator) – The random generator.
- Returns
The predicted distribution of values for each input row.
- Return type
pd.Series
- solve_for_mean_quantile(target: float, input_df: pandas.core.frame.DataFrame, weights: numpy.ndarray, max_iterations: int = 10, verbose: bool = False)[source]#
Solve for the mean quantile that produces the target value.
- Parameters
target (float) – The target value.
input_df (pd.DataFrame) – The input dataset.
weights (np.ndarray) – The sample weights.
max_iterations (int, optional) – The maximum number of iterations. Defaults to 5.
verbose (bool, optional) – Whether to print the loss at each iteration. Defaults to False.
- Returns
The mean quantile.
- Return type
float
- train(X: pandas.core.frame.DataFrame, y: pandas.core.series.Series, sample_weight: Optional[pandas.core.series.Series] = None, num_trees: int = 100)[source]#
Train a random forest model to predict the output variable from the input variables.
- Parameters
X (pd.DataFrame) – The dataset containing the input variables.
y (pd.Series) – The dataset containing the output variable.
sample_weight (pd.Series) – The sample weights.