Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Autoimputation

This notebook demonstrates the functionality of the autoimpute module, which provides an automated approach to selecting and applying optimal imputation methods for missing data. Rather than manually testing different approaches, autoimpute evaluates multiple methods (tuning their hyperparameters to the specific dataset), identifies which performs best for your specific data, and applies it to generate high-quality imputations.

autoimpute function

def autoimpute(
    donor_data: pd.DataFrame,
    receiver_data: pd.DataFrame,
    predictors: List[str],
    imputed_variables: List[str],
    weight_col: Optional[str] = None,
    models: Optional[List[Type]] = None,
    imputation_quantiles: Optional[List[float]] = None,
    hyperparameters: Optional[Dict[str, Dict[str, Any]]] = None,
    tune_hyperparameters: Optional[bool] = False,
    preprocessing: Optional[Dict[str, str]] = None,
    impute_all: Optional[bool] = False,
    metric_priority: Optional[str] = "auto",
    random_state: Optional[int] = RANDOM_STATE,
    train_size: Optional[float] = TRAIN_SIZE,
    k_folds: Optional[int] = 5,
    force_retrain: Optional[bool] = False,
    log_level: Optional[str] = "WARNING",
) -> AutoImputeResult
ParameterTypeDefault usedDescription
donor_datapd.DataFrame-DataFrame with predictor and target variables for training
receiver_datapd.DataFrame-DataFrame where imputed values will be generated
predictorsList[str]-Column names of predictor variables
imputed_variablesList[str]-Column names of variables to impute
weight_colstrNoneColumn name for sampling weights
modelsList[Type][QRF, OLS, QuantReg, Matching, MDN]List of imputer classes to compare.
imputation_quantilesList[float][0.05 to 0.95 in steps of 0.05]Quantiles at which to predict
hyperparametersDictNoneModel-specific hyperparameters (e.g., {“QRF”: {“n_estimators”: 200}})
tune_hyperparametersboolFalseEnable automatic hyperparameter tuning
preprocessingDict[str, str]NoneVariable transformations: {“var”: “normalize”/“log”/“asinh”}
impute_allboolFalseReturn imputations for all models, not just the best
metric_prioritystr“auto”Model selection strategy: “auto”, “numerical”, “categorical”, “combined”
random_stateint42Random seed for reproducibility
train_sizefloat0.8Proportion of donor data for training in cross-validation
k_foldsint5Number of cross-validation folds
force_retrainboolFalseForce MDN model retraining (bypass cache)
log_levelstr“WARNING”Logging verbosity level

AutoImputeResult

The function returns an AutoImputeResult object with the following attributes:

AttributeTypeDescription
imputationsDictMaps model names to quantile → DataFrame of imputed values
receiver_datapd.DataFrameReceiver data with imputed values integrated
fitted_modelsDictMaps model names to fitted ImputerResults objects (if impute_all=True also includes all other fitted models)
cv_resultsDictCross-validation metrics per model (quantile_loss, log_loss)

Access the best model’s imputations using AutoImputeResult.imputations["best_method"].

import warnings
warnings.filterwarnings("ignore")

import logging
logging.getLogger("pytorch_lightning").setLevel(logging.ERROR)
logging.getLogger("pytorch_tabular").setLevel(logging.ERROR)
logging.getLogger("joblib").setLevel(logging.ERROR)

import pandas as pd
import numpy as np
import plotly.graph_objects as go
from sklearn.datasets import load_diabetes

pd.set_option("display.width", 600)
pd.set_option("display.max_columns", 10)
pd.set_option("display.expand_frame_repr", False)

from microimpute.comparisons.autoimpute import autoimpute
from microimpute.models import OLS, QuantReg, QRF, Matching
from microimpute.visualizations.comparison_plots import method_comparison_results

Data preparation

This demonstration uses the diabetes dataset from scikit-learn. In real-world imputation scenarios, you would typically have a “donor” dataset with complete information for both predictor and target variables, and a “receiver” dataset that lacks some target variables that need to be imputed.

# Load the diabetes dataset
diabetes = load_diabetes()
diabetes_data = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Display the first few rows to understand the data structure
diabetes_data.head()
Loading...

For this demonstration, the diabetes dataset is split into donor and receiver portions. Part of the data is treated as the donor dataset with complete information, and another part as the receiver dataset with some variables that need imputation. autoimpute handles imputation of numerical, categorical and boolean variables, lifting constraints on the choice of data sets and variables.

# Split the data into donor and receiver portions
donor_indices = np.random.choice(
    len(diabetes_data), size=int(0.7 * len(diabetes_data)), replace=False
)
receiver_indices = np.array(
    [i for i in range(len(diabetes_data)) if i not in donor_indices]
)

donor_data = diabetes_data.iloc[donor_indices].reset_index(drop=True)
receiver_data = diabetes_data.iloc[receiver_indices].reset_index(drop=True)

# Define which variables we'll use as predictors and which we want to impute
predictors = ["age", "sex", "bmi", "bp"]
imputed_variables = ["s1", "s4"]

# For demonstration purposes, we'll remove the variables we want to impute from the receiver dataset
receiver_data_without_targets = receiver_data.drop(columns=imputed_variables)

print(f"Donor data shape: {donor_data.shape}")
print(f"Receiver data shape: {receiver_data_without_targets.shape}")
print(f"Predictors: {predictors}")
print(f"Variables to impute: {imputed_variables}")
Donor data shape: (309, 10)
Receiver data shape: (133, 8)
Predictors: ['age', 'sex', 'bmi', 'bp']
Variables to impute: ['s1', 's4']

Running autoimpute

Use the autoimpute function to automatically evaluate different imputation methods, select the best one, and generate imputations. The function handles all the complexity of model evaluation, selection, and application in a single call.

warnings.filterwarnings("ignore")

# Run the autoimpute process
results = autoimpute(
    donor_data=donor_data,
    receiver_data=receiver_data_without_targets,
    predictors=predictors,
    imputed_variables=imputed_variables,
    models=[OLS, QuantReg, QRF, Matching], # MDN model excluded for efficiency
    tune_hyperparameters=False,
    k_folds=3,
)

print(
    f"Shape of receiver data before imputation: {receiver_data_without_targets.shape} \nShape of receiver data after imputation: {results.receiver_data.shape}"
)
Loading...
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
R callback write-console: Warning:  
R callback write-console:  failed to download mirrors file (cannot open URL 'https://cran.r-project.org/CRAN_mirrors.csv'); using local file '/opt/homebrew/Cellar/r/4.5.0/lib/R/doc/CRAN_mirrors.csv'
  
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:  1.6min
R callback write-console: Warning:  
R callback write-console:  failed to download mirrors file (cannot open URL 'https://cran.r-project.org/CRAN_mirrors.csv'); using local file '/opt/homebrew/Cellar/r/4.5.0/lib/R/doc/CRAN_mirrors.csv'
  
R callback write-console: Warning:  
R callback write-console:  failed to download mirrors file (cannot open URL 'https://cran.r-project.org/CRAN_mirrors.csv'); using local file '/opt/homebrew/Cellar/r/4.5.0/lib/R/doc/CRAN_mirrors.csv'
  
Shape of receiver data before imputation: (133, 10) 
Shape of receiver data after imputation: (133, 10)
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  5.2min finished

Understanding the results

The autoimpute function returns an AutoimputeResults objects that provide comprehensive information about the imputation process:

# Examine the comparative performance of different imputation methods
print("Cross-validation results for different imputation methods:")
for model, metric_dict in results.cv_results.items():
    print(f"\nModel: {model}")
    print(f"quantile loss results: {metric_dict.get('quantile_loss').get("mean_test"):.4f}")
Cross-validation results for different imputation methods:

Model: QRF
quantile loss results: 0.0155

Model: OLS
quantile loss results: 0.0124

Model: QuantReg
quantile loss results: 0.0125

Model: Matching
quantile loss results: 0.0231

The table above provides a comprehensive view of how each imputation method performs across different quantiles. The ‘mean_loss’ column shows the average quantile loss across all quantiles for each method. Lower values indicate better performance, and autoimpute automatically selects the method with the lowest average loss.

# Identify which method was selected as the best performer
print(f"Best performing method: {results.fitted_models['best_method']}")
Best performing method: <microimpute.models.ols.OLSResults object at 0x17b05c1a0>

Visualizing method comparison

Visualize how different methods perform across quantiles provides insight into which methods are most appropriate for different parts of the distribution.

# Extract the quantiles used in the evaluation
comparison_viz = method_comparison_results(
    data=results.cv_results,
    metric="quantile_loss",
    data_format="wide",
)
fig = comparison_viz.plot(
    title="Autoimpute Method Comparison",
    show_mean=True,
)
fig.show()
Loading...

The plot above illustrates how each imputation method performs across different quantiles of the distribution. Methods with consistently lower lines generally perform better overall.

Examining the imputed values

Now let us assess the actual imputed values generated by the best-performing method.

# Examine imputed values (these were imputed for q=0.5 by default)
median_imputations = results.imputations[
    "best_method"
]  # Extract the best imputations with the "best_method" key
print("Median imputed values:")
median_imputations.head()
Median imputed values:
Loading...
# Look at the full receiver dataset with imputed values integrated
print("Receiver dataset with imputed values:")
results.receiver_data.head()
Receiver dataset with imputed values:
Loading...

Evaluating imputation quality

In this demonstration, since the receiver dataset was artificially created by removing variables from the original data, there exists the unique opportunity to evaluate the quality of our imputations by comparing them to the actual values.

# Visualize comparison between actual and imputed values
for var in imputed_variables:
    fig = go.Figure()

    # Plot actual values
    fig.add_trace(
        go.Scatter(
            x=receiver_data.index,
            y=receiver_data[var],
            mode="markers",
            name="Actual values",
            marker=dict(color="blue", size=8),
        )
    )

    # Plot imputed values
    fig.add_trace(
        go.Scatter(
            x=results.receiver_data.index,
            y=results.receiver_data[var],
            mode="markers",
            name="Imputed values",
            marker=dict(color="red", size=8),
        )
    )

    # Customize the plot appearance
    fig.update_layout(
        title=f"Comparison of actual vs imputed values for {var}",
        xaxis_title="Sample Index",
        yaxis_title=f"{var} Value",
        legend_title="Type",
        hovermode="closest",
    )

    fig.show()
Loading...
Loading...

The plots above show how well the imputed values (red) match the actual values (blue) that were removed from the receiver dataset. This visual comparison helps assess the quality of the imputations generated by the best-performing method.

Advanced usage

Custom models and hyperparameters

The autoimpute function allows for customization of both the models to evaluate and their hyperparameters. This flexibility enables adaptation to specific dataset characteristics and imputation requirements. The models that support hyperparameter specification and tuning are Matching and QRF.

from microimpute.models import *

# Specify a custom subset of models to evaluate
custom_models = [QRF, OLS, Matching]

# Specify custom hyperparameters for some models
custom_hyperparameters = {
    "QRF": {"n_estimators": 200, "max_depth": 10},
    "Matching": {"constrained": True},
}

# Then simply run autoimpute with custom models and hyperparameters
advanced_results = autoimpute(
    donor_data=donor_data,
    receiver_data=receiver_data_without_targets,
    predictors=predictors,
    imputed_variables=imputed_variables,
    models=custom_models,
    hyperparameters=custom_hyperparameters,
    k_folds=3,
)

advanced_results.imputations["best_method"]
Loading...
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    5.7s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    2.9s finished
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    4.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   10.3s finished
Loading...

Comparison of imputed values across models

For comparing, not only performance through quantile loss, but also final results through the evaluation of imputed values, autoimpute supports setting the parameter impute_all to True so that it will not only perform impuation with the model chosen as the best performing but also all others. When set to True, this parameter ensures that autoimpute’s results base clase contains an imputations dictionary and fitted models dictionary for all other models in addition to the “best_method”.

warnings.filterwarnings("ignore")

# Run the autoimpute process
results = autoimpute(
    donor_data=donor_data,
    receiver_data=receiver_data_without_targets,
    predictors=predictors,
    imputed_variables=imputed_variables,
    models=[OLS, QuantReg, QRF, Matching], # MDN model excluded for efficiency
    tune_hyperparameters=False,
    impute_all=True,
    k_folds=3,
)

print(f"Imputation results available for models: {results.imputations.keys()}")
print(
    f"The best performing model is: {results.fitted_models['best_method'].__class__.__name__}"
)
Loading...
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    1.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    0.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    1.0s finished
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    3.4s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    9.3s finished
Imputation results available for models: dict_keys(['best_method', 'QRF', 'QuantReg', 'Matching'])
The best performing model is: OLSResults