Benchmarking methods#

This document provides a comprehensive guide to benchmarking different imputation methods using MicroImpute. The examples below illustrate the workflow for comparing various imputation approaches and evaluating their performance.

# On the Diabetes Dataset

from typing import List, Type

import pandas as pd

from microimpute.comparisons import *
from microimpute.config import RANDOM_STATE
from microimpute.models import *
from microimpute.visualizations.plotting import method_comparison_results
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

import warnings

warnings.filterwarnings("ignore")

# 1. Prepare data
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
X_train, X_test = train_test_split(
    df, test_size=0.2, random_state=RANDOM_STATE
)

predictors = ["age", "sex", "bmi", "bp"]
imputed_variables = ["s1", "s4"]

Y_test: pd.DataFrame = X_test[imputed_variables]

# 2. Run imputation methods
model_classes: List[Type[Imputer]] = [QRF, OLS, QuantReg, Matching]
method_imputations = get_imputations(
    model_classes, X_train, X_test, predictors, imputed_variables
)

# 3. Compare imputation methods
loss_comparison_df = compare_quantile_loss(
    Y_test, method_imputations, imputed_variables
)

# 4. Plot results
comparison_viz = method_comparison_results(
    data=loss_comparison_df,
    metric_name="Test Quantile Loss",
    data_format="long",
)
fig = comparison_viz.plot(
    title="Method Comparison on Diabetes Dataset",
    show_mean=True,
)
fig.show()

/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pydantic/_internal/_generate_schema.py:623: UserWarning: <built-in function array> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`.
  warn(

/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "LD_LIBRARY_PATH" redefined by R and overriding existing variable. Current: "/opt/hostedtoolcache/Python/3.11.12/x64/lib", R: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib"
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "PWD" redefined by R and overriding existing variable. Current: "/home/runner/work/microimpute/microimpute/docs", R: "/home/runner/work/microimpute/microimpute/docs/imputation-benchmarking"
  warnings.warn(

/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "LD_LIBRARY_PATH" redefined by R and overriding existing variable. Current: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib", R: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib"
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_LIBS_SITE" redefined by R and overriding existing variable. Current: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library", R: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library"
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_PAPERSIZE_USER" redefined by R and overriding existing variable. Current: "a4", R: "letter"
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_SESSION_TMPDIR" redefined by R and overriding existing variable. Current: "/tmp/RtmpfAddOO", R: "/tmp/RtmphY3oMx"
  warnings.warn(

# On the SCF Dataset

from typing import List, Type

import pandas as pd

from microimpute.comparisons import *
from microimpute.config import RANDOM_STATE
from microimpute.models import *

import warnings

warnings.filterwarnings("ignore")

# 1. Prepare data
X_train, X_test, PREDICTORS, IMPUTED_VARIABLES, dummy_info = prepare_scf_data(
    full_data=False, years=2019
)
if dummy_info:
    # Retrieve new predictors after processed data
    for orig_col, dummy_cols in dummy_info["column_mapping"].items():
        if orig_col in PREDICTORS:
            PREDICTORS.remove(orig_col)
            PREDICTORS.extend(dummy_cols)
        elif orig_col in IMPUTED_VARIABLES:
            IMPUTED_VARIABLES.remove(orig_col)
            IMPUTED_VARIABLES.extend(dummy_cols)

# Shrink down the data by sampling
X_train = X_train.sample(frac=0.01, random_state=RANDOM_STATE)
X_test = X_test.sample(frac=0.01, random_state=RANDOM_STATE)

Y_test: pd.DataFrame = X_test[IMPUTED_VARIABLES]

# 2. Run imputation methods
model_classes: List[Type[Imputer]] = [QRF, OLS, QuantReg, Matching]
method_imputations = get_imputations(
    model_classes, X_train, X_test, PREDICTORS, IMPUTED_VARIABLES
)

# 3. Compare imputation methods
loss_comparison_df = compare_quantile_loss(
    Y_test, method_imputations, IMPUTED_VARIABLES
)

# 4. Plot results
comparison_viz = method_comparison_results(
    data=loss_comparison_df,
    metric_name="Test Quantile Loss",
    data_format="long",
)
fig = comparison_viz.plot(
    title="Method Comparison on SCF Dataset",
    show_mean=True,
)
fig.show()

  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:00<00:00,  2.48it/s]

100%|██████████| 1/1 [00:00<00:00,  2.47it/s]

Found 4 numeric columns with unique values < 10, treating as categorical: ['hhsex', 'married', 'kids', 'race']. Converting to dummy variables.

Data preparation#

The data preparation phase establishes the foundation for meaningful benchmarking comparisons. The prepare_scf_data() function specifically handles Survey of Consumer Finances data, though the framework accommodates any properly formatted dataset. This function downloads data from user-specified survey years, carefully selecting relevant predictor and target variables that capture the essential relationships for imputation.

The function applies normalization techniques to the features, ensuring that variables with different scales don’t unduly influence the imputation models. This preprocessing step is crucial, particularly if introducing additional for methods like nearest neighbor matching that rely on distance calculations. Finally, the function splits the data into training and testing sets, maintaining the statistical properties of both sets while creating an appropriate evaluation framework.

While the package provides this specialized function for SCF data, researchers can easily substitute their own data preparation pipeline as long as it produces properly formatted training and testing datasets that conform to the expected structure. For this, the preprocess_data() function enables basic normalization and defaults to train-test splitting on any dataset. If you would like to normalizing the data set without splitting it (for example in the event of performing cross-validation), set the full_data parameter to False.

# Normalizing
processed_data = preprocess_data(dataset, full_data=True)

# Normalizing and splitting
X_train, X_test = preprocess_data(dataset)

Imputation generation#

The imputation generation process serves as the core operational phase of the benchmarking framework. The get_imputations() function orchestrates this process with remarkable efficiency, handling all aspects of model training and prediction generation. It systematically trains each specified model on identical training data, ensuring a fair comparison across different imputation approaches.

After training, the function generates predictions at user-specified quantiles, allowing for evaluation across different parts of the conditional distribution. The quantile-based approach provides insights not just into central tendency (as with mean-based methods) but into the entire shape of the imputed distributions. This comprehensive prediction generation creates a rich dataset for subsequent evaluation.

The function organizes all results into a consistent, structured format designed for straightforward comparison. The returned nested dictionary architecture provides intuitive access to predictions from different models at different quantiles:

{
    "ModelName1": {
        0.1: DataFrame of predictions at 10th percentile,
        0.5: DataFrame of predictions at 50th percentile,
        0.9: DataFrame of predictions at 90th percentile.
    },
    "ModelName2": {
        0.1: DataFrame of predictions at 10th percentile,
        ...
    },
    ...
}

This well-designed data structure simplifies downstream analysis and visualization, allowing researchers to focus on interpreting results rather than managing data formats.

At this stage, a model object can only handle the imputation of on variable at a time, meaning that to impute multiple variables from a data set, a new model object must be created for each of them.

Quantile loss calculation#

The evaluation phase employs sophisticated quantile loss metrics to assess imputation quality. This approach provides a more nuanced evaluation than traditional metrics like mean squared error, particularly for capturing performance across different parts of the distribution.

At the foundation of this evaluation lies the quantile_loss() function, which implements the standard quantile loss formulation:

\[L(y, f, q) = max(q \cdot (y - f), (q - 1)(y - f)),\]

where \(q\) is the quantile to be evaluated, \(y\) representes the true value and \(f\) is the imputed value.

This mathematical formulation creates an asymmetric loss function that penalizes under-prediction more heavily for higher quantiles and over-prediction more heavily for lower quantiles. This asymmetry aligns perfectly with the interpretation of quantiles—a 90th percentile prediction should rarely be below the true value, while a 10th percentile prediction should rarely exceed it.

Building on this foundation, the compute_quantile_loss() function calculates losses between true and imputed values, providing granular insight into model performance at the individual prediction level. This detailed evaluation helps identify specific patterns or regions where certain models might excel or struggle.

The integration of these components culminates in the compare_quantile_loss() function, which systematically evaluates multiple methods across different quantiles. The function produces a structured DataFrame with columns that describe the method being evaluated, the specific percentile being assessed, and the corresponding average quantile loss value.

Visualization#

The method_comparison_results.plot() function generates bar charts that present benchmarking results grouping results by both model and quantile, allowing quickly identifying patterns and trends in performance across different methods and different parts of the distribution.

The function employs color coding to visually distinguish between different imputation models, making it easy to track the performance of a single method. Along the horizontal axis, the chart displays different quantiles (such as the 10th, 25th, 50th percentiles), allowing assessment across the entire distribution of interest. The vertical axis represents the average quantile loss, with lower values indicating better performance—this clear metric gives an immediate visual indication of which models are performing well. The dashed lines represent average loss across quantiles.

Extending the benchmarking framework#

The MicroImpute benchmarking framework was designed with extensibility as a core principle, allowing researchers to easily integrate and evaluate new imputation approaches. To incorporate your own custom imputation model into this evaluation framework, you can follow a straightforward process.

First, implement your custom model by extending the Imputer abstract base class, following the design patterns and interface requirements documented in the implement-new-model.md file. This structured approach ensures your model will interact correctly with the rest of the benchmarking system. Once your model implementation is complete, simply include your model class in the model_classes list alongside the built-in models you wish to compare against. Finally, execute the benchmarking process as described previously, and your custom model will be evaluated using the same rigorous methodology applied to the built-in models.

This seamless integration is possible because all models that implement the Imputer interface share a common API, allowing the benchmarking framework to interact with them in a consistent manner regardless of their internal implementation details. This architectural decision makes the framework inherently extensible while maintaining a clean separation between the benchmarking logic and the specific imputation methods being evaluated.

Best practices#

Effective benchmarking requires careful attention to methodology and interpretation. To maximize the value of your imputation benchmarking efforts, consider following these research-based best practices that ensure comprehensive and reliable evaluation.

Robust evaluation requires testing models across multiple diverse datasets rather than relying on a single test case. This approach helps identify which models perform consistently well across different data scenarios and which may be sensitive to particular data characteristics. By examining performance across varied contexts, you can make more confident generalizations about a method’s effectiveness.

A comprehensive evaluation should assess performance across different quantiles rather than focusing solely on central measures like the median. Many applications care about the tails of distributions, and models that perform well at the median might struggle with extreme quantiles. Evaluating across the full spectrum of quantiles provides a more complete picture of each method’s strengths and limitations.

While statistical performance is critical, practical considerations should not be overlooked. Different imputation methods can vary dramatically in their computational requirements, including training time, memory usage, and prediction speed. In many applications, a slightly less accurate method that runs orders of magnitude faster may be preferable. Consider these trade-offs explicitly in your evaluation framework.

For particularly important decisions, enhance the reliability of your performance estimates through cross-validation techniques. Cross-validation provides a more stable estimate of model performance by averaging results across multiple train-test splits, reducing the impact of any particular data division. This approach is especially valuable when working with smaller datasets where a single train-test split might not be representative.

The package also supports detailed assessment of model behavior through train-test performance comparisons via the plot_train_test_performance() function. This visualization tool helps identify potential overfitting or underfitting issues by contrasting a model’s performance on training data with its performance on held-out test data. Significant disparities between training and testing performance can reveal important limitations in a model’s generalization capabilities.

For specialized applications with particular interest in certain parts of the distribution, the framework accommodates custom quantile sets for targeted evaluation. Rather than using the default (random) quantiles, researchers can specify exactly which quantiles to evaluate, allowing focused assessment of performance in regions of particular interest. This flexibility enables tailored evaluations that align precisely with application-specific requirements and priorities.