Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Cross-validation and model imputation comparison

This page documents the cross-validation utilities for evaluating imputation model performance. Cross-validation provides robust estimates of how well a model will generalize to unseen data by training and testing on multiple data splits. Functions like get_imputations, will then build upon it, to standardize evaluation for all models, ensuring possible through a consistent experimental setup.

Microimpute’s cross-validation automatically selects the appropriate metric based on variable type. Numerical variables are evaluated using quantile loss, which measures prediction accuracy across the conditional distribution. Categorical variables are evaluated using log loss (cross-entropy), which penalizes confident but incorrect predictions, see the Metrics page for more details.

Cross-validation

Cross-validation provides robust estimates of how well a model will generalize to unseen data by training and testing on multiple data splits. Microimpute’s cross-validation automatically selects the appropriate metric based on variable type: quantile loss for numerical variables and log loss for categorical variables.

cross_validate_model

def cross_validate_model(
    model_class: Type,
    data: pd.DataFrame,
    predictors: List[str],
    imputed_variables: List[str],
    weight_col: Optional[str] = None,
    quantiles: Optional[List[float]] = QUANTILES,
    n_splits: Optional[int] = 5,
    random_state: Optional[int] = RANDOM_STATE,
    model_hyperparams: Optional[dict] = None,
    tune_hyperparameters: Optional[bool] = False,
) -> Union[Dict[str, Any], Tuple[Dict[str, Any], Dict]]
ParameterTypeDefault usedDescription
model_classType-Imputer class to evaluate (e.g., QRF, OLS, MDN)
datapd.DataFrame-Complete dataset for cross-validation
predictorsList[str]-Column names of predictor variables
imputed_variablesList[str]-Column names of variables to impute
weight_colstrNoneColumn name for sampling weights
quantilesList[float][0.05 to 0.95 in steps of 0.05]Quantiles to evaluate
n_splitsint5Number of cross-validation folds
random_stateint42Random seed for reproducibility
model_hyperparamsdictNoneHyperparameters to pass to the model
tune_hyperparametersboolFalseEnable hyperparameter tuning

Returns a dictionary containing separate results for each metric type:

{
    "quantile_loss": {
        "results": pd.DataFrame,  # rows: ["train", "test"], cols: quantiles
        "mean_train": float,
        "mean_test": float,
        "variables": List[str]   # numerical variables evaluated
    },
    "log_loss": {
        "results": pd.DataFrame,  # rows: ["train", "test"], cols: quantiles
        "mean_train": float,
        "mean_test": float,
        "variables": List[str]   # categorical variables evaluated
    }
}

If tune_hyperparameters=True, returns a tuple of (results_dict, best_hyperparameters).

Example usage

from microimpute.evaluations import cross_validate_model
from microimpute.models import QRF

# Run 5-fold cross-validation
results = cross_validate_model(
    model_class=QRF,
    data=diabetes_df,
    predictors=["age", "sex", "bmi", "bp"],
    imputed_variables=["s1", "s4"],
    n_splits=5
)

# Check performance for numerical variables
print(f"Mean test quantile loss: {results['quantile_loss']['mean_test']:.4f}")

# View detailed results by quantile
print(results["quantile_loss"]["results"])

Interpreting results

The results DataFrame shows loss values for each quantile, with rows for train and test splits. Lower values indicate better performance. Comparing train and test loss helps identify overfitting: a large gap suggests the model may not generalize well.

For model selection, focus on the test loss (mean_test). When comparing multiple models, the autoimpute() function automates this comparison and selects the best-performing model using a rank-based approach that handles mixed variable types.

Imputation generation for model comparison

The get_imputations function generates imputations using cross-validation for multiple model classes in a single call, organizing results in a consistent format for downstream comparison and evaluation.

get_imputations

def get_imputations(
    model_classes: List[Type],
    X_train: pd.DataFrame,
    X_test: pd.DataFrame,
    predictors: List[str],
    imputed_variables: List[str],
    quantiles: Optional[List[float]] = QUANTILES,
) -> Dict[str, Dict[float, pd.DataFrame]]
ParameterTypeDefault usedDescription
model_classesList[Type]-List of model classes to use (e.g., QRF, OLS, QuantReg, Matching)
X_trainpd.DataFrame-Training data containing predictors and variables to impute
X_testpd.DataFrame-Test data on which to generate imputations
predictorsList[str]-Column names of predictor variables
imputed_variablesList[str]-Column names of variables to impute
quantilesList[float][0.05 to 0.95 in steps of 0.05]List of quantiles to predict

Returns a nested dictionary mapping method names to dictionaries of quantile-indexed DataFrames:

{
    "QRF": {
        0.1: pd.DataFrame,  # predictions at 10th percentile
        0.5: pd.DataFrame,  # predictions at 50th percentile
        0.9: pd.DataFrame,  # predictions at 90th percentile
    },
    "OLS": {
        0.1: pd.DataFrame,
        ...
    },
}

Example usage

from microimpute.comparisons import get_imputations
from microimpute.models import QRF, OLS, QuantReg, Matching

# Generate imputations from multiple models
method_imputations = get_imputations(
    model_classes=[QRF, OLS, QuantReg, Matching],
    X_train=train_data,
    X_test=test_data,
    predictors=["age", "sex", "bmi"],
    imputed_variables=["income", "wealth"],
)

# Access predictions for a specific model and quantile
qrf_median = method_imputations["QRF"][0.5]