Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Metrics and evaluation

This page documents the evaluation metrics and predictor analysis tools available for assessing imputation quality. These utilities help understand model performance, compare methods, and analyze the contribution of individual predictors.

Loss metrics

Microimpute employs evaluation metrics tailored to the type of variable being imputed. The framework automatically selects the appropriate metric based on whether the imputed variable is numerical or categorical, ensuring meaningful performance assessment across different data types.

Quantile loss

Quantile loss assesses imputation quality for numerical variables. This approach provides a more nuanced evaluation than traditional metrics like mean squared error, particularly for capturing performance across different parts of the distribution.

The quantile loss implements the standard pinball loss formulation:

Lq(y,f)=max(q(yf),(q1)(yf))L_q(y, f) = \max(q(y-f), (q-1)(y-f))

where qq is the quantile being evaluated, yy represents the true value, and ff is the imputed value. This asymmetric loss function penalizes under-prediction more heavily for higher quantiles and over-prediction more heavily for lower quantiles. The asymmetry aligns with the interpretation of quantiles: a 90th percentile prediction should rarely fall below the true value, while a 10th percentile prediction should rarely exceed it.

def quantile_loss(q: float, y: np.ndarray, f: np.ndarray) -> np.ndarray
ParameterTypeDescription
qfloatQuantile to evaluate (e.g., 0.5 for median)
ynp.ndarrayTrue values
fnp.ndarrayPredicted values

Returns an array of element-wise quantile losses.

Log loss

Log loss (cross-entropy) evaluates probabilistic predictions of categorical outcomes. It measures the performance of a classification model where the prediction output is a probability value between 0 and 1.

The log loss metric is calculated as:

LogLoss=1Ni=1Nj=1Myijlog(pij)\text{LogLoss} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{M} y_{ij} \log(p_{ij})

where NN is the number of samples, MM is the number of classes, yijy_{ij} is 1 if sample ii belongs to class jj and 0 otherwise, and pijp_{ij} is the predicted probability of sample ii belonging to class jj.

A perfect classifier achieves a log loss of 0, while worse predictions yield increasingly higher values. The metric heavily penalizes confident misclassifications: predicting a class with high probability when incorrect results in a large loss value.

def log_loss(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    normalize: bool = True,
    labels: Optional[np.ndarray] = None,
) -> float
ParameterTypeDefault usedDescription
y_truenp.ndarray-True class labels
y_prednp.ndarray-Predicted probabilities or class labels
normalizeboolTrueIf True, return mean loss; if False, return sum
labelsnp.ndarrayNoneList of possible label values

Returns the Log loss value (float).

When predictions are class labels rather than probabilities, the function converts them to high-confidence probabilities (0.99/0.01) with a warning. For more accurate evaluation, use probability predictions when available.

compute_loss

A unified function that selects the appropriate loss metric based on the specified type, providing a consistent interface for both numerical and categorical evaluation.

def compute_loss(
    test_y: np.ndarray,
    imputations: np.ndarray,
    metric: Literal["quantile_loss", "log_loss"],
    q: float = 0.5,
    labels: Optional[np.ndarray] = None,
) -> Tuple[np.ndarray, float]
ParameterTypeDefault usedDescription
test_ynp.ndarray-True values
imputationsnp.ndarray-Predicted/imputed values
metricstr-“quantile_loss” or “log_loss”
qfloat0.5Quantile (for quantile_loss only)
labelsnp.ndarrayNoneClass labels (for log_loss only)

Returns a tuple of (element_wise_losses, mean_loss).

compare_metrics

Compares metrics across multiple imputation methods, automatically detecting variable types and applying the appropriate metric. For models that handle both numerical and categorical variables, the evaluation produces separate results for each metric type.

def compare_metrics(
    test_y: pd.DataFrame,
    method_imputations: Dict[str, Dict[float, pd.DataFrame]],
    imputed_variables: List[str],
) -> pd.DataFrame
ParameterTypeDescription
test_ypd.DataFrameDataFrame containing true values
method_imputationsDictNested dict: method → quantile → DataFrame
imputed_variablesList[str]Variables to evaluate

Returns a DataFrame with columns Method, Imputed Variable, Percentile, Loss, and Metric.

Distribution comparison

Beyond point-wise loss metrics, evaluating how well imputed values preserve distributional characteristics provides insight into whether the imputation maintains the statistical properties of the original data.

Wasserstein distance

For continuous numerical variables, the Wasserstein distance (Earth Mover’s Distance) quantifies the difference between distributions:

Wp(P,Q)=(infγΠ(P,Q)X×Yd(x,y)pdγ(x,y))1/pW_p(P, Q) = \left(\inf_{\gamma \in \Pi(P, Q)} \int_{X \times Y} d(x, y)^p d\gamma(x, y)\right)^{1/p}

where Π(P,Q)\Pi(P, Q) denotes the set of all joint distributions whose marginals are PP and QQ respectively. The Wasserstein distance measures the minimum “work” required to transform one distribution into another, where work is the amount of distribution mass moved times the distance moved. Lower values indicate better preservation of the original distribution’s shape.

When sample weights are provided, the weighted Wasserstein distance accounts for varying observation importance, which is essential when comparing survey data with different sampling designs. We use scipy’s wasserstein_distance implementation, which supports sample weights via the u_weights and v_weights parameters.

Kullback-Leibler divergence

For discrete distributions (categorical and boolean variables), KL divergence quantifies how one probability distribution diverges from a reference:

DKL(PQ)=xXP(x)log(P(x)Q(x))D_{KL}(P||Q) = \sum_{x \in \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right)

where PP is the reference distribution (original data), QQ is the approximation (imputed data), and X\mathcal{X} is the set of all possible categorical values. KL divergence measures how much information is lost when using the imputed distribution to approximate the true distribution. Lower values indicate better preservation of the original categorical distribution.

When sample weights are provided, the probability distributions are computed as weighted proportions rather than simple counts, ensuring proper comparison of weighted survey data.

kl_divergence

Computes the Kullback-Leibler divergence between two categorical distributions, with optional sample weights.

def kl_divergence(
    donor_values: np.ndarray,
    receiver_values: np.ndarray,
    donor_weights: Optional[np.ndarray] = None,
    receiver_weights: Optional[np.ndarray] = None,
) -> float
ParameterTypeDefault usedDescription
donor_valuesnp.ndarray-Categorical values from donor data (reference distribution)
receiver_valuesnp.ndarray-Categorical values from receiver data (approximation)
donor_weightsnp.ndarrayNoneOptional sample weights for donor values
receiver_weightsnp.ndarrayNoneOptional sample weights for receiver values

Returns KL divergence value (float >= 0), where 0 indicates identical distributions.

compare_distributions

Compares distributions between donor and receiver data, automatically selecting the appropriate metric based on variable type and supporting sample weights for survey data.

def compare_distributions(
    donor_data: pd.DataFrame,
    receiver_data: pd.DataFrame,
    imputed_variables: List[str],
    donor_weights: Optional[Union[pd.Series, np.ndarray]] = None,
    receiver_weights: Optional[Union[pd.Series, np.ndarray]] = None,
) -> pd.DataFrame
ParameterTypeDefault usedDescription
donor_datapd.DataFrame-Original donor data
receiver_datapd.DataFrame-Receiver data with imputations
imputed_variablesList[str]-Variables to compare
donor_weightspd.Series or np.ndarrayNoneSample weights for donor data (must match donor_data length)
receiver_weightspd.Series or np.ndarrayNoneSample weights for receiver data (must match receiver_data length)

Returns a DataFrame with columns Variable, Metric, and Distance. The function automatically selects Wasserstein distance for numerical variables and KL divergence for categorical variables.

Note that data must not contain null or infinite values. If your data contains such values, filter them before calling this function.

Predictor analysis

Understanding which predictors contribute most to imputation quality helps with feature selection and model interpretation. These tools analyze predictor-target relationships and evaluate sensitivity to predictor selection.

Mutual information

Mutual information measures the reduction in uncertainty about one variable given knowledge of another. Unlike correlation coefficients that capture only linear relationships, mutual information detects any statistical dependency, making it valuable for mixed data types.

For discrete random variables XX and YY:

I(X;Y)=xXyYp(x,y)log(p(x,y)p(x)p(y))I(X; Y) = \sum_{x \in X} \sum_{y \in Y} p(x, y) \log\left(\frac{p(x, y)}{p(x)p(y)}\right)

For continuous variables, the summations are replaced by integrals. The normalized mutual information (NMI) used in the implementation is:

NMI(X;Y)=I(X;Y)H(X)H(Y)\text{NMI}(X; Y) = \frac{I(X; Y)}{\sqrt{H(X) \cdot H(Y)}}

where H(X)H(X) and H(Y)H(Y) are the entropies of XX and YY respectively. Normalized values range from 0 (no relationship) to 1 (perfect dependency), allowing direct comparison of predictor importance across different variable types.

compute_predictor_correlations

def compute_predictor_correlations(
    data: pd.DataFrame,
    predictors: List[str],
    imputed_variables: List[str],
) -> Dict[str, pd.DataFrame]
ParameterTypeDescription
datapd.DataFrameDataset containing predictors and target variables
predictorsList[str]Column names of predictor variables
imputed_variablesList[str]Column names of target variables

Returns a dictionary containing predictor_target_mi DataFrame with mutual information scores.

Leave-one-out analysis

Leave-one-out predictor analysis evaluates model performance when each predictor is excluded. By comparing loss with and without each predictor, you can assess its contribution to imputation quality. Predictors whose removal causes large increases in loss are most important, while those with minimal impact might be candidates for removal to simplify the model.

leave_one_out_analysis

def leave_one_out_analysis(
    data: pd.DataFrame,
    predictors: List[str],
    imputed_variables: List[str],
    model_class: Type,
    quantiles: Optional[List[float]] = QUANTILES,
) -> Dict[str, Any]
ParameterTypeDefaultDescription
datapd.DataFrame-Complete dataset
predictorsList[str]-Column names of predictor variables
imputed_variablesList[str]-Column names of variables to impute
model_classType-Imputer class to evaluate
quantilesList[float][0.05 to 0.95 in steps of 0.05]Quantiles to evaluate

Returns a dictionary containing loss increase and relative impact for each predictor.

Progressive predictor inclusion

Progressive inclusion analysis adds predictors one at a time in order of their mutual information with the target. This greedy forward selection reveals the optimal inclusion order, marginal contribution of each predictor, and the minimal set of predictors achieving near-optimal performance. Diminishing returns in loss reduction indicate when additional predictors provide negligible improvement.

progressive_predictor_inclusion

def progressive_predictor_inclusion(
    data: pd.DataFrame,
    predictors: List[str],
    imputed_variables: List[str],
    model_class: Type,
    quantiles: Optional[List[float]] = QUANTILES,
) -> Dict[str, Any]
ParameterTypeDefaultDescription
datapd.DataFrame-Complete dataset
predictorsList[str]-Column names of predictor variables
imputed_variablesList[str]-Column names of variables to impute
model_classType-Imputer class to evaluate
quantilesList[float][0.05 to 0.95 in steps of 0.05]Quantiles to evaluate

Returns a dictionary containing inclusion_order (list of predictors in optimal order) and predictor_impacts (list of dicts with predictor name and loss reduction).

Example usage

from microimpute.comparisons.metrics import compare_metrics, compare_distributions
from microimpute.evaluations import (
    compute_predictor_correlations,
    leave_one_out_analysis,
    progressive_predictor_inclusion,
)
from microimpute.models import QRF

# Compare methods
metrics_df = compare_metrics(
    test_y=test_data[imputed_variables],
    method_imputations={
        "QRF": qrf_imputations,
        "OLS": ols_imputations,
    },
    imputed_variables=imputed_variables
)

# Evaluate distributional match with survey weights
dist_df_weighted = compare_distributions(
    donor_data=donor,
    receiver_data=receiver_with_imputations,
    imputed_variables=imputed_variables,
    donor_weights=donor["sample_weight"],
    receiver_weights=receiver["sample_weight"],
)

# Analyze predictor importance
mi_scores = compute_predictor_correlations(data, predictors, imputed_variables)
loo_results = leave_one_out_analysis(data, predictors, imputed_variables, QRF)
inclusion_results = progressive_predictor_inclusion(data, predictors, imputed_variables, QRF)