Skip to article frontmatterSkip to article content

Autoimpute

This documentation describes how the autoimpute function works to automize the entire method comparison, selection, and imputation pipeline in a single function.

The pipeline begins with input validation to ensure all necessary columns exist and quantiles are properly specified. It then preprocesses the donor and receiver datasets to prepare them for model training and evaluation. The function supports imputing numerical, categorical and boolean variable types, internally selecting the method corresponding to each variable type. At its core, autoimpute employs cross-validation on the donor data to evaluate multiple imputation methods. Each model is assessed on its ability to accurately predict known values using two different metrics: quantile loss for numerical imputation and log loss for categorical imputation. The method with the lowest average loss (with different metrics combined with a weighted-rank approach) across target variables is automatically selected as the optimal approach for the specific dataset and imputation task. The chosen model is then trained on the complete donor dataset and applied to generate imputations for the missing values in the receiver data. Finally, the pipeline reintegrates these imputed values back into the original receiver dataset, producing a complete dataset ready for downstream analysis.