AutoImputation#
This notebook demonstrates the functionality of the autoimpute
module, which provides an automated approach to selecting and applying optimal imputation methods for missing data. Rather than manually testing different approaches, AutoImpute evaluates multiple methods (tuning their hyperparameters to the specific data set), identifies which performs best for your specific data, and applies it to generate high-quality imputations.
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from rpy2.robjects import pandas2ri
from sklearn.datasets import load_diabetes
import warnings
# Set pandas display options to limit table width
pd.set_option("display.width", 600)
pd.set_option("display.max_columns", 10)
pd.set_option("display.expand_frame_repr", False)
from microimpute.comparisons.autoimpute import autoimpute
from microimpute.visualizations.plotting import method_comparison_results
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "LD_LIBRARY_PATH" redefined by R and overriding existing variable. Current: "/opt/hostedtoolcache/Python/3.11.12/x64/lib", R: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "PWD" redefined by R and overriding existing variable. Current: "/home/runner/work/microimpute/microimpute/docs", R: "/home/runner/work/microimpute/microimpute/docs/autoimpute"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "LD_LIBRARY_PATH" redefined by R and overriding existing variable. Current: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib", R: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_LIBS_SITE" redefined by R and overriding existing variable. Current: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library", R: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_PAPERSIZE_USER" redefined by R and overriding existing variable. Current: "a4", R: "letter"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_SESSION_TMPDIR" redefined by R and overriding existing variable. Current: "/tmp/Rtmp7Dv5hE", R: "/tmp/RtmpaDywKG"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pydantic/_internal/_generate_schema.py:623: UserWarning: <built-in function array> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`.
warn(
Data preparation#
This demonstration uses the diabetes dataset from scikit-learn. In real-world imputation scenarios, you would typically have a “donor” dataset with complete information for both predictor and target variables, and a “receiver” dataset that lacks some target variables that need to be imputed.
# Load the diabetes dataset
diabetes = load_diabetes()
diabetes_data = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
# Display the first few rows to understand the data structure
diabetes_data.head()
age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019907 | -0.017646 |
1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068332 | -0.092204 |
2 | 0.085299 | 0.050680 | 0.044451 | -0.005670 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002861 | -0.025930 |
3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022688 | -0.009362 |
4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031988 | -0.046641 |
For this demonstration, the diabetes dataset is split into donor and receiver portions. Part of the data is treated as the donor dataset with complete information, and another part as the receiver dataset with some variables that need imputation. Autoimpute handles imputation of numerical, categorical and boolean variables, lifting constraints on the choice of data sets and variables.
# Split the data into donor and receiver portions
donor_indices = np.random.choice(
len(diabetes_data), size=int(0.7 * len(diabetes_data)), replace=False
)
receiver_indices = np.array(
[i for i in range(len(diabetes_data)) if i not in donor_indices]
)
donor_data = diabetes_data.iloc[donor_indices].reset_index(drop=True)
receiver_data = diabetes_data.iloc[receiver_indices].reset_index(drop=True)
# Define which variables we'll use as predictors and which we want to impute
predictors = ["age", "sex", "bmi", "bp"]
imputed_variables = ["s1", "s4"]
# For demonstration purposes, we'll remove the variables we want to impute from the receiver dataset
receiver_data_without_targets = receiver_data.drop(columns=imputed_variables)
print(f"Donor data shape: {donor_data.shape}")
print(f"Receiver data shape: {receiver_data_without_targets.shape}")
print(f"Predictors: {predictors}")
print(f"Variables to impute: {imputed_variables}")
Donor data shape: (309, 10)
Receiver data shape: (133, 8)
Predictors: ['age', 'sex', 'bmi', 'bp']
Variables to impute: ['s1', 's4']
Running AutoImpute#
Use the autoimpute
function to automatically evaluate different imputation methods, select the best one, and generate imputations. The function handles all the complexity of model evaluation, selection, and application in a single call.
warnings.filterwarnings("ignore")
# Run the autoimpute process
imputations, imputed_data, fitted_model, method_results_df = autoimpute(
donor_data=donor_data,
receiver_data=receiver_data_without_targets,
predictors=predictors,
imputed_variables=imputed_variables,
tune_hyperparameters=False, # enable automated hyperparameter tuning if desired
k_folds=3, # Number of cross-validation folds
)
print(
f"Shape of receiver data before imputation: {receiver_data_without_targets.shape} \nShape of receiver data after imputation: {imputed_data.shape}"
)
Found 1 numeric columns with unique values < 10, treating as categorical: ['sex']. Converting to dummy variables.
Found 1 numeric columns with unique values < 10, treating as categorical: ['sex']. Converting to dummy variables.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pydantic/_internal/_generate_schema.py:623: UserWarning: <built-in function array> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`.
warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "LD_LIBRARY_PATH" redefined by R and overriding existing variable. Current: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib", R: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_LIBS_SITE" redefined by R and overriding existing variable. Current: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library", R: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_SESSION_TMPDIR" redefined by R and overriding existing variable. Current: "/tmp/RtmpaDywKG", R: "/tmp/Rtmp1pWYUn"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pydantic/_internal/_generate_schema.py:623: UserWarning: <built-in function array> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`.
warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pydantic/_internal/_generate_schema.py:623: UserWarning: <built-in function array> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`.
warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "LD_LIBRARY_PATH" redefined by R and overriding existing variable. Current: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib", R: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_LIBS_SITE" redefined by R and overriding existing variable. Current: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library", R: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_SESSION_TMPDIR" redefined by R and overriding existing variable. Current: "/tmp/Rtmp1pWYUn", R: "/tmp/RtmprS75Ox"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "LD_LIBRARY_PATH" redefined by R and overriding existing variable. Current: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib", R: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_LIBS_SITE" redefined by R and overriding existing variable. Current: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library", R: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_SESSION_TMPDIR" redefined by R and overriding existing variable. Current: "/tmp/RtmpaDywKG", R: "/tmp/Rtmptajs7B"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "LD_LIBRARY_PATH" redefined by R and overriding existing variable. Current: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib", R: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_LIBS_SITE" redefined by R and overriding existing variable. Current: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library", R: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_SESSION_TMPDIR" redefined by R and overriding existing variable. Current: "/tmp/RtmpaDywKG", R: "/tmp/Rtmp26y4kk"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "LD_LIBRARY_PATH" redefined by R and overriding existing variable. Current: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib", R: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_LIBS_SITE" redefined by R and overriding existing variable. Current: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library", R: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_SESSION_TMPDIR" redefined by R and overriding existing variable. Current: "/tmp/Rtmp26y4kk", R: "/tmp/RtmpYKoN7k"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "LD_LIBRARY_PATH" redefined by R and overriding existing variable. Current: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib", R: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_LIBS_SITE" redefined by R and overriding existing variable. Current: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library", R: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_SESSION_TMPDIR" redefined by R and overriding existing variable. Current: "/tmp/Rtmptajs7B", R: "/tmp/RtmpwOWPQx"
warnings.warn(
[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 4.0s finished
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pydantic/_internal/_generate_schema.py:623: UserWarning: <built-in function array> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`.
warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "LD_LIBRARY_PATH" redefined by R and overriding existing variable. Current: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib", R: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_LIBS_SITE" redefined by R and overriding existing variable. Current: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library", R: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_SESSION_TMPDIR" redefined by R and overriding existing variable. Current: "/tmp/RtmpaDywKG", R: "/tmp/RtmpWx9xYq"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "LD_LIBRARY_PATH" redefined by R and overriding existing variable. Current: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib", R: "/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/temurin-17-jdk-amd64/lib/server:/opt/hostedtoolcache/Python/3.11.12/x64/lib"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_LIBS_SITE" redefined by R and overriding existing variable. Current: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library", R: "/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library/:/usr/local/lib/R/site-library:/usr/lib/R/site-library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library:/usr/lib/R/library"
warnings.warn(
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/rpy2/rinterface/__init__.py:1211: UserWarning: Environment variable "R_SESSION_TMPDIR" redefined by R and overriding existing variable. Current: "/tmp/RtmpWx9xYq", R: "/tmp/Rtmpr1MfEK"
warnings.warn(
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Batch computation too fast (0.06923031806945801s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 0.1s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/statsmodels/regression/quantile_regression.py:191: IterationLimitWarning: Maximum number of iterations (1000) reached.
warnings.warn("Maximum number of iterations (" + str(max_iter) +
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/statsmodels/regression/quantile_regression.py:191: IterationLimitWarning: Maximum number of iterations (1000) reached.
warnings.warn("Maximum number of iterations (" + str(max_iter) +
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/statsmodels/regression/quantile_regression.py:191: IterationLimitWarning: Maximum number of iterations (1000) reached.
warnings.warn("Maximum number of iterations (" + str(max_iter) +
/opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/statsmodels/regression/quantile_regression.py:191: IterationLimitWarning: Maximum number of iterations (1000) reached.
warnings.warn("Maximum number of iterations (" + str(max_iter) +
[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 1.1s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 4.9s finished
Shape of receiver data before imputation: (133, 8)
Shape of receiver data after imputation: (133, 10)
Understanding the results#
The autoimpute
function returns four key objects that provide comprehensive information about the imputation process:
imputations
: A dictionary where keys are quantiles used for imputation and values are DataFrames containing the imputed values at each quantileimputed_data
: The receiver dataset with imputed values integrated into itfitted_model
: The best-performing imputation model, already fitted on the donor datamethod_results_df
: A DataFrame with detailed performance metrics for all evaluated imputation methods
# Examine the comparative performance of different imputation methods
print("Cross-validation results for different imputation methods:")
method_results_df
Cross-validation results for different imputation methods:
0.05 | 0.1 | 0.15 | 0.2 | 0.25 | ... | 0.8 | 0.85 | 0.9 | 0.95 | mean_loss | |
---|---|---|---|---|---|---|---|---|---|---|---|
QRF | 0.005040 | 0.007287 | 0.011052 | 0.014258 | 0.015701 | ... | 0.014879 | 0.013036 | 0.010515 | 0.007531 | 0.015249 |
OLS | 0.003632 | 0.006300 | 0.008623 | 0.010609 | 0.012290 | ... | 0.012981 | 0.011129 | 0.008686 | 0.005421 | 0.012412 |
QuantReg | 0.003543 | 0.006289 | 0.008795 | 0.010683 | 0.012311 | ... | 0.013302 | 0.011453 | 0.008875 | 0.005486 | 0.012510 |
Matching | 0.021658 | 0.021650 | 0.021642 | 0.021635 | 0.021627 | ... | 0.021540 | 0.021533 | 0.021525 | 0.021517 | 0.021588 |
4 rows × 20 columns
The table above provides a comprehensive view of how each imputation method performs across different quantiles. The ‘mean_loss’ column shows the average quantile loss across all quantiles for each method. Lower values indicate better performance, and AutoImpute automatically selects the method with the lowest average loss.
# Identify which method was selected as the best performer
best_method = method_results_df["mean_loss"].idxmin()
print(f"Best performing method: {best_method}")
print(f"Average loss: {method_results_df.loc[best_method, 'mean_loss']:.4f}")
Best performing method: OLS
Average loss: 0.0124
Visualizing method comparison#
Visualize how different methods perform across quantiles provides insight into which methods are most appropriate for different parts of the distribution.
# Extract the quantiles used in the evaluation
quantiles = [q for q in method_results_df.columns if isinstance(q, float)]
comparison_viz = method_comparison_results(
data=method_results_df,
metric_name="Test Quantile Loss",
data_format="wide",
)
fig = comparison_viz.plot(
title="Autoimpute Method Comparison",
show_mean=True,
)
fig.show()
The plot above illustrates how each imputation method performs across different quantiles of the distribution. Methods with consistently lower lines generally perform better overall.
comparison_viz.summary()
Method | Mean Test Quantile Loss | Best Quantile | Best Test Quantile Loss | Worst Quantile | Worst Test Quantile Loss | |
---|---|---|---|---|---|---|
1 | OLS | 0.012412 | 0.05 | 0.003632 | 0.55 | 0.016730 |
2 | QuantReg | 0.012510 | 0.05 | 0.003543 | 0.55 | 0.016709 |
0 | QRF | 0.015249 | 0.05 | 0.005040 | 0.50 | 0.020775 |
3 | Matching | 0.021588 | 0.95 | 0.021517 | 0.05 | 0.021658 |
By calling summary
on the object returned by method_comparison_results
function, you can get a summary of the imputation results, including the mean and standard deviation of the quantile loss for each method. This summary can help you understand the performance of different imputation methods in a more concise manner.
Examining the imputed values#
Now let us assess the actual imputed values generated by the best-performing method.
# Examine imputed values at the median (0.5 quantile)
median_imputations = imputations[0.5]
print("Median imputed values:")
median_imputations.head()
Median imputed values:
s1 | s4 | |
---|---|---|
0 | -0.009400 | -0.031872 |
1 | 0.002125 | 0.022185 |
2 | 0.000382 | 0.012833 |
3 | 0.003263 | -0.018164 |
4 | -0.019299 | -0.038061 |
# Look at the full receiver dataset with imputed values integrated
print("Receiver dataset with imputed values:")
imputed_data.head()
Receiver dataset with imputed values:
age | sex | bmi | bp | s2 | s3 | s5 | s6 | s1 | s4 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.019163 | 0.074412 | -0.068332 | -0.092204 | -0.009400 | -0.031872 |
1 | 0.027178 | 0.050680 | 0.017506 | -0.033213 | 0.045972 | -0.065491 | -0.096435 | -0.059067 | 0.002125 | 0.022185 |
2 | 0.005383 | 0.050680 | -0.001895 | 0.008101 | -0.015719 | -0.002903 | 0.038394 | -0.013504 | 0.000382 | 0.012833 |
3 | 0.045341 | -0.044642 | -0.025607 | -0.012556 | -0.000061 | 0.081775 | -0.031988 | -0.075636 | 0.003263 | -0.018164 |
4 | -0.049105 | -0.044642 | -0.056863 | -0.043542 | -0.043276 | 0.000779 | -0.011897 | 0.015491 | -0.019299 | -0.038061 |
Evaluating imputation quality#
In this demonstration, since the receiver dataset was artificially created by removing variables from the original data, there exists the unique opportunity to evaluate the quality of our imputations by comparing them to the actual values.
# Visualize comparison between actual and imputed values
for var in imputed_variables:
fig = go.Figure()
# Plot actual values
fig.add_trace(
go.Scatter(
x=receiver_data.index,
y=receiver_data[var],
mode="markers",
name="Actual Values",
marker=dict(color="blue", size=8),
)
)
# Plot imputed values
fig.add_trace(
go.Scatter(
x=imputed_data.index,
y=imputed_data[var],
mode="markers",
name="Imputed Values",
marker=dict(color="red", size=8),
)
)
# Customize the plot appearance
fig.update_layout(
title=f"Comparison of Actual vs Imputed Values for {var}",
xaxis_title="Sample Index",
yaxis_title=f"{var} Value",
legend_title="Type",
hovermode="closest",
)
fig.show()
The plots above show how well the imputed values (red) match the actual values (blue) that were removed from the receiver dataset. This visual comparison helps assess the quality of the imputations generated by the best-performing method.
Advanced usage: custom models and hyperparameters#
The AutoImpute function allows for customization of both the models to evaluate and their hyperparameters. This flexibility enables adaptation to specific dataset characteristics and imputation requirements. The models that support hyperparameter specification and tuning are Matching and QRF.
from microimpute.models import *
# Specify a custom subset of models to evaluate
custom_models = [QRF, OLS, Matching]
# Specify custom hyperparameters for some models
custom_hyperparameters = {
"QRF": {"n_estimators": 200, "max_depth": 10},
"Matching": {"constrained": True},
}
# Then simply run autoimpute with custom models and hyperparameters
(
advanced_imputations,
advanced_imputed_data,
advanced_fitted_model,
advanced_method_results,
) = autoimpute(
donor_data=donor_data,
receiver_data=receiver_data_without_targets,
predictors=predictors,
imputed_variables=imputed_variables,
models=custom_models,
hyperparameters=custom_hyperparameters,
k_folds=3,
)
advanced_imputations[0.5]
Found 1 numeric columns with unique values < 10, treating as categorical: ['sex']. Converting to dummy variables.
Found 1 numeric columns with unique values < 10, treating as categorical: ['sex']. Converting to dummy variables.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 2.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Batch computation too fast (0.059703826904296875s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 0.1s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 3 out of 3 | elapsed: 4.4s finished
s1 | s4 | |
---|---|---|
0 | -0.009400 | -0.031872 |
1 | 0.002125 | 0.022185 |
2 | 0.000382 | 0.012833 |
3 | 0.003263 | -0.018164 |
4 | -0.019299 | -0.038061 |
... | ... | ... |
128 | -0.002581 | -0.000772 |
129 | -0.001217 | 0.017405 |
130 | -0.012284 | -0.028842 |
131 | 0.007371 | 0.008079 |
132 | 0.005252 | -0.011087 |
133 rows × 2 columns