# Quantile Regression imputation

This notebook demonstrates how to use `microimpute`'s QuantReg imputer to impute values using quantile regression. Quantile regression is a technique that extends linear regression to estimate the conditional quantiles of a response variable, providing a more complete view of the relationship between variables.

The QuantReg model supports iterative imputation with a single object and workflow. Pass a list of `imputed_variables` with all variables that you hope to impute for and the model will do so without needing to fit and predict for each separately.

## Setup and data preparation

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.datasets import load_diabetes
import warnings

# Set pandas display options to limit table width
pd.set_option("display.width", 600)
pd.set_option("display.max_columns", 10)
pd.set_option("display.expand_frame_repr", False)

# Import MicroImpute tools
from microimpute.comparisons.data import preprocess_data
from microimpute.evaluations import *
from microimpute.models import QuantReg
from microimpute.config import QUANTILES
from microimpute.visualizations.plotting import model_performance_results

Error importing in API mode: ImportError("dlopen(/Users/movil1/envs/pe/lib/python3.11/site-packages/_rinterface_cffi_api.abi3.so, 0x0002): Library not loaded: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.dylib\n  Referenced from: <38886600-97A2-37BA-9F86-5263C9A3CF6D> /Users/movil1/envs/pe/lib/python3.11/site-packages/_rinterface_cffi_api.abi3.so\n  Reason: tried: '/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.dylib' (no such file), '/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.dylib' (no such file)")
Trying to import in ABI mode.


In [2]:
# Load the diabetes dataset
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Display the first few rows of the dataset
df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [3]:
# Define variables for the model
predictors = ["age", "sex", "bmi", "bp"]
imputed_variables = [
    "s1",
    "s4",
]  # We'll impute 's1' (total serum cholesterol) and 's4' (total cholesterol/HDL ratio)

# Create a subset with only needed columns
diabetes_df = df[predictors + imputed_variables]

# Display summary statistics
diabetes_df.describe()

Unnamed: 0,age,sex,bmi,bp,s1,s4
count,442.0,442.0,442.0,442.0,442.0,442.0
mean,-2.511817e-19,1.23079e-17,-2.245564e-16,-4.79757e-17,-1.3814990000000001e-17,-9.04254e-18
std,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905
min,-0.1072256,-0.04464164,-0.0902753,-0.1123988,-0.1267807,-0.0763945
25%,-0.03729927,-0.04464164,-0.03422907,-0.03665608,-0.03424784,-0.03949338
50%,0.00538306,-0.04464164,-0.007283766,-0.005670422,-0.004320866,-0.002592262
75%,0.03807591,0.05068012,0.03124802,0.03564379,0.02835801,0.03430886
max,0.1107267,0.05068012,0.1705552,0.1320436,0.1539137,0.1852344


In [4]:
# Split data into training and testing sets
X_train, X_test = preprocess_data(diabetes_df)

# Let's see how many records we have in each set
print(f"Training set size: {X_train.shape[0]} records")
print(f"Testing set size: {X_test.shape[0]} records")

Training set size: 353 records
Testing set size: 89 records


## Simulating missing data

For this example, we'll simulate missing data in our test set by removing the values we want to impute.

In [5]:
# Create a copy of the test set with missing values
X_test_missing = X_test.copy()

# Store the actual values for later comparison
actual_values = X_test_missing[imputed_variables].copy()

# Remove the values to be imputed
X_test_missing[imputed_variables] = np.nan

X_test_missing.head()

Unnamed: 0,age,sex,bmi,bp,s1,s4
287,0.045341,-0.044642,-0.006206,-0.015999,,
211,0.092564,-0.044642,0.036907,0.021872,,
72,0.063504,0.05068,-0.00405,-0.012556,,
321,0.096197,-0.044642,0.051996,0.079265,,
73,0.012648,0.05068,-0.020218,-0.002228,,


## Training and using the QuantReg imputer

Now we'll train the QuantReg imputer and use it to impute the missing values in our test set. For quantile regression, we need to explicitly specify which quantiles to model during fitting.

In [6]:
# Define quantiles we want to model
# We'll use the default quantiles from the config module
print(f"Modeling these quantiles: {QUANTILES}")

Modeling these quantiles: [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]


In [7]:
warnings.filterwarnings("ignore")

# Initialize the QuantReg imputer
quantreg_imputer = QuantReg()

# Fit the model with our training data
# This trains a separate regression model for each quantile
fitted_quantreg_imputer = quantreg_imputer.fit(
    X_train, predictors, imputed_variables, quantiles=QUANTILES
)

In [8]:
# Impute values in the test set
# This uses the trained quantile regression models to predict missing values
imputed_values = fitted_quantreg_imputer.predict(X_test_missing, QUANTILES)

# Display the first few imputed values at the median (0.5 quantile)
imputed_values[0.5].head()

Unnamed: 0,s1,s4
287,0.005433,-0.016971
211,0.029121,0.002011
72,0.008247,0.012233
321,0.041591,0.00642
73,-0.005648,0.001657


## Evaluating the imputation results

Now let's compare the imputed values with the actual values to evaluate the performance of our imputer. To understand QuantReg's power to capture variability accross quantiles let us find and plot the prediction closest to the true value across quantiles for each data point.

In [9]:
# Define your quantiles
quantiles = list(imputed_values.keys())

# Convert imputed_values dict to a 3D array: (n_samples, n_quantiles)
pred_matrix = np.stack(
    [imputed_values[q].values.flatten() for q in quantiles], axis=1
)

# Actual values flattened
actual = actual_values.values.flatten()

# Compute absolute error matrix: shape (n_samples, n_quantiles)
abs_error = np.abs(pred_matrix - actual[:, None])

# Find index of closest prediction for each sample
closest_indices = abs_error.argmin(axis=1)

# Select the closest predictions
closest_predictions = np.array(
    [pred_matrix[i, idx] for i, idx in enumerate(closest_indices)]
)

# Wrap as DataFrame for plotting
closest_df = pd.DataFrame(
    {
        "Actual": actual,
        "ClosestPrediction": closest_predictions,
    }
)

# Extract median predictions for evaluation
median_predictions = imputed_values[0.5]

# Create a scatter plot comparing actual vs. imputed values
min_val = min(actual_values.min().min(), median_predictions.min().min())
max_val = max(actual_values.max().max(), median_predictions.max().max())

# Create the scatter plot
fig = px.scatter(
    closest_df,
    x="Actual",
    y="ClosestPrediction",
    opacity=0.7,
    title="Comparison of actual vs. imputed values using QuantReg",
)

# Add the diagonal line (perfect prediction line)
fig.add_trace(
    go.Scatter(
        x=[min_val, max_val],
        y=[min_val, max_val],
        mode="lines",
        line=dict(color="red", dash="dash"),
        name="Perfect Prediction",
    )
)

# Update layout
fig.update_layout(
    xaxis_title="Actual values",
    yaxis_title="Imputed values",
    width=750,
    height=600,
    template="plotly_white",
    margin=dict(l=50, r=50, t=80, b=50),  # Adjust margins
)

fig.show()

This scatter plot compares actual observed values with those imputed by a Quantile Regression model, providing a visual assessment of imputation accuracy. Each point represents a data record, with the x-axis showing the true value and the y-axis showing the model’s predicted value. The red dashed line represents the ideal 1:1 relationship, where predictions perfectly match actual values. Most points cluster around this line, suggesting that the QuantReg model effectively captures the underlying structure of the data. However, we can see how it tends to overpredict slightly in the lower tail of the distribution while underpredicting slightly in the upper tail of the distribution. This suggests that QuantReg can be a powerful method for imputing missing values when the relationship between features and the target variable is approximately linear and homoscedastic.

## Examining quantile predictions

Quantile regression provides predictions at different quantiles, which helps us understand the entire conditional distribution of the missing values.

In [10]:
# Compare predictions at different quantiles for the first 5 records
quantiles_to_show = QUANTILES
comparison_df = pd.DataFrame(index=range(5))

# Add actual values
comparison_df["Actual"] = actual_values.iloc[:5, 0].values

# Add quantile predictions
for q in quantiles_to_show:
    comparison_df[f"Q{int(q*100)}"] = imputed_values[q].iloc[:5, 0].values

comparison_df

Unnamed: 0,Actual,Q5,Q10,Q15,Q20,...,Q75,Q80,Q85,Q90,Q95
0,0.125019,-0.054864,-0.045185,-0.038537,-0.03128,...,0.035409,0.048313,0.05597,0.069883,0.080143
1,-0.02496,-0.03621,-0.026032,-0.019383,-0.009854,...,0.061573,0.079184,0.084216,0.104716,0.122164
2,0.103003,-0.064832,-0.056056,-0.04505,-0.035142,...,0.036322,0.053139,0.063106,0.075872,0.091014
3,0.054845,-0.02497,-0.013884,-0.007525,0.00103,...,0.070107,0.087384,0.092593,0.113554,0.140107
4,0.038334,-0.072012,-0.064371,-0.054504,-0.046595,...,0.019022,0.030479,0.043667,0.052178,0.071291


## Visualizing prediction intervals

By visualizing the prediction intervals of the model's imputations we can better understand the uncertainty in our imputed values.

In [11]:
# Create a prediction interval plot for the first 10 records
# Number of records to plot
n_records = 10

# Prepare data for plotting
records = list(range(n_records))
actuals = actual_values.iloc[:n_records, 0].values
medians = imputed_values[0.5].iloc[:n_records, 0].values
q30 = imputed_values[0.3].iloc[:n_records, 0].values
q70 = imputed_values[0.7].iloc[:n_records, 0].values
q10 = imputed_values[0.1].iloc[:n_records, 0].values
q90 = imputed_values[0.9].iloc[:n_records, 0].values

# Create the base figure
fig = go.Figure()

# Add 80% prediction interval (Q10-Q90)
for i in range(n_records):
    fig.add_trace(
        go.Scatter(
            x=[i, i],
            y=[q10[i], q90[i]],
            mode="lines",
            line=dict(width=10, color="rgba(173, 216, 230, 0.3)"),
            hoverinfo="none",
            showlegend=False,
        )
    )

# Add 40% prediction interval (Q30-Q70)
for i in range(n_records):
    fig.add_trace(
        go.Scatter(
            x=[i, i],
            y=[q30[i], q70[i]],
            mode="lines",
            line=dict(width=10, color="rgba(70, 130, 180, 0.5)"),
            hoverinfo="none",
            showlegend=False,
        )
    )

# Add actual values
fig.add_trace(
    go.Scatter(
        x=records,
        y=actuals,
        mode="markers",
        marker=dict(color="black", size=8),
        name="Actual",
    )
)

# Add median predictions
fig.add_trace(
    go.Scatter(
        x=records,
        y=medians,
        mode="markers",
        marker=dict(color="red", size=8),
        name="Median (Q50)",
    )
)

# Add dashed line for Q10
fig.add_trace(
    go.Scatter(
        x=[-1, -1],  # Dummy points for legend
        y=[0, 0],  # Dummy points for legend
        mode="lines",
        line=dict(color="rgba(173, 216, 230, 0.3)", width=10),
        name="80% PI (Q10-Q90)",
    )
)

# Add dashed line for Q30
fig.add_trace(
    go.Scatter(
        x=[-1, -1],  # Dummy points for legend
        y=[0, 0],  # Dummy points for legend
        mode="lines",
        line=dict(color="rgba(70, 130, 180, 0.5)", width=10),
        name="40% PI (Q30-Q70)",
    )
)

# Update layout with smaller width to fit in the book layout
fig.update_layout(
    title="QuantReg imputation prediction intervals",
    xaxis=dict(
        title="Data record index",
        showgrid=True,
        gridwidth=1,
        gridcolor="rgba(211, 211, 211, 0.7)",
    ),
    yaxis=dict(
        title="Total serum cholesterol (s1)",
        showgrid=True,
        gridwidth=1,
        gridcolor="rgba(211, 211, 211, 0.7)",
    ),
    width=750,
    height=600,
    template="plotly_white",
    margin=dict(l=50, r=50, t=80, b=50),  # Adjust margins
    legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99),
)

fig.show()

This plot illustrates the prediction intervals generated by a Quantile Regression (QuantReg) model for imputing total serum cholesterol values across ten records. For each observation, the red dot represents the median prediction (Q50), while the black dot indicates the true observed value. Vertical bars depict the model’s 40% prediction interval (Q30–Q70) in dark blue and the 80% prediction interval (Q10–Q90) in light blue. The intervals convey the model’s estimation of uncertainty, with wider intervals indicating less certainty about the imputed value. In many cases, the actual value falls within the 80% interval, suggesting that the QuantReg model is reasonably well-calibrated. However, the intervals tend to be vertically symmetric and relatively wide, which reflects the linear nature of Quantile Regression: less responsive to local heteroskedasticity or skewness. Compared to Quantile Regression Forests, which can produce more adaptive and asymmetric intervals, the intervals here are more uniform in shape and spread. Overall, this plot shows that QuantReg is capable of capturing uncertainty around its median predictions, though the fit may be somewhat conservative or limited in highly nonlinear settings.

## Assesing the method's performance

To check whether our model is overfitting and ensure robust results we can perform cross-validation and visualize the results.

In [12]:
warnings.filterwarnings("ignore")

predictors = ["age", "sex", "bmi", "bp"]
imputed_variables = ["s1", "s4"]

# Run cross-validation on the same data set
quantreg_results = cross_validate_model(
    QuantReg, diabetes_df, predictors, imputed_variables
)

quantreg_results

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    2.4s remaining:    3.6s
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:    2.4s remaining:    1.6s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    2.6s finished


Unnamed: 0,0.05,0.10,0.15,0.20,0.25,...,0.75,0.80,0.85,0.90,0.95
train,0.003559,0.006266,0.008625,0.010612,0.012261,...,0.014217,0.012836,0.010951,0.008441,0.005035
test,0.003702,0.006428,0.00892,0.010948,0.012559,...,0.014497,0.013223,0.011248,0.008737,0.005352


In [13]:
# Plot the results
perf_results_viz = model_performance_results(
    results=quantreg_results,
    model_name="QuantReg",
    method_name="Cross-validation quantile loss average",
)
fig = perf_results_viz.plot(
    title="QuantReg cross-validation performance",
)
fig.show()