Ordinary Least Squares (OLS) imputation - MicroImpute documentation

This notebook demonstrates how to use Microimpute’s OLS imputer to impute values using linear regression. OLS imputation is a parametric approach that assumes a linear relationship between the predictor variables and the variable being imputed.

Variable type support¶

The OLS model intelligently handles both numerical and categorical variables. When imputing numerical targets, it uses standard linear regression. For categorical targets (strings, booleans, or numerically-encoded categorical variables), it automatically switches to logistic regression classification internally. You don’t need to specify variable types—the model detects and adapts automatically.

The OLS model supports iterative imputation with a single object and workflow. Pass a list of imputed_variables with all variables that you hope to impute for and the model will do so without needing to fit and predict for each separately.

OLS class¶

class OLS(
    log_level: Optional[str] = "WARNING"
)

Parameter	Type	Default used	Description
log_level	str	“WARNING”	Logging verbosity level

fit() method¶

def fit(
    X_train: pd.DataFrame,
    predictors: List[str],
    imputed_variables: List[str],
    weight_col: Optional[str] = None,
) -> OLSResults

Parameter	Type	Default used	Description
X_train	pd.DataFrame	-	Training data with predictors and target variables
predictors	List[str]	-	Column names to use as predictors
imputed_variables	List[str]	-	Column names to impute
weight_col	str	None	Column name for sampling weights

It returns a OLSResults object for prediction. Internally uses linear regression for numerical targets and logistic regression for categorical/boolean targets.

OLSResults.predict() method¶

def predict(
    X_test: pd.DataFrame,
    quantiles: Optional[List[float]] = None
) -> Dict[float, pd.DataFrame]

Parameter	Type	Default	Description
X_test	pd.DataFrame	-	Data to impute (with predictors)
quantiles	List[float]	None	Quantiles at which to return predictions

It returns a dictionary mapping quantiles to DataFrames of imputed values. For numerical variables, quantile predictions are based on the normal distribution assumption. For categorical variables, predictions are sampled from the predicted probability distribution.

Setup and data preparation¶

# Import necessary libraries
import warnings
warnings.filterwarnings("ignore")

import logging
logging.getLogger("joblib").setLevel(logging.ERROR)

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.datasets import load_diabetes

pd.set_option("display.width", 600)
pd.set_option("display.max_columns", 10)
pd.set_option("display.expand_frame_repr", False)

# Import Microimpute tools
from microimpute.utils.data import preprocess_data
from microimpute.evaluations import cross_validate_model
from microimpute.models import OLS
from microimpute.config import QUANTILES
from microimpute.visualizations import model_performance_results

# Load the diabetes dataset
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Display the first few rows of the dataset
df.head()

# Define variables for the model
predictors = ["age", "sex", "bmi", "bp"]
imputed_variables = [
    "s1",
    "s4",
]  # We'll impute 's1' (total serum cholesterol) and 's4' (total cholesterol/HDL ratio)

# Create a subset with only needed columns
diabetes_df = df[predictors + imputed_variables]

# Display summary statistics
diabetes_df.describe()

# Split data into training and testing sets
X_train, X_test = preprocess_data(diabetes_df)

# Let's see how many records we have in each set
print(f"Training set size: {X_train.shape[0]} records")
print(f"Testing set size: {X_test.shape[0]} records")

Training set size: 353 records
Testing set size: 89 records

Simulating missing data¶

For this example, we’ll simulate missing data in our test set by removing the values we want to impute.

# Create a copy of the test set with missing values
X_test_missing = X_test.copy()

# Store the actual values for later comparison
actual_values = X_test_missing[imputed_variables].copy()

# Remove the values to be imputed
X_test_missing[imputed_variables] = np.nan

X_test_missing.head()

Training and using the OLS imputer¶

Now we’ll train the OLS imputer and use it to impute the missing values in our test set.

# Define quantiles we want to model
# We'll use the default quantiles from the config module
print(f"Modeling these quantiles: {QUANTILES}")

Modeling these quantiles: [np.float64(0.05), np.float64(0.1), np.float64(0.15), np.float64(0.2), np.float64(0.25), np.float64(0.3), np.float64(0.35), np.float64(0.4), np.float64(0.45), np.float64(0.5), np.float64(0.55), np.float64(0.6), np.float64(0.65), np.float64(0.7), np.float64(0.75), np.float64(0.8), np.float64(0.85), np.float64(0.9), np.float64(0.95)]

# Initialize the OLS imputer
ols_imputer = OLS()

# Fit the model with our training data
# This trains a linear regression model
fitted_ols_imputer = ols_imputer.fit(X_train, predictors, imputed_variables)

# Impute values in the test set
# This uses the trained OLS model to predict missing values
imputed_values = fitted_ols_imputer.predict(X_test_missing, QUANTILES)

# Display the first few imputed values at the median (0.5 quantile)
imputed_values[0.5].head()

Evaluating the imputation results¶

Now let’s compare the imputed values with the actual values to evaluate the performance of our imputer. To understand OLS’s ability to capture variability accross quantiles let us find and plot the prediction closest to the true value across quantiles for each data point.

# Define your quantiles
quantiles = list(imputed_values.keys())

# Convert imputed_values dict to a 3D array: (n_samples, n_quantiles)
pred_matrix = np.stack(
    [imputed_values[q].values.flatten() for q in quantiles], axis=1
)

# Actual values flattened
actual = actual_values.values.flatten()

# Compute absolute error matrix: shape (n_samples, n_quantiles)
abs_error = np.abs(pred_matrix - actual[:, None])

# Find index of closest prediction for each sample
closest_indices = abs_error.argmin(axis=1)

# Select the closest predictions
closest_predictions = np.array(
    [pred_matrix[i, idx] for i, idx in enumerate(closest_indices)]
)

# Wrap as DataFrame for plotting
closest_df = pd.DataFrame(
    {
        "Actual": actual,
        "ClosestPrediction": closest_predictions,
    }
)

# Extract median predictions for evaluation
median_predictions = imputed_values[0.5]

# Create a scatter plot comparing actual vs. imputed values
min_val = min(actual_values.min().min(), median_predictions.min().min())
max_val = max(actual_values.max().max(), median_predictions.max().max())

# Create the scatter plot
fig = px.scatter(
    closest_df,
    x="Actual",
    y="ClosestPrediction",
    opacity=0.7,
    title="Comparison of actual vs. imputed values using OLS",
)

# Add the diagonal line (perfect prediction line)
fig.add_trace(
    go.Scatter(
        x=[min_val, max_val],
        y=[min_val, max_val],
        mode="lines",
        line=dict(color="red", dash="dash"),
        name="Perfect prediction",
    )
)

# Update layout
fig.update_layout(
    xaxis_title="Actual values",
    yaxis_title="Imputed values",
    width=750,
    height=600,
    template="plotly_white",
    margin=dict(l=50, r=50, t=80, b=50),  # Adjust margins
)

fig.show()

This scatter plot compares actual observed values with those imputed by a OLS Linear Regression model, providing a visual assessment of imputation accuracy. Each point represents a data record, with the x-axis showing the true value and the y-axis showing the model’s predicted value. The red dashed line represents the ideal 1:1 relationship, where predictions perfectly match actual values. Most points cluster around this line, suggesting that the OLS model effectively captures the underlying linear structure of the diabetes data. However, we can see how it tends to underpredict in the upper tail of the distribution. This suggests that OLS can be a powerful method for imputing missing values when the relationship between features and the target variable is simply linear and homoscedastic, but may perform worse otherwise.

Examining quantile predictions¶

The OLS imputer generates quantile predictions based on the normal distribution assumption, which can help understand prediction uncertainty.

# Compare predictions at different quantiles for the first 5 records
quantiles_to_show = QUANTILES
comparison_df = pd.DataFrame(index=range(5))

# Add actual values
comparison_df["Actual"] = actual_values.iloc[:5, 0].values

# Add quantile predictions
for q in quantiles_to_show:
    comparison_df[f"Q{int(q*100)}"] = imputed_values[q].iloc[:5, 0].values

comparison_df

Visualizing prediction intervals¶

By visualizing the prediction intervals of the model’s imputations we can better understand the uncertainty in our imputed values.

# Create a prediction interval plot for the first 10 records
# Number of records to plot
n_records = 10

# Prepare data for plotting
records = list(range(n_records))
actuals = actual_values.iloc[:n_records, 0].values
medians = imputed_values[0.5].iloc[:n_records, 0].values
q30 = imputed_values[0.3].iloc[:n_records, 0].values
q70 = imputed_values[0.7].iloc[:n_records, 0].values
q10 = imputed_values[0.1].iloc[:n_records, 0].values
q90 = imputed_values[0.9].iloc[:n_records, 0].values

# Create the base figure
fig = go.Figure()

# Add 80% prediction interval (Q10-Q90)
for i in range(n_records):
    fig.add_trace(
        go.Scatter(
            x=[i, i],
            y=[q10[i], q90[i]],
            mode="lines",
            line=dict(width=10, color="rgba(173, 216, 230, 0.3)"),
            hoverinfo="none",
            showlegend=False,
        )
    )

# Add 40% prediction interval (Q30-Q70)
for i in range(n_records):
    fig.add_trace(
        go.Scatter(
            x=[i, i],
            y=[q30[i], q70[i]],
            mode="lines",
            line=dict(width=10, color="rgba(70, 130, 180, 0.5)"),
            hoverinfo="none",
            showlegend=False,
        )
    )

# Add actual values
fig.add_trace(
    go.Scatter(
        x=records,
        y=actuals,
        mode="markers",
        marker=dict(color="black", size=8),
        name="Actual",
    )
)

# Add median predictions
fig.add_trace(
    go.Scatter(
        x=records,
        y=medians,
        mode="markers",
        marker=dict(color="red", size=8),
        name="Median (Q50)",
    )
)

# Add dashed line for Q10
fig.add_trace(
    go.Scatter(
        x=[-1, -1],  # Dummy points for legend
        y=[0, 0],  # Dummy points for legend
        mode="lines",
        line=dict(color="rgba(173, 216, 230, 0.3)", width=10),
        name="80% PI (Q10-Q90)",
    )
)

# Add dashed line for Q30
fig.add_trace(
    go.Scatter(
        x=[-1, -1],  # Dummy points for legend
        y=[0, 0],  # Dummy points for legend
        mode="lines",
        line=dict(color="rgba(70, 130, 180, 0.5)", width=10),
        name="40% PI (Q30-Q70)",
    )
)

# Update layout with smaller width to fit in the book layout
fig.update_layout(
    title="OLS imputation prediction intervals",
    xaxis=dict(
        title="Data record index",
        showgrid=True,
        gridwidth=1,
        gridcolor="rgba(211, 211, 211, 0.7)",
    ),
    yaxis=dict(
        title="Total serum cholesterol (s1)",
        showgrid=True,
        gridwidth=1,
        gridcolor="rgba(211, 211, 211, 0.7)",
    ),
    width=750,
    height=600,
    template="plotly_white",
    margin=dict(l=50, r=50, t=80, b=50),  # Adjust margins
    legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99),
)

fig.show()

This plot illustrates the prediction intervals generated by an OLS Linear Regression model for imputing total serum cholesterol values across ten records. For each observation, the red dot represents the median prediction (Q50), while the black dot indicates the true observed value. Vertical bars depict the model’s 40% prediction interval (Q30–Q70) in dark gray and the 80% prediction interval (Q10–Q90) in light gray. The intervals convey the model’s estimation of uncertainty, with wider intervals indicating less certainty about the imputed value. In some cases, the actual value falls within the 80% interval, suggesting that the OLS model is reasonably well-calibrated. However, the intervals tend to be vertically symmetric and relatively wide, sometimes missing the real values altogether. This reflects the linear nature of OLS: less responsive to local heteroskedasticity or skewness, and possibly limited in imputing power. Compared to Quantile Regression Forests, which can produce more adaptive and asymmetric intervals, the intervals here are more uniform in shape and spread. Overall, this plot shows that OLS is capable of performing fairly well on homocesdastic and simple linear datasets, though the fit may be quite limited in highly nonlinear settings.

Assesing the method’s performance¶

To check whether our model is overfitting and ensure robust results we can perform cross-validation and visualize the results.

predictors = ["age", "sex", "bmi", "bp"]
imputed_variables = ["s1", "s4"]

# Run cross-validation on the same data set
ols_results = cross_validate_model(
    OLS, diabetes_df, predictors, imputed_variables
)

# Check if we have quantile loss results (for numerical variables)
if "quantile_loss" in ols_results:
    print("Quantile loss results:")
    print(ols_results["quantile_loss"]["results"])

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.

Quantile loss results:
           0.05      0.10      0.15      0.20      0.25  ...      0.75      0.80      0.85      0.90      0.95
train  0.003837  0.006478  0.008737  0.010719  0.012360  ...  0.014393  0.012980  0.011086  0.008605  0.005269
test   0.003877  0.006548  0.008872  0.010896  0.012535  ...  0.014515  0.013096  0.011194  0.008689  0.005352

[2 rows x 19 columns]

# Plot the results for numerical variables
if "quantile_loss" in ols_results:
    perf_results_viz = model_performance_results(
        results=ols_results["quantile_loss"]["results"],
        model_name="OLS",
        method_name="Cross-validation quantile loss average",
    )
    fig = perf_results_viz.plot(
        title="OLS cross-validation performance",
    )
    fig.show()

Categorical variable imputation¶

OLS automatically handles categorical variables through random forest classification. Let’s evaluate its performance on categorical imputation tasks.

# Create a dataset with categorical variables
np.random.seed(42)

# Create synthetic categorical variables based on diabetes features
df_categorical = pd.DataFrame()
df_categorical['age'] = df['age']
df_categorical['sex'] = df['sex']  
df_categorical['bmi'] = df['bmi']
df_categorical['bp'] = df['bp']
df_categorical['risk_level'] = pd.qcut(df['s1'], 
                                q=3, 
                                labels=['low', 'medium', 'high'],
                                ).astype(str)

print("Categorical variable distribution:")
print(pd.Series(df_categorical['risk_level']).value_counts())
print(f"\nPercentage distribution:")
print(pd.Series(df_categorical['risk_level']).value_counts(normalize=True))
print(f"\nData types: {df_categorical.dtypes.to_dict()}")

# Split the categorical data for training and testing
X_train_cat, X_test_cat = preprocess_data(df_categorical)

print(f"\nTraining set size: {X_train_cat.shape[0]} records")
print(f"Testing set size: {X_test_cat.shape[0]} records")

Categorical variable distribution:
risk_level
low       148
high      148
medium    146
Name: count, dtype: int64

Percentage distribution:
risk_level
low       0.334842
high      0.334842
medium    0.330317
Name: proportion, dtype: float64

Data types: {'age': dtype('float64'), 'sex': dtype('float64'), 'bmi': dtype('float64'), 'bp': dtype('float64'), 'risk_level': dtype('O')}

Training set size: 353 records
Testing set size: 89 records

# Fit OLS model for categorical imputation
predictors_cat = ["age", "sex", "bmi", "bp"]
imputed_variables_cat = ["risk_level"]

# Initialize and fit the OLS imputer
ols_cat_imputer = OLS()
fitted_ols_cat = ols_cat_imputer.fit(X_train_cat, predictors_cat, imputed_variables_cat)

print("OLS model fitted for categorical variable imputation")

# Create test set with missing categorical values
X_test_cat_missing = X_test_cat.copy()
actual_cat_values = X_test_cat_missing[imputed_variables_cat].copy()
X_test_cat_missing[imputed_variables_cat] = np.nan

# Impute the categorical values
# For categorical variables, all quantiles return the same prediction
imputed_cat_values = fitted_ols_cat.predict(X_test_cat_missing, [0.5])

OLS model fitted for categorical variable imputation

Assessing categorical imputation performance¶

We can look at the accuracy of the model’s predictions to understand the quality of its categorical imputations. Cross-validation will employ log loss to evaluate the performance of the logistic regression method used.

# Evaluate categorical imputation accuracy
from sklearn.metrics import accuracy_score, confusion_matrix

# Get predictions and actual values
predicted = imputed_cat_values[0.5]['risk_level'].values
actual = actual_cat_values['risk_level'].values

# Calculate accuracy
accuracy = accuracy_score(actual, predicted)
print(f"Categorical imputation accuracy: {accuracy:.2%}")

# Create confusion matrix
conf_matrix = pd.DataFrame(
    confusion_matrix(actual, predicted),
    index=['Actual: low', 'Actual: medium', 'Actual: high'],
    columns=['Predicted: low', 'Predicted: medium', 'Predicted: high']
)
print("\nConfusion matrix:")
print(conf_matrix)

Categorical imputation accuracy: 40.45%

Confusion matrix:
                Predicted: low  Predicted: medium  Predicted: high
Actual: low                 15                 11                4
Actual: medium               9                 20                0
Actual: high                10                 19                1

# Run cross-validation for categorical variables
predictors_cat = ["age", "sex", "bmi", "bp"]
imputed_variables_cat = ["risk_level"]

ols_categorical_results = cross_validate_model(
    OLS, df_categorical, predictors_cat, imputed_variables_cat
)

# Display results
print("Categorical imputation cross-validation results (log loss):")
print(f"Mean train log loss: {ols_categorical_results['log_loss']['mean_train']:.4f}")
print(f"Mean test log loss: {ols_categorical_results['log_loss']['mean_test']:.4f}")

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.

Categorical imputation cross-validation results (log loss):
Mean train log loss: 1.0673
Mean test log loss: 1.0776

# Plot the categorical imputation performance
cat_perf_results_viz = model_performance_results(
    results=ols_categorical_results,
    model_name="OLS",
    method_name="Cross-validation log loss average",
    metric="log_loss",
)
fig = cat_perf_results_viz.plot(
    title="OLS categorical imputation cross-validation performance",
)
fig.show()