Preprocessing transformations can improve model performance by normalizing scale differences or handling skewed distributions. These are supported by preprocess_data and transformation-specific functions.
Transformation options¶
Microimpute supports three transformation types that can be applied to numeric columns before training. Each transformation automatically excludes categorical and boolean columns to prevent encoding issues.
Normalization (z-score) standardizes data to have mean 0 and standard deviation 1. This transformation is useful when predictors have different scales, ensuring that all features contribute equally to distance-based or gradient-based models.
Log transformation applies the natural logarithm to values. This is effective for right-skewed distributions common in financial data like income or wealth. The transformation requires all values to be strictly positive.
Asinh transformation applies the inverse hyperbolic sine function, which behaves like for large positive values and for large negative values, while remaining approximately linear near zero. Unlike log transformation, asinh handles zero and negative values, making it suitable for variables like net worth that can take any real value.
preprocess_data¶
The main entry point for data preparation, combining splitting and transformation.
def preprocess_data(
data: pd.DataFrame,
full_data: Optional[bool] = False,
train_size: Optional[float] = TRAIN_SIZE,
test_size: Optional[float] = TEST_SIZE,
random_state: Optional[int] = RANDOM_STATE,
normalize: Optional[Union[bool, List[str]]] = False,
log_transform: Optional[Union[bool, List[str]]] = False,
asinh_transform: Optional[Union[bool, List[str]]] = False,
) -> Union[Tuple[pd.DataFrame, dict], Tuple[pd.DataFrame, pd.DataFrame, dict]]| Parameter | Type | Default used | Description |
|---|---|---|---|
| data | pd.DataFrame | - | DataFrame containing the data to preprocess |
| full_data | bool | False | If True, return complete dataset without splitting |
| train_size | float | 0.8 | Proportion of data for training split |
| test_size | float | 0.2 | Proportion of data for test split |
| random_state | int | 42 | Random seed for reproducibility |
| normalize | bool or List[str] | False | True for all numeric columns, or list of specific columns |
| log_transform | bool or List[str] | False | True for all numeric columns, or list of specific columns |
| asinh_transform | bool or List[str] | False | True for all numeric columns, or list of specific columns |
The return type depends on parameters. If full_data=True and transformations are applied, returns (data, transform_params). If full_data=False with transformations, returns (X_train, X_test, transform_params). Without transformations, the transform_params dict is omitted.
normalize_data¶
def normalize_data(
data: pd.DataFrame,
columns_to_normalize: Optional[List[str]] = None,
) -> Tuple[pd.DataFrame, dict]| Parameter | Type | Default used | Description |
|---|---|---|---|
| data | pd.DataFrame | - | DataFrame to normalize |
| columns_to_normalize | List[str] | None | Specific columns to normalize. If None, all numeric columns |
It returns uple of (normalized_data, normalization_params) where normalization_params maps column names to {"mean": float, "std": float}.
log_transform_data¶
def log_transform_data(
data: pd.DataFrame,
columns_to_transform: Optional[List[str]] = None,
) -> Tuple[pd.DataFrame, dict]| Parameter | Type | Default used | Description |
|---|---|---|---|
| data | pd.DataFrame | - | DataFrame to transform |
| columns_to_transform | List[str] | None | Specific columns to transform. If None, all numeric columns |
Returns a tuple of (log_transformed_data, log_transform_params).
Note: Raises ValueError if any values are non-positive.
asinh_transform_data¶
def asinh_transform_data(
data: pd.DataFrame,
columns_to_transform: Optional[List[str]] = None,
) -> Tuple[pd.DataFrame, dict]| Parameter | Type | Default used | Description |
|---|---|---|---|
| data | pd.DataFrame | - | DataFrame to transform |
| columns_to_transform | List[str] | None | Specific columns to transform. If None, all numeric columns |
Returns a tuple of (asinh_transformed_data, asinh_transform_params).
Reversal functions¶
After imputation, predictions can be converted back to the original scale using the stored parameters.
def unnormalize_predictions(imputations: dict, normalization_params: dict) -> dict
def unlog_transform_predictions(imputations: dict, log_transform_params: dict) -> dict
def un_asinh_transform_predictions(imputations: dict, asinh_transform_params: dict) -> dictEach function takes the imputation dictionary (mapping quantiles to DataFrames) and the parameter dictionary returned by the corresponding transform function, returning imputations in the original scale.
Usage with autoimpute¶
The autoimpute() function accepts a preprocessing parameter that specifies transformations per variable:
from microimpute.comparisons.autoimpute import autoimpute
result = autoimpute(
donor_data=donor,
receiver_data=receiver,
predictors=["age", "education"],
imputed_variables=["income", "wealth"],
preprocessing={
"income": "log", # Log transform (positive values only)
"wealth": "asinh", # Asinh transform (handles zeros/negatives)
"age": "normalize" # Z-score normalization
}
)The transformations are applied automatically before model training and reversed after prediction, so the returned imputations are in the original scale.
Constraints¶
Each column can only have one transformation applied. Attempting to apply multiple transformations to the same column raises a ValueError. When specifying transformations as True (apply to all), only one transformation type can be used. For different transformations on different columns, use the list format to specify columns explicitly.