Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Calibration Pipeline User’s Manual

The unified calibration pipeline reweights cloned CPS records to match administrative targets using L0-regularized optimization. This guide covers the main workflows: lightweight build-then-fit, full pipeline with PUF, and fitting from a saved package.

Quick Start

# Build matrix only from stratified CPS (no PUF, no re-imputation):
python -m policyengine_us_data.calibration.unified_calibration \
  --target-config policyengine_us_data/calibration/target_config.yaml \
  --skip-source-impute \
  --skip-takeup-rerandomize \
  --build-only

# Fit weights from a saved package:
python -m policyengine_us_data.calibration.unified_calibration \
  --package-path storage/calibration/calibration_package.pkl \
  --epochs 500 --device cuda

# Full pipeline with PUF (build + fit in one shot):
make calibrate

Architecture Overview

The pipeline has two phases:

  1. Matrix build: Clone CPS records, assign geography, compute all target variable values, assemble a sparse calibration matrix. Optionally includes PUF cloning (doubles record count) and source re-imputation.

  2. Weight fitting (~5-20 min on GPU): L0-regularized optimization to find household weights that reproduce administrative targets.

The calibration package checkpoint lets you run phase 1 once and iterate on phase 2 with different hyperparameters or target selections---without rebuilding.

Prerequisites

The matrix build requires two inputs from the data pipeline:

Both must exist before running calibration. The stratified CPS already contains all CPS variables needed for calibration; PUF cloning and source re-imputation are optional enhancements that happen at calibration time.

Workflows

Build the matrix from the stratified CPS without PUF cloning or re-imputation. This is the fastest way to get a calibration package for experimentation.

Step 1: Build the matrix (~12K base records x 436 clones = ~5.2M columns).

python -m policyengine_us_data.calibration.unified_calibration \
  --target-config policyengine_us_data/calibration/target_config.yaml \
  --skip-source-impute \
  --skip-takeup-rerandomize \
  --build-only

This saves storage/calibration/calibration_package.pkl (default location). Use --package-output to specify a different path.

Step 2: Fit weights from the package (fast, repeatable).

python -m policyengine_us_data.calibration.unified_calibration \
  --package-path storage/calibration/calibration_package.pkl \
  --epochs 1000 \
  --lambda-l0 1e-8 \
  --beta 0.65 \
  --lambda-l2 1e-8 \
  --device cuda

You can re-run Step 2 as many times as you want with different hyperparameters. The expensive matrix build only happens once.

2. Full pipeline with PUF

Adding --puf-dataset doubles the record count (~24K base records x 436 clones = ~10.4M columns) by creating PUF-imputed copies of every CPS record. This also triggers source re-imputation unless skipped.

Single-pass (build + fit):

python -m policyengine_us_data.calibration.unified_calibration \
  --puf-dataset policyengine_us_data/storage/puf_2024.h5 \
  --target-config policyengine_us_data/calibration/target_config.yaml \
  --epochs 200 \
  --device cuda

Or equivalently: make calibrate

Output:

Build-only (save package for later fitting):

python -m policyengine_us_data.calibration.unified_calibration \
  --puf-dataset policyengine_us_data/storage/puf_2024.h5 \
  --target-config policyengine_us_data/calibration/target_config.yaml \
  --build-only

Or equivalently: make calibrate-build

This saves storage/calibration/calibration_package.pkl (default location). Use --package-output to specify a different path.

Then fit from the package using the same Step 2 command from Workflow 1.

3. Re-filtering a saved package

A saved package contains all targets from the database (before target config filtering). You can apply a different target config at fit time:

python -m policyengine_us_data.calibration.unified_calibration \
  --package-path storage/calibration/calibration_package.pkl \
  --target-config my_custom_config.yaml \
  --epochs 200

This lets you experiment with which targets to include without rebuilding the matrix.

4. Running on Modal (GPU cloud)

From a pre-built package (recommended):

Use --package-path to point at a local .pkl file. The runner automatically uploads it to the Modal Volume and then fits from it on the GPU, avoiding the function argument size limit.

modal run modal_app/remote_calibration_runner.py \
  --package-path policyengine_us_data/storage/calibration/calibration_package.pkl \
  --branch calibration-pipeline-improvements \
  --gpu T4 \
  --epochs 1000 \
  --beta 0.65 \
  --lambda-l0 1e-8 \
  --lambda-l2 1e-8

If a package already exists on the volume from a previous upload, you can also use --prebuilt-matrices to fit directly without re-uploading.

Full pipeline (builds matrix from scratch on Modal):

modal run modal_app/remote_calibration_runner.py \
  --branch calibration-pipeline-improvements \
  --gpu T4 \
  --epochs 1000 \
  --beta 0.65 \
  --lambda-l0 1e-8 \
  --lambda-l2 1e-8 \
  --target-config policyengine_us_data/calibration/target_config.yaml

The target config YAML is read from the cloned repo inside the container, so it must be committed to the branch you specify.

5. Portable fitting (Kaggle, Colab, etc.)

Transfer the package file to any environment with scipy, numpy, pandas, torch, and l0-python installed:

from policyengine_us_data.calibration.unified_calibration import (
    load_calibration_package,
    apply_target_config,
    fit_l0_weights,
)

package = load_calibration_package("calibration_package.pkl")
targets_df = package["targets_df"]
X_sparse = package["X_sparse"]

weights = fit_l0_weights(
    X_sparse=X_sparse,
    targets=targets_df["value"].values,
    lambda_l0=1e-8,
    epochs=500,
    device="cuda",
    beta=0.65,
    lambda_l2=1e-8,
)

Target Config

The target config controls which targets reach the optimizer. It uses a YAML exclusion list:

exclude:
  - variable: rent
    geo_level: national
  - variable: eitc
    geo_level: district
  - variable: snap
    geo_level: state
    domain_variable: snap   # optional: further narrow the match

Each rule drops rows from the calibration matrix where all specified fields match. Unrecognized variables silently match nothing.

Fields

FieldRequiredValuesDescription
variableYesAny variable name in target_overviewThe calibration target variable
geo_levelYesnational, state, districtGeographic aggregation level
domain_variableNoAny domain variable in target_overviewNarrows match to a specific domain

Default config

The checked-in config at policyengine_us_data/calibration/target_config.yaml reproduces the junkyard notebook’s 22 excluded target groups. It drops:

Applying this config reduces targets from ~37K to ~21K, matching the junkyard’s target selection.

Writing a custom config

To experiment, copy the default and edit:

cp policyengine_us_data/calibration/target_config.yaml my_config.yaml
# Edit my_config.yaml to add/remove exclusion rules
python -m policyengine_us_data.calibration.unified_calibration \
  --package-path storage/calibration/calibration_package.pkl \
  --target-config my_config.yaml \
  --epochs 200

To see what variables and geo_levels are available in the database:

SELECT DISTINCT variable, geo_level
FROM target_overview
ORDER BY variable, geo_level;

CLI Reference

Core flags

FlagDefaultDescription
--datasetstorage/stratified_extended_cps_2024.h5Path to CPS h5 file
--db-pathstorage/calibration/policy_data.dbPath to target database
--outputstorage/calibration/calibration_weights.npyWeight output path
--puf-datasetNonePath to PUF h5 (enables PUF cloning)
--presetlocalL0 preset: local (1e-8) or national (1e-4)
--lambda-l0NoneCustom L0 penalty (overrides --preset)
--epochs100Training epochs
--devicecpucpu or cuda
--n-clones436Number of dataset clones
--seed42Random seed for geography assignment

Target selection

FlagDefaultDescription
--target-configNonePath to YAML exclusion config
--domain-variablesNoneComma-separated domain filter (SQL-level)
--hierarchical-domainsNoneDomains for hierarchical uprating

Checkpoint flags

FlagDefaultDescription
--build-onlyFalseBuild matrix, save package, skip fitting
--package-pathNoneLoad pre-built package (uploads to Modal volume automatically when using Modal runner)
--package-outputAuto (when --build-only)Where to save package

Hyperparameter flags

FlagDefaultJunkyard valueDescription
--beta0.350.65L0 gate temperature (higher = softer gates)
--lambda-l21e-121e-8L2 regularization on weights
--learning-rate0.150.15Optimizer learning rate

Skip flags

FlagDescription
--skip-pufSkip PUF clone + QRF imputation
--skip-source-imputeSkip ACS/SIPP/SCF re-imputation
--skip-takeup-rerandomizeSkip takeup re-randomization

Calibration Package Format

The package is a pickled Python dict:

{
    "X_sparse": scipy.sparse.csr_matrix,  # (n_targets, n_records)
    "targets_df": pd.DataFrame,           # target metadata + values
    "target_names": list[str],            # human-readable names
    "metadata": {
        "dataset_path": str,
        "db_path": str,
        "n_clones": int,
        "n_records": int,
        "seed": int,
        "created_at": str,       # ISO timestamp
        "target_config": dict,   # config used at build time
    },
}

The targets_df DataFrame has columns: variable, geo_level, geographic_id, domain_variable, value, and others from the database.

Validating a Package

Before uploading a package to Modal, validate it:

# Default package location
python -m policyengine_us_data.calibration.validate_package

# Specific package
python -m policyengine_us_data.calibration.validate_package path/to/calibration_package.pkl

# Strict mode: fail if any target has row_sum/target < 1%
python -m policyengine_us_data.calibration.validate_package --strict

Exit codes: 0 = pass, 1 = impossible targets, 2 = strict ratio failures.

Validation also runs automatically after --build-only.

Hyperparameter Tuning Guide

The three key hyperparameters control the tradeoff between target accuracy and sparsity:

Suggested starting points

For local-area calibration (millions of records):

--lambda-l0 1e-8 --beta 0.65 --lambda-l2 1e-8 --epochs 500

For national web app (~50K records):

--lambda-l0 1e-4 --beta 0.35 --lambda-l2 1e-12 --epochs 200

Makefile Targets

TargetDescription
make calibrateFull pipeline with PUF and target config
make calibrate-buildBuild-only mode (saves package, no fitting)
make pipelineEnd-to-end: data, upload, calibrate, stage
make validate-stagingValidate staged H5s against targets (states only)
make validate-staging-fullValidate staged H5s (states + districts)
make upload-validationPush validation_results.csv to HF
make check-stagingSmoke test: sum key variables across all state H5s
make check-sanityQuick structural integrity check on one state
make upload-calibrationUpload weights, blocks, and logs to HF

Takeup Rerandomization

The calibration pipeline uses two independent code paths to compute the same target variables:

  1. Matrix builder (UnifiedMatrixBuilder.build_matrix): Computes a sparse calibration matrix XX where each row is a target and each column is a cloned household. The optimizer finds weights ww that minimize Xwt\|Xw - t\| (target values).

  2. Stacked builder (create_sparse_cd_stacked_dataset): Produces the .h5 files that users load in Microsimulation. It reconstructs each congressional district by combining base CPS records with calibrated weights and block-level geography.

For the calibration to be meaningful, both paths must produce identical values for every target variable. If the matrix builder computes Xsnap,NCw=$5.2BX_{snap,NC} \cdot w = \$5.2B but the stacked NC.h5 file yields sim.calculate("snap") * household_weight = $4.8B, then the optimizer’s solution does not actually match the target.

The problem with takeup variables

Variables like snap, aca_ptc, ssi, and medicaid depend on takeup draws — random Bernoulli samples that determine whether an eligible household actually claims the benefit. By default, PolicyEngine draws these at simulation time using Python’s built-in hash(), which is randomized per process.

This means loading the same H5 file in two different processes can produce different SNAP totals, even with the same weights. Worse, the matrix builder runs in process A while the stacked builder runs in process B, so their draws can diverge.

The solution: block-level seeding

Both paths call seeded_rng(variable_name, salt=f"{block_geoid}:{household_id}") to generate deterministic takeup draws. This ensures:

All takeup variables in SIMPLE_TAKEUP_VARS (in utils/takeup.py) receive block-seeded draws in the H5 builder, including would_file_taxes_voluntarily. The calibration matrix uses TAKEUP_AFFECTED_TARGETS to identify which target variables need takeup-adjusted rows, but the H5 builder applies draws to all SIMPLE_TAKEUP_VARS so that every takeup variable gets proper block-seeded values.

The --skip-takeup-rerandomize flag disables this rerandomization for faster iteration when you only care about non-takeup variables. Do not use it for production calibrations.

Block-Level Seeding

Each cloned household is assigned to a Census block (15-digit GEOID) during the assign_random_geography step. The first 2 digits are the state FIPS code, which determines the household’s takeup rates (since benefit eligibility rules are state-specific).

Mechanism

rng = seeded_rng(variable_name, salt=f"{block_geoid}:{household_id}")
draw = rng.random()
takes_up = draw < takeup_rate[state_fips]

The seeded_rng function uses _stable_string_hash — a deterministic hash that does not depend on Python’s PYTHONHASHSEED. This is critical because Python’s built-in hash() is randomized per process by default (since Python 3.3).

Why block (not CD or state)?

Blocks are the finest Census geography. A household’s block assignment stays the same regardless of how blocks are aggregated — the same household-block-draw triple produces the same result whether you are building an H5 for a state, a congressional district, or a county. This means:

Inactive records

When converting to stacked format, households that are not assigned to a given CD get zero weight. These inactive records must receive an empty string "" as their block GEOID, not a real block. If they received real blocks, they would inflate the entity count n passed to the RNG, shifting the draw positions for active entities and breaking the XwX \cdot w consistency invariant.

The XwX \cdot w Consistency Invariant

Formal statement

For every target variable vv and geography gg:

Xv,gw=igsim.calculate(v)i×wiX_{v,g} \cdot w = \sum_{i \in g} \text{sim.calculate}(v)_i \times w_i

where the left side comes from the matrix builder and the right side comes from loading the stacked H5 and running Microsimulation.calculate().

Why it matters

This invariant is what makes calibration meaningful. Without it, the optimizer’s solution (which minimizes Xwt\|Xw - t\|) does not actually produce a dataset that matches the targets. The weights would be “correct” in the matrix builder’s view but produce different totals in the H5 files that users actually load.

Known sources of drift

  1. Mismatched takeup draws: The matrix builder and stacked builder use different RNG states. Solved by block-level seeding (see above).

  2. Different block assignments: The stacked format uses first-clone-wins for multi-clone-same-CD records. With ~11M blocks and 3-10 clones, collision rate is ~0.7-10% of records. In practice, the residual mismatch is negligible.

  3. Inactive records in RNG calls: If inactive records (w=0) receive real block GEOIDs, they inflate the entity count for that block’s RNG call, shifting draw positions. Solved by using "" for inactive blocks.

  4. Entity ordering: Both paths must iterate over entities in the same order (sim.calculate("{entity}_id", map_to=entity)). NumPy boolean masking preserves order, so draws[i] maps to the same entity in both paths.

Testing

The test_xw_consistency.py test (pytest -m slow) verifies this invariant end-to-end:

  1. Load base dataset, create geography with uniform weights

  2. Build XX with the matrix builder (including takeup rerandomization)

  3. Convert weights to stacked format

  4. Build stacked H5 for selected CDs

  5. Compare XwX \cdot w vs sim.calculate() * household_weight — assert ratio within 1%

Post-Calibration Gating Workflow

After the pipeline stages H5 files to HuggingFace, two manual review gates determine whether to promote to production.

Gate 1: Review calibration fit

Load calibration_log.csv in the microcalibrate dashboard. This file contains the XwX \cdot w values from the matrix builder for every target at every epoch.

What to check:

If fit is poor, re-calibrate with different hyperparameters (learning rate, lambda_l0, beta, epochs).

Gate 2: Review simulation quality

make validate-staging          # states only (~30 min)
make validate-staging-full     # states + districts (~3 hrs)
make upload-validation         # push CSV to HF

This produces validation_results.csv with sim.calculate() values for every target. Load it in the dashboard’s Combined tab alongside calibration_log.csv.

What to check:

Promote

If both gates pass:

Structural pre-flight

For a quick structural check without loading the full database:

make check-sanity              # one state, ~2 min

This runs weight non-negativity, entity ID uniqueness, NaN/Inf detection, person-household mapping, boolean takeup validation, and per-household range checks.