Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Local Area Calibration Setup

This notebook demonstrates the clone-based calibration pipeline: how raw CPS records become a calibration matrix and, ultimately, CD-level stacked datasets.

The paradigm shift from the old approach: instead of replicating every household into every congressional district, we clone each record N times and assign each clone a random census block drawn from a population-weighted distribution. Each clone inherits a state, CD, and block — and gets re-simulated under the rules of its assigned state.

We follow one household (record_idx=8629, household_id 128694, SNAP $18,396) through the entire pipeline:

  1. Clone and assign geography

  2. Simulate under new state rules (_simulate_clone)

  3. Geographic column masking

  4. Re-randomize takeup per census block

  5. Build the calibration matrix

  6. Create stacked datasets from calibrated weights

Companion notebook: calibration_matrix.ipynb covers the finished matrix — row/column anatomy, target groups, sparsity. This notebook covers the process that creates it and what happens after (stacked datasets).

Requirements: policy_data.db, block_cd_distributions.csv.gz, and the stratified CPS h5 file in STORAGE_FOLDER.

Section 1: Setup & Configuration

import numpy as np
import pandas as pd
from collections import defaultdict

from policyengine_us import Microsimulation
from policyengine_us_data.storage import STORAGE_FOLDER
from policyengine_us_data.calibration.clone_and_assign import (
    assign_random_geography,
    GeographyAssignment,
    load_global_block_distribution,
)
from policyengine_us_data.calibration.unified_matrix_builder import (
    UnifiedMatrixBuilder,
)
from policyengine_us_data.calibration.unified_calibration import (
    rerandomize_takeup,
    SIMPLE_TAKEUP_VARS,
)
from policyengine_us_data.utils.randomness import seeded_rng
from policyengine_us_data.parameters import load_take_up_rate
from policyengine_us_data.datasets.cps.local_area_calibration.calibration_utils import (
    get_calculated_variables,
    STATE_CODES,
    get_all_cds_from_database,
)
from policyengine_us_data.datasets.cps.local_area_calibration.stacked_dataset_builder import (
    create_sparse_cd_stacked_dataset,
)

db_path = STORAGE_FOLDER / "calibration" / "policy_data.db"
db_uri = f"sqlite:///{db_path}"
dataset_path = str(STORAGE_FOLDER / "stratified_extended_cps_2024.h5")

N_CLONES = 3
SEED = 42
/home/baogorek/envs/sep/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
sim = Microsimulation(dataset=dataset_path)
hh_ids = sim.calculate("household_id", map_to="household").values
snap_values = sim.calculate("snap", map_to="household").values
n_records = len(hh_ids)

record_idx = 8629  # High SNAP ($18k), lands in TX/PA/NY with seed=42
example_hh_id = hh_ids[record_idx]
print(f"Base dataset: {n_records:,} households")
print(
    f"Example household: record_idx={record_idx}, "
    f"household_id={example_hh_id}, "
    f"SNAP=${snap_values[record_idx]:,.2f}"
)
Base dataset: 11,999 households
Example household: record_idx=8629, household_id=128694, SNAP=$18,396.00

Section 2: Geography Assignment

assign_random_geography creates n_records * n_clones total records, each assigned a random census block from a population-weighted distribution. State and CD are derived from the block GEOID. The result is a GeographyAssignment dataclass with arrays indexed as clone_idx * n_records + record_idx.

geography = assign_random_geography(n_records, n_clones=N_CLONES, seed=SEED)
n_total = n_records * N_CLONES

print(f"Total cloned records: {n_total:,}")
print(f"Unique states: {len(np.unique(geography.state_fips))}")
print(f"Unique CDs: {len(np.unique(geography.cd_geoid))}")
print(f"Unique blocks: {len(np.unique(geography.block_geoid))}")
Total cloned records: 35,997
Unique states: 50
Unique CDs: 435
Unique blocks: 35508
print(
    f"Example household (record_idx={record_idx}) across {N_CLONES} clones:\n"
)
rows = []
for c in range(N_CLONES):
    col = c * n_records + record_idx
    rows.append(
        {
            "clone": c,
            "col": col,
            "state_fips": geography.state_fips[col],
            "abbr": STATE_CODES.get(geography.state_fips[col], "??"),
            "cd_geoid": geography.cd_geoid[col],
            "block_geoid": geography.block_geoid[col],
        }
    )
pd.DataFrame(rows)
Example household (record_idx=8629) across 3 clones:

Loading...

One household, three parallel geographic identities. Each clone will be simulated under different state rules, producing different benefit amounts.

Note: With only N_CLONES=3 (~36K total samples), small-population areas like DC may not appear in the random draw. The production pipeline uses N_CLONES=10, which covers all 51 state-equivalents and 436 CDs.

blocks, cds, states, probs = load_global_block_distribution()
print(f"Global block distribution: {len(blocks):,} blocks")
print(f"Top 5 states by total probability:")
state_prob = pd.Series(probs, index=states).groupby(level=0).sum()
top5 = state_prob.nlargest(5)
for fips, p in top5.items():
    print(f"  {STATE_CODES.get(fips, '??')} ({fips}): {p:.3%}")
Global block distribution: 5,765,442 blocks
Top 5 states by total probability:
  CA (6): 11.954%
  TX (48): 8.736%
  FL (12): 6.437%
  NY (36): 5.977%
  PA (42): 3.908%

Section 3: Inside _simulate_clone — State-Swap

For each clone, _simulate_clone does four things:

  1. Creates a fresh Microsimulation from the base dataset

  2. Overwrites state_fips with the clone’s assigned states

  3. Optionally calls a sim_modifier (e.g., takeup re-randomization)

  4. Clears cached formulas via get_calculated_variables — preserving survey inputs and IDs while forcing recalculation of state-dependent variables like SNAP

Let’s reproduce this manually for clone 0.

clone_idx = 0
col_start = clone_idx * n_records
col_end = col_start + n_records
clone_states = geography.state_fips[col_start:col_end]

clone_sim = Microsimulation(dataset=dataset_path)
clone_sim.set_input("state_fips", 2024, clone_states.astype(np.int32))
for var in get_calculated_variables(clone_sim):
    clone_sim.delete_arrays(var)

new_snap = clone_sim.calculate("snap", map_to="household").values

orig_state = sim.calculate("state_fips", map_to="household").values[record_idx]
new_state = clone_states[record_idx]

print(f"Example household (record_idx={record_idx}):")
print(
    f"  Original state: {STATE_CODES.get(int(orig_state), '??')} "
    f"({int(orig_state)})"
)
print(
    f"  Clone 0 state:  {STATE_CODES.get(int(new_state), '??')} "
    f"({int(new_state)})"
)
print(f"  Original SNAP:  ${snap_values[record_idx]:,.2f}")
print(f"  Clone 0 SNAP:   ${new_snap[record_idx]:,.2f}")
Example household (record_idx=8629):
  Original state: NC (37)
  Clone 0 state:  TX (48)
  Original SNAP:  $18,396.00
  Clone 0 SNAP:   $18,396.00
print(f"SNAP for record_idx={record_idx} across all {N_CLONES} clones:\n")
rows = []
for c in range(N_CLONES):
    cs = geography.state_fips[c * n_records + record_idx]
    s = Microsimulation(dataset=dataset_path)
    s.set_input(
        "state_fips",
        2024,
        geography.state_fips[c * n_records : (c + 1) * n_records].astype(
            np.int32
        ),
    )
    for var in get_calculated_variables(s):
        s.delete_arrays(var)
    clone_snap = s.calculate("snap", map_to="household").values
    rows.append(
        {
            "clone": c,
            "state": STATE_CODES.get(int(cs), "??"),
            "state_fips": int(cs),
            "SNAP": f"${clone_snap[record_idx]:,.2f}",
        }
    )
pd.DataFrame(rows)
SNAP for record_idx=8629 across all 3 clones:

Loading...

get_calculated_variables is selective: it identifies variables with formulas (state-dependent computations) while preserving survey-reported inputs and entity IDs. This is what allows the same demographic household to produce different benefit amounts under different state rules.

Section 4: Geographic Column Masking

When assembling the calibration matrix, each target row only “sees” columns (clones) whose geography matches the target’s geography. This is implemented via state_to_cols and cd_to_cols dictionaries built from the GeographyAssignment.

This is step 3 of build_matrix — reproduced here for transparency.

state_col_lists = defaultdict(list)
cd_col_lists = defaultdict(list)
for col in range(n_total):
    state_col_lists[int(geography.state_fips[col])].append(col)
    cd_col_lists[str(geography.cd_geoid[col])].append(col)

state_to_cols = {s: np.array(c) for s, c in state_col_lists.items()}
cd_to_cols = {cd: np.array(c) for cd, c in cd_col_lists.items()}

print(f"Unique states mapped: {len(state_to_cols)}")
print(f"Unique CDs mapped: {len(cd_to_cols)}")

state_counts = {s: len(c) for s, c in state_to_cols.items()}
sc_series = pd.Series(state_counts)
print(
    f"\nColumns per state: min={sc_series.min()}, "
    f"median={sc_series.median():.0f}, max={sc_series.max()}"
)
Unique states mapped: 50
Unique CDs mapped: 435

Columns per state: min=62, median=494, max=4311
print(f"Example household clone visibility:\n")
for c in range(N_CLONES):
    col = c * n_records + record_idx
    state = int(geography.state_fips[col])
    cd = str(geography.cd_geoid[col])
    abbr = STATE_CODES.get(state, "??")
    print(f"Clone {c} ({abbr}, CD {cd}):")
    print(
        f"  Visible to {abbr} state targets: "
        f"col {col} in state_to_cols[{state}]? "
        f"{col in state_to_cols.get(state, [])}"
    )
    print(
        f"  Visible to CD {cd} targets: "
        f"col {col} in cd_to_cols['{cd}']? "
        f"{col in cd_to_cols.get(cd, [])}"
    )
    # Check an unrelated state
    print(
        f"  Visible to NC (37) targets: " f"{col in state_to_cols.get(37, [])}"
    )
    print()
Example household clone visibility:

Clone 0 (TX, CD 4817):
  Visible to TX state targets: col 8629 in state_to_cols[48]? True
  Visible to CD 4817 targets: col 8629 in cd_to_cols['4817']? True
  Visible to NC (37) targets: False

Clone 1 (PA, CD 4201):
  Visible to PA state targets: col 20628 in state_to_cols[42]? True
  Visible to CD 4201 targets: col 20628 in cd_to_cols['4201']? True
  Visible to NC (37) targets: False

Clone 2 (NY, CD 3611):
  Visible to NY state targets: col 32627 in state_to_cols[36]? True
  Visible to CD 3611 targets: col 32627 in cd_to_cols['3611']? True
  Visible to NC (37) targets: False

This is the mechanism behind the sparsity pattern in calibration_matrix.ipynb: a household clone assigned to TX can contribute to TX state targets and TX CD targets, but produces a zero entry for NC or AK targets. The matrix is sparse because each clone only intersects a small fraction of all geographic targets.

Section 5: Takeup Re-randomization

The base CPS has fixed takeup decisions (e.g., “this household takes up SNAP”). But when we clone a household into different census blocks, each block should have independently drawn takeup — otherwise every clone of a SNAP-participating household would still participate, regardless of geography.

rerandomize_takeup solves this: for each census block, it uses seeded_rng(variable_name, salt=block_geoid) to draw new takeup booleans. The seed is deterministic per (variable, block) pair, so results are reproducible.

print(f"{len(SIMPLE_TAKEUP_VARS)} takeup variables:\n")
for spec in SIMPLE_TAKEUP_VARS:
    rate_key = spec["rate_key"]
    if rate_key == "voluntary_filing":
        rate = 0.05
    else:
        rate = load_take_up_rate(rate_key, 2024)
    rate_str = (
        f"{rate:.2%}"
        if isinstance(rate, float)
        else f"dict ({len(rate)} entries)"
    )
    print(
        f"  {spec['variable']:40s} "
        f"entity={spec['entity']:10s} rate={rate_str}"
    )
8 takeup variables:

  takes_up_snap_if_eligible                entity=spm_unit   rate=82.00%
  takes_up_aca_if_eligible                 entity=tax_unit   rate=67.20%
  takes_up_dc_ptc                          entity=tax_unit   rate=32.00%
  takes_up_head_start_if_eligible          entity=person     rate=30.00%
  takes_up_early_head_start_if_eligible    entity=person     rate=9.00%
  takes_up_ssi_if_eligible                 entity=person     rate=50.00%
  would_file_taxes_voluntarily             entity=tax_unit   rate=5.00%
  takes_up_medicaid_if_eligible            entity=person     rate=dict (51 entries)
block_a = "482011234567890"
block_b = "170311234567890"
var = "takes_up_snap_if_eligible"

rng_a1 = seeded_rng(var, salt=block_a)
rng_a2 = seeded_rng(var, salt=block_a)
rng_b = seeded_rng(var, salt=block_b)
rng_other = seeded_rng("takes_up_aca_if_eligible", salt=block_a)

draws_a1 = rng_a1.random(5)
draws_a2 = rng_a2.random(5)
draws_b = rng_b.random(5)
draws_other = rng_other.random(5)

print("Same block + same var (reproducible):")
print(f"  {draws_a1}")
print(f"  {draws_a2}")
print(f"  Match: {np.allclose(draws_a1, draws_a2)}")
print(f"\nDifferent block, same var:")
print(f"  {draws_b}")
print(f"  Match: {np.allclose(draws_a1, draws_b)}")
print(f"\nSame block, different var:")
print(f"  {draws_other}")
print(f"  Match: {np.allclose(draws_a1, draws_other)}")
Same block + same var (reproducible):
  [0.50514599 0.75213437 0.9703409  0.18048868 0.31969517]
  [0.50514599 0.75213437 0.9703409  0.18048868 0.31969517]
  Match: True

Different block, same var:
  [0.15503168 0.96707026 0.79019745 0.67544525 0.85245009]
  Match: False

Same block, different var:
  [0.93155876 0.8912794  0.50838888 0.32192278 0.01005173]
  Match: False
test_sim = Microsimulation(dataset=dataset_path)
clone_0_states = geography.state_fips[:n_records]
clone_0_blocks = geography.block_geoid[:n_records]
test_sim.set_input("state_fips", 2024, clone_0_states.astype(np.int32))

before = {}
for spec in SIMPLE_TAKEUP_VARS:
    v = spec["variable"]
    vals = test_sim.calculate(v, map_to=spec["entity"]).values
    before[v] = vals.mean()

rerandomize_takeup(test_sim, clone_0_blocks, clone_0_states, 2024)

print("Takeup rates before/after re-randomization (clone 0):\n")
for spec in SIMPLE_TAKEUP_VARS:
    v = spec["variable"]
    vals = test_sim.calculate(v, map_to=spec["entity"]).values
    after = vals.mean()
    print(f"  {v:40s} before={before[v]:.3%}  after={after:.3%}")
Takeup rates before/after re-randomization (clone 0):

  takes_up_snap_if_eligible                before=82.333%  after=82.381%
  takes_up_aca_if_eligible                 before=66.718%  after=67.486%
  takes_up_dc_ptc                          before=31.483%  after=32.044%
  takes_up_head_start_if_eligible          before=29.963%  after=29.689%
  takes_up_early_head_start_if_eligible    before=8.869%  after=8.721%
  takes_up_ssi_if_eligible                 before=100.000%  after=49.776%
  would_file_taxes_voluntarily             before=0.000%  after=4.905%
  takes_up_medicaid_if_eligible            before=84.496%  after=80.051%
medicaid_rates = load_take_up_rate("medicaid", 2024)
print("Medicaid takeup rates (state-specific), first 10 states:\n")
for state, rate in sorted(medicaid_rates.items())[:10]:
    print(f"  {state}: {rate:.2%}")
Medicaid takeup rates (state-specific), first 10 states:

  AK: 88.00%
  AL: 92.00%
  AR: 79.00%
  AZ: 95.00%
  CA: 78.00%
  CO: 99.00%
  CT: 89.00%
  DC: 99.00%
  DE: 86.00%
  FL: 98.00%

In the full pipeline, rerandomize_takeup is passed to build_matrix as a sim_modifier callback. For each clone, after state_fips is set but before formula caches are cleared, the callback draws new takeup booleans per census block. This means the same household in block A might take up SNAP while in block B it doesn’t — matching the statistical reality that takeup varies by geography.

Section 6: Matrix Build Verification

Let’s run the full build_matrix pipeline and verify the example household’s pattern matches our Section 4 predictions. We use the same target_filter as in calibration_matrix.ipynb but without sim_modifier to match that notebook’s output.

builder = UnifiedMatrixBuilder(
    db_uri=db_uri,
    time_period=2024,
    dataset_path=dataset_path,
)

targets_df, X_sparse, target_names = builder.build_matrix(
    geography,
    sim,
    target_filter={"domain_variables": ["snap"]},
)

print(f"Matrix shape: {X_sparse.shape}")
print(f"Non-zero entries: {X_sparse.nnz:,}")
print(f"Density: {X_sparse.nnz / (X_sparse.shape[0] * X_sparse.shape[1]):.6f}")
2026-02-13 17:11:22,384 - INFO - Processing clone 1/3 (cols 0-11998, 50 unique states)...
2026-02-13 17:11:23,509 - INFO - Processing clone 2/3 (cols 11999-23997, 50 unique states)...
2026-02-13 17:11:24,645 - INFO - Processing clone 3/3 (cols 23998-35996, 50 unique states)...
2026-02-13 17:11:25,769 - INFO - Assembling matrix from 3 clones...
2026-02-13 17:11:25,771 - INFO - Matrix: 538 targets x 35997 cols, 14946 nnz
Matrix shape: (538, 35997)
Non-zero entries: 14,946
Density: 0.000772
print(f"Example household non-zero pattern across clones:\n")
for c in range(N_CLONES):
    col = c * n_records + record_idx
    col_vec = X_sparse[:, col]
    nz_rows = col_vec.nonzero()[0]
    state = int(geography.state_fips[col])
    cd = geography.cd_geoid[col]
    abbr = STATE_CODES.get(state, "??")
    print(f"Clone {c} ({abbr}, CD {cd}): {len(nz_rows)} non-zero rows")
    for r in nz_rows:
        row = targets_df.iloc[r]
        print(
            f"  row {r}: {row['variable']} "
            f"(geo={row['geographic_id']}): "
            f"{X_sparse[r, col]:.2f}"
        )
Example household non-zero pattern across clones:

Clone 0 (TX, CD 4817): 3 non-zero rows
  row 39: household_count (geo=48): 1.00
  row 90: snap (geo=48): 18396.00
  row 410: household_count (geo=4817): 1.00
Clone 1 (PA, CD 4201): 3 non-zero rows
  row 34: household_count (geo=42): 1.00
  row 85: snap (geo=42): 18396.00
  row 358: household_count (geo=4201): 1.00
Clone 2 (NY, CD 3611): 3 non-zero rows
  row 27: household_count (geo=36): 1.00
  row 78: snap (geo=36): 18396.00
  row 292: household_count (geo=3611): 1.00

Section 7: From Weights to Datasets

create_sparse_cd_stacked_dataset takes calibrated weights and builds an h5 file with only the non-zero-weight households, reindexed per CD. Internally it does its own state-swap simulation — loading the base dataset, assigning state_fips for the target CD’s state, and recalculating benefits from scratch. This means SNAP values in the output reflect the destination state’s rules (e.g., a 70SNAPhouseholdfromMEmayget70 SNAP household from ME may get 0 under AK rules).

Format gap: The calibration produces weights in clone layout (n_records * n_clones,) where each clone maps to one specific CD via the GeographyAssignment. The stacked dataset builder expects CD layout (n_cds * n_households,) where every CD has a weight slot for every household. Converting between these — accumulating clone weights into their assigned CDs — is a separate step not yet implemented. The demo below constructs artificial CD-layout weights directly to show how the builder works.

print("Dimension mismatch:")
print(
    f"  Calibration output: ({n_records} * {N_CLONES},) "
    f"= {n_records * N_CLONES:,} (clone layout)"
)

all_cds = get_all_cds_from_database(db_uri)
n_cds = len(all_cds)
print(
    f"  Stacked builder expects: ({n_cds} * {n_records},) "
    f"= {n_cds * n_records:,} (CD layout)"
)
Dimension mismatch:
  Calibration output: (11999 * 3,) = 35,997 (clone layout)
  Stacked builder expects: (436 * 11999,) = 5,231,564 (CD layout)
import os

demo_cds = ["3701", "201"]
n_demo_cds = len(demo_cds)

w = (
    np.random.default_rng(42)
    .binomial(n=1, p=0.01, size=n_demo_cds * n_records)
    .astype(float)
)

# Seed our example household into both CDs
cd_idx_3701 = demo_cds.index("3701")
w[cd_idx_3701 * n_records + record_idx] = 2.5

cd_idx_201 = demo_cds.index("201")
w[cd_idx_201 * n_records + record_idx] = 3.5

output_dir = "calibration_output"
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, "results.h5")

print(
    f"Weight vector: {len(w):,} entries "
    f"({n_demo_cds} CDs x {n_records:,} HH)"
)
print(f"Non-zero weights: {(w > 0).sum()}")
print(
    f"Example HH weight in CD 3701: {w[cd_idx_3701 * n_records + record_idx]}"
)
print(f"Example HH weight in CD 201: {w[cd_idx_201 * n_records + record_idx]}")
Weight vector: 23,998 entries (2 CDs x 11,999 HH)
Non-zero weights: 277
Example HH weight in CD 3701: 2.5
Example HH weight in CD 201: 3.5
create_sparse_cd_stacked_dataset(
    w,
    demo_cds,
    cd_subset=demo_cds,
    dataset_path=dataset_path,
    output_path=output_path,
)
Processing subset of 2 CDs: 3701, 201...
Output path: calibration_output/results.h5

Original dataset has 11,999 households
Extracted weights for 2 CDs from full weight matrix
Total active household-CD pairs: 277
Total weight in W matrix: 281
Processing CD 201 (2/2)...
2026-02-13 17:11:40,873 - INFO - HTTP Request: GET https://huggingface.co/api/models/policyengine/policyengine-us-data "HTTP/1.1 200 OK"
2026-02-13 17:11:40,899 - INFO - HTTP Request: HEAD https://huggingface.co/policyengine/policyengine-us-data/resolve/main/enhanced_cps_2024.h5 "HTTP/1.1 302 Found"
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
2026-02-13 17:11:40,899 - WARNING - Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

Combining 2 CD DataFrames...
Total households across all CDs: 277
Combined DataFrame shape: (726, 222)

Reindexing all entity IDs using 25k ranges per CD...
  Created 277 unique households across 2 CDs
  Reindexing persons using 25k ranges...
  Reindexing tax units...
  Reindexing SPM units...
  Reindexing marital units...
  Reindexing families...
  Final persons: 726
  Final households: 277
  Final tax units: 373
  Final SPM units: 291
  Final marital units: 586
  Final families: 309

Weights in combined_df AFTER reindexing:
  HH weight sum: 0.00M
  Person weight sum: 0.00M
  Ratio: 1.00

Overflow check:
  Max person ID after reindexing: 5,025,335
  Max person ID × 100: 502,533,500
  int32 max: 2,147,483,647
  ✓ No overflow risk!

Creating Dataset from combined DataFrame...
Building simulation from Dataset...

Saving to calibration_output/results.h5...
Found 175 input variables to save
Variables saved: 218
Variables skipped: 3763
Sparse CD-stacked dataset saved successfully!
Household mapping saved to calibration_output/mappings/results_household_mapping.csv

Verifying saved file...
  Final households: 277
  Final persons: 726
  Total population (from household weights): 281
'calibration_output/results.h5'
sim_after = Microsimulation(dataset=f"./{output_path}")
hh_after_df = pd.DataFrame(
    sim_after.calculate_dataframe(
        [
            "household_id",
            "congressional_district_geoid",
            "household_weight",
            "state_fips",
            "snap",
        ]
    )
)
print(f"Stacked dataset: {len(hh_after_df)} households\n")

mapping_df = pd.read_csv(
    f"{output_dir}/mappings/results_household_mapping.csv"
)
example_mapping = mapping_df.loc[
    mapping_df.original_household_id == example_hh_id
]
print(f"Example household (original_id={example_hh_id}) " f"in mapping:\n")
print(example_mapping.to_string(index=False))

new_ids = example_mapping.new_household_id
print(f"\nIn stacked dataset:\n")
print(
    hh_after_df.loc[hh_after_df.household_id.isin(new_ids)].to_string(
        index=False
    )
)
Stacked dataset: 277 households

Example household (original_id=128694) in mapping:

 new_household_id  original_household_id  congressional_district  state_fips
              108                 128694                     201           2
            25097                 128694                    3701          37

In stacked dataset:

 household_id  congressional_district_geoid  household_weight  state_fips    snap
          108                           201               3.5           2 23640.0
        25097                          3701               2.5          37 18396.0
import shutil

shutil.rmtree(output_dir)
print(f"Cleaned up {output_dir}/")
Cleaned up calibration_output/

Summary

The clone-based calibration pipeline has six stages:

  1. Clone + assign geographyassign_random_geography() creates N copies of each CPS record, each with a population-weighted random census block.

  2. Simulate_simulate_clone() sets each clone’s state_fips and recalculates state-dependent benefits.

  3. Geographic maskingstate_to_cols / cd_to_cols restrict each target row to geographically relevant columns.

  4. Re-randomize takeuprerandomize_takeup() draws new takeup per census block, breaking the fixed-takeup assumption.

  5. Build matrixUnifiedMatrixBuilder.build_matrix() assembles the sparse CSR matrix from all clones.

  6. Stacked datasetscreate_sparse_cd_stacked_dataset() converts calibrated weights into CD-level h5 files.

For matrix diagnostics (row/column anatomy, target groups, sparsity analysis), see calibration_matrix.ipynb.