Local Area Calibration Setup - PolicyEngine US Data

This notebook demonstrates the sparse matrix construction for local area (congressional district) calibration. It uses a subset of CDs from NC, HI, MT, and AK for manageable runtime.

Section 1: Setup & Configuration¶

from sqlalchemy import create_engine, text
import pandas as pd
import numpy as np

from policyengine_us import Microsimulation
from policyengine_us_data.storage import STORAGE_FOLDER
from policyengine_us_data.datasets.cps.local_area_calibration.sparse_matrix_builder import (
    SparseMatrixBuilder,
)
from policyengine_us_data.datasets.cps.local_area_calibration.matrix_tracer import (
    MatrixTracer,
)
from policyengine_us_data.datasets.cps.local_area_calibration.calibration_utils import (
    get_calculated_variables,
    create_target_groups,
)

/home/baogorek/envs/pe/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

TEST_LITE == False

db_path = STORAGE_FOLDER / "calibration" / "policy_data.db"
db_uri = f"sqlite:///{db_path}"
dataset_path = str(STORAGE_FOLDER / "stratified_extended_cps_2023.h5")

engine = create_engine(db_uri)

Section 2: Select Test Congressional Districts¶

We use CDs from 4 states for testing:

NC (37): 14 CDs (3701-3714) - provides same-state different-CD test cases
HI (15): 2 CDs (1501-1502)
MT (30): 2 CDs (3001-3002)
AK (2): 1 CD (200)

query = """
SELECT DISTINCT sc.value as cd_geoid
FROM strata s
JOIN stratum_constraints sc ON s.stratum_id = sc.stratum_id
WHERE s.stratum_group_id = 1
  AND sc.constraint_variable = 'congressional_district_geoid'
  AND (
    sc.value LIKE '37__'
    OR sc.value LIKE '150_'
    OR sc.value LIKE '300_'
    OR sc.value = '200' OR sc.value = '201'
  )
ORDER BY sc.value
"""

with engine.connect() as conn:
    result = conn.execute(text(query)).fetchall()
    test_cds = [row[0] for row in result]

print(f"Testing with {len(test_cds)} congressional districts:")
print(f"  NC (37): {[cd for cd in test_cds if cd.startswith('37')]}")
print(f"  HI (15): {[cd for cd in test_cds if cd.startswith('15')]}")
print(f"  MT (30): {[cd for cd in test_cds if cd.startswith('30')]}")
print(f"  AK (2):  {[cd for cd in test_cds if cd.startswith('20')]}")

Testing with 19 congressional districts:
  NC (37): ['3701', '3702', '3703', '3704', '3705', '3706', '3707', '3708', '3709', '3710', '3711', '3712', '3713', '3714']
  HI (15): ['1501', '1502']
  MT (30): ['3001', '3002']
  AK (2):  ['201']

Section 3: Build the Sparse Matrix¶

The sparse matrix X_sparse has:

Rows: Calibration targets (e.g., SNAP totals by geography)
Columns: (household × CD) pairs - each household appears once per CD

We filter to SNAP targets only (stratum_group_id=4) for this demonstration.

sim = Microsimulation(dataset=dataset_path)

builder = SparseMatrixBuilder(
    db_uri,
    time_period=2023,
    cds_to_calibrate=test_cds,
    dataset_path=dataset_path,
)

targets_df, X_sparse, household_id_mapping = builder.build_matrix(
    sim, target_filter={"stratum_group_ids": [4], "variables": ["snap"]}
)

print(f"X_sparse shape: {X_sparse.shape}")
print(f"  Rows (targets): {X_sparse.shape[0]}")
print(f"  Columns (household × CD pairs): {X_sparse.shape[1]}")
print(f"  Non-zero entries: {X_sparse.nnz:,}")
print(f"  Sparsity: {1 - X_sparse.nnz / (X_sparse.shape[0] * X_sparse.shape[1]):.2%}")

X_sparse shape: (539, 256633)
  Rows (targets): 539
  Columns (household × CD pairs): 256633
  Non-zero entries: 67,756
  Sparsity: 99.95%

Section 4: Understanding the Matrix Structure with MatrixTracer¶

The MatrixTracer helps navigate the sparse matrix by providing lookups between:

Column indices ↔ (household_id, CD) pairs
Row indices ↔ target definitions

tracer = MatrixTracer(
    targets_df, X_sparse, household_id_mapping, test_cds, sim
)

tracer.print_matrix_structure()


================================================================================
MATRIX STRUCTURE BREAKDOWN
================================================================================

Matrix dimensions: 539 rows x 256633 columns
  Rows = 539 targets
  Columns = 13507 households x 19 CDs
           = 13,507 x 19 = 256,633

--------------------------------------------------------------------------------
COLUMN STRUCTURE (Households stacked by CD)
--------------------------------------------------------------------------------

Showing first and last 5 CDs of 19 total:

First 5 CDs:
cd_geoid  start_col  end_col  n_households
    1501          0    13506         13507
    1502      13507    27013         13507
     201      27014    40520         13507
    3001      40521    54027         13507
    3002      54028    67534         13507

Last 5 CDs:
cd_geoid  start_col  end_col  n_households
    3710     189098   202604         13507
    3711     202605   216111         13507
    3712     216112   229618         13507
    3713     229619   243125         13507
    3714     243126   256632         13507

--------------------------------------------------------------------------------
ROW STRUCTURE (Targets)
--------------------------------------------------------------------------------

Total targets: 539

Targets by stratum group:
                  n_targets  n_unique_vars
stratum_group_id                          
1                         1              1
4                       538              2

--------------------------------------------------------------------------------
TARGET GROUPS (for loss calculation)
--------------------------------------------------------------------------------

=== Creating Target Groups ===

National targets:
  Group 0: Snap = 107,062,860,000

State targets:
  Group 1: SNAP Household Count (51 targets)
  Group 2: Snap (51 targets)

District targets:
  Group 3: SNAP Household Count (436 targets)

Total groups created: 4
========================================
  Group 0: National Snap (1 target, value=107,062,860,000) - rows [0]
  Group 1: State SNAP Household Count (51 targets) - rows [1, 2, 3, ..., 50, 51]
  Group 2: State Snap (51 targets) - rows [52, 53, 54, ..., 101, 102]
  Group 3: District SNAP Household Count (436 targets) - rows [103, 104, 105, ..., 537, 538]

================================================================================

target_groups, group_info = create_target_groups(targets_df)


=== Creating Target Groups ===

National targets:
  Group 0: Snap = 107,062,860,000

State targets:
  Group 1: SNAP Household Count (51 targets)
  Group 2: Snap (51 targets)

District targets:
  Group 3: SNAP Household Count (436 targets)

Total groups created: 4
========================================

target_group = tracer.get_group_rows(2)
row_loc = target_group.iloc[28]['row_index']  # Manually found the index value 28
row_info = tracer.get_row_info(row_loc)
var = row_info['variable']
var_desc = row_info['variable_desc']
target_geo_id = int(row_info['geographic_id'])

print("Row info for North Carolina's SNAP benefit amount:")
print(row_info)

Row info for North Carolina's SNAP benefit amount:
{'row_index': 80, 'variable': 'snap', 'variable_desc': 'SNAP allotment', 'geographic_id': '37', 'target_value': 4041086120.0, 'stratum_id': 9799, 'stratum_group_id': 4}

hh_snap_df = pd.DataFrame(sim.calculate_dataframe([
    "household_id", "household_weight", "state_fips", "snap"])                                        
)
print(hh_snap_df)

       household_id  household_weight  state_fips        snap
0                25       1229.699951          23  789.199951
1                76       3119.330078          23    0.000000
2                92       1368.089966          23    0.000000
3               110       1457.579956          23    0.000000
4               140       1445.209961          23    0.000000
...             ...               ...         ...         ...
13502        178916          0.000000          15    0.000000
13503        178918          0.000000          15    0.000000
13504        178926          0.000000          15    0.000000
13505        178935          0.000000          15    0.000000
13506        178945          0.000000          15    0.000000

[13507 rows x 4 columns]

If we were to include congressional_district_geoid above, they would all be zeros. It’s not until we do the calibration, i.e., come back with a vector of weights w to multiply X_sparse with, that we will set congressional_district_geoid.

However, every household is already a donor to every contressional district. You can get the column positions for every household (remember targets are on the rows, donor households on the columns) by running tracer’s get_household_column_positions with the original household_id.

# Reverse lookup: get all column positions for a specific household
hh_id = hh_snap_df.loc[hh_snap_df.snap > 0].household_id.values[0]
print(hh_snap_df.loc[hh_snap_df.household_id == hh_id])

print("\nEvaluating the tracer.get_household_column_positions dictionary:\n")
positions = tracer.get_household_column_positions(hh_id)
print(positions)

   household_id  household_weight  state_fips        snap
0            25       1229.699951          23  789.199951

Evaluating the tracer.get_household_column_positions dictionary:

{'1501': 0, '1502': 13507, '201': 27014, '3001': 40521, '3002': 54028, '3701': 67535, '3702': 81042, '3703': 94549, '3704': 108056, '3705': 121563, '3706': 135070, '3707': 148577, '3708': 162084, '3709': 175591, '3710': 189098, '3711': 202605, '3712': 216112, '3713': 229619, '3714': 243126}

Section 5: Understanding the cells of the X_Sparse matrix and Target vector¶

print("Remember, this is a North Carolina target:\n")
print(targets_df.iloc[row_loc])

print("\nHousehold donated to NC's 2nd district, 2023 SNAP dollars:")
print(X_sparse[row_loc, positions['3702']])  # Household donated to NC's 2nd district

print("\nHousehold donated to NC's 2nd district, 2023 SNAP dollars:")
print(X_sparse[row_loc, positions['201']])  # Household donated to AK's at Large District

Remember, this is a North Carolina target:

target_id                   9372
stratum_id                  9799
variable                    snap
value               4041086120.0
period                      2023
stratum_group_id               4
geographic_id                 37
Name: 80, dtype: object

Household donated to NC's 2nd district, 2023 SNAP dollars:
789.19995

Household donated to NC's 2nd district, 2023 SNAP dollars:
0.0

Key property: For state-level targets, only CDs in that state should have non-zero values.

Example: A NC state SNAP target should have zeros for HI, MT, and AK CD columns.

So let’s see that same household’s value for the Alaska state target:

target_group = tracer.get_group_rows(2)
new_row_loc = target_group.iloc[10]['row_index']   # Manually found the index value 10
row_info = tracer.get_row_info(row_loc)
var = row_info['variable']
var_desc = row_info['variable_desc']
target_geo_id = int(row_info['geographic_id'])

print("Row info for Alaska's SNAP benefit amount:")
print(row_info)

Row info for Alaska's SNAP benefit amount:
{'row_index': 80, 'variable': 'snap', 'variable_desc': 'SNAP allotment', 'geographic_id': '37', 'target_value': 4041086120.0, 'stratum_id': 9799, 'stratum_group_id': 4}

print("\nHousehold donated to AK's 1st district, 2023 SNAP dollars:")
print(X_sparse[new_row_loc, positions['201']])  # Household donated to AK's at Large District


Household donated to AK's 1st district, 2023 SNAP dollars:
342.48004

Section 6: Simulating State-Swapped Calculations¶

When a household is “transplanted” to a different state, state-dependent benefits like SNAP are recalculated under the destination state’s rules.

def create_state_simulation(state_fips):
    """Create a simulation with all households assigned to a specific state."""
    s = Microsimulation(dataset=dataset_path)
    s.set_input(
        "state_fips", 2023, np.full(hh_snap_df.shape[0], state_fips, dtype=np.int32)
    )
    for var in get_calculated_variables(s):
        s.delete_arrays(var)
    return s

# Compare SNAP for first 5 households under NC vs AK rules
nc_sim = create_state_simulation(37)  # NC
ak_sim = create_state_simulation(2)   # AK

nc_snap = nc_sim.calculate("snap", map_to="household").values[:5]
ak_snap = ak_sim.calculate("snap", map_to="household").values[:5]

print("SNAP values for first 5 households under different state rules:")
print(f"  NC rules: {nc_snap}")
print(f"  AK rules: {ak_snap}")
print(f"  Difference: {ak_snap - nc_snap}")

SNAP values for first 5 households under different state rules:
  NC rules: [789.19995117   0.           0.           0.           0.        ]
  AK rules: [342.4800415   0.          0.          0.          0.       ]
  Difference: [-446.71990967    0.            0.            0.            0.        ]

Section 7: Creating the h5 files¶

w (required)

The calibrated weight vector from L0 calibration
Shape: (n_cds * n_households,) — a flattened matrix where each CD has weights for all households
Gets reshaped to (n_cds, n_households) internally

cds_to_calibrate (required)

The ordered list of CD GEOIDs used when building w
Serves two purposes: a. Tells us how to reshape w (via its length) b. Provides the index mapping so we can extract the right rows for any cd_subset

cd_subset (optional, default None)

Which CDs to actually include in the output dataset
Must be a subset of cds_to_calibrate
If None, all CDs are included
Use cases: build a single-state file, a single-CD file for testing, etc.

output_path (optional but effectively required — raises if None)

Where to save the resulting .h5 file
Creates parent directories if needed

dataset_path (optional, default None)

Path to the base .h5 dataset that was used during calibration
This is the “template” — household structure, demographics, etc.
The function loads this, reweights households per CD, updates geography, and stacks

import os

from policyengine_us_data.datasets.cps.local_area_calibration.stacked_dataset_builder import create_sparse_cd_stacked_dataset

# Initialize the weights w for demonstration
# We can't allow too many w cells to be positive for a given state, or the reindexing will fail
w = np.random.binomial(n=1, p=0.01, size=X_sparse.shape[1]).astype(float)

# We'll make sure our earlier household is included:
household_ids = sim.calculate("household_id", map_to="household").values
hh_idx = np.where(household_ids == hh_id)[0][0]

cd_idx = test_cds.index('3701')
flat_idx = cd_idx * len(household_ids) + hh_idx
w[flat_idx] = 2.5

cd_idx = test_cds.index('201')
flat_idx = cd_idx * len(household_ids) + hh_idx
w[flat_idx] = 3.5

# Create a folder for the outputs of the function that is to come.
new_folder_name = "calibration_output"
os.makedirs(new_folder_name, exist_ok=True)
output_path = os.path.join(new_folder_name, "results.h5")

cd_subset = ['3701', '201']
create_sparse_cd_stacked_dataset(
    w,
    test_cds, # cds_to_calibrate - Defines the structure of the weight vector w
    cd_subset=cd_subset, #  cd_subset - Specifies which CDs to actually include in the output dataset (optional, defaults to all).
    dataset_path=dataset_path,
    output_path=output_path,
)

Processing subset of 2 CDs: 3701, 201...
Output path: calibration_output/results.h5

Original dataset has 13,507 households
Extracted weights for 2 CDs from full weight matrix
Total active household-CD pairs: 284
Total weight in W matrix: 288
Processing CD 201 (2/2)...

Combining 2 CD DataFrames...
Total households across all CDs: 284
Combined DataFrame shape: (898, 167)

Reindexing all entity IDs using 25k ranges per CD...
  Created 284 unique households across 2 CDs
  Reindexing persons using 25k ranges...
  Reindexing tax units...
  Reindexing SPM units...
  Reindexing marital units...
  Final persons: 898
  Final households: 284
  Final tax units: 438
  Final SPM units: 298
  Final marital units: 693

Weights in combined_df AFTER reindexing:
  HH weight sum: 0.00M
  Person weight sum: 0.00M
  Ratio: 1.00

Overflow check:
  Max person ID after reindexing: 5,125,419
  Max person ID × 100: 512,541,900
  int32 max: 2,147,483,647
  ✓ No overflow risk!

Creating Dataset from combined DataFrame...
Building simulation from Dataset...

Saving to calibration_output/results.h5...
Found 165 input variables to save
Variables saved: 163
Variables skipped: 3491
Sparse CD-stacked dataset saved successfully!
Household mapping saved to calibration_output/mappings/results_household_mapping.csv

Verifying saved file...
  Final households: 284
  Final persons: 898
  Total population (from household weights): 288

'calibration_output/results.h5'

%ls calibration_output

mappings/  results.h5

Note that there is a mappings directory that has also been created by create_sparse_cd_stacked_dataset. This contains the CSV file that links the original households to the donor households. The reason it’s a seperate folder is to keep the h5 files and the mapping CSVs organized when this function is run for all districts or states.

%ls calibration_output/mappings

results_household_mapping.csv

sim_after = Microsimulation(dataset="./calibration_output/results.h5")

hh_after_df =  pd.DataFrame(sim_after.calculate_dataframe([
    "household_id", "congressional_district_geoid", "county", "household_weight", "state_fips", "snap"])                                        
)
print(hh_after_df)

     household_id  congressional_district_geoid  \
0           50000                           201   
1           50001                           201   
2           50002                           201   
3           50003                           201   
4           50004                           201   
..            ...                           ...   
279        125125                          3701   
280        125126                          3701   
281        125127                          3701   
282        125128                          3701   
283        125129                          3701   

                            county  household_weight  state_fips         snap  
0    ALEUTIANS_WEST_CENSUS_AREA_AK               3.5           2   342.480042  
1     YUKON_KOYUKUK_CENSUS_AREA_AK               1.0           2  3433.199707  
2        ALEUTIANS_EAST_BOROUGH_AK               1.0           2     0.000000  
3        SITKA_CITY_AND_BOROUGH_AK               1.0           2     0.000000  
4        ANCHORAGE_MUNICIPALITY_AK               1.0           2     0.000000  
..                             ...               ...         ...          ...  
279               LENOIR_COUNTY_NC               1.0          37     0.000000  
280               CHOWAN_COUNTY_NC               1.0          37     0.000000  
281               LENOIR_COUNTY_NC               1.0          37     0.000000  
282                WAYNE_COUNTY_NC               1.0          37     0.000000  
283            EDGECOMBE_COUNTY_NC               1.0          37     0.000000  

[284 rows x 6 columns]

We can see one of the correct instances above but let’s confirm that this new household id does in fact link back to the original in both cases.

mapping_df = pd.read_csv("calibration_output/mappings/results_household_mapping.csv")
mapping_df.loc[mapping_df.original_household_id == hh_id]

new_hh_ids = mapping_df.loc[mapping_df.original_household_id == hh_id].new_household_id
hh_after_df.loc[hh_after_df.household_id.isin(new_hh_ids)]

And we can see that the snap numbers still match their values from the different US state systems. However note that due to the use of policyengine-core’s random function in a component of snap_gross_income, for some households, the value in the final simulation will not match the one used in creating the X matrix (X_sparse here). This is outlined in Issue 412.

%rm -r calibration_output