This notebook demonstrates the sparse matrix construction for local area (congressional district) calibration. It uses a subset of CDs from NC, HI, MT, and AK for manageable runtime.
Section 1: Setup & Configuration¶
from sqlalchemy import create_engine, text
import pandas as pd
import numpy as np
from policyengine_us import Microsimulation
from policyengine_us_data.storage import STORAGE_FOLDER
from policyengine_us_data.datasets.cps.local_area_calibration.sparse_matrix_builder import (
SparseMatrixBuilder,
)
from policyengine_us_data.datasets.cps.local_area_calibration.matrix_tracer import (
MatrixTracer,
)
from policyengine_us_data.datasets.cps.local_area_calibration.calibration_utils import (
get_calculated_variables,
create_target_groups,
)/home/baogorek/envs/pe/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
TEST_LITE == False
db_path = STORAGE_FOLDER / "calibration" / "policy_data.db"
db_uri = f"sqlite:///{db_path}"
dataset_path = str(STORAGE_FOLDER / "stratified_extended_cps_2023.h5")
engine = create_engine(db_uri)Section 2: Select Test Congressional Districts¶
We use CDs from 4 states for testing:
NC (37): 14 CDs (3701-3714) - provides same-state different-CD test cases
HI (15): 2 CDs (1501-1502)
MT (30): 2 CDs (3001-3002)
AK (2): 1 CD (200)
query = """
SELECT DISTINCT sc.value as cd_geoid
FROM strata s
JOIN stratum_constraints sc ON s.stratum_id = sc.stratum_id
WHERE s.stratum_group_id = 1
AND sc.constraint_variable = 'congressional_district_geoid'
AND (
sc.value LIKE '37__'
OR sc.value LIKE '150_'
OR sc.value LIKE '300_'
OR sc.value = '200' OR sc.value = '201'
)
ORDER BY sc.value
"""
with engine.connect() as conn:
result = conn.execute(text(query)).fetchall()
test_cds = [row[0] for row in result]
print(f"Testing with {len(test_cds)} congressional districts:")
print(f" NC (37): {[cd for cd in test_cds if cd.startswith('37')]}")
print(f" HI (15): {[cd for cd in test_cds if cd.startswith('15')]}")
print(f" MT (30): {[cd for cd in test_cds if cd.startswith('30')]}")
print(f" AK (2): {[cd for cd in test_cds if cd.startswith('20')]}")Testing with 19 congressional districts:
NC (37): ['3701', '3702', '3703', '3704', '3705', '3706', '3707', '3708', '3709', '3710', '3711', '3712', '3713', '3714']
HI (15): ['1501', '1502']
MT (30): ['3001', '3002']
AK (2): ['201']
Section 3: Build the Sparse Matrix¶
The sparse matrix X_sparse has:
Rows: Calibration targets (e.g., SNAP totals by geography)
Columns: (household × CD) pairs - each household appears once per CD
We filter to SNAP targets only (stratum_group_id=4) for this demonstration.
sim = Microsimulation(dataset=dataset_path)
builder = SparseMatrixBuilder(
db_uri,
time_period=2023,
cds_to_calibrate=test_cds,
dataset_path=dataset_path,
)
targets_df, X_sparse, household_id_mapping = builder.build_matrix(
sim, target_filter={"stratum_group_ids": [4], "variables": ["snap"]}
)
print(f"X_sparse shape: {X_sparse.shape}")
print(f" Rows (targets): {X_sparse.shape[0]}")
print(f" Columns (household × CD pairs): {X_sparse.shape[1]}")
print(f" Non-zero entries: {X_sparse.nnz:,}")
print(f" Sparsity: {1 - X_sparse.nnz / (X_sparse.shape[0] * X_sparse.shape[1]):.2%}")X_sparse shape: (539, 256633)
Rows (targets): 539
Columns (household × CD pairs): 256633
Non-zero entries: 67,756
Sparsity: 99.95%
Section 4: Understanding the Matrix Structure with MatrixTracer¶
The MatrixTracer helps navigate the sparse matrix by providing lookups between:
Column indices ↔ (household_id, CD) pairs
Row indices ↔ target definitions
tracer = MatrixTracer(
targets_df, X_sparse, household_id_mapping, test_cds, sim
)
tracer.print_matrix_structure()
================================================================================
MATRIX STRUCTURE BREAKDOWN
================================================================================
Matrix dimensions: 539 rows x 256633 columns
Rows = 539 targets
Columns = 13507 households x 19 CDs
= 13,507 x 19 = 256,633
--------------------------------------------------------------------------------
COLUMN STRUCTURE (Households stacked by CD)
--------------------------------------------------------------------------------
Showing first and last 5 CDs of 19 total:
First 5 CDs:
cd_geoid start_col end_col n_households
1501 0 13506 13507
1502 13507 27013 13507
201 27014 40520 13507
3001 40521 54027 13507
3002 54028 67534 13507
Last 5 CDs:
cd_geoid start_col end_col n_households
3710 189098 202604 13507
3711 202605 216111 13507
3712 216112 229618 13507
3713 229619 243125 13507
3714 243126 256632 13507
--------------------------------------------------------------------------------
ROW STRUCTURE (Targets)
--------------------------------------------------------------------------------
Total targets: 539
Targets by stratum group:
n_targets n_unique_vars
stratum_group_id
1 1 1
4 538 2
--------------------------------------------------------------------------------
TARGET GROUPS (for loss calculation)
--------------------------------------------------------------------------------
=== Creating Target Groups ===
National targets:
Group 0: Snap = 107,062,860,000
State targets:
Group 1: SNAP Household Count (51 targets)
Group 2: Snap (51 targets)
District targets:
Group 3: SNAP Household Count (436 targets)
Total groups created: 4
========================================
Group 0: National Snap (1 target, value=107,062,860,000) - rows [0]
Group 1: State SNAP Household Count (51 targets) - rows [1, 2, 3, ..., 50, 51]
Group 2: State Snap (51 targets) - rows [52, 53, 54, ..., 101, 102]
Group 3: District SNAP Household Count (436 targets) - rows [103, 104, 105, ..., 537, 538]
================================================================================
target_groups, group_info = create_target_groups(targets_df)
=== Creating Target Groups ===
National targets:
Group 0: Snap = 107,062,860,000
State targets:
Group 1: SNAP Household Count (51 targets)
Group 2: Snap (51 targets)
District targets:
Group 3: SNAP Household Count (436 targets)
Total groups created: 4
========================================
target_group = tracer.get_group_rows(2)
row_loc = target_group.iloc[28]['row_index'] # Manually found the index value 28
row_info = tracer.get_row_info(row_loc)
var = row_info['variable']
var_desc = row_info['variable_desc']
target_geo_id = int(row_info['geographic_id'])
print("Row info for North Carolina's SNAP benefit amount:")
print(row_info)Row info for North Carolina's SNAP benefit amount:
{'row_index': 80, 'variable': 'snap', 'variable_desc': 'SNAP allotment', 'geographic_id': '37', 'target_value': 4041086120.0, 'stratum_id': 9799, 'stratum_group_id': 4}
hh_snap_df = pd.DataFrame(sim.calculate_dataframe([
"household_id", "household_weight", "state_fips", "snap"])
)
print(hh_snap_df) household_id household_weight state_fips snap
0 25 1229.699951 23 789.199951
1 76 3119.330078 23 0.000000
2 92 1368.089966 23 0.000000
3 110 1457.579956 23 0.000000
4 140 1445.209961 23 0.000000
... ... ... ... ...
13502 178916 0.000000 15 0.000000
13503 178918 0.000000 15 0.000000
13504 178926 0.000000 15 0.000000
13505 178935 0.000000 15 0.000000
13506 178945 0.000000 15 0.000000
[13507 rows x 4 columns]
If we were to include congressional_district_geoid above, they would all be zeros. It’s not until we do the calibration, i.e., come back with a vector of weights w to multiply X_sparse with, that we will set congressional_district_geoid.
However, every household is already a donor to every contressional district. You can get the column positions for every household (remember targets are on the rows, donor households on the columns) by running tracer’s get_household_column_positions with the original household_id.
# Reverse lookup: get all column positions for a specific household
hh_id = hh_snap_df.loc[hh_snap_df.snap > 0].household_id.values[0]
print(hh_snap_df.loc[hh_snap_df.household_id == hh_id])
print("\nEvaluating the tracer.get_household_column_positions dictionary:\n")
positions = tracer.get_household_column_positions(hh_id)
print(positions) household_id household_weight state_fips snap
0 25 1229.699951 23 789.199951
Evaluating the tracer.get_household_column_positions dictionary:
{'1501': 0, '1502': 13507, '201': 27014, '3001': 40521, '3002': 54028, '3701': 67535, '3702': 81042, '3703': 94549, '3704': 108056, '3705': 121563, '3706': 135070, '3707': 148577, '3708': 162084, '3709': 175591, '3710': 189098, '3711': 202605, '3712': 216112, '3713': 229619, '3714': 243126}
Section 5: Understanding the cells of the X_Sparse matrix and Target vector¶
print("Remember, this is a North Carolina target:\n")
print(targets_df.iloc[row_loc])
print("\nHousehold donated to NC's 2nd district, 2023 SNAP dollars:")
print(X_sparse[row_loc, positions['3702']]) # Household donated to NC's 2nd district
print("\nHousehold donated to NC's 2nd district, 2023 SNAP dollars:")
print(X_sparse[row_loc, positions['201']]) # Household donated to AK's at Large DistrictRemember, this is a North Carolina target:
target_id 9372
stratum_id 9799
variable snap
value 4041086120.0
period 2023
stratum_group_id 4
geographic_id 37
Name: 80, dtype: object
Household donated to NC's 2nd district, 2023 SNAP dollars:
789.19995
Household donated to NC's 2nd district, 2023 SNAP dollars:
0.0
Key property: For state-level targets, only CDs in that state should have non-zero values.
Example: A NC state SNAP target should have zeros for HI, MT, and AK CD columns.
So let’s see that same household’s value for the Alaska state target:
target_group = tracer.get_group_rows(2)
new_row_loc = target_group.iloc[10]['row_index'] # Manually found the index value 10
row_info = tracer.get_row_info(row_loc)
var = row_info['variable']
var_desc = row_info['variable_desc']
target_geo_id = int(row_info['geographic_id'])
print("Row info for Alaska's SNAP benefit amount:")
print(row_info)Row info for Alaska's SNAP benefit amount:
{'row_index': 80, 'variable': 'snap', 'variable_desc': 'SNAP allotment', 'geographic_id': '37', 'target_value': 4041086120.0, 'stratum_id': 9799, 'stratum_group_id': 4}
print("\nHousehold donated to AK's 1st district, 2023 SNAP dollars:")
print(X_sparse[new_row_loc, positions['201']]) # Household donated to AK's at Large District
Household donated to AK's 1st district, 2023 SNAP dollars:
342.48004
Section 6: Simulating State-Swapped Calculations¶
When a household is “transplanted” to a different state, state-dependent benefits like SNAP are recalculated under the destination state’s rules.
def create_state_simulation(state_fips):
"""Create a simulation with all households assigned to a specific state."""
s = Microsimulation(dataset=dataset_path)
s.set_input(
"state_fips", 2023, np.full(hh_snap_df.shape[0], state_fips, dtype=np.int32)
)
for var in get_calculated_variables(s):
s.delete_arrays(var)
return s
# Compare SNAP for first 5 households under NC vs AK rules
nc_sim = create_state_simulation(37) # NC
ak_sim = create_state_simulation(2) # AK
nc_snap = nc_sim.calculate("snap", map_to="household").values[:5]
ak_snap = ak_sim.calculate("snap", map_to="household").values[:5]
print("SNAP values for first 5 households under different state rules:")
print(f" NC rules: {nc_snap}")
print(f" AK rules: {ak_snap}")
print(f" Difference: {ak_snap - nc_snap}")SNAP values for first 5 households under different state rules:
NC rules: [789.19995117 0. 0. 0. 0. ]
AK rules: [342.4800415 0. 0. 0. 0. ]
Difference: [-446.71990967 0. 0. 0. 0. ]
Section 7: Creating the h5 files¶
w (required)
The calibrated weight vector from L0 calibration
Shape: (n_cds * n_households,) — a flattened matrix where each CD has weights for all households
Gets reshaped to (n_cds, n_households) internally
cds_to_calibrate (required)
The ordered list of CD GEOIDs used when building w
Serves two purposes: a. Tells us how to reshape w (via its length) b. Provides the index mapping so we can extract the right rows for any cd_subset
cd_subset (optional, default None)
Which CDs to actually include in the output dataset
Must be a subset of cds_to_calibrate
If None, all CDs are included
Use cases: build a single-state file, a single-CD file for testing, etc.
output_path (optional but effectively required — raises if None)
Where to save the resulting .h5 file
Creates parent directories if needed
dataset_path (optional, default None)
Path to the base .h5 dataset that was used during calibration
This is the “template” — household structure, demographics, etc.
The function loads this, reweights households per CD, updates geography, and stacks
import os
from policyengine_us_data.datasets.cps.local_area_calibration.stacked_dataset_builder import create_sparse_cd_stacked_dataset
# Initialize the weights w for demonstration
# We can't allow too many w cells to be positive for a given state, or the reindexing will fail
w = np.random.binomial(n=1, p=0.01, size=X_sparse.shape[1]).astype(float)
# We'll make sure our earlier household is included:
household_ids = sim.calculate("household_id", map_to="household").values
hh_idx = np.where(household_ids == hh_id)[0][0]
cd_idx = test_cds.index('3701')
flat_idx = cd_idx * len(household_ids) + hh_idx
w[flat_idx] = 2.5
cd_idx = test_cds.index('201')
flat_idx = cd_idx * len(household_ids) + hh_idx
w[flat_idx] = 3.5
# Create a folder for the outputs of the function that is to come.
new_folder_name = "calibration_output"
os.makedirs(new_folder_name, exist_ok=True)
output_path = os.path.join(new_folder_name, "results.h5")cd_subset = ['3701', '201']
create_sparse_cd_stacked_dataset(
w,
test_cds, # cds_to_calibrate - Defines the structure of the weight vector w
cd_subset=cd_subset, # cd_subset - Specifies which CDs to actually include in the output dataset (optional, defaults to all).
dataset_path=dataset_path,
output_path=output_path,
)Processing subset of 2 CDs: 3701, 201...
Output path: calibration_output/results.h5
Original dataset has 13,507 households
Extracted weights for 2 CDs from full weight matrix
Total active household-CD pairs: 284
Total weight in W matrix: 288
Processing CD 201 (2/2)...
Combining 2 CD DataFrames...
Total households across all CDs: 284
Combined DataFrame shape: (898, 167)
Reindexing all entity IDs using 25k ranges per CD...
Created 284 unique households across 2 CDs
Reindexing persons using 25k ranges...
Reindexing tax units...
Reindexing SPM units...
Reindexing marital units...
Final persons: 898
Final households: 284
Final tax units: 438
Final SPM units: 298
Final marital units: 693
Weights in combined_df AFTER reindexing:
HH weight sum: 0.00M
Person weight sum: 0.00M
Ratio: 1.00
Overflow check:
Max person ID after reindexing: 5,125,419
Max person ID × 100: 512,541,900
int32 max: 2,147,483,647
✓ No overflow risk!
Creating Dataset from combined DataFrame...
Building simulation from Dataset...
Saving to calibration_output/results.h5...
Found 165 input variables to save
Variables saved: 163
Variables skipped: 3491
Sparse CD-stacked dataset saved successfully!
Household mapping saved to calibration_output/mappings/results_household_mapping.csv
Verifying saved file...
Final households: 284
Final persons: 898
Total population (from household weights): 288
'calibration_output/results.h5'%ls calibration_outputmappings/ results.h5
Note that there is a mappings directory that has also been created by create_sparse_cd_stacked_dataset. This contains the CSV file that links the original households to the donor households. The reason it’s a seperate folder is to keep the h5 files and the mapping CSVs organized when this function is run for all districts or states.
%ls calibration_output/mappingsresults_household_mapping.csv
sim_after = Microsimulation(dataset="./calibration_output/results.h5")
hh_after_df = pd.DataFrame(sim_after.calculate_dataframe([
"household_id", "congressional_district_geoid", "county", "household_weight", "state_fips", "snap"])
)
print(hh_after_df) household_id congressional_district_geoid \
0 50000 201
1 50001 201
2 50002 201
3 50003 201
4 50004 201
.. ... ...
279 125125 3701
280 125126 3701
281 125127 3701
282 125128 3701
283 125129 3701
county household_weight state_fips snap
0 ALEUTIANS_WEST_CENSUS_AREA_AK 3.5 2 342.480042
1 YUKON_KOYUKUK_CENSUS_AREA_AK 1.0 2 3433.199707
2 ALEUTIANS_EAST_BOROUGH_AK 1.0 2 0.000000
3 SITKA_CITY_AND_BOROUGH_AK 1.0 2 0.000000
4 ANCHORAGE_MUNICIPALITY_AK 1.0 2 0.000000
.. ... ... ... ...
279 LENOIR_COUNTY_NC 1.0 37 0.000000
280 CHOWAN_COUNTY_NC 1.0 37 0.000000
281 LENOIR_COUNTY_NC 1.0 37 0.000000
282 WAYNE_COUNTY_NC 1.0 37 0.000000
283 EDGECOMBE_COUNTY_NC 1.0 37 0.000000
[284 rows x 6 columns]
We can see one of the correct instances above but let’s confirm that this new household id does in fact link back to the original in both cases.
mapping_df = pd.read_csv("calibration_output/mappings/results_household_mapping.csv")
mapping_df.loc[mapping_df.original_household_id == hh_id]new_hh_ids = mapping_df.loc[mapping_df.original_household_id == hh_id].new_household_id
hh_after_df.loc[hh_after_df.household_id.isin(new_hh_ids)]And we can see that the snap numbers still match their values from the different US state systems. However note that due to the use of policyengine-core’s random function in a component of snap_gross_income, for some households, the value in the final simulation will not match the one used in creating the X matrix (X_sparse here). This is outlined in Issue 412.
%rm -r calibration_output