This notebook demonstrates the clone-based calibration pipeline: how raw CPS records become a calibration matrix and, ultimately, CD-level stacked datasets.
The paradigm shift from the old approach: instead of replicating every household into every congressional district, we clone each record N times and assign each clone a random census block drawn from a population-weighted distribution. Each clone inherits a state, CD, and block — and gets re-simulated under the rules of its assigned state.
We follow one household (record_idx=8629, household_id 128694, SNAP $18,396) through the entire pipeline:
Clone and assign geography
Simulate under new state rules (
_simulate_clone)Geographic column masking
Re-randomize takeup per census block
Build the calibration matrix
Create stacked datasets from calibrated weights
Companion notebook: calibration
Requirements: policy_data.db, block_cd_distributions.csv.gz, and the stratified CPS h5 file in STORAGE_FOLDER.
Section 1: Setup & Configuration¶
import numpy as np
import pandas as pd
from collections import defaultdict
from policyengine_us import Microsimulation
from policyengine_us_data.storage import STORAGE_FOLDER
from policyengine_us_data.calibration.clone_and_assign import (
assign_random_geography,
GeographyAssignment,
load_global_block_distribution,
)
from policyengine_us_data.calibration.unified_matrix_builder import (
UnifiedMatrixBuilder,
)
from policyengine_us_data.calibration.unified_calibration import (
rerandomize_takeup,
SIMPLE_TAKEUP_VARS,
)
from policyengine_us_data.utils.randomness import seeded_rng
from policyengine_us_data.parameters import load_take_up_rate
from policyengine_us_data.datasets.cps.local_area_calibration.calibration_utils import (
get_calculated_variables,
STATE_CODES,
get_all_cds_from_database,
)
from policyengine_us_data.datasets.cps.local_area_calibration.stacked_dataset_builder import (
create_sparse_cd_stacked_dataset,
)
db_path = STORAGE_FOLDER / "calibration" / "policy_data.db"
db_uri = f"sqlite:///{db_path}"
dataset_path = str(STORAGE_FOLDER / "stratified_extended_cps_2024.h5")
N_CLONES = 3
SEED = 42/home/baogorek/envs/sep/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
sim = Microsimulation(dataset=dataset_path)
hh_ids = sim.calculate("household_id", map_to="household").values
snap_values = sim.calculate("snap", map_to="household").values
n_records = len(hh_ids)
record_idx = 8629 # High SNAP ($18k), lands in TX/PA/NY with seed=42
example_hh_id = hh_ids[record_idx]
print(f"Base dataset: {n_records:,} households")
print(
f"Example household: record_idx={record_idx}, "
f"household_id={example_hh_id}, "
f"SNAP=${snap_values[record_idx]:,.2f}"
)Base dataset: 11,999 households
Example household: record_idx=8629, household_id=128694, SNAP=$18,396.00
Section 2: Geography Assignment¶
assign_random_geography creates n_records * n_clones total records, each assigned a random census block from a population-weighted distribution. State and CD are derived from the block GEOID. The result is a GeographyAssignment dataclass with arrays indexed as clone_idx * n_records + record_idx.
geography = assign_random_geography(n_records, n_clones=N_CLONES, seed=SEED)
n_total = n_records * N_CLONES
print(f"Total cloned records: {n_total:,}")
print(f"Unique states: {len(np.unique(geography.state_fips))}")
print(f"Unique CDs: {len(np.unique(geography.cd_geoid))}")
print(f"Unique blocks: {len(np.unique(geography.block_geoid))}")Total cloned records: 35,997
Unique states: 50
Unique CDs: 435
Unique blocks: 35508
print(
f"Example household (record_idx={record_idx}) across {N_CLONES} clones:\n"
)
rows = []
for c in range(N_CLONES):
col = c * n_records + record_idx
rows.append(
{
"clone": c,
"col": col,
"state_fips": geography.state_fips[col],
"abbr": STATE_CODES.get(geography.state_fips[col], "??"),
"cd_geoid": geography.cd_geoid[col],
"block_geoid": geography.block_geoid[col],
}
)
pd.DataFrame(rows)Example household (record_idx=8629) across 3 clones:
One household, three parallel geographic identities. Each clone will be simulated under different state rules, producing different benefit amounts.
Note: With only N_CLONES=3 (~36K total samples), small-population areas like DC may not appear in the random draw. The production pipeline uses N_CLONES=10, which covers all 51 state-equivalents and 436 CDs.
blocks, cds, states, probs = load_global_block_distribution()
print(f"Global block distribution: {len(blocks):,} blocks")
print(f"Top 5 states by total probability:")
state_prob = pd.Series(probs, index=states).groupby(level=0).sum()
top5 = state_prob.nlargest(5)
for fips, p in top5.items():
print(f" {STATE_CODES.get(fips, '??')} ({fips}): {p:.3%}")Global block distribution: 5,765,442 blocks
Top 5 states by total probability:
CA (6): 11.954%
TX (48): 8.736%
FL (12): 6.437%
NY (36): 5.977%
PA (42): 3.908%
Section 3: Inside _simulate_clone — State-Swap¶
For each clone, _simulate_clone does four things:
Creates a fresh
Microsimulationfrom the base datasetOverwrites
state_fipswith the clone’s assigned statesOptionally calls a
sim_modifier(e.g., takeup re-randomization)Clears cached formulas via
get_calculated_variables— preserving survey inputs and IDs while forcing recalculation of state-dependent variables like SNAP
Let’s reproduce this manually for clone 0.
clone_idx = 0
col_start = clone_idx * n_records
col_end = col_start + n_records
clone_states = geography.state_fips[col_start:col_end]
clone_sim = Microsimulation(dataset=dataset_path)
clone_sim.set_input("state_fips", 2024, clone_states.astype(np.int32))
for var in get_calculated_variables(clone_sim):
clone_sim.delete_arrays(var)
new_snap = clone_sim.calculate("snap", map_to="household").values
orig_state = sim.calculate("state_fips", map_to="household").values[record_idx]
new_state = clone_states[record_idx]
print(f"Example household (record_idx={record_idx}):")
print(
f" Original state: {STATE_CODES.get(int(orig_state), '??')} "
f"({int(orig_state)})"
)
print(
f" Clone 0 state: {STATE_CODES.get(int(new_state), '??')} "
f"({int(new_state)})"
)
print(f" Original SNAP: ${snap_values[record_idx]:,.2f}")
print(f" Clone 0 SNAP: ${new_snap[record_idx]:,.2f}")Example household (record_idx=8629):
Original state: NC (37)
Clone 0 state: TX (48)
Original SNAP: $18,396.00
Clone 0 SNAP: $18,396.00
print(f"SNAP for record_idx={record_idx} across all {N_CLONES} clones:\n")
rows = []
for c in range(N_CLONES):
cs = geography.state_fips[c * n_records + record_idx]
s = Microsimulation(dataset=dataset_path)
s.set_input(
"state_fips",
2024,
geography.state_fips[c * n_records : (c + 1) * n_records].astype(
np.int32
),
)
for var in get_calculated_variables(s):
s.delete_arrays(var)
clone_snap = s.calculate("snap", map_to="household").values
rows.append(
{
"clone": c,
"state": STATE_CODES.get(int(cs), "??"),
"state_fips": int(cs),
"SNAP": f"${clone_snap[record_idx]:,.2f}",
}
)
pd.DataFrame(rows)SNAP for record_idx=8629 across all 3 clones:
get_calculated_variables is selective: it identifies variables with formulas (state-dependent computations) while preserving survey-reported inputs and entity IDs. This is what allows the same demographic household to produce different benefit amounts under different state rules.
Section 4: Geographic Column Masking¶
When assembling the calibration matrix, each target row only “sees” columns (clones) whose geography matches the target’s geography. This is implemented via state_to_cols and cd_to_cols dictionaries built from the GeographyAssignment.
This is step 3 of build_matrix — reproduced here for transparency.
state_col_lists = defaultdict(list)
cd_col_lists = defaultdict(list)
for col in range(n_total):
state_col_lists[int(geography.state_fips[col])].append(col)
cd_col_lists[str(geography.cd_geoid[col])].append(col)
state_to_cols = {s: np.array(c) for s, c in state_col_lists.items()}
cd_to_cols = {cd: np.array(c) for cd, c in cd_col_lists.items()}
print(f"Unique states mapped: {len(state_to_cols)}")
print(f"Unique CDs mapped: {len(cd_to_cols)}")
state_counts = {s: len(c) for s, c in state_to_cols.items()}
sc_series = pd.Series(state_counts)
print(
f"\nColumns per state: min={sc_series.min()}, "
f"median={sc_series.median():.0f}, max={sc_series.max()}"
)Unique states mapped: 50
Unique CDs mapped: 435
Columns per state: min=62, median=494, max=4311
print(f"Example household clone visibility:\n")
for c in range(N_CLONES):
col = c * n_records + record_idx
state = int(geography.state_fips[col])
cd = str(geography.cd_geoid[col])
abbr = STATE_CODES.get(state, "??")
print(f"Clone {c} ({abbr}, CD {cd}):")
print(
f" Visible to {abbr} state targets: "
f"col {col} in state_to_cols[{state}]? "
f"{col in state_to_cols.get(state, [])}"
)
print(
f" Visible to CD {cd} targets: "
f"col {col} in cd_to_cols['{cd}']? "
f"{col in cd_to_cols.get(cd, [])}"
)
# Check an unrelated state
print(
f" Visible to NC (37) targets: " f"{col in state_to_cols.get(37, [])}"
)
print()Example household clone visibility:
Clone 0 (TX, CD 4817):
Visible to TX state targets: col 8629 in state_to_cols[48]? True
Visible to CD 4817 targets: col 8629 in cd_to_cols['4817']? True
Visible to NC (37) targets: False
Clone 1 (PA, CD 4201):
Visible to PA state targets: col 20628 in state_to_cols[42]? True
Visible to CD 4201 targets: col 20628 in cd_to_cols['4201']? True
Visible to NC (37) targets: False
Clone 2 (NY, CD 3611):
Visible to NY state targets: col 32627 in state_to_cols[36]? True
Visible to CD 3611 targets: col 32627 in cd_to_cols['3611']? True
Visible to NC (37) targets: False
This is the mechanism behind the sparsity pattern in calibration_matrix.ipynb: a household clone assigned to TX can contribute to TX state targets and TX CD targets, but produces a zero entry for NC or AK targets. The matrix is sparse because each clone only intersects a small fraction of all geographic targets.
Section 5: Takeup Re-randomization¶
The base CPS has fixed takeup decisions (e.g., “this household takes up SNAP”). But when we clone a household into different census blocks, each block should have independently drawn takeup — otherwise every clone of a SNAP-participating household would still participate, regardless of geography.
rerandomize_takeup solves this: for each census block, it uses seeded_rng(variable_name, salt=block_geoid) to draw new takeup booleans. The seed is deterministic per (variable, block) pair, so results are reproducible.
print(f"{len(SIMPLE_TAKEUP_VARS)} takeup variables:\n")
for spec in SIMPLE_TAKEUP_VARS:
rate_key = spec["rate_key"]
if rate_key == "voluntary_filing":
rate = 0.05
else:
rate = load_take_up_rate(rate_key, 2024)
rate_str = (
f"{rate:.2%}"
if isinstance(rate, float)
else f"dict ({len(rate)} entries)"
)
print(
f" {spec['variable']:40s} "
f"entity={spec['entity']:10s} rate={rate_str}"
)8 takeup variables:
takes_up_snap_if_eligible entity=spm_unit rate=82.00%
takes_up_aca_if_eligible entity=tax_unit rate=67.20%
takes_up_dc_ptc entity=tax_unit rate=32.00%
takes_up_head_start_if_eligible entity=person rate=30.00%
takes_up_early_head_start_if_eligible entity=person rate=9.00%
takes_up_ssi_if_eligible entity=person rate=50.00%
would_file_taxes_voluntarily entity=tax_unit rate=5.00%
takes_up_medicaid_if_eligible entity=person rate=dict (51 entries)
block_a = "482011234567890"
block_b = "170311234567890"
var = "takes_up_snap_if_eligible"
rng_a1 = seeded_rng(var, salt=block_a)
rng_a2 = seeded_rng(var, salt=block_a)
rng_b = seeded_rng(var, salt=block_b)
rng_other = seeded_rng("takes_up_aca_if_eligible", salt=block_a)
draws_a1 = rng_a1.random(5)
draws_a2 = rng_a2.random(5)
draws_b = rng_b.random(5)
draws_other = rng_other.random(5)
print("Same block + same var (reproducible):")
print(f" {draws_a1}")
print(f" {draws_a2}")
print(f" Match: {np.allclose(draws_a1, draws_a2)}")
print(f"\nDifferent block, same var:")
print(f" {draws_b}")
print(f" Match: {np.allclose(draws_a1, draws_b)}")
print(f"\nSame block, different var:")
print(f" {draws_other}")
print(f" Match: {np.allclose(draws_a1, draws_other)}")Same block + same var (reproducible):
[0.50514599 0.75213437 0.9703409 0.18048868 0.31969517]
[0.50514599 0.75213437 0.9703409 0.18048868 0.31969517]
Match: True
Different block, same var:
[0.15503168 0.96707026 0.79019745 0.67544525 0.85245009]
Match: False
Same block, different var:
[0.93155876 0.8912794 0.50838888 0.32192278 0.01005173]
Match: False
test_sim = Microsimulation(dataset=dataset_path)
clone_0_states = geography.state_fips[:n_records]
clone_0_blocks = geography.block_geoid[:n_records]
test_sim.set_input("state_fips", 2024, clone_0_states.astype(np.int32))
before = {}
for spec in SIMPLE_TAKEUP_VARS:
v = spec["variable"]
vals = test_sim.calculate(v, map_to=spec["entity"]).values
before[v] = vals.mean()
rerandomize_takeup(test_sim, clone_0_blocks, clone_0_states, 2024)
print("Takeup rates before/after re-randomization (clone 0):\n")
for spec in SIMPLE_TAKEUP_VARS:
v = spec["variable"]
vals = test_sim.calculate(v, map_to=spec["entity"]).values
after = vals.mean()
print(f" {v:40s} before={before[v]:.3%} after={after:.3%}")Takeup rates before/after re-randomization (clone 0):
takes_up_snap_if_eligible before=82.333% after=82.381%
takes_up_aca_if_eligible before=66.718% after=67.486%
takes_up_dc_ptc before=31.483% after=32.044%
takes_up_head_start_if_eligible before=29.963% after=29.689%
takes_up_early_head_start_if_eligible before=8.869% after=8.721%
takes_up_ssi_if_eligible before=100.000% after=49.776%
would_file_taxes_voluntarily before=0.000% after=4.905%
takes_up_medicaid_if_eligible before=84.496% after=80.051%
medicaid_rates = load_take_up_rate("medicaid", 2024)
print("Medicaid takeup rates (state-specific), first 10 states:\n")
for state, rate in sorted(medicaid_rates.items())[:10]:
print(f" {state}: {rate:.2%}")Medicaid takeup rates (state-specific), first 10 states:
AK: 88.00%
AL: 92.00%
AR: 79.00%
AZ: 95.00%
CA: 78.00%
CO: 99.00%
CT: 89.00%
DC: 99.00%
DE: 86.00%
FL: 98.00%
In the full pipeline, rerandomize_takeup is passed to build_matrix as a sim_modifier callback. For each clone, after state_fips is set but before formula caches are cleared, the callback draws new takeup booleans per census block. This means the same household in block A might take up SNAP while in block B it doesn’t — matching the statistical reality that takeup varies by geography.
Section 6: Matrix Build Verification¶
Let’s run the full build_matrix pipeline and verify the example household’s pattern matches our Section 4 predictions. We use the same target_filter as in calibration_matrix.ipynb but without sim_modifier to match that notebook’s output.
builder = UnifiedMatrixBuilder(
db_uri=db_uri,
time_period=2024,
dataset_path=dataset_path,
)
targets_df, X_sparse, target_names = builder.build_matrix(
geography,
sim,
target_filter={"domain_variables": ["snap"]},
)
print(f"Matrix shape: {X_sparse.shape}")
print(f"Non-zero entries: {X_sparse.nnz:,}")
print(f"Density: {X_sparse.nnz / (X_sparse.shape[0] * X_sparse.shape[1]):.6f}")2026-02-13 17:11:22,384 - INFO - Processing clone 1/3 (cols 0-11998, 50 unique states)...
2026-02-13 17:11:23,509 - INFO - Processing clone 2/3 (cols 11999-23997, 50 unique states)...
2026-02-13 17:11:24,645 - INFO - Processing clone 3/3 (cols 23998-35996, 50 unique states)...
2026-02-13 17:11:25,769 - INFO - Assembling matrix from 3 clones...
2026-02-13 17:11:25,771 - INFO - Matrix: 538 targets x 35997 cols, 14946 nnz
Matrix shape: (538, 35997)
Non-zero entries: 14,946
Density: 0.000772
print(f"Example household non-zero pattern across clones:\n")
for c in range(N_CLONES):
col = c * n_records + record_idx
col_vec = X_sparse[:, col]
nz_rows = col_vec.nonzero()[0]
state = int(geography.state_fips[col])
cd = geography.cd_geoid[col]
abbr = STATE_CODES.get(state, "??")
print(f"Clone {c} ({abbr}, CD {cd}): {len(nz_rows)} non-zero rows")
for r in nz_rows:
row = targets_df.iloc[r]
print(
f" row {r}: {row['variable']} "
f"(geo={row['geographic_id']}): "
f"{X_sparse[r, col]:.2f}"
)Example household non-zero pattern across clones:
Clone 0 (TX, CD 4817): 3 non-zero rows
row 39: household_count (geo=48): 1.00
row 90: snap (geo=48): 18396.00
row 410: household_count (geo=4817): 1.00
Clone 1 (PA, CD 4201): 3 non-zero rows
row 34: household_count (geo=42): 1.00
row 85: snap (geo=42): 18396.00
row 358: household_count (geo=4201): 1.00
Clone 2 (NY, CD 3611): 3 non-zero rows
row 27: household_count (geo=36): 1.00
row 78: snap (geo=36): 18396.00
row 292: household_count (geo=3611): 1.00
Section 7: From Weights to Datasets¶
create_sparse_cd_stacked_dataset takes calibrated weights and builds an h5 file with only the non-zero-weight households, reindexed per CD. Internally it does its own state-swap simulation — loading the base dataset, assigning state_fips for the target CD’s state, and recalculating benefits from scratch. This means SNAP values in the output reflect the destination state’s rules (e.g., a 0 under AK rules).
Format gap: The calibration produces weights in clone layout (n_records * n_clones,) where each clone maps to one specific CD via the GeographyAssignment. The stacked dataset builder expects CD layout (n_cds * n_households,) where every CD has a weight slot for every household. Converting between these — accumulating clone weights into their assigned CDs — is a separate step not yet implemented. The demo below constructs artificial CD-layout weights directly to show how the builder works.
print("Dimension mismatch:")
print(
f" Calibration output: ({n_records} * {N_CLONES},) "
f"= {n_records * N_CLONES:,} (clone layout)"
)
all_cds = get_all_cds_from_database(db_uri)
n_cds = len(all_cds)
print(
f" Stacked builder expects: ({n_cds} * {n_records},) "
f"= {n_cds * n_records:,} (CD layout)"
)Dimension mismatch:
Calibration output: (11999 * 3,) = 35,997 (clone layout)
Stacked builder expects: (436 * 11999,) = 5,231,564 (CD layout)
import os
demo_cds = ["3701", "201"]
n_demo_cds = len(demo_cds)
w = (
np.random.default_rng(42)
.binomial(n=1, p=0.01, size=n_demo_cds * n_records)
.astype(float)
)
# Seed our example household into both CDs
cd_idx_3701 = demo_cds.index("3701")
w[cd_idx_3701 * n_records + record_idx] = 2.5
cd_idx_201 = demo_cds.index("201")
w[cd_idx_201 * n_records + record_idx] = 3.5
output_dir = "calibration_output"
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, "results.h5")
print(
f"Weight vector: {len(w):,} entries "
f"({n_demo_cds} CDs x {n_records:,} HH)"
)
print(f"Non-zero weights: {(w > 0).sum()}")
print(
f"Example HH weight in CD 3701: {w[cd_idx_3701 * n_records + record_idx]}"
)
print(f"Example HH weight in CD 201: {w[cd_idx_201 * n_records + record_idx]}")Weight vector: 23,998 entries (2 CDs x 11,999 HH)
Non-zero weights: 277
Example HH weight in CD 3701: 2.5
Example HH weight in CD 201: 3.5
create_sparse_cd_stacked_dataset(
w,
demo_cds,
cd_subset=demo_cds,
dataset_path=dataset_path,
output_path=output_path,
)Processing subset of 2 CDs: 3701, 201...
Output path: calibration_output/results.h5
Original dataset has 11,999 households
Extracted weights for 2 CDs from full weight matrix
Total active household-CD pairs: 277
Total weight in W matrix: 281
Processing CD 201 (2/2)...
2026-02-13 17:11:40,873 - INFO - HTTP Request: GET https://huggingface.co/api/models/policyengine/policyengine-us-data "HTTP/1.1 200 OK"
2026-02-13 17:11:40,899 - INFO - HTTP Request: HEAD https://huggingface.co/policyengine/policyengine-us-data/resolve/main/enhanced_cps_2024.h5 "HTTP/1.1 302 Found"
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
2026-02-13 17:11:40,899 - WARNING - Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Combining 2 CD DataFrames...
Total households across all CDs: 277
Combined DataFrame shape: (726, 222)
Reindexing all entity IDs using 25k ranges per CD...
Created 277 unique households across 2 CDs
Reindexing persons using 25k ranges...
Reindexing tax units...
Reindexing SPM units...
Reindexing marital units...
Reindexing families...
Final persons: 726
Final households: 277
Final tax units: 373
Final SPM units: 291
Final marital units: 586
Final families: 309
Weights in combined_df AFTER reindexing:
HH weight sum: 0.00M
Person weight sum: 0.00M
Ratio: 1.00
Overflow check:
Max person ID after reindexing: 5,025,335
Max person ID × 100: 502,533,500
int32 max: 2,147,483,647
✓ No overflow risk!
Creating Dataset from combined DataFrame...
Building simulation from Dataset...
Saving to calibration_output/results.h5...
Found 175 input variables to save
Variables saved: 218
Variables skipped: 3763
Sparse CD-stacked dataset saved successfully!
Household mapping saved to calibration_output/mappings/results_household_mapping.csv
Verifying saved file...
Final households: 277
Final persons: 726
Total population (from household weights): 281
'calibration_output/results.h5'sim_after = Microsimulation(dataset=f"./{output_path}")
hh_after_df = pd.DataFrame(
sim_after.calculate_dataframe(
[
"household_id",
"congressional_district_geoid",
"household_weight",
"state_fips",
"snap",
]
)
)
print(f"Stacked dataset: {len(hh_after_df)} households\n")
mapping_df = pd.read_csv(
f"{output_dir}/mappings/results_household_mapping.csv"
)
example_mapping = mapping_df.loc[
mapping_df.original_household_id == example_hh_id
]
print(f"Example household (original_id={example_hh_id}) " f"in mapping:\n")
print(example_mapping.to_string(index=False))
new_ids = example_mapping.new_household_id
print(f"\nIn stacked dataset:\n")
print(
hh_after_df.loc[hh_after_df.household_id.isin(new_ids)].to_string(
index=False
)
)Stacked dataset: 277 households
Example household (original_id=128694) in mapping:
new_household_id original_household_id congressional_district state_fips
108 128694 201 2
25097 128694 3701 37
In stacked dataset:
household_id congressional_district_geoid household_weight state_fips snap
108 201 3.5 2 23640.0
25097 3701 2.5 37 18396.0
import shutil
shutil.rmtree(output_dir)
print(f"Cleaned up {output_dir}/")Cleaned up calibration_output/
Summary¶
The clone-based calibration pipeline has six stages:
Clone + assign geography —
assign_random_geography()creates N copies of each CPS record, each with a population-weighted random census block.Simulate —
_simulate_clone()sets each clone’sstate_fipsand recalculates state-dependent benefits.Geographic masking —
state_to_cols/cd_to_colsrestrict each target row to geographically relevant columns.Re-randomize takeup —
rerandomize_takeup()draws new takeup per census block, breaking the fixed-takeup assumption.Build matrix —
UnifiedMatrixBuilder.build_matrix()assembles the sparse CSR matrix from all clones.Stacked datasets —
create_sparse_cd_stacked_dataset()converts calibrated weights into CD-level h5 files.
For matrix diagnostics (row/column anatomy, target groups, sparsity analysis), see calibration