Storage substrate for microdata artifacts (design sketch)
Storage substrate for microdata artifacts — design sketch
Status. Design sketch, not accepted. Two independent reviews (general-purpose stress test + codex review) reshaped scope. Release bundles remains the authoritative doc for the user-facing certification + citation surface, and this sketch does not propose replacing that surface.
Scope (read this first)
Codex’s review caught an important conflation in an earlier draft: there are two separable systems in play, and the earlier sketch quietly merged them.
| System | Concern | Authoritative home |
|---|---|---|
| Certified release bundle | What policyengine.py 4.x users get when they cite a paper |
release-bundles.md — unchanged |
| Storage substrate | Where the artifact bytes actually live, how they’re fetched, how stale caches invalidate | This doc |
The earlier draft described a single unified system and pitched it as “replace PyPI + HF + GitHub + release manifests.” That was overreach. Release bundles are a scientific citation surface (with certification, staged promotion, compatibility rules, and formal replicability guarantees that a reviewer 5 years from now can audit). A storage substrate is an operational concern underneath. Conflating them means the certification story weakens when it should strengthen, and a separate certification layer reappears on top within 2 years (codex: 60–85% probability).
This doc is scoped to the storage layer only. Release bundles continue to own certification, citation, promotion, and the reproducibility guarantee. The storage substrate is a cleaner cache/mirror/distribution primitive beneath release bundles.
Motivating pains (and what’s actually on the critical path)
Two concrete frictions pushed this sketch forward:
- HF model-vs-dataset repo confusion (hit in #310): the refresh helper assumed
huggingface.co/datasets/...paths; the actual microdata lives athuggingface.co/...(model-type repo). - “Is our data out of date?” is a manual audit today because no automated job compares the current country-model sha256 against what the certified artifact was built with.
Neither requires the sketch below. A one-time HF repo-type migration (or a smarter URL resolver, already done in bundle._hf_dataset_sha256) fixes #1. A ~50-line CI job diffing the bundled release manifest’s certified_for_model_version against importlib.metadata.version("policyengine-us") fixes #2. Both are ~days of work, not weeks.
The value proposition below is architectural, not bug-fix.
What the storage substrate would provide
Four concrete properties the current HF + PyPI + GitHub Releases pairing does not:
- Cheap schema introspection. A 2 KB sidecar manifest per artifact records the column set, dtypes, entity mapping, weight column, and row counts — so agents and tooling can learn the shape of a dataset without streaming 100 MB.
- Content-addressed cache keys. Local caches keyed on the artifact’s output-byte sha256 (not a mutable HF tag) can’t go stale after a retag.
pe.us.ensure_datasets(...)always returns the bytes the release bundle pinned, or nothing. - Operational channels. Small JSON pointers at
channels/{country}/{name}.jsonlet CI dashboards and bleeding- edge developers subscribe to updates without cutting a newpolicyengine.pyrelease. These are operational aliases, not a scientific citation surface — release bundles remain the thing papers cite. - Simpler certification mechanics.
certify_data_release(country, data_producer=...)becomes “fetch producer manifest → validate → write the certified bundle data release” with no sha256 juggling.
Notably absent from that list compared to earlier drafts: no claim of org-independent build identity, no claim of retagging-impossible certification, no claim of replacing the release bundle. Those were overreach.
Identity: output-hash, not input-hash
The earlier draft framed the primary identifier as build_id = sha256(inputs) — “two orgs rebuilding from the same recipe get the same ID without exchanging files.” Codex’s review is correct that this is weaker than it sounded:
data_vintage: "cps_asec_2024"is a label, not a raw-bytes hash. Two orgs honestly using “CPS ASEC 2024” can have different source bytes. The current release-bundles.md schema already records raw-input hashes — a regression from that would be real.built_at/built_byfields in the manifest break bitwise identity across org rebuilds even when the payload is identical.- Genuine bit-level determinism across orgs (libc, CPU microcode, torch seeds, dict iteration, pandas groupby order) is a multi-month project, not a v1 flag.
The revised proposal: the primary identifier is artifact_sha256 = sha256 of the output bytes. Input digest is recorded in the manifest as a derived queryable field (inputs.composite_digest), not the primary key. That matches how OCI/Nix work in the parts that actually deliver: content-addressed at the output, with provenance recorded alongside.
Storage layout becomes:
s3://policyengine-data/
{country}/
{artifact_sha256_prefix}/{artifact_sha256}.parquet
{artifact_sha256_prefix}/{artifact_sha256}.manifest.json
channels/
{country}/
latest.json # { "artifact_sha256": "…" }
next.json # staging; feeds into release-bundles promotion
The channels/ tree intentionally drops stable and lts-* — those carry semantics (“what should researchers treat as authoritative?”) that belong to the release bundle, not the storage substrate. At the storage layer we only need “operationally newest” (latest) and “nominated-for-certification” (next).
Channel semantics (deliberately narrow)
| Channel | Purpose | Updated by |
|---|---|---|
latest |
Output of the most recent successful CI build. May be broken, uncalibrated, experimental. | CI on every policyengine-{country}-data main-branch merge |
next |
Staging artifact that has passed validation and is nominated for the next release bundle. | Manual promotion from latest after review |
“Certified” / “stable” stay on the release-bundle side. This avoids the codex failure mode where “stable” silently means four different things to four different audiences.
The release-bundle boundary (what doesn’t change)
release-bundles.md remains authoritative:
- The certification process (who signs off, what validations, what compatibility checks) — unchanged.
src/policyengine/data/bundle/manifest.jsonremains the source record of what a givenpolicyengine.pyrelease guarantees and is packaged directly into the wheel.- The staged
provisional → certified → retiredlifecycle — unchanged. *.trace.tro.jsonldsidecars — unchanged (shorter to build because inputs are already in the storage manifest, but the emitted TRO has the same shape andtrov:/pe:fields).- The replicability guarantee wording — unchanged.
The storage substrate is an implementation detail that the certification process pulls from. When a release bundle is certified, it promotes an artifact from channels/next to a concrete artifact_sha256 pin in the country release manifest. After that, the release manifest is what papers cite; the storage channel is just the cache.
Consumer resolver (what pe.py changes)
Minimal. The existing pe.us.ensure_datasets takes a URI today:
pe.us.ensure_datasets(
datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
years=[2026],
)Under the substrate, the URI scheme gains a new prefix:
# The release manifest pins a specific artifact:
pe.us.ensure_datasets(
datasets=["pe-data://us/enhanced_cps_2024@sha256:4e92b340…"],
years=[2026],
)
# A developer asking for operational newest:
pe.us.ensure_datasets(
datasets=["pe-data://us/enhanced_cps_2024@latest"], # resolves via channel
years=[2026],
)The HF scheme stays supported indefinitely for backward compat — the substrate is additive. Local cache keyed on artifact_sha256.
Unresolved risks (carried forward from prior reviews)
UK Data Service audit trail
Today HF logs who pulls a private-gated tag via auth token. Under the sketch, auth happens at the bucket but identity resolves through two hops (channel JSON → manifest + payload), and object- store access logs record GETs against opaque artifact_sha256 paths, not “user X downloaded UK enhanced FRS derived from FRS 2023-24”.
Decision needed: gate the manifest fetch, not just the payload, so the resolver hit is auditable; maintain a per-country access log keyed on manifest content. Without this, UK support regresses vs. today.
Silent-promote attack
An adversary with bucket write access rewrites channels/us/latest.json to point at an artifact with a manifest that has a quietly wrong inputs.composite_digest. sha256 verification only protects payload-vs-manifest integrity; it doesn’t authenticate the channel pointer itself. Today’s PyPI/HF platforms have account auth auditing on tag pushes that the sketch does not.
Decision needed: before any release-bundle certification can pull from a channel, the channel JSON must be signed with a key pinned in pe.py. Channels can stay unsigned for the operational-latest use case (the certification step verifies), but the nomination → certification boundary must validate a signature.
Non-deterministic builds (storage-layer version)
With output-hash identity (not input-hash), two CI runs from the same inputs producing slightly different bytes produce two different artifact_sha256 values. They don’t collide. This is actually cleaner than the recipe-addressed framing: the storage layer doesn’t need to promise determinism. The release-bundle certification step is where determinism matters, and it’s already responsible for picking one specific artifact to pin.
Licence revocation vs “immutable forever”
The earlier sketch called storage “immutable forever.” In practice, Census / ONS / DWP can yank redistribution rights, and we must be able to respond. The storage substrate must support tombstoning: an artifact_sha256 resolves to a manifest with status: "revoked" and no payload. Release bundles that pinned the revoked artifact get marked as “unreproducible: licence revoked” in the certification registry (a new release-bundle concept, not this sketch’s).
Cross-cloud replication
Payloads mirror trivially — sha256 verifies. Channels don’t (a single authoritative URL). An EU partner wanting their own mirror runs their own channel namespace. The substrate should not promise one-click cross-cloud channels; it should promise one-click payload mirrors and make channel namespacing explicit.
Relationship to release-bundles.md (and what stays load-bearing)
The old-design’s central claim was “this could replace release bundles.” It cannot, and it shouldn’t try. Release bundles carry the certification contract with external stakeholders:
- The UK Data Service licence negotiation hangs off the fact that
policyengine.pyreleases are the thing that’s audited, reviewed, and approved. The storage substrate changes the mechanism of how bytes reach users, not the contract about what’s been certified. - Academic replication reviewers need something at citation time that is
vN.M.P-shaped, notsha256:…-shaped. Release bundle versions fill that role. - “Is this the PolicyEngine release?” has a legal/regulatory answer that release bundles track. The storage substrate does not attempt that.
The load-bearing sentence of this sketch: “When a release bundle is certified, it promotes an artifact from channels/next to a concrete artifact_sha256 pin in the country release manifest. After that, the release manifest is what papers cite; the storage channel is just the cache.”
What this fixes that today’s HF/PyPI pairing doesn’t
Narrow list, honestly scoped:
| Pain | Today | Under substrate |
|---|---|---|
| “Did a retag silently change the artifact?” | Possible, HF tags are mutable | Impossible: cache keyed on output sha256 |
“What’s the schema of enhanced_cps_2024?” |
Download 100 MB, open with h5py | Fetch 2 KB manifest |
| “Where’s model-vs-dataset repo type?” | Tripped up bundle._hf_dataset_sha256 (#310) |
No such distinction |
| “How does an EU partner mirror the payload bytes?” | Coordinate with HF, PyPI, release cadence | Re-upload bytes to their bucket; sha256 verifies |
What it doesn’t fix, which the earlier draft overclaimed: - “Bump stable to the newest data” — that’s a release-bundle certification decision and stays manual. - “Reproduce a paper from 5 years ago” — depends on the release bundle being preserved, which is a release-bundle concern. - “Two orgs can independently produce the same build” — bit-level determinism is out of scope; the substrate just doesn’t pretend otherwise.
Migration cost (realistic)
Revised after the stress tests:
| Work item | Estimate |
|---|---|
| Bucket + manifest schema + one US build end-to-end | 1–2 weeks |
Consumer resolver in pe.py (pe-data:// URI scheme, cache, sha256 verify) |
1 week |
| UK gating with auditable manifest hits | 1–2 weeks |
Channel signing + trust-root rollover story (for next → certification) |
2–3 weeks |
| Tombstone + release-bundle “unreproducible” state | 1–2 weeks |
Retire the legacy HF resolver path (after 2–3 pe.py releases) |
1 week |
Total: 7–11 engineer-weeks. With two engineers + agents running in parallel on independent tracks, ~5–7 calendar weeks is realistic. Not a v4.x stretch; candidate for v5 if pursued at all.
Whether to pursue
Honest read from both stress tests combined:
- Keep the storage substrate idea. Output-hash-addressed storage + a schema-sidecar manifest + a
pe-data://URI scheme is a real improvement over HF for our use case, independent of everything else. - Drop the “replace release bundles” framing entirely. That was the codex review’s main correction, and it holds.
- Don’t build it to fix #310 or “is our data stale?” Both have cheap, targeted fixes already within reach.
- If the UK Data Service relationship is going to get stricter (an external trigger, not an internal one), revisit. A substrate with first-class audit is defensible in a way that the current HF private-repo setup is not.
Open questions (narrowed)
- Object store: GCS or S3? (Lean: GCS — build pipelines already run on GCP.)
- Payload format for new builds: parquet or HDF5? (Lean: parquet for new, keep HDF5 for the Enhanced CPS legacy until consumers migrate.)
- Should the manifest schema have a
schema_versionand a formal migration policy? (Lean: yes, borrow from PEP 621 style pragmatic evolution.)