Data Integrity and the Hash Chain
Every stage of the extraction pipeline produces files that depend on the output of the previous stage. The XML produces the extraction, the extraction produces the embeddings, and the embeddings enable semantic search. But what happens if you re-download the XML, or re-extract with a different model? The downstream files become stale — they were built from data that no longer matches.
The hash chain is a simple mechanism that detects this staleness automatically. Each downstream artifact records the SHA-256 hash of the input it was built from. When you run a command that uses those artifacts, the tool recomputes the hash and compares. If they don’t match, you get a warning.
The Chain
BILLS-*.xml ──sha256──▶ metadata.json (source_xml_sha256)
│
extraction.json ──sha256──▶ bill_meta.json (extraction_sha256)
extraction.json ──sha256──▶ embeddings.json (extraction_sha256)
│
vectors.bin ──sha256──▶ embeddings.json (vectors_sha256)
Four links, each connecting an input to the artifact that records its hash:
Link 1: Source XML → Metadata
When extraction runs, it computes the SHA-256 hash of the source XML file (BILLS-*.xml) and stores it in metadata.json:
{
"model": "claude-opus-4-6",
"source_xml_sha256": "a3f7b2c4e8d1..."
}
If someone re-downloads the XML (perhaps a corrected version was published), the hash in metadata.json no longer matches the file on disk. This tells you the extraction was built from a different version of the source.
Link 2: Extraction → Embeddings
When embeddings are generated, the SHA-256 hash of extraction.json is stored in embeddings.json:
{
"schema_version": "1.0",
"model": "text-embedding-3-large",
"dimensions": 3072,
"count": 2364,
"extraction_sha256": "b5d9e1f3a7c2...",
"vectors_file": "vectors.bin",
"vectors_sha256": "c8f2a4b6d0e3..."
}
If you re-extract the bill (with a different model, or after a prompt improvement), the new extraction.json has a different hash than what embeddings.json recorded. The provisions may have changed — different provision count, different classifications, different text — but the embedding vectors still correspond to the old provisions.
Link 3: Vectors → Embeddings
The SHA-256 hash of vectors.bin is also stored in embeddings.json. This is an integrity check: if the binary file is corrupted, truncated, or replaced, the hash mismatch is detected.
How Staleness Detection Works
The staleness.rs module implements the checking logic. It’s called by commands that depend on embeddings — primarily search --semantic and search --similar.
What happens on every query
- The tool loads
extraction.jsonfor each bill - If the command uses embeddings, it loads
embeddings.jsonfor each bill - It computes the SHA-256 hash of the current
extraction.jsonon disk - It compares that hash to the
extraction_sha256stored inembeddings.json - If they differ, it prints a warning to stderr
The warning
⚠ H.R. 4366: embeddings are stale (extraction.json has changed)
This warning is advisory only — it never blocks execution. The tool still runs your query, still computes cosine similarity, and still returns results. But the results may be unreliable because the provision indices in the embedding vectors may not correspond to the current provisions.
Why warnings don’t block
Strict enforcement (refusing to run with stale data) would be frustrating in practice. You might have re-extracted one bill out of twenty and want to run a query across all of them while you regenerate embeddings in the background. The warning tells you what’s stale; you decide whether it matters for your current task.
When Staleness Occurs
| Action | What Becomes Stale | Fix |
|---|---|---|
| Re-download XML | extraction.json (built from old XML) | Re-extract: congress-approp extract --dir <path> |
| Re-extract bill | embeddings.json + vectors.bin (built from old extraction) | Re-embed: congress-approp embed --dir <path> |
| Upgrade extraction data | embeddings.json + vectors.bin (extraction.json changed) | Re-embed: congress-approp embed --dir <path> |
| Manually edit extraction.json | embeddings.json + vectors.bin | Re-embed |
| Move files to a new machine | Nothing — hashes are content-based, not path-based | No fix needed |
| Copy bill directory | Nothing — all files move together | No fix needed |
Automatic Skip for Up-to-Date Bills
The embed command uses the hash chain to avoid unnecessary work. When you run:
congress-approp embed --dir data
For each bill, it checks:
- Does
embeddings.jsonexist? - Does the stored
extraction_sha256match the current SHA-256 ofextraction.json? - Does the stored
vectors_sha256match the current SHA-256 ofvectors.bin?
If all three pass, the bill is skipped:
Skipping H.R. 9468: embeddings up to date
This makes it safe to run embed --dir data repeatedly — only bills with new or changed extractions are processed. The same logic applies when running embed after upgrading some bills but not others.
Performance
Hash computation is fast:
| Operation | Time |
|---|---|
| SHA-256 of H.R. 9468 extraction.json (~15 KB) | <1ms |
| SHA-256 of H.R. 4366 extraction.json (~12 MB) | ~5ms |
| SHA-256 of H.R. 4366 vectors.bin (~29 MB) | ~8ms |
| Total for all example bills | ~50ms |
At scale (20 congresses, ~60 bills), total hashing time would be ~50ms — still negligible compared to the ~10ms JSON parsing time. There is no performance reason to skip or cache hash checks.
The tool always checks — it never caches hash results. Since the check takes milliseconds and the files are immutable in normal operation, this is the right tradeoff: simplicity and correctness over micro-optimization.
What’s NOT in the Hash Chain
chunks/ directory
The chunks/ directory contains per-chunk LLM artifacts — thinking traces, raw responses, conversion reports. These are local provenance records for debugging and analysis. They are:
- Not part of the hash chain — no downstream artifact records their hashes
- Not required for any operation — all query commands work without them
- Gitignored by default — they contain model thinking content and aren’t meant for distribution
If the chunks are deleted, nothing breaks. They’re useful for understanding why the LLM classified a provision a certain way, but they’re not part of the data integrity chain.
verification.json
The verification report is regenerated by the upgrade command and could be regenerated at any time from extraction.json + BILLS-*.xml. It’s not part of the hash chain because it’s a derived artifact — you can always reproduce it from its inputs.
tokens.json
Token usage records from the extraction are informational only. They don’t affect any downstream operation and aren’t part of the hash chain.
The Immutability Model
The hash chain works because of the write-once principle: every file is immutable after creation. This means:
- No concurrent modification. Two processes reading the same bill data will never see partially written files.
- No invalidation logic. There’s nothing to invalidate — files are either current (hashes match) or stale (hashes don’t match).
- No locking. Read operations don’t need to coordinate. Write operations (extract, embed, upgrade) overwrite files atomically.
The one exception is links/links.json, which is append-only — new links are added via link accept, existing links can be removed via link remove. Even this follows a simple consistency model: links reference provision indices in specific bill directories, and if those bills are re-extracted, the links become invalid (detectable via hash chain).
Verifying Integrity Manually
You can verify the hash chain yourself using standard tools:
Check extraction against metadata
# Compute the current SHA-256 of the source XML
shasum -a 256 data/118-hr9468/BILLS-118hr9468enr.xml
# Compare to what metadata.json recorded
python3 -c "
import json
meta = json.load(open('data/118-hr9468/metadata.json'))
print(f'Recorded: {meta.get(\"source_xml_sha256\", \"NOT SET\")}')
"
Check embeddings against extraction
# Compute the current SHA-256 of extraction.json
shasum -a 256 data/118-hr9468/extraction.json
# Compare to what embeddings.json recorded
python3 -c "
import json
emb = json.load(open('data/118-hr9468/embeddings.json'))
print(f'Recorded: {emb[\"extraction_sha256\"]}')
"
Check vectors.bin integrity
# Compute the current SHA-256 of vectors.bin
shasum -a 256 data/118-hr9468/vectors.bin
# Compare to what embeddings.json recorded
python3 -c "
import json
emb = json.load(open('data/118-hr9468/embeddings.json'))
print(f'Recorded: {emb[\"vectors_sha256\"]}')
"
If all three pairs match, the data is consistent across the entire chain.
Design Decisions
Why SHA-256?
SHA-256 is:
- Collision-resistant — the probability of two different files producing the same hash is astronomically small
- Fast — computing a hash takes milliseconds even for the largest files in the pipeline
- Standard — available in every language and platform via the
sha2crate in Rust,hashlibin Python,shasumon the command line - Deterministic — the same file always produces the same hash, regardless of when or where it’s computed
Why content-based hashing instead of timestamps?
Timestamps tell you when a file was modified, not whether its content changed. If you copy a bill directory to a new machine, the timestamps change but the content doesn’t. Content-based hashing correctly reports “no staleness” in this case.
Conversely, if you re-extract a bill and the LLM happens to produce identical output, the timestamps change but the content doesn’t. Content-based hashing correctly reports “no staleness” here too — the embeddings are still valid because the extraction didn’t actually change.
Why warn instead of error?
Stale embeddings still produce some results — they may just not correspond perfectly to the current provisions. In practice, re-extraction often produces very similar provisions (same accounts, same amounts, slightly different wording), so stale embeddings are “mostly correct” even when technically outdated. Blocking execution would be overly strict for this use case.
The warning goes to stderr so it doesn’t interfere with stdout output (which may be piped to jq or a file).
Summary
| Component | Records Hash Of | Stored In | Checked When |
|---|---|---|---|
| Source XML hash | BILLS-*.xml | metadata.json | extract, upgrade |
| Extraction hash | extraction.json | embeddings.json | embed, search --semantic, search --similar |
| Vectors hash | vectors.bin | embeddings.json | embed, search --semantic, search --similar |
The hash chain is simple by design — three links, SHA-256, advisory warnings, millisecond overhead. It provides confidence that the artifacts you’re querying were built from the data you think they were built from, without imposing any operational burden.
Next Steps
- The Extraction Pipeline — the six stages that produce the artifacts in the hash chain
- Generate Embeddings — how the embed command uses the hash chain to skip up-to-date bills
- Data Directory Layout — where each file lives and what it contains