The Extraction Pipeline
A bill flows through six stages on its way from raw XML on Congress.gov to queryable, verified, searchable data on your machine. Each stage produces immutable files. Once a stage completes for a bill, its output is never modified — unless you deliberately re-extract or upgrade.
This chapter explains each stage in detail: what it does, what it produces, and why it’s designed the way it is.
Pipeline Overview
┌──────────┐
Congress.gov ───▶ │ Download │ ───▶ BILLS-*.xml
└──────────┘
│
┌──────────┐
│ Parse │ ───▶ clean text + chunk boundaries
│ + XML │
└──────────┘
│
┌──────────┐
Anthropic API ◀── │ Extract │ ───▶ extraction.json + verification.json
│ (LLM) │ metadata.json + tokens.json + chunks/
└──────────┘
│
┌──────────┐
│ Enrich │ ───▶ bill_meta.json (offline, no API)
│(optional)│
└──────────┘
│
┌──────────┐
OpenAI API ◀───── │ Embed │ ───▶ embeddings.json + vectors.bin
└──────────┘
│
┌──────────┐
│ Query │ ───▶ search, compare, summary, audit, relate
└──────────┘
Only stages 3 (Extract) and 5 (Embed) call external APIs. Everything else — downloading, parsing, enrichment, verification, linking, querying — runs locally and deterministically.
Stage 1: Download
The download command fetches enrolled bill XML from the Congress.gov API.
What “enrolled” means: When a bill passes both the House and Senate in identical form and is sent to the President for signature, that final text is the “enrolled” version. Once signed, it becomes law. This is the authoritative text — the version that actually governs how money is spent.
What the XML looks like: Congressional bill XML uses semantic markup defined by the Government Publishing Office (GPO). Tags like <division>, <title>, <section>, <appropriations-major>, <appropriations-small>, <quote>, and <proviso> describe the legislative structure, not just formatting. This semantic markup is what makes reliable parsing possible — you can identify account name headings, dollar amounts, proviso clauses, and structural boundaries directly from the XML tree.
What gets created:
data/118/hr/9468/
└── BILLS-118hr9468enr.xml ← Enrolled bill XML from Congress.gov
Requires: CONGRESS_API_KEY (free from api.congress.gov)
No transformation is applied. The XML is saved exactly as received from Congress.gov.
Stage 2: Parse
Parsing happens at the beginning of the extract command — it’s not a separate CLI step. The xml.rs module reads the bill XML using roxmltree (a pure-Rust XML parser with no C dependencies) and produces two things:
Clean text extraction
The parser walks the XML tree and extracts human-readable text with two important conventions:
-
Quote delimiters: Account names in bill XML are wrapped in
<quote>tags. The parser renders these as''Account Name''(double single-quotes) to match the format the LLM system prompt expects. For example:<quote>Compensation and Pensions</quote>becomes:
''Compensation and Pensions'' -
Structural markers: Division headers, title headers, and section numbers are preserved in the clean text so the LLM can identify structural boundaries.
Chunk boundaries
Large bills need to be split into smaller pieces for the LLM — you can’t send a 1,500-page omnibus as a single prompt. The parser identifies semantic chunk boundaries by walking the XML tree structure:
- Primary splits: At
<division>boundaries (Division A, Division B, etc.) - Secondary splits: At
<title>boundaries within each division - Tertiary splits: If a single title or division still exceeds the maximum chunk token limit (~3,000 tokens), it’s further split at paragraph boundaries
This is semantic chunking, not arbitrary token-limit splitting. Each chunk contains a complete legislative section — a full title or division — so the LLM sees complete context. This matters because provisions often reference “the amount made available under this heading” or “the previous paragraph,” and the LLM needs to see those references in context.
Chunk counts for the example data:
| Bill | XML Size | Chunks |
|---|---|---|
| H.R. 9468 (supplemental) | 9 KB | 1 |
| H.R. 5860 (CR) | 131 KB | 5 |
| H.R. 4366 (omnibus) | 1.8 MB | 75 |
No files are written. The clean text and chunk boundaries exist only in memory, passed directly to the extraction stage.
No API calls. Pure Rust computation.
Stage 3: Extract
This is the core stage — the only one that uses an LLM. Each chunk of bill text is sent to Claude with a detailed system prompt (~300 lines) that defines every provision type, shows real JSON examples, constrains the output format, and includes specific instructions for edge cases. The LLM reads the actual legislative language and produces structured JSON — there is no intermediate regex extraction step.
The system prompt
The system prompt (defined in prompts.rs) is the instruction manual for the LLM. It covers:
- Reading instructions: How to interpret
''Account Name''delimiters, dollar amounts, “Provided, That” provisos, “notwithstanding” clauses, and section numbering - Bill type guidance: How regular appropriations, continuing resolutions, omnibus bills, and supplementals differ
- Provision type definitions: All 11 types (appropriation, rescission, transfer_authority, limitation, directed_spending, cr_substitution, mandatory_spending_extension, directive, rider, continuing_resolution_baseline, other) with examples
- Detail level rules: When to classify a provision as top_level, line_item, sub_allocation, or proviso_amount
- Sub-allocation semantics: Explicit instructions that “of which $X shall be for…” breakdowns are
reference_amount, notnew_budget_authority - CR substitution requirements: Both the new and old amounts must be extracted with dollar values, semantics, and text_as_written
- Output format: The exact JSON schema the LLM must produce
The prompt is sent with cache_control enabled, so subsequent chunks within the same bill benefit from prompt caching — the system prompt tokens are served from cache rather than re-processed, reducing both latency and cost.
Parallel chunk processing
Chunks are extracted in parallel using bounded concurrency (default 5 simultaneous LLM calls, configurable via --parallel). A progress dashboard shows real-time status:
5/42, 187 provs [4m 23s] 842 tok/s | 📝A-IIb ~8K 180/s | 🤔B-I ~3K | 📝B-III ~1K 95/s
Each chunk produces a JSON array of provisions. The LLM’s response is captured along with its “thinking” content (internal reasoning) and saved to the chunks/ directory as a permanent provenance record.
Resilient JSON parsing
The LLM doesn’t always produce perfect JSON. Missing fields, wrong types, unexpected enum values, extra fields — all of these can occur. The from_value.rs module handles this with a resilient parsing strategy:
- Missing fields get defaults (empty string, null, empty array)
- Wrong types are coerced where possible (string
"$10,000,000"→ integer10000000) - Unknown provision types become
Provision::Otherwith the LLM’s original classification preserved - Extra fields on known types are silently ignored
- Failed provisions are logged but don’t abort the extraction
Every compromise is counted in a ConversionReport — you can see exactly how many null-to-default conversions, type coercions, and unknown types occurred.
Merge and compute
After all chunks complete:
- Provisions are merged into a single flat array, ordered by chunk sequence
- Budget authority totals are computed from the individual provisions — summing
new_budget_authorityprovisions attop_levelandline_itemdetail levels. The LLM also produces a summary with totals, but these are never used for computation — only for diagnostics. This design means a bug in the LLM’s arithmetic can’t corrupt budget totals. - Chunk provenance is recorded — the
chunk_mapfield inextraction.jsonlinks each provision back to the chunk it came from
Deterministic verification
Verification runs immediately after extraction, with no LLM involvement. It answers three questions:
-
“Are the dollar amounts real?” — For every provision with a
text_as_writtendollar string (e.g.,"$2,285,513,000"), search for that exact string in the source bill text. Result:verified(found once),ambiguous(found multiple times), ornot_found. -
“Is the quoted text actually from the bill?” — For every provision’s
raw_textexcerpt, check if it’s a substring of the source text using tiered matching:- Exact: Byte-identical substring (95.6% of provisions in example data)
- Normalized: Matches after collapsing whitespace and normalizing Unicode quotes/dashes (2.8%)
- Spaceless: Matches after removing all spaces (0%)
- No match: Not found at any tier (1.5% — all non-dollar statutory amendments)
-
“Did we miss anything?” — Count every dollar-sign pattern in the source text and check how many are accounted for by extracted provisions. This produces the coverage percentage.
See How Verification Works for the complete technical details.
What gets created
data/118/hr/9468/
├── BILLS-118hr9468enr.xml ← Source XML (unchanged)
├── extraction.json ← All provisions, bill info, summary, chunk map
├── verification.json ← Amount checks, raw text checks, completeness
├── metadata.json ← Model name, prompt version, timestamps, source hash
├── tokens.json ← Input/output/cache token counts per chunk
└── chunks/ ← Per-chunk LLM artifacts (gitignored)
├── 01JRWN9T5RR0JTQ6C9FYYE96A8.json
└── ...
Requires: ANTHROPIC_API_KEY
Stage 3.5: Enrich (Optional)
The enrich command generates bill-level metadata by parsing the source XML structure and analyzing the already-extracted provisions. It bridges the gap between raw extraction and informed querying — adding structural knowledge that the LLM extraction doesn’t capture.
Why this stage exists: The LLM extracts provisions faithfully — every dollar amount, every account name, every section reference. But it doesn’t know that Division A in H.R. 7148 covers Defense while Division A in H.R. 6938 covers CJS. It doesn’t know that “shall become available on October 1, 2024” in a FY2024 bill means the money is for FY2025 (an advance appropriation). It doesn’t know that “Grants-In-Aid for Airports” and “Grants-in-Aid for Airports” are the same account. The enrich command adds this structural and normalization knowledge.
What it does:
-
Parses division titles from XML. The enrolled bill XML contains
<division><enum>A</enum><header>Department of Defense Appropriations Act, 2026</header>elements. The enrich command extracts each division’s letter and title, then classifies the title to a jurisdiction using case-insensitive pattern matching against known subcommittee names. -
Classifies advance vs current-year. For each budget authority provision, the command checks the
availabilityfield andraw_textfor “October 1, YYYY” or “first quarter of fiscal year YYYY” patterns. It compares the referenced year to the bill’s fiscal year: if the money becomes available after the bill’s FY ends, it’s advance. -
Normalizes account names. Each account name is lowercased and stripped of hierarchical em-dash prefixes (e.g., “Department of VA—Compensation and Pensions” → “compensation and pensions”) for cross-bill matching.
-
Classifies bill nature. The provision type distribution and subcommittee count determine whether the bill is an omnibus (5+ subcommittees), minibus (2-4), full-year CR with appropriations (CR baseline + hundreds of regular appropriations), or other type.
Input: extraction.json + BILLS-*.xml
Output: bill_meta.json
Requires: Nothing — no API keys, no network access.
This stage is optional. All commands from v3.x continue to work without it. It is required for --subcommittee filtering, --show-advance display, and enriched bill classification display. See Enrich Bills with Metadata for a complete guide.
Stage 4: Embed
The embed command generates semantic embedding vectors for every provision using OpenAI’s text-embedding-3-large model. This is the foundation for meaning-based search and cross-bill matching.
How provision text is built
Each provision is represented as a concatenation of its meaningful fields:
Account: Child Nutrition Programs | Agency: Department of Agriculture | Text: For necessary expenses of the Food and Nutrition Service...
This construction is deterministic — the same provision always produces the same embedding text, computed by query::build_embedding_text(). The exact fields included depend on the provision type:
- Appropriations/Rescissions: Account name, agency, program, raw text
- CR Substitutions: Account name, reference act, reference section, raw text
- Directives/Riders: Description, raw text
- Other types: Description or LLM classification, raw text
Batch processing
Provisions are sent to the OpenAI API in batches (default 100 provisions per call). Each call returns a vector of 3,072 floating-point numbers per provision — the embedding that captures the provision’s meaning in high-dimensional space.
All vectors are L2-normalized (unit length), which means cosine similarity equals the simple dot product — a fast computation.
Binary storage
Embeddings are stored in a split format for efficiency:
embeddings.json(~200 bytes): Human-readable metadata — model name, dimensions, count, and SHA-256 hashes for the hash chainvectors.bin(count × 3,072 × 4 bytes): Raw little-endian float32 array with no header
For the FY2024 omnibus (2,364 provisions), vectors.bin is 29 MB and loads in under 2 milliseconds. The same data as JSON float arrays would be ~57 MB and take ~175ms to parse. Since this is a read-heavy system — load once per CLI invocation, query many times — the binary format keeps startup instant.
What gets created
data/118/hr/9468/
├── ...existing files...
├── embeddings.json ← Metadata: model, dimensions, count, hashes
└── vectors.bin ← Raw float32 vectors [count × 3072]
Requires: OPENAI_API_KEY
Stage 5: Query
All query operations — search, summary, compare, audit — run locally against the JSON and binary files on disk. There are no API calls at query time, with one exception: search --semantic makes a single API call to embed your query text (~100ms).
How queries work
-
Load:
loading.rsrecursively walks the--dirpath, finds everyextraction.json, and deserializes it along with sibling files (verification.json,metadata.json) intoLoadedBillstructs. -
Filter: For
searchqueries, each provision is tested against the specified filters (type, agency, account, keyword, division, dollar range). All filters use AND logic. -
Rank: For semantic searches, the query text is embedded via OpenAI, and cosine similarity is computed against every matching provision’s pre-stored vector. For
--similar, the source provision’s stored vector is used directly (no API call). -
Compute: For
summary, budget authority and rescissions are computed from provisions. Forcompare, accounts are matched by(agency, account_name)and deltas are calculated. Foraudit, verification metrics are aggregated. -
Format: The CLI layer (
main.rs) renders results as tables, JSON, JSONL, or CSV depending on the--formatflag.
Performance
All of this is fast:
| Operation | Time | Notes |
|---|---|---|
| Load 14 bills (extraction.json) | ~40ms | JSON parsing |
| Load embeddings (14 bills, binary) | ~8ms | Memory read |
| Hash all files (14 bills) | ~8ms | SHA-256 |
| Cosine search (8,500 provisions) | <0.5ms | Dot products |
| Total cold-start query | ~50ms | Load + hash + search |
| Embed query text (OpenAI API) | ~100ms | Network round-trip |
At 20 congresses (~60 bills, ~15,000 provisions): cold start ~100ms, search <1ms. The system scales linearly and stays interactive at any realistic data volume.
No API calls at query time unless you use --semantic (one call to embed the query). The --similar command uses only stored vectors — completely offline.
The Write-Once Principle
Every file in the pipeline is write-once. After a bill is extracted and embedded, its files are never modified (unless you deliberately re-extract or upgrade). This design has several advantages:
- No file locking needed. Multiple processes can read simultaneously without coordination.
- No database needed. JSON files on disk are the right abstraction for a read-dominated workload with ~15 writes per year (when Congress enacts bills) and thousands of reads.
- No caching needed. The files ARE the cache. There’s nothing to invalidate.
- Git-friendly. All files are diffable JSON (except
vectors.bin, which is gitattributed as binary). - Trivially relocatable. Copy a bill directory anywhere and it works — no registry, no config, no state files outside the directory.
The one exception to strict immutability is the links/links.json file, which is append-only for accepted cross-bill relationships. Links are added via link accept and removed via link remove, but the file is never overwritten — only updated.
The Hash Chain
Each downstream artifact records the SHA-256 hash of its input, forming a chain that enables staleness detection:
BILLS-*.xml ──sha256──▶ metadata.json (source_xml_sha256)
│
extraction.json ──sha256──▶ embeddings.json (extraction_sha256)
│
vectors.bin ──sha256──▶ embeddings.json (vectors_sha256)
If you re-download the XML (producing a new file), metadata.json still references the old hash. If you re-extract (producing a new extraction.json), embeddings.json still references the old extraction hash. The staleness.rs module checks these hashes on commands that use embeddings and prints warnings:
⚠ H.R. 4366: embeddings are stale (extraction.json has changed)
Warnings are advisory — they never block execution. Hashing all files for 14 bills takes ~8ms, so there’s no performance reason to skip checks.
See Data Integrity and the Hash Chain for more details.
Dependencies
The pipeline uses a minimal set of Rust crates:
| Stage | Key Crate | Role |
|---|---|---|
| Download | reqwest | HTTP client for Congress.gov API |
| Parse | roxmltree | Pure-Rust XML parsing, zero-copy where possible |
| Extract | reqwest + tokio | Async HTTP for Anthropic API with parallel chunk processing |
| Parse LLM output | serde_json | JSON deserialization with custom resilient parsing |
| Verify | sha2 | SHA-256 hashing for the hash chain |
| Embed | reqwest | HTTP client for OpenAI API |
| Query | walkdir | Recursive directory traversal to find bill data |
| Output | comfy-table + csv | Terminal table formatting and CSV export |
All API clients use rustls-tls (pure Rust TLS) — no OpenSSL dependency.
What Can Go Wrong
Understanding the pipeline helps you diagnose issues:
| Symptom | Likely Stage | Investigation |
|---|---|---|
| “No XML files found” | Download | Check that BILLS-*.xml exists in the directory |
| Low provision count | Extract | Check audit coverage; examine chunk artifacts in chunks/ |
| NotFound > 0 in audit | Extract + Verify | Run audit --verbose; check if the LLM hallucinated an amount |
| “Embeddings are stale” | Embed | Run embed to regenerate after re-extraction |
| Semantic search returns no results | Embed | Check that embeddings.json and vectors.bin exist |
| Budget authority doesn’t match expectations | Extract | Check detail_level and semantics; see Budget Authority Calculation |
Next Steps
- How Verification Works — deep dive into the three verification checks
- How Semantic Search Works — embeddings, cosine similarity, and vector storage
- Budget Authority Calculation — exactly how totals are computed from provisions
- Data Integrity and the Hash Chain — staleness detection across the pipeline