Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Running the Complete Pipeline

This guide walks through every step to process appropriations bills from raw XML to a queryable account registry. Each step adds data without modifying previous outputs. You can stop at any step and still get value from the data produced so far.

Prerequisites

cargo install --path .    # Build the tool (Rust 1.93+)

API keys (only needed for specific steps):

KeyEnvironment VariableRequired For
Congress.govCONGRESS_API_KEYdownload (free at api.congress.gov)
AnthropicANTHROPIC_API_KEYextract, resolve-tas (LLM tier)
OpenAIOPENAI_API_KEYembed (text-embedding-3-large)

No API keys are needed for verify-text, enrich, authority build, or any query command when working with pre-processed data.

The Pipeline

Step 1: download       → BILLS-*.xml
Step 2: extract        → extraction.json, verification.json, metadata.json
Step 3: verify-text    → source_span on every provision (modifies extraction.json)
Step 4: enrich         → bill_meta.json
Step 5: resolve-tas    → tas_mapping.json
Step 6: embed          → embeddings.json, vectors.bin
Step 7: authority build → authorities.json

Step 1: Download bill XML

# Download all enacted bills for a congress
congress-approp download --congress 119 --enacted-only

# Or download a specific bill
congress-approp download --congress 119 --type hr --number 7148

This fetches the enrolled (signed-into-law) XML from Congress.gov into data/{congress}-{type}{number}/. Each bill gets its own directory.

Cost: Free (Congress.gov API is free). Time: ~30 seconds per congress. Needs: CONGRESS_API_KEY

You can skip this step entirely if you already have bill XML files — just place them in the expected directory structure.

Step 2: Extract provisions

congress-approp extract --dir data --parallel 5

Sends bill text to Claude Opus 4.6 for structured extraction. Large bills are split into chunks and processed in parallel. Every provision — appropriations, rescissions, CR anomalies, riders, directives — is captured as typed JSON.

The command skips bills that already have extraction.json. Use --force to re-extract.

Cost: ~$0.10 per chunk. Small bills: $0.10–0.50. Omnibus bills: $5–15. Time: Small bills: 1–2 minutes. Omnibus: 30–60 minutes. Needs: ANTHROPIC_API_KEY

This is the expensive step. Once done, you do not need to re-extract unless the model or prompt improves significantly.

Produces per bill:

FileContent
extraction.jsonStructured provisions (the main output)
verification.jsonDollar amount and raw text verification
metadata.jsonProvenance (model, timestamps, chunk completion)
conversion.jsonLLM JSON parsing report
tokens.jsonAPI token usage for cost tracking
BILLS-*.txtClean text extracted from XML (used for verification)

Step 3: Verify and repair raw text

congress-approp verify-text --dir data --repair

Deterministically checks that every provision’s raw_text field is a verbatim substring of the enrolled bill source text. Repairs LLM copying errors (word substitutions like “clause” instead of “subsection”, whitespace differences, quote character mismatches) using a 3-tier algorithm:

  1. Prefix match — find the longest matching prefix, copy source bytes
  2. Substring match — find a distinctive internal phrase, walk backward to the provision start
  3. Normalized position mapping — search in whitespace/quote-normalized space, map back to original byte positions

After repair, every provision carries a source_span with exact UTF-8 byte offsets into the source .txt file.

Cost: Free (no API calls). Time: ~10 seconds for all 32 bills. Needs: Nothing.

Without --repair, the command analyzes but does not modify any files. A backup (extraction.json.pre-repair) is created before any modifications.

Invariant: After this step, for every provision p:

source_file_bytes[p.source_span.start .. p.source_span.end] == p.raw_text

This is mechanically verifiable. The start and end values are UTF-8 byte offsets (matching Rust’s native str indexing). Languages that use character-based indexing (Python, JavaScript) must use byte-level slicing:

raw_bytes = open("BILLS-118hr2882enr.txt", "rb").read()
actual = raw_bytes[span["start"]:span["end"]].decode("utf-8")
assert actual == provision["raw_text"]

Step 4: Enrich with metadata

congress-approp enrich --dir data

Generates bill_meta.json per bill with fiscal year metadata, subcommittee/jurisdiction mappings, advance appropriation classification, and enriched bill nature (omnibus, minibus, full-year CR, etc.). Uses XML parsing and deterministic keyword matching — no LLM calls.

Cost: Free. Time: ~30 seconds for all bills. Needs: Nothing.

Enables --fy, --subcommittee, and --show-advance flags on query commands.

Step 5: Resolve Treasury Account Symbols

# Full resolution (deterministic + LLM)
congress-approp resolve-tas --dir data

# Deterministic only (free, no API key, ~56% resolution)
congress-approp resolve-tas --dir data --no-llm

# Preview cost before running
congress-approp resolve-tas --dir data --dry-run

Maps each top-level budget authority provision to a Federal Account Symbol (FAS) — a stable identifier assigned by the Treasury that persists through account renames and reorganizations.

Two tiers:

  • Deterministic (~56%): Matches provision account names against the bundled FAST Book reference (fas_reference.json). Free, instant, zero false positives.
  • LLM (~44%): Sends ambiguous provisions to Claude Opus with the relevant FAS codes for the provision’s agency. Verifies each returned code against the FAST Book.

Cost: Free with --no-llm. $85 for the full 32-bill dataset with LLM tier ($2–4 per omnibus). Time: Instant for --no-llm. ~5 minutes per omnibus with LLM. Needs: ANTHROPIC_API_KEY for LLM tier.

This is a one-time cost per bill. The FAS code assignment does not need to be repeated unless the bill is re-extracted.

Step 6: Generate embeddings

congress-approp embed --dir data

Generates OpenAI embedding vectors (text-embedding-3-large, 3072 dimensions) for every provision. Enables semantic search (--semantic), similar-provision matching (--similar), the relate command, and link suggest.

Cost: ~$14 for 34,568 provisions. Time: ~10–15 minutes for all bills. Needs: OPENAI_API_KEY

Optional. If you only need TAS-based account tracking, keyword search, and fiscal year comparisons, you can skip this step.

Step 7: Build the authority registry

congress-approp authority build --dir data

Aggregates all tas_mapping.json files into a single authorities.json at the data root. Groups provisions by FAS code into account authorities with name variants, provision references, fiscal year coverage, dollar totals, and detected lifecycle events (renames).

Cost: Free. Time: ~1 second. Needs: At least one tas_mapping.json from Step 5.

Querying the Data

After the pipeline completes, all query commands work:

# What bills do I have?
congress-approp summary --dir data

# Filter to one fiscal year
congress-approp summary --dir data --fy 2026

# Track an account across fiscal years
congress-approp trace 070-0400 --dir data
congress-approp trace "coast guard operations" --dir data

# Browse the account registry
congress-approp authority list --dir data --agency 070

# Search by meaning
congress-approp search --dir data --semantic "disaster relief funding" --top 5

# Compare fiscal years with TAS matching
congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud \
    --dir data --use-authorities

# Audit data quality
congress-approp audit --dir data

# Verify source traceability
congress-approp verify-text --dir data

Adding a New Bill

When Congress enacts a new bill, add it to the dataset:

congress-approp download --congress 119 --type hr --number 9999
congress-approp extract --dir data/119-hr9999 --parallel 5
congress-approp verify-text --dir data --bill 119-hr9999 --repair
congress-approp enrich --dir data/119-hr9999
congress-approp resolve-tas --dir data --bill 119-hr9999
congress-approp embed --dir data/119-hr9999
congress-approp authority build --dir data --force

The --force on the last command rebuilds authorities.json to include the new bill. All existing data is unchanged.

Rebuilding From Scratch

If you have only the XML files, you can rebuild everything:

congress-approp extract --dir data --parallel 5      # ~$100, ~4 hours
congress-approp verify-text --dir data --repair       # free, ~10 seconds
congress-approp enrich --dir data                     # free, ~30 seconds
congress-approp resolve-tas --dir data                # ~$85, ~1 hour
congress-approp embed --dir data                      # ~$14, ~15 minutes
congress-approp authority build --dir data             # free, ~1 second

Total cost to rebuild from scratch: ~$200. Total time: ~6 hours (mostly waiting for LLM responses). The XML files themselves are permanent government records available from Congress.gov.

Pipeline Dependencies

download (1) ─────────┐
                       ▼
extract (2) ──────► verify-text (3) ──────┐
     │                                     │
     ├──────────► enrich (4) ◄────────────┘
     │                │
     ├──────────► resolve-tas (5) ◄── fas_reference.json
     │                │
     └──────────► embed (6)
                      │
                      ├──► link suggest
                      │
authority build (7) ◄─── resolve-tas outputs from all bills

Steps 4, 5, and 6 are independent of each other — they all read from extraction.json and can run in any order after Step 3. Step 7 requires Step 5 to have run on all bills you want included.

Output File Reference

Per-bill files

FileStepSize (typical)Content
BILLS-*.xml112K–9.4MBEnrolled bill XML (source of truth)
BILLS-*.txt23K–3MBClean text from XML
extraction.json2+320K–2MBProvisions + source spans
verification.json25K–500KVerification report
metadata.json2500BProvenance
bill_meta.json42K–20KFY, subcommittee, timing
tas_mapping.json55K–200KFAS codes per provision
embeddings.json61K–50KEmbedding metadata
vectors.bin6100K–35MBBinary float32 vectors

Cross-bill files (at data root)

FileStepContent
fas_reference.jsonbundled2,768 FAS codes from the FAST Book
authorities.json7Account registry with timelines and events
dataset.jsonnormalize acceptEntity resolution rules (optional)
links/links.jsonlink acceptEmbedding-based cross-bill links (optional)