Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

What This Tool Does

The Problem

Every year, Congress passes appropriations bills authorizing roughly $1.7 trillion in discretionary spending — the money that funds federal agencies, military operations, scientific research, infrastructure, veterans’ benefits, and thousands of other programs. These bills run to approximately 1,500 pages annually, published as XML on Congress.gov.

The text is public, but it’s practically unsearchable at the provision level. If you want to know how much Congress appropriated for a specific program, you have three options:

  1. Read the bill yourself. The FY2024 omnibus alone is over 1,800 pages of dense legislative text with nested cross-references, “of which” sub-allocations, and provisions scattered across twelve divisions.
  2. Read CBO cost estimates or committee reports. These are expert summaries, but they aggregate — you get totals by title or account, not individual provisions. They also don’t cover every bill type the same way.
  3. Search Congress.gov full text. You can find keywords, but you can’t filter by provision type, sort by dollar amount, or compare the same program across bills.

None of these let you ask structured questions like “show me every rescission over $10 million” or “which programs got a different amount in the continuing resolution than in the omnibus” or “find all provisions related to opioid treatment, including ones that don’t use the word ‘opioid.’”

What This Tool Does

congress-approp turns appropriations bill text into structured, queryable, verified data:

  • Downloads enrolled bill XML from Congress.gov via its official API — the authoritative, machine-readable source
  • Extracts every spending provision into structured JSON using Claude, capturing account names, dollar amounts, agencies, availability periods, provision types, section references, and more
  • Verifies every dollar amount against the source text using deterministic string matching — no LLM in the verification loop
  • Generates semantic embeddings for meaning-based search, enabling search by meaning rather than exact keywords
  • Provides CLI query tools to search, compare, summarize, and audit provisions across any number of extracted bills

The Trust Model

LLM extraction is not infallible. This tool is designed around a simple principle: the LLM extracts once; deterministic code verifies everything.

The verification pipeline runs after extraction and checks every claim the LLM made against the source bill text. No language model is involved in verification — it’s pure string matching with tiered fallback (exact → normalized → spaceless). The result across the included dataset:

MetricResult
Dollar amounts not found in source1 out of 18,584 (99.995%)
Source traceability100% — every provision has byte-level source spans
Raw text byte-identical to source94.6%
CR substitution pairs verified100%
Sub-allocations correctly excluded from budget authority

Every extracted dollar amount can be traced back to an exact byte position in the enrolled bill text. The audit command shows this verification breakdown for any set of bills. If a number can’t be verified, it’s flagged — not silently accepted. For the full breakdown, see Accuracy Metrics.

The ~5% of provisions where raw_text isn’t a byte-identical substring are cases where the LLM truncated a very long provision or normalized whitespace. The verify-text command repairs these deterministically — and the dollar amounts in those provisions are still independently verified.

What’s Included

The tool ships with 32 enacted appropriations bills across 4 congresses (116th–119th), covering FY2019 through FY2026. Every major bill type is represented — omnibus, minibus, continuing resolutions, supplementals, and authorizations. See the Recipes & Demos page for the full bill inventory, or run congress-approp summary --dir data to see them all.

Each bill directory includes the source XML, extracted provisions (extraction.json), verification report, extraction metadata, TAS mapping, bill metadata, and pre-computed embeddings. No API keys are required to query this data.

Five Things You Can Do Right Now

All of these work immediately with the included example data — no API keys needed.

1. See budget totals for all included bills:

congress-approp summary --dir data

Shows each bill’s provision count, gross budget authority, rescissions, and net budget authority in a formatted table.

2. Search all appropriations provisions:

congress-approp search --dir data --type appropriation

Lists every appropriation-type provision across all bills with account name, amount, division, and agency.

3. Find FEMA funding:

congress-approp search --dir data --keyword "Federal Emergency Management"

Searches provision text for any mention of FEMA across all bills.

4. See what the continuing resolution changed:

congress-approp search --dir data/118-hr5860 --type cr_substitution

Shows the 13 “anomalies” — programs where the CR set a different funding level instead of continuing at the prior-year rate.

5. Audit verification status:

congress-approp audit --dir data

Displays a detailed verification breakdown for each bill: how many dollar amounts were verified, how many raw text excerpts matched the source, and the completeness coverage metric.