Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Congressional Appropriations Analyzer

congress-approp is a Rust CLI tool and library that downloads U.S. federal appropriations bills from Congress.gov, extracts every spending provision into structured JSON using Claude, and verifies each dollar amount against the source text. The included dataset covers 32 enacted bills across FY2019–FY2026 with 34,568 provisions and $21.5 trillion in budget authority.

Dollar amounts are verified by deterministic string matching against the enrolled bill text — no LLM in the verification loop. 99.995% of extracted dollar amounts are confirmed present in the source (18,583 of 18,584). Every provision carries a source_span with exact byte offsets into the enrolled bill for independent verification.

Jump straight to working examples: Recipes & Demos — track any federal account across fiscal years, compare subcommittees with inflation adjustment, load the data in Python, and more. No API keys needed.

What’s Included

This book ships with 32 enacted appropriations bills across 4 congresses (116th–119th), covering FY2019 through FY2026. All twelve appropriations subcommittees are represented for FY2020–FY2024 and FY2026. You don’t need any API keys to explore them — just install the tool and start querying.

116th Congress (FY2019–FY2021) — 11 bills

BillClassificationProvisionsBudget Auth
H.R. 1865Omnibus (FY2020, 8 subcommittees)3,338$1,710B
H.R. 1158Minibus (FY2020, Defense + CJS + FinServ + Homeland)1,519$887B
H.R. 133Omnibus (FY2021, all 12 subcommittees)6,739$3,378B
H.R. 2157Supplemental (FY2019, disaster relief)116$19B
H.R. 3401Supplemental (FY2019, humanitarian)55$5B
H.R. 6074Supplemental (FY2020, COVID preparedness)55$8B
+ 5 CRsContinuing resolutions351$31B

117th Congress (FY2021–FY2023) — 7 bills

BillClassificationProvisionsBudget Auth
H.R. 2471Omnibus (FY2022)5,063$3,031B
H.R. 2617Omnibus (FY2023)5,910$3,379B
H.R. 3237Supplemental (FY2021, Capitol security)47$2B
H.R. 7691Supplemental (FY2022, Ukraine)67$40B
H.R. 6833CR + Ukraine supplemental240$46B
+ 2 CRsContinuing resolutions37$0

118th Congress (FY2024/FY2025) — 10 bills

BillClassificationProvisionsBudget Auth
H.R. 4366Omnibus (MilCon-VA, Ag, CJS, E&W, Interior, THUD)2,323$921B
H.R. 2882Omnibus (Defense, FinServ, Homeland, Labor-HHS, LegBranch, State)2,608$2,451B
H.R. 815Supplemental (Ukraine/Israel/Taiwan)306$95B
H.R. 9468Supplemental (VA)7$3B
H.R. 5860Continuing Resolution + 13 anomalies136$16B
S. 870Authorization (Fire Admin)51$0
+ 4 CRsContinuing resolutions233$0

119th Congress (FY2025/FY2026) — 4 bills

BillClassificationProvisionsBudget Auth
H.R. 7148Omnibus (Defense + Labor-HHS + THUD + FinServ + State)2,774$2,841B
H.R. 5371Minibus (CR + Ag + LegBranch + MilCon-VA)1,051$681B
H.R. 6938Minibus (CJS + Energy-Water + Interior)1,028$196B
H.R. 1968Full-Year CR with Appropriations (FY2025)514$1,786B

Totals: 32 bills, 34,568 provisions, $21.5 trillion in budget authority, 1,051 accounts tracked by Treasury Account Symbol across FY2019–FY2026.

What Can You Do?

“How did THUD funding change from FY2024 to FY2026?”

congress-approp enrich --dir data                    # Generate metadata (once, no API key)
congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud --dir data

82 accounts matched across fiscal years — Tenant-Based Rental Assistance up $6.1B (+18.7%), Transit Formula Grants reclassified at $14.6B, Capital Investment Grants down $505M.

“What’s the FY2026 MilCon-VA budget, and how much is advance?”

congress-approp summary --dir data --fy 2026 --subcommittee milcon-va --show-advance
┌───────────┬────────────────┬────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┐
│ Bill      ┆ Classification ┆ Provisions ┆     Current ($) ┆     Advance ($) ┆    Total BA ($) ┆ Rescissions ($) ┆      Net BA ($) │
╞═══════════╪════════════════╪════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╡
│ H.R. 5371 ┆ Minibus        ┆        257 ┆ 101,742,083,450 ┆ 393,689,946,000 ┆ 495,432,029,450 ┆  16,499,000,000 ┆ 478,933,029,450 │
└───────────┴────────────────┴────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┘

79.5% of MilCon-VA is advance appropriations for the next fiscal year — without --show-advance, you’d overstate current-year VA spending by $394 billion.

“Trace VA Compensation and Pensions across all fiscal years”

congress-approp relate 118-hr9468:0 --dir data --fy-timeline

Shows every matching provision across FY2024–FY2026 with current/advance/supplemental split, plus deterministic hashes you can save as persistent links for future comparisons.

“Find everything about FEMA disaster relief”

congress-approp search --dir data --semantic "FEMA disaster relief funding" --top 5

Finds FEMA provisions across 5 different bills by meaning, not just keywords — even when the bill text says “Federal Emergency Management Agency—Disaster Relief Fund” instead of “FEMA.”

Key Concepts

  • enrich generates bill metadata offline (no API keys) — enabling fiscal year filtering, subcommittee scoping, and advance appropriation detection.
  • --fy 2026 filters any command to bills covering that fiscal year.
  • --subcommittee thud scopes to a specific appropriations jurisdiction, resolving division letters automatically (Division D in one bill, Division F in another — both map to THUD).
  • --show-advance separates current-year spending from advance appropriations (money enacted now but available in a future fiscal year). Critical for year-over-year comparisons.
  • relate traces one provision across all bills with a fiscal year timeline.
  • link suggest / link accept persist cross-bill relationships so compare --use-links can handle renames automatically.

This book is organized so you can jump to whatever fits your needs:

  • Recipes & Demos — Worked examples for account tracking, fiscal year comparisons, inflation adjustment, Python/pandas integration, and data export. Interactive visualizations included.
  • Getting Started — Install the tool and run your first query in under five minutes. Covers installation and first commands.
  • Getting to Know the Tool — Background reading on what this tool does, who it’s for, and a primer on how federal appropriations work if you’re new to the domain.
  • Tutorials — Step-by-step walkthroughs for common tasks: finding spending on a topic, comparing bills, tracking programs, exporting data, and more.
  • How-To Guides — Task-oriented recipes for specific operations like downloading bills, extracting provisions, and generating embeddings.
  • Explanation — Deep dives into how the extraction pipeline, verification, semantic search, provision types, and budget authority calculation work under the hood.
  • Reference — Lookup material: CLI commands, JSON field definitions, provision types, environment variables, data directory layout, and the glossary.
  • For Contributors — Architecture overview, code map, and guides for adding new provision types, commands, and tests.

Version

This documentation covers congress-approp v6.0.0.

What This Tool Does

The Problem

Every year, Congress passes appropriations bills authorizing roughly $1.7 trillion in discretionary spending — the money that funds federal agencies, military operations, scientific research, infrastructure, veterans’ benefits, and thousands of other programs. These bills run to approximately 1,500 pages annually, published as XML on Congress.gov.

The text is public, but it’s practically unsearchable at the provision level. If you want to know how much Congress appropriated for a specific program, you have three options:

  1. Read the bill yourself. The FY2024 omnibus alone is over 1,800 pages of dense legislative text with nested cross-references, “of which” sub-allocations, and provisions scattered across twelve divisions.
  2. Read CBO cost estimates or committee reports. These are expert summaries, but they aggregate — you get totals by title or account, not individual provisions. They also don’t cover every bill type the same way.
  3. Search Congress.gov full text. You can find keywords, but you can’t filter by provision type, sort by dollar amount, or compare the same program across bills.

None of these let you ask structured questions like “show me every rescission over $10 million” or “which programs got a different amount in the continuing resolution than in the omnibus” or “find all provisions related to opioid treatment, including ones that don’t use the word ‘opioid.’”

What This Tool Does

congress-approp turns appropriations bill text into structured, queryable, verified data:

  • Downloads enrolled bill XML from Congress.gov via its official API — the authoritative, machine-readable source
  • Extracts every spending provision into structured JSON using Claude, capturing account names, dollar amounts, agencies, availability periods, provision types, section references, and more
  • Verifies every dollar amount against the source text using deterministic string matching — no LLM in the verification loop
  • Generates semantic embeddings for meaning-based search, enabling search by meaning rather than exact keywords
  • Provides CLI query tools to search, compare, summarize, and audit provisions across any number of extracted bills

The Trust Model

LLM extraction is not infallible. This tool is designed around a simple principle: the LLM extracts once; deterministic code verifies everything.

The verification pipeline runs after extraction and checks every claim the LLM made against the source bill text. No language model is involved in verification — it’s pure string matching with tiered fallback (exact → normalized → spaceless). The result across the included dataset:

MetricResult
Dollar amounts not found in source1 out of 18,584 (99.995%)
Source traceability100% — every provision has byte-level source spans
Raw text byte-identical to source94.6%
CR substitution pairs verified100%
Sub-allocations correctly excluded from budget authority

Every extracted dollar amount can be traced back to an exact byte position in the enrolled bill text. The audit command shows this verification breakdown for any set of bills. If a number can’t be verified, it’s flagged — not silently accepted. For the full breakdown, see Accuracy Metrics.

The ~5% of provisions where raw_text isn’t a byte-identical substring are cases where the LLM truncated a very long provision or normalized whitespace. The verify-text command repairs these deterministically — and the dollar amounts in those provisions are still independently verified.

What’s Included

The tool ships with 32 enacted appropriations bills across 4 congresses (116th–119th), covering FY2019 through FY2026. Every major bill type is represented — omnibus, minibus, continuing resolutions, supplementals, and authorizations. See the Recipes & Demos page for the full bill inventory, or run congress-approp summary --dir data to see them all.

Each bill directory includes the source XML, extracted provisions (extraction.json), verification report, extraction metadata, TAS mapping, bill metadata, and pre-computed embeddings. No API keys are required to query this data.

Five Things You Can Do Right Now

All of these work immediately with the included example data — no API keys needed.

1. See budget totals for all included bills:

congress-approp summary --dir data

Shows each bill’s provision count, gross budget authority, rescissions, and net budget authority in a formatted table.

2. Search all appropriations provisions:

congress-approp search --dir data --type appropriation

Lists every appropriation-type provision across all bills with account name, amount, division, and agency.

3. Find FEMA funding:

congress-approp search --dir data --keyword "Federal Emergency Management"

Searches provision text for any mention of FEMA across all bills.

4. See what the continuing resolution changed:

congress-approp search --dir data/118-hr5860 --type cr_substitution

Shows the 13 “anomalies” — programs where the CR set a different funding level instead of continuing at the prior-year rate.

5. Audit verification status:

congress-approp audit --dir data

Displays a detailed verification breakdown for each bill: how many dollar amounts were verified, how many raw text excerpts matched the source, and the completeness coverage metric.

Who This Is For

congress-approp is built for anyone who needs to work with the details of federal appropriations bills — not just the headline numbers, but the individual provisions. This chapter describes five audiences and how each can get the most out of the tool.


Journalists & Policy Researchers

What you’d use this for:

  • Fact-checking spending claims. A press release says “Congress cut Program X by 15%.” You can pull up every provision mentioning that program, compare the dollar amounts to the prior year’s bill, and confirm or refute the claim against the enrolled bill text — not a summary or a committee report, but the law itself.
  • Comparing spending across fiscal years. “How did THUD funding change from FY2024 to FY2026?” Use compare --base-fy 2024 --current-fy 2026 --subcommittee thud and get a per-account comparison: Tenant-Based Rental Assistance up $6.1B (+18.7%), Capital Investment Grants down $505M. No need to know which bills or divisions to look at — the tool resolves that automatically.
  • Finding provisions by topic. You’re writing a story about opioid treatment funding. Semantic search finds relevant provisions even when the bill text says “Substance Use Treatment and Prevention” instead of “opioid.” Combine with --fy 2026 --subcommittee labor-hhs to scope results to a specific year and jurisdiction.
  • Separating advance from current-year spending. 79.5% of MilCon-VA budget authority is advance appropriations for the next fiscal year. Without --show-advance, a reporter comparing year-over-year VA spending would be off by hundreds of billions of dollars. The tool flags this automatically.
  • Tracing a program across all bills. Use relate 118-hr9468:0 --fy-timeline to see VA Compensation and Pensions across FY2024–FY2026, with current/advance/supplemental split per year and links to every matching provision.

Start here: Getting StartedFind Spending on a TopicCompare Two BillsEnrich Bills with Metadata

API keys needed: None for querying pre-extracted example data (including FY filtering, subcommittee scoping, advance splits, and relate). OPENAI_API_KEY if you want semantic (meaning-based) search. CONGRESS_API_KEY + ANTHROPIC_API_KEY if you want to download and extract additional bills yourself.


Congressional Staffers & Analysts

What you’d use this for:

  • Tracking program funding across bills. Use relate to trace a specific account — say, VA Compensation and Pensions — across all 14 bills with a fiscal year timeline showing the current-year, advance, and supplemental split. Save the matches as persistent links with link accept so you can reuse them in future comparisons.
  • Subcommittee-level analysis. “What’s the FY2026 Defense budget?” Use summary --fy 2026 --subcommittee defense and get $836B in budget authority from H.R. 7148 Division A. The tool maps division letters to canonical jurisdictions automatically — Division A means Defense in H.R. 7148 but CJS in H.R. 6938.
  • Identifying CR anomalies. Continuing resolutions fund the government at prior-year rates except for specific anomalies. The tool extracts every cr_substitution as structured data so you can see exactly which programs got different treatment: congress-approp search --dir data/118-hr5860 --type cr_substitution.
  • Enriched bill classifications. The tool distinguishes omnibus (5+ subcommittees), minibus (2–4), full-year CR with appropriations (like H.R. 1968 with $1.786T in appropriations alongside a CR mechanism), and supplementals — not just the raw LLM classification.
  • Exporting for briefings and spreadsheets. Every query command supports --format csv output. Pipe it to a file and open it in Excel: congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud --dir data --format csv > thud_compare.csv.

Start here: Getting StartedCompare Two BillsEnrich Bills with MetadataTrack a Program Across Bills

API keys needed: None for querying pre-extracted data (including FY filtering, subcommittee scoping, advance splits, relate, and link management). Most staffers won’t need to run extractions themselves — the included example data covers 13 enacted bills across FY2024–FY2026.


Data Scientists & Developers

What you’d use this for:

  • Building dashboards and visualizations. The --format json and --format jsonl output modes give you machine-readable provision data ready for ingestion into dashboards, notebooks, or databases. Every provision includes structured fields for amount, agency, account, division, section, provision type, and more.
  • Integrating into data pipelines. congress-approp is both a CLI tool and a Rust library (congress_appropriations). You can call it from scripts via the CLI or embed it directly in Rust projects via the library API. The JSON schema is stable within major versions.
  • Extending with new provision types or analysis. The extraction schema supports 11 provision types today. If you need to capture something new — say, a specific category of earmark or a new kind of spending limitation — the Adding a New Provision Type guide walks you through it.

Start here: Getting StartedExport Data for Spreadsheets and ScriptsUse the Library API from RustArchitecture Overview

API keys needed: Depends on your workflow. None for querying existing extractions. OPENAI_API_KEY for generating embeddings (semantic search). CONGRESS_API_KEY + ANTHROPIC_API_KEY for downloading and extracting new bills.


Auditors & Oversight Staff

What you’d use this for:

  • Validating extracted numbers. The audit command gives you a per-bill breakdown of verification status: how many dollar amounts were found in the source text, how many raw text excerpts matched byte-for-byte, and a completeness metric showing what percentage of dollar strings in the source were accounted for. Across the included dataset, 99.995% of dollar amounts are verified against the source text. See Accuracy Metrics for the full breakdown.
  • Assessing extraction completeness. The verification report flags any dollar amount that appears in the source XML but isn’t captured by an extracted provision. A completeness percentage below 100% doesn’t necessarily indicate a missed provision — many dollar strings in bill text are statutory cross-references, loan guarantee ceilings, or old amounts being struck by amendments — but it gives you a starting point for investigation.
  • Tracing numbers to source. Every verified dollar amount includes a character position in the source text. Every provision includes raw_text that can be matched against the bill XML. You can independently confirm any number the tool reports by opening the source file and checking the indicated position.

Start here: Getting StartedVerify Extraction AccuracyLLM Reliability and Guardrails

API keys needed: None. All verification and audit operations work entirely offline against already-extracted data.


Contributors

What you’d use this for:

  • Adding features. The tool is open source under MIT/Apache-2.0. Whether you want to add a new CLI subcommand, support a new bill format, or improve the extraction prompt, the contributor guides walk you through the codebase and conventions.
  • Fixing bugs. The Testing Strategy chapter explains how the test suite is structured — including golden-file tests against the example bills — so you can reproduce issues and verify fixes.
  • Understanding the architecture. The Architecture Overview and Code Map chapters explain how the pipeline stages connect, where each module lives, and how data flows from XML download through LLM extraction and verification to query output.

Start here: Architecture OverviewCode MapTesting StrategyStyle Guide and Conventions

API keys needed: CONGRESS_API_KEY + ANTHROPIC_API_KEY if you’re working on download or extraction features. OPENAI_API_KEY if you’re working on embedding or semantic search features. None if you’re working on query, verification, or CLI features — the example data is sufficient.

How Federal Appropriations Work

This chapter covers the essentials of federal appropriations — fiscal years, bill types, provision structure, and key terminology. Readers already familiar with the appropriations process can skip to the tutorials.

The Federal Budget in 60 Seconds

The U.S. federal government spends roughly $6.7 trillion per year. That breaks down into three major categories:

CategoryShareWhat It Covers
Mandatory spending~63%Social Security, Medicare, Medicaid, SNAP, and other programs where spending is determined by eligibility rules set in permanent law — not annual votes
Discretionary spending~26%Everything Congress votes on each year through appropriations bills: defense, veterans’ health care, scientific research, federal law enforcement, national parks, foreign aid, and thousands of other programs
Net interest~11%Interest payments on the national debt

This tool covers the 26% — discretionary spending — plus certain mandatory spending lines that appear as appropriation provisions in the bill text (for example, SNAP funding appears as a line item in the Agriculture appropriations division even though it’s technically mandatory spending). That’s why the budget authority total for H.R. 4366 is ~$846 billion, not the ~$1.7 trillion figure you’ll sometimes see for total discretionary spending (which includes all twelve bills plus defense), and certainly not the ~$6.7 trillion total federal budget.

The Fiscal Year

The federal fiscal year runs from October 1 through September 30. It’s named for the calendar year in which it ends, not the one in which it begins. So:

  • FY2024 = October 1, 2023 – September 30, 2024
  • FY2025 = October 1, 2024 – September 30, 2025

Bills are labeled by the fiscal year they fund, not the calendar year they were enacted in. The Consolidated Appropriations Act, 2024 (H.R. 4366) was signed into law on March 23, 2024 — nearly six months into the fiscal year it was supposed to fund from the start.

The Twelve Appropriations Bills

Each year, Congress is supposed to pass twelve individual appropriations bills, one for each subcommittee of the House and Senate Appropriations Committees:

  1. Agriculture, Rural Development, FDA
  2. Commerce, Justice, Science (CJS)
  3. Defense
  4. Energy and Water Development
  5. Financial Services and General Government
  6. Homeland Security
  7. Interior, Environment
  8. Labor, Health and Human Services, Education (Labor-HHS)
  9. Legislative Branch
  10. Military Construction, Veterans Affairs (MilCon-VA)
  11. State, Foreign Operations
  12. Transportation, Housing and Urban Development (THUD)

In practice, Congress rarely passes all twelve on time. Instead, it bundles them:

  • An omnibus packages all (or nearly all) twelve bills into a single piece of legislation.
  • A minibus bundles a few of the twelve together.
  • Individual bills are occasionally passed on their own, but this has become increasingly rare.

When none of the twelve are done by October 1, Congress passes a continuing resolution to keep the government funded temporarily while it finishes negotiations.

Bill Types

The included dataset covers 32 enacted appropriations bills spanning all major bill types. Here’s what each one is, with the real example from this tool:

Regular / Omnibus

A regular appropriations bill provides new funding for one of the twelve subcommittee jurisdictions for the coming fiscal year. An omnibus combines multiple regular bills into one legislative vehicle, organized into lettered divisions (Division A, Division B, etc.). H.R. 4366, the Consolidated Appropriations Act, 2024, is an omnibus covering MilCon-VA, Agriculture, CJS, Energy-Water, Interior, THUD, and other matters across multiple divisions. It contains 2,364 provisions and authorizes $846 billion in budget authority.

Continuing Resolution

A continuing resolution (CR) provides temporary funding — usually at the prior fiscal year’s rate — for agencies whose regular appropriations bills haven’t been enacted yet. Most provisions in a CR simply say “continue at last year’s level,” but specific programs may get different treatment through anomalies (formally called CR substitutions). H.R. 5860, the Continuing Appropriations Act, 2024, contains 130 provisions including 13 CR substitutions — programs where Congress set a specific dollar amount rather than defaulting to the prior-year rate. It also includes mandatory spending extensions and other legislative riders.

Supplemental

A supplemental appropriation provides additional funding outside the regular annual cycle, typically in response to emergencies — natural disasters, military operations, public health crises, or (in this case) an unexpected funding shortfall. H.R. 9468, the Veterans Benefits Continuity and Accountability Supplemental Appropriations Act, 2024, contains 7 provisions providing $2.9 billion for VA Compensation and Pensions and Readjustment Benefits, plus reporting requirements and an Inspector General review.

Rescissions

A rescission bill cancels previously enacted budget authority. Rescissions also appear as individual provisions within larger bills — H.R. 4366 includes $24.7 billion in rescissions alongside its new appropriations.

Anatomy of a Provision

To see how bill text becomes structured data, let’s walk through a real example from H.R. 9468. Here’s what Congress wrote:

For an additional amount for ’‘Compensation and Pensions’’, $2,285,513,000, to remain available until expended.

And here is the structured JSON that congress-approp extracted from that sentence:

{
  "provision_type": "appropriation",
  "agency": "Department of Veterans Affairs",
  "account_name": "Compensation and Pensions",
  "amount": {
    "value": { "kind": "specific", "dollars": 2285513000 },
    "semantics": "new_budget_authority",
    "text_as_written": "$2,285,513,000"
  },
  "detail_level": "top_level",
  "availability": "to remain available until expended",
  "fiscal_year": 2024,
  "raw_text": "For an additional amount for ''Compensation and Pensions'', $2,285,513,000, to remain available until expended.",
  "confidence": 0.99
}

Here’s what each piece means:

  • account_name: Pulled from the double-quoted name in the bill text (the ''Compensation and Pensions'' delimiters are a legislative drafting convention).
  • amount: The dollar value is parsed to an integer (2285513000), the original text is preserved ("$2,285,513,000"), and the meaning is classified — this is new_budget_authority, meaning Congress is granting new spending authority, not referencing an existing amount.
  • detail_level: This is a top_level appropriation — the full amount for the account, not a sub-allocation (“of which $X for Y”).
  • availability: Captured from the bill text. “To remain available until expended” means this is no-year money — the agency can spend it over multiple fiscal years, unlike annual funds that expire at the end of the fiscal year.
  • raw_text: The original bill text, verified against the source XML.
  • Verification: The string $2,285,513,000 was found at character position 431 in the source XML. The raw_text is a byte-identical substring of the source starting at position 371.

Key Concepts

Budget Authority vs. Outlays

Budget authority (BA) is what Congress authorizes — the legal permission for agencies to enter into obligations (sign contracts, award grants, hire staff). Outlays are what the Treasury actually disburses. The two differ because agencies often obligate funds in one year but spend them over several years (especially for construction, procurement, and multi-year grants).

This tool reports budget authority, because that’s what the bill text specifies. When you see “$846B” for H.R. 4366, that’s the sum of new_budget_authority provisions at the top_level and line_item detail levels — what Congress authorized, not what agencies will spend this year.

Sub-Allocations Are Not Additional Money

Many provisions include “of which” clauses: “For the Office of Science, $8,220,000,000, of which $300,000,000 shall be for fusion energy research.” The $300 million is a sub-allocation — a directive about how to spend part of the $8.2 billion, not money on top of it. The tool captures sub-allocations at detail_level: "sub_allocation" and correctly excludes them from budget authority totals to avoid double-counting.

Advance Appropriations

Sometimes Congress enacts budget authority in this year’s bill but makes it available starting in the next fiscal year. These advance appropriations are included in the bill’s budget authority total (because the bill does enact them) but are noted in the provision’s notes field.

Congress Numbers

Each Congress spans two calendar years. The 118th Congress served from January 2023 through January 2025; the 119th Congress runs from January 2025 through January 2027. Bills are identified by their Congress — H.R. 4366 of the 118th Congress is an entirely different bill from H.R. 4366 of any other Congress. All three example bills in this tool are from the 118th Congress.

Essential Glossary

These five terms come up throughout the book. A comprehensive glossary is available in the Glossary reference chapter.

TermDefinition
Budget authorityThe legal authority Congress grants to federal agencies to enter into financial obligations. This is the dollar figure in an appropriation provision — what Congress authorizes, as distinct from what agencies ultimately spend (outlays).
ProvisionA single identifiable directive in an appropriations bill: an appropriation, a rescission, a spending limitation, a transfer authority, a CR anomaly, a policy rider, or any other discrete instruction. This is the fundamental unit of data in congress-approp.
EnrolledThe final text of a bill as passed by both the House and Senate and presented to the President for signature. This is the version congress-approp downloads — the authoritative text that becomes law.
RescissionA provision that cancels previously enacted budget authority. A rescission of $500 million reduces the net budget authority by that amount. In the summary table, rescissions appear in their own column and are subtracted to produce the Net BA figure.
Continuing resolution (CR)Temporary legislation that funds the government at the prior year’s rate for agencies whose regular appropriations bills have not been enacted. Specific exceptions, called anomalies (or CR substitutions), set different funding levels for particular programs.

Installation

You will need: A computer running macOS or Linux, and an internet connection.

You will learn: How to install congress-approp and verify it’s working.

Install Rust

congress-approp is written in Rust and requires Rust 1.93 or later. If you don’t have Rust installed, the easiest way is via rustup:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

If you already have Rust, make sure it’s up to date:

rustup update

Verify your version:

rustc --version
# Should show 1.93.0 or later

Cloning the repository gives you the full dataset — 32 enacted appropriations bills (FY2019–FY2026) with pre-computed embeddings, ready to query with no API keys.

git clone https://github.com/cgorski/congress-appropriations.git
cd congress-appropriations
cargo install --path .

This compiles the project and places the congress-approp binary on your PATH. The first build takes a few minutes; subsequent builds are much faster.

Install from crates.io

If you just want the binary without cloning the full repository:

cargo install congress-appropriations

Note: The crates.io package does not include the data/ directory or pre-computed embedding vectors because they exceed the crates.io upload limit. If you install via crates.io, clone the repository separately to get the dataset, or download and extract your own bills.

Verify the Installation

Run the summary command against the included data:

congress-approp summary --dir data

You should see a table listing all 32 bills with their provision counts, budget authority, and rescissions. The last line confirms data integrity:

0 dollar amounts unverified across all bills. Run `congress-approp audit` for detailed verification.

If you see 32 bills and 34,568 total provisions across FY2019–FY2026, everything is working. You’re ready to start querying.

Tip: If you’re running from the cloned repo directory, data is a relative path that points to the included dataset. If you installed via cargo install and are running from a different directory, provide the full path to the data/ directory inside your clone.

API Keys (Optional)

No API keys are needed to query the pre-extracted dataset. Keys are only required if you want to download new bills, extract provisions from them, or use semantic search:

Environment VariableRequired ForHow to Get It
CONGRESS_API_KEYDownloading bill XML (download command)Free — sign up at api.congress.gov
ANTHROPIC_API_KEYExtracting provisions (extract command)Sign up at console.anthropic.com
OPENAI_API_KEYGenerating embeddings (embed command) and semantic search (search --semantic)Sign up at platform.openai.com

Set them in your shell when needed:

export CONGRESS_API_KEY="your-key-here"
export ANTHROPIC_API_KEY="your-key-here"
export OPENAI_API_KEY="your-key-here"

See Environment Variables and API Keys for details.

Rebuilding After Source Changes

If you modify the source code (or pull updates), rebuild and reinstall with:

cargo install --path .

For development iteration without reinstalling:

cargo build --release
./target/release/congress-approp summary --dir data

Next Steps

Next: Your First Query.

Your First Query

You will need: congress-approp installed (Installation), access to the data/ directory from the cloned repository.

You will learn: How to explore the included FY2024 appropriations data using five core commands — no API keys required.

This chapter walks through five core commands using the included dataset. Every command shown here produces output you can verify against the data files.

Step 1: See What Bills You Have

Start with the summary command to get an overview:

congress-approp summary --dir data
┌───────────┬───────────────────────┬────────────┬─────────────────┬─────────────────┬─────────────────┐
│ Bill      ┆ Classification        ┆ Provisions ┆ Budget Auth ($) ┆ Rescissions ($) ┆      Net BA ($) │
╞═══════════╪═══════════════════════╪════════════╪═════════════════╪═════════════════╪═════════════════╡
│ H.R. 4366 ┆ Omnibus               ┆       2364 ┆ 846,137,099,554 ┆  24,659,349,709 ┆ 821,477,749,845 │
│ H.R. 5860 ┆ Continuing Resolution ┆        130 ┆  16,000,000,000 ┆               0 ┆  16,000,000,000 │
│ H.R. 9468 ┆ Supplemental          ┆          7 ┆   2,882,482,000 ┆               0 ┆   2,882,482,000 │
│ TOTAL     ┆                       ┆       2501 ┆ 865,019,581,554 ┆  24,659,349,709 ┆ 840,360,231,845 │
└───────────┴───────────────────────┴────────────┴─────────────────┴─────────────────┴─────────────────┘

0 dollar amounts unverified across all bills. Run `congress-approp audit` for detailed verification.

Here’s what each column means:

ColumnMeaning
BillThe bill identifier (e.g., H.R. 4366)
ClassificationWhat kind of appropriations bill: Omnibus, Continuing Resolution, or Supplemental
ProvisionsTotal number of provisions extracted from the bill
Budget Auth ($)Sum of all provisions with new_budget_authority semantics — what Congress authorized agencies to spend. Computed from the actual provisions, not from any LLM-generated summary
Rescissions ($)Sum of all rescission provisions — money Congress is taking back from prior appropriations
Net BA ($)Budget Authority minus Rescissions — the net new spending authority

The footer line — “0 dollar amounts unverified” — tells you that every extracted dollar amount was confirmed to exist in the source bill text. This is the headline trust metric.

Step 2: Search for Provisions

The search command finds provisions matching your criteria. Let’s start broad — all appropriation-type provisions across all bills:

congress-approp search --dir data --type appropriation

This returns a table with hundreds of rows. Let’s narrow it down. Find all provisions mentioning FEMA:

congress-approp search --dir data --keyword "Federal Emergency Management"
┌───┬───────────┬───────────────┬───────────────────────────────────────────────┬────────────────┬──────────┬─────┐
│ $ ┆ Bill      ┆ Type          ┆ Description / Account                         ┆     Amount ($) ┆ Section  ┆ Div │
╞═══╪═══════════╪═══════════════╪═══════════════════════════════════════════════╪════════════════╪══════════╪═════╡
│   ┆ H.R. 5860 ┆ other         ┆ Allows FEMA Disaster Relief Fund to be appor… ┆              — ┆ SEC. 128 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ appropriation ┆ Federal Emergency Management Agency—Disast…   ┆ 16,000,000,000 ┆ SEC. 129 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ appropriation ┆ Office of the Inspector General—Operations…   ┆      2,000,000 ┆ SEC. 129 ┆ A   │
└───┴───────────┴───────────────┴───────────────────────────────────────────────┴────────────────┴──────────┴─────┘
3 provisions found

$ = Amount status: ✓ found (unique), ≈ found (multiple matches), ✗ not found

Understanding the $ column — the verification status for each provision’s dollar amount:

SymbolMeaning
Dollar amount string found at exactly one position in the source text — highest confidence
Dollar amount found at multiple positions (common for round numbers like $5,000,000) — amount is correct but can’t be pinned to a unique location
Dollar amount not found in the source text — needs manual review
(blank)Provision doesn’t carry a dollar amount (riders, directives)

Now try searching by account name. This matches against the structured account_name field rather than searching the full text:

congress-approp search --dir data --account "Child Nutrition"
┌───┬───────────┬───────────────┬─────────────────────────────────────────────┬────────────────┬─────────┬─────┐
│ $ ┆ Bill      ┆ Type          ┆ Description / Account                       ┆     Amount ($) ┆ Section ┆ Div │
╞═══╪═══════════╪═══════════════╪═════════════════════════════════════════════╪════════════════╪═════════╪═════╡
│ ✓ ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆ 33,266,226,000 ┆         ┆ B   │
│ ✓ ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆     18,004,000 ┆         ┆ B   │
│ ...                                                                                                          │
└───┴───────────┴───────────────┴─────────────────────────────────────────────┴────────────────┴─────────┴─────┘

The top result — $33.27 billion for Child Nutrition Programs — is the top-level appropriation. The smaller amounts below it are sub-allocations and reference amounts within the same account.

You can combine filters. For example, find all appropriations over $1 billion in Division A (MilCon-VA):

congress-approp search --dir data/118-hr4366 --type appropriation --division A --min-dollars 1000000000

Step 3: Look at the VA Supplemental

The smallest bill, H.R. 9468, is a good place to see the full picture. It has only 7 provisions:

congress-approp search --dir data/118-hr9468
┌───┬───────────┬───────────────┬───────────────────────────────────────────────┬───────────────┬──────────┬─────┐
│ $ ┆ Bill      ┆ Type          ┆ Description / Account                         ┆    Amount ($) ┆ Section  ┆ Div │
╞═══╪═══════════╪═══════════════╪═══════════════════════════════════════════════╪═══════════════╪══════════╪═════╡
│ ✓ ┆ H.R. 9468 ┆ appropriation ┆ Compensation and Pensions                     ┆ 2,285,513,000 ┆          ┆     │
│ ✓ ┆ H.R. 9468 ┆ appropriation ┆ Readjustment Benefits                         ┆   596,969,000 ┆          ┆     │
│   ┆ H.R. 9468 ┆ rider         ┆ Establishes that each amount appropriated o…  ┆             — ┆ SEC. 101 ┆     │
│   ┆ H.R. 9468 ┆ rider         ┆ Unless otherwise provided, the additional a…  ┆             — ┆ SEC. 102 ┆     │
│   ┆ H.R. 9468 ┆ directive     ┆ Requires the Secretary of Veterans Affairs …  ┆             — ┆ SEC. 103 ┆     │
│   ┆ H.R. 9468 ┆ directive     ┆ Requires the Secretary of Veterans Affairs …  ┆             — ┆ SEC. 103 ┆     │
│   ┆ H.R. 9468 ┆ directive     ┆ Requires the Inspector General of the Depar…  ┆             — ┆ SEC. 104 ┆     │
└───┴───────────┴───────────────┴───────────────────────────────────────────────┴───────────────┴──────────┴─────┘
7 provisions found

This is the complete bill: two appropriations ($2.3B for Comp & Pensions, $597M for Readjustment Benefits), two policy riders (SEC. 101 and 102 establishing that these amounts are additional to regular appropriations), and three directives requiring the VA Secretary and Inspector General to submit reports about the funding shortfall that necessitated this supplemental.

Notice how the two appropriations have ✓ in the dollar column, while the riders and directives show no symbol — they don’t carry dollar amounts, so there’s nothing to verify.

Step 4: See What the CR Changed

Continuing resolutions normally fund agencies at prior-year rates, but specific programs can get different treatment through “anomalies” — formally called CR substitutions. These are provisions that say “substitute $X for $Y,” setting a new level instead of continuing the old one.

congress-approp search --dir data/118-hr5860 --type cr_substitution
┌───┬───────────┬──────────────────────────────────────────┬───────────────┬───────────────┬──────────────┬──────────┬─────┐
│ $ ┆ Bill      ┆ Account                                  ┆       New ($) ┆       Old ($) ┆    Delta ($) ┆ Section  ┆ Div │
╞═══╪═══════════╪══════════════════════════════════════════╪═══════════════╪═══════════════╪══════════════╪══════════╪═════╡
│ ✓ ┆ H.R. 5860 ┆ Rural Housing Service—Rural Community…   ┆    25,300,000 ┆    75,300,000 ┆  -50,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Rural Utilities Service—Rural Water a…   ┆    60,000,000 ┆   325,000,000 ┆ -265,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆                                          ┆   122,572,000 ┆   705,768,000 ┆ -583,196,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ National Science Foundation—STEM Educ…   ┆    92,000,000 ┆   217,000,000 ┆ -125,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ National Oceanic and Atmospheric Admini… ┆    42,000,000 ┆    62,000,000 ┆  -20,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ National Science Foundation—Research …   ┆   608,162,000 ┆   818,162,000 ┆ -210,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Department of State—Administration of…   ┆    87,054,000 ┆   147,054,000 ┆  -60,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Bilateral Economic Assistance—Funds A…   ┆   637,902,000 ┆   937,902,000 ┆ -300,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Bilateral Economic Assistance—Departm…   ┆   915,048,000 ┆ 1,535,048,000 ┆ -620,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ International Security Assistance—Dep…   ┆    74,996,000 ┆   374,996,000 ┆ -300,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Office of Personnel Management—Salari…   ┆   219,076,000 ┆   190,784,000 ┆  +28,292,000 ┆ SEC. 126 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Department of Transportation—Federal …   ┆   617,000,000 ┆   570,000,000 ┆  +47,000,000 ┆ SEC. 137 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Department of Transportation—Federal …   ┆ 2,174,200,000 ┆ 2,221,200,000 ┆  -47,000,000 ┆ SEC. 137 ┆ A   │
└───┴───────────┴──────────────────────────────────────────┴───────────────┴───────────────┴──────────────┴──────────┴─────┘
13 provisions found

Notice how the table automatically changes shape for CR substitutions — it shows New, Old, and Delta columns instead of a single Amount. This tells you exactly which programs Congress funded above or below the prior-year rate:

  • Most programs were cut: Migration and Refugee Assistance lost $620 million (-40.4%), NSF Research lost $210 million (-25.7%)
  • Two programs increased: OPM Salaries and Expenses gained $28 million (+14.8%) and FAA Facilities and Equipment gained $47 million (+8.2%)
  • Every dollar amount has ✓ — both the new and old amounts were verified in the source text

Step 5: Check Data Quality

The audit command shows how well the extraction held up against the source text:

congress-approp audit --dir data
┌───────────┬────────────┬──────────┬──────────┬───────┬───────┬──────────┬───────────┬──────────┬──────────┐
│ Bill      ┆ Provisions ┆ Verified ┆ NotFound ┆ Ambig ┆ Exact ┆ NormText ┆ Spaceless ┆ TextMiss ┆ Coverage │
╞═══════════╪════════════╪══════════╪══════════╪═══════╪═══════╪══════════╪═══════════╪══════════╪══════════╡
│ H.R. 4366 ┆       2364 ┆      762 ┆        0 ┆   723 ┆  2285 ┆       59 ┆         0 ┆       20 ┆    94.2% │
│ H.R. 5860 ┆        130 ┆       33 ┆        0 ┆     2 ┆   102 ┆       12 ┆         0 ┆       16 ┆    61.1% │
│ H.R. 9468 ┆          7 ┆        2 ┆        0 ┆     0 ┆     5 ┆        0 ┆         0 ┆        2 ┆   100.0% │
│ TOTAL     ┆       2501 ┆      797 ┆        0 ┆   725 ┆  2392 ┆       71 ┆         0 ┆       38 ┆          │
└───────────┴────────────┴──────────┴──────────┴───────┴───────┴──────────┴───────────┴──────────┴──────────┘

The key number: NotFound = 0 for every bill. Every dollar amount the tool extracted actually exists in the source bill text. Here’s a quick guide to the other columns:

ColumnWhat It MeansGood Value
VerifiedDollar amount found at exactly one position in sourceHigher is better
NotFoundDollar amounts NOT found in sourceShould be 0
AmbigDollar amount found at multiple positions (e.g., “$5,000,000” appears 50 times)Not a problem — amount is correct
Exactraw_text excerpt is byte-identical to sourceHigher is better
NormTextraw_text matches after whitespace/quote normalizationMinor formatting difference
TextMissraw_text not found at any matching tierReview manually
CoveragePercentage of dollar strings in source text matched to a provision100% is ideal, <100% is often fine

For a deeper dive into what these numbers mean, see Verify Extraction Accuracy and What Coverage Means.

Step 6: Export to JSON

Every command supports --format json for machine-readable output. This is useful for piping to jq, loading into Python, or just seeing the full data:

congress-approp search --dir data/118-hr9468 --type appropriation --format json
[
  {
    "account_name": "Compensation and Pensions",
    "agency": "Department of Veterans Affairs",
    "amount_status": "found",
    "bill": "H.R. 9468",
    "description": "Compensation and Pensions",
    "division": "",
    "dollars": 2285513000,
    "match_tier": "exact",
    "old_dollars": null,
    "provision_index": 0,
    "provision_type": "appropriation",
    "quality": "strong",
    "raw_text": "For an additional amount for ''Compensation and Pensions'', $2,285,513,000, to remain available until expended.",
    "section": "",
    "semantics": "new_budget_authority"
  },
  {
    "account_name": "Readjustment Benefits",
    "agency": "Department of Veterans Affairs",
    "amount_status": "found",
    "bill": "H.R. 9468",
    "description": "Readjustment Benefits",
    "division": "",
    "dollars": 596969000,
    "match_tier": "exact",
    "old_dollars": null,
    "provision_index": 1,
    "provision_type": "appropriation",
    "quality": "strong",
    "raw_text": "For an additional amount for ''Readjustment Benefits'', $596,969,000, to remain available until expended.",
    "section": "",
    "semantics": "new_budget_authority"
  }
]

The JSON output includes every field for each provision — more detail than the table can show. Key fields to know:

  • dollars: The dollar amount as an integer (no formatting)
  • semantics: What the amount means — new_budget_authority counts toward budget totals
  • raw_text: The verbatim excerpt from the bill text
  • match_tier: How closely raw_text matched the source — exact means byte-identical
  • quality: Overall quality assessment — strong, moderate, or weak
  • provision_index: Position in the bill’s provision list (useful for --similar searches)

Other output formats are also available: --format csv for spreadsheets, --format jsonl for streaming one-object-per-line output. See Output Formats for details.

Enrich for Fiscal Year and Subcommittee Filtering

The example data includes pre-enriched metadata, but if you extract your own bills, run enrich to enable fiscal year and subcommittee filtering:

congress-approp enrich --dir data      # No API key needed — runs offline

Once enriched, you can scope any command to a specific fiscal year and subcommittee:

# FY2026 THUD subcommittee only
congress-approp summary --dir data --fy 2026 --subcommittee thud

# See advance vs current-year spending
congress-approp summary --dir data --fy 2026 --subcommittee milcon-va --show-advance

# Compare THUD across fiscal years
congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud --dir data

# Trace one provision across all bills
congress-approp relate 118-hr9468:0 --dir data --fy-timeline

See Enrich Bills with Metadata for the full guide.

What’s Next

Related chapters:

Understanding the Output

You will need: congress-approp installed, access to the data/ directory.

You will learn: How to read every table the tool produces — what each column means, what the symbols indicate, and how to interpret the numbers.

Before diving into tutorials and specific tasks, let’s build a solid understanding of the output formats you’ll encounter. Every command in congress-approp uses consistent conventions, but the tables adapt their shape depending on what you’re looking at.

The Summary Table

The summary command gives you the bird’s-eye view:

congress-approp summary --dir data
┌───────────┬───────────────────────┬────────────┬─────────────────┬─────────────────┬─────────────────┐
│ Bill      ┆ Classification        ┆ Provisions ┆ Budget Auth ($) ┆ Rescissions ($) ┆      Net BA ($) │
╞═══════════╪═══════════════════════╪════════════╪═════════════════╪═════════════════╪═════════════════╡
│ H.R. 4366 ┆ Omnibus               ┆       2364 ┆ 846,137,099,554 ┆  24,659,349,709 ┆ 821,477,749,845 │
│ H.R. 5860 ┆ Continuing Resolution ┆        130 ┆  16,000,000,000 ┆               0 ┆  16,000,000,000 │
│ H.R. 9468 ┆ Supplemental          ┆          7 ┆   2,882,482,000 ┆               0 ┆   2,882,482,000 │
│ TOTAL     ┆                       ┆       2501 ┆ 865,019,581,554 ┆  24,659,349,709 ┆ 840,360,231,845 │
└───────────┴───────────────────────┴────────────┴─────────────────┴─────────────────┴─────────────────┘

0 dollar amounts unverified across all bills. Run `congress-approp audit` for detailed verification.

Column-by-column

ColumnWhat It Shows
BillThe bill identifier as printed in the legislation (e.g., “H.R. 4366”). The TOTAL row sums across all loaded bills.
ClassificationThe type of appropriations bill: Omnibus, Continuing Resolution, Supplemental, Regular, Minibus, or Rescissions.
ProvisionsThe total count of extracted provisions of all types — appropriations, rescissions, riders, directives, and everything else.
Budget Auth ($)The sum of all provisions where the amount semantics is new_budget_authority and the detail level is top_level or line_item. Sub-allocations and proviso amounts are excluded to prevent double-counting. This number is computed from individual provisions, never from an LLM-generated summary.
Rescissions ($)The absolute value sum of all provisions of type rescission with rescission semantics. This is money Congress is canceling from prior appropriations.
Net BA ($)Budget Authority minus Rescissions. This is the net new spending authority enacted by the bill. For most reporting purposes, Net BA is the number you want.

The line below the table — “0 dollar amounts unverified across all bills” — is a quick trust check. It counts provisions across all loaded bills where the dollar amount string was not found in the source bill text. Zero means every extracted number was confirmed against the source. If this number is ever greater than zero, the audit command will show you exactly which provisions need review.

By-agency view

Add --by-agency to see budget authority broken down by parent department:

congress-approp summary --dir data --by-agency

This appends a second table showing every agency, its total budget authority, rescissions, and provision count, sorted by budget authority descending. For example, Department of Veterans Affairs shows ~$343B (which includes mandatory programs like Compensation and Pensions that appear as appropriation lines in the bill text).

The Search Table

The search command produces tables that adapt their columns based on what you’re searching for. This is one of the most important things to understand about the output.

Standard search table

For most searches, you see this layout:

congress-approp search --dir data/118-hr9468
┌───┬───────────┬───────────────┬───────────────────────────────────────────────┬───────────────┬──────────┬─────┐
│ $ ┆ Bill      ┆ Type          ┆ Description / Account                         ┆    Amount ($) ┆ Section  ┆ Div │
╞═══╪═══════════╪═══════════════╪═══════════════════════════════════════════════╪═══════════════╪══════════╪═════╡
│ ✓ ┆ H.R. 9468 ┆ appropriation ┆ Compensation and Pensions                     ┆ 2,285,513,000 ┆          ┆     │
│ ✓ ┆ H.R. 9468 ┆ appropriation ┆ Readjustment Benefits                         ┆   596,969,000 ┆          ┆     │
│   ┆ H.R. 9468 ┆ rider         ┆ Establishes that each amount appropriated o…  ┆             — ┆ SEC. 101 ┆     │
│   ┆ H.R. 9468 ┆ rider         ┆ Unless otherwise provided, the additional a…  ┆             — ┆ SEC. 102 ┆     │
│   ┆ H.R. 9468 ┆ directive     ┆ Requires the Secretary of Veterans Affairs …  ┆             — ┆ SEC. 103 ┆     │
│   ┆ H.R. 9468 ┆ directive     ┆ Requires the Secretary of Veterans Affairs …  ┆             — ┆ SEC. 103 ┆     │
│   ┆ H.R. 9468 ┆ directive     ┆ Requires the Inspector General of the Depar…  ┆             — ┆ SEC. 104 ┆     │
└───┴───────────┴───────────────┴───────────────────────────────────────────────┴───────────────┴──────────┴─────┘
7 provisions found
ColumnWhat It Shows
$Verification status of the dollar amount (see symbols table below)
BillWhich bill this provision comes from
TypeThe provision type: appropriation, rescission, rider, directive, limitation, transfer_authority, cr_substitution, mandatory_spending_extension, directed_spending, continuing_resolution_baseline, or other
Description / AccountThe account name for appropriations and rescissions, or a description for other provision types. Long text is truncated with
Amount ($)The dollar amount. Shows for provisions without a dollar value (riders, directives).
SectionThe section reference from the bill text (e.g., “SEC. 101”). Empty if the provision appears under a heading without a section number.
DivThe division letter for omnibus bills (e.g., “A” for MilCon-VA in H.R. 4366). Empty for bills without divisions.

The $ column — verification symbols

The leftmost column tells you the verification status of each provision’s dollar amount:

SymbolMeaningShould You Worry?
The exact dollar string (e.g., $2,285,513,000) was found at one unique position in the source bill text.No — this is the best result.
The dollar string was found at multiple positions in the source text. The amount is correct, but it can’t be pinned to a single location.No — very common for round numbers like $5,000,000 which may appear 50 times in an omnibus.
The dollar string was not found in the source text.Yes — this provision needs manual review. Across the included dataset, this occurs only once in 18,584 dollar amounts (99.995%).
(blank)The provision doesn’t carry a dollar amount (riders, directives, some policy provisions).No — nothing to verify.

CR substitution table

When you search for cr_substitution type provisions, the table automatically changes shape to show the old and new amounts:

congress-approp search --dir data/118-hr5860 --type cr_substitution
┌───┬───────────┬──────────────────────────────────────────┬───────────────┬───────────────┬──────────────┬──────────┬─────┐
│ $ ┆ Bill      ┆ Account                                  ┆       New ($) ┆       Old ($) ┆    Delta ($) ┆ Section  ┆ Div │
╞═══╪═══════════╪══════════════════════════════════════════╪═══════════════╪═══════════════╪══════════════╪══════════╪═════╡
│ ✓ ┆ H.R. 5860 ┆ Rural Housing Service—Rural Community…   ┆    25,300,000 ┆    75,300,000 ┆  -50,000,000 ┆ SEC. 101 ┆ A   │
│ ...                                                                                                                      │
│ ✓ ┆ H.R. 5860 ┆ Office of Personnel Management—Salari…   ┆   219,076,000 ┆   190,784,000 ┆  +28,292,000 ┆ SEC. 126 ┆ A   │
└───┴───────────┴──────────────────────────────────────────┴───────────────┴───────────────┴──────────────┴──────────┴─────┘
13 provisions found

Instead of a single Amount column, you get:

ColumnMeaning
New ($)The new dollar amount the CR substitutes in
Old ($)The old dollar amount being replaced
Delta ($)New minus Old. Negative means a cut, positive means an increase

Semantic search table

When you use --semantic or --similar, a Sim (similarity) column appears at the left:

┌──────┬───────────┬───────────────┬───────────────────────────────────────┬────────────────┬─────┐
│ Sim  ┆ Bill      ┆ Type          ┆ Description / Account                 ┆     Amount ($) ┆ Div │
╞══════╪═══════════╪═══════════════╪═══════════════════════════════════════╪════════════════╪═════╡
│ 0.51 ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs              ┆ 33,266,226,000 ┆ B   │
│ 0.46 ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs              ┆     10,000,000 ┆ B   │
└──────┴───────────┴───────────────┴───────────────────────────────────────┴────────────────┴─────┘

The Sim score is the cosine similarity between your query and the provision’s embedding vector, ranging from 0 to 1:

Score RangeInterpretation
> 0.80Almost certainly the same program (when comparing across bills)
0.60 – 0.80Related topic, same policy area
0.45 – 0.60Loosely related
< 0.45Probably not meaningfully related

Results are sorted by similarity descending and limited to --top N (default 20).

The Audit Table

The audit command provides the most detailed quality view:

congress-approp audit --dir data
┌───────────┬────────────┬──────────┬──────────┬───────┬───────┬──────────┬───────────┬──────────┬──────────┐
│ Bill      ┆ Provisions ┆ Verified ┆ NotFound ┆ Ambig ┆ Exact ┆ NormText ┆ Spaceless ┆ TextMiss ┆ Coverage │
╞═══════════╪════════════╪══════════╪══════════╪═══════╪═══════╪══════════╪═══════════╪══════════╪══════════╡
│ H.R. 4366 ┆       2364 ┆      762 ┆        0 ┆   723 ┆  2285 ┆       59 ┆         0 ┆       20 ┆    94.2% │
│ H.R. 5860 ┆        130 ┆       33 ┆        0 ┆     2 ┆   102 ┆       12 ┆         0 ┆       16 ┆    61.1% │
│ H.R. 9468 ┆          7 ┆        2 ┆        0 ┆     0 ┆     5 ┆        0 ┆         0 ┆        2 ┆   100.0% │
│ TOTAL     ┆       2501 ┆      797 ┆        0 ┆   725 ┆  2392 ┆       71 ┆         0 ┆       38 ┆          │
└───────────┴────────────┴──────────┴──────────┴───────┴───────┴──────────┴───────────┴──────────┴──────────┘

The audit table has two groups of columns: amount verification (left side) and text verification (right side).

Amount verification columns

These check whether the dollar amount string (e.g., "$2,285,513,000") exists in the source bill text:

ColumnWhat It CountsIdeal Value
VerifiedProvisions whose dollar string was found at exactly one position in the sourceHigher is better
NotFoundProvisions whose dollar string was not found anywhere in the source textMust be 0 — any value above 0 means you should investigate
AmbigProvisions whose dollar string was found at multiple positions (ambiguous location but correct amount)Not a problem — common for round numbers

The sum of Verified + Ambig equals the total number of provisions that have dollar amounts. NotFound should always be zero. Across the included example data, it is.

Text verification columns

These check whether the raw_text excerpt (the first ~150 characters of the bill language for each provision) is a substring of the source text:

ColumnMatch MethodWhat It Means
ExactByte-identical substring matchThe raw text was copied verbatim from the source — best case. 95.5% of provisions across the 13-bill dataset.
NormTextMatches after normalizing whitespace, curly quotes (""), and em-dashes (-)Minor formatting differences from XML-to-text conversion. Content is correct.
SpacelessMatches only after removing all spacesCatches word-joining artifacts. Zero occurrences in the example data.
TextMissNot found at any matching tierThe raw text may be paraphrased or truncated. In the example data, all 38 TextMiss cases are non-dollar provisions (statutory amendments) where the LLM slightly reformatted section references.

Coverage column

Coverage is the percentage of all dollar-sign patterns found in the source bill text that were matched to an extracted provision. This measures completeness, not accuracy.

  • 100% (H.R. 9468): Every dollar amount in the source was captured — perfect.
  • 94.2% (H.R. 4366): Most dollar amounts were captured. The remaining 5.8% are typically statutory cross-references, loan guarantee ceilings, or old amounts being struck by amendments — dollar figures that appear in the text but aren’t independent provisions.
  • 61.1% (H.R. 5860): Lower coverage is expected for continuing resolutions because most of the bill text consists of references to prior-year appropriations acts, which contain many dollar amounts that are contextual references, not new provisions.

Coverage below 100% does not mean the extracted numbers are wrong. It means the bill text contains dollar strings that aren’t captured as provisions. See What Coverage Means (and Doesn’t) for a detailed explanation.

Quick decision guide

After running audit, here’s how to interpret the results:

SituationInterpretationAction
NotFound = 0, Coverage ≥ 90%Excellent — all extracted amounts verified, high completenessUse with confidence
NotFound = 0, Coverage 60–90%Good — all extracted amounts verified, some dollar strings in source uncapturedFine for most purposes; check unaccounted amounts if completeness matters
NotFound = 0, Coverage < 60%Amounts are correct but extraction may be incompleteConsider re-extracting; review with audit --verbose
NotFound > 0Some amounts need reviewRun audit --verbose to see which provisions failed; verify manually against the source XML

The Compare Table

The compare command shows account-level differences between two sets of bills:

congress-approp compare --base data/118-hr4366 --current data/118-hr9468
┌─────────────────────────────────────┬──────────────────────┬─────────────────┬───────────────┬──────────────────┬─────────┬──────────────┐
│ Account                             ┆ Agency               ┆        Base ($) ┆   Current ($) ┆        Delta ($) ┆     Δ % ┆ Status       │
╞═════════════════════════════════════╪══════════════════════╪═════════════════╪═══════════════╪══════════════════╪═════════╪══════════════╡
│ Compensation and Pensions           ┆ Department of Veter… ┆ 197,382,903,000 ┆ 2,285,513,000 ┆ -195,097,390,000 ┆  -98.8% ┆ changed      │
│ Readjustment Benefits               ┆ Department of Veter… ┆  13,774,657,000 ┆   596,969,000 ┆  -13,177,688,000 ┆  -95.7% ┆ changed      │
│ ...                                                                                                                                       │
│ Supplemental Nutrition Assistance … ┆ Department of Agric… ┆ 122,382,521,000 ┆             0 ┆ -122,382,521,000 ┆ -100.0% ┆ only in base │
└─────────────────────────────────────┴──────────────────────┴─────────────────┴───────────────┴──────────────────┴─────────┴──────────────┘
ColumnMeaning
AccountThe account name, matched between bills
AgencyThe parent agency or department
Base ($)Total budget authority for this account in the --base bills
Current ($)Total budget authority in the --current bills
Delta ($)Current minus Base
Δ %Percentage change
Statuschanged (in both, different amounts), unchanged (in both, same amount), only in base (not in current), or only in current (not in base)

Results are sorted by the absolute value of Delta, largest changes first.

Interpreting cross-type comparisons: When comparing an omnibus to a supplemental (as above), most accounts will show “only in base” because the supplemental only touches a few accounts. The tool warns you about this: “Comparing Omnibus to Supplemental. Accounts in one but not the other may be expected.” The compare command is most informative when comparing bills of the same type — for example, an FY2023 omnibus to an FY2024 omnibus.

Output Formats

Every query command supports four output formats via --format:

Table (default)

congress-approp search --dir data/118-hr9468 --format table

Human-readable formatted table. Best for interactive use and quick exploration. Column widths adapt to content. Long text is truncated.

JSON

congress-approp search --dir data/118-hr9468 --format json

A JSON array of objects. Includes every field for each matching provision — more data than the table shows. Best for programmatic consumption, piping to jq, or loading into scripts.

JSONL (JSON Lines)

congress-approp search --dir data/118-hr9468 --format jsonl

One JSON object per line, no enclosing array. Best for streaming processing, piping to while read, or working with very large result sets. Each line is independently parseable.

CSV

congress-approp search --dir data/118-hr9468 --format csv > provisions.csv

Comma-separated values suitable for importing into Excel, Google Sheets, R, or pandas. Includes a header row. Dollar amounts are plain integers (not formatted with commas).

Tip: When exporting to CSV for Excel, make sure to import the file with UTF-8 encoding. Some bill text contains em-dashes (—) and other Unicode characters that may display incorrectly with the default Windows encoding.

For a detailed guide with examples and recipes for each format, see Output Formats.

Provision Types at a Glance

You’ll encounter these provision types throughout the tool. Use --list-types for a quick reference:

congress-approp search --dir data --list-types
Available provision types:
  appropriation                    Budget authority grant
  rescission                       Cancellation of prior budget authority
  cr_substitution                  CR anomaly (substituting $X for $Y)
  transfer_authority               Permission to move funds between accounts
  limitation                       Cap or prohibition on spending
  directed_spending                Earmark / community project funding
  mandatory_spending_extension     Amendment to authorizing statute
  directive                        Reporting requirement or instruction
  rider                            Policy provision (no direct spending)
  continuing_resolution_baseline   Core CR funding mechanism
  other                            Unclassified provisions

The distribution varies by bill type. In the FY2024 omnibus (H.R. 4366), the breakdown is:

TypeCountWhat These Are
appropriation1,216Grant of budget authority — the core spending provisions
limitation456Caps and prohibitions (“not more than”, “none of the funds”)
rider285Policy provisions that don’t directly spend or limit money
directive120Reporting requirements and instructions to agencies
other84Provisions that don’t fit neatly into the standard types
rescission78Cancellations of previously appropriated funds
transfer_authority77Permission to move funds between accounts
mandatory_spending_extension40Amendments to authorizing statutes
directed_spending8Earmarks and community project funding

The continuing resolution (H.R. 5860) has a very different profile: 49 riders, 44 mandatory spending extensions, 13 CR substitutions, and only 5 standalone appropriations. This reflects the CR’s structure — it mostly continues prior-year funding rather than setting new levels.

For detailed documentation of each provision type including all fields and real examples, see Provision Types.

Enriched Output

When you run congress-approp enrich --dir data (no API key needed), the tool generates bill metadata that enhances the output:

  • Enriched classifications — the summary table shows “Full-Year CR with Appropriations” instead of “Continuing Resolution” for hybrid bills like H.R. 1968, and “Minibus” instead of “Omnibus” for bills covering only 2–4 subcommittees.
  • Advance appropriation split — use --show-advance on summary to separate current-year spending from advance appropriations (money enacted now but available in a future fiscal year). This is critical for VA accounts where 79.5% of MilCon-VA budget authority is advance.
  • Fiscal year and subcommittee filtering — use --fy 2026 and --subcommittee thud to scope any command to a specific year and jurisdiction, automatically resolving division letters across bills.

See Enrich Bills with Metadata for the full guide.

Next Steps

Related chapters:

Recipes & Demos

Worked examples using the included 32-bill dataset (data/). All commands run locally against the pre-extracted data with no API keys unless noted. Semantic search requires OPENAI_API_KEY.

The book/cookbook/cookbook.py script reproduces all CSVs, charts, and JSON shown on this page. See Run All Demos Yourself at the bottom.


Dataset Overview

116th Congress (2019–2021)11 bills — FY2019, FY2020, FY2021
117th Congress (2021–2023)7 bills — FY2021, FY2022, FY2023
118th Congress (2023–2025)10 bills — FY2024, FY2025
119th Congress (2025–2027)4 bills — FY2025, FY2026
Total32 bills, 34,568 provisions, $21.5 trillion in budget authority
Accounts tracked1,051 unique Federal Account Symbols across 937 cross-bill links
Source traceability100% — every provision has exact byte positions in the enrolled bill
Dollar verification99.995% — 18,583 of 18,584 dollar amounts confirmed in source text

Subcommittee coverage by fiscal year

The --subcommittee filter requires bills with separate divisions per jurisdiction. FY2025 was funded through H.R. 1968, a full-year continuing resolution that wraps all 12 subcommittees into a single division — so --subcommittee cannot break it apart. Use trace or search --fy 2025 to access FY2025 data by account.

Fiscal YearSubcommittee filterNotes
FY2019PartialOnly supplemental and disaster relief bills
FY2020–FY2024✅ FullTraditional omnibus/minibus bills with per-subcommittee divisions
FY2025❌ Not availableFunded via full-year CR (H.R. 1968) — all jurisdictions in one division
FY2026✅ FullThree bills cover all 12 subcommittees

Quick Reference

# Track any federal account across all fiscal years (by FAS code or name search)
congress-approp trace "child nutrition" --dir data

# Budget totals for FY2026
congress-approp summary --dir data --fy 2026

# Find FEMA provisions across all bills covering FY2026
congress-approp search --dir data --keyword "Federal Emergency Management" --fy 2026

# Compare THUD funding FY2024 → FY2026 with inflation adjustment
congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud --dir data --use-authorities --real

# Verification quality across all 32 bills
congress-approp audit --dir data

Searching and Tracking Accounts

The --keyword flag searches the raw_text field — the verbatim bill language stored with each provision. It is case-insensitive. Combine with --type to filter by provision type, --fy by fiscal year, --agency by department, or --min-dollars / --max-dollars for dollar ranges. All filters are ANDed.

congress-approp search --dir data --keyword "veterans" --type appropriation
┌───┬───────────┬───────────────┬───────────────────────────────────────────────┬─────────────────┬─────────┬─────┐
│ $ ┆ Bill      ┆ Type          ┆ Description / Account                         ┆      Amount ($) ┆ Section ┆ Div │
╞═══╪═══════════╪═══════════════╪═══════════════════════════════════════════════╪═════════════════╪═════════╪═════╡
│ ✓ ┆ H.R. 133  ┆ appropriation ┆ Compensation and Pensions                     ┆   6,110,251,552 ┆         ┆ J   │
│ ✓ ┆ H.R. 133  ┆ appropriation ┆ Readjustment Benefits                         ┆  14,946,618,000 ┆         ┆ J   │
│ ✓ ┆ H.R. 133  ┆ appropriation ┆ General Operating Expenses, Veterans Benefit… ┆   3,180,000,000 ┆         ┆ J   │
│ ...                                                                                                              │

Column reference:

ColumnMeaning
$Dollar amount verification status. = dollar string found at one unique position in the enrolled bill text. = found at multiple positions (common for round numbers) — correct but location ambiguous. = not found in source — needs review. Blank = provision has no dollar amount.
BillThe enacted legislation this provision comes from
TypeProvision classification: appropriation (grant of budget authority), rescission (cancellation of prior funds), transfer_authority (permission to move funds), rider (policy provision, no spending), directive (reporting requirement), limitation (spending cap), cr_substitution (CR anomaly replacing one dollar amount with another), and others
Description / AccountAccount name (for appropriations, rescissions) or description text (for riders, directives). This is the name as written in the bill text, between '' delimiters.
Amount ($)Budget authority in dollars. = provision carries no dollar value.
SectionSection reference in the bill (e.g., SEC. 1701). Empty if no numbered section.
DivDivision letter for omnibus/minibus bills. Division letters are bill-internal — Division A means different things in different bills.

Tracking an account across fiscal years

The trace command follows a single federal account across every bill in the dataset using its Federal Account Symbol (FAS code) — a government-assigned identifier that persists through name changes and reorganizations.

Finding the FAS code by name:

congress-approp trace "child nutrition" --dir data

If the name matches multiple accounts, the tool lists them with their FAS codes. Use the code for the specific account:

congress-approp trace 012-3539 --dir data
TAS 012-3539: Child Nutrition Programs, Food and Nutrition Service, Agriculture
  Agency: Department of Agriculture

┌──────┬──────────────────────┬───────────┬──────────────────────────┐
│ FY   ┆ Budget Authority ($) ┆ Bill(s)   ┆ Account Name(s)          │
╞══════╪══════════════════════╪═══════════╪══════════════════════════╡
│ 2020 ┆       23,615,098,000 ┆ H.R. 1865 ┆ Child Nutrition Programs │
│ 2021 ┆       25,118,440,000 ┆ H.R. 133  ┆ Child Nutrition Programs │
│ 2022 ┆       26,883,922,000 ┆ H.R. 2471 ┆ Child Nutrition Programs │
│ 2023 ┆       28,545,432,000 ┆ H.R. 2617 ┆ Child Nutrition Programs │
│ 2024 ┆       33,266,226,000 ┆ H.R. 4366 ┆ Child Nutrition Programs │
│ 2026 ┆       37,841,674,000 ┆ H.R. 5371 ┆ Child Nutrition Programs │
└──────┴──────────────────────┴───────────┴──────────────────────────┘

  6 fiscal years, 6 bills, 175,270,792,000 total
ColumnMeaning
FYFederal fiscal year (Oct 1 – Sep 30). FY2024 = Oct 2023 – Sep 2024.
Budget Authority ($)What Congress authorized the agency to obligate. This is budget authority, not outlays.
Bill(s)Enacted legislation providing the funding. (CR) = continuing resolution; (supplemental) = emergency funding.
Account Name(s)Account name as written in each bill. May vary across congresses — the FAS code is the stable identifier.

FY2025 is absent here because H.R. 1968 (the full-year CR) continued FY2024 rates without a separate line item for this account.

Accounts with name changes demonstrate why FAS codes are necessary for cross-bill tracking:

congress-approp trace 070-0400 --dir data
TAS 070-0400: Operations and Support, United States Secret Service, Homeland Security
  Agency: Department of Homeland Security

┌──────┬──────────────────────┬────────────────┬─────────────────────────────────────────────┐
│ FY   ┆ Budget Authority ($) ┆ Bill(s)        ┆ Account Name(s)                             │
╞══════╪══════════════════════╪════════════════╪═════════════════════════════════════════════╡
│ 2020 ┆        2,336,401,000 ┆ H.R. 1158      ┆ United States Secret Service—Operations an… │
│ 2021 ┆        2,373,109,000 ┆ H.R. 133       ┆ United States Secret Service—Operations an… │
│ 2022 ┆        2,554,729,000 ┆ H.R. 2471      ┆ Operations and Support                      │
│ 2023 ┆        2,734,267,000 ┆ H.R. 2617      ┆ Operations and Support                      │
│ 2024 ┆        3,007,982,000 ┆ H.R. 2882      ┆ Operations and Support                      │
│ 2025 ┆          231,000,000 ┆ H.R. 9747 (CR) ┆ United States Secret Service—Operations an… │
└──────┴──────────────────────┴────────────────┴─────────────────────────────────────────────┘

  Name variants across bills:
    "Operations and Support" (117-hr2471, 117-hr2617, 118-hr2882) [prefix]
    "United States Secret Service—Operations and Sup…" (116-hr1158, 116-hr133, 118-hr9747) [canonical]

  6 fiscal years, 6 bills, 13,237,488,000 total

The account was renamed between the 116th and 117th Congress — the “United States Secret Service—” prefix was dropped. FAS code 070-0400 unifies both names. The FY2025 row shows $231M from H.R. 9747 (a CR supplement), not the full-year level.


When the official program name is unknown, semantic search matches provisions by meaning rather than keywords. Requires OPENAI_API_KEY (one API call per query, ~100ms).

export OPENAI_API_KEY="your-key"
congress-approp search --dir data --semantic "school lunch programs for kids" --top 3
┌──────┬───────────────────┬───────────────┬──────────────────────────┬────────────────┐
│ Sim  ┆ Bill              ┆ Type          ┆ Description / Account    ┆     Amount ($) │
╞══════╪═══════════════════╪═══════════════╪══════════════════════════╪════════════════╡
│ 0.52 ┆ H.R. 1865 (116th) ┆ appropriation ┆ Child Nutrition Programs ┆ 23,615,098,000 │
│ 0.51 ┆ H.R. 4366 (118th) ┆ appropriation ┆ Child Nutrition Programs ┆ 33,266,226,000 │
│ 0.51 ┆ H.R. 2471 (117th) ┆ appropriation ┆ Child Nutrition Programs ┆ 26,883,922,000 │
└──────┴───────────────────┴───────────────┴──────────────────────────┴────────────────┘

“school lunch programs for kids” shares no keywords with “Child Nutrition Programs”, but semantic search matches them by meaning. The Sim column is cosine similarity between the query and provision embeddings:

Sim ScoreInterpretation
> 0.80Almost certainly the same program (when comparing provisions across bills)
0.60–0.80Related topic, same policy area
0.45–0.60Loosely related
< 0.45Unlikely to be meaningfully related

Scores reflect the full provision text (account name + agency + raw bill language), not just the account name, which is why good matches are often in the 0.45–0.55 range rather than near 1.0.

Additional examples (tested against the dataset):

QueryTop ResultSim
opioid crisis drug treatmentSubstance Abuse Treatment0.48
space explorationExploration (NASA)0.57
military pay raises for soldiersMilitary Personnel, Army0.53
fighting wildfiresWildland Fire Management0.53
veterans mental healthVA mental health counseling directives0.53

Comparing Across Fiscal Years

Year-over-year comparison with inflation adjustment

congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud \
    --dir data --use-authorities --real
FlagPurpose
--base-fy 2024Use all bills covering FY2024 as the baseline
--current-fy 2026Use all bills covering FY2026 as the comparison
--subcommittee thudScope to Transportation, Housing and Urban Development. The tool resolves which division in each bill corresponds to THUD.
--use-authoritiesMatch accounts using Treasury Account Symbols instead of name strings. Handles renames and agency reorganizations.
--realAdd inflation-adjusted columns using bundled CPI-U data.
20 orphan(s) rescued via TAS authority matching
Comparing: H.R. 4366 (118th)  →  H.R. 7148 (119th)

┌─────────────────────────────────────┬──────────────────────┬────────────────┬────────────────┬─────────────────┬─────────┬───────────┬───┬──────────┐
│ Account                             ┆ Agency               ┆       Base ($) ┆    Current ($) ┆       Delta ($) ┆     Δ % ┆ Real Δ %* ┆   ┆ Status   │
╞═════════════════════════════════════╪══════════════════════╪════════════════╪════════════════╪═════════════════╪═════════╪═══════════╪═══╪══════════╡
│ Tenant-Based Rental Assistance      ┆ Department of Housi… ┆ 32,386,831,000 ┆ 38,438,557,000 ┆  +6,051,726,000 ┆  +18.7% ┆    +13.8% ┆ ▲ ┆ changed  │
│ Federal-Aid Highways                ┆ Federal Highway Adm… ┆ 60,834,782,888 ┆ 63,396,105,821 ┆  +2,561,322,933 ┆   +4.2% ┆     -0.1% ┆ ▼ ┆ changed  │
│ Operations                          ┆ Federal Aviation Ad… ┆ 12,729,627,000 ┆ 13,710,000,000 ┆    +980,373,000 ┆   +7.7% ┆     +3.2% ┆ ▲ ┆ changed  │
│ Facilities and Equipment            ┆ Federal Aviation Ad… ┆  3,191,250,000 ┆  4,000,000,000 ┆    +808,750,000 ┆  +25.3% ┆    +20.1% ┆ ▲ ┆ changed  │
│ Capital Investment Grants           ┆ Federal Transit Adm… ┆  2,205,000,000 ┆  1,700,000,000 ┆    -505,000,000 ┆  -22.9% ┆    -26.1% ┆ ▼ ┆ changed  │
│ Public Housing Fund                 ┆ Department of Housi… ┆  8,810,784,000 ┆  8,319,393,000 ┆    -491,391,000 ┆   -5.6% ┆     -9.5% ┆ ▼ ┆ changed  │
│ ...                                 ┆                      ┆                ┆                ┆                 ┆         ┆           ┆   ┆          │

Column reference:

ColumnMeaning
AccountAppropriations account name, matched between the two fiscal years
AgencyParent department or agency
Base ($)Total budget authority for this account in FY2024
Current ($)Total budget authority in FY2026
Delta ($)Current minus Base
Δ %Nominal percentage change (not inflation-adjusted)
Real Δ %*Inflation-adjusted percentage change using CPI-U data. Asterisk indicates this is computed from a price index, not a number verified against bill text.
▲ / ▼ / —▲ = real increase (beat inflation), ▼ = real cut or inflation erosion, — = unchanged
Statuschanged = in both FYs, different amounts. unchanged = same amount. only in base = not in FY2026. only in current = new in FY2026. matched (TAS …) (normalized) = matched via Treasury Account Symbol because the name differed.

The Federal-Aid Highways row illustrates why inflation adjustment matters: nominal +4.2%, but real -0.1%. The nominal increase does not keep pace with inflation.

The --real flag works on any compare command — any subcommittee, any fiscal year pair. No API key needed.

The “20 orphan(s) rescued via TAS authority matching” message indicates 20 accounts that would have appeared unmatched (different names between FY2024 and FY2026) were paired using their FAS codes.


Subcommittee budget authority across fiscal years

Individual subcommittee totals can be retrieved per fiscal year using summary --fy Y --subcommittee S. The book/cookbook/cookbook.py script runs all combinations; the resulting table:

SubcommitteeFY2020FY2021FY2022FY2023FY2024FY2026Change
Defense$693B$695B$723B$791B$819B$836B+21%
Labor-HHS$1,089B$1,167B$1,305B$1,408B$1,435B$1,729B+59%
THUD$97B$87B$112B$162B$184B$183B+88%
MilCon-VA$256B$272B$316B$332B$360B$495B+94%
Homeland Security$73B$75B$81B$85B$88B+20%
Agriculture$120B$205B$197B$212B$187B$177B+48%
CJS$84B$81B$84B$89B$88B$88B+5%
Energy & Water$50B$53B$57B$61B$63B$69B+38%
Interior$37B$37B$39B$45B$40B$40B+7%
State-Foreign Ops$56B$62B$59B$61B$62B$53B-6%
Financial Services$37B$38B$39B$41B$40B$41B+11%
Legislative Branch$5B$5B$6B$7B$7B$7B+43%

FY2025 is omitted for individual subcommittees because it was funded through a full-year CR with all jurisdictions under one division — see the coverage note above.

All values are budget authority. These include mandatory spending programs that appear as appropriation lines (e.g., SNAP under Agriculture, Medicaid under Labor-HHS). The MilCon-VA figure ($495B for FY2026) includes $394B in advance appropriations — see the next section.


Advance vs. current-year appropriations

congress-approp summary --dir data --fy 2026 --subcommittee milcon-va --show-advance
┌───────────────────┬──────┬────────────────┬────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┐
│ Bill              ┆ FYs  ┆ Classification ┆ Provisions ┆     Current ($) ┆     Advance ($) ┆    Total BA ($) ┆ Rescissions ($) ┆      Net BA ($) │
╞═══════════════════╪══════╪════════════════╪════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╡
│ H.R. 5371 (119th) ┆ 2026 ┆ Minibus        ┆        263 ┆ 101,839,976,450 ┆ 393,592,053,000 ┆ 495,432,029,450 ┆  16,499,000,000 ┆ 478,933,029,450 │
│ TOTAL             ┆      ┆                ┆        263 ┆ 101,839,976,450 ┆ 393,592,053,000 ┆ 495,432,029,450 ┆  16,499,000,000 ┆ 478,933,029,450 │
└───────────────────┴──────┴────────────────┴────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┘
ColumnMeaning
Current ($)Budget authority available in the current fiscal year (FY2026)
Advance ($)Budget authority enacted in this bill but available starting in a future fiscal year (FY2027+). Common for VA medical accounts.
Total BA ($)Current + Advance. This is the number shown without --show-advance.
Rescissions ($)Cancellations of previously enacted budget authority (absolute value)
Net BA ($)Total BA minus Rescissions

79.4% of FY2026 MilCon-VA budget authority ($394B of $495B) is advance appropriations for FY2027. Only $102B is current-year spending. Without --show-advance, the total combines both, which can distort year-over-year comparisons by hundreds of billions of dollars.

The classification uses bill_meta.json generated by enrich (run once, no API key). The algorithm compares each provision’s availability dates against the bill’s fiscal year.


CR substitutions — what the continuing resolution changed

Continuing resolutions fund the government at prior-year rates, except for specific anomalies (CR substitutions) where Congress sets a different level.

congress-approp search --dir data/118-hr5860 --type cr_substitution
┌───┬───────────┬──────────────────────────────────────────┬───────────────┬───────────────┬──────────────┬──────────┬─────┐
│ $ ┆ Bill      ┆ Account                                  ┆       New ($) ┆       Old ($) ┆    Delta ($) ┆ Section  ┆ Div │
╞═══╪═══════════╪══════════════════════════════════════════╪═══════════════╪═══════════════╪══════════════╪══════════╪═════╡
│ ✓ ┆ H.R. 5860 ┆ Rural Housing Service—Rural Community…   ┆    25,300,000 ┆    75,300,000 ┆  -50,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Rural Utilities Service—Rural Water a…   ┆    60,000,000 ┆   325,000,000 ┆ -265,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ National Science Foundation—STEM Educ…   ┆    92,000,000 ┆   217,000,000 ┆ -125,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ National Science Foundation—Research …   ┆   608,162,000 ┆   818,162,000 ┆ -210,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Office of Personnel Management—Salari…   ┆   219,076,000 ┆   190,784,000 ┆  +28,292,000 ┆ SEC. 126 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Department of Transportation—Federal …   ┆   617,000,000 ┆   570,000,000 ┆  +47,000,000 ┆ SEC. 137 ┆ A   │
│ ...                                                                                                                      │
└───┴───────────┴──────────────────────────────────────────┴───────────────┴───────────────┴──────────────┴──────────┴─────┘
13 provisions found

The cr_substitution table shows New (the CR level), Old (the prior-year rate being replaced), and Delta (the difference). Negative delta = funding cut below the prior-year rate. The full dataset contains 123 CR substitutions across all bills.

To see all CR substitutions: congress-approp search --dir data --type cr_substitution


Working with the Data Programmatically

Loading extraction.json in Python

Each bill’s provisions are in data/{bill_dir}/extraction.json:

import json
from collections import Counter

ext = json.load(open('data/119-hr7148/extraction.json'))
provisions = ext['provisions']

# Count by type
type_counts = Counter(p['provision_type'] for p in provisions)
for ptype, count in type_counts.most_common():
    print(f"  {ptype}: {count}")
  appropriation: 1201
  limitation: 553
  rider: 325
  directive: 285
  transfer_authority: 107
  rescission: 98
  mandatory_spending_extension: 82
  other: 63
  directed_spending: 59
  continuing_resolution_baseline: 1

Field access patterns:

p = provisions[0]
p['provision_type']       # → 'appropriation'
p['account_name']         # → 'Military Personnel, Army'
p['agency']               # → 'Department of Defense'

# Dollar amount (defensive — some fields can be null)
amt = p.get('amount') or {}
value = (amt.get('value') or {}).get('dollars', 0) or 0
# → 54538366000

amt['semantics']          # → 'new_budget_authority'
#   'new_budget_authority' — counts toward budget totals
#   'rescission'           — cancellation of prior funds
#   'transfer_ceiling'     — max transfer amount (not new spending)
#   'limitation'           — spending cap
#   'reference_amount'     — sub-allocation or contextual (not counted)
#   'mandatory_spending'   — mandatory program in the appropriation text

p['detail_level']         # → 'top_level'
#   'top_level'       — main account appropriation (counts toward totals)
#   'line_item'       — numbered item within a section (counts)
#   'sub_allocation'  — "of which" breakdown (does NOT count)
#   'proviso_amount'  — amount in a "Provided, That" clause (does NOT count)

p['raw_text'][:80]        # → verbatim bill language
p['confidence']           # → 0.97 (LLM self-assessed; not calibrated above 0.90)
p['section']              # → '' (empty if no section number)
p['division']             # → 'A'

# Source span — exact byte position in the enrolled bill
span = p.get('source_span') or {}
span['start']             # → UTF-8 byte offset in the source text file
span['end']               # → exclusive end byte
span['file']              # → 'BILLS-119hr7148enr.txt'
span['verified']          # → True (source_bytes[start:end] == raw_text)

Filtering to top-level budget authority provisions (the ones counted in totals):

for p in provisions:
    if p.get('provision_type') != 'appropriation':
        continue
    amt = p.get('amount') or {}
    if amt.get('semantics') != 'new_budget_authority':
        continue
    dl = p.get('detail_level', '')
    if dl in ('sub_allocation', 'proviso_amount'):
        continue
    dollars = (amt.get('value') or {}).get('dollars', 0) or 0
    print(f"{p['account_name'][:50]:50s}  ${dollars:>15,}")

Building a pandas DataFrame from authorities.json

data/authorities.json contains the cross-bill account registry — 1,051 accounts with provisions, name variants, and rename events. To flatten it into a DataFrame:

import json
import pandas as pd

auth = json.load(open('data/authorities.json'))

rows = []
for a in auth['authorities']:
    for prov in a.get('provisions', []):
        for fy in prov.get('fiscal_years', []):
            rows.append({
                'fas_code': a['fas_code'],
                'agency_code': a['agency_code'],
                'agency': a['agency_name'],
                'title': a['fas_title'],
                'fiscal_year': fy,
                'dollars': prov.get('dollars', 0) or 0,
                'bill': prov['bill_identifier'],
                'bill_dir': prov['bill_dir'],
                'confidence': prov['confidence'],
                'method': prov['method'],
            })

df = pd.DataFrame(rows)

Key fields:

ColumnMeaning
fas_codeFederal Account Symbol — primary key. Format: {agency_code}-{main_account} (e.g., 070-0400). Assigned by Treasury, stable across renames.
agency_codeCGAC agency code. 021 = Army, 017 = Navy, 057 = Air Force, 097 = DOD-wide, 070 = DHS, 075 = HHS, 036 = VA.
confidenceTAS resolution confidence. verified = deterministic match. high = LLM-resolved, confirmed in FAST Book. inferred = LLM-resolved, not directly confirmed.
methodResolution method. direct_match, suffix_match, agency_disambiguated = deterministic. llm_resolved = Claude Opus.

Common operations:

# Budget authority by fiscal year
df.groupby('fiscal_year')['dollars'].sum().sort_index()

# Top 10 agencies
df.groupby('agency')['dollars'].sum().sort_values(ascending=False).head(10)

# Pivot: one row per account, one column per FY
df.pivot_table(values='dollars', index=['fas_code', 'title'],
               columns='fiscal_year', aggfunc='sum', fill_value=0)

# Export
df.to_csv('budget_timeline.csv', index=False)

CLI CSV export and analysis

Export provisions from the CLI, then load in Python or a spreadsheet:

congress-approp search --dir data --type appropriation --fy 2026 --format csv > fy2026_approps.csv
import pandas as pd

df = pd.read_csv('fy2026_approps.csv')

CSV field reference:

FieldMeaning
billBill identifier with congress (e.g., H.R. 7148 (119th))
congressCongress number (116–119)
provision_typeOne of the 11 provision types
account_nameAccount name from the bill text
agencyDepartment or agency
dollarsDollar amount as plain integer
old_dollarsFor cr_substitution only: the replaced amount
semanticsWhat the amount means (see field guide above)
detail_leveltop_level, line_item, sub_allocation, or proviso_amount
amount_statusfound (unique), found_multiple, not_found, or empty
qualitystrong, moderate, or weak
match_tierexact, normalized, or no_match
raw_textVerbatim bill language (~150 chars)
provision_indexZero-based position in the bill’s provisions array

Do not sum the dollars column directly. Filter to semantics == 'new_budget_authority' and exclude detail_level in ('sub_allocation', 'proviso_amount') to avoid double-counting. Or use congress-approp summary which handles this automatically.

ba = df[(df['semantics'] == 'new_budget_authority') &
        (~df['detail_level'].isin(['sub_allocation', 'proviso_amount']))]
print(f"FY2026 BA provisions: {len(ba)}")
print(f"Total: ${ba['dollars'].sum():,.0f}")

Other export formats: --format json (array), --format jsonl (one object per line for streaming), --format csv.

jq one-liners:

# Top 5 rescissions by dollar amount
congress-approp search --dir data --type rescission --format json | \
  jq 'sort_by(-.dollars) | .[0:5] | .[] | {bill, account_name, dollars}'

# Count provisions by type for FY2026
congress-approp search --dir data --fy 2026 --format json | \
  jq 'group_by(.provision_type) | map({type: .[0].provision_type, count: length}) | sort_by(-.count)'

Source span verification

Every provision carries a source_span with exact byte offsets into the enrolled bill text. To independently verify a provision:

import json

ext = json.load(open('data/118-hr9468/extraction.json'))
p = ext['provisions'][0]
span = p['source_span']

source_bytes = open(f"data/118-hr9468/{span['file']}", 'rb').read()
actual = source_bytes[span['start']:span['end']].decode('utf-8')

assert actual == p['raw_text']  # True
Account:  Compensation and Pensions
Dollars:  $2,285,513,000
Span:     bytes 371..482 in BILLS-118hr9468enr.txt
Match:    True

start and end are UTF-8 byte offsets. In Python, use open(path, 'rb').read()[start:end].decode('utf-8') — not character-based indexing.

FieldMeaning
startStart byte offset (inclusive)
endEnd byte offset (exclusive) — standard Python slice semantics
fileSource filename (e.g., BILLS-118hr9468enr.txt)
verifiedtrue if source_bytes[start:end] is byte-identical to raw_text
match_tierexact, repaired_prefix, repaired_substring, or repaired_normalized

To verify all provisions across multiple bills:

import json, os

for bill_dir in ['118-hr9468', '119-hr7148', '119-hr5371']:
    ext = json.load(open(f'data/{bill_dir}/extraction.json'))
    for i, p in enumerate(ext['provisions']):
        span = p.get('source_span') or {}
        if not span.get('file'):
            continue
        source = open(f'data/{bill_dir}/{span["file"]}', 'rb').read()
        actual = source[span['start']:span['end']].decode('utf-8')
        assert actual == p['raw_text'], f'{bill_dir} provision {i}: MISMATCH'
    print(f'{bill_dir}: {len(ext["provisions"])} provisions verified')

Visualizations

Generated by book/cookbook/cookbook.py. The images below are included in the repository; run the script to regenerate from the current data.

FY2026 Interactive Treemap

FY2026 budget authority ($5.6 trillion across 1,076 accounts) organized by jurisdiction → agency → account. The file is a self-contained HTML page — open it in your browser.

Hierarchy: jurisdiction (subcommittee) → agency (department) → account. Click to zoom. Color intensity encodes dollar amount.

Defense vs. Non-Defense Spending Trend

Defense vs. Non-Defense Spending FY2019–FY2026

Dark blue = Defense. Light blue = all other subcommittees. Defense grew from $693B to $836B (+21%) over this period. Non-defense growth is primarily driven by mandatory spending programs (Medicaid, SNAP, VA Compensation) that appear as appropriation lines in the bill text. See Why the Numbers Might Not Match Headlines.

Top 6 Federal Accounts by Budget Authority

Top 6 Account Spending Trends

Each line is one Treasury Account Symbol (FAS code). The top accounts are dominated by mandatory programs that appear as appropriation line items: Medicaid, Health Care Trust Funds, and VA Compensation & Pensions.

Note on FY2025→FY2026 jumps: Some accounts show sharp increases between FY2025 and FY2026 (e.g., Medicaid $261B → $1,086B). This is because FY2025 was covered by a single full-year CR while FY2026 has multiple omnibus/minibus bills — the amounts are correct per bill, but the visual jump reflects different legislative coverage.

Verification Quality Heatmap

Verification Quality Heatmap

Each row is a bill; each column is a verification metric. Color intensity shows the percentage of provisions meeting that criterion.

ColumnWhat it measuresDataset result
$ VerifiedDollar string at unique position in source10,468 (56.3% of provisions with amounts)
$ AmbiguousDollar string at multiple positions — correct but location uncertain8,115
$ Not FoundDollar string not in source1 (0.005%)
Text Exactraw_text byte-identical to source32,691 (94.6%)
Text NormalizedMatches after whitespace/quote normalization1,287 (3.7%)
Text No MatchNot found at any tier585 (1.7%)

Bills with low $ Verified percentages (e.g., CRs) are expected — most CR provisions do not carry dollar amounts.


Run All Demos Yourself

book/cookbook/cookbook.py runs 24 demos including everything above plus TAS resolution quality per bill, account rename events, directed spending analysis, advance appropriation breakdown, and more.

Setup

source .venv/bin/activate
pip install -r book/cookbook/requirements.txt

Run

python book/cookbook/cookbook.py

For semantic search demos (optional):

export OPENAI_API_KEY="your-key"
python book/cookbook/cookbook.py

Output

All files go to tmp/demo_output/:

FileDescription
fy2026_treemap.htmlInteractive budget treemap
defense_vs_nondefense.pngStacked bar chart
spending_trends_top6.pngLine chart — top 6 accounts
verification_heatmap.pngVerification quality heatmap
authorities_flat.csvFull dataset as flat CSV — every provision-FY pair
biggest_changes_2024_2026.csvAccount-level changes FY2024 → FY2026
cr_substitutions.csvEvery CR substitution across all bills
rename_events.csvAccount rename events with fiscal year boundaries
subcommittee_scorecard.csv12 subcommittees × 7 fiscal years
fy2026_by_agency.csvFY2026 budget authority by agency
semantic_search_demos.jsonSemantic query results
dataset_summary.jsonSummary statistics

Find How Much Congress Spent on a Topic

You will need: congress-approp installed, access to the data/ directory. For semantic search: OPENAI_API_KEY.

You will learn: Three ways to find spending provisions — by account name, by keyword, and by semantic meaning — and when to use each one.

This tutorial demonstrates three methods for finding spending provisions — by account name, by keyword, and by semantic meaning — and when each is appropriate.

Start with the Agency Rollup

If your question is about an entire department, the fastest answer is the by-agency summary:

congress-approp summary --dir data --by-agency

This prints the standard bill summary table, followed by a second table breaking down budget authority by parent department. Here’s the top of that second table:

┌─────────────────────────────────────────────────────┬─────────────────┬─────────────────┬────────────┐
│ Department                                           ┆ Budget Auth ($) ┆ Rescissions ($) ┆ Provisions │
╞═════════════════════════════════════════════════════╪═════════════════╪═════════════════╪════════════╡
│ Department of Veterans Affairs                       ┆ 343,238,707,982 ┆   9,799,155,560 ┆         51 │
│ Department of Agriculture                            ┆ 187,748,124,000 ┆     351,891,000 ┆        266 │
│ Department of Housing and Urban Development          ┆  75,743,762,466 ┆      85,000,000 ┆        116 │
│ Department of Energy                                 ┆  50,776,281,000 ┆               0 ┆         62 │
│ Department of Justice                                ┆  37,960,158,000 ┆   1,158,272,000 ┆        186 │
│ ...                                                                                                    │
└─────────────────────────────────────────────────────┴─────────────────┴─────────────────┴────────────┘

So the answer to “how much did the VA get?” is approximately $343 billion in budget authority across all bills in the dataset, with $9.8 billion in rescissions.

Important caveat: This total includes mandatory spending programs that appear as appropriation lines in the bill text. VA’s Compensation and Pensions account alone is $197 billion — that’s a mandatory entitlement, not discretionary spending, even though it appears in the appropriations bill. See Why the Numbers Might Not Match Headlines for more on this distinction.

Search by Account Name

When you know the program’s official name (or part of it), --account is the most precise filter. It matches against the structured account_name field:

congress-approp search --dir data --account "Child Nutrition"
┌───┬───────────┬───────────────┬─────────────────────────────────────────────┬────────────────┬─────────┬─────┐
│ $ ┆ Bill      ┆ Type          ┆ Description / Account                       ┆     Amount ($) ┆ Section ┆ Div │
╞═══╪═══════════╪═══════════════╪═════════════════════════════════════════════╪════════════════╪═════════╪═════╡
│ ✓ ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆ 33,266,226,000 ┆         ┆ B   │
│ ✓ ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆     18,004,000 ┆         ┆ B   │
│ ✓ ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆     21,005,000 ┆         ┆ B   │
│ ✓ ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆      5,000,000 ┆         ┆ B   │
│ ≈ ┆ H.R. 4366 ┆ limitation    ┆ Child Nutrition Programs                    ┆        500,000 ┆         ┆ B   │
│ ≈ ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆     10,000,000 ┆         ┆ B   │
│ ≈ ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆      1,000,000 ┆         ┆ B   │
│ ✓ ┆ H.R. 4366 ┆ appropriation ┆ McGovern-Dole International Food for Educ…  ┆    240,000,000 ┆         ┆ B   │
│ ≈ ┆ H.R. 4366 ┆ limitation    ┆ McGovern-Dole International Food for Educ…  ┆     24,000,000 ┆         ┆ B   │
└───┴───────────┴───────────────┴─────────────────────────────────────────────┴────────────────┴─────────┴─────┘

The top result — $33,266,226,000 — is the top-level appropriation for Child Nutrition Programs. The smaller amounts below it are sub-allocations (“of which $18,004,000 shall be for…”) and proviso amounts that break down how the top-level figure is to be spent. These sub-allocations have reference_amount semantics and are not counted again in the budget authority total — no double-counting.

The McGovern-Dole account also matches because it has “Child Nutrition” in its full name.

When to use --account vs. --keyword

  • --account matches against the structured account_name field extracted by the LLM — the official name of the appropriations account.
  • --keyword searches the full raw_text field — the actual bill language.

Sometimes the account name doesn’t contain the term you’re looking for, but the bill text does. Other times, the bill text doesn’t mention a term that is in the account name. Use both when you want to be thorough.

Search by Keyword in Bill Text

The --keyword flag searches the raw_text field — the excerpt of actual bill language stored with each provision. This finds provisions where the term appears anywhere in the source text, regardless of account name:

congress-approp search --dir data --keyword "Federal Emergency Management"
┌───┬───────────┬───────────────┬───────────────────────────────────────────────┬────────────────┬──────────┬─────┐
│ $ ┆ Bill      ┆ Type          ┆ Description / Account                         ┆     Amount ($) ┆ Section  ┆ Div │
╞═══╪═══════════╪═══════════════╪═══════════════════════════════════════════════╪════════════════╪══════════╪═════╡
│   ┆ H.R. 5860 ┆ other         ┆ Allows FEMA Disaster Relief Fund to be appor… ┆              — ┆ SEC. 128 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ appropriation ┆ Federal Emergency Management Agency—Disast…   ┆ 16,000,000,000 ┆ SEC. 129 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ appropriation ┆ Office of the Inspector General—Operations…   ┆      2,000,000 ┆ SEC. 129 ┆ A   │
└───┴───────────┴───────────────┴───────────────────────────────────────────────┴────────────────┴──────────┴─────┘
3 provisions found

This found three provisions: the $16B FEMA Disaster Relief Fund appropriation, a $2M Inspector General appropriation, and a non-dollar provision about how the fund can be apportioned. All three are in the continuing resolution (H.R. 5860), not the omnibus — because FEMA’s regular funding falls under the Homeland Security appropriations bill, which isn’t one of the divisions included in this particular omnibus.

Useful keywords for exploring

Here are some keywords that surface interesting provisions in the example data:

KeywordWhat It Finds
"notwithstanding"Provisions that override other legal requirements — often important policy exceptions
"is hereby rescinded"Rescission provisions (also findable with --type rescission)
"shall submit a report"Reporting requirements and directives
"not to exceed"Caps and limitations on spending
"transfer"Fund transfer authorities
"Veterans Affairs"All VA-related provisions across all bills

Combining filters

All search filters are combined with AND logic. Every provision in the result must match every filter you specify:

# Appropriations over $1 billion in Division A (MilCon-VA)
congress-approp search --dir data --type appropriation --division A --min-dollars 1000000000

# Rescissions from the Department of Justice
congress-approp search --dir data --type rescission --agency "Justice"

# Directives in the VA supplemental
congress-approp search --dir data/118-hr9468 --type directive

Keyword search has a fundamental limitation: it only finds provisions that use the exact words you search for. If you search for “school lunch” but the bill says “Child Nutrition Programs,” keyword search returns nothing.

Semantic search solves this. It uses embedding vectors to understand the meaning of your query and rank provisions by conceptual similarity — even when the words don’t overlap at all.

Prerequisites: Semantic search requires OPENAI_API_KEY (to embed your query text at search time) and pre-computed embeddings for the bills you’re searching. The included example data has pre-computed embeddings, so you just need the API key.

export OPENAI_API_KEY="your-key-here"
congress-approp search --dir data --semantic "school lunch programs for kids" --top 5
┌──────┬───────────┬───────────────┬─────────────────────────────────────────────┬────────────────┬─────┐
│ Sim  ┆ Bill      ┆ Type          ┆ Description / Account                       ┆     Amount ($) ┆ Div │
╞══════╪═══════════╪═══════════════╪═════════════════════════════════════════════╪════════════════╪═════╡
│ 0.51 ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆ 33,266,226,000 ┆ B   │
│ 0.46 ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆     10,000,000 ┆ B   │
│ 0.45 ┆ H.R. 4366 ┆ rider         ┆ Pilot project grant recipients shall be r…  ┆              — ┆ B   │
│ 0.45 ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆     18,004,000 ┆ B   │
│ 0.44 ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆      5,000,000 ┆ B   │
└──────┴───────────┴───────────────┴─────────────────────────────────────────────┴────────────────┴─────┘
5 provisions found

The query “school lunch programs for kids” shares no keywords with “Child Nutrition Programs”, but semantic search matches them by meaning. The similarity score of 0.51 reflects the conceptual relationship between the query and the provision text.

More semantic search examples

Try these queries against the example data to get a feel for how semantic search finds provisions that keyword search would miss:

# "Fixing roads and bridges" → finds Highway Infrastructure Programs, Federal-Aid Highways
congress-approp search --dir data --semantic "money for fixing roads and bridges" --top 5

# "Space exploration" → finds NASA Exploration, Space Operations, Space Technology
congress-approp search --dir data --semantic "space exploration" --top 5

# "Clean energy" → finds Energy Efficiency and Renewable Energy, Nuclear Energy
congress-approp search --dir data --semantic "clean energy research" --top 5

Combining semantic search with filters

You can narrow semantic results with hard filters. For example, find only appropriation-type provisions about clean energy with at least $100 million:

congress-approp search --dir data --semantic "clean energy" --type appropriation --min-dollars 100000000 --top 10

The filters are applied first (hard constraints that must match), then the remaining provisions are ranked by semantic similarity.

When semantic search doesn’t help

Semantic search is not always the right tool:

  • Exact account name lookup: If you know the account name, use --account. It’s faster, deterministic, and doesn’t require an API key.
  • No conceptual match: If nothing in the dataset relates to your query, similarity scores will be low (below 0.40). Low scores are an honest answer — the tool isn’t hallucinating relevance.
  • Provision type distinction: Embeddings don’t strongly encode whether something is a rider vs. an appropriation. If you need only appropriations, add --type appropriation as a hard filter.

Get the Full Details in JSON

Once you’ve found interesting provisions in the table view, switch to JSON to see every field:

congress-approp search --dir data --account "Child Nutrition" --type appropriation --format json

This returns the full structured data for each matching provision, including fields the table truncates: raw_text (the full excerpt), semantics, detail_level, agency, division, notes, cross_references, and more.

For example, the top-level Child Nutrition Programs appropriation includes:

{
  "account_name": "Child Nutrition Programs",
  "agency": "Department of Agriculture",
  "bill": "H.R. 4366",
  "dollars": 33266226000,
  "semantics": "new_budget_authority",
  "detail_level": "top_level",
  "division": "B",
  "provision_type": "appropriation",
  "quality": "strong",
  "amount_status": "found",
  "match_tier": "exact",
  "raw_text": "For necessary expenses of the Food and Nutrition Service..."
}

Key fields to check:

  • semantics: new_budget_authority means this counts toward the budget authority total. reference_amount means it’s a sub-allocation or contextual amount.
  • detail_level: top_level is the main account appropriation. sub_allocation is an “of which” breakdown. line_item is a numbered item within a section.
  • quality: strong means the dollar amount was verified and the raw text matched the source. moderate or weak means something didn’t check out as well.

Cross-Check Against the Source

For any provision you plan to cite, you can verify it directly against the bill XML. The raw_text field contains the excerpt, and the text_as_written dollar string can be searched in the source file:

# Find the dollar string in the source XML
grep '33,266,226,000' data/118-hr4366/BILLS-118hr4366enr.xml

If the string is found (which it will be — the audit confirms this), you know the extraction is accurate. For a full verification procedure, see Verify Extraction Accuracy.

Export for Further Analysis

Once you’ve identified the provisions you care about, export them for further work:

# CSV for Excel or Google Sheets
congress-approp search --dir data --account "Child Nutrition" --format csv > child_nutrition.csv

# JSON for Python, R, or jq
congress-approp search --dir data --agency "Veterans" --type appropriation --format json > va_appropriations.json

See Export Data for Spreadsheets and Scripts for detailed recipes.

Summary: Which Search Method to Use

MethodFlagBest ForLimitations
Account name--accountKnown program namesOnly matches the account_name field
Keyword--keywordTerms that appear in bill textOnly finds exact word matches
Agency--agencyDepartment-level filteringCase-insensitive substring match
Semantic--semanticFinding provisions by meaningRequires embeddings + OPENAI_API_KEY
Provision type--typeFiltering by categoryRelies on LLM classification accuracy
Division--divisionScoping to a part of an omnibus billOnly applicable to multi-division bills
Dollar range--min-dollars / --max-dollarsFinding large or small provisionsOnly filters on absolute value

For the most thorough search, try multiple approaches. Start with --account or --keyword for precision, then use --semantic to catch provisions you might have missed with different terminology.

Next Steps

Compare Two Bills

You will need: congress-approp installed, access to the data/ directory.

You will learn: How to use the compare command to see which accounts gained, lost, or changed funding between two sets of bills.

One of the most common questions in appropriations analysis is: “What changed?” Maybe you’re comparing a continuing resolution to the full-year omnibus to see which programs got different treatment. Maybe you’re comparing this year’s omnibus to last year’s. Or maybe a supplemental added emergency funding on top of the base bill and you want to see exactly where the money went.

The compare command answers these questions by matching accounts across two sets of bills and computing the dollar difference.

Your First Comparison

Let’s compare the FY2024 omnibus (H.R. 4366) to the VA supplemental (H.R. 9468) to see which accounts got additional emergency funding:

congress-approp compare --base data/118-hr4366 --current data/118-hr9468

The tool first prints a warning:

⚠  Comparing Omnibus to Supplemental. Accounts in one but not the other may be expected
    — this does not necessarily indicate policy changes.

This is important context. A supplemental only touches a handful of accounts, so most accounts from the omnibus will show up as “only in base.” That’s expected — the supplemental didn’t eliminate those programs.

The comparison table follows, sorted by largest absolute change first:

┌─────────────────────────────────────┬──────────────────────┬─────────────────┬───────────────┬──────────────────┬─────────┬──────────────┐
│ Account                             ┆ Agency               ┆        Base ($) ┆   Current ($) ┆        Delta ($) ┆     Δ % ┆ Status       │
╞═════════════════════════════════════╪══════════════════════╪═════════════════╪═══════════════╪══════════════════╪═════════╪══════════════╡
│ Compensation and Pensions           ┆ Department of Veter… ┆ 197,382,903,000 ┆ 2,285,513,000 ┆ -195,097,390,000 ┆  -98.8% ┆ changed      │
│ Supplemental Nutrition Assistance … ┆ Department of Agric… ┆ 122,382,521,000 ┆             0 ┆ -122,382,521,000 ┆ -100.0% ┆ only in base │
│ Medical Services                    ┆ Department of Veter… ┆  71,000,000,000 ┆             0 ┆  -71,000,000,000 ┆ -100.0% ┆ only in base │
│ Child Nutrition Programs            ┆ Department of Agric… ┆  33,266,226,000 ┆             0 ┆  -33,266,226,000 ┆ -100.0% ┆ only in base │
│ ...                                                                                                                                       │
│ Readjustment Benefits               ┆ Department of Veter… ┆  13,774,657,000 ┆   596,969,000 ┆  -13,177,688,000 ┆  -95.7% ┆ changed      │
│ ...                                                                                                                                       │
└─────────────────────────────────────┴──────────────────────┴─────────────────┴───────────────┴──────────────────┴─────────┴──────────────┘

Understanding the Columns

ColumnMeaning
AccountThe appropriations account name, matched between the two bill sets
AgencyThe parent department or agency
Base ($)Total budget authority for this account in the --base bills
Current ($)Total budget authority for this account in the --current bills
Delta ($)Current minus Base
Δ %Percentage change from base to current
StatusHow the account appears across the two sets (see below)

Status values

StatusMeaning
changedAccount exists in both base and current with different dollar amounts
unchangedAccount exists in both with the same amount (rare in practice)
only in baseAccount exists in the base bills but not in the current bills
only in currentAccount exists in the current bills but not in the base bills

Interpreting Cross-Type Comparisons

The comparison above — omnibus vs. supplemental — is instructive but requires careful interpretation:

Why “Compensation and Pensions” shows -98.8%: The omnibus has $197B for Comp & Pensions (which includes mandatory spending). The supplemental has $2.3B. The compare command shows the raw dollar values in each set — it doesn’t add them together. The supplemental is additional funding on top of the omnibus, but the compare table shows the amounts within each set, not cumulative totals.

Why most accounts show “only in base”: The supplemental only funds two accounts (Comp & Pensions and Readjustment Benefits). Every other account in the omnibus has zero representation in the supplemental. This doesn’t mean those programs lost funding — it means the supplemental didn’t touch them.

The classification warning: The tool detects when you’re comparing different bill types (Omnibus vs. Supplemental, CR vs. Regular, etc.) and prints a warning. These cross-type comparisons can be misleading if you interpret “only in base” as “program eliminated.”

A More Natural Comparison: Filtering by Agency

To focus on just the accounts that matter, use --agency to narrow the comparison:

congress-approp compare --base data/118-hr4366 --current data/118-hr9468 --agency "Veterans"

This filters both sides to only show accounts from the Department of Veterans Affairs, making the comparison much easier to read. You’ll see the two “changed” accounts (Comp & Pensions and Readjustment Benefits) plus the VA accounts that are “only in base.”

When Compare Shines: Same-Type Comparisons

The compare command is most useful when comparing bills of the same type:

  • FY2023 omnibus → FY2024 omnibus: See which programs gained or lost funding year over year
  • House version → Senate version: Track differences during the conference process
  • FY2024 omnibus → FY2025 omnibus: Year-over-year trend analysis

To do this, extract both bills into separate directories, then:

# Example: comparing two fiscal years (requires extracting both bills first)
congress-approp compare --base data/117/hr/2471 --current data/118/hr/4366

Accounts are matched by (agency, account_name) with automatic normalization. Results are sorted by the absolute value of the delta, so the biggest changes appear first.

Handling Account Name Mismatches

The compare command matches accounts by exact normalized name. If Congress renames an account between fiscal years — say, “Cybersecurity and Infrastructure Security Agency” becomes “CISA Operations and Support” — the compare command will show the old name as “only in base” and the new name as “only in current” rather than matching them.

For accounts with different names that represent the same program, use the --similar flag on search to find the semantic match:

congress-approp search --dir data --similar 118-hr9468:0 --top 5

This uses embedding vectors to match by meaning rather than account name. See Track a Program Across Bills for details.

The compare --use-links flag uses persistent cross-bill relationships (created via link accept) to inform the matching, handling renames automatically. See Track a Program Across Bills for the full link workflow.

Export Comparisons

Like all query commands, compare supports multiple output formats:

# CSV for Excel analysis
congress-approp compare --base data/118-hr4366 --current data/118-hr9468 --format csv > comparison.csv

# JSON for programmatic processing
congress-approp compare --base data/118-hr4366 --current data/118-hr9468 --format json

The JSON output includes every field for each account delta:

[
  {
    "account_name": "Compensation and Pensions",
    "agency": "Department of Veterans Affairs",
    "base_dollars": 197382903000,
    "current_dollars": 2285513000,
    "delta": -195097390000,
    "delta_pct": -98.84,
    "status": "changed"
  }
]

This is useful for building year-over-year tracking dashboards or automated change reports.

Practical Examples

Which programs got the biggest increases?

congress-approp compare --base data/fy2023 --current data/fy2024 --format json | \
  jq '[.[] | select(.delta > 0)] | sort_by(-.delta) | .[:10]'

Which programs were eliminated?

congress-approp compare --base data/fy2023 --current data/fy2024 --format json | \
  jq '[.[] | select(.status == "only in base")] | sort_by(-.base_dollars)'

What’s new this year?

congress-approp compare --base data/fy2023 --current data/fy2024 --format json | \
  jq '[.[] | select(.status == "only in current")] | sort_by(-.current_dollars)'

Summary

The compare command is your tool for answering “what changed?” at the account level:

  • Use --base and --current to point at any two directories containing extracted bills
  • Results are sorted by the absolute value of the change — biggest impacts first
  • The --agency filter helps focus on specific departments
  • Pay attention to the classification warning when comparing different bill types
  • Export to CSV or JSON for further analysis
  • For accounts that change names between bills, use --similar semantic matching

Next Steps

Track a Program Across Bills

You will need: congress-approp installed, access to the data/ directory. Optionally: OPENAI_API_KEY for semantic search.

You will learn: How to follow a specific program’s funding across multiple bills using --similar, and how to interpret cross-bill matching results.

A single program — say, VA Compensation and Pensions — can appear in multiple bills within the same fiscal year: the full-year omnibus, a continuing resolution, and an emergency supplemental. Tracking it across all three tells you the complete funding story. But account names aren’t always consistent between bills, and keyword search only works when you know the exact terminology each bill uses.

The --similar flag solves this by using pre-computed embedding vectors to find provisions that mean the same thing, even when the words differ.

The Scenario

H.R. 9468 (the VA Supplemental) appropriated $2,285,513,000 for “Compensation and Pensions.” You want to find every related provision in the omnibus (H.R. 4366) and the continuing resolution (H.R. 5860).

Step 1: Identify the Source Provision

First, find the provision you want to track. You can use any search command to locate it:

congress-approp search --dir data/118-hr9468 --type appropriation
┌───┬───────────┬───────────────┬─────────────────────────────┬───────────────┬─────────┬─────┐
│ $ ┆ Bill      ┆ Type          ┆ Description / Account       ┆    Amount ($) ┆ Section ┆ Div │
╞═══╪═══════════╪═══════════════╪═════════════════════════════╪═══════════════╪═════════╪═════╡
│ ✓ ┆ H.R. 9468 ┆ appropriation ┆ Compensation and Pensions   ┆ 2,285,513,000 ┆         ┆     │
│ ✓ ┆ H.R. 9468 ┆ appropriation ┆ Readjustment Benefits       ┆   596,969,000 ┆         ┆     │
└───┴───────────┴───────────────┴─────────────────────────────┴───────────────┴─────────┴─────┘
2 provisions found

Compensation and Pensions is the first provision listed. To use --similar, you need the bill directory name and provision index. The directory is hr9468 (the directory name inside data/), and the index is 0 (first provision, zero-indexed).

You can also see the index in JSON output:

congress-approp search --dir data/118-hr9468 --type appropriation --format json

Look for the "provision_index": 0 field in the first result.

Step 2: Find Similar Provisions Across All Bills

Now use --similar to find the closest matches across every loaded bill:

congress-approp search --dir data --similar 118-hr9468:0 --top 10
┌──────┬───────────┬───────────────┬────────────────────────────────┬─────────────────┬─────┐
│ Sim  ┆ Bill      ┆ Type          ┆ Description / Account          ┆      Amount ($) ┆ Div │
╞══════╪═══════════╪═══════════════╪════════════════════════════════╪═════════════════╪═════╡
│ 0.86 ┆ H.R. 4366 ┆ appropriation ┆ Compensation and Pensions      ┆ 182,310,515,000 ┆ A   │
│ 0.78 ┆ H.R. 4366 ┆ appropriation ┆ Compensation and Pensions      ┆  15,072,388,000 ┆ A   │
│ 0.73 ┆ H.R. 4366 ┆ limitation    ┆ Compensation and Pensions      ┆      22,109,000 ┆ A   │
│ 0.70 ┆ H.R. 9468 ┆ appropriation ┆ Readjustment Benefits          ┆     596,969,000 ┆     │
│ 0.68 ┆ H.R. 4366 ┆ rescission    ┆ Medical Support and Compliance ┆   1,550,000,000 ┆ A   │
│ ...                                                                                         │
└──────┴───────────┴───────────────┴────────────────────────────────┴─────────────────┴─────┘

This is the complete picture of Comp & Pensions across the dataset:

  1. 0.86 similarity — The omnibus’s main Comp & Pensions appropriation: $182.3 billion. This is the regular-year funding for the same account that the supplemental topped up by $2.3 billion.
  2. 0.78 similarity — The omnibus’s advance appropriation for Comp & Pensions: $15.1 billion. This is money enacted in FY2024 but available for FY2025.
  3. 0.73 similarity — A $22 million limitation on the Comp & Pensions account.
  4. 0.70 similarity — Readjustment Benefits from the same supplemental. This is a different VA account, but conceptually close because it’s also VA mandatory benefits.
  5. 0.68 similarity — A rescission of Medical Support and Compliance funds. Related VA account, lower similarity because it’s a different type of action (rescission vs. appropriation).

Why no CR matches?

The continuing resolution (H.R. 5860) doesn’t have a specific Comp & Pensions provision because CRs fund at the prior-year rate by default. Only the 13 programs with anomalies (CR substitutions) appear as explicit provisions. VA Comp & Pensions wasn’t one of them — it was simply continued at its prior-year level.

Step 3: How –similar Works Under the Hood

The --similar flag does not make any API calls. Here’s what happens:

  1. It looks up the embedding vector for 118-hr9468:0 from the pre-computed vectors.bin file
  2. It loads the embedding vectors for every provision in every bill under --dir
  3. It computes the cosine similarity between the source vector and every other vector
  4. It ranks by similarity descending and returns the top N results

Because everything is pre-computed and stored locally, this operation takes less than a millisecond for 2,500 provisions. The only prerequisite is that embeddings have been generated (via congress-approp embed) for all the bills you want to search.

Step 4: Interpret Similarity Scores

The similarity score tells you how closely related two provisions are in “meaning space”:

ScoreInterpretationExample
> 0.80Almost certainly the same programVA Supp “Comp & Pensions” ↔ Omnibus “Comp & Pensions” (0.86)
0.60 – 0.80Related topic, same policy area“Comp & Pensions” ↔ “Medical Support and Compliance” (0.68)
0.45 – 0.60Loosely relatedVA provisions ↔ non-VA provisions with similar structure
< 0.45Probably not meaningfully relatedVA provisions ↔ transportation or energy provisions

For cross-bill tracking, focus on matches above 0.75 — these are very likely the same account in a different bill.

Step 5: Track the Second Account

Repeat for Readjustment Benefits (provision index 1 in the supplemental):

congress-approp search --dir data --similar 118-hr9468:1 --top 5
┌──────┬───────────┬───────────────┬────────────────────────────────┬─────────────────┬─────┐
│ Sim  ┆ Bill      ┆ Type          ┆ Description / Account          ┆      Amount ($) ┆ Div │
╞══════╪═══════════╪═══════════════╪════════════════════════════════╪═════════════════╪═════╡
│ 0.88 ┆ H.R. 4366 ┆ appropriation ┆ Readjustment Benefits          ┆  13,399,805,000 ┆ A   │
│ 0.76 ┆ H.R. 9468 ┆ appropriation ┆ Compensation and Pensions      ┆   2,285,513,000 ┆     │
│ ...                                                                                         │
└──────┴───────────┴───────────────┴────────────────────────────────┴─────────────────┴─────┘

Top match at 0.88: the omnibus Readjustment Benefits account at $13.4 billion. The supplemental added $597 million on top of that.

When Account Names Differ Between Bills

The example data happens to use the same account names across bills, but this isn’t always the case. Continuing resolutions often use hierarchical names like:

  • CR: "Rural Housing Service—Rural Community Facilities Program Account"
  • Omnibus: "Rural Community Facilities Program Account"

Keyword matching would miss this, but --similar handles it because the embeddings capture the meaning of the provision, not just the words.

To demonstrate, let’s find the omnibus counterparts of the CR substitutions that have different naming conventions:

# First, find a CR substitution provision index
congress-approp search --dir data/118-hr5860 --type cr_substitution --format json
# Note: the first CR substitution (Rural Housing) is at some index — check provision_index

# Then find similar provisions in the omnibus
congress-approp search --dir data --similar hr5860:<INDEX> --top 3

Even though “Rural Housing Service—Rural Community Facilities Program Account” and “Rural Community Facilities Program Account” are different strings, the embedding similarity will be in the 0.75–0.80 range — high enough to confidently identify them as the same program.

Building a Funding Timeline

Once you can match accounts across bills, you can assemble a complete funding picture. For VA Comp & Pensions in FY2024:

SourceAmountType
H.R. 4366 (Omnibus)$182,310,515,000Regular appropriation
H.R. 4366 (Omnibus)$15,072,388,000Advance appropriation (FY2025)
H.R. 9468 (Supplemental)$2,285,513,000Emergency supplemental
H.R. 5860 (CR)(prior-year rate)No explicit provision — funded by CR baseline

With multiple fiscal years extracted, you could extend this to a multi-year timeline. The --similar command makes cross-year matching possible even when account names evolve.

Deep-Dive with Relate

The relate command provides a focused view of one provision across all bills, with a fiscal year timeline showing advance/current/supplemental splits:

# Trace VA Compensation and Pensions across all fiscal years
congress-approp relate 118-hr9468:0 --dir data --fy-timeline

Each match includes a deterministic 8-character hash that you can use to persist the relationship.

You can save cross-bill relationships using the link system, so they persist across sessions and can be used by compare --use-links:

# Discover link candidates from embeddings
congress-approp link suggest --dir data --scope cross --limit 20

# Accept specific matches by hash (from relate or link suggest output)
congress-approp link accept --dir data a3f7b2c4 e5d1c8a9

# Or batch-accept all verified + high-confidence candidates
congress-approp link accept --dir data --auto

# Use accepted links in compare to handle renames
congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud --dir data --use-links

# View and manage saved links
congress-approp link list --dir data
congress-approp link remove --dir data a3f7b2c4

Links are stored at <dir>/links/links.json and are indexed by deterministic hashes — the same provision pair always produces the same hash, so you can script the workflow reliably. See Enrich Bills with Metadata and the CLI Reference for details.

This will enable automatic cross-year matching even when account names change, with human review for ambiguous cases.

Tips for Cross-Bill Tracking

  1. Start from the smaller bill. If you’re tracking between a supplemental (7 provisions) and an omnibus (2,364 provisions), start from the supplemental and search into the omnibus. It’s easier to review 5–10 matches than 2,364.

  2. Use --top 3 to reduce noise. You rarely need more than the top 3 matches. The best match is almost always the right one.

  3. Combine with --type for precision. If you’re matching appropriations, add --type appropriation to exclude riders, directives, and other provision types from the results:

    congress-approp search --dir data --similar 118-hr9468:0 --type appropriation --top 5
    
  4. Check both directions. If provision A in bill X matches provision B in bill Y at 0.85, provision B in bill Y should also match provision A in bill X at a similar score. If it doesn’t, something is off.

  5. Low max similarity means the program is unique. If your source provision’s best match in another bill is below 0.55, the program may genuinely not exist in that bill. This is useful for identifying new programs or eliminated ones.

Summary

TaskCommand
Find the omnibus version of a supplemental provisionsearch --dir data --similar 118-hr9468:0 --top 3
Find related provisions across all billssearch --dir data --similar 118-hr4366:42 --top 10
Restrict matches to appropriations onlysearch --dir data --similar 118-hr9468:0 --type appropriation --top 5
Find provisions in a specific billsearch --dir data/118-hr4366 --similar 118-hr9468:0 --top 5

Next Steps

Extract Your Own Bill

You will need: congress-approp installed, CONGRESS_API_KEY (free), ANTHROPIC_API_KEY. Optionally: OPENAI_API_KEY for embeddings.

You will learn: How to go from zero to queryable data — downloading a bill from Congress.gov, extracting provisions with Claude, verifying the results, and optionally generating embeddings for semantic search.

The included example data covers three FY2024 bills, but there are dozens of enacted appropriations bills across recent congresses. This tutorial walks you through the full pipeline for extracting any bill you want.

Step 1: Get Your API Keys

You need two keys to run the full pipeline. A third is optional for semantic search.

KeyPurposeCostSign Up
CONGRESS_API_KEYDownload bill XML from Congress.govFreeapi.congress.gov/sign-up
ANTHROPIC_API_KEYExtract provisions using ClaudePay-per-useconsole.anthropic.com
OPENAI_API_KEYGenerate embeddings for semantic search (optional)Pay-per-useplatform.openai.com

Set them in your shell:

export CONGRESS_API_KEY="your-congress-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
# Optional:
export OPENAI_API_KEY="your-openai-key"

Step 2: Test Connectivity

Verify that your API keys work before spending time on a full extraction:

congress-approp api test

This checks both the Congress.gov and Anthropic APIs. You should see confirmation that both are reachable and your keys are valid.

Step 3: Discover Available Bills

Use the api bill list command to see what appropriations bills exist for a given congress:

# List all appropriations bills for the 118th Congress (2023-2024)
congress-approp api bill list --congress 118

# List only enacted appropriations bills
congress-approp api bill list --congress 118 --enacted-only

The --enacted-only flag filters to bills that were signed into law — these are the ones that actually became binding spending authority. You’ll see a list with bill type, number, title, and status.

Congress numbers

Each Congress spans two years:

CongressYearsExample
117th2021–2022FY2022 and FY2023 bills
118th2023–2024FY2024 and FY2025 bills
119th2025–2026FY2026 bills

Bill type codes

When downloading a specific bill, you need the bill type code:

CodeMeaningExample
hrHouse billH.R. 4366
sSenate billS. 1234
hjresHouse joint resolutionH.J.Res. 100
sjresSenate joint resolutionS.J.Res. 50

Most enacted appropriations bills originate in the House (hr), since the Constitution requires revenue and spending bills to originate there.

Step 4: Download the Bill

Download a single bill

If you know the specific bill you want:

congress-approp download --congress 118 --type hr --number 9468 --output-dir data

This fetches the enrolled (final, signed into law) XML from Congress.gov and saves it to data/118/hr/9468/BILLS-118hr9468enr.xml.

Download all enacted bills for a congress

To get everything at once:

congress-approp download --congress 118 --enacted-only --output-dir data

This scans for all enacted appropriations bills in the 118th Congress and downloads their enrolled XML. It may take a minute or two depending on how many bills there are.

Preview without downloading

Use --dry-run to see what would be downloaded without actually fetching anything:

congress-approp download --congress 118 --enacted-only --output-dir data --dry-run

Step 5: Preview the Extraction (Dry Run)

Before making any LLM API calls, preview what the extraction will look like:

congress-approp extract --dir data/118/hr/9468 --dry-run

The dry run shows you:

  • Chunk count: How many chunks the bill will be split into. Small bills (like the VA supplemental) are a single chunk. The FY2024 omnibus splits into 75 chunks.
  • Estimated input tokens: How many tokens will be sent to the LLM. This helps you estimate cost before committing.

Here’s what to expect for different bill sizes:

Bill TypeTypical XML SizeChunksEstimated Input Tokens
Supplemental (small)~10 KB1~1,200
Continuing Resolution~130 KB5~25,000
Omnibus (large)~1.8 MB75~315,000

Step 6: Run the Extraction

Now run the actual extraction:

congress-approp extract --dir data/118/hr/9468

For the small VA supplemental, this completes in under a minute. Here’s what happens:

  1. Parse: The XML is parsed to extract clean text and identify chunk boundaries
  2. Extract: Each chunk is sent to Claude with a detailed system prompt defining every provision type
  3. Merge: Provisions from all chunks are combined into a single list
  4. Compute: Budget authority totals are computed from the individual provisions (never trusting the LLM’s arithmetic)
  5. Verify: Every dollar amount and text excerpt is checked against the source XML
  6. Write: All artifacts are saved to disk

Controlling parallelism

For large bills with many chunks, you can control how many LLM calls run simultaneously:

# Default: 5 concurrent calls
congress-approp extract --dir data/118/hr/4366

# Faster but uses more API quota
congress-approp extract --dir data/118/hr/4366 --parallel 8

# Conservative — one at a time
congress-approp extract --dir data/118/hr/4366 --parallel 1

Higher parallelism is faster but may hit API rate limits. The default of 5 is a good balance.

Using a different model

By default, extraction uses claude-opus-4-6. You can override this:

# Via flag
congress-approp extract --dir data/118/hr/9468 --model claude-sonnet-4-20250514

# Via environment variable
export APPROP_MODEL="claude-sonnet-4-20250514"
congress-approp extract --dir data/118/hr/9468

Caution: The system prompt and expected output format are tuned for Claude Opus. Other models may produce lower-quality extractions with more classification errors or missing provisions. Always check the audit output after extracting with a non-default model.

Progress display

For multi-chunk bills, a progress dashboard shows real-time status:

  5/42, 187 provs [4m 23s] 842 tok/s | 📝A-IIb ~8K 180/s | 🤔B-I ~3K | 📝B-III ~1K 95/s

This tells you: 5 of 42 chunks complete, 187 provisions extracted so far, running for 4 minutes 23 seconds, with three chunks currently being processed.

Step 7: Check the Output Files

After extraction, your bill directory contains several new files:

data/118/hr/9468/
├── BILLS-118hr9468enr.xml     ← Source XML (downloaded in Step 4)
├── extraction.json            ← All provisions with amounts, accounts, sections
├── verification.json          ← Deterministic checks against source text
├── metadata.json              ← Model name, prompt version, timestamps, source hash
├── tokens.json                ← LLM token usage (input, output, cache hits)
└── chunks/                    ← Per-chunk LLM artifacts (thinking traces, raw responses)
FileWhat It Contains
extraction.jsonThe main output: every extracted provision with structured fields. This is the file all query commands read.
verification.jsonDeterministic verification: dollar amount checks, raw text matching, completeness analysis. No LLM involved.
metadata.jsonProvenance: which model was used, prompt version, extraction timestamp, SHA-256 of the source XML.
tokens.jsonToken usage: input tokens, output tokens, cache read/create tokens, total API calls.
chunks/Per-chunk artifacts: the model’s thinking content, raw response, parsed JSON, and conversion report for each chunk. These are local provenance records, gitignored by default.

Step 8: Verify the Extraction

Run the audit command to check quality:

congress-approp audit --dir data/118/hr/9468
┌───────────┬────────────┬──────────┬──────────┬───────┬───────┬──────────┬───────────┬──────────┬──────────┐
│ Bill      ┆ Provisions ┆ Verified ┆ NotFound ┆ Ambig ┆ Exact ┆ NormText ┆ Spaceless ┆ TextMiss ┆ Coverage │
╞═══════════╪════════════╪══════════╪══════════╪═══════╪═══════╪══════════╪═══════════╪══════════╪══════════╡
│ H.R. 9468 ┆          7 ┆        2 ┆        0 ┆     0 ┆     5 ┆        0 ┆         0 ┆        2 ┆   100.0% │
└───────────┴────────────┴──────────┴──────────┴───────┴───────┴──────────┴───────────┴──────────┴──────────┘

What to check:

  1. NotFound should be 0. If any dollar amounts weren’t found in the source text, investigate with audit --verbose.
  2. Exact should be high. This means the raw text excerpts are byte-identical to the source — the LLM copied the text faithfully.
  3. Coverage ideally ≥ 90%. Coverage below 100% isn’t necessarily a problem — see What Coverage Means.

If NotFound > 0, run the verbose audit to see which provisions failed:

congress-approp audit --dir data/118/hr/9468 --verbose

This lists each problematic provision with its dollar string, allowing you to manually check against the source XML.

Step 9: Query Your Data

All the same commands you used with the example data now work on your extracted bill:

# Summary
congress-approp summary --dir data/118/hr/9468

# Search for specific provisions
congress-approp search --dir data/118/hr/9468 --type appropriation

# Compare with the examples
congress-approp compare --base data/118-hr4366 --current data/118/hr/9468

You can also point --dir at a parent directory to load multiple bills at once:

# Load everything under data/
congress-approp summary --dir data

# Search across all extracted bills
congress-approp search --dir data --keyword "Veterans Affairs"

The loader walks recursively from whatever --dir you specify, finding every extraction.json file.

Step 10 (Optional): Generate Embeddings

If you want semantic search and --similar matching for your newly extracted bill, generate embeddings:

export OPENAI_API_KEY="your-key"
congress-approp embed --dir data/118/hr/9468

This sends each provision’s text to OpenAI’s text-embedding-3-large model and saves the vectors locally. For a small bill (7 provisions), this takes a few seconds. For the omnibus (2,364 provisions), about 30 seconds.

Preview token usage

congress-approp embed --dir data/118/hr/9468 --dry-run

Shows how many provisions would be embedded and estimated token count without making any API calls.

After embedding

Now semantic search works on your bill:

congress-approp search --dir data --semantic "school lunch programs" --top 5
congress-approp search --dir data --similar 118-hr9468:0 --top 5

The embed command writes two files:

  • embeddings.json — Metadata: model name, dimensions, provision count, SHA-256 of the extraction it was built from
  • vectors.bin — Binary float32 vectors (count × dimensions × 4 bytes)

See Generate Embeddings for detailed options.

Re-Extracting a Bill

If you want to re-extract a bill — perhaps with a newer model or after a schema update — simply run extract again. It will overwrite the existing extraction.json and verification.json.

After re-extracting, the embeddings become stale. The tool detects this via the hash chain and warns you:

⚠ H.R. 9468: embeddings are stale (extraction.json has changed)

Run embed again to regenerate them.

If you only need to re-verify without re-extracting (for example, after a schema upgrade), use the upgrade command instead:

congress-approp upgrade --dir data/118/hr/9468

This re-deserializes the existing extraction through the current code’s schema, re-runs verification, and updates the files — no LLM calls needed. See Upgrade Extraction Data for details.

Estimating Costs

The tokens.json file records exact token usage after extraction. Here are typical numbers from the example bills:

BillTypeChunksInput TokensOutput Tokens
H.R. 9468Supplemental (9 KB XML)1~1,200~1,500
H.R. 5860CR (131 KB XML)5~25,000~15,000
H.R. 4366Omnibus (1.8 MB XML)75~315,000~200,000

Embedding costs are much lower — approximately $0.01 per bill for text-embedding-3-large.

Use extract --dry-run and embed --dry-run to preview token counts before committing to API calls.

Quick Reference: Full Pipeline

Here’s the complete sequence for extracting a bill from scratch:

# 1. Set API keys
export CONGRESS_API_KEY="..."
export ANTHROPIC_API_KEY="..."
export OPENAI_API_KEY="..."  # optional, for embeddings

# 2. Find the bill
congress-approp api bill list --congress 118 --enacted-only

# 3. Download
congress-approp download --congress 118 --type hr --number 4366 --output-dir data

# 4. Preview extraction
congress-approp extract --dir data/118/hr/4366 --dry-run

# 5. Extract
congress-approp extract --dir data/118/hr/4366 --parallel 6

# 6. Verify
congress-approp audit --dir data/118/hr/4366

# 7. Generate embeddings (optional)
congress-approp embed --dir data/118/hr/4366

# 8. Query
congress-approp summary --dir data
congress-approp search --dir data --type appropriation

Troubleshooting

“No XML files found”

Make sure you downloaded the bill first (congress-approp download). The extract command looks for BILLS-*.xml files in the specified directory.

“Rate limited” errors during extraction

Reduce parallelism: extract --parallel 2. Anthropic’s API has per-minute token limits that can be exceeded with high concurrency on large bills.

Low coverage after extraction

Run audit --verbose to see which dollar amounts in the source text weren’t captured. Common causes:

  • Statutory cross-references: Dollar amounts from other laws cited in the bill text — correctly excluded
  • Struck amounts: “Striking ‘$50,000’ and inserting ‘$75,000’” — the old amount shouldn’t be extracted
  • Loan guarantee ceilings: Not budget authority — correctly excluded

If legitimate provisions are missing, consider re-extracting with a higher-capability model.

Stale embeddings warning

After re-extracting, the hash chain detects that extraction.json has changed but embeddings.json still references the old version. Run congress-approp embed --dir <path> to regenerate.

Next Steps

Export Data for Spreadsheets and Scripts

You will need: congress-approp installed, access to the data/ directory.

You will learn: How to get appropriations data into Excel, Google Sheets, Python, R, and shell pipelines using the four output formats: CSV, JSON, JSONL, and table.

The congress-approp CLI is great for interactive exploration, but most analysis workflows eventually need the data in another tool — a spreadsheet for a briefing, a pandas DataFrame for statistical analysis, or a jq pipeline for automation. Every query command supports four output formats via the --format flag, and this tutorial shows you how to use each one effectively.

CSV for Spreadsheets

CSV is the most portable format for getting data into Excel, Google Sheets, LibreOffice Calc, or any other spreadsheet application.

Basic export

congress-approp search --dir data --type appropriation --format csv > appropriations.csv

This writes a file with a header row and one row per matching provision. Here’s what the first few lines look like:

bill,provision_type,account_name,description,agency,dollars,old_dollars,semantics,detail_level,section,division,raw_text,amount_status,match_tier,quality,provision_index
H.R. 9468,appropriation,Compensation and Pensions,Compensation and Pensions,Department of Veterans Affairs,2285513000,,new_budget_authority,,,,For an additional amount for ''Compensation and Pensions''...,found,exact,strong,0
H.R. 9468,appropriation,Readjustment Benefits,Readjustment Benefits,Department of Veterans Affairs,596969000,,new_budget_authority,,,,For an additional amount for ''Readjustment Benefits''...,found,exact,strong,1

Columns in CSV output

The CSV includes the same fields as the JSON output, flattened into columns:

ColumnDescription
billBill identifier (e.g., “H.R. 4366”)
provision_typeType: appropriation, rescission, rider, etc.
account_nameThe appropriations account name
descriptionDescription of the provision
agencyParent department or agency
dollarsDollar amount as a plain integer (no commas or $)
old_dollarsFor CR substitutions: the old amount being replaced
semanticsWhat the amount means: new_budget_authority, rescission, reference_amount, etc.
sectionSection reference (e.g., “SEC. 101”)
divisionDivision letter for omnibus bills (e.g., “A”)
amount_statusVerification result: found, found_multiple, not_found
qualityOverall quality: strong, moderate, weak, n/a
raw_textExcerpt of the actual bill language
provision_indexPosition in the bill’s provision array (zero-indexed)
match_tierHow raw_text matched the source: exact, normalized, spaceless, no_match
fiscal_yearFiscal year the provision is for (appropriations only)
detail_levelStructural granularity: top_level, line_item, sub_allocation, proviso_amount
confidenceLLM confidence score (0.00–1.00)

⚠️ Don’t sum the dollars column directly. The export includes sub-allocations and reference amounts that would double-count money already in a parent line item. Without filtering, a naive sum can overcount budget authority by 2x or more.

To compute correct budget authority totals:

  • Filter to semantics == new_budget_authority
  • Exclude detail_level == sub_allocation and detail_level == proviso_amount

Or use congress-approp summary which does this correctly and automatically.

Computing totals correctly

In Excel or Google Sheets:

  1. Open the CSV
  2. Add a filter on the semantics column → select only new_budget_authority
  3. Add a filter on the detail_level column → deselect sub_allocation and proviso_amount
  4. Sum the filtered dollars column

With jq (command line):

congress-approp search --dir data --type appropriation --format jsonl \
  | jq -s '[.[] | select(.semantics == "new_budget_authority" and .detail_level != "sub_allocation" and .detail_level != "proviso_amount") | .dollars] | add'

With Python:

import csv
with open("provisions.csv") as f:
    rows = list(csv.DictReader(f))
ba = sum(int(r["dollars"]) for r in rows
         if r["dollars"]
         and r["semantics"] == "new_budget_authority"
         and r["detail_level"] not in ("sub_allocation", "proviso_amount"))
print(f"Budget Authority: ${ba:,}")

Tip: When you export to CSV/JSON/JSONL, the tool prints a summary to stderr showing how many provisions have each semantics type and the budget authority total. Watch for this — it tells you immediately whether filtering is needed.

Opening in Excel

  1. Open Excel
  2. File → Open → navigate to your .csv file
  3. If Excel doesn’t auto-detect columns, use Data → From Text/CSV and select UTF-8 encoding
  4. The dollars column will be numeric — you can format it as currency or with comma separators

Gotchas to watch for:

  • Large numbers: Excel may display very large dollar amounts in scientific notation (e.g., 8.46E+11). Format the column as Number with 0 decimal places.
  • Leading zeros: Not an issue here since bill numbers don’t have leading zeros, but be aware that CSV import can strip them in other contexts.
  • UTF-8 characters: Bill text contains em-dashes (—), curly quotes, and other Unicode characters. Make sure your import specifies UTF-8 encoding. On Windows, this sometimes requires the “From Text/CSV” import wizard rather than a simple File → Open.
  • Commas in text: The raw_text and description fields may contain commas. The CSV output properly quotes these fields, but some older CSV parsers may not handle quoted fields correctly.

Opening in Google Sheets

  1. Go to Google Sheets → File → Import → Upload
  2. Select your .csv file
  3. Import location: “Replace current sheet” or “Insert new sheet”
  4. Separator type: Comma (should auto-detect)
  5. Google Sheets handles UTF-8 natively — no encoding issues

Useful CSV exports

# All appropriations across all example bills
congress-approp search --dir data --type appropriation --format csv > all_appropriations.csv

# Just the VA accounts
congress-approp search --dir data --agency "Veterans" --format csv > va_provisions.csv

# Rescissions over $100 million
congress-approp search --dir data --type rescission --min-dollars 100000000 --format csv > big_rescissions.csv

# CR substitutions with old and new amounts
congress-approp search --dir data --type cr_substitution --format csv > cr_anomalies.csv

# Everything in Division A (MilCon-VA)
congress-approp search --dir data/118-hr4366 --division A --format csv > milcon_va.csv

# Summary table as CSV
congress-approp summary --dir data --format csv > bill_summary.csv

JSON for Programmatic Use

JSON output includes every field for each matching provision as an array of objects. It’s the richest output format and the best choice for Python, JavaScript, R, or any other programming language.

Basic export

congress-approp search --dir data/118-hr9468 --type appropriation --format json
[
  {
    "account_name": "Compensation and Pensions",
    "agency": "Department of Veterans Affairs",
    "amount_status": "found",
    "bill": "H.R. 9468",
    "description": "Compensation and Pensions",
    "division": "",
    "dollars": 2285513000,
    "match_tier": "exact",
    "old_dollars": null,
    "provision_index": 0,
    "provision_type": "appropriation",
    "quality": "strong",
    "raw_text": "For an additional amount for ''Compensation and Pensions'', $2,285,513,000, to remain available until expended.",
    "section": "",
    "semantics": "new_budget_authority"
  },
  {
    "account_name": "Readjustment Benefits",
    "agency": "Department of Veterans Affairs",
    "amount_status": "found",
    "bill": "H.R. 9468",
    "description": "Readjustment Benefits",
    "division": "",
    "dollars": 596969000,
    "match_tier": "exact",
    "old_dollars": null,
    "provision_index": 1,
    "provision_type": "appropriation",
    "quality": "strong",
    "raw_text": "For an additional amount for ''Readjustment Benefits'', $596,969,000, to remain available until expended.",
    "section": "",
    "semantics": "new_budget_authority"
  }
]

Five jq One-Liners Every Analyst Needs

If you have jq installed (a lightweight JSON processor), you can do filtering and aggregation directly from the command line:

1. Total budget authority across all appropriations:

congress-approp search --dir data --type appropriation --format json | \
  jq '[.[] | select(.semantics == "new_budget_authority") | .dollars] | add'
862137099554

2. Top 10 accounts by dollar amount:

congress-approp search --dir data --type appropriation --format json | \
  jq '[.[] | select(.dollars != null)] | sort_by(-.dollars) | .[:10] | .[] | "\(.dollars)\t\(.account_name)"'

3. Group by agency and sum budget authority:

congress-approp search --dir data --type appropriation --format json | \
  jq 'group_by(.agency) | map({
    agency: .[0].agency,
    total: [.[] | .dollars // 0] | add,
    count: length
  }) | sort_by(-.total) | .[:10]'

4. Find all provisions in Division A over $1 billion:

congress-approp search --dir data --format json | \
  jq '[.[] | select(.division == "A" and (.dollars // 0) > 1000000000)]'

5. Extract just account names (unique, sorted):

congress-approp search --dir data --type appropriation --format json | \
  jq '[.[].account_name] | unique | sort | .[]'

Loading JSON in Python

import json

# Method 1: From a file
with open("appropriations.json") as f:
    provisions = json.load(f)

# Method 2: From subprocess
import subprocess
result = subprocess.run(
    ["congress-approp", "search", "--dir", "data",
     "--type", "appropriation", "--format", "json"],
    capture_output=True, text=True
)
provisions = json.loads(result.stdout)

# Work with the data
for p in provisions:
    if p["dollars"] and p["dollars"] > 1_000_000_000:
        print(f"{p['account_name']}: ${p['dollars']:,.0f}")

Loading JSON in pandas

import pandas as pd
import json

# Load search output
df = pd.read_json("appropriations.json")

# Basic analysis
print(f"Total provisions: {len(df)}")
print(f"Total BA: ${df[df['semantics'] == 'new_budget_authority']['dollars'].sum():,.0f}")
print(f"\nBy agency:")
print(df.groupby("agency")["dollars"].sum().sort_values(ascending=False).head(10))

Loading JSON in R

library(jsonlite)

provisions <- fromJSON("appropriations.json")

# Filter to appropriations with budget authority
ba <- provisions[provisions$semantics == "new_budget_authority" & !is.na(provisions$dollars), ]

# Top 10 by dollars
head(ba[order(-ba$dollars), c("account_name", "agency", "dollars")], 10)

JSONL for Streaming

JSONL (JSON Lines) outputs one JSON object per line, with no enclosing array brackets. This is ideal for:

  • Streaming processing (each line is independently parseable)
  • Piping to while read loops in shell scripts
  • Processing very large result sets without loading everything into memory
  • Tools like xargs and parallel

Basic usage

congress-approp search --dir data --type appropriation --format jsonl

Each line is a complete JSON object:

{"account_name":"Compensation and Pensions","agency":"Department of Veterans Affairs","amount_status":"found","bill":"H.R. 9468","description":"Compensation and Pensions","division":"","dollars":2285513000,...}
{"account_name":"Readjustment Benefits","agency":"Department of Veterans Affairs","amount_status":"found","bill":"H.R. 9468","description":"Readjustment Benefits","division":"","dollars":596969000,...}
...

Shell processing examples

# Count provisions per bill
congress-approp search --dir data --format jsonl | \
  jq -r '.bill' | sort | uniq -c | sort -rn

# Extract account names line by line
congress-approp search --dir data --type appropriation --format jsonl | \
  while IFS= read -r line; do
    echo "$line" | jq -r '.account_name'
  done

# Filter and reformat in one pipeline
congress-approp search --dir data --type rescission --format jsonl | \
  jq -r 'select(.dollars > 1000000000) | "\(.bill)\t$\(.dollars)\t\(.account_name)"'

When to use JSONL vs. JSON

FormatUse When
JSONLoading the full result set into memory (Python, R, JavaScript). Result is a single parseable array.
JSONLStreaming line-by-line processing, very large result sets, piping to jq/xargs/parallel. Each line is independent.

Working with extraction.json Directly

Sometimes the CLI search output doesn’t give you exactly what you need. The raw extraction.json file contains the complete data with nested structures that the CLI flattens.

Structure

{
  "schema_version": "1.0",
  "bill": {
    "identifier": "H.R. 9468",
    "classification": "supplemental",
    "short_title": "Veterans Benefits Continuity and Accountability Supplemental Appropriations Act, 2024",
    "fiscal_years": [2024],
    "divisions": [],
    "public_law": null
  },
  "provisions": [
    {
      "provision_type": "appropriation",
      "account_name": "Compensation and Pensions",
      "agency": "Department of Veterans Affairs",
      "amount": {
        "value": { "kind": "specific", "dollars": 2285513000 },
        "semantics": "new_budget_authority",
        "text_as_written": "$2,285,513,000"
      },
      "detail_level": "top_level",
      "availability": "to remain available until expended",
      "fiscal_year": 2024,
      "confidence": 0.99,
      "raw_text": "For an additional amount for ''Compensation and Pensions'', $2,285,513,000, to remain available until expended.",
      "notes": ["Supplemental appropriation under Veterans Benefits Administration heading", "No-year funding"],
      "cross_references": [],
      "section": "",
      "division": null,
      "title": null,
      "provisos": [],
      "earmarks": [],
      "parent_account": null,
      "program": null
    }
  ],
  "summary": { ... },
  "chunk_map": []
}

Key differences from CLI JSON output:

  • Nested amount object with value, semantics, and text_as_written sub-fields
  • notes array — explanatory annotations the LLM added
  • cross_references array — references to other laws and sections
  • provisos array — “Provided, That” conditions
  • earmarks array — community project funding items
  • confidence float — LLM self-assessed confidence (0.0–1.0)
  • availability string — fund availability period

Flattening nested data in Python

import json
import pandas as pd

with open("data/118-hr9468/extraction.json") as f:
    data = json.load(f)

# Flatten provisions with nested amounts
rows = []
for p in data["provisions"]:
    row = {
        "provision_type": p["provision_type"],
        "account_name": p.get("account_name", ""),
        "agency": p.get("agency", ""),
        "section": p.get("section", ""),
        "division": p.get("division", ""),
        "confidence": p.get("confidence", 0),
        "raw_text": p.get("raw_text", ""),
        "notes": "; ".join(p.get("notes", [])),
    }

    # Flatten the amount
    amt = p.get("amount")
    if amt:
        val = amt.get("value", {})
        row["dollars"] = val.get("dollars") if val.get("kind") == "specific" else None
        row["semantics"] = amt.get("semantics", "")
        row["text_as_written"] = amt.get("text_as_written", "")

    rows.append(row)

df = pd.DataFrame(rows)
print(df[["provision_type", "account_name", "dollars", "semantics"]].to_string())

Finding provisions with specific notes

The notes field contains useful annotations that the CLI doesn’t display:

import json

with open("data/118-hr4366/extraction.json") as f:
    data = json.load(f)

# Find all provisions noted as advance appropriations
for i, p in enumerate(data["provisions"]):
    notes = p.get("notes", [])
    for note in notes:
        if "advance" in note.lower():
            acct = p.get("account_name", "unknown")
            amt = p.get("amount", {}).get("value", {}).get("dollars", "N/A")
            print(f"[{i}] {acct}: ${amt:,} — {note}")

Summary: Choosing the Right Format

FormatFlagBest ForPreserves Nested Data?
Table--format table (default)Interactive exploration, quick lookupsNo — truncates long fields
CSV--format csvExcel, Google Sheets, R, simple tabular analysisNo — flattened columns
JSON--format jsonPython, JavaScript, jq, programmatic processingPartially — CLI flattens some fields
JSONL--format jsonlStreaming, piping, line-by-line processingPartially — same as JSON per line
extraction.json (direct)Read the file directlyFull nested data, notes, cross-references, provisosYes — complete data

For most analysis tasks, start with --format json or --format csv. Only read extraction.json directly when you need nested fields like notes, cross_references, or provisos that the CLI output flattens away.

Next Steps

Use Semantic Search

You will need: congress-approp installed, access to the data/ directory, OPENAI_API_KEY environment variable set.

You will learn: How to find provisions by meaning instead of keywords, how to interpret similarity scores, how to use --similar for cross-bill matching, and when semantic search is (and isn’t) the right tool.

Keyword search finds provisions that contain the exact words you type. Semantic search finds provisions that mean what you’re looking for — even when the words are completely different. This is the difference between searching for “school lunch” (zero results in appropriations language) and finding “$33 billion for Child Nutrition Programs” (the actual provision that funds school lunches).

This tutorial walks through setup, real queries against the example data, and practical techniques for getting the best results.

Prerequisites

Semantic search requires two things:

  1. Pre-computed embeddings for the bills you want to search. The included example data already has these — you don’t need to generate them.
  2. OPENAI_API_KEY set in your environment. This is needed at query time to embed your search text (a single API call, ~100ms, costs fractions of a cent).
export OPENAI_API_KEY="your-key-here"

If you’re working with your own extracted bills that don’t have embeddings yet, generate them first:

congress-approp embed --dir your-data-directory

See Generate Embeddings for details.

The following example searches for a concept using everyday language that shares no keywords with the matching provision:

congress-approp search --dir data --semantic "school lunch programs for kids" --top 5
┌──────┬───────────┬───────────────┬─────────────────────────────────────────────┬────────────────┬─────┐
│ Sim  ┆ Bill      ┆ Type          ┆ Description / Account                       ┆     Amount ($) ┆ Div │
╞══════╪═══════════╪═══════════════╪═════════════════════════════════════════════╪════════════════╪═════╡
│ 0.51 ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆ 33,266,226,000 ┆ B   │
│ 0.46 ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆     10,000,000 ┆ B   │
│ 0.45 ┆ H.R. 4366 ┆ rider         ┆ Pilot project grant recipients shall be r…  ┆              — ┆ B   │
│ 0.45 ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆     18,004,000 ┆ B   │
│ 0.44 ┆ H.R. 4366 ┆ appropriation ┆ Child Nutrition Programs                    ┆      5,000,000 ┆ B   │
└──────┴───────────┴───────────────┴─────────────────────────────────────────────┴────────────────┴─────┘
5 provisions found

Not a single word in “school lunch programs for kids” appears in “Child Nutrition Programs” — and yet it’s the top result at 0.51 similarity. The embedding model understands that school lunches and child nutrition are the same concept.

Compare this to a keyword search for the same phrase:

congress-approp search --dir data --keyword "school lunch"
0 provisions found

Zero results. Keyword search can only find provisions containing the literal words “school lunch,” which no provision in any of these bills does.

Understanding the Sim Column

When you use --semantic or --similar, the table gains a Sim column showing the cosine similarity between your query and each provision’s embedding vector. Scores range from 0 to 1:

Score RangeWhat It MeansExample
> 0.80Nearly identical meaning — almost certainly the same program in a different billVA Supp “Comp & Pensions” ↔ Omnibus “Comp & Pensions”
0.60 – 0.80Related topic, same policy area“Clean energy” ↔ “Energy Efficiency and Renewable Energy”
0.45 – 0.60Conceptually connected but not a direct match“School lunch” ↔ “Child Nutrition Programs” (0.51)
0.30 – 0.45Weak connection; may be coincidental“Cryptocurrency regulation” ↔ “Regulation and Technology”
< 0.30No meaningful relationshipRandom topic ↔ unrelated provision

Key insight: A score of 0.51 for “school lunch” → “Child Nutrition Programs” is strong for a conceptual translation query. Scores above 0.80 typically occur only when comparing the same program in different bills.

More Queries to Try

These examples demonstrate different types of semantic matching. Try each one against the example data:

Layperson → Bureaucratic Translation

The most common use case — you know what you want in plain English, but the bill uses formal government terminology:

# Plain language → official program names
congress-approp search --dir data --semantic "money for fixing roads and bridges" --top 5
# → Highway Infrastructure Programs, Federal-Aid Highways, National Infrastructure Investments

congress-approp search --dir data --semantic "space exploration and rockets" --top 5
# → Exploration (NASA), Space Operations, Space Technology

congress-approp search --dir data --semantic "fighting wildfires" --top 5
# → Wildland Fire Management, Wildfire Suppression Operations Reserve Fund

congress-approp search --dir data --semantic "help for homeless veterans" --top 5
# → Homeless Assistance Grants, various VA provisions

Topic Discovery

When you’re exploring a policy area without knowing specific program names:

# What's in the bill about clean energy?
congress-approp search --dir data --semantic "clean energy research" --top 10

# What about drug enforcement?
congress-approp search --dir data --semantic "drug enforcement and narcotics control" --top 10

# Nuclear weapons and defense?
congress-approp search --dir data --semantic "nuclear weapons maintenance and modernization" --top 10

News Story → Provisions

Paste a phrase from a news article to find the relevant provisions:

# From a headline about the opioid crisis
congress-approp search --dir data --semantic "opioid crisis drug treatment" --top 5

# From a story about border security
congress-approp search --dir data --semantic "border wall construction and immigration enforcement" --top 5

# From a story about scientific research funding
congress-approp search --dir data --semantic "federal funding for scientific research grants" --top 10

Combining Semantic Search with Filters

Semantic search provides the ranking (which provisions are most relevant to your query). Hard filters provide constraints (which provisions are even eligible to appear). When combined, the filters apply first, then semantic ranking orders the remaining results.

Filter by provision type

If you only want appropriation-type provisions (not riders, directives, or limitations):

congress-approp search --dir data --semantic "clean energy" --type appropriation --top 5

This is useful because semantic search doesn’t distinguish provision types — a rider about clean energy policy scores as high as an appropriation for clean energy funding. Adding --type appropriation ensures you only see provisions with dollar amounts.

Filter by dollar range

Find large provisions about a topic:

congress-approp search --dir data --semantic "scientific research" --type appropriation --min-dollars 1000000000 --top 5

This returns only appropriations of $1 billion or more that are semantically related to scientific research.

Filter by division

Focus on a specific part of the omnibus:

# Only Division A (MilCon-VA)
congress-approp search --dir data --semantic "veterans health care" --division A --top 5

# Only Division B (Agriculture)
congress-approp search --dir data --semantic "farm subsidies" --division B --top 5

Combine multiple filters

congress-approp search --dir data \
  --semantic "renewable energy and climate" \
  --type appropriation \
  --min-dollars 100000000 \
  --division D \
  --top 10

This finds the top 10 appropriations of $100M+ in Division D (Energy and Water) related to renewable energy and climate.

Finding Similar Provisions with –similar

While --semantic embeds a text query and searches for matching provisions, --similar takes an existing provision and finds the most similar provisions across all loaded bills. This is the cross-bill matching tool.

Basic usage

The syntax is --similar <bill_directory>:<provision_index>:

congress-approp search --dir data --similar 118-hr9468:0 --top 5
┌──────┬───────────┬───────────────┬────────────────────────────────┬─────────────────┬─────┐
│ Sim  ┆ Bill      ┆ Type          ┆ Description / Account          ┆      Amount ($) ┆ Div │
╞══════╪═══════════╪═══════════════╪════════════════════════════════╪═════════════════╪═════╡
│ 0.86 ┆ H.R. 4366 ┆ appropriation ┆ Compensation and Pensions      ┆ 182,310,515,000 ┆ A   │
│ 0.78 ┆ H.R. 4366 ┆ appropriation ┆ Compensation and Pensions      ┆  15,072,388,000 ┆ A   │
│ 0.73 ┆ H.R. 4366 ┆ limitation    ┆ Compensation and Pensions      ┆      22,109,000 ┆ A   │
│ 0.70 ┆ H.R. 9468 ┆ appropriation ┆ Readjustment Benefits          ┆     596,969,000 ┆     │
│ 0.68 ┆ H.R. 4366 ┆ rescission    ┆ Medical Support and Compliance ┆   1,550,000,000 ┆ A   │
└──────┴───────────┴───────────────┴────────────────────────────────┴─────────────────┴─────┘
5 provisions found

Here 118-hr9468:0 means “provision index 0 in the hr9468 directory” — that’s the VA Supplemental’s Compensation and Pensions appropriation. The top match in the omnibus is the same account at 0.86 similarity.

Key difference from –semantic

Feature--semantic--similar
InputA text query you typeAn existing provision by directory:index
API call?Yes — embeds your query text via OpenAI (~100ms)No — uses pre-computed vectors from vectors.bin
Use caseFind provisions matching a conceptMatch the same program across bills
Requires OPENAI_API_KEY?YesNo

Because --similar doesn’t make any API calls, it’s instant and free. It looks up the source provision’s pre-computed vector and computes cosine similarity against every other provision’s vector locally.

Finding the provision index

To use --similar, you need the provision index. There are several ways to find it:

Method 1: Use --format json and look for the provision_index field:

congress-approp search --dir data/118-hr9468 --type appropriation --format json | \
  jq '.[] | "\(.provision_index): \(.account_name) $\(.dollars)"'
"0: Compensation and Pensions $2285513000"
"1: Readjustment Benefits $596969000"

Method 2: In the table output, count rows from the top (zero-indexed). The first row is index 0, the second is index 1, and so on within each bill.

Method 3: For a specific account, search for it and note the provision_index in the JSON output.

Cross-bill matching with different naming conventions

CRs and omnibus bills often use different naming conventions for the same account. Embeddings handle this because they capture meaning, not just words:

  • CR: "Rural Housing Service—Rural Community Facilities Program Account"
  • Omnibus: "Rural Community Facilities Program Account"

Despite the different names, --similar will match these at approximately 0.78 similarity — well above the threshold for confident matching.

When Semantic Search Doesn’t Work

Semantic search has limitations. Here are situations where other approaches work better:

Exact account name lookups

If you know the precise account name, --account is faster, deterministic, and doesn’t require an API key:

# Better than semantic search for exact lookups
congress-approp search --dir data --account "Child Nutrition Programs"

No conceptual match in the dataset

If you search for a topic that genuinely isn’t in the bills, similarity scores will be low — and that’s the correct answer:

congress-approp search --dir data --semantic "cryptocurrency regulation bitcoin blockchain" --top 3
┌──────┬───────────┬───────────────┬───────────────────────────────┬─────────────┬─────┐
│ Sim  ┆ Bill      ┆ Type          ┆ Description / Account         ┆  Amount ($) ┆ Div │
╞══════╪═══════════╪═══════════════╪═══════════════════════════════╪═════════════╪═════╡
│ 0.30 ┆ H.R. 4366 ┆ appropriation ┆ Regulation and Technology     ┆  62,400,000 ┆ E   │
│ 0.29 ┆ H.R. 4366 ┆ appropriation ┆ Regulation and Technology     ┆      40,000 ┆ E   │
│ 0.29 ┆ H.R. 4366 ┆ appropriation ┆ Regulation and Technology     ┆ 116,186,000 ┆ E   │
└──────┴───────────┴───────────────┴───────────────────────────────┴─────────────┴─────┘
3 provisions found

Scores of 0.29–0.30 are well below any meaningful threshold. The tool correctly surfaces the closest things it has (NRC “Regulation and Technology” — the word “regulation” provides a weak signal) but the low scores tell you: nothing in this dataset is actually about cryptocurrency.

Treat scores below 0.40 as “no meaningful match.”

Distinguishing provision types by embedding

Embeddings capture what the provision is about, not what type of action it is. A rider that prohibits funding for abortions and an appropriation for reproductive health services may score highly similar because they’re about the same topic — even though they represent opposite policy actions.

If provision type matters, always combine semantic search with --type:

# Find appropriations about reproductive health, not policy riders
congress-approp search --dir data --semantic "reproductive health" --type appropriation --top 5

Query instability

Different phrasings of the same question can produce somewhat different results. In experiments, five different phrasings of a FEMA-related query shared only one common provision in their top-5 results. This is a known property of embedding models.

Mitigation: If the topic matters, try 2–3 different phrasings and take the union of results. A future --multi-query feature will automate this.

Cost and Performance

Semantic search is fast and inexpensive:

OperationTimeCost
Embed your query text (one API call)~100ms~$0.0001
Cosine similarity over 2,500 provisions<0.1msFree (local)
Load embedding vectors from disk~2msFree (local)
Total per search~100ms~$0.0001

Embedding generation (one-time per bill):

BillProvisionsTimeApproximate Cost
H.R. 9468 (supplemental)7~2 seconds< $0.01
H.R. 5860 (CR)130~5 seconds< $0.01
H.R. 4366 (omnibus)2,364~30 seconds< $0.01

The embedding model is text-embedding-3-large with 3,072 dimensions. Vectors are stored as binary float32 files that load in milliseconds.

How It Works Under the Hood

For a detailed technical explanation, see How Semantic Search Works. In brief:

  1. At embed time: Each provision’s meaningful text (account name + agency + bill text) is sent to OpenAI’s embedding model, which returns a 3,072-dimensional vector. These vectors are stored in vectors.bin.

  2. At query time (–semantic): Your search text is sent to the same model (one API call). The returned vector is compared to every stored provision vector using cosine similarity (the dot product of normalized vectors). Results are ranked by similarity.

  3. At query time (–similar): The source provision’s vector is looked up from the stored vectors.bin. No API call needed — everything is local.

  4. The math: Cosine similarity measures the angle between two vectors in 3,072-dimensional space. Vectors pointing in the same direction (similar meaning) have high cosine similarity; vectors pointing in different directions (different meanings) have low similarity.

  1. Be descriptive, not terse. “Federal funding for scientific research at universities” works better than just “science.” Longer queries give the embedding model more context.

  2. Use domain language when you know it. “SNAP benefits supplemental nutrition” will rank higher than “food stamps for poor people” because the embedding model has seen more formal language in its training data.

  3. Combine with hard filters. Semantic search ranks; filters constrain. Use them together:

    congress-approp search --dir data --semantic "your query" --type appropriation --min-dollars 1000000 --top 10
    
  4. Try both --semantic and --similar. If you find one good provision via semantic search, switch to --similar with that provision’s index to find related provisions across other bills without additional API calls.

  5. Trust low scores. If the best match is below 0.40, the topic likely isn’t in the dataset. Don’t force an interpretation.

  6. Check results with keyword search. After semantic search finds a promising account, verify with --account or --keyword to make sure you’re seeing the complete picture:

    # Semantic search found "Child Nutrition Programs" — now get everything for that account
    congress-approp search --dir data --account "Child Nutrition"
    

Quick Reference

TaskCommand
Search by meaningsearch --semantic "your query" --top 10
Search by meaning, only appropriationssearch --semantic "your query" --type appropriation --top 10
Search by meaning, large provisions onlysearch --semantic "your query" --min-dollars 1000000000 --top 10
Find similar provisions across billssearch --similar 118-hr9468:0 --top 5
Find similar appropriations onlysearch --similar 118-hr9468:0 --type appropriation --top 5

Next Steps

Download Bills from Congress.gov

You will need: congress-approp installed, CONGRESS_API_KEY environment variable set.

You will learn: How to discover available appropriations bills, download their enrolled XML, and set up a data directory for extraction.

This guide covers every option for downloading bill XML from Congress.gov. If you just want the quick path, skip to Quick Reference at the end.

Set Up Your API Key

The Congress.gov API requires a free API key. Sign up at api.congress.gov/sign-up — approval is usually instant.

Set the key in your environment:

export CONGRESS_API_KEY="your-key-here"

You can verify connectivity with:

congress-approp api test

Discover Available Bills

Before downloading, you’ll usually want to see what’s available. The api bill list command queries Congress.gov for appropriations bills:

List all appropriations bills for a congress

congress-approp api bill list --congress 118

This returns every bill in the 118th Congress (2023–2024) that Congress.gov classifies as an appropriations bill — introduced, passed, vetoed, or enacted.

List only enacted bills

Most of the time you only want bills that became law:

congress-approp api bill list --congress 118 --enacted-only

The --enacted-only flag filters to bills signed by the President (or with a veto override). These are the authoritative spending laws.

Congress numbers

Each Congress spans two years. Here are the recent ones:

CongressYearsFiscal Years Typically Covered
116th2019–2020FY2020, FY2021
117th2021–2022FY2022, FY2023
118th2023–2024FY2024, FY2025
119th2025–2026FY2026, FY2027

Note that fiscal years don’t align perfectly with congresses — a bill enacted in the 118th Congress might fund FY2024 (which started October 1, 2023) or FY2025.

Get metadata for a specific bill

If you know which bill you want, you can inspect its metadata before downloading:

congress-approp api bill get --congress 118 --type hr --number 4366

Check available text versions

Bills have multiple text versions (introduced, engrossed, enrolled, etc.). To see what’s available:

congress-approp api bill text --congress 118 --type hr --number 4366

This lists every text version with its format (XML, PDF, HTML) and download URL. For extraction, you want the enrolled (enr) version — the final text signed into law.

Bill Type Codes

When specifying a bill, you need the type code:

CodeMeaningExample
hrHouse billH.R. 4366
sSenate billS. 1234
hjresHouse joint resolutionH.J.Res. 100
sjresSenate joint resolutionS.J.Res. 50

Most enacted appropriations bills originate in the House (hr), since the Constitution requires spending bills to originate there. Joint resolutions (hjres, sjres) are sometimes used for continuing resolutions.

Download a Single Bill

To download one specific bill’s enrolled XML:

congress-approp download --congress 118 --type hr --number 9468 --output-dir data

This creates the directory structure and saves the XML:

data/
└── 118/
    └── hr/
        └── 9468/
            └── BILLS-118hr9468enr.xml

The file name follows the Government Publishing Office convention: BILLS-{congress}{type}{number}enr.xml.

Only the enrolled version is downloaded

By default, the tool downloads only the enrolled version (the final text signed into law). This is the version you need for extraction and analysis — one XML file per bill, no clutter.

If you need other text versions (for example, to compare the House-passed version to the final enrolled version), you can request specific versions or all versions:

# Download only the introduced version
congress-approp download --congress 118 --type hr --number 4366 --output-dir data --version ih

# Download all available text versions (introduced, engrossed, enrolled, etc.)
congress-approp download --congress 118 --type hr --number 4366 --output-dir data --all-versions

Available version codes for --version:

CodeVersionDescription
enrEnrolledFinal version, signed into law (downloaded by default)
ihIntroduced in HouseAs originally introduced
isIntroduced in SenateAs originally introduced
ehEngrossed in HouseAs passed by the House
esEngrossed in SenateAs passed by the Senate

Tip: For extraction and analysis, always use the enrolled version (the default). Non-enrolled versions may have different XML structures that the parser doesn’t support. The --all-versions flag is for advanced workflows like tracking how a bill changed during the legislative process.

Download multiple formats

You can download both XML (for extraction) and PDF (for reading) at once:

congress-approp download --congress 118 --type hr --number 4366 --output-dir data --format xml,pdf

Download All Enacted Bills for a Congress

To batch-download every enacted appropriations bill:

congress-approp download --congress 118 --enacted-only --output-dir data

This scans Congress.gov for all enacted appropriations bills in the specified congress, then downloads the enrolled XML for each one. The process may take a minute or two depending on how many bills exist and the API’s response time.

Each bill gets its own directory:

data/
└── 118/
    └── hr/
        ├── 4366/
        │   └── BILLS-118hr4366enr.xml
        ├── 5860/
        │   └── BILLS-118hr5860enr.xml
        └── 9468/
            └── BILLS-118hr9468enr.xml

Preview Without Downloading

Use --dry-run to see what would be downloaded without actually fetching anything:

congress-approp download --congress 118 --enacted-only --output-dir data --dry-run

This queries the API and lists each bill that would be downloaded, along with the file size and output path. Useful for estimating how much data you’re about to pull down.

Choosing an Output Directory

The --output-dir flag controls where bills are saved. The default is ./data. You can use any directory structure you like:

# Default location
congress-approp download --congress 118 --type hr --number 4366

# Custom location
congress-approp download --congress 118 --type hr --number 4366 --output-dir ~/appropriations-data

# Organized by fiscal year (your choice of structure)
congress-approp download --congress 118 --type hr --number 4366 --output-dir data/fy2024

The tool creates intermediate directories as needed. Later, when you run extract, search, summary, and other commands, you point --dir at whatever directory contains your bills — the loader walks recursively to find all extraction.json files.

Handling Rate Limits and Errors

The Congress.gov API has rate limits (typically 5,000 requests per hour for registered users). If you’re downloading many bills in quick succession, you may encounter rate limiting.

Symptoms: HTTP 429 (Too Many Requests) errors, or slow responses.

Solutions:

  • Wait a few minutes and retry
  • Download bills one at a time rather than in batch
  • The tool handles most retries automatically, but persistent rate limiting may require reducing your request frequency

Other common issues:

ErrorCauseSolution
“API key not set”CONGRESS_API_KEY not in environmentexport CONGRESS_API_KEY="your-key"
“Bill not found” (404)Wrong congress number, bill type, or numberDouble-check using api bill list
“No enrolled text available”Bill hasn’t been enrolled yet, or text not yet publishedCheck api bill text for available versions; some bills take days to appear after signing
“Connection refused”Network issue or Congress.gov maintenanceCheck your internet connection; try again later

After Downloading

Once you have the XML, the next step is extraction:

# Preview extraction (no API calls)
congress-approp extract --dir data/118/hr/9468 --dry-run

# Run extraction
congress-approp extract --dir data/118/hr/9468

See Extract Provisions from a Bill for the full extraction guide, or Extract Your Own Bill for the end-to-end tutorial.

Quick Reference

# Set your API key
export CONGRESS_API_KEY="your-key"

# Test connectivity
congress-approp api test

# List enacted bills for a congress
congress-approp api bill list --congress 118 --enacted-only

# Download a single bill
congress-approp download --congress 118 --type hr --number 4366 --output-dir data

# Download all enacted bills for a congress
congress-approp download --congress 118 --enacted-only --output-dir data

# Preview without downloading
congress-approp download --congress 118 --enacted-only --output-dir data --dry-run

# Check available text versions for a bill
congress-approp api bill text --congress 118 --type hr --number 4366

Full Command Reference

congress-approp download [OPTIONS] --congress <CONGRESS>

Options:
    --congress <CONGRESS>      Congress number (e.g., 118 for 2023-2024)
    --type <TYPE>              Bill type: hr, s, hjres, sjres
    --number <NUMBER>          Bill number (used with --type for single-bill download)
    --output-dir <OUTPUT_DIR>  Output directory [default: ./data]
    --enacted-only             Only download enacted (signed into law) bills
    --format <FORMAT>          Download format: xml, pdf [comma-separated] [default: xml]
    --version <VERSION>        Text version filter: enr, ih, eh, es, is
    --all-versions             Download all text versions instead of just enrolled
    --dry-run                  Show what would be downloaded without fetching

Next Steps

Extract Provisions from a Bill

You will need: congress-approp installed, downloaded bill XML (see Download Bills), ANTHROPIC_API_KEY environment variable set.

You will learn: How to run the extraction pipeline, control parallelism and model selection, interpret the output files, and handle common issues.

Extraction is the core step of the pipeline — it sends bill text to Claude, which identifies and classifies every spending provision, then deterministic verification checks every dollar amount against the source. This guide covers all the options and considerations.

Prerequisites

  1. Downloaded bill XML. You need at least one BILLS-*.xml file in a bill directory. See Download Bills from Congress.gov.
  2. Anthropic API key. Set it in your environment:
export ANTHROPIC_API_KEY="your-key-here"

Preview Before Extracting

Always start with a dry run to see what the extraction will involve:

congress-approp extract --dir data/118/hr/9468 --dry-run

The dry run shows you:

  • Bill identifier parsed from the XML
  • Chunk count — how many pieces the bill will be split into for parallel processing
  • Estimated input tokens — helps you estimate API cost before committing

Typical chunk counts by bill size:

Bill TypeXML SizeChunksEst. Input Tokens
Supplemental (small)~10 KB1~1,200
Continuing Resolution~130 KB3–5~25,000
Individual regular bill~200–500 KB10–20~50,000–100,000
Omnibus (large)~1–2 MB50–75~200,000–315,000

No API calls are made during a dry run.

Run Extraction

Single bill

congress-approp extract --dir data/118/hr/9468

For a small bill (like the VA supplemental), this completes in under a minute. The tool:

  1. Parses the XML to extract clean text and identify structural boundaries (divisions, titles)
  2. Splits large bills into chunks at division and title boundaries
  3. Sends each chunk to Claude with a ~300-line system prompt defining every provision type
  4. Merges provisions from all chunks into a single list
  5. Computes budget authority totals from the individual provisions (never trusting the LLM’s arithmetic)
  6. Verifies every dollar amount and text excerpt against the source XML
  7. Writes all artifacts to disk

Multiple bills

Point --dir at a parent directory to extract all bills found underneath:

congress-approp extract --dir data

The tool walks recursively, finds every directory containing a BILLS-*.xml file, and extracts each one. Bills that already have extraction.json are automatically skipped — you can safely re-run the same command after a partial failure and it picks up where it left off. To force re-extraction of already-processed bills, use --force:

# Re-extract everything, even bills that already have extraction.json
congress-approp extract --dir data --force

Enrolled versions only

When a bill directory contains multiple XML versions (enrolled, introduced, engrossed, etc.), the extract command automatically uses only the enrolled version (*enr.xml). Non-enrolled versions are ignored. If no enrolled version exists, all available versions are processed.

This means you don’t need to worry about cleaning up extra XML files — the tool picks the right one automatically.

Resilient processing

If an XML file fails to parse (for example, a non-enrolled version with a different XML structure), the tool logs a warning and continues to the next bill instead of aborting the entire run:

⚠ Skipping data/118/hr/2872/BILLS-118hr2872eas.xml: Failed to parse ... (not a parseable bill XML?)

This means one bad file won’t kill a multi-bill extraction run.

Chunk failure handling

Large bills are split into many chunks for parallel extraction. If any chunk permanently fails after all retries (typically due to API rate limiting or empty responses), the tool aborts that bill by default — it does not write extraction.json. This prevents garbage partial extractions from being saved and mistaken for valid data.

✗ 7148: 113 of 115 chunks failed for H.R. 7148. Aborting to prevent partial extraction.
  Use --continue-on-error to save partial results.
  No extraction.json written for this bill.

The tool then continues to the next bill in the queue. Since no extraction.json was written for the failed bill, re-running the same command will automatically retry it.

If you explicitly want partial results (for example, a bill where 59 of 92 chunks succeeded and you want the 1,600+ provisions that were extracted), use --continue-on-error:

congress-approp extract --dir data/118/hr/2882 --parallel 6 --continue-on-error

This saves the partial extraction.json with whatever chunks succeeded. The audit command will show lower coverage for these partial extractions.

Extract all downloaded bills with parallelism

congress-approp extract --dir data --parallel 6

Controlling Parallelism

The --parallel flag controls how many LLM API calls run simultaneously. This affects both speed and API rate limit usage:

# Default: 5 concurrent calls
congress-approp extract --dir data/118/hr/4366

# Faster — good for large bills if your API quota allows
congress-approp extract --dir data/118/hr/4366 --parallel 8

# Conservative — avoids rate limits, good for debugging
congress-approp extract --dir data/118/hr/4366 --parallel 1
ParallelismSpeedRate Limit RiskBest For
1SlowestNoneDebugging, small bills
3ModerateLowConservative extraction
5 (default)GoodModerateMost use cases
8–10FastHigherLarge bills with high API quota

For the FY2024 omnibus (75 chunks), --parallel 6 completes in approximately 60 minutes. At --parallel 1, it would take several hours.

Progress display

For multi-chunk bills, a live progress dashboard shows extraction status:

  5/42, 187 provs [4m 23s] 842 tok/s | 📝A-IIb ~8K 180/s | 🤔B-I ~3K | 📝B-III ~1K 95/s

Reading left to right:

  • 5/42 — 5 of 42 chunks complete
  • 187 provs — 187 provisions extracted so far
  • [4m 23s] — elapsed time
  • 842 tok/s — average token throughput
  • The remaining items show currently active chunks: 📝 = receiving response, 🤔 = model is thinking

Choosing a Model

By default, extraction uses claude-opus-4-6, which produces the highest quality results. You can override this:

# Via command-line flag
congress-approp extract --dir data/118/hr/9468 --model claude-sonnet-4-20250514

# Via environment variable (useful for scripting)
export APPROP_MODEL="claude-sonnet-4-20250514"
congress-approp extract --dir data/118/hr/9468

The command-line flag takes precedence over the environment variable.

Quality warning: The system prompt and expected output format are specifically tuned for Claude Opus. Other models may produce:

  • More classification errors (e.g., marking an appropriation as a rider)
  • Missing provisions (especially sub-allocations and proviso amounts)
  • Inconsistent JSON formatting (handled by from_value.rs resilient parsing, but still)
  • Lower coverage scores in the audit

Always check audit output after extracting with a non-default model.

The model name is recorded in metadata.json so you always know which model produced a given extraction.

Output Files

After extraction, the bill directory contains:

data/118/hr/9468/
├── BILLS-118hr9468enr.xml     ← Source XML (unchanged)
├── extraction.json            ← All provisions with amounts, accounts, sections
├── verification.json          ← Deterministic checks against source text
├── metadata.json              ← Model name, prompt version, timestamps, source hash
├── tokens.json                ← LLM token usage (input, output, cache hits)
└── chunks/                    ← Per-chunk LLM artifacts (gitignored)
    ├── 01JRWN9T5RR0JTQ6C9FYYE96A8.json
    └── ...

extraction.json

The main output. Contains:

  • bill — Identifier, classification, short title, fiscal years, divisions
  • provisions — Array of every extracted provision with full structured data
  • summary — LLM-generated summary statistics (used for diagnostics, never for computation)
  • chunk_map — Links each provision to the chunk it was extracted from
  • schema_version — Version of the extraction schema

This is the file all query commands (search, summary, compare, audit) read.

verification.json

Deterministic verification of every provision against the source text. No LLM involved:

  • amount_checks — Was each dollar string found in the source?
  • raw_text_checks — Is each raw text excerpt a substring of the source?
  • completeness — How many dollar strings in the source were captured?
  • summary — Roll-up metrics (verified, not_found, ambiguous, match tiers)

metadata.json

Extraction provenance:

  • model — Which LLM model was used
  • prompt_version — Hash of the system prompt
  • extraction_timestamp — When the extraction ran
  • source_xml_sha256 — SHA-256 hash of the source XML (for the hash chain)

tokens.json

API token usage:

  • total_input — Total input tokens across all chunks
  • total_output — Total output tokens
  • total_cache_read — Tokens served from prompt cache (reduces cost)
  • total_cache_create — Tokens added to prompt cache
  • calls — Number of API calls made

chunks/ directory

Per-chunk LLM artifacts stored with ULID filenames. Each file contains:

  • The model’s thinking content (internal reasoning)
  • The raw JSON response before parsing
  • The parsed provisions for that chunk
  • A conversion report showing any type coercions or missing fields

These are permanent provenance records — useful for debugging why a particular provision was classified a certain way. They are gitignored by default (not part of the hash chain, not needed for downstream operations).

Verify After Extraction

Always run the audit after extracting:

congress-approp audit --dir data/118/hr/9468

What to check:

MetricGood ValueAction if Bad
NotFound0Run audit --verbose to see which provisions failed; check source XML manually
Exact> 90% of provisionsMinor formatting differences are handled by NormText tier; only worry if TextMiss is high
Coverage> 80% for regular billsReview unaccounted amounts — many are legitimately excluded (statutory refs, loan ceilings)
Provisions countReasonable for bill sizeA small bill with 500+ provisions or a large bill with <50 may indicate extraction issues

For a detailed verification procedure, see Verify Extraction Accuracy.

Re-Extracting a Bill

To re-extract (for example, with a newer model or after prompt improvements), use the --force flag:

# Re-extract even though extraction.json already exists
congress-approp extract --dir data/118/hr/9468 --force

Without --force, the extract command skips bills that already have extraction.json. This makes it safe to re-run extract --dir data after failures — bills that succeeded are skipped, and bills that failed (no extraction.json written) are retried automatically.

After re-extraction:

  • extraction.json and verification.json are overwritten
  • metadata.json and tokens.json are overwritten
  • A new set of chunk artifacts is created in chunks/
  • Embeddings become stale — the tool will warn you, and you’ll need to run embed again

Upgrade without re-extracting

If you only need to re-verify against a newer schema (no LLM calls), use upgrade instead:

congress-approp upgrade --dir data/118/hr/9468

This re-deserializes the existing extraction through the current code’s schema, re-runs verification, and updates the files. Much faster and free. See Upgrade Extraction Data.

Handling Large Bills

Omnibus bills (1,000+ pages) require special attention:

Chunk splitting

Large bills are automatically split into chunks at XML <division> and <title> boundaries. This is semantic chunking — each chunk contains a complete legislative section with full context. The FY2024 omnibus (H.R. 4366) splits into approximately 75 chunks.

If a single title or division exceeds the maximum chunk token limit (~3,000 tokens), it’s further split at paragraph boundaries. This is rare but happens for very long sections.

Time estimates

BillChunks–parallel 5–parallel 8
Small supplemental1~30 seconds~30 seconds
Continuing resolution5~3 minutes~2 minutes
Regular bill15–20~15 minutes~10 minutes
Omnibus75~75 minutes~50 minutes

Handling interruptions

If extraction is interrupted (network error, rate limit, crash), you’ll need to re-run it from the beginning. There is no checkpoint/resume mechanism — the tool extracts all chunks and merges them atomically.

Troubleshooting

“N of M chunks failed … Aborting”

This means some LLM API calls failed after all retries — typically due to rate limiting on large bills. The tool did not write extraction.json to prevent saving garbage data.

Fix: Wait a few minutes for API quotas to reset, then re-run the same command. Since no extraction.json was written, the failed bill will be retried automatically. If the bill is very large (90+ chunks), try reducing parallelism:

congress-approp extract --dir data/119/hr/7148 --parallel 3

If you want to save whatever chunks succeeded (accepting an incomplete extraction), add --continue-on-error:

congress-approp extract --dir data/119/hr/7148 --parallel 6 --continue-on-error --force

“All bills already extracted”

This means every bill directory already has extraction.json. Use --force to re-extract:

congress-approp extract --dir data/118/hr/9468 --force

“No XML files found”

Make sure you downloaded the bill first. The extract command looks for files matching BILLS-*.xml in the specified directory.

ls data/118/hr/9468/BILLS-*.xml

“Rate limited” or 429 errors

Reduce parallelism:

congress-approp extract --dir data/118/hr/4366 --parallel 2

Anthropic’s API has per-minute token limits. High concurrency on large bills can exceed these limits.

Low provision count

If a large bill produces surprisingly few provisions, check:

  1. The XML file — is it the correct version? Some partial texts are available on Congress.gov.
  2. The audit output — low coverage combined with low provision count suggests the extraction missed sections.
  3. The chunk artifacts — look in chunks/ for any chunks that produced zero provisions or error responses.

“Unexpected token” or JSON parsing errors

The from_value.rs resilient parser handles most LLM output quirks automatically. If you see parsing warnings in the verbose output, they’re usually minor (a missing field defaulting to empty, a string where a number was expected being coerced). The conversion.json report in each chunk directory shows exactly what was adjusted.

If extraction fails entirely, try with --parallel 1 to isolate which chunk is problematic, then examine that chunk’s artifacts in chunks/.

Quick Reference

# Set API key
export ANTHROPIC_API_KEY="your-key"

# Preview extraction (no API calls)
congress-approp extract --dir data/118/hr/9468 --dry-run

# Extract a single bill
congress-approp extract --dir data/118/hr/9468

# Extract with higher parallelism
congress-approp extract --dir data/118/hr/4366 --parallel 8

# Extract all bills under a directory (skips already-extracted bills)
congress-approp extract --dir data --parallel 6

# Re-extract a bill that was already extracted
congress-approp extract --dir data/118/hr/9468 --force

# Save partial results even when some chunks fail
congress-approp extract --dir data/118/hr/2882 --parallel 6 --continue-on-error

# Verify after extraction
congress-approp audit --dir data/118/hr/9468

Full Command Reference

congress-approp extract [OPTIONS]

Options:
    --dir <DIR>            Data directory containing downloaded bill XML [default: ./data]
    --dry-run              Show what would be extracted without calling LLM
    --parallel <PARALLEL>  Parallel LLM calls [default: 5]
    --model <MODEL>        LLM model override [env: APPROP_MODEL=]
    --force                Re-extract bills even if extraction.json already exists
    --continue-on-error    Save partial results when some chunks fail (default: abort bill)

Next Steps

Generate Embeddings

You will need: congress-approp installed, extracted bill data (with extraction.json), OPENAI_API_KEY environment variable set.

You will learn: How to generate embedding vectors for semantic search and --similar matching, configure embedding options, detect and handle staleness, and manage embedding storage.

Embeddings are what power semantic search (--semantic) and cross-bill matching (--similar). Each provision’s text is converted into a 3,072-dimensional vector that captures its meaning. Provisions about similar topics — even with completely different wording — will have vectors pointing in similar directions.

You only need to generate embeddings once per bill. After that, all semantic operations use the stored vectors locally, with the single exception of --semantic queries which make one small API call to embed your query text.

Prerequisites

  1. Extracted bill data. You need extraction.json in each bill directory. See Extract Provisions from a Bill.
  2. OpenAI API key. Embeddings use OpenAI’s text-embedding-3-large model.
export OPENAI_API_KEY="your-key-here"

Note: The included example data (data/118-hr4366, data/118-hr5860, data/118-hr9468) ships with pre-generated embeddings. You don’t need to run embed for the examples unless you want to regenerate them.

Generate Embeddings

Single bill directory

congress-approp embed --dir data/118/hr/9468

For a small bill (7 provisions), this takes a few seconds. For the FY2024 omnibus (2,364 provisions), about 30 seconds.

All bills under a directory

congress-approp embed --dir data

The tool walks recursively, finds every directory with an extraction.json, and generates embeddings for each one. Bills that already have up-to-date embeddings are skipped automatically.

Preview without calling the API

congress-approp embed --dir data --dry-run

Shows how many provisions would be embedded and the estimated token count for each bill, without making any API calls.

What Gets Created

The embed command writes two files to each bill directory:

embeddings.json

A small JSON metadata file (~200 bytes, human-readable):

{
  "schema_version": "1.0",
  "model": "text-embedding-3-large",
  "dimensions": 3072,
  "count": 7,
  "extraction_sha256": "a1b2c3d4e5f6...",
  "vectors_file": "vectors.bin",
  "vectors_sha256": "f6e5d4c3b2a1..."
}
FieldDescription
schema_versionEmbedding schema version
modelThe OpenAI model used to generate embeddings
dimensionsNumber of dimensions per vector
countNumber of provisions embedded (should match the provisions array length in extraction.json)
extraction_sha256SHA-256 hash of the extraction.json this was built from — enables staleness detection
vectors_fileFilename of the binary vectors file
vectors_sha256SHA-256 hash of the vectors file — integrity check

vectors.bin

A binary file containing raw little-endian float32 vectors. The file size is exactly count × dimensions × 4 bytes:

BillProvisionsDimensionsFile Size
H.R. 9468 (supplemental)73,07286 KB
H.R. 5860 (CR)1303,0721.6 MB
H.R. 4366 (omnibus)2,3643,07229 MB

There is no header in the file — the count and dimensions come from embeddings.json. Vectors are stored in provision order (provision 0 first, then provision 1, etc.).

Embedding Options

Model

The default model is text-embedding-3-large, which provides the best quality embeddings available from OpenAI. You can override this:

congress-approp embed --dir data --model text-embedding-3-small

Warning: All embeddings in a dataset must use the same model. You cannot compare vectors from different models. If you change models, regenerate embeddings for all bills.

Dimensions

By default, the tool requests the full 3,072 dimensions from text-embedding-3-large. You can request fewer dimensions for smaller storage at the cost of some quality:

congress-approp embed --dir data --dimensions 1024

Experimental results from this project’s testing:

DimensionsStorage (omnibus)Top-20 Overlap vs. 3072
256~2.4 MB16/20 (lossy)
512~4.8 MB18/20 (near-lossless)
1024~9.7 MB19/20
3072 (default)~29 MB20/20 (ground truth)

Since binary vector files load in under 2ms regardless of size, there is little practical reason to truncate dimensions.

Warning: Like models, all embeddings in a dataset must use the same dimension count. Cosine similarity between vectors of different dimensions is undefined.

Batch size

Provisions are sent to the API in batches. The default batch size is 100 provisions per API call:

congress-approp embed --dir data --batch-size 50

Smaller batch sizes make more API calls but reduce the impact of a single failed call. The default of 100 is efficient for most use cases.

How Provision Text Is Built

Each provision is embedded using a deterministic text representation built by build_embedding_text(). The text concatenates the provision’s meaningful fields:

Account: Child Nutrition Programs | Agency: Department of Agriculture | Text: For necessary expenses of the Food and Nutrition Service...

The exact fields included depend on the provision type:

  • Appropriations/Rescissions: Account name, agency, program, raw text
  • CR Substitutions: Account name, reference act, reference section, raw text
  • Directives/Riders: Description, raw text
  • Other types: Description or LLM classification, raw text

This deterministic construction means the same provision always produces the same embedding text, regardless of when or where you run the command.

Staleness Detection

The hash chain connects embeddings to their source extraction:

extraction.json ──sha256──▶ embeddings.json (extraction_sha256)
vectors.bin ──sha256──▶ embeddings.json (vectors_sha256)

If you re-extract a bill (producing a new extraction.json), the embeddings become stale. Commands that use embeddings will warn you:

⚠ H.R. 4366: embeddings are stale (extraction.json has changed)

This warning is advisory — the tool still works, but similarity results may not match the current provisions. To fix it, regenerate embeddings:

congress-approp embed --dir data/118/hr/4366

The embed command automatically detects stale embeddings and regenerates them. Up-to-date embeddings are skipped.

Skipping Up-to-Date Bills

When you run embed on a directory with multiple bills, the tool checks each one:

  1. Does embeddings.json exist?
  2. Does extraction_sha256 in embeddings.json match the current SHA-256 of extraction.json?
  3. Does vectors_sha256 in embeddings.json match the current SHA-256 of vectors.bin?

If all three checks pass, the bill is skipped with a message like:

Skipping H.R. 9468: embeddings up to date

This makes it safe to run embed --dir data repeatedly — it only does work where needed.

Cost Estimates

Embedding generation is inexpensive compared to extraction:

BillProvisionsEstimated Cost
H.R. 9468 (7 provisions)7< $0.001
H.R. 5860 (130 provisions)130< $0.01
H.R. 4366 (2,364 provisions)2,364< $0.01

The text-embedding-3-large model charges per token. Even the largest omnibus bill with 2,364 provisions uses only a few tens of thousands of tokens total, which costs pennies.

Use --dry-run to preview the exact token count before committing.

Reading Vectors in Python

If you want to work with the embeddings outside of congress-approp:

import json
import struct

# Load metadata
with open("data/118-hr9468/embeddings.json") as f:
    meta = json.load(f)

dims = meta["dimensions"]  # 3072
count = meta["count"]       # 7

# Load vectors
with open("data/118-hr9468/vectors.bin", "rb") as f:
    raw = f.read()

# Parse into list of vectors
vectors = []
for i in range(count):
    start = i * dims * 4
    end = start + dims * 4
    vec = struct.unpack(f"<{dims}f", raw[start:end])
    vectors.append(vec)

# Vectors are L2-normalized (norm ≈ 1.0), so cosine similarity = dot product
def cosine(a, b):
    return sum(x * y for x, y in zip(a, b))

# Compare provision 0 to provision 1
print(f"Similarity: {cosine(vectors[0], vectors[1]):.4f}")

You can also load the vectors into numpy for faster computation:

import numpy as np

vectors = np.frombuffer(raw, dtype=np.float32).reshape(count, dims)

# Cosine similarity matrix
similarity_matrix = vectors @ vectors.T

After Generating Embeddings

Once embeddings are generated, you can use:

  • Semantic search: congress-approp search --dir data --semantic "your query" --top 10
  • Similar provisions: congress-approp search --dir data --similar 118-hr9468:0 --top 5

The --similar flag does not make any API calls — it uses the stored vectors directly. The --semantic flag makes one API call to embed your query text (~100ms).

Troubleshooting

“OPENAI_API_KEY environment variable not set”

Set your API key:

export OPENAI_API_KEY="your-key-here"

“No extraction.json found”

You need to extract the bill before generating embeddings. Run congress-approp extract first.

Embeddings stale warning after re-extraction

This is expected. Run congress-approp embed --dir <path> to regenerate.

Very large vectors.bin file

The omnibus bill produces a ~29 MB vectors.bin file. This is expected for 2,364 provisions × 3,072 dimensions × 4 bytes per float. The file loads in under 2ms despite its size.

These files are excluded from the crates.io package (via Cargo.toml exclude field) because they exceed the 10 MB upload limit. They are included in the git repository for users who clone.

Quick Reference

# Set API key
export OPENAI_API_KEY="your-key"

# Generate embeddings for one bill
congress-approp embed --dir data/118/hr/9468

# Generate embeddings for all bills
congress-approp embed --dir data

# Preview without API calls
congress-approp embed --dir data --dry-run

# Use a different model
congress-approp embed --dir data --model text-embedding-3-small

# Use fewer dimensions
congress-approp embed --dir data --dimensions 1024

# Smaller batch size
congress-approp embed --dir data --batch-size 50

Full Command Reference

congress-approp embed [OPTIONS]

Options:
    --dir <DIR>                Data directory [default: ./data]
    --model <MODEL>            Embedding model [default: text-embedding-3-large]
    --dimensions <DIMENSIONS>  Request this many dimensions from the API [default: 3072]
    --batch-size <BATCH_SIZE>  Provisions per API batch [default: 100]
    --dry-run                  Preview without calling API

Next Steps

Verify Extraction Accuracy

You will need: congress-approp installed, access to extracted bill data (the data/ directory works).

You will learn: How to run a full verification audit, interpret every metric, trace individual provisions back to source XML, and decide whether extraction quality is sufficient for your use case.

Extraction uses an LLM to classify and structure provisions from bill text. Verification uses deterministic code — no LLM involved — to check every claim the extraction made against the source. This guide walks you through the complete verification workflow.

Step 1: Run the Audit

The audit command is your primary verification tool:

congress-approp audit --dir data
┌───────────┬────────────┬──────────┬──────────┬───────┬───────┬──────────┬───────────┬──────────┬──────────┐
│ Bill      ┆ Provisions ┆ Verified ┆ NotFound ┆ Ambig ┆ Exact ┆ NormText ┆ Spaceless ┆ TextMiss ┆ Coverage │
╞═══════════╪════════════╪══════════╪══════════╪═══════╪═══════╪══════════╪═══════════╪══════════╪══════════╡
│ H.R. 4366 ┆       2364 ┆      762 ┆        0 ┆   723 ┆  2285 ┆       59 ┆         0 ┆       20 ┆    94.2% │
│ H.R. 5860 ┆        130 ┆       33 ┆        0 ┆     2 ┆   102 ┆       12 ┆         0 ┆       16 ┆    61.1% │
│ H.R. 9468 ┆          7 ┆        2 ┆        0 ┆     0 ┆     5 ┆        0 ┆         0 ┆        2 ┆   100.0% │
│ TOTAL     ┆       2501 ┆      797 ┆        0 ┆   725 ┆  2392 ┆       71 ┆         0 ┆       38 ┆          │
└───────────┴────────────┴──────────┴──────────┴───────┴───────┴──────────┴───────────┴──────────┴──────────┘

Column Guide:
  Verified   Dollar amount string found at exactly one position in source text
  NotFound   Dollar amounts NOT found in source — not present in source, review manually
  Ambig      Dollar amounts found multiple times in source — correct but position uncertain
  Exact      raw_text is byte-identical substring of source — verbatim copy
  NormText   raw_text matches after whitespace/quote/dash normalization — content correct
  Spaceless  raw_text matches only after removing all spaces — PDF artifact, review
  TextMiss   raw_text not found at any tier — may be paraphrased, review manually
  Coverage   Percentage of dollar strings in source text matched to a provision

Key:
  NotFound = 0 and Coverage = 100%   →  All amounts captured and found in source
  NotFound = 0 and Coverage < 100%   →  Extracted amounts correct, but bill has more
  NotFound > 0                       →  Some amounts need manual review

This is a lot of information. Let’s break it down column by column.

Step 2: Check for Unverifiable Amounts (The Critical Metric)

The single most important number in the audit is the NotFound column. It counts provisions where the extracted dollar string (e.g., "$2,285,513,000") was not found anywhere in the source bill text.

NotFound ValueInterpretationAction
0Every extracted dollar amount exists in the source text.No action needed — this is the ideal result.
1–5A small number of amounts couldn’t be verified.Run audit --verbose to identify which provisions; manually check them against the source XML.
> 5Significant number of unverifiable amounts.Investigate whether extraction used the wrong source file, the model hallucinated amounts, or the XML is corrupted. Consider re-extracting.

Across the included example data: NotFound = 0 for every bill. 99.995% of extracted dollar amounts were confirmed to exist in the source text. See Accuracy Metrics for the full breakdown.

Verified vs. Ambiguous

The remaining provisions with dollar amounts fall into two categories:

  • Verified: The dollar string was found at exactly one position in the source. This provides the strongest attribution — you know exactly where in the bill this amount comes from.
  • Ambiguous (Ambig): The dollar string was found at multiple positions. The amount is correct — it’s definitely in the bill — but it appears more than once, so you can’t automatically pin it to a single location.

Ambiguous matches are common and expected. Round numbers like $5,000,000 can appear 50+ times in a large omnibus bill. In H.R. 4366, 723 of 1,485 provisions with dollar amounts are ambiguous — mostly because common round-number amounts recur throughout the bill’s 2,364 provisions.

Ambiguous does not mean inaccurate. The amount is verified to exist in the source; only the precise location is uncertain.

Provisions without dollar amounts

Not all provisions have dollar amounts. Riders, directives, and some policy provisions carry no dollars. These provisions don’t appear in the Verified/NotFound/Ambig counts. In the example data:

  • H.R. 4366: 2,364 provisions, 1,485 with dollar amounts (762 verified + 723 ambiguous), 879 without
  • H.R. 5860: 130 provisions, 35 with dollar amounts (33 verified + 2 ambiguous), 95 without
  • H.R. 9468: 7 provisions, 2 with dollar amounts (2 verified + 0 ambiguous), 5 without

Step 3: Examine Raw Text Matching

The right side of the audit table checks whether each provision’s raw_text excerpt (the first ~150 characters of the bill language) is a substring of the source text. This is checked in four tiers:

Tier 1: Exact (best)

The raw_text is a byte-identical substring of the source bill text. This means the LLM copied the text perfectly — not a single character was changed.

In the example data: approximately 95.5% of provisions match at the Exact tier across the 13-bill dataset. This is excellent and provides strong evidence that the provision is attributed to the correct location in the bill.

Tier 2: Normalized

The raw_text matches after normalizing whitespace, curly quotes (""), and em-dashes (-). These differences arise from the XML-to-text conversion process — the source XML uses Unicode characters that the LLM may render differently.

In the example data: 71 provisions (2.8%) match at the Normalized tier. The content is correct; only formatting details differ.

Tier 3: Spaceless

The raw_text matches only after removing all spaces. This catches cases where word boundaries differ — for example, (1)not less than vs. (1) not less than. This is typically caused by XML tags being stripped without inserting spaces.

In the example data: 0 provisions match at the Spaceless tier.

Tier 4: No Match (TextMiss)

The raw_text was not found at any tier. Possible causes:

  • Truncation: The LLM truncated a very long provision and the truncated text doesn’t appear as-is in the source.
  • Paraphrasing: The LLM rephrased the statutory language (especially common for complex amendments like “Section X is amended by striking Y and inserting Z”).
  • Concatenation: The LLM combined text from adjacent sections into one raw_text string.

In the example data: 38 provisions (1.5%) are TextMiss. Examining them reveals they are all non-dollar provisions — statutory amendments (riders and mandatory spending extensions) where the LLM slightly reformatted section references. No provision with a dollar amount has a TextMiss in the example data.

What TextMiss does and doesn’t mean

TextMiss does NOT mean the provision is fabricated. The provision’s other fields (account_name, description, dollar amounts) may still be correct — it’s only the raw_text excerpt that doesn’t match. Dollar amounts are verified independently through the amount checks.

TextMiss DOES mean you should review manually if the provision is important to your analysis. Use audit --verbose to see which provisions are affected.

Step 4: Use Verbose Mode for Details

When any metric raises a concern, use --verbose to see specific problematic provisions:

congress-approp audit --dir data --verbose

This adds a list of individual provisions that didn’t pass verification at the highest tier. For each one, you’ll see:

  • The provision index
  • The provision type and account name (if applicable)
  • The dollar string (if applicable) and whether it was found
  • The raw text preview and which match tier it achieved

This gives you enough information to manually check any provision against the source XML.

Step 5: Trace a Specific Provision to Source

For any provision you want to verify yourself — perhaps one you plan to cite in a report or story — here’s how to trace it back to the source:

1. Get the provision details

congress-approp search --dir data/118-hr9468 --type appropriation --format json

Look for the provision you’re interested in. Note the dollars, raw_text, and provision_index fields.

For example, provision 0 of H.R. 9468:

{
  "dollars": 2285513000,
  "raw_text": "For an additional amount for ''Compensation and Pensions'', $2,285,513,000, to remain available until expended.",
  "provision_index": 0,
  "amount_status": "found",
  "match_tier": "exact"
}

2. Verify the dollar string in the source XML

Search for the text_as_written dollar string in the source file:

grep '$2,285,513,000' data/118-hr9468/BILLS-118hr9468enr.xml

If it’s found (and amount_status is “found”), the amount is verified. If found exactly once, the attribution is unambiguous.

3. Read the surrounding context

To see what the bill actually says around that dollar amount:

grep -B2 -A5 '2,285,513,000' data/118-hr9468/BILLS-118hr9468enr.xml

Or in Python for cleaner output:

import re

with open("data/118-hr9468/BILLS-118hr9468enr.xml") as f:
    text = f.read()

idx = text.find("2,285,513,000")
if idx >= 0:
    # Get surrounding context, strip XML tags
    start = max(0, idx - 200)
    end = min(len(text), idx + 200)
    context = re.sub(r'<[^>]+>', ' ', text[start:end])
    context = re.sub(r'\s+', ' ', context).strip()
    print(f"Context: ...{context}...")

4. Compare to the extracted data

Does the context match what the provision claims? Is the account name correct? Is the amount attributed to the right program? The structured raw_text field should be recognizable in the source context.

For the VA Supplemental example, the source text reads:

For an additional amount for ’‘Compensation and Pensions’’, $2,285,513,000, to remain available until expended.

And the extracted raw_text is identical — byte-for-byte.

Step 6: Interpret Coverage

The Coverage column shows the percentage of dollar-sign patterns in the source bill text that were matched to an extracted provision. This measures extraction completeness, not accuracy.

100% coverage (H.R. 9468)

Every dollar amount in the source was captured by a provision. This is ideal and common for small, simple bills.

94.2% coverage (H.R. 4366)

Most dollar amounts were captured, but 5.8% were not. For a 1,500-page omnibus, this is expected. The unmatched dollar strings are typically:

  • Statutory cross-references: Dollar amounts from other laws cited in the bill text (e.g., “as authorized under section 1241(a)” where the referenced section contains a dollar amount)
  • Loan guarantee ceilings: “$3,500,000,000 for guaranteed farm ownership loans” — these are loan volume limits, not budget authority
  • Struck amounts: “Striking ‘$50,000’ and inserting ‘$75,000’” — the old amount being struck shouldn’t be an independent provision
  • Proviso sub-references: Amounts in conditions that don’t constitute independent provisions

61.1% coverage (H.R. 5860)

Continuing resolutions have inherently lower coverage because most of the bill text consists of references to prior-year appropriations acts. Those referenced acts contain many dollar amounts that appear in the CR’s text but aren’t new provisions — they’re contextual citations. Only the 13 CR substitutions and a few standalone appropriations are genuine new provisions in this bill.

When low coverage IS concerning

Coverage below 60% on a regular appropriations bill (not a CR) may indicate that the extraction missed entire sections. Investigate by:

  1. Running audit --verbose to see which dollar amounts are unaccounted for
  2. Checking whether major accounts you expect are present in search --type appropriation
  3. Comparing the provision count to what you’d expect for a bill of that size

See What Coverage Means (and Doesn’t) for a detailed explanation.

Step 7: Cross-Check with External Sources

For high-stakes analysis, cross-check the tool’s totals against independent sources:

CBO cost estimates

The Congressional Budget Office publishes cost estimates for most appropriations bills. These aggregate numbers can serve as a sanity check for the tool’s budget authority totals. Note that CBO estimates may use slightly different accounting conventions (e.g., including or excluding advance appropriations differently).

Committee reports

The House and Senate Appropriations Committees publish detailed reports accompanying each bill. These contain account-level funding tables that can be compared to the tool’s per-account breakdowns.

Known sources of discrepancy

Even with perfect extraction, the tool’s totals may differ from external sources because:

  • Mandatory spending lines (SNAP, VA Comp & Pensions) appear as appropriation provisions in the bill text but are not “discretionary” in the budget sense
  • Advance appropriations are enacted in the current bill but available in a future fiscal year
  • Sub-allocations use reference_amount semantics and are excluded from budget authority totals, while some external sources include them
  • Transfer authorities have dollar ceilings that are not new spending

See Why the Numbers Might Not Match Headlines for a comprehensive explanation.

Step 8: Decide Whether to Re-Extract

Based on your audit results, here’s a decision framework:

SituationRecommendation
NotFound = 0, Coverage > 80%, TextMiss < 5%Use as-is. Quality is high.
NotFound = 0, Coverage 60–80%, TextMiss < 10%Use with awareness. Extraction is accurate but may be incomplete. Check specific accounts you care about.
NotFound = 0, Coverage < 60% (non-CR bill)Consider re-extracting. Major sections may be missing. Try --parallel 1 for more reliable extraction of tricky sections.
NotFound > 0Investigate and possibly re-extract. Some dollar amounts weren’t found in the source. Run audit --verbose, manually verify the flagged provisions, and re-extract if the issues are systemic.
TextMiss > 10% on dollar-bearing provisionsRe-extract. The LLM may have been paraphrasing rather than quoting the bill text.

Re-extraction vs. upgrade

  • Re-extract (congress-approp extract --dir <path>): Makes new LLM API calls. Use when you want a fresh extraction, possibly with a different model or after prompt improvements.
  • Upgrade (congress-approp upgrade --dir <path>): No LLM calls. Re-deserializes existing data through the current schema and re-runs verification. Use when the schema or verification logic has been updated but the extraction itself is fine.

Automated Verification in Scripts

For CI/CD or automated pipelines, you can check verification programmatically:

# Check that no dollar amounts are unverifiable across all bills
congress-approp summary --dir data --format json | python3 -c "
import sys, json
bills = json.load(sys.stdin)
# The summary footer reports unverified count
# Check budget authority totals as a regression guard
expected = {'H.R. 4366': 846137099554, 'H.R. 5860': 16000000000, 'H.R. 9468': 2882482000}
for b in bills:
    assert b['budget_authority'] == expected[b['identifier']], \
        f\"{b['identifier']} budget authority mismatch: {b['budget_authority']} != {expected[b['identifier']]}\"
print('All budget authority totals match expected values')
"

This is the same check used in the project’s integration test suite to guard against data regressions.

Quick Decision Table

I need to…Command
Run a full auditaudit --dir data
See individual problematic provisionsaudit --dir data --verbose
Check a specific provision’s dollar amountgrep '$AMOUNT' data/118-hr4366/BILLS-*.xml
Verify a provision’s raw textCompare raw_text from JSON output to source XML
Check budget authority totalssummary --dir data --format json
Compare to external sourcessummary --dir data --by-agency for department-level totals

Next Steps

Filter and Search Provisions

You will need: congress-approp installed, access to the data/ directory. For semantic search: OPENAI_API_KEY.

You will learn: Every filter flag available on the search command, how to combine them, and practical recipes for common queries.

The search command is the most versatile tool in congress-approp. It supports ten filter flags that can be combined freely — all filters use AND logic, meaning every provision in the results must match every filter you specify. This guide covers each flag with real examples from the included data.

Quick Reference: All Search Flags

FlagShortTypeDescription
--dirpathDirectory containing extracted bills (required)
--type-tstringFilter by provision type
--agency-astringFilter by agency name (case-insensitive substring)
--accountstringFilter by account name (case-insensitive substring)
--keyword-kstringSearch in raw_text (case-insensitive substring)
--billstringFilter to a specific bill identifier
--divisionstringFilter by division letter
--min-dollarsintegerMinimum dollar amount (absolute value)
--max-dollarsintegerMaximum dollar amount (absolute value)
--formatstringOutput format: table, json, jsonl, csv
--semanticstringRank by meaning similarity (requires embeddings + OPENAI_API_KEY)
--similarstringFind provisions similar to a specific one (format: dir:index)
--topintegerMaximum results for semantic/similar search (default 20)
--list-typesflagList all valid provision types and exit

Filter by Provision Type (--type)

The most common filter. Restricts results to a single provision type.

# All appropriations across all bills
congress-approp search --dir data --type appropriation

# All rescissions
congress-approp search --dir data --type rescission

# CR substitutions (anomalies) — table auto-adapts to show New/Old/Delta columns
congress-approp search --dir data --type cr_substitution

# Reporting requirements and instructions to agencies
congress-approp search --dir data --type directive

# Policy provisions (no direct spending)
congress-approp search --dir data --type rider

Available provision types

Use --list-types to see all valid values:

congress-approp search --dir data --list-types
Available provision types:
  appropriation                    Budget authority grant
  rescission                       Cancellation of prior budget authority
  cr_substitution                  CR anomaly (substituting $X for $Y)
  transfer_authority               Permission to move funds between accounts
  limitation                       Cap or prohibition on spending
  directed_spending                Earmark / community project funding
  mandatory_spending_extension     Amendment to authorizing statute
  directive                        Reporting requirement or instruction
  rider                            Policy provision (no direct spending)
  continuing_resolution_baseline   Core CR funding mechanism
  other                            Unclassified provisions

Type distribution by bill

Not every bill contains every type. Here’s the distribution across the example data:

TypeH.R. 4366 (Omnibus)H.R. 5860 (CR)H.R. 9468 (Supp)
appropriation1,21652
limitation4564
rider285492
directive12023
other8412
rescission78
transfer_authority77
mandatory_spending_extension4044
directed_spending8
cr_substitution13
continuing_resolution_baseline1

Filter by Agency (--agency)

Matches the agency field using a case-insensitive substring search:

# All provisions from the Department of Veterans Affairs
congress-approp search --dir data --agency "Veterans"

# All provisions from the Department of Energy
congress-approp search --dir data --agency "Energy"

# All NASA provisions
congress-approp search --dir data --agency "Aeronautics"

# All DOJ provisions
congress-approp search --dir data --agency "Justice"

The --agency flag matches against the structured agency field that the LLM extracted — typically the full department name (e.g., “Department of Veterans Affairs”). You only need to provide a substring; the match is case-insensitive.

Tip: Some provisions don’t have an agency field (riders, directives, and some other types). These will never appear in agency-filtered results.

Combine with type for focused results

# Only VA appropriations
congress-approp search --dir data --agency "Veterans" --type appropriation

# Only VA rescissions
congress-approp search --dir data --agency "Veterans" --type rescission

# DOJ directives
congress-approp search --dir data --agency "Justice" --type directive

Filter by Account Name (--account)

Matches the account_name field using a case-insensitive substring search. This is more specific than --agency — it targets the individual appropriations account:

# All provisions for Child Nutrition Programs
congress-approp search --dir data --account "Child Nutrition"

# All provisions for the FBI
congress-approp search --dir data --account "Federal Bureau of Investigation"

# All provisions for Disaster Relief
congress-approp search --dir data --account "Disaster Relief"

# All provisions for Medical Services (VA)
congress-approp search --dir data --account "Medical Services"

The account name is extracted from the bill text — it’s usually the text between '' delimiters in the legislative language (e.g., ''Compensation and Pensions'').

Account vs. Agency

FlagMatches AgainstGranularityExample
--agencyParent department or agencyBroad“Department of Veterans Affairs”
--accountSpecific appropriations accountNarrow“Compensation and Pensions”

Many provisions under the same agency have different account names. Use --agency for a department-wide view and --account when you know the specific program.

Gotcha: “Salaries and Expenses”

The account name “Salaries and Expenses” appears under dozens of different agencies. If you search --account "Salaries and Expenses" without an agency filter, you’ll get results from across the entire government. Combine with --agency to narrow:

congress-approp search --dir data --account "Salaries and Expenses" --agency "Justice"

Filter by Keyword in Bill Text (--keyword)

Searches the raw_text field — the actual bill language excerpt stored with each provision. This is a case-insensitive substring match:

# Find provisions mentioning FEMA
congress-approp search --dir data --keyword "Federal Emergency Management"

# Find provisions with "notwithstanding" (often signals important policy exceptions)
congress-approp search --dir data --keyword "notwithstanding"

# Find provisions about transfer authority
congress-approp search --dir data --keyword "may transfer"

# Find provisions about reporting requirements
congress-approp search --dir data --keyword "shall submit a report"

# Find provisions referencing a specific public law
congress-approp search --dir data --keyword "Public Law 118"

Keyword vs. Account vs. Semantic

Search MethodSearchesBest ForMisses
--keywordThe raw_text excerpt (~150 chars of bill language)Exact terms you know appear in the textProvisions where the term is in the account name but not the raw_text excerpt, or where synonyms are used
--accountThe structured account_name fieldKnown program namesProvisions that reference the program without naming the account
--semanticThe full provision meaning (via embeddings)Concepts and topics, layperson languageNothing — it searches everything, but scores may be low for weak matches

For the most thorough search, try all three approaches. Start with --keyword or --account for precision, then use --semantic to find provisions you might have missed.

Filter by Bill (--bill)

Restricts results to a specific bill by its identifier string:

# Only provisions from H.R. 4366
congress-approp search --dir data --bill "H.R. 4366"

# Only provisions from H.R. 9468
congress-approp search --dir data --bill "H.R. 9468"

The value must match the bill identifier as it appears in the data (e.g., “H.R. 4366”, including the space and period). This is a case-sensitive exact match.

Alternative: Point --dir at a specific bill directory. Instead of --bill, you can scope the search by directory:

# These are equivalent for single-bill searches:
congress-approp search --dir data --bill "H.R. 4366"
congress-approp search --dir data/118-hr4366

The --dir approach is simpler for single-bill searches. The --bill flag is useful when you have multiple bills loaded via a parent directory and want to filter to one.

Filter by Division (--division)

Omnibus bills are organized into lettered divisions (Division A, Division B, etc.), each covering a different set of agencies. The --division flag scopes results to a single division:

# Division A = MilCon-VA in H.R. 4366
congress-approp search --dir data/118-hr4366 --division A

# Division B = Agriculture in H.R. 4366
congress-approp search --dir data/118-hr4366 --division B

# Division C = Commerce, Justice, Science in H.R. 4366
congress-approp search --dir data/118-hr4366 --division C

# Division D = Energy and Water in H.R. 4366
congress-approp search --dir data/118-hr4366 --division D

The division letter is a single character (A, B, C, etc.). Bills without divisions (like the VA supplemental H.R. 9468) have no division field, so --division effectively returns no results for those bills.

Combine with type for division-level analysis

# All appropriations in MilCon-VA (Division A) over $1 billion
congress-approp search --dir data/118-hr4366 --division A --type appropriation --min-dollars 1000000000

# All rescissions in Commerce-Justice-Science (Division C)
congress-approp search --dir data/118-hr4366 --division C --type rescission

# All riders in Agriculture (Division B)
congress-approp search --dir data/118-hr4366 --division B --type rider

Filter by Dollar Range (--min-dollars, --max-dollars)

Filters provisions by the absolute value of their dollar amount:

# Provisions of $1 billion or more
congress-approp search --dir data --min-dollars 1000000000

# Provisions between $100 million and $500 million
congress-approp search --dir data --min-dollars 100000000 --max-dollars 500000000

# Small provisions under $1 million
congress-approp search --dir data --max-dollars 1000000

# Large rescissions
congress-approp search --dir data --type rescission --min-dollars 1000000000

The filter uses the absolute value of the dollar amount, so rescissions (which may be stored as negative values internally) are compared by their magnitude.

Provisions without dollar amounts (riders, directives, etc.) are excluded from results when --min-dollars or --max-dollars is specified.

Combining Multiple Filters

All filters use AND logic — every filter must match for a provision to appear. This lets you build very specific queries:

# VA appropriations over $1 billion in Division A
congress-approp search --dir data \
  --agency "Veterans" \
  --type appropriation \
  --division A \
  --min-dollars 1000000000

# DOJ rescissions in Division C
congress-approp search --dir data \
  --agency "Justice" \
  --type rescission \
  --division C

# Provisions mentioning "notwithstanding" in the omnibus under $10 million
congress-approp search --dir data/118-hr4366 \
  --keyword "notwithstanding" \
  --max-dollars 10000000

# Energy-related appropriations in Division D between $100M and $1B
congress-approp search --dir data/118-hr4366 \
  --division D \
  --type appropriation \
  --min-dollars 100000000 \
  --max-dollars 1000000000

Filter order doesn’t matter

The tool applies filters in the order that’s most efficient internally. The command-line order of flags has no effect on results — these two commands produce identical output:

congress-approp search --dir data --type appropriation --agency "Veterans"
congress-approp search --dir data --agency "Veterans" --type appropriation

Semantic Search (--semantic)

Semantic search ranks provisions by meaning similarity instead of keyword matching. It requires pre-computed embeddings and an OPENAI_API_KEY:

export OPENAI_API_KEY="your-key"

# Find provisions about school lunch programs (no keyword overlap with "Child Nutrition Programs")
congress-approp search --dir data --semantic "school lunch programs for kids" --top 5

# Find provisions about road and bridge infrastructure
congress-approp search --dir data --semantic "money for fixing roads and bridges" --top 5

Combining semantic search with hard filters

Hard filters apply first (constraining which provisions are eligible), then semantic ranking orders the remaining results:

# Appropriations about clean energy, at least $100M
congress-approp search --dir data \
  --semantic "clean energy research" \
  --type appropriation \
  --min-dollars 100000000 \
  --top 10

For a full tutorial on semantic search, see Use Semantic Search.

Find Similar Provisions (--similar)

Find provisions most similar to a specific one across all loaded bills. The syntax is --similar <bill_directory>:<provision_index>:

# Find provisions similar to VA Supplemental provision 0 (Comp & Pensions)
congress-approp search --dir data --similar 118-hr9468:0 --top 5

# Find provisions similar to omnibus provision 620 (FBI Salaries and Expenses)
congress-approp search --dir data --similar hr4366:620 --top 5

Unlike --semantic, the --similar flag does not make any API calls — it uses pre-computed vectors directly. This makes it instant and free.

You can also combine --similar with hard filters:

# Find appropriations similar to a specific provision
congress-approp search --dir data --similar 118-hr9468:0 --type appropriation --top 5

For a full tutorial, see Track a Program Across Bills.

Controlling the Number of Results (--top)

The --top flag limits results for semantic and similar searches (default 20). It has no effect on non-semantic searches (which return all matching provisions):

# Top 3 results
congress-approp search --dir data --semantic "veterans health care" --top 3

# Top 50 results
congress-approp search --dir data --semantic "veterans health care" --top 50

Output Formats (--format)

All search results can be output in four formats:

# Human-readable table (default)
congress-approp search --dir data --type appropriation --format table

# JSON array (full fields, for programmatic use)
congress-approp search --dir data --type appropriation --format json

# JSON Lines (one object per line, for streaming)
congress-approp search --dir data --type appropriation --format jsonl

# CSV (for spreadsheets)
congress-approp search --dir data --type appropriation --format csv > provisions.csv

JSON and CSV include more fields than the table view — notably raw_text, semantics, detail_level, amount_status, match_tier, quality, and provision_index.

For detailed format documentation and recipes, see Export Data for Spreadsheets and Scripts and Output Formats.

Practical Recipes

Here are battle-tested queries for common analysis tasks:

Find the biggest appropriations in a bill

congress-approp search --dir data/118-hr4366 --type appropriation --min-dollars 10000000000 --format table

Find all provisions for a specific agency

congress-approp search --dir data --agency "Department of Energy" --format table

Export all rescissions to a spreadsheet

congress-approp search --dir data --type rescission --format csv > rescissions.csv

Find reporting requirements for the VA

congress-approp search --dir data --keyword "Veterans Affairs" --type directive

Find all provisions that override other law

congress-approp search --dir data --keyword "notwithstanding"

Find which mandatory programs were extended in the CR

congress-approp search --dir data/118-hr5860 --type mandatory_spending_extension --format json

Find provisions in a specific dollar range

# "Small" appropriations: $1M to $10M
congress-approp search --dir data --type appropriation --min-dollars 1000000 --max-dollars 10000000

# "Large" appropriations: over $10B
congress-approp search --dir data --type appropriation --min-dollars 10000000000

Count provisions by type across all bills

congress-approp search --dir data --format json | \
  jq 'group_by(.provision_type) | map({type: .[0].provision_type, count: length}) | sort_by(-.count)'

Export everything and filter later

If you’re not sure what you need yet, export all provisions and filter in your analysis tool:

# All provisions, all fields, all bills
congress-approp search --dir data --format json > all_provisions.json

# Or as CSV for Excel
congress-approp search --dir data --format csv > all_provisions.csv

Tips

  1. Start broad, then narrow. Begin with --type or --agency alone, see how many results you get, then add more filters to focus.

  2. Use --format json to see all fields. The table view truncates long text and hides some fields. JSON shows everything.

  3. Use --dir scoping for single-bill searches. Instead of --bill "H.R. 4366", use --dir data/118-hr4366 — it’s simpler and slightly faster.

  4. Combine keyword and account searches. An account name search finds provisions named after a program. A keyword search finds provisions that mention a program in their text. Use both for completeness.

  5. Try semantic search as a second pass. After keyword/account search gives you the obvious results, run a semantic search on the same topic to find provisions you might have missed because the bill uses different terminology.

  6. Check --list-types when unsure. If you can’t remember the exact type name, --list-types shows all valid values with descriptions.

Next Steps

Work with CR Substitutions

You will need: congress-approp installed, access to the data/ directory.

You will learn: What CR substitutions are in legislative context, how to find and interpret them, how to match them to their omnibus counterparts, and how to export them for analysis.

Continuing resolutions (CRs) fund the government at prior-year rates — but not uniformly. Specific programs get different treatment through anomalies, formally known as CR substitutions. These are provisions that say “substitute $X for $Y,” replacing one dollar amount with another. They’re politically significant because they reveal which programs Congress chose to fund above or below the default rate.

The tool extracts CR substitutions as structured data with both the new and old amounts, making them easy to find, compare, and analyze.

What a CR Substitution Looks Like

In bill text, a CR substitution looks like this:

…shall be applied by substituting “$25,300,000” for “$75,300,000”…

This means: instead of continuing the Rural Community Facilities Program at its prior-year level of $75.3 million, fund it at $25.3 million — a $50 million cut.

The tool captures both sides:

{
  "provision_type": "cr_substitution",
  "account_name": "Rural Housing Service—Rural Community Facilities Program Account",
  "new_amount": {
    "value": { "kind": "specific", "dollars": 25300000 },
    "semantics": "new_budget_authority",
    "text_as_written": "$25,300,000"
  },
  "old_amount": {
    "value": { "kind": "specific", "dollars": 75300000 },
    "semantics": "new_budget_authority",
    "text_as_written": "$75,300,000"
  },
  "raw_text": "except section 521(a)(2) shall be applied by substituting ''$25,300,000'' for ''$75,300,000''",
  "section": "SEC. 101",
  "division": "A"
}

Both dollar amounts — the new and the old — are independently verified against the source bill text.

Find All CR Substitutions

The --type cr_substitution filter finds every anomaly in a continuing resolution:

congress-approp search --dir data/118-hr5860 --type cr_substitution
┌───┬───────────┬──────────────────────────────────────────┬───────────────┬───────────────┬──────────────┬──────────┬─────┐
│ $ ┆ Bill      ┆ Account                                  ┆       New ($) ┆       Old ($) ┆    Delta ($) ┆ Section  ┆ Div │
╞═══╪═══════════╪══════════════════════════════════════════╪═══════════════╪═══════════════╪══════════════╪══════════╪═════╡
│ ✓ ┆ H.R. 5860 ┆ Rural Housing Service—Rural Community…   ┆    25,300,000 ┆    75,300,000 ┆  -50,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Rural Utilities Service—Rural Water a…   ┆    60,000,000 ┆   325,000,000 ┆ -265,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆                                          ┆   122,572,000 ┆   705,768,000 ┆ -583,196,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ National Science Foundation—STEM Educ…   ┆    92,000,000 ┆   217,000,000 ┆ -125,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ National Oceanic and Atmospheric Admini… ┆    42,000,000 ┆    62,000,000 ┆  -20,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ National Science Foundation—Research …   ┆   608,162,000 ┆   818,162,000 ┆ -210,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Department of State—Administration of…   ┆    87,054,000 ┆   147,054,000 ┆  -60,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Bilateral Economic Assistance—Funds A…   ┆   637,902,000 ┆   937,902,000 ┆ -300,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Bilateral Economic Assistance—Departm…   ┆   915,048,000 ┆ 1,535,048,000 ┆ -620,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ International Security Assistance—Dep…   ┆    74,996,000 ┆   374,996,000 ┆ -300,000,000 ┆ SEC. 101 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Office of Personnel Management—Salari…   ┆   219,076,000 ┆   190,784,000 ┆  +28,292,000 ┆ SEC. 126 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Department of Transportation—Federal …   ┆   617,000,000 ┆   570,000,000 ┆  +47,000,000 ┆ SEC. 137 ┆ A   │
│ ✓ ┆ H.R. 5860 ┆ Department of Transportation—Federal …   ┆ 2,174,200,000 ┆ 2,221,200,000 ┆  -47,000,000 ┆ SEC. 137 ┆ A   │
└───┴───────────┴──────────────────────────────────────────┴───────────────┴───────────────┴──────────────┴──────────┴─────┘
13 provisions found

$ = Amount status: ✓ found (unique), ≈ found (multiple matches), ✗ not found

Notice how the table automatically changes shape when you search for CR substitutions — instead of a single Amount column, you get three:

ColumnMeaning
New ($)The new dollar amount the CR substitutes in (the “X” in “substituting X for Y”)
Old ($)The old dollar amount being replaced (the “Y”)
Delta ($)New minus Old. Negative means a cut, positive means an increase.

Every dollar amount has ✓ verification — both the new and old amounts were found in the source bill text. All 13 CR substitutions in H.R. 5860 are fully verified.

Interpret the Results

Which programs were cut?

Eleven of the thirteen CR substitutions are negative deltas — Congress funded these programs below the prior-year level during the temporary spending period. The largest cuts:

AccountNewOldDeltaCut %
Migration and Refugee Assistance$915M$1,535M-$620M-40.4%
(section 521(d)(1) reference)$123M$706M-$583M-82.6%
Bilateral Economic Assistance$638M$938M-$300M-32.0%
Int’l Narcotics Control$75M$375M-$300M-80.0%
Rural Water and Waste Disposal$60M$325M-$265M-81.5%

Which programs got more?

Only two programs received increases:

AccountNewOldDeltaIncrease %
OPM Salaries and Expenses$219M$191M+$28M+14.8%
FAA Facilities and Equipment$617M$570M+$47M+8.2%

Missing account names

The third row in the table has no account name — just $122,572,000 / $705,768,000. This happens when the CR language references a section of law rather than naming an account directly:

except section 521(d)(1) shall be applied by substituting ''$122,572,000'' for ''$705,768,000''

Section 521(d)(1) refers to the rental assistance voucher program under the Housing Act of 1949. The tool captures the amounts and the raw text but can’t always infer the account name when the bill text uses a statutory reference instead.

You can see the full details in JSON:

congress-approp search --dir data/118-hr5860 --type cr_substitution --format json

The raw_text field will show the full excerpt for each provision, including the statutory reference.

Export CR Substitutions

CSV for spreadsheets

congress-approp search --dir data/118-hr5860 --type cr_substitution --format csv > cr_anomalies.csv

The CSV includes the dollars column (new amount), old_dollars column (old amount), and all other fields. You can compute the delta in Excel as =A2-B2 or use the dollars and old_dollars columns directly.

JSON for scripts

congress-approp search --dir data/118-hr5860 --type cr_substitution --format json > cr_anomalies.json

JSON output includes every field:

{
  "account_name": "Rural Housing Service—Rural Community Facilities Program Account",
  "amount_status": "found",
  "bill": "H.R. 5860",
  "description": "Rural Housing Service—Rural Community Facilities Program Account",
  "division": "A",
  "dollars": 25300000,
  "match_tier": "exact",
  "old_dollars": 75300000,
  "provision_index": 3,
  "provision_type": "cr_substitution",
  "quality": "strong",
  "raw_text": "except section 521(a)(2) shall be applied by substituting ''$25,300,000'' for ''$75,300,000''",
  "section": "SEC. 101",
  "semantics": "new_budget_authority"
}

Sort by largest cut using jq

congress-approp search --dir data/118-hr5860 --type cr_substitution --format json | \
  jq 'map(. + {delta: (.dollars - .old_dollars)}) | sort_by(.delta) | .[] |
    "\(.delta)\t\(.account_name // "unnamed")"'

Match CR Substitutions to Omnibus Provisions

A natural follow-up question is: “This CR cut Rural Water from $325M to $60M. What did the full-year omnibus give it?”

Using –similar

If embeddings are available, use --similar to find the omnibus counterpart. First, find the CR substitution’s provision index:

congress-approp search --dir data/118-hr5860 --type cr_substitution --format json | \
  jq '.[] | select(.account_name | test("Rural.*Water"; "i")) | .provision_index'

Then find similar provisions across all bills:

congress-approp search --dir data --similar hr5860:<INDEX> --type appropriation --top 3

Even though the CR names accounts differently than the omnibus (e.g., “Rural Utilities Service—Rural Water and Waste Disposal Program Account” vs. “Rural Water and Waste Disposal Program Account”), the embedding similarity is typically in the 0.75–0.80 range — well above the threshold for confident matching.

Using –account

If the names are close enough, a substring search works:

congress-approp search --dir data/118-hr4366 --account "Rural Water" --type appropriation

This will find the omnibus appropriation for the same program, letting you compare the CR anomaly level to the full-year funding.

Understanding the CR Structure

Not all provisions in a CR are substitutions. The full structure of H.R. 5860 includes:

TypeCountRole
rider49Policy provisions extending or modifying existing authorities
mandatory_spending_extension44Extensions of mandatory programs that would otherwise expire
cr_substitution13Anomalies — programs funded at different-than-prior-year rates
other12Miscellaneous provisions
appropriation5Standalone new appropriations (FEMA disaster relief, IG funding)
limitation4Spending caps and prohibitions
directive2Reporting requirements
continuing_resolution_baseline1The core mechanism (SEC. 101) establishing prior-year rates

The continuing_resolution_baseline provision (usually SEC. 101) establishes the default rule: fund everything at the prior fiscal year’s rate. The CR substitutions are exceptions to that rule. Everything else — riders, mandatory extensions, limitations — modifies or supplements the baseline.

To see the full picture:

# All provisions in the CR
congress-approp search --dir data/118-hr5860

# The baseline mechanism
congress-approp search --dir data/118-hr5860 --type continuing_resolution_baseline

# Mandatory programs extended
congress-approp search --dir data/118-hr5860 --type mandatory_spending_extension

# Standalone appropriations (FEMA, etc.)
congress-approp search --dir data/118-hr5860 --type appropriation

Verify CR Substitution Amounts

Both dollar amounts in each CR substitution are independently verified. You can confirm this in the audit:

congress-approp audit --dir data/118-hr5860

The audit shows NotFound = 0 for H.R. 5860, meaning every dollar string — including both the “new” and “old” amounts in all 13 CR substitutions — was found in the source bill text.

To verify a specific pair manually:

# Check that both amounts from the Migration and Refugee Assistance anomaly exist
grep '915,048,000' data/118-hr5860/BILLS-118hr5860enr.xml
grep '1,535,048,000' data/118-hr5860/BILLS-118hr5860enr.xml

Both should return matches. The source text will show them adjacent to each other in a “substituting X for Y” pattern.

Tips for CR Analysis

  1. CRs don’t show the full funding picture. Programs not mentioned in CR substitutions are funded at the prior-year rate. The CR itself doesn’t state what that rate is — you need the prior year’s appropriations bill to know the baseline.

  2. Watch for paired substitutions. The two FAA provisions at the bottom of the table (SEC. 137) have opposite deltas: +$47M for Facilities and Equipment and -$47M for the same agency’s account. This is a reallocation within the same agency — not a net change in FAA funding.

  3. Some substitutions reference statute sections, not accounts. When the bill says “section 521(d)(1) shall be applied by substituting X for Y,” the tool captures both amounts but may not identify the account name. Check the raw_text field for the statutory reference and look it up in the U.S. Code.

  4. Export and sort by delta for the narrative. The story is always “which programs got cut, which got more, and by how much.” Export to CSV, sort by delta, and you have the outline for a briefing or article.

  5. Use --similar to find the regular appropriation. Every CR anomaly corresponds to a regular appropriation in an omnibus or annual bill. The --similar command finds that correspondence even when naming conventions differ between bills.

Quick Reference

# Find all CR substitutions
congress-approp search --dir data/118-hr5860 --type cr_substitution

# Export to CSV
congress-approp search --dir data/118-hr5860 --type cr_substitution --format csv > cr_subs.csv

# Export to JSON
congress-approp search --dir data/118-hr5860 --type cr_substitution --format json

# Find the full-year omnibus equivalent of a CR account
congress-approp search --dir data --similar hr5860:<INDEX> --type appropriation --top 3

# See all CR provisions (not just substitutions)
congress-approp search --dir data/118-hr5860

# Audit CR verification
congress-approp audit --dir data/118-hr5860

Next Steps

Use the Library API from Rust

You will need: A Rust project with congress-appropriations as a dependency.

You will learn: How to load extracted bill data, query it programmatically using the library API, and build custom analysis tools on top of the structured provision data.

congress-appropriations is both a CLI tool and a Rust library. The library exposes the same query functions the CLI uses — summarize, search, compare, audit, rollup_by_department, and build_embedding_text — as pure functions that take loaded bill data and return plain data structs. No I/O, no formatting, no side effects.

This guide shows you how to use the library in your own Rust projects.

Add the Dependency

Add congress-appropriations to your Cargo.toml:

[dependencies]
congress-appropriations = "3.0"

The crate re-exports the key types you need:

#![allow(unused)]
fn main() {
use congress_appropriations::{load_bills, query, LoadedBill};
use congress_appropriations::approp::query::SearchFilter;
}

Load Bills

The entry point is load_bills(), which recursively walks a directory to find all extraction.json files and loads them along with their sibling verification and metadata files:

use congress_appropriations::load_bills;
use std::path::Path;

fn main() -> anyhow::Result<()> {
    let bills = load_bills(Path::new("data"))?;
    println!("Loaded {} bills", bills.len());

    for bill in &bills {
        println!(
            "  {} ({}) — {} provisions",
            bill.extraction.bill.identifier,
            bill.extraction.bill.classification,
            bill.extraction.provisions.len()
        );
    }

    Ok(())
}

Expected output with the included example data:

Loaded 3 bills
  H.R. 4366 (Omnibus) — 2364 provisions
  H.R. 5860 (Continuing Resolution) — 130 provisions
  H.R. 9468 (Supplemental) — 7 provisions

What LoadedBill Contains

Each LoadedBill has three fields:

#![allow(unused)]
fn main() {
pub struct LoadedBill {
    /// Path to the bill directory on disk
    pub dir: PathBuf,
    /// The extraction output: bill info, provisions array, summary
    pub extraction: BillExtraction,
    /// Verification report (if verification.json exists)
    pub verification: Option<VerificationReport>,
    /// Extraction metadata (if metadata.json exists)
    pub metadata: Option<ExtractionMetadata>,
}
}

Only extraction is required — verification and metadata are loaded if their files exist but are None otherwise. This means you can use the library on data that was only partially extracted.

Summarize Bills

The summarize function computes per-bill budget authority, rescissions, and net BA:

use congress_appropriations::{load_bills, query};
use std::path::Path;

fn main() -> anyhow::Result<()> {
    let bills = load_bills(Path::new("data"))?;
    let summaries = query::summarize(&bills);

    for s in &summaries {
        println!(
            "{}: ${:>15} BA, ${:>13} rescissions, ${:>15} net",
            s.identifier,
            format_dollars(s.budget_authority),
            format_dollars(s.rescissions),
            format_dollars(s.net_ba),
        );
    }

    Ok(())
}

fn format_dollars(n: i64) -> String {
    // Simple comma formatting for display
    let s = n.to_string();
    let mut result = String::new();
    for (i, c) in s.chars().rev().enumerate() {
        if i > 0 && i % 3 == 0 && c != '-' {
            result.push(',');
        }
        result.push(c);
    }
    result.chars().rev().collect()
}

BillSummary Fields

#![allow(unused)]
fn main() {
pub struct BillSummary {
    pub identifier: String,      // e.g., "H.R. 4366"
    pub classification: String,  // e.g., "Omnibus"
    pub provisions: usize,       // total provision count
    pub budget_authority: i64,   // sum of new_budget_authority provisions
    pub rescissions: i64,        // sum of rescission provisions (absolute)
    pub net_ba: i64,             // budget_authority - rescissions
    pub completeness_pct: Option<f64>, // from verification, if available
}
}

Budget authority is computed from the actual provisions — it sums all Appropriation provisions where semantics == NewBudgetAuthority and detail_level is not sub_allocation or proviso_amount. The LLM’s self-reported totals are never used.

Search Provisions

The search function takes a SearchFilter and returns matching provisions:

#![allow(unused)]
fn main() {
use congress_appropriations::approp::query::{SearchFilter, SearchResult};

let results = query::search(&bills, &SearchFilter {
    provision_type: Some("appropriation"),
    agency: Some("Veterans"),
    min_dollars: Some(1_000_000_000),
    ..Default::default()
});

for r in &results {
    println!(
        "[{}] {} — ${:?}",
        r.bill_identifier, r.account_name, r.dollars
    );
}
}

SearchFilter Fields

All fields are optional and use AND logic — every field that is Some must match:

#![allow(unused)]
fn main() {
pub struct SearchFilter<'a> {
    pub provision_type: Option<&'a str>,  // e.g., "appropriation"
    pub agency: Option<&'a str>,          // case-insensitive substring
    pub account: Option<&'a str>,         // case-insensitive substring
    pub keyword: Option<&'a str>,         // search in raw_text
    pub bill: Option<&'a str>,            // exact bill identifier
    pub division: Option<&'a str>,        // division letter, e.g., "A"
    pub min_dollars: Option<i64>,         // minimum absolute dollar amount
    pub max_dollars: Option<i64>,         // maximum absolute dollar amount
}
}

You can construct a filter with defaults for all fields and override just the ones you care about:

#![allow(unused)]
fn main() {
let filter = SearchFilter {
    provision_type: Some("rescission"),
    min_dollars: Some(100_000_000),
    ..Default::default()
};
}

Compare Bills

The compare function computes account-level deltas between two sets of bills:

#![allow(unused)]
fn main() {
let base_bills = load_bills(Path::new("data/118-hr4366"))?;
let current_bills = load_bills(Path::new("data/118-hr9468"))?;

let deltas = query::compare(&base_bills, &current_bills, None);

for d in &deltas {
    println!(
        "{}: base=${}, current=${}, delta={} ({})",
        d.account_name,
        d.base_dollars,
        d.current_dollars,
        d.delta,
        d.status,
    );
}
}

The optional third parameter is an agency filter (Option<&str>) that restricts the comparison to accounts from a specific agency.

AccountDelta Fields

#![allow(unused)]
fn main() {
pub struct AccountDelta {
    pub agency: String,
    pub account_name: String,
    pub base_dollars: i64,
    pub current_dollars: i64,
    pub delta: i64,
    pub delta_pct: f64,
    pub status: String,  // "changed", "unchanged", "only in base", "only in current"
}
}

Results are sorted by the absolute value of delta, largest changes first.

Audit Bills

The audit function returns per-bill verification metrics:

#![allow(unused)]
fn main() {
let audit_rows = query::audit(&bills);

for row in &audit_rows {
    println!(
        "{}: {} provisions, {} verified, {} not found, {:.1}% coverage",
        row.identifier,
        row.provisions,
        row.verified,
        row.not_found,
        row.completeness_pct.unwrap_or(0.0),
    );
}
}

AuditRow Fields

#![allow(unused)]
fn main() {
pub struct AuditRow {
    pub identifier: String,
    pub provisions: usize,
    pub verified: usize,       // dollar amounts found at unique position
    pub not_found: usize,      // dollar amounts NOT found in source
    pub ambiguous: usize,      // dollar amounts found at multiple positions
    pub exact: usize,          // raw_text byte-identical match
    pub normalized: usize,     // raw_text normalized match
    pub spaceless: usize,      // raw_text spaceless match
    pub no_match: usize,       // raw_text not found
    pub completeness_pct: Option<f64>,
}
}

The critical metric is not_found — it should be 0 for every bill. Across the included example data, it is.

Roll Up by Department

The rollup_by_department function aggregates budget authority by parent department. This is a query-time computation — it never modifies stored data:

#![allow(unused)]
fn main() {
let agencies = query::rollup_by_department(&bills);

for a in &agencies {
    println!(
        "{}: ${} BA, ${} rescissions, {} provisions",
        a.department,
        a.budget_authority,
        a.rescissions,
        a.provision_count,
    );
}
}

Agency names are split at the first comma to extract the parent department (e.g., “Salaries and Expenses, Federal Bureau of Investigation” → “Federal Bureau of Investigation”). The exception is “Office of Inspector General, …” which takes the text after the comma.

Results are sorted by budget authority descending.

Build Embedding Text

The build_embedding_text function constructs the deterministic text representation used for embedding a provision. This is useful if you want to use your own embedding model instead of OpenAI:

#![allow(unused)]
fn main() {
use congress_appropriations::approp::ontology::Provision;

for provision in &bills[0].extraction.provisions[..3] {
    let text = query::build_embedding_text(provision);
    println!("Embedding text ({} chars): {}...",
        text.len(),
        &text[..text.len().min(100)]
    );
}
}

The text concatenates the provision’s meaningful fields (account name, agency, program, raw text) in a consistent format. The same provision always produces the same text, regardless of when or where you call the function.

Access Provision Fields Directly

The Provision enum has 11 variants. Accessor methods provide a uniform interface across all variants:

#![allow(unused)]
fn main() {
use congress_appropriations::approp::ontology::{Provision, AmountSemantics};

for bill in &bills {
    for p in &bill.extraction.provisions {
        // These methods work on all provision variants:
        let ptype = p.provision_type_str();   // e.g., "appropriation"
        let account = p.account_name();       // "" if not applicable
        let agency = p.agency();              // "" if not applicable
        let section = p.section();            // e.g., "SEC. 101"
        let division = p.division();          // Some("A") or None
        let raw_text = p.raw_text();          // bill text excerpt
        let confidence = p.confidence();      // 0.0-1.0

        // Amount access returns Option<&DollarAmount>
        if let Some(amt) = p.amount() {
            if matches!(amt.semantics, AmountSemantics::NewBudgetAuthority) {
                if let Some(dollars) = amt.dollars() {
                    println!("{}: ${}", account, dollars);
                }
            }
        }
    }
}
}

Key accessor methods

MethodReturnsNotes
provision_type_str()&stre.g., "appropriation", "rescission"
account_name()&strEmpty string for types without accounts
agency()&strEmpty string for types without agencies
section()&stre.g., "SEC. 101" or empty
division()Option<&str>Some("A") or None
raw_text()&strBill text excerpt (~150 chars)
confidence()f32LLM self-assessed confidence, 0.0–1.0
amount()Option<&DollarAmount>The primary dollar amount, if any
description()&strDescription field, if applicable

Pattern matching for type-specific fields

When you need fields specific to a provision type, use pattern matching:

#![allow(unused)]
fn main() {
match p {
    Provision::Appropriation {
        account_name,
        agency,
        amount,
        detail_level,
        parent_account,
        fiscal_year,
        availability,
        ..
    } => {
        println!("Appropriation: {} (detail: {})", account_name, detail_level);
        if let Some(parent) = parent_account {
            println!("  Sub-allocation of: {}", parent);
        }
    }
    Provision::CrSubstitution {
        account_name,
        new_amount,
        old_amount,
        ..
    } => {
        let new_d = new_amount.dollars().unwrap_or(0);
        let old_d = old_amount.dollars().unwrap_or(0);
        println!("CR Sub: {} — ${} → ${} (delta: ${})",
            account_name.as_deref().unwrap_or("unnamed"),
            old_d, new_d, new_d - old_d);
    }
    Provision::Rescission {
        account_name,
        amount,
        reference_law,
        ..
    } => {
        println!("Rescission: {} — ${}", account_name, amount.dollars().unwrap_or(0));
        if let Some(law) = reference_law {
            println!("  From: {}", law);
        }
    }
    _ => {
        // Handle other provision types generically
        println!("{}: {}", p.provision_type_str(), p.description());
    }
}
}

Compute Budget Authority Manually

The BillExtraction struct has a compute_totals() method that returns (budget_authority, rescissions):

#![allow(unused)]
fn main() {
for bill in &bills {
    let (ba, rescissions) = bill.extraction.compute_totals();
    let net = ba - rescissions;
    println!("{}: BA=${}, Rescissions=${}, Net=${}",
        bill.extraction.bill.identifier, ba, rescissions, net);
}
}

This uses the same logic as the summary command: it sums Appropriation provisions where semantics == NewBudgetAuthority and detail_level is not sub_allocation or proviso_amount.

Full Working Example

Here’s a complete program that loads all example bills, finds the top 10 appropriations by dollar amount, and prints them:

use congress_appropriations::{load_bills, query};
use congress_appropriations::approp::query::SearchFilter;
use std::path::Path;

fn main() -> anyhow::Result<()> {
    // Load all bills under data/
    let bills = load_bills(Path::new("data"))?;
    println!("Loaded {} bills with {} total provisions\n",
        bills.len(),
        bills.iter().map(|b| b.extraction.provisions.len()).sum::<usize>()
    );

    // Search for all appropriations
    let results = query::search(&bills, &SearchFilter {
        provision_type: Some("appropriation"),
        ..Default::default()
    });

    // Sort by dollars descending, take top 10
    let mut with_dollars: Vec<_> = results.iter()
        .filter(|r| r.dollars.is_some())
        .collect();
    with_dollars.sort_by(|a, b| b.dollars.unwrap().abs().cmp(&a.dollars.unwrap().abs()));

    println!("Top 10 appropriations by dollar amount:");
    println!("{:<50} {:>20} {}", "Account", "Amount", "Bill");
    println!("{}", "-".repeat(85));

    for r in with_dollars.iter().take(10) {
        println!("{:<50} ${:>18} {}",
            &r.account_name[..r.account_name.len().min(48)],
            r.dollars.unwrap(),
            r.bill_identifier,
        );
    }

    // Budget summary
    println!("\nBudget Summary:");
    for s in query::summarize(&bills) {
        println!("  {}: ${} BA, ${} rescissions",
            s.identifier, s.budget_authority, s.rescissions);
    }

    Ok(())
}

Design Principles

The library API follows these conventions:

  1. All query functions are pure. They take &[LoadedBill] and return data. No side effects, no I/O, no API calls, no formatting.

  2. The CLI formats; the library computes. main.rs handles table/JSON/CSV/JSONL rendering. The library returns structs that derive Serialize for easy JSON output.

  3. Semantic search is separate. Embedding loading and cosine similarity live in embeddings.rs, not query.rs. This keeps the library usable without OpenAI. The CLI wires them together for --semantic and --similar searches.

  4. Error handling uses anyhow. All fallible functions return anyhow::Result<T>. For library consumers who prefer typed errors, the underlying error types from thiserror are also available.

  5. Serde for everything. All data types derive Serialize and Deserialize. You can serialize any query result to JSON with serde_json::to_string(&results)?.

Working with Embeddings

The embeddings module is separate from the query module. If you want to work with embedding vectors directly:

#![allow(unused)]
fn main() {
use congress_appropriations::approp::embeddings;
use std::path::Path;

// Load embeddings for a bill
if let Some(loaded) = embeddings::load(Path::new("data/118-hr9468"))? {
    println!("Loaded {} vectors of {} dimensions",
        loaded.count(), loaded.dimensions());

    // Get the vector for provision 0
    let vec0 = loaded.vector(0);
    println!("First 5 dimensions: {:?}", &vec0[..5]);

    // Compute cosine similarity between two provisions
    let sim = embeddings::cosine_similarity(loaded.vector(0), loaded.vector(1));
    println!("Similarity between provisions 0 and 1: {:.4}", sim);
}
}

Key embedding functions

FunctionDescription
embeddings::load(dir)Load embeddings from a bill directory. Returns Option<LoadedEmbeddings>.
embeddings::save(dir, metadata, vectors)Save embeddings to a bill directory.
embeddings::cosine_similarity(a, b)Compute cosine similarity between two vectors.
embeddings::normalize(vec)L2-normalize a vector in place.
loaded.vector(i)Get the embedding vector for provision at index i.
loaded.count()Number of provisions with embeddings.
loaded.dimensions()Number of dimensions per vector (e.g., 3072).

Tips

  1. Load once, query many times. load_bills() does all the file I/O. After that, all query functions work on in-memory data and are extremely fast.

  2. Use SearchFilter::default() as a base. Override only the fields you need — all None fields are unrestricted.

  3. Check provision_type_str() instead of pattern matching when you just need the type name as a string.

  4. The amount() accessor returns None for provisions without dollar amounts. Riders, directives, and some other types don’t carry amounts. Always handle the None case.

  5. Budget authority totals should match the CLI. If compute_totals() returns different numbers than congress-approp summary, something is wrong. The included example data produces these exact totals: H.R. 4366 = $846,137,099,554 BA; H.R. 5860 = $16,000,000,000 BA; H.R. 9468 = $2,882,482,000 BA.

Next Steps

Upgrade Extraction Data

You will need: congress-approp installed, existing extracted bill data (with extraction.json).

You will learn: How to use the upgrade command to migrate extraction data to the latest schema version, re-verify against current code, and update files — all without making any LLM API calls.

The upgrade command is your tool for keeping extraction data current without re-extracting. When the tool’s schema evolves — new fields, renamed fields, new verification checks, or updated deserialization logic — upgrade applies those changes to your existing data. It re-deserializes each bill’s extraction.json through the current code’s parsing logic, re-runs deterministic verification against the source XML, and writes updated files.

No LLM API calls are made. Upgrade is fast, free, and safe.

When to Use Upgrade

Use upgrade when:

  • You’ve updated congress-approp to a new version that includes schema changes, new provision type handling, or improved verification logic. The upgrade command applies those improvements to your existing extractions.
  • You want to re-verify without re-extracting. Maybe you suspect the verification logic has been improved, or you want to check data integrity after moving files between systems.
  • You see schema version warnings. If your data was extracted with an older schema version and the tool detects this, it may suggest running upgrade.
  • You want to normalize data. Upgrade re-serializes through the current schema, which normalizes field names, fills in defaults for new fields, and standardizes enum values.

Do NOT use upgrade when:

  • You want a fresh extraction with a different model. Use extract instead — that makes new LLM API calls.
  • Your source XML has changed. If you re-downloaded the bill, you need to re-extract, not upgrade.

Preview Before Upgrading

Always start with a dry run:

congress-approp upgrade --dir data --dry-run

This shows what would change for each bill without writing any files:

  • Which bills would be upgraded
  • Whether the schema version would change
  • How many provisions would be re-parsed
  • Whether verification results would differ

No files are modified during a dry run.

Run the Upgrade

Upgrade all bills in a directory

congress-approp upgrade --dir data

The tool walks recursively from the specified directory, finds every extraction.json, and upgrades each one. For each bill:

  1. Load the existing extraction.json
  2. Re-deserialize every provision through the current from_value.rs parsing logic, which handles missing fields, type coercions, and unknown provision types
  3. Re-compute the schema_version field
  4. Re-run verification against the source XML (if BILLS-*.xml is present in the same directory)
  5. Write updated extraction.json and verification.json

Upgrade a single bill

congress-approp upgrade --dir data/118/hr/9468

Verbose output

Add -v for detailed logging:

congress-approp upgrade --dir data -v

This shows per-provision details: which fields were defaulted, which types were coerced, and any warnings from the deserialization process.

What Upgrade Changes

extraction.json

  • schema_version is set to the current version
  • New fields added in recent versions get their default values (e.g., a new Option<String> field defaults to null)
  • Renamed fields are mapped from old names to new names
  • Type coercions are applied — for example, if a dollar amount was stored as a string "$10,000,000" in an old extraction, upgrade converts it to the integer 10000000
  • Unknown provision types that have since been added to the schema are re-parsed into their proper variant instead of falling back to Other

The provision data itself is not re-generated — upgrade works with whatever the LLM originally produced. It only normalizes the representation, not the content.

verification.json

Verification is fully re-run against the source XML:

  • Amount checks — Every text_as_written dollar string is searched for in the source text
  • Raw text checks — Every raw_text excerpt is checked as a substring of the source (exact → normalized → spaceless → no match)
  • Completeness — The percentage of dollar strings in the source text matched to extracted provisions is recomputed

If the source XML (BILLS-*.xml) is not present in the bill directory, verification is skipped and the existing verification.json is left unchanged.

metadata.json

The source_xml_sha256 field is added or updated if the source XML is present. This is part of the hash chain that enables staleness detection for downstream artifacts (embeddings).

What is NOT changed

  • The provisions themselves — the LLM’s original extraction is preserved. Upgrade doesn’t re-classify provisions, change account names, or modify dollar amounts.
  • tokens.json — Token usage records from the original extraction are untouched.
  • chunks/ — Per-chunk LLM artifacts are not modified.
  • embeddings.json / vectors.bin — Embeddings are not regenerated. If the upgrade changes extraction.json, the embeddings become stale. The tool will warn you about this, and you can run embed to regenerate.

Handling the SuchSums Fix

One specific issue that upgrade addresses: in early versions, SuchSums amount variants (for “such sums as may be necessary” provisions) could serialize incorrectly. The upgrade command detects and fixes this, converting them to the proper tagged enum format. This is transparent — you don’t need to do anything special.

After Upgrading

Check the audit

Run audit to see whether verification metrics improved:

congress-approp audit --dir data

If the upgrade applied new verification logic, you may see changes in the Exact/NormText/TextMiss columns. The NotFound column should remain at 0 (it would only increase if the upgrade somehow corrupted dollar amount strings, which it doesn’t).

Check for stale embeddings

If upgrade modified extraction.json, the hash chain detects that embeddings are stale:

⚠ H.R. 4366: embeddings are stale (extraction.json has changed)

Regenerate embeddings if you use semantic search:

congress-approp embed --dir data

Verify budget authority totals

As a sanity check, confirm that budget authority totals haven’t changed:

congress-approp summary --dir data --format json

Upgrade should never change the dollar amounts in provisions, so budget authority totals should be identical before and after. If they differ, something unexpected happened — file a bug report.

For the included example data, the expected totals are:

BillBudget AuthorityRescissions
H.R. 4366$846,137,099,554$24,659,349,709
H.R. 5860$16,000,000,000$0
H.R. 9468$2,882,482,000$0

Upgrade vs. Re-Extract: Decision Guide

SituationUse upgradeUse extract
Updated to a new version of congress-approp
Want to try a different LLM model
Schema version is outdated
Low coverage — want more provisions extracted
Verification logic improved
Source XML was re-downloaded
Want to normalize field names and types
NotFound > 0 and you suspect extraction errors

Key principle: upgrade preserves the LLM’s work and improves how it’s stored and verified. extract discards the LLM’s work and starts over.

Troubleshooting

“No extraction.json found”

The upgrade command only processes directories that already contain extraction.json. If you haven’t extracted a bill yet, use extract first.

“No source XML found — skipping verification”

Upgrade re-runs verification against the source XML. If the BILLS-*.xml file isn’t in the bill directory (maybe you moved files around), verification is skipped. The extraction data is still upgraded, but verification.json won’t be updated.

To fix, make sure the source XML is in the same directory as extraction.json:

ls data/118/hr/9468/
# Should show both BILLS-118hr9468enr.xml and extraction.json

Budget authority totals changed after upgrade

This should not happen. If it does:

  1. Compare the pre-upgrade and post-upgrade extraction.json using diff or a JSON diff tool
  2. Look for provisions whose detail_level or semantics changed — these fields affect the budget authority calculation
  3. File a bug report with the before/after data

Quick Reference

# Preview what would change (no files modified)
congress-approp upgrade --dir data --dry-run

# Upgrade all bills under a directory
congress-approp upgrade --dir data

# Upgrade a single bill
congress-approp upgrade --dir data/118/hr/9468

# Upgrade with verbose logging
congress-approp upgrade --dir data -v

# Verify after upgrading
congress-approp audit --dir data

# Regenerate stale embeddings after upgrade
congress-approp embed --dir data

Full Command Reference

congress-approp upgrade [OPTIONS]

Options:
    --dir <DIR>  Data directory to upgrade [default: ./data]
    --dry-run    Show what would change without writing files

Next Steps

Enrich Bills with Metadata

The enrich command generates bill_meta.json for each bill directory, enabling fiscal year filtering, subcommittee scoping, and advance appropriation classification. Unlike extraction (which requires an Anthropic API key) or embedding (which requires an OpenAI API key), enrichment runs entirely offline.

Quick Start

# Enrich all bills in the data directory
congress-approp enrich --dir data

This creates a bill_meta.json file in each bill directory. You only need to run it once per bill — the tool skips bills that already have metadata unless you pass --force.

What It Enables

After enriching, you can use these filtering options on summary, search, and compare:

# See only FY2026 bills
congress-approp summary --dir data --fy 2026

# Search within a specific subcommittee
congress-approp search --dir data --type appropriation --fy 2026 --subcommittee thud

# Combine semantic search with FY and subcommittee filtering
congress-approp search --dir data --semantic "housing assistance" --fy 2026 --subcommittee thud --top 5

# Compare THUD funding across fiscal years
congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud --dir data

Note: The --fy flag works without enrich — it uses the fiscal year data already in extraction.json. But --subcommittee requires the division-to-jurisdiction mapping that only enrich provides.

Note on embeddings: Semantic search (the --semantic flag) requires embedding vectors. If you cloned the git repository, pre-generated vectors.bin files are included for all example bills. If you installed via cargo install, the embedding files are not included (they exceed the crates.io size limit) — run congress-approp embed --dir data to generate them (~30 seconds per bill, requires OPENAI_API_KEY). The enrich command itself does not require embeddings and does not use any API keys.

What It Generates

The enrich command creates a bill_meta.json file in each bill directory containing five categories of metadata:

Subcommittee Mappings

Each division in an omnibus or minibus bill gets mapped to a canonical jurisdiction. The tool parses division titles directly from the enrolled bill XML and classifies them using pattern matching:

DivisionTitle (from XML)Jurisdiction
ADepartment of Defense Appropriations Act, 2026defense
BDepartments of Labor, Health and Human Services…labor-hhs
DTransportation, Housing and Urban Development…thud
GOther Mattersother

This solves the problem where Division A means Defense in one bill but CJS in another — the --subcommittee flag uses the canonical jurisdiction, not the letter.

Available subcommittee slugs for --subcommittee:

SlugJurisdiction
defenseDepartment of Defense
labor-hhsLabor, Health and Human Services, Education
thudTransportation, Housing and Urban Development
financial-servicesFinancial Services and General Government
cjsCommerce, Justice, Science
energy-waterEnergy and Water Development
interiorInterior, Environment
agricultureAgriculture, Rural Development
legislative-branchLegislative Branch
milcon-vaMilitary Construction, Veterans Affairs
state-foreign-opsState, Foreign Operations
homeland-securityHomeland Security

Advance Appropriation Classification

Each budget authority provision is classified as:

  • current_year — money available in the fiscal year the bill funds
  • advance — money enacted now but available in a future fiscal year
  • supplemental — additional emergency or supplemental funding
  • unknown — a future fiscal year is referenced but no known pattern was matched

The classification uses a fiscal-year-aware algorithm:

  1. Extract “October 1, YYYY” from the provision’s availability text — this means funds available starting fiscal year YYYY+1
  2. Extract “first quarter of fiscal year YYYY” — this means funds for FY YYYY
  3. Compare the availability year to the bill’s fiscal year
  4. If the availability year is later than the bill’s fiscal year → advance
  5. If the availability year equals the bill’s fiscal year → current_year (start of the funded FY)
  6. Check provision notes for “supplemental” → supplemental
  7. Default to current_year

This correctly handles cases like:

  • H.R. 4366 (FY2024): VA Compensation and Pensions “available October 1, 2024” → advance for FY2025 ($182 billion)
  • H.R. 7148 (FY2026): Medicaid “for the first quarter of fiscal year 2027” → advance for FY2027 ($316 billion)
  • H.R. 7148 (FY2026): Tenant-Based Rental Assistance “available October 1, 2026” → advance for FY2027 ($4 billion)

Across the 13-bill dataset, the algorithm identifies $1.49 trillion in advance appropriations — approximately 24% of total budget authority. Failing to separate advance from current-year can cause year-over-year comparisons to be off by hundreds of billions of dollars.

Bill Nature

The enriched bill classification provides finer distinctions than the original LLM classification:

Original ClassificationEnriched Bill NatureReason
continuing_resolutionfull_year_cr_with_appropriationsH.R. 1968 has 260 appropriations + a CR baseline — it’s a hybrid containing $1.786 trillion in full-year appropriations
omnibusminibusH.R. 5371 covers only 3 subcommittees (Agriculture, Legislative Branch, MilCon-VA)
supplemental_appropriationssupplementalH.R. 815 is normalized to the canonical enum value

The classification uses provision type distribution and subcommittee count: 5+ real subcommittees = omnibus, 2-4 = minibus, CR baseline + many appropriations without multiple subcommittees = full-year CR with appropriations.

Canonical Account Names

Every account name is normalized for cross-bill matching:

OriginalCanonical
Grants-In-Aid for Airportsgrants-in-aid for airports
Grants-in-Aid for Airportsgrants-in-aid for airports
Grants-in-aid for Airportsgrants-in-aid for airports
Department of VA—Compensation and Pensionscompensation and pensions

Normalization lowercases, strips em-dash and en-dash prefixes, and trims whitespace. This eliminates false orphans in compare caused by capitalization differences and hierarchical naming conventions.

Classification Provenance

Every classification in bill_meta.json records how it was determined:

{
  "timing": "advance",
  "available_fy": 2027,
  "source": {
    "type": "fiscal_year_comparison",
    "availability_fy": 2027,
    "bill_fy": 2026
  }
}

This means: “classified as advance because the money becomes available in FY2027 but the bill covers FY2026.” Provenance types include xml_structure, pattern_match, fiscal_year_comparison, note_text, and default_rule.

When to Re-Enrich

The tool automatically detects when bill_meta.json is stale — when extraction.json has changed since enrichment. You will see a warning:

⚠ H.R. 7148: bill metadata is stale (extraction.json has changed). Run `enrich --force`.

Run enrich --force to regenerate metadata for all bills.

Flags

FlagDescription
--dir <DIR>Data directory [default: ./data]
--dry-runShow what would be generated without writing files
--forceRe-enrich even if bill_meta.json already exists

Previewing Before Writing

Use --dry-run to see what the enrich command would produce without writing any files:

congress-approp enrich --dir data --dry-run
  would enrich H.R. 1968: nature=FullYearCrWithAppropriations, 3 divisions, 192 BA provisions (8 advance, 3 supplemental)
  would enrich H.R. 4366: nature=Omnibus, 7 divisions, 511 BA provisions (11 advance, 4 supplemental)
  would enrich H.R. 7148: nature=Omnibus, 9 divisions, 505 BA provisions (11 advance, 4 supplemental)
  ...

Using with Compare

The compare command benefits most from enrichment. Without enrich, comparing two omnibus bills that cover different subcommittees produces hundreds of false orphans. With enrichment and --subcommittee scoping:

# Before: 759 orphans (mixing Defense with Agriculture)
congress-approp compare --base data/118-hr4366 --current data/118-hr7148

# After: 43 meaningful changes, 12 unchanged
congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud --dir data

The --base-fy and --current-fy flags automatically select the right bills for each fiscal year and the --subcommittee flag scopes to the correct division in each bill.

Known Limitations

  • Sub-agency mismatches — the LLM sometimes uses sub-agency names (e.g., “Maritime Administration”) in one bill and parent department names (e.g., “Department of Transportation”) in another. The compare command includes a 35-entry sub-agency-to-parent-department lookup table that resolves most of these, but some agency naming inconsistencies (~5-15 orphans per subcommittee) may remain for agencies not in the table.
  • 17 supplemental policy division titles (e.g., “FEND Off Fentanyl Act”, “Protecting Americans from Foreign Adversary Controlled Applications Act”) are classified as other jurisdiction by default. These are from just two bills (H.R. 815 and S. 870) and don’t affect regular appropriations bill analysis.
  • Advance detection patterns cover “October 1, YYYY” and “first quarter of fiscal year YYYY.” If Congress uses novel phrasing in future bills, those provisions would default to current_year. The tool logs a warning when it detects a provision referencing a future fiscal year but not matching any known advance pattern.

Adjust for Inflation

When comparing appropriations across fiscal years, nominal dollar changes can be misleading. A program that received $100M in FY2024 and $104M in FY2026 looks like it got a 4% increase — but if inflation over that period was 3.9%, the real increase is only 0.1%. The program’s purchasing power barely changed.

The --real flag on compare adds inflation-adjusted context to every row, showing you which programs received real increases and which ones lost ground to inflation.

Quick Start

# Compare THUD FY2024 → FY2026 with inflation adjustment
congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud --dir data --real

The output adds two columns: Real Δ %* (the inflation-adjusted percentage change) and a directional indicator:

  • — real increase (nominal change exceeded inflation)
  • — real cut or inflation erosion (purchasing power decreased)
  • — unchanged in both nominal and real terms

The asterisk on “Real Δ %*” reminds you this is a computed value based on an external price index, not a number verified against bill text.

A summary line at the bottom counts how many programs beat inflation and how many fell behind.

What It Shows

Account                   Base ($)       Current ($)   Δ %    Real Δ %*  
TBRA                   28,386,831,000  34,438,557,000  +21.3%  +16.7%  ▲
Project-Based Rental   16,010,000,000  18,543,000,000  +15.8%  +11.4%  ▲
Operations (FAA)       12,729,627,000  13,710,000,000   +7.7%   +3.6%  ▲
Public Housing Fund     8,810,784,000   8,319,393,000   -5.6%   -9.1%  ▼
Capital Inv Grants      2,205,000,000   1,700,000,000  -22.9%  -25.8%  ▼
Payment to NRC            158,000,000     158,000,000    0.0%   -3.9%  ▼

45 beat inflation, 17 fell behind | CPI-U FY2024→FY2026: 3.9% (2 months of FY2026 data)

Key insight: “Payment to NRC” got the exact same dollar amount both years. Nominally that’s “unchanged.” But after adjusting for 3.9% inflation, it’s effectively a 3.9% cut in purchasing power. The flag makes this visible at a glance.

How It Works

The tool ships with a bundled CPI data file containing monthly Consumer Price Index values from the Bureau of Labor Statistics (CPI-U All Items, series CUUR0000SA0). When you pass --real:

  1. The tool identifies the base and current fiscal years from the comparison
  2. It computes fiscal-year-weighted CPI averages (October through September) from the monthly data
  3. The inflation rate is the ratio: current_fy_cpi / base_fy_cpi - 1
  4. For each row, the real percentage change is: (current / (base × (1 + inflation))) - 1
  5. The inflation flag compares the nominal change to the inflation rate

The bundled CPI data is compiled into the binary — no network access is needed. It’s updated with each tool release.

Using Your Own Price Index

The default deflator is CPI-U (Consumer Price Index for All Urban Consumers), which is the standard measure used in journalism and public policy discussion. However, different analyses may call for different deflators:

  • GDP Deflator — used by CBO for aggregate budget analysis; broader than CPI
  • PCE Price Index — the Federal Reserve’s preferred measure; typically 0.3–0.5% below CPI
  • Sector-specific deflators — DoD procurement indices, medical care CPI, construction cost indices

To use a different deflator, provide your own data file:

congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud --dir data \
  --real --cpi-file my_gdp_deflator.json

The file must follow this JSON schema:

{
  "source": "GDP Deflator (BEA NIPA Table 1.1.4)",
  "retrieved": "2026-03-15",
  "note": "Quarterly values interpolated to monthly",
  "monthly": {
    "2023-10": 118.432,
    "2023-11": 118.576,
    "2023-12": 118.701,
    "2024-01": 118.823,
    "...": "..."
  }
}

The tool reads the monthly values and computes fiscal-year averages (Oct–Sep) from them. The source and note fields are displayed in the output footer, so the reader knows exactly which deflator was used.

If you provide calendar-year annual averages instead of monthly data, you can use:

{
  "source": "My custom deflator",
  "retrieved": "2026-03-15",
  "annual_averages": {
    "2024": 118.9,
    "2025": 121.3,
    "2026": 123.1
  },
  "partial_years": {
    "2026": { "months": 2, "through": "2026-02" }
  }
}

The tool prefers monthly data for precise fiscal year computation, falling back to annual_averages (calendar year proxy) when monthly data is not available.

Understanding the Output

Nominal vs. Real

ColumnWhat It Means
Δ %The nominal percentage change — what Congress actually voted. Verifiable against bill text.
Real Δ %*The inflation-adjusted percentage change — what the money can buy. Computed from an external price index.

The nominal number answers: “What did Congress decide?” The real number answers: “Did the program’s purchasing power go up or down?”

Inflation Flags

FlagMeaningExample
Real increase — nominal growth exceeded inflation+7.7% nominal with 3.9% inflation = real increase
Real cut — program lost purchasing power-5.6% nominal = real cut regardless of inflation
Inflation erosion — nominal increase but below inflation+2.0% nominal with 3.9% inflation = real cut
Unchanged — zero nominal change, zero real changeOnly when both base and current are $0

The most important insight is inflation erosion: programs that received a nominal increase but still lost purchasing power. These are politically described as “increases” but economically function as cuts. The --real flag makes this visible.

Every inflation-adjusted output includes a footer showing:

  • The deflator used (CPI-U by default, or whatever --cpi-file specifies)
  • The base and current fiscal year CPI values
  • The inflation rate between them
  • How many months of data are available for partial years
  • A count of programs that beat or fell behind inflation

This metadata ensures the analysis is reproducible and the methodology is transparent.

CSV and JSON Output

CSV

With --real --format csv, the CSV output adds three columns:

account_name,agency,base_dollars,current_dollars,delta,delta_pct,status,real_delta_pct,inflation_flag

The inflation_flag values are: real_increase, real_cut, inflation_erosion, or unchanged. These are designed for filtering in spreadsheets — sort or filter on inflation_flag to find all programs that lost ground.

JSON

With --real --format json, the output includes an inflation metadata object:

{
  "inflation": {
    "source": "Bureau of Labor Statistics, CPI-U All Items (CUUR0000SA0)",
    "base_fy": 2024,
    "current_fy": 2026,
    "base_cpi": 311.6,
    "current_cpi": 325.1,
    "rate": 0.0434,
    "current_fy_months": 4,
    "note": "FY2026 based on 4 months of data (Oct 2025 – Jan 2026)"
  },
  "rows": [
    {
      "account_name": "Tenant-Based Rental Assistance",
      "base_dollars": 28386831000,
      "current_dollars": 34438557000,
      "delta": 6051726000,
      "delta_pct": 21.3,
      "real_delta_pct": 16.7,
      "inflation_flag": "real_increase",
      "status": "changed"
    }
  ],
  "summary": {
    "beat_inflation": 45,
    "fell_behind": 17,
    "inflation_rate_pct": 4.34
  }
}

Important Caveats

CPI-U is a consumer measure

CPI-U measures the cost of goods and services purchased by urban consumers — groceries, rent, gasoline, healthcare. Government spending has a different cost structure: federal employee salaries, military procurement, construction, transfer payments. CPI-U is the standard deflator for public-facing analysis but may not precisely reflect the cost pressures facing a specific government program.

For sector-specific analysis, consider using --cpi-file with a deflator appropriate to the spending category (medical care CPI for VA health, construction cost index for infrastructure, etc.).

Partial-year data

For the most recent fiscal year, CPI data may be incomplete. The output always notes how many months of data are available. The inflation rate may shift as more months are published — typically by 0.1–0.3 percentage points.

This is analysis, not extraction

Nominal dollar amounts in this tool are verified against the enrolled bill text — every number traces to a specific position in the source XML. Inflation-adjusted numbers are computed values that depend on an external data source (BLS) and methodology choices (CPI-U, fiscal year weighting). The asterisk on “Real Δ %*” marks this distinction. When citing inflation-adjusted figures, note the deflator used.

Updating the Bundled CPI Data

The tool includes CPI-U data current as of its release date. To use more recent data:

  1. Download fresh monthly CPI from the BLS Public Data API or FRED
  2. Format as the JSON schema shown above
  3. Pass via --cpi-file

Alternatively, wait for the next tool release — each version bundles the latest available CPI data.

Resolving Agency and Account Name Differences Across Bills

When comparing appropriations across fiscal years, the same program sometimes appears under different agency names. The Army’s research budget might be listed under “Department of Defense—Army” in one bill and “Department of Defense—Department of the Army” in another. These are the same program, but the tool can’t tell without your help.

The dataset.json file at the root of your data directory is where you record these equivalences. Once recorded, every command — compare, relate, link suggest — uses them automatically.

The Problem

Run a Defense comparison and you’ll likely see orphan pairs:

congress-approp compare --base-fy 2024 --current-fy 2026 \
    --subcommittee defense --dir data
only in base    "RDT&E, Army"  agency="Department of Defense—Army"         $17.1B
only in current "RDT&E, Army"  agency="Department of Defense—Dept of Army" $16.7B

Same account name. Same program. Different agency string. The tool treats them as different accounts.

Two Ways to Discover Naming Variants

normalize suggest-text-match — Local analysis

congress-approp normalize suggest-text-match --dir data

Scans your data for orphan pairs (same account name on both sides of a cross-FY comparison, different agency name) and structural patterns (preposition variants like “of” vs “for”, prefix expansion like “Defense—Army” vs “Defense—Department of the Army”).

Runs entirely offline. No API calls. Instant.

Found 94 suggested agency groups (252 orphan pairs resolvable):

  1. [064847a5] [orphan-pair] "Department of Health and Human Services"
     = "National Institutes of Health"
     Evidence: 27 shared accounts (e.g., national cancer institute, ...)

  2. [3dec4083] [orphan-pair] "Centers for Disease Control and Prevention"
     = "Department of Health and Human Services"
     Evidence: 13 shared accounts (e.g., environmental health, ...)

Each suggestion has an 8-character hash for use with normalize accept.

Use --format hashes to output just the hashes (one per line) for scripting:

congress-approp normalize suggest-text-match --dir data --format hashes

Use --min-accounts N to only show pairs sharing N or more account names (higher = stronger evidence):

congress-approp normalize suggest-text-match --dir data --min-accounts 3

normalize suggest-llm — LLM-assisted classification

congress-approp normalize suggest-llm --dir data

Sends unresolved ambiguous accounts to Claude along with the XML heading context from each bill. The LLM sees the full organizational structure surrounding each provision — the [MAJOR] and [SUBHEADING] headings from the enrolled bill XML — and classifies agency pairs as SAME or DIFFERENT.

Requires ANTHROPIC_API_KEY. Uses Claude Opus.

The LLM uses three types of evidence:

  • XML heading hierarchy — which department/agency heading the provision appears under in the bill structure
  • Dollar amounts — similar amounts across years suggest the same program
  • Institutional knowledge — understanding organizational relationships (e.g., Space Force is under Department of the Air Force)

Both suggest commands cache their results. Neither writes to dataset.json directly — use normalize accept to review and persist.

Accepting Suggestions

After running either suggest command, accept specific suggestions by hash:

congress-approp normalize accept 064847a5 3dec4083 --dir data

Or accept all cached suggestions at once:

congress-approp normalize accept --auto --dir data

The accept command reads from the suggestion cache (~/.congress-approp/cache/), matches hashes, and writes the accepted groups to dataset.json. If dataset.json already exists, new groups are merged with existing ones.

What dataset.json Looks Like

Open data/dataset.json in any text editor:

{
  "schema_version": "1.0",
  "entities": {
    "agency_groups": [
      {
        "canonical": "Department of Health and Human Services",
        "members": [
          "National Institutes of Health",
          "Centers for Disease Control and Prevention"
        ]
      }
    ],
    "account_aliases": [
      {
        "canonical": "Office for Civil Rights",
        "aliases": ["Office of Civil Rights"]
      }
    ]
  }
}

Each agency group says: when matching, treat all these agency names as equivalent. The canonical name is what appears in compare output. The members are variants that get mapped to it.

Each account alias maps variant spellings of an account name to a preferred form.

This file contains only user knowledge — decisions that cannot be derived from scanning bill files. There is no cached or derived data.

How Matching Works

When you run compare, relate, or link suggest, the tool matches provisions by (agency, account name). Here’s exactly what happens:

  1. Both agency and account name are lowercased
  2. Account name em-dash prefixes are stripped (“Dept—Account” → “account”)
  3. If dataset.json exists, agency names are mapped through the agency groups
  4. If dataset.json exists, account names are mapped through account aliases
  5. Provisions with the same (mapped agency, normalized account) are matched

No other normalization happens. The tool does not silently rename agencies or merge accounts. If two provisions don’t match, they appear as orphans — and you can decide whether to add a group.

When normalization is applied, the compare output marks it:

Account                          Base ($)        Current ($)    Status
RDT&E, Army                      $17,115,037,000 $16,705,760,000 changed (normalized)
Tenant-Based Rental Assistance   $32,386,831,000 $38,438,557,000 changed

The (normalized) marker tells you this match used an agency group from dataset.json. Matches without the marker are exact. In CSV output, normalized is a separate true/false column rather than a status suffix.

Using –exact to Disable Normalization

congress-approp compare --exact --base-fy 2024 --current-fy 2026 --dir data

Ignores dataset.json entirely. Every match is exact lowercased strings only. Use this to see the raw matching results without any entity resolution applied.

When dataset.json Doesn’t Exist

The tool uses exact matching only. No implicit normalization. This is the default behavior — explicit and predictable. To create a dataset.json:

congress-approp normalize suggest-text-match --dir data
congress-approp normalize accept --auto --dir data

Viewing Current Rules

congress-approp normalize list --dir data

Displays all agency groups and account aliases currently in dataset.json.

Editing by Hand

You can edit dataset.json directly in any text editor. The format is simple JSON with two sections:

  • agency_groups — each group has a canonical name and a list of members that should be treated as equivalent
  • account_aliases — each alias has a canonical name and a list of alternative spellings

Typical Workflow

  1. Run compare, notice orphan pairs in the output
  2. Run normalize suggest-text-match to discover obvious naming variants
  3. Review suggestions — check the hashes, evidence, and shared accounts
  4. Accept the ones you trust: normalize accept HASH1 HASH2 --dir data
  5. Re-run compare — orphans are now matched, marked (normalized)
  6. For remaining ambiguous pairs, run normalize suggest-llm for LLM-assisted classification with XML evidence
  7. Accept LLM suggestions the same way: normalize accept HASH --dir data

Tips

  • Start with suggest-text-match. It finds the obvious pairs for free. Run suggest-llm only for the remaining ambiguous cases.
  • Use --min-accounts 3 to focus on the strongest suggestions first — pairs sharing 3+ account names are very likely the same agency.
  • Review every suggestion. Especially from the LLM. Check the reasoning.
  • Verify merges. After accepting groups, re-run compare and check that the merged numbers make sense. If a merged amount looks too high, you may have grouped agencies that should be separate.
  • One file per dataset. The dataset.json file is specific to the data directory it lives in. Different data directories can have different normalization rules.
  • Version control it. If your data directory is in git, commit dataset.json alongside your bill data. It records the decisions you made about entity identity.
  • Use --exact to verify. At any time, run compare --exact to see the raw matching results without normalization. This is your ground truth.

Cache Details

Both suggest commands store their results in ~/.congress-approp/cache/. The cache is:

  • Keyed by data directory — different --dir values get separate caches
  • Auto-invalidated — when any bill’s extraction.json changes (added, removed, or re-extracted), the cache is invalidated and suggest recomputes
  • Read by normalize accept — the accept command reads from cache rather than recomputing, making the suggest → accept workflow fast
  • Deletable — if anything seems wrong, delete ~/.congress-approp/cache/ and re-run suggest

See Also

Resolving Treasury Account Symbols

Every federal budget account has a Federal Account Symbol (FAS) — a stable identifier assigned by the Treasury Department that persists through account renames and reorganizations. The resolve-tas command maps each extracted appropriation provision to its FAS code, enabling cross-bill account tracking regardless of how Congress names the account in different years.

Why TAS Resolution Matters

The same budget account can appear under different names across bills:

Fiscal YearBillAccount Name
FY2020H.R. 1158United States Secret Service—Operations and Support
FY2022H.R. 2471Operations and Support
FY2024H.R. 2882Operations and Support

Without TAS resolution, these look like different accounts. With it, all three map to FAS code 070-0400 — the same Treasury account.

The FAST Book Reference

The tool ships with fas_reference.json, derived from the Federal Account Symbols and Titles (FAST) Book published by the Bureau of the Fiscal Service at tfx.treasury.gov. This reference contains:

  • 2,768 active FAS codes across 156 agencies
  • 485 discontinued General Fund accounts from the Changes sheet
  • Official titles, agency names, fund types, and legislation references

The FAS code format is {agency_code}-{main_account}:

  • 070-0400 → agency 070 (DHS), main account 0400 (Secret Service Ops)
  • 021-2020 → agency 021 (Army), main account 2020 (Operation and Maintenance)
  • 075-0350 → agency 075 (HHS), main account 0350 (NIH)

Running TAS Resolution

Preview what will happen (no API calls)

congress-approp resolve-tas --dir data --dry-run

This shows how many provisions need resolution per bill and estimates the LLM cost:

  H.R. 2882             448/491  deterministic, 43 need LLM (~$1.29)
  H.R. 4366             467/498  deterministic, 31 need LLM (~$0.93)

Deterministic only (free, no API key)

congress-approp resolve-tas --dir data --no-llm

Matches provisions against the FAST Book using string comparison. Handles ~56% of provisions — those with unique account names or where the agency code disambiguates among multiple candidates. Zero false positives.

Full resolution (deterministic + LLM)

congress-approp resolve-tas --dir data

Provisions that cannot be matched deterministically are sent to Claude Opus in batches, grouped by agency. The LLM receives the provision’s account name, agency, and dollar amount along with all FAS codes for that agency. Each returned FAS code is verified against the FAST Book — if the code is not in the reference, the match is flagged as inferred rather than high.

Achieves ~99.4% resolution across the full dataset.

Resolve a single bill

congress-approp resolve-tas --dir data --bill 118-hr2882

Re-resolve after changes

congress-approp resolve-tas --dir data --bill 118-hr2882 --force

How the Two-Tier Matching Works

Tier 1: Deterministic (free, instant)

For each top-level budget authority appropriation:

  1. Direct match: Lowercase the account name, look up in the FAS short-title index. If exactly one FAS code has this title, match it.

  2. Short-title match: Extract the first comma-delimited segment of the account name (e.g., “Operation and Maintenance” from “Operation and Maintenance, Army”). Look up in the index. If unique, match.

  3. Suffix match: Strip any em-dash agency prefix (e.g., “United States Secret Service—Operations and Support” → “Operations and Support”). Look up the suffix. If unique, match.

  4. Agency disambiguation: If multiple FAS codes share the same title (151 agencies have “Salaries and Expenses”), use the provision’s agency to narrow the candidates. If exactly one candidate matches the agency, match it.

  5. DOD service branch detection: When the agency is “Department of Defense” but the account name contains “, Army”, “, Navy”, “, Air Force”, etc., the resolver uses the service-specific CGAC code (021, 017, 057) instead of the DOD umbrella code (097).

If none of these strategies produce a single unambiguous match, the provision is left unmatched for the LLM tier.

Tier 2: LLM (requires ANTHROPIC_API_KEY)

Unmatched provisions are batched by agency and sent to Claude Opus with:

  • The provision’s account name, agency, and dollar amount
  • All FAS codes for that agency from the FAST Book

The LLM returns a FAS code and reasoning for each provision. Each returned code is verified against the FAST Book. Codes confirmed in the reference are marked high confidence; codes the LLM knows from training but that are not in the reference are marked inferred.

Understanding the Output

The command produces tas_mapping.json per bill:

{
  "schema_version": "1.0",
  "bill_identifier": "H.R. 2882",
  "fas_reference_hash": "a1b2c3...",
  "mappings": [
    {
      "provision_index": 0,
      "account_name": "Operations and Support",
      "agency": "United States Secret Service",
      "dollars": 3007982000,
      "fas_code": "070-0400",
      "fas_title": "Operations and Support, United States Secret Service, Homeland Security",
      "confidence": "verified",
      "method": "direct_match"
    }
  ],
  "summary": {
    "total_provisions": 491,
    "deterministic_matched": 448,
    "llm_matched": 39,
    "unmatched": 4,
    "match_rate_pct": 99.2
  }
}

Confidence levels

LevelMeaning
verifiedDeterministic match confirmed against the FAST Book. Mechanically provable.
highLLM matched, and the FAS code exists in the FAST Book.
inferredLLM matched, but the FAS code is not in the FAST Book (known from training data).
unmatchedCould not resolve. Typically edge cases: Postal Service, intelligence community, newly created accounts.

Match methods

MethodHow the match was made
direct_matchAccount name uniquely matched one FAS short title.
suffix_matchAfter stripping the em-dash agency prefix, the suffix uniquely matched.
agency_disambiguatedMultiple FAS codes shared the title, but the agency code narrowed to one.
llm_resolvedClaude Opus provided the mapping.

The 40 Unmatched Provisions

Across the full 32-bill dataset, 40 provisions (0.6%) could not be resolved even with the LLM. These are genuine edge cases:

  • Postal Service accounts — USPS has its own funding structure
  • Intelligence community accounts — classified budget lines
  • FDIC Inspector General — FDIC is self-funded
  • Newly created programs — not yet in the FAST Book

These 40 provisions represent less than 0.05% of total budget authority.

Updating the FAST Book Reference

The FAST Book is updated periodically by the Bureau of the Fiscal Service. To refresh the bundled reference data:

  1. Download the updated Excel from tfx.treasury.gov/reference-books/fast-book
  2. Save as tmp/fast_book_part_ii_iii.xlsx
  3. Run: python scripts/convert_fast_book.py
  4. The updated data/fas_reference.json will be generated
  5. Re-run resolve-tas --force to apply the new reference

Cost Summary

ScenarioCostWhat you get
--no-llm (free)$0~56% of provisions resolved deterministically
Full resolution (one bill)$1–4~99% resolution for that bill
Full resolution (32 bills)~$8599.4% resolution across the dataset

This is a one-time cost per bill. Once tas_mapping.json is produced, the FAS codes are permanent — they do not change unless the bill is re-extracted.

Verifying Extraction Data

The verify-text command checks that every provision’s raw_text field is a verbatim substring of the enrolled bill source text, and optionally repairs any discrepancies. After verification, every provision carries a source_span with exact byte positions linking it back to the enrolled bill.

Quick Start

# Analyze without modifying anything
congress-approp verify-text --dir data

# Repair mismatches and add source spans
congress-approp verify-text --dir data --repair

# Verify a single bill
congress-approp verify-text --dir data --bill 118-hr2882 --repair

What It Checks

During LLM extraction, the model is instructed to copy the first ~150 characters of each provision’s source text verbatim into the raw_text field. In practice, the model occasionally makes small substitutions:

  • Word substitutions: “clause” instead of “subsection”, “on” instead of “in”
  • Quote character differences: straight quotes ('') instead of Unicode curly quotes ('')
  • Whitespace normalization: newlines collapsed into spaces

The verify-text command detects these mismatches by searching for each provision’s raw_text in the bill’s source text file (BILLS-*.txt).

The 3-Tier Repair Algorithm

When --repair is specified, mismatched provisions are repaired using a deterministic algorithm that requires no LLM calls:

Tier 1: Prefix Match

Find the longest prefix of raw_text (15–80 characters) that appears in the source text. When found, copy the actual source bytes from that position.

This handles single-word substitutions that occur after a long correct prefix. For example, if the first 80 characters match but then the model wrote “clause” where the source says “subsection”, the prefix matcher finds the correct position and copies the real text.

Tier 2: Substring Match

If the prefix is too short (e.g., the provision starts with “(a) “ which appears thousands of times), search for the longest internal substring (starting from various offsets within raw_text). Walk backward from the match position to recover the provision’s start.

This handles cases where the first few characters are generic but a distinctive phrase later in the text is unique in the source.

Tier 3: Normalized Position Mapping

Build a character-level map between a normalized version of the source (whitespace and quote characters collapsed) and the original source. Search in normalized space, then map the hit position back to original byte offsets.

This handles curly-quote vs. straight-quote differences and newline-vs-space mismatches that the first two tiers cannot resolve.

Properties

  • All three tiers are deterministic: same input produces same output.
  • Every repair is guaranteed to be a verbatim substring of the source, because the algorithm copies directly from the source text.
  • No LLM calls are made. The entire process runs in under 10 seconds for 34,568 provisions.

The Source Span Invariant

After verify-text --repair, every provision has a source_span field:

{
  "source_span": {
    "start": 45892,
    "end": 46042,
    "file": "BILLS-118hr2882enr.txt",
    "verified": true,
    "match_tier": "exact"
  }
}

The invariant:

source_file_bytes[start .. end] == provision.raw_text

where start and end are UTF-8 byte offsets into the source file.

Byte Offsets vs. Character Offsets

The start and end values match Rust’s native str indexing, which operates on byte positions. In files containing multi-byte UTF-8 characters (such as curly quotes, which are 3 bytes each), byte offsets differ from character offsets.

To verify the invariant in Python, use byte-level slicing:

import json

extraction = json.load(open("data/118-hr2882/extraction.json"))
source_bytes = open("data/118-hr2882/BILLS-118hr2882enr.txt", "rb").read()

for provision in extraction["provisions"]:
    span = provision.get("source_span")
    if span and span.get("verified"):
        actual = source_bytes[span["start"]:span["end"]].decode("utf-8")
        assert actual == provision["raw_text"], f"Invariant violated at {span}"

Do not use Python’s character-based string slicing (source_str[start:end]) — it will produce incorrect results when the file contains multi-byte characters.

Match Tiers

The match_tier field on each source span records how the span was established:

TierMeaning
exactraw_text was already a verbatim substring of the source. No repair needed.
repaired_prefixFixed via Tier 1 — longest prefix match + source byte copy.
repaired_substringFixed via Tier 2 — internal substring match + walk-back.
repaired_normalizedFixed via Tier 3 — normalized position mapping.

Output

Analysis mode (no --repair)

34568 provisions: 34568 exact, 0 repaired (0 prefix, 0 substring, 0 normalized), 0 unverified
Traceable: 34568/34568 (100.000%)

✅ Every provision is traceable to the enrolled bill source text.

After repair

The command modifies extraction.json to:

  1. Replace any incorrect raw_text with the verbatim source excerpt.
  2. Add source_span to each provision.

A backup is created at extraction.json.pre-repair before any modifications.

JSON output

congress-approp verify-text --dir data --format json
{
  "total": 34568,
  "exact": 34568,
  "repaired_prefix": 0,
  "repaired_substring": 0,
  "repaired_normalized": 0,
  "unverified": 0,
  "spans_added": 0,
  "traceable_pct": 100.0
}

When to Run

Run verify-text --repair once after extraction. The command is idempotent — running it again on already-repaired data produces no changes (all provisions are already exact).

If you re-extract a bill (extract --force), run verify-text --repair again on that bill to update the source spans.

Technical Details

The verify-text command works at the serde_json::Value level rather than through the typed Provision enum. This allows it to write the source_span field on each provision object in the JSON without modifying the Rust type definitions for all 11 provision variants. The field is ignored by the Rust deserializer (Serde skips unknown fields) but is available to any consumer reading the JSON directly.

Running the Complete Pipeline

This guide walks through every step to process appropriations bills from raw XML to a queryable account registry. Each step adds data without modifying previous outputs. You can stop at any step and still get value from the data produced so far.

Prerequisites

cargo install --path .    # Build the tool (Rust 1.93+)

API keys (only needed for specific steps):

KeyEnvironment VariableRequired For
Congress.govCONGRESS_API_KEYdownload (free at api.congress.gov)
AnthropicANTHROPIC_API_KEYextract, resolve-tas (LLM tier)
OpenAIOPENAI_API_KEYembed (text-embedding-3-large)

No API keys are needed for verify-text, enrich, authority build, or any query command when working with pre-processed data.

The Pipeline

Step 1: download       → BILLS-*.xml
Step 2: extract        → extraction.json, verification.json, metadata.json
Step 3: verify-text    → source_span on every provision (modifies extraction.json)
Step 4: enrich         → bill_meta.json
Step 5: resolve-tas    → tas_mapping.json
Step 6: embed          → embeddings.json, vectors.bin
Step 7: authority build → authorities.json

Step 1: Download bill XML

# Download all enacted bills for a congress
congress-approp download --congress 119 --enacted-only

# Or download a specific bill
congress-approp download --congress 119 --type hr --number 7148

This fetches the enrolled (signed-into-law) XML from Congress.gov into data/{congress}-{type}{number}/. Each bill gets its own directory.

Cost: Free (Congress.gov API is free). Time: ~30 seconds per congress. Needs: CONGRESS_API_KEY

You can skip this step entirely if you already have bill XML files — just place them in the expected directory structure.

Step 2: Extract provisions

congress-approp extract --dir data --parallel 5

Sends bill text to Claude Opus 4.6 for structured extraction. Large bills are split into chunks and processed in parallel. Every provision — appropriations, rescissions, CR anomalies, riders, directives — is captured as typed JSON.

The command skips bills that already have extraction.json. Use --force to re-extract.

Cost: ~$0.10 per chunk. Small bills: $0.10–0.50. Omnibus bills: $5–15. Time: Small bills: 1–2 minutes. Omnibus: 30–60 minutes. Needs: ANTHROPIC_API_KEY

This is the expensive step. Once done, you do not need to re-extract unless the model or prompt improves significantly.

Produces per bill:

FileContent
extraction.jsonStructured provisions (the main output)
verification.jsonDollar amount and raw text verification
metadata.jsonProvenance (model, timestamps, chunk completion)
conversion.jsonLLM JSON parsing report
tokens.jsonAPI token usage for cost tracking
BILLS-*.txtClean text extracted from XML (used for verification)

Step 3: Verify and repair raw text

congress-approp verify-text --dir data --repair

Deterministically checks that every provision’s raw_text field is a verbatim substring of the enrolled bill source text. Repairs LLM copying errors (word substitutions like “clause” instead of “subsection”, whitespace differences, quote character mismatches) using a 3-tier algorithm:

  1. Prefix match — find the longest matching prefix, copy source bytes
  2. Substring match — find a distinctive internal phrase, walk backward to the provision start
  3. Normalized position mapping — search in whitespace/quote-normalized space, map back to original byte positions

After repair, every provision carries a source_span with exact UTF-8 byte offsets into the source .txt file.

Cost: Free (no API calls). Time: ~10 seconds for all 32 bills. Needs: Nothing.

Without --repair, the command analyzes but does not modify any files. A backup (extraction.json.pre-repair) is created before any modifications.

Invariant: After this step, for every provision p:

source_file_bytes[p.source_span.start .. p.source_span.end] == p.raw_text

This is mechanically verifiable. The start and end values are UTF-8 byte offsets (matching Rust’s native str indexing). Languages that use character-based indexing (Python, JavaScript) must use byte-level slicing:

raw_bytes = open("BILLS-118hr2882enr.txt", "rb").read()
actual = raw_bytes[span["start"]:span["end"]].decode("utf-8")
assert actual == provision["raw_text"]

Step 4: Enrich with metadata

congress-approp enrich --dir data

Generates bill_meta.json per bill with fiscal year metadata, subcommittee/jurisdiction mappings, advance appropriation classification, and enriched bill nature (omnibus, minibus, full-year CR, etc.). Uses XML parsing and deterministic keyword matching — no LLM calls.

Cost: Free. Time: ~30 seconds for all bills. Needs: Nothing.

Enables --fy, --subcommittee, and --show-advance flags on query commands.

Step 5: Resolve Treasury Account Symbols

# Full resolution (deterministic + LLM)
congress-approp resolve-tas --dir data

# Deterministic only (free, no API key, ~56% resolution)
congress-approp resolve-tas --dir data --no-llm

# Preview cost before running
congress-approp resolve-tas --dir data --dry-run

Maps each top-level budget authority provision to a Federal Account Symbol (FAS) — a stable identifier assigned by the Treasury that persists through account renames and reorganizations.

Two tiers:

  • Deterministic (~56%): Matches provision account names against the bundled FAST Book reference (fas_reference.json). Free, instant, zero false positives.
  • LLM (~44%): Sends ambiguous provisions to Claude Opus with the relevant FAS codes for the provision’s agency. Verifies each returned code against the FAST Book.

Cost: Free with --no-llm. $85 for the full 32-bill dataset with LLM tier ($2–4 per omnibus). Time: Instant for --no-llm. ~5 minutes per omnibus with LLM. Needs: ANTHROPIC_API_KEY for LLM tier.

This is a one-time cost per bill. The FAS code assignment does not need to be repeated unless the bill is re-extracted.

Step 6: Generate embeddings

congress-approp embed --dir data

Generates OpenAI embedding vectors (text-embedding-3-large, 3072 dimensions) for every provision. Enables semantic search (--semantic), similar-provision matching (--similar), the relate command, and link suggest.

Cost: ~$14 for 34,568 provisions. Time: ~10–15 minutes for all bills. Needs: OPENAI_API_KEY

Optional. If you only need TAS-based account tracking, keyword search, and fiscal year comparisons, you can skip this step.

Step 7: Build the authority registry

congress-approp authority build --dir data

Aggregates all tas_mapping.json files into a single authorities.json at the data root. Groups provisions by FAS code into account authorities with name variants, provision references, fiscal year coverage, dollar totals, and detected lifecycle events (renames).

Cost: Free. Time: ~1 second. Needs: At least one tas_mapping.json from Step 5.

Querying the Data

After the pipeline completes, all query commands work:

# What bills do I have?
congress-approp summary --dir data

# Filter to one fiscal year
congress-approp summary --dir data --fy 2026

# Track an account across fiscal years
congress-approp trace 070-0400 --dir data
congress-approp trace "coast guard operations" --dir data

# Browse the account registry
congress-approp authority list --dir data --agency 070

# Search by meaning
congress-approp search --dir data --semantic "disaster relief funding" --top 5

# Compare fiscal years with TAS matching
congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud \
    --dir data --use-authorities

# Audit data quality
congress-approp audit --dir data

# Verify source traceability
congress-approp verify-text --dir data

Adding a New Bill

When Congress enacts a new bill, add it to the dataset:

congress-approp download --congress 119 --type hr --number 9999
congress-approp extract --dir data/119-hr9999 --parallel 5
congress-approp verify-text --dir data --bill 119-hr9999 --repair
congress-approp enrich --dir data/119-hr9999
congress-approp resolve-tas --dir data --bill 119-hr9999
congress-approp embed --dir data/119-hr9999
congress-approp authority build --dir data --force

The --force on the last command rebuilds authorities.json to include the new bill. All existing data is unchanged.

Rebuilding From Scratch

If you have only the XML files, you can rebuild everything:

congress-approp extract --dir data --parallel 5      # ~$100, ~4 hours
congress-approp verify-text --dir data --repair       # free, ~10 seconds
congress-approp enrich --dir data                     # free, ~30 seconds
congress-approp resolve-tas --dir data                # ~$85, ~1 hour
congress-approp embed --dir data                      # ~$14, ~15 minutes
congress-approp authority build --dir data             # free, ~1 second

Total cost to rebuild from scratch: ~$200. Total time: ~6 hours (mostly waiting for LLM responses). The XML files themselves are permanent government records available from Congress.gov.

Pipeline Dependencies

download (1) ─────────┐
                       ▼
extract (2) ──────► verify-text (3) ──────┐
     │                                     │
     ├──────────► enrich (4) ◄────────────┘
     │                │
     ├──────────► resolve-tas (5) ◄── fas_reference.json
     │                │
     └──────────► embed (6)
                      │
                      ├──► link suggest
                      │
authority build (7) ◄─── resolve-tas outputs from all bills

Steps 4, 5, and 6 are independent of each other — they all read from extraction.json and can run in any order after Step 3. Step 7 requires Step 5 to have run on all bills you want included.

Output File Reference

Per-bill files

FileStepSize (typical)Content
BILLS-*.xml112K–9.4MBEnrolled bill XML (source of truth)
BILLS-*.txt23K–3MBClean text from XML
extraction.json2+320K–2MBProvisions + source spans
verification.json25K–500KVerification report
metadata.json2500BProvenance
bill_meta.json42K–20KFY, subcommittee, timing
tas_mapping.json55K–200KFAS codes per provision
embeddings.json61K–50KEmbedding metadata
vectors.bin6100K–35MBBinary float32 vectors

Cross-bill files (at data root)

FileStepContent
fas_reference.jsonbundled2,768 FAS codes from the FAST Book
authorities.json7Account registry with timelines and events
dataset.jsonnormalize acceptEntity resolution rules (optional)
links/links.jsonlink acceptEmbedding-based cross-bill links (optional)

The Authority System

The authority system solves a fundamental problem in federal budget analysis: the same budget account can appear under different names, different agencies, and different bill structures across fiscal years. Without a stable identity for each account, tracking spending over time requires manual reconciliation of thousands of name variants.

The Problem

Consider the Secret Service’s main operating account:

Fiscal YearBillName Used
FY2020H.R. 1158United States Secret Service—Operations and Support
FY2021H.R. 133United States Secret Service—Operations and Support
FY2022H.R. 2471Operations and Support
FY2023H.R. 2617Operations and Support
FY2024H.R. 2882Operations and Support

These are all the same account. But the LLM extraction faithfully reproduces whatever name the bill text uses, which varies across congresses. A string-based comparison would treat “United States Secret Service—Operations and Support” and “Operations and Support” as different accounts.

The problem is worse for generic names. 151 different agencies have an account called “Salaries and Expenses.” Without knowing which agency a provision belongs to, the name alone is meaningless.

The Solution: Federal Account Symbols

The U.S. Treasury assigns every budget account a Federal Account Symbol (FAS) — a code in the format {agency_code}-{main_account} that persists for the life of the account regardless of what Congress calls it in bill text.

The Secret Service example resolves cleanly:

FYName in BillFAS Code
FY2020United States Secret Service—Operations and Support070-0400
FY2022Operations and Support070-0400
FY2024Operations and Support070-0400

Same code, every year. The code 070 identifies the Department of Homeland Security and 0400 identifies the Secret Service Operations account within DHS.

How the Authority System Works

The authority system has three layers:

Layer 1: The FAST Book Reference

The tool ships with fas_reference.json, derived from the Federal Account Symbols and Titles (FAST) Book published by the Bureau of the Fiscal Service. This file contains 2,768 active FAS codes and 485 discontinued General Fund accounts — the complete catalog of federal budget accounts as defined by the Treasury.

Layer 2: TAS Mapping (per bill)

The resolve-tas command maps each top-level budget authority provision to a FAS code. It uses deterministic string matching for unambiguous names (~56%) and Claude Opus for ambiguous cases (~44%), achieving 99.4% resolution across the dataset. Each mapping is verified against the FAST Book reference.

The result is a tas_mapping.json per bill containing entries like:

{
  "provision_index": 15,
  "account_name": "Operations and Support",
  "agency": "United States Secret Service",
  "fas_code": "070-0400",
  "confidence": "high",
  "method": "llm_resolved"
}

Layer 3: The Authority Registry

The authority build command aggregates all per-bill TAS mappings into a single authorities.json file. Each FAS code becomes one authority — a record that collects every provision for that account across all bills and fiscal years.

An authority record contains:

  • FAS code — the stable identifier (e.g., 070-0400)
  • Official title — from the FAST Book
  • Provisions — every instance across all bills, with bill identifier, fiscal year, dollar amount, and the account name the LLM extracted
  • Name variants — all distinct names used for this account, classified by type
  • Events — detected lifecycle changes (renames)

Name Variant Classification

When the same FAS code has different account names across bills, the system classifies each variant:

ClassificationMeaningExample
canonicalThe primary name (most frequently used)“Salaries and Expenses”
case_variantDiffers only in capitalization“salaries and expenses”
prefix_variantDiffers by em-dash agency prefix“USSS—Operations and Support” vs “Operations and Support”
name_changeA genuine rename with a temporal boundary“Allowances and Expenses” → “Members’ Representational Allowances”
inconsistent_extractionThe LLM used different names without a clear patternDifferent formatting across bill editions

The first three categories (canonical, case, prefix) account for the vast majority of variants and are harmless — they reflect different formatting conventions in different bills, not actual program changes.

Authority Events

When the system detects a clear temporal boundary — one name used exclusively before a fiscal year, another used exclusively after — it records a rename event:

TAS 000-0438: Contingent Expenses, House of Representatives
  ⟹  FY2025: renamed from "Allowances and Expenses"
                         to "Members' Representational Allowances"

Across the 32-bill dataset spanning FY2019–FY2026, the system detects 40 rename events. These are cases where Congress formally changed an account’s title in the enacted bill text.

Events currently cover renames only. Future versions may detect agency moves (e.g., Secret Service moving from Treasury to DHS in 2003), account splits, and account merges.

Using the Authority System

Track an account across fiscal years

# By FAS code
congress-approp trace 070-0400 --dir data

# By name (searches across title, agency, and all name variants)
congress-approp trace "coast guard operations" --dir data

The timeline output shows budget authority per fiscal year, which bills contributed, and the account names used. Continuing resolution and supplemental bills are labeled.

Browse the registry

# All authorities
congress-approp authority list --dir data

# Filter to one agency
congress-approp authority list --dir data --agency 070

# JSON output for programmatic use
congress-approp authority list --dir data --format json

Use in comparisons

congress-approp compare --base-fy 2024 --current-fy 2026 \
    --subcommittee thud --dir data --use-authorities

The --use-authorities flag matches accounts by FAS code instead of by name, resolving orphan pairs where the same account has different names or agency attributions across fiscal years.

What the FAS Code Represents

The FAS code is a two-part identifier:

070-0400
 │    │
 │    └── Main account code (4 digits) — the specific account
 └─────── CGAC agency code (3 digits) — the department or agency

Key properties:

  • Stable through renames. When “Salaries and Expenses” became “Operations and Support” for DHS accounts around FY2017, the FAS code did not change.

  • Changes on reorganization. When the Secret Service moved from Treasury (agency 020) to DHS (agency 070) in 2003, it received new FAS codes under the 070 prefix. For tracking across reorganizations, the authority system would need historical cross-references (not yet implemented).

  • Assigned by Treasury. These are not invented identifiers — they are the government’s own account numbering system, published in the FAST Book and used across USASpending.gov, the OMB budget database, and Treasury financial reports.

Scope and Limitations

The authority system covers discretionary appropriations — the spending that Congress votes on annually through the twelve appropriations bills, plus supplementals and continuing resolutions. This is roughly 26% of total federal spending.

It does not cover:

  • Mandatory spending (Social Security, Medicare, Medicaid — ~63% of spending)
  • Net interest on the national debt (~11% of spending)
  • Trust funds, revolving funds, or other non-appropriated accounts

The dollar amounts represent budget authority (what Congress authorizes agencies to obligate), not outlays (what the Treasury actually disburses). Budget authority and outlays can differ significantly, especially for multi-year accounts.

40 provisions (0.6%) across the dataset could not be resolved to a FAS code. These are genuine edge cases: Postal Service accounts, intelligence community programs, FDIC self-funded accounts, and newly created programs not yet in the FAST Book. They represent less than 0.05% of total budget authority.

Data Files

FileLocationContent
fas_reference.jsondata/Bundled FAST Book reference (2,768 FAS codes)
tas_mapping.jsonPer bill directoryFAS code per top-level appropriation provision
authorities.jsondata/Aggregated account registry with timelines and events

The authorities.json file is rebuilt from scratch by authority build. It is a derived artifact — delete it and rebuild at any time from the per-bill tas_mapping.json files.

The Extraction Pipeline

A bill flows through six stages on its way from raw XML on Congress.gov to queryable, verified, searchable data on your machine. Each stage produces immutable files. Once a stage completes for a bill, its output is never modified — unless you deliberately re-extract or upgrade.

This chapter explains each stage in detail: what it does, what it produces, and why it’s designed the way it is.

Pipeline Overview

                    ┌──────────┐
  Congress.gov ───▶ │ Download │ ───▶ BILLS-*.xml
                    └──────────┘
                         │
                    ┌──────────┐
                    │  Parse   │ ───▶ clean text + chunk boundaries
                    │  + XML   │
                    └──────────┘
                         │
                    ┌──────────┐
  Anthropic API ◀── │ Extract  │ ───▶ extraction.json + verification.json
                    │  (LLM)   │      metadata.json + tokens.json + chunks/
                    └──────────┘
                         │
                    ┌──────────┐
                    │ Enrich   │ ───▶ bill_meta.json          (offline, no API)
                    │(optional)│
                    └──────────┘
                         │
                    ┌──────────┐
  OpenAI API ◀───── │  Embed   │ ───▶ embeddings.json + vectors.bin
                    └──────────┘
                         │
                    ┌──────────┐
                    │  Query   │ ───▶ search, compare, summary, audit, relate
                    └──────────┘

Only stages 3 (Extract) and 5 (Embed) call external APIs. Everything else — downloading, parsing, enrichment, verification, linking, querying — runs locally and deterministically.

Stage 1: Download

The download command fetches enrolled bill XML from the Congress.gov API.

What “enrolled” means: When a bill passes both the House and Senate in identical form and is sent to the President for signature, that final text is the “enrolled” version. Once signed, it becomes law. This is the authoritative text — the version that actually governs how money is spent.

What the XML looks like: Congressional bill XML uses semantic markup defined by the Government Publishing Office (GPO). Tags like <division>, <title>, <section>, <appropriations-major>, <appropriations-small>, <quote>, and <proviso> describe the legislative structure, not just formatting. This semantic markup is what makes reliable parsing possible — you can identify account name headings, dollar amounts, proviso clauses, and structural boundaries directly from the XML tree.

What gets created:

data/118/hr/9468/
└── BILLS-118hr9468enr.xml     ← Enrolled bill XML from Congress.gov

Requires: CONGRESS_API_KEY (free from api.congress.gov)

No transformation is applied. The XML is saved exactly as received from Congress.gov.

Stage 2: Parse

Parsing happens at the beginning of the extract command — it’s not a separate CLI step. The xml.rs module reads the bill XML using roxmltree (a pure-Rust XML parser with no C dependencies) and produces two things:

Clean text extraction

The parser walks the XML tree and extracts human-readable text with two important conventions:

  1. Quote delimiters: Account names in bill XML are wrapped in <quote> tags. The parser renders these as ''Account Name'' (double single-quotes) to match the format the LLM system prompt expects. For example:

    <quote>Compensation and Pensions</quote>
    

    becomes:

    ''Compensation and Pensions''
    
  2. Structural markers: Division headers, title headers, and section numbers are preserved in the clean text so the LLM can identify structural boundaries.

Chunk boundaries

Large bills need to be split into smaller pieces for the LLM — you can’t send a 1,500-page omnibus as a single prompt. The parser identifies semantic chunk boundaries by walking the XML tree structure:

  • Primary splits: At <division> boundaries (Division A, Division B, etc.)
  • Secondary splits: At <title> boundaries within each division
  • Tertiary splits: If a single title or division still exceeds the maximum chunk token limit (~3,000 tokens), it’s further split at paragraph boundaries

This is semantic chunking, not arbitrary token-limit splitting. Each chunk contains a complete legislative section — a full title or division — so the LLM sees complete context. This matters because provisions often reference “the amount made available under this heading” or “the previous paragraph,” and the LLM needs to see those references in context.

Chunk counts for the example data:

BillXML SizeChunks
H.R. 9468 (supplemental)9 KB1
H.R. 5860 (CR)131 KB5
H.R. 4366 (omnibus)1.8 MB75

No files are written. The clean text and chunk boundaries exist only in memory, passed directly to the extraction stage.

No API calls. Pure Rust computation.

Stage 3: Extract

This is the core stage — the only one that uses an LLM. Each chunk of bill text is sent to Claude with a detailed system prompt (~300 lines) that defines every provision type, shows real JSON examples, constrains the output format, and includes specific instructions for edge cases. The LLM reads the actual legislative language and produces structured JSON — there is no intermediate regex extraction step.

The system prompt

The system prompt (defined in prompts.rs) is the instruction manual for the LLM. It covers:

  • Reading instructions: How to interpret ''Account Name'' delimiters, dollar amounts, “Provided, That” provisos, “notwithstanding” clauses, and section numbering
  • Bill type guidance: How regular appropriations, continuing resolutions, omnibus bills, and supplementals differ
  • Provision type definitions: All 11 types (appropriation, rescission, transfer_authority, limitation, directed_spending, cr_substitution, mandatory_spending_extension, directive, rider, continuing_resolution_baseline, other) with examples
  • Detail level rules: When to classify a provision as top_level, line_item, sub_allocation, or proviso_amount
  • Sub-allocation semantics: Explicit instructions that “of which $X shall be for…” breakdowns are reference_amount, not new_budget_authority
  • CR substitution requirements: Both the new and old amounts must be extracted with dollar values, semantics, and text_as_written
  • Output format: The exact JSON schema the LLM must produce

The prompt is sent with cache_control enabled, so subsequent chunks within the same bill benefit from prompt caching — the system prompt tokens are served from cache rather than re-processed, reducing both latency and cost.

Parallel chunk processing

Chunks are extracted in parallel using bounded concurrency (default 5 simultaneous LLM calls, configurable via --parallel). A progress dashboard shows real-time status:

  5/42, 187 provs [4m 23s] 842 tok/s | 📝A-IIb ~8K 180/s | 🤔B-I ~3K | 📝B-III ~1K 95/s

Each chunk produces a JSON array of provisions. The LLM’s response is captured along with its “thinking” content (internal reasoning) and saved to the chunks/ directory as a permanent provenance record.

Resilient JSON parsing

The LLM doesn’t always produce perfect JSON. Missing fields, wrong types, unexpected enum values, extra fields — all of these can occur. The from_value.rs module handles this with a resilient parsing strategy:

  • Missing fields get defaults (empty string, null, empty array)
  • Wrong types are coerced where possible (string "$10,000,000" → integer 10000000)
  • Unknown provision types become Provision::Other with the LLM’s original classification preserved
  • Extra fields on known types are silently ignored
  • Failed provisions are logged but don’t abort the extraction

Every compromise is counted in a ConversionReport — you can see exactly how many null-to-default conversions, type coercions, and unknown types occurred.

Merge and compute

After all chunks complete:

  1. Provisions are merged into a single flat array, ordered by chunk sequence
  2. Budget authority totals are computed from the individual provisions — summing new_budget_authority provisions at top_level and line_item detail levels. The LLM also produces a summary with totals, but these are never used for computation — only for diagnostics. This design means a bug in the LLM’s arithmetic can’t corrupt budget totals.
  3. Chunk provenance is recorded — the chunk_map field in extraction.json links each provision back to the chunk it came from

Deterministic verification

Verification runs immediately after extraction, with no LLM involvement. It answers three questions:

  1. “Are the dollar amounts real?” — For every provision with a text_as_written dollar string (e.g., "$2,285,513,000"), search for that exact string in the source bill text. Result: verified (found once), ambiguous (found multiple times), or not_found.

  2. “Is the quoted text actually from the bill?” — For every provision’s raw_text excerpt, check if it’s a substring of the source text using tiered matching:

    • Exact: Byte-identical substring (95.6% of provisions in example data)
    • Normalized: Matches after collapsing whitespace and normalizing Unicode quotes/dashes (2.8%)
    • Spaceless: Matches after removing all spaces (0%)
    • No match: Not found at any tier (1.5% — all non-dollar statutory amendments)
  3. “Did we miss anything?” — Count every dollar-sign pattern in the source text and check how many are accounted for by extracted provisions. This produces the coverage percentage.

See How Verification Works for the complete technical details.

What gets created

data/118/hr/9468/
├── BILLS-118hr9468enr.xml     ← Source XML (unchanged)
├── extraction.json            ← All provisions, bill info, summary, chunk map
├── verification.json          ← Amount checks, raw text checks, completeness
├── metadata.json              ← Model name, prompt version, timestamps, source hash
├── tokens.json                ← Input/output/cache token counts per chunk
└── chunks/                    ← Per-chunk LLM artifacts (gitignored)
    ├── 01JRWN9T5RR0JTQ6C9FYYE96A8.json
    └── ...

Requires: ANTHROPIC_API_KEY

Stage 3.5: Enrich (Optional)

The enrich command generates bill-level metadata by parsing the source XML structure and analyzing the already-extracted provisions. It bridges the gap between raw extraction and informed querying — adding structural knowledge that the LLM extraction doesn’t capture.

Why this stage exists: The LLM extracts provisions faithfully — every dollar amount, every account name, every section reference. But it doesn’t know that Division A in H.R. 7148 covers Defense while Division A in H.R. 6938 covers CJS. It doesn’t know that “shall become available on October 1, 2024” in a FY2024 bill means the money is for FY2025 (an advance appropriation). It doesn’t know that “Grants-In-Aid for Airports” and “Grants-in-Aid for Airports” are the same account. The enrich command adds this structural and normalization knowledge.

What it does:

  1. Parses division titles from XML. The enrolled bill XML contains <division><enum>A</enum><header>Department of Defense Appropriations Act, 2026</header> elements. The enrich command extracts each division’s letter and title, then classifies the title to a jurisdiction using case-insensitive pattern matching against known subcommittee names.

  2. Classifies advance vs current-year. For each budget authority provision, the command checks the availability field and raw_text for “October 1, YYYY” or “first quarter of fiscal year YYYY” patterns. It compares the referenced year to the bill’s fiscal year: if the money becomes available after the bill’s FY ends, it’s advance.

  3. Normalizes account names. Each account name is lowercased and stripped of hierarchical em-dash prefixes (e.g., “Department of VA—Compensation and Pensions” → “compensation and pensions”) for cross-bill matching.

  4. Classifies bill nature. The provision type distribution and subcommittee count determine whether the bill is an omnibus (5+ subcommittees), minibus (2-4), full-year CR with appropriations (CR baseline + hundreds of regular appropriations), or other type.

Input: extraction.json + BILLS-*.xml Output: bill_meta.json Requires: Nothing — no API keys, no network access.

This stage is optional. All commands from v3.x continue to work without it. It is required for --subcommittee filtering, --show-advance display, and enriched bill classification display. See Enrich Bills with Metadata for a complete guide.

Stage 4: Embed

The embed command generates semantic embedding vectors for every provision using OpenAI’s text-embedding-3-large model. This is the foundation for meaning-based search and cross-bill matching.

How provision text is built

Each provision is represented as a concatenation of its meaningful fields:

Account: Child Nutrition Programs | Agency: Department of Agriculture | Text: For necessary expenses of the Food and Nutrition Service...

This construction is deterministic — the same provision always produces the same embedding text, computed by query::build_embedding_text(). The exact fields included depend on the provision type:

  • Appropriations/Rescissions: Account name, agency, program, raw text
  • CR Substitutions: Account name, reference act, reference section, raw text
  • Directives/Riders: Description, raw text
  • Other types: Description or LLM classification, raw text

Batch processing

Provisions are sent to the OpenAI API in batches (default 100 provisions per call). Each call returns a vector of 3,072 floating-point numbers per provision — the embedding that captures the provision’s meaning in high-dimensional space.

All vectors are L2-normalized (unit length), which means cosine similarity equals the simple dot product — a fast computation.

Binary storage

Embeddings are stored in a split format for efficiency:

  • embeddings.json (~200 bytes): Human-readable metadata — model name, dimensions, count, and SHA-256 hashes for the hash chain
  • vectors.bin (count × 3,072 × 4 bytes): Raw little-endian float32 array with no header

For the FY2024 omnibus (2,364 provisions), vectors.bin is 29 MB and loads in under 2 milliseconds. The same data as JSON float arrays would be ~57 MB and take ~175ms to parse. Since this is a read-heavy system — load once per CLI invocation, query many times — the binary format keeps startup instant.

What gets created

data/118/hr/9468/
├── ...existing files...
├── embeddings.json            ← Metadata: model, dimensions, count, hashes
└── vectors.bin                ← Raw float32 vectors [count × 3072]

Requires: OPENAI_API_KEY

Stage 5: Query

All query operations — search, summary, compare, audit — run locally against the JSON and binary files on disk. There are no API calls at query time, with one exception: search --semantic makes a single API call to embed your query text (~100ms).

How queries work

  1. Load: loading.rs recursively walks the --dir path, finds every extraction.json, and deserializes it along with sibling files (verification.json, metadata.json) into LoadedBill structs.

  2. Filter: For search queries, each provision is tested against the specified filters (type, agency, account, keyword, division, dollar range). All filters use AND logic.

  3. Rank: For semantic searches, the query text is embedded via OpenAI, and cosine similarity is computed against every matching provision’s pre-stored vector. For --similar, the source provision’s stored vector is used directly (no API call).

  4. Compute: For summary, budget authority and rescissions are computed from provisions. For compare, accounts are matched by (agency, account_name) and deltas are calculated. For audit, verification metrics are aggregated.

  5. Format: The CLI layer (main.rs) renders results as tables, JSON, JSONL, or CSV depending on the --format flag.

Performance

All of this is fast:

OperationTimeNotes
Load 14 bills (extraction.json)~40msJSON parsing
Load embeddings (14 bills, binary)~8msMemory read
Hash all files (14 bills)~8msSHA-256
Cosine search (8,500 provisions)<0.5msDot products
Total cold-start query~50msLoad + hash + search
Embed query text (OpenAI API)~100msNetwork round-trip

At 20 congresses (~60 bills, ~15,000 provisions): cold start ~100ms, search <1ms. The system scales linearly and stays interactive at any realistic data volume.

No API calls at query time unless you use --semantic (one call to embed the query). The --similar command uses only stored vectors — completely offline.

The Write-Once Principle

Every file in the pipeline is write-once. After a bill is extracted and embedded, its files are never modified (unless you deliberately re-extract or upgrade). This design has several advantages:

  • No file locking needed. Multiple processes can read simultaneously without coordination.
  • No database needed. JSON files on disk are the right abstraction for a read-dominated workload with ~15 writes per year (when Congress enacts bills) and thousands of reads.
  • No caching needed. The files ARE the cache. There’s nothing to invalidate.
  • Git-friendly. All files are diffable JSON (except vectors.bin, which is gitattributed as binary).
  • Trivially relocatable. Copy a bill directory anywhere and it works — no registry, no config, no state files outside the directory.

The one exception to strict immutability is the links/links.json file, which is append-only for accepted cross-bill relationships. Links are added via link accept and removed via link remove, but the file is never overwritten — only updated.

The Hash Chain

Each downstream artifact records the SHA-256 hash of its input, forming a chain that enables staleness detection:

BILLS-*.xml ──sha256──▶ metadata.json (source_xml_sha256)
                              │
extraction.json ──sha256──▶ embeddings.json (extraction_sha256)
                              │
vectors.bin ──sha256──▶ embeddings.json (vectors_sha256)

If you re-download the XML (producing a new file), metadata.json still references the old hash. If you re-extract (producing a new extraction.json), embeddings.json still references the old extraction hash. The staleness.rs module checks these hashes on commands that use embeddings and prints warnings:

⚠ H.R. 4366: embeddings are stale (extraction.json has changed)

Warnings are advisory — they never block execution. Hashing all files for 14 bills takes ~8ms, so there’s no performance reason to skip checks.

See Data Integrity and the Hash Chain for more details.

Dependencies

The pipeline uses a minimal set of Rust crates:

StageKey CrateRole
DownloadreqwestHTTP client for Congress.gov API
ParseroxmltreePure-Rust XML parsing, zero-copy where possible
Extractreqwest + tokioAsync HTTP for Anthropic API with parallel chunk processing
Parse LLM outputserde_jsonJSON deserialization with custom resilient parsing
Verifysha2SHA-256 hashing for the hash chain
EmbedreqwestHTTP client for OpenAI API
QuerywalkdirRecursive directory traversal to find bill data
Outputcomfy-table + csvTerminal table formatting and CSV export

All API clients use rustls-tls (pure Rust TLS) — no OpenSSL dependency.

What Can Go Wrong

Understanding the pipeline helps you diagnose issues:

SymptomLikely StageInvestigation
“No XML files found”DownloadCheck that BILLS-*.xml exists in the directory
Low provision countExtractCheck audit coverage; examine chunk artifacts in chunks/
NotFound > 0 in auditExtract + VerifyRun audit --verbose; check if the LLM hallucinated an amount
“Embeddings are stale”EmbedRun embed to regenerate after re-extraction
Semantic search returns no resultsEmbedCheck that embeddings.json and vectors.bin exist
Budget authority doesn’t match expectationsExtractCheck detail_level and semantics; see Budget Authority Calculation

Next Steps

How Verification Works

Extraction uses an LLM to understand legislative language and classify provisions. Verification uses deterministic code — with zero LLM involvement — to check every claim the extraction made against the source bill text. This chapter explains the three verification checks in detail: amount verification, raw text matching, and completeness analysis.

The Core Principle

The verification pipeline answers three independent questions:

  1. “Are the extracted dollar amounts real?” — Does the dollar string actually exist in the source bill text?
  2. “Is the quoted text actually from the bill?” — Is the raw text excerpt a verbatim substring of the source?
  3. “Did we miss anything?” — How many dollar amounts in the source text were captured by extracted provisions?

Each question is answered by a different check. All three are deterministic string operations — no language model, no heuristics, no probabilistic matching. The code in verification.rs runs pure string searches against the source text extracted from the bill XML.

Amount Verification

For every provision that carries a dollar amount, the verifier takes the text_as_written field (e.g., "$2,285,513,000") and searches for that exact string in the source bill text.

How it works

  1. The text_index module builds a positional index of every dollar-sign pattern ($X,XXX,XXX) in the source text
  2. For each provision with a text_as_written value, the verifier searches the index for that string
  3. It counts how many times the string appears and records the character positions

Three possible outcomes

ResultMeaningCount in Example Data
Verified (found)The dollar string was found at exactly one position in the source text. This is the strongest result — the amount exists, and its location is unambiguous.797 of 1,522 provisions with amounts
Ambiguous (found_multiple)The dollar string was found at multiple positions. The amount is correct — it’s definitely in the bill — but the same string appears more than once, so we can’t automatically pin it to a specific location.725 of 1,522
Not Found (not_found)The dollar string was not found anywhere in the source text. This means the LLM may have hallucinated the amount, or the text_as_written field has formatting differences from the source.0 of 1,522

Why ambiguous is common and acceptable

Round numbers appear frequently throughout appropriations bills. In the FY2024 omnibus (H.R. 4366):

Dollar StringOccurrences in Source
$5,000,00050
$1,000,00045
$10,000,00038
$15,000,00027
$3,000,00025

When the tool finds $5,000,000 in 50 places, it can confirm the amount is real but can’t determine which of the 50 occurrences corresponds to this specific provision. That’s an “ambiguous” result — correct amount, uncertain location.

The 762 “verified” provisions in H.R. 4366 are the ones with unique dollar amounts — numbers specific enough (like $10,643,713,000 for FBI Salaries and Expenses) that they appear exactly once in the entire bill.

Why not_found is critical

A not_found result means the extracted dollar string does not exist anywhere in the source bill text. This is the strongest signal of a potential extraction error — the LLM may have:

  • Hallucinated a dollar amount
  • Misread or transposed digits
  • Formatted the amount differently than it appears in the source

Across the included example data: not_found = 0 for every bill. All 1,522 provisions with dollar amounts (797 verified + 725 ambiguous) were confirmed to exist in the source text.

Internal consistency check

Beyond searching the source text, verification also checks that the parsed integer in amount.value.dollars is consistent with the text_as_written string. For example, if text_as_written is "$2,285,513,000" and dollars is 2285513000, these are consistent. If dollars were 228551300 (a digit dropped), this would be flagged as a mismatch.

Across all example data: 0 internal consistency mismatches.

Raw Text Matching

Every provision includes a raw_text field — the first ~150 characters of the bill language that the provision was extracted from. The verifier checks whether this text is a verbatim substring of the source bill text. This is more than an amount check — it verifies that the provision’s context (not just its dollar figure) comes from the actual bill.

Four-tier matching

The verifier tries four progressively more lenient matching strategies:

Tier 1: Exact Match

The raw_text is searched as a byte-identical substring of the source text. No normalization, no transformation — the exact bytes must appear in the source.

Example — exact match:

  • Source text: For an additional amount for ''Compensation and Pensions'', $2,285,513,000, to remain available until expended.
  • Extracted raw_text: For an additional amount for ''Compensation and Pensions'', $2,285,513,000, to remain available until expended.
  • Result: ✓ Exact — byte-identical substring

In the example data: approximately 95% of provisions across the full dataset match at the exact tier. This is the strongest evidence that the provision was faithfully extracted from the correct location in the bill.

Tier 2: Normalized Match

If exact matching fails, the verifier normalizes both the raw_text and the source text before comparing:

  • Collapse multiple whitespace characters to a single space
  • Convert curly quotes (" ") to straight quotes (")
  • Convert em-dashes () and en-dashes () to hyphens (-)
  • Trim leading and trailing whitespace

Why this tier exists: The XML-to-text conversion process can introduce minor formatting differences. The source XML may use Unicode curly quotes while the LLM output uses straight quotes. Whitespace around XML tags may be collapsed differently. These are formatting artifacts, not content errors.

In the example data: 71 provisions (2.8%) match at the normalized tier.

Tier 3: Spaceless Match

If normalized matching also fails, the verifier removes all spaces from both strings and compares. This catches cases where word boundaries differ due to XML tag stripping — for example, (1)not less than vs. (1) not less than.

In the example data: 0 provisions match at the spaceless tier.

Tier 4: No Match

If none of the three tiers find a match, the provision is marked as no_match. The raw text was not found in the source at any level of normalization.

Common causes of no_match:

  • Truncation: The LLM truncated a very long provision, and the truncated text includes text from adjacent provisions that don’t appear together in the source
  • Paraphrasing: The LLM rephrased the statutory language instead of quoting it verbatim (most common for complex amendments like “Section X is amended by striking Y and inserting Z”)
  • Concatenation: The LLM combined text from multiple subsections into one raw_text field

In the example data: 38 provisions (1.5%) are no_match. Examining them reveals an important pattern: all 38 are non-dollar provisions — riders and mandatory spending extensions that amend existing statutes. The LLM slightly reformatted section references in these provisions. No provision with a dollar amount has a no_match in the example data.

What raw text matching proves (and doesn’t)

What it proves:

  • The provision text was taken from the actual bill, not fabricated
  • At the exact tier: the provision is attributed to a specific, locatable passage in the source
  • Combined with amount verification: the dollar figure and its context both trace to the source

What it doesn’t prove:

  • That the provision is classified correctly (is it really a “rider” vs. a “directive”?)
  • That the dollar amount is attributed to the correct account (the amount exists in the source, but is it under the heading the LLM says it is?)
  • That sub-allocation relationships are correct (is this really a sub-allocation of that parent account?)

The 95.6% exact match rate provides strong but not absolute attribution confidence. For the remaining 4.4%, the dollar amounts are still independently verified — you just can’t be as certain about the exact source location from the raw text alone.

Completeness Analysis

The third verification check measures how much of the bill’s content was captured by the extraction.

How it works

  1. The text_index module scans the entire source text for every dollar-sign pattern (e.g., $51,181,397,000, $500,000, $0)
  2. For each dollar pattern found, it checks whether any extracted provision has a matching text_as_written value
  3. The completeness percentage is: (matched dollar patterns) / (total dollar patterns in source) × 100

Interpreting coverage

BillCoverageInterpretation
H.R. 9468100.0%Every dollar amount in the source was captured. Perfect completeness — expected for a small, simple bill.
H.R. 436694.2%Most dollar amounts captured. The remaining 5.8% are dollar strings in the source text that no provision accounts for.
H.R. 586061.1%Many dollar strings in the source text are not captured. Expected for a CR — see explanation below.

Why coverage below 100% is often correct

Many dollar strings in bill text are not independent provisions and should not be extracted:

Statutory cross-references: “as authorized under section 1241(a) of the Food Security Act” — the referenced section contains dollar amounts, but those are amounts from a different law being cited for context.

Loan guarantee ceilings: “$3,500,000,000 for guaranteed farm ownership loans” — these are loan volume limits, not budget authority. They represent how much the government will guarantee in private lending, not how much it will spend.

Struck amounts: “striking ‘$50,000’ and inserting ‘$75,000’” — when the bill amends another law by changing a dollar figure, the old amount being struck should not be extracted as a new provision.

Prior-year references in CRs: Continuing resolutions reference prior-year appropriations acts extensively. Those referenced acts contain many dollar amounts that appear in the CR’s text but are citations, not new provisions. This is why H.R. 5860 has only 61.1% coverage — most dollar strings in the bill are references to prior-year levels, not new appropriations.

When low coverage IS concerning

Low coverage on a regular appropriations bill (not a CR) may indicate missed provisions. Warning signs:

  • Coverage below 60% on a regular bill or omnibus
  • Known major accounts not appearing in search --type appropriation
  • Coverage dropping significantly after re-extracting with a different model
  • Large sections of the bill with no extracted provisions at all

If these signs appear, consider re-extracting with the default model and higher parallelism.

Putting It All Together

The three checks provide layered confidence:

CheckWhat It VerifiesConfidence Level
Amount: verifiedThe dollar amount exists in the source at a unique positionHighest — amount is real and unambiguously located
Amount: ambiguousThe dollar amount exists in the source at multiple positionsHigh — amount is real, location is uncertain
Amount: not_foundThe dollar amount doesn’t exist in the sourceAlarm — possible hallucination or formatting error
Raw text: exactThe bill text excerpt is byte-identical to the sourceHighest — provision text is faithful and locatable
Raw text: normalizedThe text matches after Unicode normalizationHigh — content is correct, formatting differs slightly
Raw text: no_matchThe text isn’t found in the sourceReview needed — may be paraphrased or truncated
Coverage: 100%All dollar strings in source are accounted forComplete — nothing was missed
Coverage: >80%Most dollar strings are accounted forGood — some uncaptured strings are likely legitimate exclusions
Coverage: <60% (non-CR)Many dollar strings are unaccounted forInvestigate — significant provisions may be missing

For the included example data, the combined picture is strong:

  • 99.995% of dollar amounts verified against source text across the full dataset
  • 95.6% of raw text excerpts are byte-identical to the source
  • 0 internal consistency mismatches between parsed dollars and text_as_written
  • 13/13 CR substitution pairs fully verified (both new and old amounts)

The verification.json File

All verification results are stored in verification.json alongside the extraction. This file contains:

  • amount_checks — One entry per provision with a dollar amount: the text_as_written string, whether it was found, source positions, and status
  • raw_text_checks — One entry per provision: the raw text preview, match tier (exact/normalized/spaceless/no_match), and found position
  • completeness — Total dollar amounts in source, number accounted for, and a list of unaccounted dollar strings with their positions and surrounding context
  • summary — Roll-up metrics: total provisions, amounts verified/not_found/ambiguous, raw text exact/normalized/spaceless/no_match, and completeness percentage

The audit command renders this data as the audit table. The search command uses it to populate the $ column (✓/≈/✗), the amount_status, match_tier, and quality fields in JSON/CSV output.

See verification.json Fields for the complete field reference.

What Verification Cannot Check

Verification has clear boundaries:

  1. Classification correctness. Verification cannot tell you whether a provision classified as “rider” should actually be a “directive.” That’s LLM judgment, not a string-matching question.

  2. Attribution correctness. Verification confirms that a dollar amount exists in the source text and that the raw text excerpt is faithful — but it cannot prove that the dollar amount was attributed to the correct account. If the bill says “$500 million for Program A” on line 100 and “$500 million for Program B” on line 200, and the LLM attributes $500M to Program B but pulls raw text from the Program A paragraph, the amount check says “ambiguous” (found multiple times) but doesn’t catch the misattribution. The 95.6% exact raw text match rate provides strong evidence against this scenario — when the raw text matches exactly, attribution is very likely correct.

  3. Completeness of non-dollar provisions. The completeness check counts dollar strings in the source. Riders, directives, and other provisions without dollar amounts are not part of the coverage metric. There is no automated way to measure whether all non-dollar provisions were captured.

  4. Correctness of sub-allocation relationships. The tool checks that detail_level: sub_allocation provisions have reference_amount semantics (so they don’t double-count), but it doesn’t verify that the parent-child relationship between a sub-allocation and its parent account is correct.

  5. Fiscal year attribution. The tool extracts fiscal_year from context, but verification doesn’t independently confirm that the LLM assigned the right fiscal year to each provision.

For high-stakes analysis, use the audit command to establish baseline trust, then manually spot-check critical provisions using the procedure described in Verify Extraction Accuracy.

Next Steps

How Semantic Search Works

Semantic search lets you find provisions by meaning rather than keywords. The query “school lunch programs for kids” finds “Child Nutrition Programs” even though the words don’t overlap — because the meaning is similar. This chapter explains the technology behind this capability: what embeddings are, how cosine similarity works, how vectors are stored, and why certain queries work better than others.

The Intuition

Imagine every provision is a point on a map of “meaning.” Programs about similar things are close together on this map. “Child Nutrition Programs” and “school lunch programs for kids” are at nearby points even though they share zero words — because they mean similar things.

Your search query is also placed on this map, and the tool finds the nearest points. That’s semantic search.

The “map” is actually a 3,072-dimensional vector space (far more dimensions than a physical map’s two), and “nearby” is measured by the angle between vectors. But the intuition holds: similar meanings are close together, dissimilar meanings are far apart.

What Actually Happens

At Embed Time (One-Time Setup)

When you run congress-approp embed, each provision’s text is sent to OpenAI’s text-embedding-3-large model. The model returns a vector — a list of 3,072 floating-point numbers — that represents the provision’s meaning in high-dimensional space.

The text sent to the model is built deterministically from the provision’s key fields:

Account: Child Nutrition Programs | Agency: Department of Agriculture | Text: For necessary expenses of the Food and Nutrition Service...

This combined text gives the embedding model enough context to understand what the provision is about. The exact fields included depend on the provision type:

  • Appropriations/Rescissions: Account name, agency, program, raw text
  • CR Substitutions: Account name, reference act, reference section, raw text
  • Directives/Riders: Description, raw text
  • Other types: Description or LLM classification, raw text

The resulting vectors are stored locally:

  • embeddings.json — metadata (model, dimensions, count, hashes)
  • vectors.bin — raw float32 array, count × 3072 × 4 bytes

For the FY2024 omnibus with 2,364 provisions, vectors.bin is 29 MB and loads in under 2 milliseconds.

At Query Time (--semantic)

When you run search --semantic "school lunch programs for kids":

  1. Your query text is sent to the same OpenAI embedding model (single API call, ~100ms, costs fractions of a cent)
  2. The model returns a 3,072-dimensional query vector
  3. The tool loads the pre-computed provision vectors from vectors.bin
  4. It computes the cosine similarity between the query vector and every provision vector
  5. Results are ranked by similarity descending, filtered by any hard constraints (--type, --division, --min-dollars, etc.), and truncated to --top N

At Query Time (--similar)

When you run search --similar 118-hr9468:0:

  1. The tool looks up provision 0’s pre-computed vector from the hr9468 directory’s vectors.bin
  2. It computes cosine similarity against every other provision’s vector across all loaded bills
  3. Results are ranked by similarity descending

No API call is made — the source provision’s vector is already stored locally. This makes --similar instant and free.

Cosine Similarity

Cosine similarity is the mathematical measure of how similar two vectors are. It computes the cosine of the angle between them in high-dimensional space.

The Formula

For two vectors a and b:

cosine_similarity(a, b) = (a · b) / (|a| × |b|)

Where a · b is the dot product (sum of element-wise products) and |a| is the L2 norm (square root of sum of squared elements).

Since OpenAI embedding vectors are L2-normalized (every vector has norm = 1.0), the formula simplifies to just the dot product:

cosine_similarity(a, b) = a · b = Σ(aᵢ × bᵢ)

This is extremely fast to compute — just 3,072 multiplications and additions per pair. Over 2,500 provisions, the entire search takes less than 0.1 milliseconds.

Score Ranges

Cosine similarity ranges from -1 to 1 in theory, but for text embeddings the practical range is much narrower. Here’s what scores mean for appropriations provisions:

Score RangeInterpretationReal Example
> 0.80Almost certainly the same program in a different billVA Supplemental “Comp & Pensions” ↔ Omnibus “Comp & Pensions” = 0.86
0.60 – 0.80Related topic, same policy area“Comp & Pensions” ↔ “Readjustment Benefits” = 0.70
0.45 – 0.60Conceptually connected but not a direct match“school lunch programs for kids” ↔ “Child Nutrition Programs” = 0.51
0.30 – 0.45Weak connection; may be coincidental“cryptocurrency regulation” ↔ NRC “Regulation and Technology” = 0.30
< 0.30No meaningful relationshipRandom topic ↔ unrelated provision

These thresholds were calibrated through 30 experiments on the example data. They are specific to appropriations provisions and may not generalize to other domains.

Why Cosine Instead of Euclidean Distance?

Cosine similarity measures the direction vectors point, ignoring their magnitude. Since all embedding vectors are normalized to unit length, magnitude is already removed — but the conceptual advantage remains: provisions about the same topic point in the same direction regardless of how long or detailed their text is.

In experiments on this project’s data, cosine similarity, Euclidean distance, and dot product all produced identical rankings (Spearman ρ = 1.0). This is mathematically expected for L2-normalized vectors — all three metrics are monotone transformations of each other when norms are constant.

What Embeddings Capture (and Don’t)

What works well

Layperson → bureaucratic translation. The embedding model understands that “school lunch programs for kids” and “Child Nutrition Programs” mean the same thing because it was trained on vast amounts of text that connects these concepts. This is particularly useful when the user does not know the official program name.

Cross-bill matching. The same program in different bills — even with different naming conventions — produces similar vectors:

CR Account NameOmnibus Account NameSimilarity
Rural Housing Service—Rural Community Facilities Program AccountRural Community Facilities Program Account~0.78
National Science Foundation—Research and Related ActivitiesResearch and Related Activities~0.77

The embedding model ignores the hierarchical prefix (“Rural Housing Service—”) and focuses on the semantic content.

Topic discovery. Searching for “clean energy research” finds Energy Efficiency and Renewable Energy, Nuclear Energy, and related accounts even though the specific program names don’t match the query.

Same-account matching across bills. VA Supplemental “Compensation and Pensions” matches Omnibus “Compensation and Pensions” at 0.86 — the highest similarities in the dataset come from the same program appearing in different bills.

What doesn’t work well

Provision type classification. Embeddings don’t strongly encode whether something is a rider vs. an appropriation vs. a limitation. A rider prohibiting funding for X and an appropriation funding X may have similar embeddings because they’re about the same topic. If type matters, combine semantic search with --type.

Vector arithmetic. Analogies like “MilCon Army - Army + Navy = MilCon Navy” don’t work. The embedding space doesn’t support linear arithmetic the way word2vec sometimes does.

Clustering. Attempting DBSCAN or k-means clustering on the provision embeddings collapses almost everything into one cluster. Appropriations provisions are too semantically similar to each other (they’re all about government spending) for global clustering to produce useful groups.

Query stability. Different phrasings of the same question can produce somewhat different top-5 results. In experiments, five different FEMA-related queries shared only 1 of 5 common results in their top-5 lists. This is a known property of embedding models — the ranking is sensitive to exact wording.

The Embedding Model

The tool uses OpenAI’s text-embedding-3-large model with the full 3,072 native output dimensions.

Why this model?

  • Quality: Best-in-class performance on semantic similarity benchmarks at the time of development
  • Dimensionality: 3,072 dimensions provide lossless representation — experiments showed that truncating to 1,024 dimensions lost 1 of 20 top results, and truncating to 256 lost 4 of 20
  • Determinism: Embedding the same text produces nearly identical vectors across calls (max deviation ~1e-6)
  • Normalization: Outputs are L2-normalized, so cosine similarity reduces to a dot product

Why full 3,072 dimensions?

Experiments compared truncated dimensions:

DimensionsTop-20 Overlap vs. 3072Storage (Omnibus)
25616/20 (lossy)~2.4 MB
51218/20 (near-lossless)~4.8 MB
102419/20~9.7 MB
307220/20 (ground truth)~29 MB

Since binary vector files load in under 2ms regardless of size and storage is negligible for this use case, there was no reason to truncate. The full 3,072 dimensions are used.

Consistency requirement

All embeddings in a dataset must use the same model and dimension count. Cosine similarity between vectors from different models or different dimension counts is undefined and will produce garbage results.

If you change models, you must regenerate embeddings for all bills in the dataset. The hash chain in embeddings.json helps detect this — the model and dimensions fields record what was used.

Binary Vector Storage

Embeddings are stored in a split format optimized for the read-heavy access pattern:

embeddings.json (metadata)

{
  "schema_version": "1.0",
  "model": "text-embedding-3-large",
  "dimensions": 3072,
  "count": 2364,
  "extraction_sha256": "ae912e3427b8...",
  "vectors_file": "vectors.bin",
  "vectors_sha256": "7bd7821176bc..."
}

Human-readable, ~200 bytes. Contains everything you need to interpret the binary file: the model, dimensions, and count. Also contains SHA-256 hashes for the hash chain (linking embeddings to the extraction that produced them).

vectors.bin (data)

Raw little-endian float32 array. No header, no delimiters, no structure — just count × dimensions floating-point numbers in sequence.

[provision_0_dim_0] [provision_0_dim_1] ... [provision_0_dim_3071]
[provision_1_dim_0] [provision_1_dim_1] ... [provision_1_dim_3071]
...
[provision_N_dim_0] [provision_N_dim_1] ... [provision_N_dim_3071]

To read provision i’s vector, seek to byte offset i × dimensions × 4 and read dimensions × 4 bytes.

Why binary instead of JSON? Performance. The omnibus bill’s vectors as a JSON array of float arrays would be ~57 MB and take ~175ms to parse. As binary, it’s 29 MB and loads in <2ms. Since the tool loads vectors once per CLI invocation and queries many times, fast loading matters.

Reading vectors in Python

import json
import struct
import numpy as np

# Load metadata
with open("data/118-hr4366/embeddings.json") as f:
    meta = json.load(f)

dims = meta["dimensions"]  # 3072
count = meta["count"]       # 2364

# Option 1: Using struct (standard library)
with open("data/118-hr4366/vectors.bin", "rb") as f:
    raw = f.read()
for i in range(count):
    vec = struct.unpack(f"<{dims}f", raw[i*dims*4 : (i+1)*dims*4])

# Option 2: Using numpy (much faster for large files)
vectors = np.fromfile("data/118-hr4366/vectors.bin", dtype=np.float32).reshape(count, dims)

# Compute cosine similarity (vectors are already normalized)
similarity = vectors[0] @ vectors[1]  # dot product = cosine for unit vectors

Performance Characteristics

OperationTimeNotes
Load vectors from disk (14 bills)~8msBinary file I/O
Cosine similarity (one query vs. 8,500 provisions)<0.5ms8,500 dot products of 3,072 dimensions
Embed query text (OpenAI API)~100msNetwork round-trip
Total --semantic search~110msDominated by the API call
Total --similar search~8msNo API call needed

At 20 congresses (~60 bills, ~15,000 provisions), cosine computation would still be under 1ms. The bottleneck is always the network call for --semantic, which is inherently ~100ms regardless of dataset size.

Staleness Detection

The hash chain links embeddings to the extraction they were built from:

extraction.json ──sha256──▶ embeddings.json (extraction_sha256)
vectors.bin ──sha256──▶ embeddings.json (vectors_sha256)

If you re-extract a bill (producing a new extraction.json with different provisions), the stored extraction_sha256 in embeddings.json no longer matches. The tool detects this and warns:

⚠ H.R. 4366: embeddings are stale (extraction.json has changed)

Stale embeddings still work — cosine similarity is still computed correctly — but the provision indices may have shifted, so the vectors may not correspond to the right provisions. Regenerate with congress-approp embed to fix.

FeatureKeyword Search (--keyword)Semantic Search (--semantic)
Finds exact word matches✓ AlwaysNot guaranteed — may rank lower
Finds conceptual matches✗ Never✓ Core strength
Requires API keyNoYes (OPENAI_API_KEY)
Requires pre-computed dataNoYes (embeddings)
DeterministicYes — same query always returns same resultsNearly — scores vary by ~1e-6 across runs
Speed~1ms (string matching)~100ms (API call)
Cost per queryFree~$0.0001
Best forKnown terms in bill textConcepts, topics, layperson language

Recommendation: Use keyword search when you know the exact term. Use semantic search when you don’t know the official terminology, when you want to discover related provisions, or when you want to match across bills with different naming conventions. Use both for the most thorough coverage.

Experimental Results

The embedding approach was validated through 30 experiments on the example data:

Successful use cases

  • Layperson → bureaucratic: “school lunch for kids” → “Child Nutrition Programs” (6/7 correct results)
  • Cross-bill matching: VA Supplemental “Comp & Pensions” → Omnibus “Comp & Pensions” at 0.86
  • News clip → provisions: Pasted news article excerpts found relevant provisions
  • Topic classification: 15 policy topics correctly assigned via embedding nearest-neighbor
  • Orphan detection: Provisions unique to one bill identified by low max-similarity to any other bill

Failed use cases

  • Vector arithmetic/analogy: “MilCon Army - Army + Navy” failed
  • Global clustering: All provisions collapsed to one cluster
  • Provision type classification via embeddings: Riders classified at 11% accuracy
  • Query stability: 5 FEMA rephrasings shared only 1/5 common top-5 result

Key calibration numbers

  • >0.80 = same account across bills (use for confident cross-bill matching)
  • 0.60–0.80 = related topic, same policy area (use for discovery)
  • 0.45–0.60 = loosely related (use as hints, not answers)
  • <0.45 = unlikely to be meaningfully related (treat as no match)

These thresholds are stable across the dataset but may need recalibration for very different bill types or future congresses.

Tips for Better Results

  1. Be descriptive. “Federal funding for scientific research at universities” works better than “science.” More context gives the embedding model more signal.

  2. Use domain language when you know it. “SNAP benefits supplemental nutrition” will outperform “food stamps for poor people.”

  3. Combine with hard filters. Semantic search provides ranking; --type, --division, --min-dollars provide constraints. Use both.

  4. Try multiple phrasings. Query instability is real. If the topic matters, try 2–3 different phrasings and take the union of results.

  5. Follow up --semantic with --similar. If semantic search finds one good provision, use its index with --similar to find related provisions across other bills without additional API calls.

  6. Trust low scores. If the best match is below 0.40, the topic genuinely isn’t in the dataset. That’s the correct answer, not a failure.

Next Steps

The Provision Type System

Every provision extracted from an appropriations bill is classified into one of 11 types. This classification determines what fields are available, how dollar amounts are interpreted, and how the provision contributes to budget authority calculations. This chapter documents each type in detail with real examples from the included data.

Overview

The Provision enum in the Rust source code uses tagged serialization — each JSON object self-identifies with a provision_type field:

{"provision_type": "appropriation", "account_name": "...", "amount": {...}, ...}
{"provision_type": "rescission", "account_name": "...", "amount": {...}, ...}
{"provision_type": "rider", "description": "...", ...}

This means you can always determine a provision’s type by reading the provision_type field. Different types carry different fields, but all share a set of common fields.

Common Fields (All Provision Types)

Every provision, regardless of type, has these fields:

FieldTypeDescription
provision_typestringThe type discriminator (e.g., "appropriation", "rescission")
sectionstringSection header from the bill (e.g., "SEC. 101"). Empty string if no section applies.
divisionstring or nullDivision letter for omnibus bills (e.g., "A"). Null for bills without divisions.
titlestring or nullTitle numeral (e.g., "IV", "XIII"). Null if not determinable.
confidencefloatLLM self-assessed confidence, 0.0–1.0. Not calibrated — useful only for identifying outliers below 0.90. Values above 0.90 are not meaningfully differentiated.
raw_textstringVerbatim excerpt from the bill text (~first 150 characters of the provision). Verified against the source text.
notesarray of stringsExplanatory annotations — flags unusual patterns, drafting inconsistencies, or contextual information like “advance appropriation” or “no-year funding.”
cross_referencesarray of objectsReferences to other laws, sections, or bills. Each has ref_type, target, and optional description.

Distribution in the Example Data

Not every bill contains every type. The distribution reflects the nature of each bill:

TypeH.R. 4366 (Omnibus)H.R. 5860 (CR)H.R. 9468 (Supp)Total
appropriation1,216521,223
limitation4564460
rider285492336
directive12023125
other841296
rescission7878
transfer_authority7777
mandatory_spending_extension404484
directed_spending88
cr_substitution1313
continuing_resolution_baseline11
Total34,568 (across 32 bills)

Key patterns:

  • The omnibus is dominated by appropriations (51%), limitations (19%), and riders (12%)
  • The CR is dominated by riders (38%) and mandatory spending extensions (34%), with only 13 CR substitutions and 5 standalone appropriations
  • The supplemental has just 2 appropriations and 5 non-spending provisions (riders and directives)

The 11 Provision Types

appropriation

What it is: A grant of budget authority — the core spending provision. This is what most people think of when they think of an appropriations bill: Congress authorizing an agency to spend a specific amount of money.

In bill text: Typically appears as: “For necessary expenses of [account], $X,XXX,XXX,XXX…”

Real example from H.R. 9468:

{
  "provision_type": "appropriation",
  "account_name": "Compensation and Pensions",
  "agency": "Department of Veterans Affairs",
  "amount": {
    "value": { "kind": "specific", "dollars": 2285513000 },
    "semantics": "new_budget_authority",
    "text_as_written": "$2,285,513,000"
  },
  "detail_level": "top_level",
  "availability": "to remain available until expended",
  "fiscal_year": 2024,
  "parent_account": null,
  "provisos": [],
  "earmarks": [],
  "raw_text": "For an additional amount for ''Compensation and Pensions'', $2,285,513,000, to remain available until expended.",
  "confidence": 0.99
}

Type-specific fields:

FieldTypeDescription
account_namestringThe appropriations account name, extracted from '' delimiters in the bill text
agencystring or nullParent department or agency
programstring or nullSub-account or program name if specified
amountAmountDollar amount with semantics (see Amount Fields below)
fiscal_yearinteger or nullFiscal year the funds are available for
availabilitystring or nullFund availability period (e.g., “to remain available until expended”)
provisosarray“Provided, That” conditions attached to the appropriation
earmarksarrayCommunity project funding items
detail_levelstring"top_level", "line_item", "sub_allocation", or "proviso_amount"
parent_accountstring or nullFor sub-allocations, the parent account name

Budget authority impact: Appropriations with semantics: "new_budget_authority" at detail_level: "top_level" or "line_item" are counted in the budget authority total. Sub-allocations and proviso amounts are excluded to prevent double-counting.

Count: 1,223 across example data (49% of all provisions)


rescission

What it is: Cancellation of previously appropriated funds. Congress is taking back money it already gave — reducing net budget authority.

In bill text: Typically contains phrases like “is hereby rescinded” or “is rescinded.”

Real example from H.R. 4366:

{
  "provision_type": "rescission",
  "account_name": "Nonrecurring Expenses Fund",
  "agency": "Department of Health and Human Services",
  "amount": {
    "value": { "kind": "specific", "dollars": 12440000000 },
    "semantics": "rescission",
    "text_as_written": "$12,440,000,000"
  },
  "reference_law": "Fiscal Responsibility Act of 2023",
  "fiscal_years": null
}

Type-specific fields:

FieldTypeDescription
account_namestringAccount being rescinded from
agencystring or nullDepartment or agency
amountAmountDollar amount (semantics will be "rescission")
reference_lawstring or nullThe law whose funds are being rescinded
fiscal_yearsstring or nullWhich fiscal years’ funds are affected

Budget authority impact: Rescissions are summed separately and subtracted to produce Net BA in the summary table. The $12.44B Nonrecurring Expenses Fund rescission in the example above is the largest single rescission in the FY2024 omnibus.

Count: 78 across example data (3.1%)


cr_substitution

What it is: A continuing resolution anomaly that substitutes one dollar amount for another. The bill says “apply by substituting ‘$X’ for ‘$Y’” — meaning fund the program at $X instead of the prior-year level of $Y.

In bill text: “…shall be applied by substituting ‘$25,300,000’ for ‘$75,300,000’…”

Real example from H.R. 5860:

{
  "provision_type": "cr_substitution",
  "account_name": "Rural Housing Service—Rural Community Facilities Program Account",
  "new_amount": {
    "value": { "kind": "specific", "dollars": 25300000 },
    "semantics": "new_budget_authority",
    "text_as_written": "$25,300,000"
  },
  "old_amount": {
    "value": { "kind": "specific", "dollars": 75300000 },
    "semantics": "new_budget_authority",
    "text_as_written": "$75,300,000"
  },
  "reference_act": "Further Consolidated Appropriations Act, 2024",
  "reference_section": "title I",
  "section": "SEC. 101",
  "division": "A"
}

Type-specific fields:

FieldTypeDescription
account_namestring or nullAccount affected (may be null if the bill references a statute section instead)
new_amountAmountThe new dollar amount ($X in “substituting $X for $Y”)
old_amountAmountThe old dollar amount being replaced ($Y)
reference_actstringThe act being modified
reference_sectionstringSection being modified

Both new_amount and old_amount are independently verified against the source text. In the example data, all 13 CR substitution pairs are fully verified.

Display: When you search for --type cr_substitution, the table automatically shows New, Old, and Delta columns instead of a single Amount column.

Count: 13 across example data (all in H.R. 5860)


transfer_authority

What it is: Permission to move funds between accounts. The dollar amount is a ceiling (maximum that may be transferred), not new spending.

In bill text: “…may transfer not to exceed $X from [source] to [destination]…”

Type-specific fields:

FieldTypeDescription
from_scopestringSource account(s) or scope
to_scopestringDestination account(s) or scope
limitTransferLimitTransfer ceiling (percentage, fixed amount, or description)
conditionsarray of stringsConditions that must be met

Budget authority impact: Transfer authority provisions have semantics: "transfer_ceiling". These are not counted in budget authority totals because they don’t represent new spending — they’re permission to reallocate existing funds.

Count: 77 across example data (all in H.R. 4366)


limitation

What it is: A cap or prohibition on spending. “Not more than $X”, “none of the funds”, “shall not exceed $X.”

In bill text: “Provided, That not to exceed $279,000 shall be available for official reception and representation expenses.”

Type-specific fields:

FieldTypeDescription
descriptionstringWhat is being limited
amountAmount or nullDollar cap, if one is specified
account_namestring or nullAccount the limitation applies to
parent_accountstring or nullParent account for proviso-based limitations

Budget authority impact: Limitations have semantics: "limitation" and are not counted in budget authority totals. They constrain how appropriated funds may be used, but they don’t provide new spending authority.

Count: 460 across example data (18.4%)


directed_spending

What it is: Earmark or community project funding directed to a specific recipient.

Type-specific fields:

FieldTypeDescription
account_namestringAccount providing the funds
amountAmountDollar amount directed
earmarkEarmark or nullRecipient details: recipient, location, requesting_member
detail_levelstringTypically "sub_allocation" or "line_item"
parent_accountstring or nullParent account name

Note: Most earmarks in appropriations bills are listed in the joint explanatory statement — a separate document not included in the enrolled bill XML. The provisions extracted here are earmarks that appear in the bill text itself, which is relatively rare. Only 8 appear in the example data.

Count: 8 across example data (all in H.R. 4366)


mandatory_spending_extension

What it is: An amendment to an authorizing statute — common in continuing resolutions and Division B/C of omnibus bills. These provisions extend, modify, or reauthorize mandatory spending programs that would otherwise expire.

In bill text: “Section 330B(b)(2) of the Public Health Service Act is amended by striking ‘2023’ and inserting ‘2024’.”

Type-specific fields:

FieldTypeDescription
program_namestringProgram being extended
statutory_referencestringThe statute being amended
amountAmount or nullDollar amount if specified
periodstring or nullDuration of the extension
extends_throughstring or nullEnd date or fiscal year

Budget authority impact: If an amount is present and has semantics: "mandatory_spending", it is tracked separately from discretionary budget authority.

Count: 84 across example data (40 in omnibus, 44 in CR)


directive

What it is: A reporting requirement or instruction to an agency. No direct spending impact.

In bill text: “The Secretary shall submit a report to Congress within 30 days…”

Real example from H.R. 9468:

{
  "provision_type": "directive",
  "description": "Requires the Inspector General of the Department of Veterans Affairs to conduct a review of the circumstances surrounding and underlying causes of the announced VBA funding shortfall for FY2024...",
  "deadlines": ["180 days after enactment"],
  "section": "SEC. 104"
}

Type-specific fields:

FieldTypeDescription
descriptionstringWhat is being directed
deadlinesarray of stringsAny deadlines mentioned

Budget authority impact: None — directives don’t carry dollar amounts.

Count: 125 across example data


rider

What it is: A policy provision that doesn’t directly appropriate, rescind, or limit funds. Riders establish rules, extend authorities, or set policy conditions.

In bill text: “Each amount appropriated or made available by this Act is in addition to amounts otherwise appropriated for the fiscal year involved.”

Type-specific fields:

FieldTypeDescription
descriptionstringWhat the rider does
policy_areastring or nullPolicy domain if identifiable

Budget authority impact: None — riders don’t carry dollar amounts.

Count: 336 across example data (the second most common type)


continuing_resolution_baseline

What it is: The core CR mechanism — usually SEC. 101 or equivalent — that establishes the default rule: “fund everything at the prior fiscal year’s rate.”

In bill text: “Such amounts as may be necessary…under the authority and conditions provided in the applicable appropriations Act for fiscal year 2023…”

Type-specific fields:

FieldTypeDescription
reference_yearinteger or nullThe fiscal year used as the baseline rate
reference_lawsarray of stringsLaws providing the baseline funding levels
ratestring or nullRate description
durationstring or nullHow long the CR lasts
anomaliesarrayExplicit anomalies (usually captured as separate cr_substitution provisions)

Budget authority impact: The CR baseline itself doesn’t have a specific dollar amount — it says “fund at last year’s rate” without stating what that rate is. The CR substitutions are the exceptions to this baseline.

Count: 1 across example data (in H.R. 5860)


other

What it is: A catch-all for provisions that don’t fit neatly into any of the 10 specific types. The LLM uses this when it can’t confidently classify a provision, or when the provision represents an unusual legislative pattern.

Real examples include: Authority for corporations to make expenditures, emergency designations under budget enforcement rules, recoveries of unobligated balances, and fee collection authorities.

Type-specific fields:

FieldTypeDescription
llm_classificationstringThe LLM’s original description of what this provision is
descriptionstringSummary of the provision
amountsarray of AmountAny dollar amounts mentioned
referencesarray of stringsAny references mentioned
metadataobjectArbitrary key-value pairs for fields that didn’t fit the standard schema

Important: When the LLM produces a provision_type that doesn’t match any of the 10 known types, the resilient parser in from_value.rs wraps it as Other with the original classification preserved in llm_classification. This means the data is never lost — it’s just put in the catch-all bucket with full transparency about why.

In the example data, all 96 other provisions were deliberately classified as “other” by the LLM itself (not caught by the fallback). They represent genuinely unusual provisions like budget enforcement designations, fee authorities, and fund recovery provisions.

Count: 96 across example data (3.8%)

Amount Fields

Many provision types include an amount field (or new_amount/old_amount for CR substitutions). The amount structure has three components:

AmountValue (value)

The actual dollar figure:

KindFieldsDescription
specificdollars (integer)An exact dollar amount. Always whole dollars. Can be negative for rescissions.
such_sumsOpen-ended: “such sums as may be necessary.” No dollar figure.
noneNo dollar amount — the provision doesn’t carry a dollar value.

Amount Semantics (semantics)

What the dollar amount represents in budget terms:

ValueMeaningCounted in BA?
new_budget_authorityNew spending power granted to an agencyYes (at top_level/line_item detail)
rescissionCancellation of prior budget authoritySeparately as rescissions
reference_amountA dollar figure for context (sub-allocations, “of which” breakdowns)No
limitationA cap on spendingNo
transfer_ceilingMaximum transfer amountNo
mandatory_spendingMandatory program referenced in the billTracked separately

Distribution in example data:

SemanticsCountNotes
reference_amount649Most common — sub-allocations, proviso amounts, contextual references
new_budget_authority511The core spending provisions
limitation167Caps and restrictions
rescission78Cancellations
other43Miscellaneous
mandatory_spending13Mandatory program amounts
transfer_ceiling2Transfer limits

The fact that reference_amount is the most common semantics value (not new_budget_authority) reflects the hierarchical structure of appropriations: many provisions are breakdowns of a parent account (“of which $X shall be for…”), not independent spending authority.

Text As Written (text_as_written)

The verbatim dollar string from the bill text (e.g., "$2,285,513,000"). This is what the verification pipeline searches for in the source text to confirm the amount is real.

Detail Levels

The detail_level field on appropriation provisions indicates where the provision sits in the funding hierarchy:

LevelMeaningCounted in BA?
top_levelThe main account appropriation (e.g., “$57B for Medical Services”)Yes
line_itemA numbered item within a section (e.g., “(1) $3.5B for guaranteed farm ownership loans”)Yes
sub_allocationAn “of which” breakdown (“of which $300M shall be for fusion energy research”)No
proviso_amountA dollar amount in a “Provided, That” clauseNo
"" (empty)Provisions where detail level doesn’t apply (directives, riders)N/A

Why this matters: The compute_totals() function uses detail_level to avoid double-counting. If an account appropriates $8.2B and has an “of which $300M for fusion research” sub-allocation, only the $8.2B is counted — the $300M is a breakdown, not additional money. The sub-allocation has semantics: "reference_amount" AND detail_level: "sub_allocation" to make this unambiguous.

Distribution for appropriation-type provisions in H.R. 4366:

Detail LevelCount
top_level483
sub_allocation396
line_item272
proviso_amount65

Nearly a third of appropriation provisions are sub-allocations — breakdowns that should not be double-counted.

How Types Affect the CLI

The search command adapts its table display based on the provision types in the results:

  • Standard display: Shows Bill, Type, Description/Account, Amount, Section, Div
  • CR substitutions: Automatically shows New, Old, and Delta columns instead of a single Amount
  • Semantic search: Adds a Sim (similarity) column at the left

The summary command uses provision types to compute budget authority (only appropriation type with new_budget_authority semantics) and rescissions (only rescission type).

The compare command only matches appropriation provisions between the base and current bill sets — other types are excluded from the comparison.

Adding Custom Provision Types

If you need to capture a legislative pattern not covered by the existing 11 types, see Adding a New Provision Type for the implementation guide. The key files involved are:

  1. ontology.rs — Add the enum variant
  2. from_value.rs — Add the parsing logic
  3. prompts.rs — Update the LLM system prompt
  4. main.rs — Update display logic

The Other type serves as a bridge — provisions that could be a new type today are captured as Other with full metadata, so historical data doesn’t need to be re-extracted when a new type is added.

Next Steps

Budget Authority Calculation

The budget authority number is the primary fiscal output of this tool. This chapter explains exactly how it’s computed, what’s included, what’s excluded, and why.

The Formula

Budget authority is computed by the compute_totals() function in ontology.rs. The logic is simple and deterministic:

Budget Authority = sum of amount.value.dollars
    WHERE provision_type = "appropriation"
    AND   amount.semantics = "new_budget_authority"
    AND   detail_level NOT IN ("sub_allocation", "proviso_amount")

Rescissions are computed separately:

Rescissions = sum of |amount.value.dollars|
    WHERE provision_type = "rescission"
    AND   amount.semantics = "rescission"

Net Budget Authority = Budget Authority − Rescissions.

This computation uses the actual provisions — never the LLM’s self-reported summary totals. The LLM also produces an ExtractionSummary with its own total_budget_authority field, but this is used only for diagnostics. If the LLM’s arithmetic is wrong, it doesn’t matter — the provision-level sum is authoritative.

What’s Included in Budget Authority

Top-level appropriations

The main account appropriation — the headline dollar figure for each account. For example:

{
  "provision_type": "appropriation",
  "account_name": "Compensation and Pensions",
  "amount": {
    "value": { "kind": "specific", "dollars": 2285513000 },
    "semantics": "new_budget_authority"
  },
  "detail_level": "top_level"
}

This $2.285 billion counts toward budget authority because:

  • provision_type is "appropriation"
  • semantics is "new_budget_authority"
  • detail_level is "top_level" (not excluded)

Line items

Numbered items within a section — for example, when a section lists multiple accounts:

(1) $3,500,000,000 for guaranteed farm ownership loans
(2) $3,100,000,000 for farm ownership direct loans
(3) $2,118,491,000 for unsubsidized guaranteed operating loans

Each is extracted as a separate provision with detail_level: "line_item". Line items count toward budget authority because they represent distinct funding decisions, not breakdowns of a parent amount.

Mandatory spending lines

Programs like SNAP ($122 billion) and VA Compensation and Pensions ($182 billion) appear as appropriation lines in the bill text, even though they’re technically mandatory spending. The tool extracts what the bill says — it doesn’t distinguish mandatory from discretionary. These amounts are included in the budget authority total because they have provision_type: "appropriation" and semantics: "new_budget_authority".

This is why the omnibus total ($846 billion) is much larger than what you might expect for discretionary spending alone. See Why the Numbers Might Not Match Headlines for more on this distinction.

Advance appropriations

Some provisions enact budget authority in the current bill but make it available starting in a future fiscal year. For example, VA Medical Services often includes an advance appropriation for the next fiscal year. These are included in the budget authority total because the bill does enact them — the notes field typically flags them with “advance appropriation” or similar language.

What’s Excluded from Budget Authority

Sub-allocations (detail_level: "sub_allocation")

When a provision says “of which $300,000,000 shall be for fusion energy research,” the $300 million is a breakdown of the parent account’s funding, not money on top of it. Including both the parent and the sub-allocation would double-count.

Sub-allocations are captured as separate provisions with:

  • detail_level: "sub_allocation"
  • semantics: "reference_amount"
  • parent_account pointing to the parent account name

Both the detail level and the semantics independently exclude them from the budget authority sum.

Example: The FBI Salaries and Expenses account has:

ProvisionAmountDetail LevelSemanticsCounted?
FBI S&E (main)$10,643,713,000top_levelnew_budget_authority✓ Yes
“of which” sub-allocation$216,900,000sub_allocationreference_amount✗ No
Reception expense limitation$279,000(limitation type)limitation✗ No

Only the $10.6 billion top-level amount counts. The $216.9 million is a directive about how to spend part of the $10.6 billion, not additional funding.

Proviso amounts (detail_level: "proviso_amount")

Dollar amounts in “Provided, That” clauses are also excluded. These clauses attach conditions to an appropriation — they may specify sub-uses or transfer authorities, but they don’t add new money.

Transfer ceilings (semantics: "transfer_ceiling")

Transfer authority provisions specify the maximum amount that may be moved between accounts. This isn’t new spending — it’s permission to reallocate existing funds. Transfer ceilings have semantics: "transfer_ceiling" and are excluded from budget authority.

Limitations (semantics: "limitation")

Spending caps (“not more than $X”) constrain how appropriated funds may be used but don’t provide new authority. They have semantics: "limitation" and are excluded.

Reference amounts (semantics: "reference_amount")

Dollar figures mentioned for context — statutory cross-references, prior-year comparisons, loan guarantee ceilings — that don’t represent new spending authority. These have semantics: "reference_amount" and are excluded.

Non-appropriation provision types

Only provisions with provision_type: "appropriation" contribute to the budget authority total. Other types are excluded entirely:

  • Rescissions are summed separately (and subtracted for Net BA)
  • CR substitutions set funding levels but are not directly counted as new BA in the summary (CRs fund at prior-year rates plus adjustments — the tool captures the substituted amounts but doesn’t model the baseline)
  • Transfer authority, limitations, directives, riders, mandatory spending extensions, directed spending, continuing resolution baselines, and other provisions are all excluded from the BA calculation

Verifying the Calculation

You can independently verify the budget authority calculation against the example data.

Using the CLI

congress-approp summary --dir data --format json

This produces:

[
  {
    "identifier": "H.R. 4366",
    "budget_authority": 846137099554,
    "rescissions": 24659349709,
    "net_ba": 821477749845
  },
  {
    "identifier": "H.R. 5860",
    "budget_authority": 16000000000,
    "rescissions": 0,
    "net_ba": 16000000000
  },
  {
    "identifier": "H.R. 9468",
    "budget_authority": 2882482000,
    "rescissions": 0,
    "net_ba": 2882482000
  }
]

Using Python directly

You can replicate the calculation by reading extraction.json and applying the same filters:

import json

with open("data/118-hr4366/extraction.json") as f:
    data = json.load(f)

ba = 0
for p in data["provisions"]:
    if p["provision_type"] != "appropriation":
        continue
    amt = p.get("amount")
    if not amt or amt.get("semantics") != "new_budget_authority":
        continue
    val = amt.get("value", {})
    if val.get("kind") != "specific":
        continue
    dl = p.get("detail_level", "")
    if dl in ("sub_allocation", "proviso_amount"):
        continue
    ba += val["dollars"]

print(f"Budget Authority: ${ba:,.0f}")
# Output: Budget Authority: $846,137,099,554

The Python calculation produces exactly the same number as the CLI. If these ever diverge, something is wrong — file a bug report.

The $22 million difference

If you sum all appropriation provisions with new_budget_authority semantics without excluding sub-allocations and proviso amounts, you get $846,159,099,554 — about $22 million more than the official total. That $22 million represents sub-allocations and proviso amounts that are correctly excluded from the budget authority sum.

This is by design: the detail_level filter prevents double-counting between parent accounts and their “of which” breakdowns.

How Rescissions Work

Rescissions are cancellations of previously appropriated funds. They reduce the net budget authority:

Net BA = Budget Authority − Rescissions
       = $846,137,099,554 − $24,659,349,709
       = $821,477,749,845  (for H.R. 4366)

Rescissions are always displayed as positive numbers in the summary table (absolute value), even though they represent a reduction. The subtraction happens in the Net BA column.

The largest rescissions in the example data

AccountAmountDivision
Nonrecurring Expenses Fund (HHS)$12,440,000,000C
Medical Services (VA)$3,034,205,000A
Medical Community Care (VA)$2,657,977,000A
Veterans Health Administration$1,951,750,000A
Medical Support and Compliance (VA)$1,550,000,000A

The $12.44 billion HHS rescission is from the Fiscal Responsibility Act of 2023 — Congress clawing back unspent pandemic-era funds. The VA rescissions are from prior-year unobligated balances being recovered.

CR Budget Authority

Continuing resolutions present a special case. The H.R. 5860 summary shows $16 billion in budget authority. This comes from the standalone appropriations in the CR (principally the $16 billion for FEMA Disaster Relief Fund), not from the CR baseline mechanism.

The CR baseline — “fund at prior-year rates” — doesn’t have an explicit dollar amount in the bill. The tool captures the 13 CR substitutions (anomalies) that set specific levels for specific programs, but it doesn’t model the total funding implied by the “continue at prior-year rate” provision. To know the full funding picture during a CR, you need both the CR data and the prior-year regular appropriations bill data.

Why Budget Authority ≠ What You Read in Headlines

Three common sources of confusion:

1. This tool reports budget authority, not outlays

Budget authority is what Congress authorizes; outlays are what Treasury spends. The two differ because agencies often obligate funds in one year but disburse them over several years. Headline federal spending figures ($6.7 trillion) are in outlays. This tool reports budget authority.

2. Mandatory spending appears in the totals

Programs like SNAP ($122 billion) and VA Compensation and Pensions ($182 billion) appear as appropriation lines in the bill text. They’re technically mandatory spending (determined by eligibility rules, not annual votes), but they show up in appropriations bills. The tool extracts what the bill says.

3. Not all 12 appropriations bills are in one omnibus

The FY2024 omnibus (H.R. 4366) covers MilCon-VA, Agriculture, CJS, Energy-Water, Interior, THUD, and other matters — but it does NOT cover Defense, Labor-HHS, Homeland Security, State-Foreign Ops, Financial Services, or Legislative Branch. Those were in separate legislation. So the $846 billion total represents 7 of 12 bills, not the entire discretionary budget.

See Why the Numbers Might Not Match Headlines for a comprehensive explanation of these differences.

The Trust Model for Budget Authority

The budget authority number has several layers of protection against errors:

  1. Computed from provisions, not LLM summaries. The compute_totals() function sums individual provisions. The LLM’s self-reported totals are diagnostic only.

  2. Dollar amounts are verified against source text. Every text_as_written dollar string is searched for in the bill XML. Across the full dataset: 99.995% of dollar amounts verified against the source text.

  3. Sub-allocation exclusion prevents double-counting. The detail_level filter is deterministic and applied in Rust code, not by the LLM.

  4. Regression-tested. The project’s integration test suite hardcodes the exact budget authority for each example bill ($846,137,099,554 / $16,000,000,000 / $2,882,482,000). Any change in extraction data or computation logic that would alter these numbers is caught by tests.

  5. Independently reproducible. The Python calculation above reproduces the same number from the same JSON data. Anyone can verify the computation.

The weakest link is the LLM’s classification of semantics and detail_level — if the LLM incorrectly labels a sub-allocation as top_level, it would be included in the total when it shouldn’t be. The 95.6% exact raw text match rate provides indirect evidence that provisions are attributed correctly, and the hardcoded regression totals catch systematic errors, but there’s no automated per-provision check of detail_level correctness.

For high-stakes analysis, spot-check a sample of provisions with search --format json and verify that the detail_level and semantics assignments match what the bill text actually says.

Quick Reference

ComponentComputationExample Data Total
Budget AuthoritySum of appropriation provisions with new_budget_authority semantics at top_level or line_item detail$6,412,476,574,673 (across all 14 bills)
RescissionsSum of rescission provisions (absolute value)$24,659,349,709
Net BABudget Authority − Rescissions$840,360,231,845

Per bill:

BillBudget AuthorityRescissionsNet BA
H.R. 4366 (Omnibus)$846,137,099,554$24,659,349,709$821,477,749,845
H.R. 5860 (CR)$16,000,000,000$0$16,000,000,000
H.R. 9468 (Supplemental)$2,882,482,000$0$2,882,482,000

Next Steps

Why the Numbers Might Not Match Headlines

If you run congress-approp summary --dir data and see the budget numbers, your first reaction might be: “That doesn’t match any number I’ve seen in the news.” Headlines about the federal budget typically cite figures like $6.7 trillion (total spending), $1.7 trillion (total discretionary), or sometimes $1.2 trillion or $886 billion (specific spending cap categories).

This chapter explains the three main reasons for the discrepancy — and why the tool’s number is correct for what it measures.

The Three Budget Numbers

There are at least three different “federal budget” numbers in common use, and they measure fundamentally different things:

NumberWhat It MeasuresSource
~$6.7 trillionTotal federal spending (outlays) — mandatory + discretionary + interestCBO, OMB, Treasury
~$1.7 trillionTotal discretionary budget authority — all 12 appropriations bills combinedCBO scoring of appropriations acts
$846 billion (this tool, H.R. 4366)Budget authority enacted in one specific bill (7 of 12 appropriations bills, plus mandatory lines that appear in the text)Computed from individual provisions

None of these numbers are wrong — they just measure different things at different levels of aggregation.

Reason 1: This Omnibus Doesn’t Cover All 12 Bills

Congress is supposed to pass 12 annual appropriations bills, one for each subcommittee jurisdiction. In practice, they’re often bundled into an omnibus or split across multiple legislative vehicles.

The FY2024 omnibus (H.R. 4366, the Consolidated Appropriations Act, 2024) covers these divisions:

DivisionCoverage
AMilitary Construction, Veterans Affairs
BAgriculture, Rural Development, FDA
CCommerce, Justice, Science
DEnergy and Water Development
EInterior, Environment
FTransportation, Housing and Urban Development
G–HOther matters

It does not include:

  • Defense (by far the largest single appropriations bill, ~$886 billion in the FY2024 NDAA)
  • Labor, HHS, Education (typically the largest domestic bill)
  • Homeland Security
  • State, Foreign Operations
  • Financial Services and General Government
  • Legislative Branch

Those were addressed through other legislative vehicles for FY2024. Since the tool only extracts what’s in the bills you give it, the $846 billion total reflects 7 of 12 subcommittee jurisdictions — not the full discretionary budget.

To get the full picture: Extract all enacted appropriations bills for a congress, then run summary --dir data across all of them.

Reason 2: Mandatory Spending Appears in Appropriations Bills

Some of the largest federal programs — technically classified as “mandatory spending” — appear as appropriation line items in the bill text. The tool extracts what the bill says without distinguishing mandatory from discretionary.

Notable mandatory programs in the H.R. 4366 example data:

AccountAmountTechnically…
Compensation and Pensions (VA)$197,382,903,000Mandatory entitlement
Supplemental Nutrition Assistance Program (SNAP)$122,382,521,000Mandatory entitlement
Child Nutrition Programs$33,266,226,000Mostly mandatory
Readjustment Benefits (VA)$13,774,657,000Mandatory entitlement

These four programs alone account for over $366 billion — nearly half of the omnibus total. They’re in the bill because Congress appropriates the funds even though the spending levels are determined by eligibility rules in permanent law (the authorizing statutes), not by the annual appropriations process.

Why the tool includes them: The tool faithfully extracts every provision in the bill text. A provision that says “For Compensation and Pensions, $197,382,903,000” is an appropriation provision regardless of whether budget analysts classify the underlying program as mandatory. Distinguishing mandatory from discretionary requires authorizing-law context beyond the bill itself — context the tool doesn’t have.

How to identify mandatory lines: Look for very large amounts in Division A (VA) and Division B (Agriculture). Programs with amounts in the tens or hundreds of billions are almost certainly mandatory. The notes field sometimes flags these, and you can filter them using --max-dollars to exclude the largest accounts from analysis.

Reason 3: Budget Authority vs. Outlays

The most fundamental distinction in federal budgeting:

  • Budget Authority (BA): The legal authority Congress grants to agencies to enter into financial obligations — sign contracts, award grants, hire staff. This is what the bill text specifies and what this tool reports.

  • Outlays: The actual cash disbursements by the U.S. Treasury. This is what the government actually spends in a given year.

Budget authority and outlays differ because agencies often obligate funds in one year but spend them over several years. A multi-year construction project might receive $500 million in budget authority in FY2024, but the Treasury only disburses $100 million in FY2024, $200 million in FY2025, and $200 million in FY2026.

Headline federal spending numbers are in outlays. When you read “the federal government spent $6.7 trillion in FY2024,” that’s outlays — actual cash out the door. This tool reports budget authority — the amount Congress authorized agencies to commit. The two numbers are related but not identical, and budget authority is typically lower than outlays in any given year because outlays include spending from prior years’ budget authority.

ConceptWhat It MeasuresReported By This Tool?
Budget Authority (BA)What Congress authorizesYes
ObligationsWhat agencies commit to spendNo
OutlaysWhat Treasury actually pays outNo

Why BA is the right measure for this tool: Budget authority is what the bill text specifies. It’s the number Congress votes on, the number the Appropriations Committee reports, and the number that determines whether spending caps are breached. It’s the most precise measure of congressional intent — “how much did Congress decide to give this program?”

Reason 4: Advance Appropriations

Some provisions enact budget authority in the current year’s bill but make the funds available starting in a future fiscal year. These advance appropriations are common for VA medical accounts:

For example, H.R. 4366 includes both:

  • $71 billion for VA Medical Services in FY2024 (current-year appropriation)
  • Advance appropriation amounts for VA Medical Services in FY2025

Both are counted in the bill’s budget authority total because both are enacted by this bill. But from a fiscal year perspective, the advance amounts will be “FY2025 spending” even though the legal authority was enacted in the FY2024 bill.

The tool captures advance appropriations and typically flags them in the notes field. CBO scores may attribute them to different fiscal years than this tool’s simple per-bill sum.

Reason 5: Gross vs. Net Budget Authority

The summary table shows both gross budget authority and rescissions separately:

│ H.R. 4366 ┆ Omnibus ┆ 2364 ┆ 846,137,099,554 ┆ 24,659,349,709 ┆ 821,477,749,845 │
  • Budget Auth ($846.1B): Gross new budget authority
  • Rescissions ($24.7B): Previously appropriated funds being canceled
  • Net BA ($821.5B): The actual net new spending authority

Some external sources report gross BA, some report net BA, and some report net BA after other adjustments (offsets, fees, etc.). Make sure you’re comparing like to like.

How to Reconcile with External Sources

CBO cost estimates

The Congressional Budget Office publishes cost estimates for most appropriations bills. These are the authoritative source for budget scoring. To compare:

  1. Find the CBO cost estimate for the specific bill (e.g., H.R. 4366)
  2. Look at the “discretionary” budget authority line
  3. Note that CBO separates discretionary from mandatory — this tool does not
  4. Note that CBO may attribute advance appropriations to different fiscal years

Appropriations Committee reports

House and Senate Appropriations Committee reports contain detailed funding tables by account. These are useful for account-level verification:

  1. Find the committee report for the bill’s division (e.g., Division A report for MilCon-VA)
  2. Compare individual account amounts — these should match exactly
  3. Compare title-level or division-level subtotals

OMB Budget Appendix

The Office of Management and Budget publishes the Budget Appendix with account-level detail. This is useful for cross-checking agency totals but uses a different fiscal year attribution than this tool.

Summary: What This Tool’s Numbers Mean

When you see a budget authority figure from this tool, it means:

  1. It’s computed from individual provisions — not from any summary or LLM estimate
  2. It includes both discretionary and mandatory spending lines that appear in the bill text
  3. It covers only the bills you’ve loaded — not necessarily all 12 appropriations bills
  4. It reports budget authority — what Congress authorized, not what agencies will actually spend
  5. It may include advance appropriations — funds enacted now but available in future fiscal years
  6. Sub-allocations are correctly excluded — “of which” breakdowns don’t double-count
  7. Every dollar amount was verified against the source bill text (0 unverifiable amounts across example data)

The number is precisely what the bill says. Whether that matches a headline depends on which bill, which measure (BA vs. outlays), and which programs (discretionary only vs. including mandatory) the headline is reporting.

Quick Reference: Common Discrepancy Sources

Your Number Seems…Likely CauseHow to Check
Too high vs. “discretionary spending”Mandatory spending lines (SNAP, VA Comp & Pensions) includedFilter with --max-dollars 50000000000 to see without the largest accounts
Too low vs. “total federal budget”BA ≠ outlays; not all 12 bills loadedCheck which divisions/bills are in your data
Different from CBO scoreAdvance appropriations, mandatory/discretionary split, net vs. grossCompare specific accounts rather than totals
Doesn’t match committee reportSub-allocations excluded from BA total; different aggregation levelUse search --account for account-level comparison

Next Steps

Data Integrity and the Hash Chain

Every stage of the extraction pipeline produces files that depend on the output of the previous stage. The XML produces the extraction, the extraction produces the embeddings, and the embeddings enable semantic search. But what happens if you re-download the XML, or re-extract with a different model? The downstream files become stale — they were built from data that no longer matches.

The hash chain is a simple mechanism that detects this staleness automatically. Each downstream artifact records the SHA-256 hash of the input it was built from. When you run a command that uses those artifacts, the tool recomputes the hash and compares. If they don’t match, you get a warning.

The Chain

BILLS-*.xml ──sha256──▶ metadata.json (source_xml_sha256)
                              │
extraction.json ──sha256──▶ bill_meta.json (extraction_sha256)
extraction.json ──sha256──▶ embeddings.json (extraction_sha256)
                              │
vectors.bin ──sha256──▶ embeddings.json (vectors_sha256)

Four links, each connecting an input to the artifact that records its hash:

When extraction runs, it computes the SHA-256 hash of the source XML file (BILLS-*.xml) and stores it in metadata.json:

{
  "model": "claude-opus-4-6",
  "source_xml_sha256": "a3f7b2c4e8d1..."
}

If someone re-downloads the XML (perhaps a corrected version was published), the hash in metadata.json no longer matches the file on disk. This tells you the extraction was built from a different version of the source.

When embeddings are generated, the SHA-256 hash of extraction.json is stored in embeddings.json:

{
  "schema_version": "1.0",
  "model": "text-embedding-3-large",
  "dimensions": 3072,
  "count": 2364,
  "extraction_sha256": "b5d9e1f3a7c2...",
  "vectors_file": "vectors.bin",
  "vectors_sha256": "c8f2a4b6d0e3..."
}

If you re-extract the bill (with a different model, or after a prompt improvement), the new extraction.json has a different hash than what embeddings.json recorded. The provisions may have changed — different provision count, different classifications, different text — but the embedding vectors still correspond to the old provisions.

The SHA-256 hash of vectors.bin is also stored in embeddings.json. This is an integrity check: if the binary file is corrupted, truncated, or replaced, the hash mismatch is detected.

How Staleness Detection Works

The staleness.rs module implements the checking logic. It’s called by commands that depend on embeddings — primarily search --semantic and search --similar.

What happens on every query

  1. The tool loads extraction.json for each bill
  2. If the command uses embeddings, it loads embeddings.json for each bill
  3. It computes the SHA-256 hash of the current extraction.json on disk
  4. It compares that hash to the extraction_sha256 stored in embeddings.json
  5. If they differ, it prints a warning to stderr

The warning

⚠ H.R. 4366: embeddings are stale (extraction.json has changed)

This warning is advisory only — it never blocks execution. The tool still runs your query, still computes cosine similarity, and still returns results. But the results may be unreliable because the provision indices in the embedding vectors may not correspond to the current provisions.

Why warnings don’t block

Strict enforcement (refusing to run with stale data) would be frustrating in practice. You might have re-extracted one bill out of twenty and want to run a query across all of them while you regenerate embeddings in the background. The warning tells you what’s stale; you decide whether it matters for your current task.

When Staleness Occurs

ActionWhat Becomes StaleFix
Re-download XMLextraction.json (built from old XML)Re-extract: congress-approp extract --dir <path>
Re-extract billembeddings.json + vectors.bin (built from old extraction)Re-embed: congress-approp embed --dir <path>
Upgrade extraction dataembeddings.json + vectors.bin (extraction.json changed)Re-embed: congress-approp embed --dir <path>
Manually edit extraction.jsonembeddings.json + vectors.binRe-embed
Move files to a new machineNothing — hashes are content-based, not path-basedNo fix needed
Copy bill directoryNothing — all files move togetherNo fix needed

Automatic Skip for Up-to-Date Bills

The embed command uses the hash chain to avoid unnecessary work. When you run:

congress-approp embed --dir data

For each bill, it checks:

  1. Does embeddings.json exist?
  2. Does the stored extraction_sha256 match the current SHA-256 of extraction.json?
  3. Does the stored vectors_sha256 match the current SHA-256 of vectors.bin?

If all three pass, the bill is skipped:

Skipping H.R. 9468: embeddings up to date

This makes it safe to run embed --dir data repeatedly — only bills with new or changed extractions are processed. The same logic applies when running embed after upgrading some bills but not others.

Performance

Hash computation is fast:

OperationTime
SHA-256 of H.R. 9468 extraction.json (~15 KB)<1ms
SHA-256 of H.R. 4366 extraction.json (~12 MB)~5ms
SHA-256 of H.R. 4366 vectors.bin (~29 MB)~8ms
Total for all example bills~50ms

At scale (20 congresses, ~60 bills), total hashing time would be ~50ms — still negligible compared to the ~10ms JSON parsing time. There is no performance reason to skip or cache hash checks.

The tool always checks — it never caches hash results. Since the check takes milliseconds and the files are immutable in normal operation, this is the right tradeoff: simplicity and correctness over micro-optimization.

What’s NOT in the Hash Chain

chunks/ directory

The chunks/ directory contains per-chunk LLM artifacts — thinking traces, raw responses, conversion reports. These are local provenance records for debugging and analysis. They are:

  • Not part of the hash chain — no downstream artifact records their hashes
  • Not required for any operation — all query commands work without them
  • Gitignored by default — they contain model thinking content and aren’t meant for distribution

If the chunks are deleted, nothing breaks. They’re useful for understanding why the LLM classified a provision a certain way, but they’re not part of the data integrity chain.

verification.json

The verification report is regenerated by the upgrade command and could be regenerated at any time from extraction.json + BILLS-*.xml. It’s not part of the hash chain because it’s a derived artifact — you can always reproduce it from its inputs.

tokens.json

Token usage records from the extraction are informational only. They don’t affect any downstream operation and aren’t part of the hash chain.

The Immutability Model

The hash chain works because of the write-once principle: every file is immutable after creation. This means:

  • No concurrent modification. Two processes reading the same bill data will never see partially written files.
  • No invalidation logic. There’s nothing to invalidate — files are either current (hashes match) or stale (hashes don’t match).
  • No locking. Read operations don’t need to coordinate. Write operations (extract, embed, upgrade) overwrite files atomically.

The one exception is links/links.json, which is append-only — new links are added via link accept, existing links can be removed via link remove. Even this follows a simple consistency model: links reference provision indices in specific bill directories, and if those bills are re-extracted, the links become invalid (detectable via hash chain).

Verifying Integrity Manually

You can verify the hash chain yourself using standard tools:

Check extraction against metadata

# Compute the current SHA-256 of the source XML
shasum -a 256 data/118-hr9468/BILLS-118hr9468enr.xml

# Compare to what metadata.json recorded
python3 -c "
import json
meta = json.load(open('data/118-hr9468/metadata.json'))
print(f'Recorded: {meta.get(\"source_xml_sha256\", \"NOT SET\")}')
"

Check embeddings against extraction

# Compute the current SHA-256 of extraction.json
shasum -a 256 data/118-hr9468/extraction.json

# Compare to what embeddings.json recorded
python3 -c "
import json
emb = json.load(open('data/118-hr9468/embeddings.json'))
print(f'Recorded: {emb[\"extraction_sha256\"]}')
"

Check vectors.bin integrity

# Compute the current SHA-256 of vectors.bin
shasum -a 256 data/118-hr9468/vectors.bin

# Compare to what embeddings.json recorded
python3 -c "
import json
emb = json.load(open('data/118-hr9468/embeddings.json'))
print(f'Recorded: {emb[\"vectors_sha256\"]}')
"

If all three pairs match, the data is consistent across the entire chain.

Design Decisions

Why SHA-256?

SHA-256 is:

  • Collision-resistant — the probability of two different files producing the same hash is astronomically small
  • Fast — computing a hash takes milliseconds even for the largest files in the pipeline
  • Standard — available in every language and platform via the sha2 crate in Rust, hashlib in Python, shasum on the command line
  • Deterministic — the same file always produces the same hash, regardless of when or where it’s computed

Why content-based hashing instead of timestamps?

Timestamps tell you when a file was modified, not whether its content changed. If you copy a bill directory to a new machine, the timestamps change but the content doesn’t. Content-based hashing correctly reports “no staleness” in this case.

Conversely, if you re-extract a bill and the LLM happens to produce identical output, the timestamps change but the content doesn’t. Content-based hashing correctly reports “no staleness” here too — the embeddings are still valid because the extraction didn’t actually change.

Why warn instead of error?

Stale embeddings still produce some results — they may just not correspond perfectly to the current provisions. In practice, re-extraction often produces very similar provisions (same accounts, same amounts, slightly different wording), so stale embeddings are “mostly correct” even when technically outdated. Blocking execution would be overly strict for this use case.

The warning goes to stderr so it doesn’t interfere with stdout output (which may be piped to jq or a file).

Summary

ComponentRecords Hash OfStored InChecked When
Source XML hashBILLS-*.xmlmetadata.jsonextract, upgrade
Extraction hashextraction.jsonembeddings.jsonembed, search --semantic, search --similar
Vectors hashvectors.binembeddings.jsonembed, search --semantic, search --similar

The hash chain is simple by design — three links, SHA-256, advisory warnings, millisecond overhead. It provides confidence that the artifacts you’re querying were built from the data you think they were built from, without imposing any operational burden.

Next Steps

LLM Reliability and Guardrails

Anyone evaluating whether to trust this tool’s output will eventually ask: “How do I know the LLM didn’t make this up?” This chapter answers that question comprehensively — explaining the trust model, documenting the accuracy metrics, cataloguing known failure modes, and describing what the tool can and cannot guarantee.

The Trust Model

The architecture is designed around a single principle:

The LLM extracts once. Deterministic code verifies everything.

The LLM (Claude) touches the data at exactly one point in the pipeline: during extraction (Stage 3). It reads bill text and produces structured JSON — classifying provisions, extracting dollar amounts, identifying account names, and assigning metadata like division, section, and detail level.

After that, the LLM is never consulted again. Every downstream operation — verification, budget authority computation, querying, searching, comparing, auditing — is deterministic code. If you don’t trust the LLM’s classification of a provision, the raw_text field lets you read the original bill language yourself.

This separation means:

  • Dollar amount verification is a string search in the source XML. No LLM judgment involved.
  • Budget authority totals are computed by summing individual provisions in Rust code. The LLM also produces its own totals, but these are diagnostic only — never used for computation.
  • Raw text matching is byte-level substring comparison against the source. The LLM’s output is checked, not trusted.
  • Semantic search ranking uses pre-computed vectors and cosine similarity. The LLM plays no role at query time (except one small API call to embed your search text).

Accuracy Metrics Across Example Data

The included dataset — 32 bills across FY2019–FY2026 — provides a concrete benchmark for extraction quality:

Dollar amount verification

MetricResult
Total provisions with dollar amounts1,522
Dollar amounts found at unique position in source797 (52.4%)
Dollar amounts found at multiple positions in source725 (47.6%)
Dollar amounts not found in source0 (0.0%)

Every single dollar amount the LLM extracted actually exists in the source bill text. The 47.6% “ambiguous” rate is expected — round numbers like $5,000,000 appear dozens of times in a large omnibus.

Internal consistency

MetricResult
Mismatches between parsed dollars integer and text_as_written string0
CR substitution pairs where both amounts verified13/13 (100%)

When the LLM extracts "text_as_written": "$2,285,513,000" and "dollars": 2285513000, these are independently checked for consistency. Zero mismatches across all example data.

Raw text faithfulness

Match TierCountPercentage
Exact (byte-identical substring of source)2,39295.6%
Normalized (matches after whitespace/quote normalization)712.8%
Spaceless (matches after removing all spaces)00.0%
No match (not found at any tier)381.5%

95.6% of provisions have raw_text that is a byte-for-byte copy of the source bill text. The 1.5% that don’t match are all non-dollar provisions — statutory amendments where the LLM slightly reformatted section references. No provision with a dollar amount has a raw text mismatch.

Completeness

BillCoverage
H.R. 9468 (supplemental, 7 provisions)100.0%
H.R. 4366 (omnibus, 2,364 provisions)94.2%
H.R. 5860 (CR, 130 provisions)61.1%

Coverage measures what percentage of dollar strings in the source text were captured by an extracted provision. Below 100% doesn’t necessarily indicate errors — see What Coverage Means.

Classification

MetricResult
Provisions classified into one of 10 specific types2,405 (96.2%)
Provisions classified as other (catch-all)96 (3.8%)
Unknown provision types caught by fallback parser0

The LLM classified 96.2% of provisions into specific types. The remaining 3.8% are genuinely unusual provisions (budget enforcement designations, fee authorities, fund recovery provisions) that the LLM correctly placed in the catch-all category rather than forcing into an inappropriate type.

What the LLM Does Well

Structured extraction from complex text

Appropriations bills are among the most structurally complex legislative documents — nested provisos, cross-references to other laws, hierarchical account structures, and domain-specific conventions. The LLM handles these well:

  • Account names are correctly extracted from between '' delimiters in the bill text
  • Dollar amounts are parsed from formatted strings ($10,643,713,000) to integers (10643713000)
  • Sub-allocations are correctly identified as breakdowns of parent accounts, not additional money
  • CR substitutions are extracted with both the new and old amounts
  • Provisos (“Provided, That” clauses) are recognized and categorized

Handling edge cases

The system prompt includes specific instructions for legislative edge cases:

  • “Such sums as may be necessary” — open-ended authorizations without a specific dollar figure, captured as AmountValue::SuchSums
  • Transfer authority ceilings — marked as transfer_ceiling semantics so they don’t inflate budget authority
  • Advance appropriations — flagged in the notes field
  • Sub-allocation semantics — marked as reference_amount to prevent double-counting

Graceful degradation

When the LLM encounters something it can’t confidently classify, it falls back to other rather than guessing. The llm_classification field preserves the LLM’s description of what it thinks the provision is, so information is never lost.

The from_value.rs resilient parser adds another layer: if the LLM produces unexpected JSON — missing fields, wrong types, extra fields, or unknown enum values — the parser absorbs the variance, counts it, and produces a ConversionReport documenting every compromise. Extraction rarely fails entirely.

Known Failure Modes

1. LLM non-determinism

Re-extracting the same bill may produce slightly different results:

  • Provision counts may vary by a small number (typically ±1-3% for large bills)
  • Classifications may shift — a provision classified as rider in one extraction might become limitation in another
  • Detail levels may change — a sub-allocation might be classified as a line item or vice versa
  • Notes and descriptions are generated text and will differ between runs

Mitigation: Dollar amounts are verified against the source text regardless of classification. Budget authority totals are regression-tested against hardcoded expected values. If the numbers match, classification differences are cosmetic.

2. Paraphrased raw text on statutory amendments

The 38 no_match provisions in the example data are all statutory amendments — provisions that modify existing law by striking and inserting text. The LLM sometimes reformats the section numbering:

  • Source: Section 1886(d)(5)(G) of the Social Security Act (42 U.S.C. 1395ww(d)(5)(G)) is amended—
  • LLM: Section 1886(d)(5)(G) of the Social Security Act (42 U.S.C. 1395ww(d)(5)(G)) is amended— (1) clause...

The LLM includes text from the next line, creating a raw_text that doesn’t appear as-is in the source. The statutory reference and substance are correct; the excerpt boundary is slightly off.

Mitigation: These provisions don’t carry dollar amounts, so the amount verification is unaffected. The match_tier: "no_match" flag lets you identify and manually review them.

3. Missing provisions on large bills

The FY2024 omnibus has 94.2% coverage — meaning 5.8% of dollar strings in the source text weren’t captured by any provision. For a 1,500-page bill, some provisions may be missed entirely.

Common causes:

  • Token limit truncation — if a chunk is very long, the LLM may not process all of it
  • Ambiguous provision boundaries — the LLM may merge two provisions or skip one
  • Unusual formatting — provisions with atypical structure may not be recognized

Mitigation: The audit command shows completeness metrics. If coverage is low for a regular bill (not a CR), re-extracting with --parallel 1 (which may handle tricky sections more carefully) or reviewing the chunk artifacts in chunks/ can help identify what was missed.

4. Sub-allocation misclassification

The LLM occasionally marks a sub-allocation as top_level or a top-level provision as sub_allocation. This affects budget authority calculations because top_level provisions are counted and sub_allocation provisions are not.

Mitigation: Budget authority totals are regression-tested. For the example data, the exact totals ($846,137,099,554 / $16,000,000,000 / $2,882,482,000) are hardcoded in the test suite. Any misclassification that would change these totals would be caught. For newly extracted bills, manual spot-checking of large provisions is recommended.

5. Agency attribution errors

The agency field is inferred by the LLM from context — the heading hierarchy in the bill text. Occasionally the LLM assigns a provision to the wrong agency, especially near division or title boundaries where the context shifts.

Mitigation: The account_name is usually more reliable than agency because it’s extracted from explicit '' delimiters in the bill text. If agency attribution matters, cross-check using --keyword to find the provision by its text content, then verify the heading hierarchy in the source XML.

6. Confidence scores are uncalibrated

The LLM assigns a confidence score (0.0–1.0) to each provision, but these scores are not calibrated against actual accuracy:

  • Scores above 0.90 are not meaningfully differentiated — 0.95 is not reliably more accurate than 0.91
  • Scores below 0.80 may indicate genuine uncertainty and are worth reviewing
  • The scores are useful only for identifying outliers, not for quantitative quality assessment

Mitigation: Don’t use confidence scores for automated filtering. Use the verification metrics (amount_status, match_tier, quality) instead — these are computed from deterministic checks, not LLM self-assessment.

The Resilient Parsing Layer

Between the LLM’s raw JSON output and the structured Rust types, there’s a translation layer (from_value.rs) that handles the messiness of LLM output:

LLM Output ProblemHow from_value.rs Handles It
Missing field (e.g., no fiscal_year)Defaults to None or empty string; increments null_to_default counter
Wrong type (e.g., string "$10,000,000" instead of integer 10000000)Strips formatting and parses; increments type_coercions counter
Unknown provision type (e.g., "earmark_extension")Wraps as Provision::Other with original classification preserved; increments unknown_provision_types counter
Extra fields not in schemaSilently ignored for known types; preserved in metadata map for Other type
Completely unparseable provisionLogged as warning, skipped; increments provisions_failed counter

Every compromise is counted in the ConversionReport, which is saved with each chunk’s artifacts. You can see exactly how many null-to-default conversions, type coercions, and unknown types occurred during extraction.

This design philosophy — absorb variance, count it, never crash — means extraction almost never fails entirely, even when the LLM produces imperfect JSON.

What This Tool Cannot Guarantee

Classification correctness

The tool cannot guarantee that a provision classified as rider is actually a rider and not a limitation or directive. Classification is LLM judgment, and there is currently no gold-standard evaluation set to measure classification accuracy.

The 11 provision types are well-defined in the system prompt, and the LLM is generally consistent, but edge cases exist. A provision that limits spending (“none of the funds shall be used for…”) could be classified as either a limitation or a rider depending on context.

Complete extraction on large bills

The tool cannot guarantee 100% completeness on large omnibus bills. The 94.2% coverage on H.R. 4366 is good but not perfect. Some provisions may be missed, especially those with unusual formatting or those that fall at chunk boundaries.

Correct attribution

The tool verifies that dollar amounts exist in the source text (not fabricated) and that raw text excerpts are faithful (not paraphrased). But it cannot prove that the dollar amount is attributed to the correct account. If $500,000,000 appears 20 times in the bill, the verification says “amount is real” but not “this $500M belongs to Program A and not Program B.”

The 95.6% exact raw text match rate provides strong indirect evidence of correct attribution — when the exact bill text matches, the provision is almost certainly from the right location. But “almost certainly” is not “guaranteed.”

Consistency across re-extractions

Different extraction runs of the same bill may produce slightly different results due to LLM non-determinism. The verification pipeline ensures dollar amounts are always correct, but provision counts, classifications, and descriptions may vary.

Fiscal year correctness

The fiscal_year field is inferred from context. The tool does not independently verify that the LLM assigned the correct fiscal year to each provision.

How to Build Confidence in the Data

For individual provisions

  1. Check amount_status — should be "found" or "found_multiple", never "not_found"
  2. Check match_tier"exact" is best, "normalized" is fine, "no_match" warrants review
  3. Check quality"strong" means both amount and text verified; "moderate" or "weak" means something didn’t check out fully
  4. Read raw_text — the bill language is right there; does it match what the provision claims?
  5. Verify against sourcegrep the dollar string in the XML for independent confirmation

For aggregate results

  1. Run audit — check that NotFound = 0 for every bill
  2. Check budget totals — compare to CBO scores or committee reports for sanity
  3. Spot-check — pick 5-10 provisions at random, verify each against the source XML
  4. Cross-reference — compare the by-agency rollup to known department-level totals

For publication

If you’re publishing numbers from this tool:

  1. Always cite the specific bill and provision
  2. Note that amounts are budget authority, not outlays
  3. Note whether the number includes mandatory spending
  4. Verify the specific provision against the source XML (takes 30 seconds with grep)
  5. Link to the source bill on Congress.gov for reader verification

Comparison to Alternatives

ApproachAccuracyCoverageStructured?Cost
This toolHigh (0 unverifiable amounts)Good (94% omnibus, 100% small bills)Yes — 11 typed provisions with full fieldsLLM API costs for extraction
Manual readingPerfect (human judgment)Low (nobody reads 1,500 pages)No — notes and spreadsheetsStaff time
CBO cost estimatesHigh (expert analysis)Partial (aggregated by title/function)No — PDF reportsFree (published)
Committee reportsHigh (staff analysis)Good (account-level tables)No — PDF/HTML reportsFree (published)
Keyword search on Congress.govPerfect (exact text)Low (can’t filter by type/amount/agency)No — raw text searchFree

The tool’s advantage is the combination of structured data (searchable, filterable, comparable) with verification against source (every dollar amount traced to the bill text). No other approach provides both.

Summary

QuestionAnswer
Can the LLM hallucinate dollar amounts?In theory, yes. In practice, 99.995% of dollar amounts were verified across the full dataset (1 unverifiable out of 18,584).
Can the LLM misclassify provisions?Yes — classification is LLM judgment. Dollar amounts and raw text are verified; classification is not.
Can the LLM miss provisions?Yes — 94.2% coverage on the omnibus means some provisions may be missed.
Is the budget authority total reliable?Yes — computed from provisions (not LLM summaries), regression-tested, and independently reproducible.
Should I verify before publishing?Yes — spot-check specific provisions against the source XML. The audit command is your first-pass quality check.
Is the tool better than reading the bill myself?For finding specific provisions across 1,500 pages, absolutely. For understanding a single provision in depth, read the bill.

Next Steps

What Coverage Means (and Doesn’t)

The audit command includes a Coverage column that shows the percentage of dollar-sign patterns in the source bill text that were matched to an extracted provision. This metric is frequently misunderstood — it measures extraction completeness, not accuracy. A bill can have 0 unverifiable dollar amounts (perfect accuracy) and still show 61% coverage (incomplete extraction). This chapter explains exactly what coverage measures, why it’s often below 100%, and when you should (and shouldn’t) worry about it.

The Definition

Coverage is computed by the completeness check in verification.rs:

Coverage = (dollar patterns matched to a provision) / (total dollar patterns in source text) × 100%

The numerator counts dollar-sign patterns in the source bill text (e.g., $51,181,397,000, $500,000, $0) that were matched to at least one extracted provision’s text_as_written field.

The denominator counts every dollar-sign pattern in the source text — including many that should not be extracted as provisions.

Coverage in the Example Data

BillProvisionsCoverageInterpretation
H.R. 9468 (supplemental)7100.0%Every dollar amount in the source was captured
H.R. 4366 (omnibus)2,36494.2%Most captured; 5.8% are dollar strings that aren’t independent provisions
H.R. 5860 (CR)13061.1%Many dollar strings are prior-year references in the CR text, not new provisions

Notice that all bills in the dataset have 0 unverifiable dollar amounts (NotFound = 0 in the audit). Coverage and accuracy are independent metrics:

  • Accuracy (NotFound) answers: “Are the extracted amounts real?” → Yes, all of them.
  • Coverage answers: “Did we capture every dollar amount in the bill?” → Not necessarily, and that’s often fine.

Why Coverage Below 100% Is Usually Fine

Many dollar strings in bill text are not independent provisions and should not be extracted. Here are the most common categories:

Statutory cross-references

Bills frequently cite dollar amounts from other laws for context. For example:

…pursuant to section 1241(a) of the Food Security Act ($500,000,000 for each fiscal year)…

The $500 million is from a different law being referenced — it’s not a new appropriation in this bill. The dollar string appears in the source text but correctly should not be extracted as a provision.

Loan guarantee ceilings

Agricultural and housing bills contain loan guarantee volumes:

$3,500,000,000 for guaranteed farm ownership loans and $3,100,000,000 for farm ownership direct loans

These are loan volume limits — how much the government will guarantee in private lending. They’re not budget authority (the government isn’t spending this money directly). The subsidy cost of the loan guarantee may be extracted as a separate provision, but the face value of the loan volume is correctly excluded.

Struck amounts in amendments

When a bill amends another law by changing a dollar figure:

…by striking “$50,000” and inserting “$75,000”…

The old amount ($50,000) appears in the source text but should not be extracted as a new provision. Only the new amount ($75,000) represents the current-law level.

Prior-year references in continuing resolutions

This is the main reason H.R. 5860 has only 61.1% coverage. Continuing resolutions reference prior-year appropriations acts extensively:

…under the authority and conditions provided in the applicable appropriations Act for fiscal year 2023…

The referenced prior-year act contains hundreds of dollar amounts that appear in the CR’s text as part of the legal citation. These are contextual references — they describe the baseline funding level — but they’re not new provisions in the CR. Only the 13 CR substitutions (anomalies) and a few standalone appropriations represent new funding decisions in the CR itself.

Proviso sub-references within already-captured provisions

Some dollar amounts appear within provisos that are already captured as part of a parent provision’s context:

Provided, That of the total amount available under this heading, $7,000,000 shall be for the Urban Agriculture program

If this $7M is captured as a sub-allocation provision, it’s accounted for. But if it’s part of the parent provision’s raw_text and not separately extracted, the $7M appears in the source text but isn’t “matched to a provision” in the completeness calculation. This can happen when the proviso amount is too small or too contextual to warrant a separate provision.

Fee offsets and receipts

Some provisions reference fee amounts that offset spending:

…of which not to exceed $520,000,000 shall be derived from fee collections

Fee collections appear as dollar strings in the text but represent revenue, not expenditure. They may or may not be extracted as provisions depending on context.

When Low Coverage IS Concerning

While coverage below 100% is often fine, certain patterns warrant investigation:

Coverage below 60% on a regular appropriations bill

CRs routinely have low coverage (lots of prior-year references). But a regular appropriations bill or omnibus should generally be above 80%. If you see 50-60% coverage on a bill that should have hundreds of provisions, significant sections may have been missed.

What to do: Run audit --verbose to see the unaccounted dollar amounts. Check whether major accounts you expect are present in search --type appropriation. Look for gaps — are entire divisions or titles missing?

Known major accounts not appearing

If you know a bill includes funding for a specific large program and that program doesn’t appear in the search results, the extraction may have missed it — even if overall coverage looks acceptable.

What to do: Search by keyword: search --keyword "program name". If nothing appears, check the source XML to confirm the program is in the bill, then consider re-extracting.

Coverage dropping significantly after re-extraction

If you re-extract a bill with a different model and coverage drops from 94% to 75%, the new model may be less capable at identifying provisions.

What to do: Compare provision counts between the old and new extractions. Check whether the new extraction missed entire sections. Consider reverting to the original extraction or using a higher-capability model.

Large unaccounted dollar amounts

The audit --verbose output lists every unaccounted dollar string with its context. If you see large amounts ($1 billion+) that aren’t captured by any provision, those are worth investigating — they may represent missed appropriations rather than innocent cross-references.

What to do: Look at the context for each large unaccounted amount. If it starts with “For necessary expenses of…” or similar appropriation language, it’s a genuine miss. If it’s in the middle of a statutory reference or amendment language, it’s correctly excluded.

Why Coverage Was Removed from the Summary Table

In version 2.1.0, the coverage column was removed from the default summary table output. The reason: it was routinely misinterpreted as an accuracy metric.

Users would see “94.2% coverage” and think “5.8% of the data is wrong.” In reality, 0% of the extracted data is wrong (NotFound = 0) — the 5.8% represents dollar strings in the source text that weren’t captured, most of which are correctly excluded.

Coverage is still available in:

  • audit command — shown as the rightmost column with the full column guide
  • summary --format json — available as the completeness_pct field
  • verification.json — available as summary.completeness_pct

The decision to keep coverage in audit but remove it from summary reflects the difference in audience: summary is for quick overview (journalists, analysts), while audit is for detailed quality assessment (auditors, developers).

How Coverage Is Computed: Technical Details

The completeness check in verification.rs works as follows:

Step 1: Build the dollar pattern index

The text_index module scans the entire source bill text (extracted from XML) for every pattern matching a dollar sign followed by digits and commas: $X, $X,XXX, $X,XXX,XXX, etc.

For H.R. 4366, this finds approximately 1,734 dollar patterns (with 1,046 unique strings, since round numbers like $5,000,000 appear multiple times).

Step 2: Match against extracted provisions

For each dollar pattern found in the source, the tool checks whether any extracted provision has a text_as_written field matching that dollar string.

A dollar pattern is “accounted for” if at least one provision claims it. Multiple provisions can claim the same dollar string (common for ambiguous amounts like $5,000,000).

Step 3: Compute the percentage

Coverage = (accounted dollar patterns) / (total dollar patterns) × 100%

For H.R. 4366: approximately 1,634 of 1,734 dollar patterns are accounted for → 94.2%.

Step 4: List unaccounted amounts

The verification.json file includes a completeness.unaccounted array listing every dollar string that wasn’t matched to a provision. Each entry includes:

  • text — the dollar string (e.g., "$500,000")
  • value — parsed dollar value
  • position — character offset in the source text
  • context — surrounding text for identification

The audit --verbose command displays these unaccounted amounts, making it easy to review whether they’re legitimate exclusions or genuine misses.

A Decision Framework for Coverage

SituationCoverageAction
Small simple bill (supplemental, single purpose)100%No action needed — perfect
Omnibus, regular bill85–100%Good — spot-check any unaccounted amounts >$1B
Omnibus, regular bill60–85%Review — some provisions may be missed; run audit --verbose
Omnibus, regular bill<60%Investigate — likely missing entire sections; consider re-extracting
Continuing resolution50–70%Expected — most dollar strings are prior-year references
Continuing resolution<50%Review — even for a CR, this is unusually low

The key insight: Coverage is a completeness heuristic, not an accuracy measure. It tells you how much of the bill’s dollar content was captured. NotFound (which should be 0) tells you whether the captured content is trustworthy.

Improving Coverage

If coverage is lower than expected, consider these approaches:

Re-extract with –parallel 1

Higher parallelism is faster but can occasionally cause issues with API rate limits or token budget allocation. Running with --parallel 1 ensures each chunk gets full attention:

congress-approp extract --dir data/118/hr/4366 --parallel 1

This is much slower for large bills but may capture provisions that were missed with higher parallelism.

Use the default model

If you extracted with a non-default model (e.g., Claude Sonnet instead of Claude Opus), the lower-capability model may have missed provisions. Re-extracting with the default model often improves coverage:

congress-approp extract --dir data/118/hr/4366

Check chunk artifacts

The chunks/ directory contains per-chunk LLM artifacts. If a specific section of the bill seems to have missing provisions, find the chunk that covers that section and examine its raw response to see what the LLM produced.

Accept the gap

For many use cases, 94% coverage is more than sufficient. If the unaccounted amounts are all statutory references, loan ceilings, and struck amounts, the extraction is correct — it just doesn’t capture every dollar string in the text, which is the right behavior.

Summary

QuestionAnswer
What does coverage measure?The percentage of dollar strings in the source text matched to an extracted provision
Does low coverage mean the data is wrong?No — accuracy (NotFound) and coverage are independent metrics
Why is coverage below 100%?Many dollar strings in bill text are cross-references, loan ceilings, struck amounts, or prior-year citations — not independent provisions
Why is CR coverage especially low?CRs reference prior-year acts extensively, creating many dollar strings that aren’t new provisions
When should I worry about low coverage?When a regular bill (not CR) is below 60%, or when known major accounts are missing
Where can I see coverage?audit command, summary --format json, verification.json
Why isn’t coverage in the summary table?Removed in v2.1.0 because it was routinely misinterpreted as an accuracy metric

Next Steps

CLI Command Reference

This is the complete reference for every congress-approp command and flag. For tutorials and worked examples, see the Tutorials section. For task-oriented guides, see How-To Guides.

Global Options

These flags can be used with any command:

FlagShortDescription
--verbose-vEnable verbose (debug-level) logging. Shows detailed progress, file paths, and internal state.
--help-hPrint help for the command
--version-VPrint version (top-level only)

summary

Show a per-bill overview of all extracted data: provision counts, budget authority, rescissions, and net budget authority.

congress-approp summary [OPTIONS]
FlagTypeDefaultDescription
--dirpath./dataData directory containing extracted bills. Try data for included FY2019–FY2026 dataset. Walks recursively to find all extraction.json files.
--formatstringtableOutput format: table, json, jsonl, csv
--by-agencyflagAppend a second table showing budget authority totals by parent department, sorted descending
--fyintegerFilter to bills covering this fiscal year (e.g., 2026). Uses bill.fiscal_years from extraction data — works without enrich.
--subcommitteestringFilter by subcommittee jurisdiction (e.g., defense, thud, cjs). Requires bill_meta.json — run enrich first. See Enrich Bills with Metadata for valid slugs.

Examples

# FY2026 bills only
congress-approp summary --dir data --fy 2026

# FY2026 THUD subcommittee only (requires enrich)
congress-approp summary --dir data --fy 2026 --subcommittee thud
# Basic summary of included example data
congress-approp summary --dir data

# JSON output for scripting
congress-approp summary --dir data --format json

# Show department-level rollup
congress-approp summary --dir data --by-agency

# CSV for spreadsheet import
congress-approp summary --dir data --format csv > bill_summary.csv

Output

The summary table shows one row per loaded bill plus a TOTAL row:

┌───────────┬───────────────────────┬────────────┬─────────────────┬─────────────────┬─────────────────┐
│ Bill      ┆ Classification        ┆ Provisions ┆ Budget Auth ($) ┆ Rescissions ($) ┆      Net BA ($) │
╞═══════════╪═══════════════════════╪════════════╪═════════════════╪═════════════════╪═════════════════╡
│ H.R. 4366 ┆ Omnibus               ┆       2364 ┆ 846,137,099,554 ┆  24,659,349,709 ┆ 821,477,749,845 │
│ H.R. 5860 ┆ Continuing Resolution ┆        130 ┆  16,000,000,000 ┆               0 ┆  16,000,000,000 │
│ H.R. 9468 ┆ Supplemental          ┆          7 ┆   2,882,482,000 ┆               0 ┆   2,882,482,000 │
│ TOTAL     ┆                       ┆       2501 ┆ 865,019,581,554 ┆  24,659,349,709 ┆ 840,360,231,845 │
└───────────┴───────────────────────┴────────────┴─────────────────┴─────────────────┴─────────────────┘

0 dollar amounts unverified across all bills. Run `congress-approp audit` for detailed verification.

Budget Authority is computed from provisions (not from any LLM-generated summary). See Budget Authority Calculation for the formula.

The --by-agency flag appends a second table with columns: Department, Budget Auth ($), Rescissions ($), Provisions.


Search provisions across all extracted bills. Supports filtering by type, agency, account, keyword, division, dollar range, and meaning-based semantic search.

congress-approp search [OPTIONS]

Filter Flags

FlagShortTypeDescription
--dirpathData directory containing extracted bills. Default: ./data
--type-tstringFilter by provision type. Use --list-types to see valid values.
--agency-astringFilter by agency name (case-insensitive substring match)
--accountstringFilter by account name (case-insensitive substring match)
--keyword-kstringSearch in raw_text field (case-insensitive substring match)
--billstringFilter to a specific bill identifier (e.g., "H.R. 4366")
--divisionstringFilter by division letter (e.g., A, B, C)
--min-dollarsintegerMinimum dollar amount (absolute value)
--max-dollarsintegerMaximum dollar amount (absolute value)
--fyintegerFilter to bills covering this fiscal year (e.g., 2026). Works without enrich.
--subcommitteestringFilter by subcommittee jurisdiction (e.g., thud, defense). Requires enrich.

All filters use AND logic — every provision in the result must match every specified filter. Filter order on the command line has no effect on results.

Semantic Search Flags

FlagTypeDescription
--semanticstringRank results by meaning similarity to this query text. Requires pre-computed embeddings and OPENAI_API_KEY.
--similarstringFind provisions similar to the one specified. Format: <bill_directory>:<provision_index> (e.g., 118-hr9468:0). Uses stored vectors — no API call needed.
--topintegerMaximum number of results for --semantic or --similar searches. Default: 20. Has no effect on non-semantic searches (which return all matching provisions).

Output Flags

FlagTypeDefaultDescription
--formatstringtableOutput format: table, json, jsonl, csv
--list-typesflagPrint all valid provision types and exit (ignores other flags)

Examples

# All appropriations across all example bills
congress-approp search --dir data --type appropriation

# VA appropriations over $1 billion in Division A
congress-approp search --dir data --type appropriation --agency "Veterans" --division A --min-dollars 1000000000

# FEMA-related provisions by keyword
congress-approp search --dir data --keyword "Federal Emergency Management"

# CR substitutions (table auto-adapts to show New/Old/Delta columns)
congress-approp search --dir data/118-hr5860 --type cr_substitution

# All directives in the VA supplemental
congress-approp search --dir data/118-hr9468 --type directive

# Semantic search — find by meaning, not keywords
congress-approp search --dir data --semantic "school lunch programs for kids" --top 5

# Find provisions similar to a specific one across all bills
congress-approp search --dir data --similar 118-hr9468:0 --top 5

# Combine semantic with hard filters
congress-approp search --dir data --semantic "clean energy" --type appropriation --min-dollars 100000000 --top 10

# Export to CSV for spreadsheet analysis
congress-approp search --dir data --type appropriation --format csv > appropriations.csv

# Export to JSON for programmatic use
congress-approp search --dir data --type rescission --format json

# List all valid provision types
congress-approp search --dir data --list-types

Available Provision Types

  appropriation                    Budget authority grant
  rescission                       Cancellation of prior budget authority
  cr_substitution                  CR anomaly (substituting $X for $Y)
  transfer_authority               Permission to move funds between accounts
  limitation                       Cap or prohibition on spending
  directed_spending                Earmark / community project funding
  mandatory_spending_extension     Amendment to authorizing statute
  directive                        Reporting requirement or instruction
  rider                            Policy provision (no direct spending)
  continuing_resolution_baseline   Core CR funding mechanism
  other                            Unclassified provisions

Table Output Columns

The table adapts its shape based on the provision types in the results.

Standard search table:

ColumnDescription
$Verification status: (found unique), (found multiple), (not found), blank (no dollar amount)
BillBill identifier
TypeProvision type
Description / AccountAccount name for appropriations/rescissions, description for other types
Amount ($)Dollar amount, or for provisions without amounts
SectionSection reference from the bill (e.g., SEC. 101)
DivDivision letter for omnibus bills

CR substitution table: Replaces Amount ($) with New ($), Old ($), and Delta ($).

Semantic/similar table: Adds a Sim column at the left showing cosine similarity (0.0–1.0).

JSON/CSV Output Fields

JSON and CSV output include more fields than the table:

FieldTypeDescription
billstringBill identifier
provision_typestringProvision type
account_namestringAccount name
descriptionstringDescription
agencystringAgency name
dollarsinteger or nullDollar amount
old_dollarsinteger or nullOld amount (CR substitutions only)
semanticsstringAmount semantics (e.g., new_budget_authority)
sectionstringSection reference
divisionstringDivision letter
raw_textstringBill text excerpt
amount_statusstring or nullfound, found_multiple, not_found, or null
match_tierstringexact, normalized, spaceless, no_match
qualitystringstrong, moderate, weak, or n/a
provision_indexintegerIndex in the bill’s provision array (zero-based)

compare

Compare provisions between two sets of bills. Matches accounts by (agency, account_name) and computes dollar deltas. Account names are matched case-insensitively with em-dash prefix stripping. If a dataset.json file exists in the data directory, agency groups and account aliases are applied for cross-bill matching. Use --exact to disable all normalization and match on exact lowercased strings only. See Resolve Agency and Account Name Differences for details.

There are two ways to specify what to compare:

Directory-based (compare two specific directories):

congress-approp compare --base <BASE> --current <CURRENT> [OPTIONS]

FY-based (compare all bills for one fiscal year against another):

congress-approp compare --base-fy <YEAR> --current-fy <YEAR> --dir <DIR> [OPTIONS]
FlagShortTypeDefaultDescription
--basepathBase directory for comparison (e.g., prior fiscal year)
--currentpathCurrent directory for comparison (e.g., current fiscal year)
--base-fyintegerUse all bills covering this FY as the base set (alternative to --base)
--current-fyintegerUse all bills covering this FY as the current set (alternative to --current)
--dirpath./dataData directory (required with --base-fy/--current-fy)
--subcommitteestringScope comparison to one subcommittee jurisdiction. Requires enrich.
--agency-astringFilter by agency name (case-insensitive substring)
--realflagAdd inflation-adjusted “Real Δ %*” column using CPI-U. Shows which programs beat inflation (▲) and which fell behind (▼).
--cpi-filepathPath to a custom CPI/deflator JSON file. Overrides the bundled CPI-U data. See Adjust for Inflation for the file format.
--formatstringtableOutput format: table, json, csv

You must provide either --base + --current (directory paths) or --base-fy + --current-fy + --dir.

Examples

# Compare omnibus to supplemental (directory-based)
congress-approp compare --base data/118-hr4366 --current data/118-hr9468

# Compare THUD funding: FY2024 → FY2026 (FY-based with subcommittee scope)
congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud --dir data

# Compare all FY2024 vs FY2026 (no subcommittee scope)
congress-approp compare --base-fy 2024 --current-fy 2026 --dir data

# Show inflation-adjusted changes (which programs beat inflation?)
congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud --dir data --real

# Filter to VA accounts only
congress-approp compare --base data/118-hr4366 --current data/118-hr9468 --agency "Veterans"

# Export comparison to CSV
congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud --dir data --format csv > thud_compare.csv

Matching Behavior

Account matching uses several normalization layers:

  • Case-insensitive: “Grants-In-Aid for Airports” matches “Grants-in-Aid for Airports”
  • Em-dash prefix stripping: “Department of VA—Compensation and Pensions” matches “Compensation and Pensions”
  • Sub-agency normalization: “Maritime Administration” matches “Department of Transportation” for the same account name
  • Hierarchical CR name matching: “Federal Emergency Management Agency—Disaster Relief Fund” matches “Disaster Relief Fund”

Output Columns

ColumnDescription
AccountAccount name, matched between bills
AgencyParent department or agency
Base ($)Budget authority in the --base or --base-fy bills
Current ($)Budget authority in the --current or --current-fy bills
Delta ($)Current minus Base
Δ %Percentage change
Statuschanged, unchanged, only in base, or only in current

Results are sorted by absolute delta, largest changes first. The tool warns when comparing different bill classifications (e.g., Omnibus vs. Supplemental).


audit

Show a detailed verification and quality report for all extracted bills.

congress-approp audit [OPTIONS]
FlagTypeDefaultDescription
--dirpath./dataData directory to audit. Try data for included FY2019–FY2026 dataset.
--verboseflagShow individual problematic provisions (those with not_found amounts or no_match raw text)

Examples

# Standard audit
congress-approp audit --dir data

# Verbose — see individual problematic provisions
congress-approp audit --dir data --verbose

Output

┌───────────┬────────────┬──────────┬──────────┬───────┬───────┬──────────┬───────────┬──────────┬──────────┐
│ Bill      ┆ Provisions ┆ Verified ┆ NotFound ┆ Ambig ┆ Exact ┆ NormText ┆ Spaceless ┆ TextMiss ┆ Coverage │
╞═══════════╪════════════╪══════════╪══════════╪═══════╪═══════╪══════════╪═══════════╪══════════╪══════════╡
│ H.R. 4366 ┆       2364 ┆      762 ┆        0 ┆   723 ┆  2285 ┆       59 ┆         0 ┆       20 ┆    94.2% │
│ H.R. 5860 ┆        130 ┆       33 ┆        0 ┆     2 ┆   102 ┆       12 ┆         0 ┆       16 ┆    61.1% │
│ H.R. 9468 ┆          7 ┆        2 ┆        0 ┆     0 ┆     5 ┆        0 ┆         0 ┆        2 ┆   100.0% │
│ TOTAL     ┆       2501 ┆      797 ┆        0 ┆   725 ┆  2392 ┆       71 ┆         0 ┆       38 ┆          │
└───────────┴────────────┴──────────┴──────────┴───────┴───────┴──────────┴───────────┴──────────┴──────────┘

Column Reference

Amount verification (left side):

ColumnDescription
VerifiedDollar amount found at exactly one position in source text
NotFoundDollar amount NOT found in source — should be 0; review manually if > 0
AmbigDollar amount found at multiple positions — correct but location is uncertain

Raw text verification (right side):

ColumnDescription
Exactraw_text is byte-identical substring of source text
NormTextraw_text matches after whitespace/quote/dash normalization
Spacelessraw_text matches only after removing all spaces
TextMissraw_text not found at any tier — may be paraphrased or truncated

Completeness:

ColumnDescription
CoveragePercentage of dollar strings in source text matched to a provision. See What Coverage Means.

See Understanding the Output and Verify Extraction Accuracy for detailed interpretation guidance.


download

Download appropriations bill XML from Congress.gov.

congress-approp download [OPTIONS] --congress <CONGRESS>
FlagTypeDefaultDescription
--congressinteger(required)Congress number (e.g., 118 for 2023–2024)
--typestringBill type code: hr, s, hjres, sjres
--numberintegerBill number (used with --type for single-bill download)
--output-dirpath./dataOutput directory. Intermediate directories are created as needed.
--enacted-onlyflagOnly download bills signed into law
--formatstringxmlDownload format: xml (for extraction), pdf (for reading). Comma-separated for multiple.
--versionstringText version filter: enr (enrolled/final), ih (introduced), eh (engrossed). When omitted, only enrolled is downloaded.
--all-versionsflagDownload all text versions (introduced, engrossed, enrolled, etc.) instead of just enrolled
--dry-runflagShow what would be downloaded without fetching

Requires: CONGRESS_API_KEY environment variable.

Examples

# Download a specific bill (enrolled version only, by default)
congress-approp download --congress 118 --type hr --number 4366 --output-dir data

# Download all enacted bills for a congress (enrolled versions only)
congress-approp download --congress 118 --enacted-only --output-dir data

# Preview without downloading
congress-approp download --congress 118 --enacted-only --output-dir data --dry-run

# Download both XML and PDF
congress-approp download --congress 118 --type hr --number 4366 --output-dir data --format xml,pdf

# Download all text versions (introduced, engrossed, enrolled, etc.)
congress-approp download --congress 118 --type hr --number 4366 --output-dir data --all-versions

extract

Extract spending provisions from bill XML using Claude. Parses the XML, sends text chunks to the LLM in parallel, merges results, and runs deterministic verification.

congress-approp extract [OPTIONS]
FlagTypeDefaultDescription
--dirpath./dataData directory containing downloaded bill XML
--dry-runflagShow chunk count and estimated tokens without calling the LLM
--parallelinteger5Number of concurrent LLM API calls. Higher is faster but uses more API quota.
--modelstringclaude-opus-4-6LLM model for extraction. Can also be set via APPROP_MODEL env var. Flag takes precedence.
--forceflagRe-extract bills even if extraction.json already exists. Without this flag, already-extracted bills are skipped.
--continue-on-errorflagSave partial results when some chunks fail. Without this flag, the tool aborts a bill if any chunk permanently fails and does not write extraction.json.

Requires: ANTHROPIC_API_KEY environment variable (not required if all bills are already extracted).

Behavior notes:

  • Aborts on chunk failure by default. If any chunk permanently fails (after all retries), the bill’s extraction is aborted and no extraction.json is written. This prevents garbage partial extractions from being saved to disk. Use --continue-on-error to save partial results instead.
  • Per-bill error handling. In a multi-bill run, a failure on one bill does not abort the entire run. The failed bill is skipped (no files written) and extraction continues with the remaining bills. Re-running the same command retries only the failed bills.
  • Skips already-extracted bills by default. If every bill in --dir already has extraction.json, the command exits without requiring an API key. Use --force to re-extract.
  • Prefers enrolled XML. When a directory has multiple BILLS-*.xml files, only the enrolled version (*enr.xml) is processed. Non-enrolled versions are ignored.
  • Resilient to parse failures. If an XML file fails to parse (e.g., a non-enrolled version with a different structure), the tool logs a warning and continues to the next bill instead of aborting.

Examples

# Preview extraction (no API calls)
congress-approp extract --dir data/118/hr/9468 --dry-run

# Extract a single bill
congress-approp extract --dir data/118/hr/9468

# Extract with higher parallelism for large bills
congress-approp extract --dir data/118/hr/4366 --parallel 8

# Extract all bills under a directory (skips already-extracted bills)
congress-approp extract --dir data --parallel 6

# Re-extract a bill that was already processed
congress-approp extract --dir data/118/hr/9468 --force

# Save partial results even when some chunks fail (rate limiting, etc.)
congress-approp extract --dir data/118/hr/2882 --parallel 6 --continue-on-error

# Use a different model
congress-approp extract --dir data/118/hr/9468 --model claude-sonnet-4-20250514

Output Files

FileDescription
extraction.jsonAll provisions with structured fields
verification.jsonDeterministic verification against source text
metadata.jsonModel, prompt version, timestamps, source XML hash
tokens.jsonToken usage (input, output, cache)
chunks/Per-chunk LLM artifacts (gitignored)

embed

Generate semantic embedding vectors for extracted provisions using OpenAI’s embedding model. Enables --semantic and --similar on the search command.

congress-approp embed [OPTIONS]
FlagTypeDefaultDescription
--dirpath./dataData directory containing extracted bills
--modelstringtext-embedding-3-largeOpenAI embedding model
--dimensionsinteger3072Number of dimensions to request from the API
--batch-sizeinteger100Provisions per API batch call
--dry-runflagPreview token counts without calling the API

Requires: OPENAI_API_KEY environment variable.

Bills with up-to-date embeddings are automatically skipped (detected via hash chain).

Examples

# Generate embeddings for all bills
congress-approp embed --dir data

# Preview without calling API
congress-approp embed --dir data --dry-run

# Generate for a single bill
congress-approp embed --dir data/118/hr/9468

# Use fewer dimensions (not recommended — see Generate Embeddings guide)
congress-approp embed --dir data --dimensions 1024

Output Files

FileDescription
embeddings.jsonMetadata: model, dimensions, count, SHA-256 hashes
vectors.binRaw little-endian float32 vectors (count × dimensions × 4 bytes)

enrich

Generate bill metadata for fiscal year filtering, subcommittee scoping, and advance appropriation classification. This command parses the source XML and analyzes the extraction output — no API keys are required.

congress-approp enrich [OPTIONS]
FlagTypeDefaultDescription
--dirpath./dataData directory containing extracted bills
--dry-runflagPreview what would be generated without writing files
--forceflagRe-enrich even if bill_meta.json already exists

What It Generates

For each bill directory, enrich creates a bill_meta.json file containing:

  • Congress number — parsed from the XML filename
  • Subcommittee mappings — division letter → jurisdiction (e.g., Division A → Defense)
  • Bill nature — enriched classification (omnibus, minibus, full-year CR with appropriations, etc.)
  • Advance appropriation classification — each budget authority provision classified as current-year, advance, or supplemental using a fiscal-year-aware algorithm
  • Canonical account names — case-normalized, prefix-stripped names for cross-bill matching

Examples

# Enrich all bills
congress-approp enrich --dir data

# Preview without writing files
congress-approp enrich --dir data --dry-run

# Force re-enrichment
congress-approp enrich --dir data --force

When to Run

Run enrich once after extracting bills, before using --subcommittee filters. The --fy flag on other commands works without enrich (it uses fiscal year data already in extraction.json), but --subcommittee requires the division-to-jurisdiction mapping that only enrich provides.

The tool warns when bill_meta.json is stale (when extraction.json has changed since enrichment). Run enrich --force to regenerate.

See Enrich Bills with Metadata for a detailed guide including subcommittee slugs, advance classification algorithm, and provenance tracking.


verify-text

Check that every provision’s raw_text is a verbatim substring of the enrolled bill source text. Optionally repair mismatches and add source_span byte positions. No API key required.

congress-approp verify-text [OPTIONS]
  --dir <DIR>       Data directory [default: ./data]
  --repair          Fix broken raw_text and add source_span to every provision
  --bill <BILL>     Single bill directory (e.g., 118-hr2882)
  --format <FMT>    Output format: table, json [default: table]

Examples

# Analyze all bills (no changes)
congress-approp verify-text --dir data

# Repair and add source spans
congress-approp verify-text --dir data --repair

# Single bill
congress-approp verify-text --dir data --bill 118-hr2882 --repair

Output

Reports the number of provisions at each match tier:

34568 provisions: 34568 exact, 0 repaired (0 prefix, 0 substring, 0 normalized), 0 unverified
Traceable: 34568/34568 (100.000%)

✅ Every provision is traceable to the enrolled bill source text.

When --repair is used, a backup is created at extraction.json.pre-repair before any modifications. Each provision gets a source_span field with UTF-8 byte offsets into the source .txt file.

See Verifying Extraction Data for details on the 3-tier repair algorithm and the source span invariant.


resolve-tas

Map each top-level budget authority provision to a Federal Account Symbol (FAS) code from the Treasury’s FAST Book. Uses deterministic string matching for unambiguous names and Claude Opus for the rest.

congress-approp resolve-tas [OPTIONS]
  --dir <DIR>              Data directory [default: ./data]
  --bill <BILL>            Single bill directory (e.g., 118-hr2882)
  --dry-run                Show what would be resolved and estimated cost
  --no-llm                 Deterministic matching only (no API key needed)
  --force                  Re-resolve even if tas_mapping.json exists
  --batch-size <N>         Provisions per LLM batch [default: 40]
  --fas-reference <PATH>   Path to FAS reference JSON [default: data/fas_reference.json]

Requires ANTHROPIC_API_KEY for the LLM tier. With --no-llm, no API key is needed (resolves ~56% of provisions).

Examples

# Preview cost before running
congress-approp resolve-tas --dir data --dry-run

# Full resolution (deterministic + LLM)
congress-approp resolve-tas --dir data

# Free mode (deterministic only, no API key)
congress-approp resolve-tas --dir data --no-llm

# Single bill
congress-approp resolve-tas --dir data --bill 118-hr2882

Output

Produces tas_mapping.json per bill with one mapping per top-level budget authority provision. Reports match rates:

6685 provisions: 6645 matched (99.4%), 40 unmatched
  Deterministic: 3731, LLM: 2914

See Resolving Treasury Account Symbols for details on the two-tier matching algorithm, confidence levels, and the FAST Book reference.


authority build

Aggregate all tas_mapping.json files into a single authorities.json account registry at the data root. Groups provisions by FAS code, collects name variants, and detects rename events.

congress-approp authority build [OPTIONS]
  --dir <DIR>       Data directory [default: ./data]
  --force           Rebuild even if authorities.json already exists

No API key required. Runs in ~1 second.

Example

congress-approp authority build --dir data

# Output:
# Built authorities.json:
#   1051 authorities, 6645 provisions, 24 bills, FYs [2019, 2020, ..., 2026]
#   937 in multiple bills, 443 with name variants

authority list

Browse the account authority registry. Shows FAS code, bill count, fiscal years, total budget authority, and official title for each authority.

congress-approp authority list [OPTIONS]
  --dir <DIR>       Data directory [default: ./data]
  --agency <CODE>   Filter by CGAC agency code (e.g., 070 for DHS)
  --format <FMT>    Output format: table, json [default: table]

Examples

# List all authorities
congress-approp authority list --dir data

# Filter to DHS accounts
congress-approp authority list --dir data --agency 070

# JSON for programmatic use
congress-approp authority list --dir data --format json

trace

Show the funding timeline for a federal budget account across all fiscal years in the dataset. Accepts a FAS code or a name search query.

congress-approp trace <QUERY> [OPTIONS]
  <QUERY>           FAS code (e.g., 070-0400) or account name fragment
  --dir <DIR>       Data directory [default: ./data]
  --format <FMT>    Output format: table, json [default: table]

Name search splits the query into words and matches authorities where all words appear across the title, agency name, FAS code, and name variants. If multiple authorities match, the command lists candidates and asks you to be more specific.

Examples

# By FAS code (exact)
congress-approp trace 070-0400 --dir data

# By name (word-level search)
congress-approp trace "coast guard operations" --dir data
congress-approp trace "disaster relief" --dir data

# JSON output
congress-approp trace 070-0400 --dir data --format json

Output

TAS 070-0400: Operations and Support, United States Secret Service, Homeland Security
  Agency: Department of Homeland Security

┌──────┬──────────────────────┬────────────────┬──────────────────────────────┐
│ FY   ┆ Budget Authority ($) ┆ Bill(s)        ┆ Account Name(s)              │
╞══════╪══════════════════════╪════════════════╪══════════════════════════════╡
│ 2020 ┆        2,336,401,000 ┆ H.R. 1158      ┆ United States Secret Servi…  │
│ 2021 ┆        2,373,109,000 ┆ H.R. 133       ┆ United States Secret Servi…  │
│ 2022 ┆        2,554,729,000 ┆ H.R. 2471      ┆ Operations and Support       │
│ 2024 ┆        3,007,982,000 ┆ H.R. 2882      ┆ Operations and Support       │
│ 2025 ┆          231,000,000 ┆ H.R. 9747 (CR) ┆ United States Secret Servi…  │
└──────┴──────────────────────┴────────────────┴──────────────────────────────┘

Bill classification labels — (CR), (supplemental), (full-year CR) — are shown when the bill is not a regular or omnibus appropriation. Detected rename events are shown below the timeline. Name variants are listed with their classification type.

See The Authority System for details on how account tracking works across fiscal years.


normalize suggest-text-match

Discover agency and account naming variants using orphan-pair analysis and structural regex patterns. Scans all bills for cross-FY orphan pairs (same account name, different agency) and common naming patterns (prefix expansion, preposition variants, abbreviation differences). Results are cached for the normalize accept command.

No API calls. No network access. Runs in milliseconds.

congress-approp normalize suggest-text-match [OPTIONS]
  --dir <DIR>            Data directory [default: ./data]
  --format <FORMAT>      Output format: table, json, hashes [default: table]
  --min-accounts <N>     Minimum shared accounts to include a suggestion [default: 1]

Use --format hashes to output one hash per line for scripting. Use --min-accounts 3 to filter to stronger suggestions (pairs sharing 3+ account names).

Suggestions are cached in ~/.congress-approp/cache/ and consumed by normalize accept.


normalize suggest-llm

Discover agency and account naming variants using LLM classification with XML heading context. Sends unresolved ambiguous account clusters to Claude with the bill’s XML organizational structure, dollar amounts, and fiscal year information. The LLM classifies agency pairs as SAME or DIFFERENT.

Requires ANTHROPIC_API_KEY. Uses Claude Opus.

congress-approp normalize suggest-llm [OPTIONS]
  --dir <DIR>            Data directory [default: ./data]
  --batch-size <N>       Maximum clusters per API call [default: 15]
  --format <FORMAT>      Output format: table, json, hashes [default: table]

Only processes clusters not already resolved by suggest-text-match or existing dataset.json entries. Results are cached for the normalize accept command.


normalize accept

Accept suggested normalizations by hash. Reads from the suggestion cache populated by suggest-text-match or suggest-llm, matches the specified hashes, and writes the accepted groups to dataset.json.

congress-approp normalize accept [OPTIONS] [HASHES]...
  --dir <DIR>            Data directory [default: ./data]
  --auto                 Accept all cached suggestions without specifying hashes

If no cache exists, prints an error suggesting to run suggest-text-match first.


normalize list

Display current entity resolution rules from dataset.json.

congress-approp normalize list [OPTIONS]
  --dir <DIR>            Data directory [default: ./data]

Shows all agency groups and account aliases. If no dataset.json exists, shows a helpful message suggesting how to create one.


relate

Deep-dive on one provision across all bills. Finds similar provisions by embedding similarity, groups them by confidence tier, and optionally builds a fiscal year timeline with advance/current/supplemental split. Requires pre-computed embeddings but no API keys (uses stored vectors).

congress-approp relate <SOURCE> [OPTIONS]

The <SOURCE> argument is a provision reference in the format bill_directory:index (e.g., 118-hr9468:0). Use the provision_index from search output.

FlagTypeDefaultDescription
--dirpath./dataData directory
--topinteger10Max related provisions per confidence tier
--formatstringtableOutput format: table, json, hashes
--fy-timelineflagShow fiscal year timeline with advance/current/supplemental split

Output

The table output shows two sections:

  • Same Account — high-confidence matches (verified name match or high similarity + same agency). Each row includes a deterministic 8-char hash, similarity score, bill, account name, dollar amount, funding timing, and confidence label.
  • Related — lower-confidence matches (uncertain zone, 0.55–0.65 similarity or name mismatch).

With --fy-timeline, a third section shows the fiscal year timeline: current-year BA, advance BA, supplemental BA, and contributing bills for each fiscal year.

Examples

# Deep-dive on VA Compensation and Pensions
congress-approp relate 118-hr9468:0 --dir data --fy-timeline

# Get just the link hashes for piping to `link accept`
congress-approp relate 118-hr9468:0 --dir data --format hashes

# JSON output with timeline
congress-approp relate 118-hr9468:0 --dir data --format json --fy-timeline

Each match includes a deterministic 8-character hex hash (e.g., b7e688d7). These hashes are computed from the source provision, target provision, and embedding model — the same inputs always produce the same hash. Use --format hashes to output just the hashes of same-account matches, suitable for piping to link accept:

congress-approp relate 118-hr9468:0 --dir data --format hashes | \
  xargs congress-approp link accept --dir data

Compute cross-bill link candidates from embeddings. For each top-level budget authority provision, finds the best match in every other bill above the similarity threshold and classifies by confidence tier.

congress-approp link suggest [OPTIONS]
FlagTypeDefaultDescription
--dirpath./dataData directory
--thresholdfloat0.55Minimum similarity for candidates
--scopestringallWhich bill pairs to compare: intra (within same FY), cross (across FYs), all
--limitinteger100Max candidates to output
--formatstringtableOutput format: table, json, hashes

Confidence Tiers

Based on empirically calibrated thresholds from analysis of 6.7M pairwise comparisons:

TierCriteriaMeaning
verifiedCanonical account name match (case-insensitive, prefix-stripped)Almost certainly the same account
highSimilarity ≥ 0.65 AND same normalized agencyVery likely the same account
uncertainSimilarity 0.55–0.65, or name mismatch above 0.65Needs manual review

Examples

# Cross-fiscal-year candidates (year-over-year tracking)
congress-approp link suggest --dir data --scope cross --limit 20

# All candidates above 0.65 similarity
congress-approp link suggest --dir data --threshold 0.65 --limit 50

# Output just the hashes of new (un-accepted) candidates
congress-approp link suggest --dir data --format hashes

Persist link candidates by accepting them into links/links.json at the data root.

congress-approp link accept [OPTIONS] [HASHES...]
FlagTypeDefaultDescription
--dirpath./dataData directory
--notestringOptional annotation (e.g., “Account renamed from X to Y”)
--autoflagAccept all verified + high-confidence candidates without specifying hashes
HASHESpositionalOne or more 8-char link hashes to accept

Examples

# Accept specific links by hash
congress-approp link accept --dir data a3f7b2c4 e5d1c8a9

# Accept with a note
congress-approp link accept --dir data a3f7b2c4 --note "Same VA account, different bill vehicles"

# Auto-accept all verified and high-confidence candidates
congress-approp link accept --dir data --auto

# Pipe from relate output
congress-approp relate 118-hr9468:0 --dir data --format hashes | \
  xargs congress-approp link accept --dir data

Remove accepted links by hash.

congress-approp link remove --dir <DIR> <HASHES...>
FlagTypeDefaultDescription
--dirpath./dataData directory
HASHESpositional(required)One or more 8-char link hashes to remove

Example

congress-approp link remove --dir data a3f7b2c4

Show accepted links, optionally filtered by bill.

congress-approp link list [OPTIONS]
FlagTypeDefaultDescription
--dirpath./dataData directory
--formatstringtableOutput format: table, json
--billstringFilter to links involving this bill (case-insensitive substring)

Examples

# Show all accepted links
congress-approp link list --dir data

# Filter to links involving H.R. 4366
congress-approp link list --dir data --bill hr4366

# JSON output for programmatic use
congress-approp link list --dir data --format json

compare –use-authorities

The compare command accepts a --use-authorities flag that rescues orphan provisions by matching on FAS code instead of account name. When two provisions have the same FAS code but different names or agency attributions, they are recognized as the same account.

congress-approp compare --base-fy 2024 --current-fy 2026 \
    --subcommittee thud --dir data --use-authorities

Requires tas_mapping.json files for the bills being compared (run resolve-tas first). Orphan provisions rescued via TAS matching are labeled with their FAS code in the status column (e.g., matched (TAS 069-1775)).

This flag can be combined with --use-links, --real, and --exact. Entity resolution via dataset.json still applies unless --exact is specified.


upgrade

Upgrade extraction data to the latest schema version. Re-deserializes existing data through the current parsing logic and re-runs verification. No LLM API calls.

congress-approp upgrade [OPTIONS]
FlagTypeDefaultDescription
--dirpath./dataData directory to upgrade
--dry-runflagShow what would change without writing files

Examples

# Preview changes
congress-approp upgrade --dir data --dry-run

# Upgrade all bills
congress-approp upgrade --dir data

# Upgrade a single bill
congress-approp upgrade --dir data/118/hr/9468

api test

Test API connectivity for Congress.gov and Anthropic.

congress-approp api test

Verifies that CONGRESS_API_KEY and ANTHROPIC_API_KEY are set and that both APIs are reachable. No flags.


api bill list

List appropriations bills for a given congress.

congress-approp api bill list [OPTIONS]
FlagTypeDefaultDescription
--congressinteger(required)Congress number
--typestringFilter by bill type (hr, s, hjres, sjres)
--offsetinteger0Pagination offset
--limitinteger20Maximum results per page
--enacted-onlyflagOnly show enacted (signed into law) bills

Requires: CONGRESS_API_KEY

Examples

# All appropriations bills for the 118th Congress
congress-approp api bill list --congress 118

# Only enacted bills
congress-approp api bill list --congress 118 --enacted-only

api bill get

Get metadata for a specific bill.

congress-approp api bill get --congress <N> --type <TYPE> --number <N>
FlagTypeDescription
--congressintegerCongress number
--typestringBill type (hr, s, hjres, sjres)
--numberintegerBill number

Requires: CONGRESS_API_KEY


api bill text

Get text versions and download URLs for a bill.

congress-approp api bill text --congress <N> --type <TYPE> --number <N>
FlagTypeDescription
--congressintegerCongress number
--typestringBill type (hr, s, hjres, sjres)
--numberintegerBill number

Requires: CONGRESS_API_KEY

Lists every text version (introduced, engrossed, enrolled, etc.) with available formats (XML, PDF, HTML) and download URLs.

Example

congress-approp api bill text --congress 118 --type hr --number 4366

Common Patterns

Query pre-extracted example data (no API keys needed)

congress-approp summary --dir data
congress-approp search --dir data --type appropriation
congress-approp audit --dir data
congress-approp compare --base data/118-hr4366 --current data/118-hr9468

Full extraction pipeline

export CONGRESS_API_KEY="..."
export ANTHROPIC_API_KEY="..."
export OPENAI_API_KEY="..."

congress-approp download --congress 118 --enacted-only --output-dir data
congress-approp extract --dir data --parallel 6
congress-approp audit --dir data
congress-approp embed --dir data
congress-approp summary --dir data

Export workflows

# All appropriations to CSV
congress-approp search --dir data --type appropriation --format csv > all.csv

# JSON for jq processing
congress-approp search --dir data --format json | jq '.[].account_name' | sort -u

# JSONL for streaming
congress-approp search --dir data --format jsonl | while IFS= read -r line; do echo "$line" | jq '.dollars'; done

Environment Variables

VariableUsed ByDescription
CONGRESS_API_KEYdownload, api commandsCongress.gov API key (free signup)
ANTHROPIC_API_KEYextractAnthropic API key for Claude
OPENAI_API_KEYembed, search --semanticOpenAI API key for embeddings
APPROP_MODELextractOverride default LLM model (flag takes precedence)

See Environment Variables and API Keys for details.

Next Steps

Provision Types

Quick reference for all 11 provision types in the extraction schema. For detailed explanations with real examples and distribution data, see The Provision Type System.

At a Glance

TypeWhat It IsHas Dollar Amount?Counted in BA?
appropriationGrant of budget authorityYesYes (at top_level/line_item)
rescissionCancellation of prior fundsYesSeparately (subtracted for Net BA)
cr_substitutionCR anomaly — substituting $X for $YYes (new + old)No (CR baseline amounts)
transfer_authorityPermission to move funds between accountsSometimes (ceiling)No
limitationCap or prohibition on spendingSometimesNo
directed_spendingEarmark / community project fundingYesDepends on detail_level
mandatory_spending_extensionAmendment to authorizing statuteSometimesNo (tracked separately)
directiveReporting requirement or instructionNoNo
riderPolicy provision (no direct spending)NoNo
continuing_resolution_baselineCore CR mechanism (SEC. 101)NoNo
otherCatch-all for unclassifiable provisionsSometimesNo

Common Fields (All Types)

Every provision carries these fields regardless of type:

FieldTypeDescription
provision_typestringThe type discriminator
sectionstringSection header (e.g., "SEC. 101"). Empty string if none.
divisionstring or nullDivision letter (e.g., "A"). Null for bills without divisions.
titlestring or nullTitle numeral (e.g., "IV"). Null if not determinable.
confidencefloatLLM self-assessed confidence, 0.0–1.0. Not calibrated — useful only for identifying outliers below 0.90.
raw_textstringVerbatim excerpt from the bill text (~first 150 characters). Verified against source.
notesarray of stringsExplanatory annotations (e.g., “advance appropriation”, “no-year funding”).
cross_referencesarray of CrossReferenceReferences to other laws, sections, or bills.

CrossReference Fields

FieldTypeDescription
ref_typestringRelationship: baseline_from, amends, notwithstanding, subject_to, see_also, transfer_to, rescinds_from, modifies, references, other
targetstringThe referenced law or section (e.g., "31 U.S.C. 1105(a)")
descriptionstring or nullOptional clarifying note

appropriation

Grant of budget authority — the core spending provision.

Bill text pattern: “For necessary expenses of [account], $X,XXX,XXX,XXX…”

FieldTypeDescription
account_namestringAppropriations account name from '' delimiters in bill text
agencystring or nullParent department or agency
programstring or nullSub-account or program name
amountAmountDollar amount with semantics
fiscal_yearinteger or nullFiscal year the funds are available for
availabilitystring or nullFund availability (e.g., "to remain available until expended")
provisosarray of Proviso“Provided, That” conditions
earmarksarray of EarmarkCommunity project funding items
detail_levelstring"top_level", "line_item", "sub_allocation", or "proviso_amount"
parent_accountstring or nullParent account for sub-allocations

Budget authority: Counted when semantics == "new_budget_authority" AND detail_level is "top_level" or "line_item". Sub-allocations and proviso amounts are excluded to prevent double-counting.

Example (from H.R. 9468):

{
  "provision_type": "appropriation",
  "account_name": "Compensation and Pensions",
  "agency": "Department of Veterans Affairs",
  "amount": {
    "value": { "kind": "specific", "dollars": 2285513000 },
    "semantics": "new_budget_authority",
    "text_as_written": "$2,285,513,000"
  },
  "detail_level": "top_level",
  "availability": "to remain available until expended",
  "fiscal_year": 2024,
  "confidence": 0.99,
  "raw_text": "For an additional amount for ''Compensation and Pensions'', $2,285,513,000, to remain available until expended."
}

Count in example data: 1,223 (49% of all provisions)


rescission

Cancellation of previously appropriated funds.

Bill text pattern: “…is hereby rescinded” or “Of the unobligated balances… $X is rescinded”

FieldTypeDescription
account_namestringAccount being rescinded from
agencystring or nullDepartment or agency
amountAmountDollar amount (semantics: "rescission")
reference_lawstring or nullThe law whose funds are being rescinded
fiscal_yearsstring or nullWhich fiscal years’ funds are affected

Budget authority: Summed separately and subtracted to produce Net BA.

Example (from H.R. 4366):

{
  "provision_type": "rescission",
  "account_name": "Nonrecurring Expenses Fund",
  "agency": "Department of Health and Human Services",
  "amount": {
    "value": { "kind": "specific", "dollars": 12440000000 },
    "semantics": "rescission",
    "text_as_written": "$12,440,000,000"
  },
  "reference_law": "Fiscal Responsibility Act of 2023"
}

Count in example data: 78 (3.1%)


cr_substitution

Continuing resolution anomaly — substitutes one dollar amount for another.

Bill text pattern: “…shall be applied by substituting ‘$X’ for ‘$Y’…”

FieldTypeDescription
account_namestring or nullAccount affected (null if bill references a statute section)
new_amountAmountThe new dollar amount ($X — the replacement level)
old_amountAmountThe old dollar amount ($Y — the level being replaced)
reference_actstringThe act being modified
reference_sectionstringSection being modified

Both amounts are independently verified. The search table automatically shows New, Old, and Delta columns.

Example (from H.R. 5860):

{
  "provision_type": "cr_substitution",
  "account_name": "Rural Housing Service—Rural Community Facilities Program Account",
  "new_amount": {
    "value": { "kind": "specific", "dollars": 25300000 },
    "semantics": "new_budget_authority",
    "text_as_written": "$25,300,000"
  },
  "old_amount": {
    "value": { "kind": "specific", "dollars": 75300000 },
    "semantics": "new_budget_authority",
    "text_as_written": "$75,300,000"
  },
  "section": "SEC. 101",
  "division": "A"
}

Count in example data: 13 (all in H.R. 5860)


transfer_authority

Permission to move funds between accounts. The dollar amount is a ceiling, not new spending.

FieldTypeDescription
from_scopestringSource account(s) or scope
to_scopestringDestination account(s) or scope
limitTransferLimitTransfer ceiling (percentage, fixed amount, or description)
conditionsarray of stringsConditions that must be met

Budget authority: Not counted — semantics: "transfer_ceiling".

Count in example data: 77 (all in H.R. 4366)


limitation

Cap or prohibition on spending.

Bill text pattern: “not more than $X”, “none of the funds”, “shall not exceed”

FieldTypeDescription
descriptionstringWhat is being limited
amountAmount or nullDollar cap, if specified
account_namestring or nullAccount the limitation applies to
parent_accountstring or nullParent account for proviso-based limitations

Budget authority: Not counted — semantics: "limitation".

Count in example data: 460 (18.4%)


directed_spending

Earmark or community project funding directed to a specific recipient.

FieldTypeDescription
account_namestringAccount providing the funds
amountAmountDollar amount directed
earmarkEarmark or nullrecipient, location, requesting_member
detail_levelstringTypically "sub_allocation" or "line_item"
parent_accountstring or nullParent account name

Note: Most earmarks are in the joint explanatory statement (a separate document), not the enrolled bill XML. Only earmarks in the bill text itself appear here.

Count in example data: 8 (all in H.R. 4366)


mandatory_spending_extension

Amendment to an authorizing statute — extends, modifies, or reauthorizes mandatory programs.

FieldTypeDescription
program_namestringProgram being extended
statutory_referencestringThe statute being amended (e.g., "Section 330B(b)(2) of the Public Health Service Act")
amountAmount or nullDollar amount if specified
periodstring or nullDuration of the extension
extends_throughstring or nullEnd date or fiscal year

Count in example data: 84 (40 in omnibus, 44 in CR)


directive

Reporting requirement or instruction to an agency.

FieldTypeDescription
descriptionstringWhat is being directed
deadlinesarray of stringsAny deadlines mentioned (e.g., "30 days after enactment")

Budget authority: None — directives don’t carry dollar amounts.

Example (from H.R. 9468):

{
  "provision_type": "directive",
  "description": "Requires the Inspector General of the Department of Veterans Affairs to conduct a review of the circumstances surrounding and underlying causes of the announced VBA funding shortfall for FY2024...",
  "deadlines": ["180 days after enactment"],
  "section": "SEC. 104"
}

Count in example data: 125


rider

Policy provision that doesn’t directly appropriate, rescind, or limit funds.

FieldTypeDescription
descriptionstringWhat the rider does
policy_areastring or nullPolicy domain if identifiable

Budget authority: None.

Count in example data: 336


continuing_resolution_baseline

The core CR mechanism — usually SEC. 101 — establishing the default funding rule.

FieldTypeDescription
reference_yearinteger or nullFiscal year used as the baseline rate
reference_lawsarray of stringsLaws providing baseline funding levels
ratestring or nullRate description (e.g., “the rate for operations”)
durationstring or nullHow long the CR lasts
anomaliesarray of CrAnomalyExplicit anomalies (usually captured as separate cr_substitution provisions)

Count in example data: 1 (in H.R. 5860)


other

Catch-all for provisions that don’t fit any of the 10 specific types.

FieldTypeDescription
llm_classificationstringThe LLM’s original description of what this provision is
descriptionstringSummary of the provision
amountsarray of AmountAny dollar amounts mentioned
referencesarray of stringsAny references mentioned
metadataobjectArbitrary key-value pairs for non-standard fields

When the LLM produces an unknown provision_type string, the resilient parser wraps it as Other with the original classification preserved in llm_classification. In the example data, all 96 other provisions were deliberately classified as “other” by the LLM — none triggered the fallback parser.

Count in example data: 96 (3.8%)


Amount Fields

Dollar amounts appear on many provision types. Each amount has three components:

AmountValue (value)

KindFieldsDescription
specificdollars (integer)Exact whole-dollar amount. Can be negative for rescissions.
such_sumsOpen-ended: “such sums as may be necessary”
noneNo dollar amount

Amount Semantics (semantics)

ValueMeaningCounted in Budget Authority?
new_budget_authorityNew spending powerYes (at top_level/line_item)
rescissionCancellation of prior BASeparately (subtracted for Net BA)
reference_amountContextual amount (sub-allocations, “of which” breakdowns)No
limitationCap on spendingNo
transfer_ceilingMaximum transfer amountNo
mandatory_spendingMandatory program amountTracked separately

Text As Written (text_as_written)

The verbatim dollar string from the bill text (e.g., "$2,285,513,000"). Used for verification — the string is searched for in the source XML to confirm the amount is real.

Detail Levels (Appropriation Type Only)

LevelMeaningCounted in BA?
top_levelMain account appropriationYes
line_itemNumbered item within a sectionYes
sub_allocation“Of which” breakdownNo
proviso_amountDollar amount in a “Provided, That” clauseNo
"" (empty)Not applicable (non-appropriation types)N/A

Proviso Fields

Conditions attached to appropriations via “Provided, That” clauses:

FieldTypeDescription
proviso_typestringlimitation, transfer, reporting, condition, prohibition, other
descriptionstringSummary of the proviso
amountAmount or nullDollar amount if specified
referencesarray of stringsReferenced laws or sections
raw_textstringSource text excerpt

Earmark Fields

Community project funding items:

FieldTypeDescription
recipientstringWho receives the funds
locationstring or nullGeographic location
requesting_memberstring or nullMember of Congress who requested it

Distribution in Example Data

The distribution varies by bill type. Here’s a sample from three FY2024 bills to illustrate — run congress-approp search --dir data --list-types for current counts across the full 32-bill dataset:

TypeH.R. 4366 (Omnibus)H.R. 5860 (CR)H.R. 9468 (Supp)
appropriation1,21652
limitation4564
rider285492
directive12023
other8412
rescission78
transfer_authority77
mandatory_spending_extension4044
directed_spending8
cr_substitution13
continuing_resolution_baseline1

Notice how bill type shapes the distribution: the omnibus is dominated by appropriations and limitations, the CR by riders and mandatory spending extensions, and the supplemental by a handful of targeted appropriations and directives.

Next Steps

extraction.json Fields

Complete reference for every field in extraction.json — the primary output of the extract command and the file all query commands read.

Top-Level Structure

{
  "schema_version": "1.0",
  "bill": { ... },
  "provisions": [ ... ],
  "summary": { ... },
  "chunk_map": [ ... ]
}
FieldTypeDescription
schema_versionstring or nullSchema version identifier (e.g., "1.0"). Null in pre-versioned extractions.
billBillInfoBill-level metadata
provisionsarray of ProvisionEvery extracted provision — the core data
summaryExtractionSummaryLLM-generated summary statistics. Diagnostic only — never used for budget authority computation.
chunk_maparrayMaps chunk IDs to provision index ranges for traceability. Empty for single-chunk bills.

BillInfo (bill)

FieldTypeDescription
identifierstringBill number as printed (e.g., "H.R. 9468", "H.R. 4366")
classificationstringBill type: regular, continuing_resolution, omnibus, minibus, supplemental, rescissions, or a free-text string
short_titlestring or nullThe bill’s short title if one is given (e.g., "Veterans Benefits Continuity and Accountability Supplemental Appropriations Act, 2024")
fiscal_yearsarray of integersFiscal years covered (e.g., [2024] or [2024, 2025])
divisionsarray of stringsDivision letters present in the bill (e.g., ["A", "B", "C", "D", "E", "F"]). Empty array if the bill has no divisions.
public_lawstring or nullPublic law number if enacted (e.g., "P.L. 118-158"). Null if not identified in the text.

Example (H.R. 9468):

{
  "identifier": "H.R. 9468",
  "classification": "supplemental",
  "short_title": "Veterans Benefits Continuity and Accountability Supplemental Appropriations Act, 2024",
  "fiscal_years": [2024],
  "divisions": [],
  "public_law": null
}

Provisions (provisions)

An array of provision objects. Each provision has a provision_type field that determines which type-specific fields are present, plus the common fields shared by all types.

See Provision Types for the complete type-by-type reference including type-specific fields and examples.

Common Fields (All Provision Types)

FieldTypeDescription
provision_typestringType discriminator: appropriation, rescission, cr_substitution, transfer_authority, limitation, directed_spending, mandatory_spending_extension, directive, rider, continuing_resolution_baseline, other
sectionstringSection header (e.g., "SEC. 101"). Empty string if no section header applies.
divisionstring or nullDivision letter (e.g., "A"). Null if the bill has no divisions.
titlestring or nullTitle numeral (e.g., "IV", "XIII"). Null if not determinable.
confidencefloatLLM self-assessed confidence, 0.0–1.0. Not calibrated. Useful only for identifying outliers below 0.90.
raw_textstringVerbatim excerpt from the bill text (~first 150 characters of the provision). Verified against source.
notesarray of stringsExplanatory annotations. Flags unusual patterns, drafting inconsistencies, or contextual information (e.g., "advance appropriation", "no-year funding", "supplemental appropriation").
cross_referencesarray of CrossReferenceReferences to other laws, sections, or bills.

CrossReference

FieldTypeDescription
ref_typestringRelationship type: baseline_from, amends, notwithstanding, subject_to, see_also, transfer_to, rescinds_from, modifies, references, other
targetstringThe referenced law or section (e.g., "31 U.S.C. 1105(a)", "P.L. 118-47, Division A")
descriptionstring or nullOptional clarifying note

Amount

Dollar amounts appear throughout the schema — on appropriation, rescission, limitation, directed_spending, mandatory_spending_extension, and other provision types. CR substitutions have new_amount and old_amount instead of a single amount.

Each amount has three sub-fields:

AmountValue (value)

Tagged by the kind field:

KindFieldsDescription
specificdollars (integer)An exact dollar amount. Always whole dollars, no cents. Can be negative for rescissions. Example: {"kind": "specific", "dollars": 2285513000}
such_sumsOpen-ended: “such sums as may be necessary.” No dollar figure. Example: {"kind": "such_sums"}
noneNo dollar amount — the provision doesn’t carry a dollar value. Example: {"kind": "none"}

Amount Semantics (semantics)

ValueMeaningCounted in Budget Authority?
new_budget_authorityNew spending power granted to an agencyYes (at top_level/line_item detail)
rescissionCancellation of prior budget authoritySummed separately as rescissions
reference_amountDollar figure for context (sub-allocations, “of which” breakdowns)No
limitationCap on how much may be spent for a purposeNo
transfer_ceilingMaximum amount transferable between accountsNo
mandatory_spendingMandatory spending referenced or extendedTracked separately
Other stringCatch-all for unrecognized semanticsNo

Text As Written (text_as_written)

The verbatim dollar string from the bill text (e.g., "$2,285,513,000"). Used by the verification pipeline — this exact string is searched for in the source XML.

Complete Amount Example

{
  "value": {
    "kind": "specific",
    "dollars": 2285513000
  },
  "semantics": "new_budget_authority",
  "text_as_written": "$2,285,513,000"
}

Detail Level (Appropriation Type Only)

The detail_level field on appropriation provisions indicates structural position in the funding hierarchy:

LevelMeaningCounted in BA?Example
top_levelMain account appropriationYes"$10,643,713,000" for FBI Salaries and Expenses
line_itemNumbered item within a sectionYes"(1) $3,500,000,000 for guaranteed farm ownership loans"
sub_allocation“Of which” breakdownNo"of which $216,900,000 shall remain available until expended"
proviso_amountDollar amount in a “Provided, That” clauseNo"Provided, That not to exceed $279,000 for reception expenses"
"" (empty)Not applicable (non-appropriation provision types)N/ADirectives, riders, etc.

The compute_totals() function uses detail_level to prevent double-counting. Sub-allocations and proviso amounts are breakdowns of a parent appropriation, not additional money.


Proviso

Conditions attached to appropriations via “Provided, That” clauses:

FieldTypeDescription
proviso_typestringlimitation, transfer, reporting, condition, prohibition, other
descriptionstringSummary of the proviso
amountAmount or nullDollar amount if the proviso specifies one
referencesarray of stringsReferenced laws or sections
raw_textstringSource text excerpt

Earmark

Community project funding or directed spending items:

FieldTypeDescription
recipientstringWho receives the funds
locationstring or nullGeographic location
requesting_memberstring or nullMember of Congress who requested it

CrAnomaly

Anomaly entries within a continuing_resolution_baseline provision:

FieldTypeDescription
accountstringAccount being modified
modificationstringWhat’s changing
deltainteger or nullDollar change if applicable
raw_textstringSource text excerpt

ExtractionSummary (summary)

LLM-produced self-check totals. These are diagnostic only — budget authority displayed by the summary command is always computed from individual provisions, never from these fields.

FieldTypeDescription
total_provisionsintegerCount of all provisions the LLM reported extracting
by_divisionobjectProvision count per division (e.g., {"A": 130, "B": 10})
by_typeobjectProvision count per type (e.g., {"appropriation": 2, "rider": 2})
total_budget_authorityintegerLLM’s self-reported sum of budget authority. Not used for computation.
total_rescissionsintegerLLM’s self-reported sum of rescissions. Not used for computation.
sections_with_no_provisionsarray of stringsSection headers where no provision was extracted — helps verify completeness
flagged_issuesarray of stringsAnything unusual the LLM noticed: drafting inconsistencies, ambiguous language, potential errors

Chunk Map (chunk_map)

Links provisions to the extraction chunks they came from. For single-chunk bills (like H.R. 9468), this is an empty array. For multi-chunk bills, each entry maps a chunk ID (ULID) to a range of provision indices:

[
  {
    "chunk_id": "01JRWN9T5RR0JTQ6C9FYYE96A8",
    "label": "A-I",
    "provision_start": 0,
    "provision_end": 42
  },
  {
    "chunk_id": "01JRWNA2B3C4D5E6F7G8H9J0K1",
    "label": "A-II",
    "provision_start": 42,
    "provision_end": 95
  }
]

This enables full audit trails — you can trace any provision back to the specific chunk and LLM call that produced it.


Complete Minimal Example (H.R. 9468)

{
  "schema_version": "1.0",
  "bill": {
    "identifier": "H.R. 9468",
    "classification": "supplemental",
    "short_title": "Veterans Benefits Continuity and Accountability Supplemental Appropriations Act, 2024",
    "fiscal_years": [2024],
    "divisions": [],
    "public_law": null
  },
  "provisions": [
    {
      "provision_type": "appropriation",
      "account_name": "Compensation and Pensions",
      "agency": "Department of Veterans Affairs",
      "program": null,
      "amount": {
        "value": { "kind": "specific", "dollars": 2285513000 },
        "semantics": "new_budget_authority",
        "text_as_written": "$2,285,513,000"
      },
      "fiscal_year": 2024,
      "availability": "to remain available until expended",
      "provisos": [],
      "earmarks": [],
      "detail_level": "top_level",
      "parent_account": null,
      "section": "",
      "division": null,
      "title": null,
      "confidence": 0.99,
      "raw_text": "For an additional amount for ''Compensation and Pensions'', $2,285,513,000, to remain available until expended.",
      "notes": [
        "Supplemental appropriation under Veterans Benefits Administration heading",
        "No-year funding"
      ],
      "cross_references": []
    },
    {
      "provision_type": "appropriation",
      "account_name": "Readjustment Benefits",
      "agency": "Department of Veterans Affairs",
      "program": null,
      "amount": {
        "value": { "kind": "specific", "dollars": 596969000 },
        "semantics": "new_budget_authority",
        "text_as_written": "$596,969,000"
      },
      "fiscal_year": 2024,
      "availability": "to remain available until expended",
      "provisos": [],
      "earmarks": [],
      "detail_level": "top_level",
      "parent_account": null,
      "section": "",
      "division": null,
      "title": null,
      "confidence": 0.99,
      "raw_text": "For an additional amount for ''Readjustment Benefits'', $596,969,000, to remain available until expended.",
      "notes": [
        "Supplemental appropriation under Veterans Benefits Administration heading",
        "No-year funding"
      ],
      "cross_references": []
    },
    {
      "provision_type": "rider",
      "description": "Establishes that each amount appropriated or made available by this Act is in addition to amounts otherwise appropriated for the fiscal year involved.",
      "policy_area": null,
      "section": "SEC. 101",
      "division": null,
      "title": null,
      "confidence": 0.98,
      "raw_text": "SEC. 101. Each amount appropriated or made available by this Act is in addition to amounts otherwise appropriated for the fiscal year involved.",
      "notes": [],
      "cross_references": []
    },
    {
      "provision_type": "directive",
      "description": "Requires the Secretary of Veterans Affairs to submit a report detailing corrections the Department will make to improve forecasting, data quality, and budget assumptions.",
      "deadlines": ["30 days after enactment"],
      "section": "SEC. 103",
      "division": null,
      "title": null,
      "confidence": 0.97,
      "raw_text": "SEC. 103. (a) Not later than 30 days after the date of enactment of this Act, the Secretary of Veterans Affairs shall submit to the Committees on App",
      "notes": [],
      "cross_references": []
    }
  ],
  "summary": {
    "total_provisions": 7,
    "by_division": {},
    "by_type": {
      "appropriation": 2,
      "rider": 2,
      "directive": 3
    },
    "total_budget_authority": 2882482000,
    "total_rescissions": 0,
    "sections_with_no_provisions": [],
    "flagged_issues": []
  },
  "chunk_map": []
}

Note: The example above is abbreviated — the actual H.R. 9468 extraction has 7 provisions (2 appropriations, 2 riders, 3 directives). Only 4 are shown here for brevity.


Accessing extraction.json

From the CLI

All query commands (search, summary, compare, audit) read extraction.json automatically. You don’t need to interact with the file directly for normal use.

From Python

import json

with open("data/118-hr9468/extraction.json") as f:
    data = json.load(f)

# Bill info
print(data["bill"]["identifier"])  # "H.R. 9468"

# Provisions
for p in data["provisions"]:
    ptype = p["provision_type"]
    if ptype == "appropriation":
        dollars = p["amount"]["value"]["dollars"]
        account = p["account_name"]
        print(f"{account}: ${dollars:,}")

From Rust (Library API)

#![allow(unused)]
fn main() {
use congress_appropriations::load_bills;
use std::path::Path;

let bills = load_bills(Path::new("examples"))?;
for bill in &bills {
    println!("{}: {} provisions",
        bill.extraction.bill.identifier,
        bill.extraction.provisions.len());
}
}

See Use the Library API from Rust for the full guide.


Schema Versioning

The schema_version field tracks the extraction data format. When the schema evolves (new fields, renamed fields), the upgrade command migrates existing data to the latest version without re-extraction.

VersionDescription
nullPre-versioned data (before v1.1.0)
"1.0"Current schema with all documented fields

The upgrade command adds schema_version to pre-versioned files and applies any necessary field migrations. See Upgrade Extraction Data.


verification.json Fields

Complete reference for every field in verification.json — the deterministic verification report produced by the extract and upgrade commands. No LLM is involved in generating this file; it is pure string matching and arithmetic against the source bill text.

Top-Level Structure

{
  "amount_checks": [ ... ],
  "raw_text_checks": [ ... ],
  "arithmetic_checks": [ ... ],
  "completeness": { ... },
  "summary": { ... }
}
FieldTypeDescription
amount_checksarray of AmountCheckOne entry per provision with a dollar amount
raw_text_checksarray of RawTextCheckOne entry per provision
arithmetic_checksarray of ArithmeticCheckGroup-level sum verification (deprecated in newer files)
completenessCompletenessDollar amount coverage analysis
summaryVerificationSummaryRoll-up metrics for the entire bill

Amount Checks (amount_checks)

One entry for each provision that has a text_as_written dollar string. Checks whether that exact string exists in the source bill text.

FieldTypeDescription
provision_indexintegerIndex into the provisions array in extraction.json (0-based)
text_as_writtenstringThe dollar string being checked (e.g., "$2,285,513,000")
found_in_sourcebooleanWhether the string was found anywhere in the source text
source_positionsarray of integersCharacter offset(s) where the string was found. Empty if not found.
statusstringVerification result (see below)

Status Values

StatusMeaningAction
verifiedDollar string found at exactly one position in the source text. Highest confidence — amount is real and location is unambiguous.None needed
ambiguousDollar string found at multiple positions. Amount is correct but location is uncertain (common for round numbers like $5,000,000).Acceptable — not an error
not_foundDollar string not found anywhere in the source text. The LLM may have hallucinated or misformatted the amount.Review manually — check the source XML
mismatchInternal consistency check failed — the parsed dollars integer doesn’t match the text_as_written string.Review manually — likely a parsing issue

Example

{
  "provision_index": 0,
  "text_as_written": "$2,285,513,000",
  "found_in_source": true,
  "source_positions": [431],
  "status": "verified"
}

Counts in Example Data

BillVerifiedAmbiguousNot Found
H.R. 43667627230
H.R. 58603320
H.R. 9468200
Total7977250

Raw Text Checks (raw_text_checks)

One entry per provision. Checks whether the provision’s raw_text excerpt is a substring of the source bill text, using tiered matching.

FieldTypeDescription
provision_indexintegerIndex into the provisions array (0-based)
raw_text_previewstringFirst ~80 characters of the raw text being checked
is_verbatim_substringbooleanTrue only for exact tier matches
match_tierstringHow closely the raw text matched (see below)
found_at_positioninteger or nullCharacter offset if exact match; null otherwise

Match Tiers

TierMethodWhat It HandlesCount in Example Data
exactByte-identical substring matchClean, faithful extractions2,392 (95.6%)
normalizedMatches after collapsing whitespace and normalizing curly quotes ("") and dashes (-)Unicode formatting differences from XML-to-text conversion71 (2.8%)
spacelessMatches after removing all spacesWord-joining artifacts from XML tag stripping0 (0.0%)
no_matchNot found at any tierParaphrased, truncated, or concatenated text from adjacent sections38 (1.5%)

Example

{
  "provision_index": 0,
  "raw_text_preview": "For an additional amount for ''Compensation and Pensions'', $2,285,513,000, to r",
  "is_verbatim_substring": true,
  "match_tier": "exact",
  "found_at_position": 371
}

Arithmetic Checks (arithmetic_checks)

Group-level sum verification — checks whether line items within a section or title sum to a stated total.

Note: This field is deprecated in newer extraction files. It may be absent or empty. When present, it uses this structure:

FieldTypeDescription
scopestringWhat’s being summed (e.g., a title or division)
extracted_sumintegerSum of extracted provisions in this scope
stated_totalinteger or nullTotal stated in the bill, if any
statusstringverified, not_found, mismatch, or no_reference

Old files that include this field still load correctly. New extractions and upgrades omit it.


Completeness (completeness)

Checks whether every dollar-sign pattern in the source bill text is accounted for by at least one extracted provision.

FieldTypeDescription
total_dollar_amounts_in_textintegerHow many dollar patterns the text index found in the source bill text
accounted_forintegerHow many of those patterns were matched to an extracted provision’s text_as_written
unaccountedarray of UnaccountedAmountDollar amounts in the bill that no provision captured

UnaccountedAmount

Each entry represents a dollar string found in the source text that wasn’t matched to any extracted provision:

FieldTypeDescription
textstringThe dollar string (e.g., "$500,000")
valueintegerParsed dollar value
positionintegerCharacter offset in the source text
contextstringSurrounding text (~100 characters) for identification

Example

{
  "total_dollar_amounts_in_text": 2,
  "accounted_for": 2,
  "unaccounted": []
}

For a bill with unaccounted amounts:

{
  "total_dollar_amounts_in_text": 1734,
  "accounted_for": 1634,
  "unaccounted": [
    {
      "text": "$500,000",
      "value": 500000,
      "position": 45023,
      "context": "pursuant to section 502(b) of the Agricultural Credit Act, $500,000 for each State"
    }
  ]
}

The unaccounted amounts are typically statutory cross-references, loan guarantee ceilings, struck amounts in amendments, or prior-year references in CRs. See What Coverage Means (and Doesn’t) for detailed interpretation.

Coverage Calculation

Coverage = (accounted_for / total_dollar_amounts_in_text) × 100%
BillTotalAccountedCoverage
H.R. 4366~1,734~1,63494.2%
H.R. 5860~36~2261.1%
H.R. 946822100.0%

Verification Summary (summary)

Roll-up metrics for the entire bill — these are the numbers displayed by the audit command.

FieldTypeDescription
total_provisionsintegerTotal provisions checked
amounts_verifiedintegerProvisions whose dollar amount was found at exactly one position
amounts_not_foundintegerProvisions whose dollar amount was NOT found in source text
amounts_ambiguousintegerProvisions whose dollar amount appeared at multiple positions
raw_text_exactintegerProvisions with exact (byte-identical) raw text match
raw_text_normalizedintegerProvisions with normalized match
raw_text_spacelessintegerProvisions with spaceless match
raw_text_no_matchintegerProvisions with no raw text match at any tier
completeness_pctfloatPercentage of source dollar amounts accounted for (100.0 = all captured)
provisions_by_detail_levelobjectCount of provisions at each detail level (e.g., {"top_level": 483, "sub_allocation": 396})

Example (H.R. 9468)

{
  "total_provisions": 7,
  "amounts_verified": 2,
  "amounts_not_found": 0,
  "amounts_ambiguous": 0,
  "raw_text_exact": 5,
  "raw_text_normalized": 0,
  "raw_text_spaceless": 0,
  "raw_text_no_match": 2,
  "completeness_pct": 100.0,
  "provisions_by_detail_level": {
    "top_level": 2
  }
}

Mapping to Audit Table Columns

Audit ColumnSummary Field
Provisionstotal_provisions
Verifiedamounts_verified
NotFoundamounts_not_found
Ambigamounts_ambiguous
Exactraw_text_exact
NormTextraw_text_normalized
Spacelessraw_text_spaceless
TextMissraw_text_no_match
Coveragecompleteness_pct

How verification.json Is Used

By the audit command

The audit command reads verification.json for each bill and renders the summary metrics as the audit table.

By the search command

Search uses verification data to populate these output fields:

Search Output FieldSource in verification.json
amount_statusamount_checks[i].status — mapped to "found", "found_multiple", or "not_found"
match_tierraw_text_checks[i].match_tier"exact", "normalized", "spaceless", or "no_match"
qualityDerived from both: "strong" if amount verified + text exact; "moderate" if either is imperfect; "weak" if amount not found; "n/a" for provisions without dollar amounts

By the summary command

The summary footer (“0 dollar amounts unverified across all bills”) counts the total amounts_not_found across all loaded bills.


When verification.json Is Generated

  • By extract: Automatically after LLM extraction completes. Verification runs against the source XML with no LLM involvement.
  • By upgrade: Re-generated when upgrading extraction data to a new schema version. The source XML must be present in the bill directory for verification to run.

If the source XML (BILLS-*.xml) is not present, verification is skipped and verification.json is not created or updated.


Accessing verification.json

From the CLI

You don’t need to read this file directly — the audit and search commands surface its data in user-friendly formats.

From Python

import json

with open("data/118-hr9468/verification.json") as f:
    v = json.load(f)

# Summary metrics
print(f"Not found: {v['summary']['amounts_not_found']}")
print(f"Coverage: {v['summary']['completeness_pct']:.1f}%")
print(f"Exact text matches: {v['summary']['raw_text_exact']}")

# Check individual provisions
for check in v["amount_checks"]:
    if check["status"] == "not_found":
        print(f"WARNING: Provision {check['provision_index']}: {check['text_as_written']} not found in source")

# See unaccounted dollar amounts
for ua in v["completeness"]["unaccounted"]:
    print(f"Unaccounted: {ua['text']} at position {ua['position']}")
    print(f"  Context: {ua['context']}")

embeddings.json Fields

Complete reference for the embedding metadata file and its companion binary vector file. These are produced by the congress-approp embed command and consumed by search --semantic and search --similar.

Overview

Embeddings use a split storage format:

  • embeddings.json — Small JSON metadata file (~200 bytes, human-readable)
  • vectors.bin — Binary float32 array (can be tens of megabytes for large bills)

The metadata file tells you everything you need to interpret the binary file: which model produced the vectors, how many dimensions each vector has, how many provisions are embedded, and SHA-256 hashes for the data integrity chain.


embeddings.json Structure

{
  "schema_version": "1.0",
  "model": "text-embedding-3-large",
  "dimensions": 3072,
  "count": 2364,
  "extraction_sha256": "ae912e3427b8...",
  "vectors_file": "vectors.bin",
  "vectors_sha256": "7bd7821176bc..."
}

Fields

FieldTypeDescription
schema_versionstringEmbedding schema version. Currently "1.0".
modelstringThe OpenAI embedding model used (e.g., "text-embedding-3-large"). All embeddings in a dataset must use the same model — you cannot compare vectors from different models.
dimensionsintegerNumber of dimensions per vector. Default is 3072 for text-embedding-3-large. All embeddings in a dataset must use the same dimension count.
countintegerNumber of provisions embedded. Should equal the length of the provisions array in the corresponding extraction.json.
extraction_sha256stringSHA-256 hash of the extraction.json file these embeddings were built from. Used for staleness detection — if the extraction changes, this hash won’t match and the tool warns that embeddings are stale.
vectors_filestringFilename of the binary vectors file. Always "vectors.bin".
vectors_sha256stringSHA-256 hash of the vectors.bin file. Integrity check — detects corruption or truncation.

Example Files from Included Data

BillCountDimensionsembeddings.json Sizevectors.bin Size
H.R. 4366 (omnibus)2,3643,072~230 bytes29,048,832 bytes (29 MB)
H.R. 5860 (CR)1303,072~230 bytes1,597,440 bytes (1.6 MB)
H.R. 9468 (supplemental)73,072~230 bytes86,016 bytes (86 KB)

vectors.bin Format

A flat binary file containing raw little-endian float32 values. There is no header, no delimiter, and no structure — just count × dimensions floating-point numbers in sequence.

Layout

[provision_0_dim_0] [provision_0_dim_1] ... [provision_0_dim_3071]
[provision_1_dim_0] [provision_1_dim_1] ... [provision_1_dim_3071]
...
[provision_N_dim_0] [provision_N_dim_1] ... [provision_N_dim_3071]

Each float32 is 4 bytes, stored in little-endian byte order. Provisions are stored in the same order as the provisions array in extraction.json — provision index 0 comes first, then index 1, and so on.

File Size Formula

file_size = count × dimensions × 4  (bytes)

For the omnibus: 2364 × 3072 × 4 = 29,048,832 bytes

If the actual file size doesn’t match this formula, the file is corrupted or truncated. The vectors_sha256 hash in embeddings.json provides an independent integrity check.

Reading a Specific Provision’s Vector

To read the vector for provision at index i:

byte_offset = i × dimensions × 4
byte_length = dimensions × 4

Seek to byte_offset and read byte_length bytes, then interpret as dimensions little-endian float32 values.

Vector Properties

All vectors are L2-normalized — each vector has a Euclidean norm of approximately 1.0. This means:

  • Cosine similarity equals the dot product: cos(a, b) = a · b (since |a| = |b| = 1)
  • Values range from approximately -0.1 to +0.1 per dimension (spread across 3,072 dimensions)
  • Similarity scores range from approximately 0.2 to 0.9 in practice for appropriations data

Reading Vectors in Python

Using struct (standard library)

import json
import struct

with open("data/118-hr9468/embeddings.json") as f:
    meta = json.load(f)

dims = meta["dimensions"]  # 3072
count = meta["count"]       # 7

with open("data/118-hr9468/vectors.bin", "rb") as f:
    raw = f.read()

# Verify file size
assert len(raw) == count * dims * 4, "File size mismatch — possible corruption"

# Parse into list of tuples
vectors = []
for i in range(count):
    start = i * dims * 4
    end = start + dims * 4
    vec = struct.unpack(f"<{dims}f", raw[start:end])
    vectors.append(vec)

# Check normalization
norm = sum(x * x for x in vectors[0]) ** 0.5
print(f"Vector 0 L2 norm: {norm:.6f}")  # Should be ~1.000000

Using numpy (faster for large files)

import numpy as np
import json

with open("data/118-hr4366/embeddings.json") as f:
    meta = json.load(f)

vectors = np.fromfile(
    "data/118-hr4366/vectors.bin",
    dtype=np.float32
).reshape(meta["count"], meta["dimensions"])

print(f"Shape: {vectors.shape}")  # (2364, 3072)
print(f"Vector 0 norm: {np.linalg.norm(vectors[0]):.6f}")  # ~1.000000

# Cosine similarity matrix (fast — vectors are normalized)
similarity = vectors @ vectors.T
print(f"Provision 0 vs 1 similarity: {similarity[0, 1]:.4f}")

Computing Cosine Similarity

Since vectors are L2-normalized, cosine similarity is just the dot product:

def cosine_similarity(a, b):
    return sum(x * y for x, y in zip(a, b))

# Or with numpy:
sim = np.dot(vectors[0], vectors[1])

Reading Vectors in Rust

The congress-approp library provides the embeddings module:

#![allow(unused)]
fn main() {
use congress_appropriations::approp::embeddings;
use std::path::Path;

if let Some(loaded) = embeddings::load(Path::new("data/118-hr9468"))? {
    println!("Model: {}", loaded.metadata.model);
    println!("Dimensions: {}", loaded.dimensions());
    println!("Count: {}", loaded.count());

    // Get vector for provision 0
    let vec0: &[f32] = loaded.vector(0);

    // Cosine similarity between provisions 0 and 1
    let sim = embeddings::cosine_similarity(loaded.vector(0), loaded.vector(1));
    println!("Similarity: {:.4}", sim);
}
}

Key Functions

FunctionSignatureDescription
embeddings::load(dir)fn load(dir: &Path) -> Result<Option<LoadedEmbeddings>>Load embeddings from a bill directory. Returns None if no embeddings.json exists.
embeddings::save(dir, meta, vecs)fn save(dir: &Path, metadata: &EmbeddingsMetadata, vectors: &[f32]) -> Result<()>Save embeddings to a bill directory. Writes both embeddings.json and vectors.bin.
embeddings::cosine_similarity(a, b)fn cosine_similarity(a: &[f32], b: &[f32]) -> f32Compute cosine similarity (dot product for normalized vectors).
embeddings::normalize(vec)fn normalize(vec: &mut [f32])L2-normalize a vector in place.
loaded.vector(i)fn vector(&self, i: usize) -> &[f32]Get the embedding vector for provision at index i.
loaded.count()fn count(&self) -> usizeNumber of embedded provisions.
loaded.dimensions()fn dimensions(&self) -> usizeNumber of dimensions per vector.

The Hash Chain

Embeddings participate in the data integrity hash chain:

extraction.json ──sha256──▶ embeddings.json (extraction_sha256)
vectors.bin ──sha256──▶ embeddings.json (vectors_sha256)

Staleness Detection

When you run a command that uses embeddings (search --semantic or search --similar), the tool:

  1. Computes the SHA-256 of the current extraction.json on disk
  2. Compares it to extraction_sha256 in embeddings.json
  3. If they differ, prints a warning to stderr:
⚠ H.R. 4366: embeddings are stale (extraction.json has changed)

This means the extraction was modified (re-extracted or upgraded) after the embeddings were generated. The provision indices in the vectors may no longer correspond to the current provisions. The warning is advisory — execution continues, but results may be unreliable.

Fix: Regenerate embeddings with congress-approp embed --dir <path>.

Integrity Check

The vectors_sha256 field verifies that vectors.bin hasn’t been corrupted. If the hash doesn’t match, the binary file was modified, truncated, or replaced since embeddings were generated.

Automatic Skip

The embed command checks the hash chain before processing each bill. If extraction_sha256 matches the current extraction and vectors_sha256 matches the current vectors file, the bill is skipped:

Skipping H.R. 9468: embeddings up to date

This makes it safe to run embed --dir data repeatedly — only bills with new or changed extractions are processed.


Consistency Requirements

Same model across all bills

All embeddings in a dataset must use the same model. Cosine similarity between vectors from different models is undefined. The model field in embeddings.json records which model was used.

If you change models, regenerate embeddings for all bills:

# Delete existing embeddings (optional — embed will overwrite)
congress-approp embed --dir data --model text-embedding-3-large

Same dimensions across all bills

All embeddings must use the same dimension count. The default is 3,072 (the native output of text-embedding-3-large). If you truncate dimensions with --dimensions 1024, all bills must use 1,024.

The dimensions field in embeddings.json records the dimension count. The tool does not currently check for dimension mismatches across bills — comparing vectors of different dimensions will silently produce garbage results.

Provision count alignment

The count field should equal the number of provisions in extraction.json. If the extraction is re-run (producing a different number of provisions), the stored vectors no longer align with the provisions — the hash chain detects this as staleness.


Storage on crates.io

The vectors.bin files are excluded from the crates.io package via the exclude field in Cargo.toml:

exclude = ["data/"]

This is because the omnibus bill’s vectors.bin (29 MB) exceeds crates.io’s 10 MB upload limit. Users who install from crates.io can generate embeddings themselves:

export OPENAI_API_KEY="your-key"
congress-approp embed --dir data

Users who clone the GitHub repository get the pre-generated vectors.bin files.


Embedding Model Details

The default model is OpenAI’s text-embedding-3-large:

PropertyValue
Model nametext-embedding-3-large
Native dimensions3,072
NormalizationL2-normalized (unit vectors)
DeterminismNear-perfect — max deviation ~1e-6 across repeated embeddings of the same text
Supported dimension truncation256, 512, 1024, 3072 (via --dimensions flag)

Dimension Truncation Trade-offs

Experimental results from this project:

DimensionsTop-20 Overlap vs. 3072vectors.bin Size (Omnibus)Load Time
25616/20 (lossy)~2.4 MB<1ms
51218/20 (near-lossless)~4.8 MB<1ms
102419/20~9.7 MB~1ms
3072 (default)20/20 (ground truth)~29 MB~2ms

Since binary files load in milliseconds regardless of size, the full 3,072 dimensions are recommended. There is no practical performance benefit to truncation.


Output Formats

Every query command (search, summary, compare, audit) supports multiple output formats via the --format flag. This reference documents each format with examples and usage notes.

Available Formats

FormatFlagBest For
Table--format table (default)Interactive exploration, quick lookups, terminal display
JSON--format jsonProgrammatic consumption, Python/R/JavaScript, piping to jq
JSONL--format jsonlStreaming line-by-line processing, xargs, parallel, large result sets
CSV--format csvExcel, Google Sheets, R, pandas, any spreadsheet application

All formats are available on search, summary, and compare. The audit command only supports table output.


Table (Default)

Human-readable formatted table with Unicode box-drawing characters. Columns adapt to content width. Long text is truncated with .

congress-approp search --dir data/118-hr9468
┌───┬───────────┬───────────────┬───────────────────────────────────────────────┬───────────────┬──────────┬─────┐
│ $ ┆ Bill      ┆ Type          ┆ Description / Account                         ┆    Amount ($) ┆ Section  ┆ Div │
╞═══╪═══════════╪═══════════════╪═══════════════════════════════════════════════╪═══════════════╪══════════╪═════╡
│ ✓ ┆ H.R. 9468 ┆ appropriation ┆ Compensation and Pensions                     ┆ 2,285,513,000 ┆          ┆     │
│ ✓ ┆ H.R. 9468 ┆ appropriation ┆ Readjustment Benefits                         ┆   596,969,000 ┆          ┆     │
│   ┆ H.R. 9468 ┆ rider         ┆ Establishes that each amount appropriated o…  ┆             — ┆ SEC. 101 ┆     │
│   ┆ H.R. 9468 ┆ rider         ┆ Unless otherwise provided, the additional a…  ┆             — ┆ SEC. 102 ┆     │
│   ┆ H.R. 9468 ┆ directive     ┆ Requires the Secretary of Veterans Affairs …  ┆             — ┆ SEC. 103 ┆     │
│   ┆ H.R. 9468 ┆ directive     ┆ Requires the Secretary of Veterans Affairs …  ┆             — ┆ SEC. 103 ┆     │
│   ┆ H.R. 9468 ┆ directive     ┆ Requires the Inspector General of the Depar…  ┆             — ┆ SEC. 104 ┆     │
└───┴───────────┴───────────────┴───────────────────────────────────────────────┴───────────────┴──────────┴─────┘
7 provisions found

Table characteristics

  • Dollar amounts are formatted with commas (e.g., 2,285,513,000)
  • Missing amounts show (em-dash) for provisions without dollar values
  • Long text is truncated with to fit terminal width
  • Verification symbols in the $ column: (found unique), (found multiple), (not found), blank (no amount)
  • Row count is shown below the table

Adaptive table layouts

The table changes its column structure depending on what you’re searching for:

Standard search: $, Bill, Type, Description/Account, Amount ($), Section, Div

CR substitution search (--type cr_substitution): $, Bill, Account, New ($), Old ($), Delta ($), Section, Div

Semantic/similar search (--semantic or --similar): Sim, Bill, Type, Description/Account, Amount ($), Div

Summary table: Bill, Classification, Provisions, Budget Auth ($), Rescissions ($), Net BA ($)

Compare table: Account, Agency, Base ($), Current ($), Delta ($), Δ %, Status

When to use

  • Interactive exploration at the terminal
  • Quick spot-checks and lookups
  • Sharing results in chat or email (the Unicode formatting renders well in most contexts)
  • Any situation where you’re reading results directly rather than processing them

JSON

A JSON array of objects. Every matching provision is included with all available fields — more data than the table can show.

congress-approp search --dir data/118-hr9468 --type appropriation --format json
[
  {
    "account_name": "Compensation and Pensions",
    "agency": "Department of Veterans Affairs",
    "amount_status": "found",
    "bill": "H.R. 9468",
    "description": "Compensation and Pensions",
    "division": "",
    "dollars": 2285513000,
    "match_tier": "exact",
    "old_dollars": null,
    "provision_index": 0,
    "provision_type": "appropriation",
    "quality": "strong",
    "raw_text": "For an additional amount for ''Compensation and Pensions'', $2,285,513,000, to remain available until expended.",
    "section": "",
    "semantics": "new_budget_authority"
  },
  {
    "account_name": "Readjustment Benefits",
    "agency": "Department of Veterans Affairs",
    "amount_status": "found",
    "bill": "H.R. 9468",
    "description": "Readjustment Benefits",
    "division": "",
    "dollars": 596969000,
    "match_tier": "exact",
    "old_dollars": null,
    "provision_index": 1,
    "provision_type": "appropriation",
    "quality": "strong",
    "raw_text": "For an additional amount for ''Readjustment Benefits'', $596,969,000, to remain available until expended.",
    "section": "",
    "semantics": "new_budget_authority"
  }
]

JSON fields (search output)

FieldTypeDescription
billstringBill identifier (e.g., "H.R. 9468")
provision_typestringProvision type (e.g., "appropriation")
provision_indexintegerZero-based index in the bill’s provision array
account_namestringAccount name (empty string if not applicable)
descriptionstringDescription of the provision
agencystringAgency name (empty string if not applicable)
dollarsinteger or nullDollar amount as plain integer, or null if no amount
old_dollarsinteger or nullOld amount for CR substitutions, null for other types
semanticsstringAmount semantics: new_budget_authority, rescission, reference_amount, limitation, transfer_ceiling, mandatory_spending
sectionstringSection reference (e.g., "SEC. 101")
divisionstringDivision letter (empty string if none)
raw_textstringBill text excerpt (~150 characters)
amount_statusstring or null"found", "found_multiple", "not_found", or null (no amount)
match_tierstring"exact", "normalized", "spaceless", "no_match"
qualitystring"strong", "moderate", "weak", or "n/a"

JSON fields (summary output)

congress-approp summary --dir data --format json
[
  {
    "identifier": "H.R. 4366",
    "classification": "Omnibus",
    "provisions": 2364,
    "budget_authority": 846137099554,
    "rescissions": 24659349709,
    "net_ba": 821477749845,
    "completeness_pct": 94.23298731257208
  }
]
FieldTypeDescription
identifierstringBill identifier
classificationstringBill classification
provisionsintegerTotal provision count
budget_authorityintegerTotal budget authority (computed from provisions)
rescissionsintegerTotal rescissions (absolute value)
net_baintegerBudget authority minus rescissions
completeness_pctfloatCoverage percentage from verification

JSON fields (compare output)

congress-approp compare --base data/118-hr4366 --current data/118-hr9468 --format json
FieldTypeDescription
account_namestringAccount name
agencystringAgency name
base_dollarsintegerBudget authority in --base bills
current_dollarsintegerBudget authority in --current bills
deltaintegerCurrent minus base
delta_pctfloatPercentage change
statusstring"changed", "unchanged", "only in base", "only in current"

Piping to jq

JSON output is designed for piping to jq:

# Total budget authority
congress-approp search --dir data --type appropriation --format json | \
  jq '[.[] | select(.semantics == "new_budget_authority") | .dollars] | add'

# Top 5 by dollars
congress-approp search --dir data --type appropriation --format json | \
  jq 'sort_by(-.dollars) | .[:5] | .[] | "\(.dollars)\t\(.account_name)"'

# Unique account names
congress-approp search --dir data --type appropriation --format json | \
  jq '[.[].account_name] | unique | sort | .[]'

# Group by agency
congress-approp search --dir data --type appropriation --format json | \
  jq 'group_by(.agency) | map({agency: .[0].agency, count: length, total: [.[].dollars // 0] | add}) | sort_by(-.total)'

Loading in Python

import json
import subprocess

# From a file
with open("provisions.json") as f:
    data = json.load(f)

# From subprocess
result = subprocess.run(
    ["congress-approp", "search", "--dir", "data",
     "--type", "appropriation", "--format", "json"],
    capture_output=True, text=True
)
provisions = json.loads(result.stdout)

# With pandas
import pandas as pd
df = pd.read_json("provisions.json")

Loading in R

library(jsonlite)
provisions <- fromJSON("provisions.json")

When to use

  • Any programmatic consumption (Python, R, JavaScript, shell scripts)
  • Piping to jq for ad-hoc filtering and aggregation
  • When you need fields that the table truncates or hides
  • When you need the provision_index for --similar searches

JSONL (JSON Lines)

One JSON object per line, with no enclosing array brackets. Each line is independently parseable.

congress-approp search --dir data/118-hr9468 --type appropriation --format jsonl
{"account_name":"Compensation and Pensions","agency":"Department of Veterans Affairs","amount_status":"found","bill":"H.R. 9468","description":"Compensation and Pensions","division":"","dollars":2285513000,"match_tier":"exact","old_dollars":null,"provision_index":0,"provision_type":"appropriation","quality":"strong","raw_text":"For an additional amount for ''Compensation and Pensions'', $2,285,513,000, to remain available until expended.","section":"","semantics":"new_budget_authority"}
{"account_name":"Readjustment Benefits","agency":"Department of Veterans Affairs","amount_status":"found","bill":"H.R. 9468","description":"Readjustment Benefits","division":"","dollars":596969000,"match_tier":"exact","old_dollars":null,"provision_index":1,"provision_type":"appropriation","quality":"strong","raw_text":"For an additional amount for ''Readjustment Benefits'', $596,969,000, to remain available until expended.","section":"","semantics":"new_budget_authority"}

JSONL characteristics

  • Same fields as JSON — each line contains the same fields as a JSON array element
  • No array wrapper — no [ at the start or ] at the end
  • Each line is self-contained — can be parsed independently without reading the entire output
  • No trailing comma issues — each line is a complete JSON object

Shell processing

# Count provisions per bill
congress-approp search --dir data --format jsonl | \
  jq -r '.bill' | sort | uniq -c | sort -rn

# Line-by-line processing
congress-approp search --dir data --type appropriation --format jsonl | \
  while IFS= read -r line; do
    echo "$line" | jq -r '"\(.bill)\t\(.account_name)\t\(.dollars)"'
  done

# Filter with jq (works identically to JSON since jq handles JSONL natively)
congress-approp search --dir data --format jsonl | \
  jq -r 'select(.dollars > 1000000000) | "\(.bill)\t$\(.dollars)\t\(.account_name)"'

When to use JSONL vs. JSON

ScenarioUse JSONUse JSONL
Loading into Python/R/JavaScript
Piping to jqEither works✓ (slightly more natural for streaming)
Line-by-line shell processing
xargs or parallel pipelines
Very large result sets✓ (no need to load entire array into memory)
Appending to a log file
Need a single parseable document

CSV

Comma-separated values with a header row. Suitable for import into any spreadsheet application or data analysis tool.

congress-approp search --dir data/118-hr9468 --type appropriation --format csv
bill,provision_type,account_name,description,agency,dollars,old_dollars,semantics,detail_level,section,division,raw_text,amount_status,match_tier,quality,provision_index
H.R. 9468,appropriation,Compensation and Pensions,Compensation and Pensions,Department of Veterans Affairs,2285513000,,new_budget_authority,,,,For an additional amount for ''Compensation and Pensions''...,found,exact,strong,0
H.R. 9468,appropriation,Readjustment Benefits,Readjustment Benefits,Department of Veterans Affairs,596969000,,new_budget_authority,,,,For an additional amount for ''Readjustment Benefits''...,found,exact,strong,1

CSV columns

The CSV output includes all the same fields as JSON, flattened into columns:

ColumnTypeDescription
billstringBill identifier
provision_typestringProvision type
account_namestringAccount name
descriptionstringDescription
agencystringAgency name
dollarsinteger or emptyDollar amount (no formatting, no $ sign)
old_dollarsinteger or emptyOld amount for CR substitutions
semanticsstringAmount semantics
detail_levelstringDetail level (appropriation types only)
sectionstringSection reference
divisionstringDivision letter
raw_textstringBill text excerpt
amount_statusstring or emptyVerification status
match_tierstringRaw text match tier
qualitystringQuality assessment
provision_indexintegerProvision index

Opening in Excel

  1. Save the output to a file: congress-approp search --dir data --format csv > provisions.csv
  2. Open Excel → File → Open → navigate to provisions.csv
  3. If columns aren’t detected automatically, use Data → From Text/CSV and select:
    • Encoding: UTF-8 (important for em-dashes and other Unicode characters)
    • Delimiter: Comma
    • Data type detection: Based on entire file

Common gotchas:

IssueCauseFix
Large numbers in scientific notation (e.g., 8.46E+11)Excel auto-formattingFormat the dollars column as Number with 0 decimal places
Garbled characters (em-dashes, curly quotes)Wrong encodingImport with UTF-8 encoding explicitly
Extra line breaks in rowsraw_text or description contains newlinesThe CSV properly quotes these fields; use the Import Wizard if simple Open doesn’t handle them

Opening in Google Sheets

  1. File → Import → Upload → select your .csv file
  2. Import location: “Replace current sheet” or “Insert new sheet”
  3. Separator type: Comma (should auto-detect)
  4. Google Sheets handles UTF-8 natively

Loading in pandas

import pandas as pd

df = pd.read_csv("provisions.csv")

# Basic analysis
print(f"Total provisions: {len(df)}")
print(f"Total BA: ${df[df['semantics'] == 'new_budget_authority']['dollars'].sum():,.0f}")
print(df.groupby("agency")["dollars"].sum().sort_values(ascending=False).head(10))

Loading in R

provisions <- read.csv("provisions.csv", stringsAsFactors = FALSE)

When to use

  • Importing into Excel or Google Sheets
  • Loading into R or pandas when you prefer CSV to JSON
  • Any tabular data tool that doesn’t support JSON
  • Sharing data with non-technical colleagues who work in spreadsheets

Summary: Choosing the Right Format

I want to…Use
Explore data interactively at the terminal--format table (default)
Process data in Python, R, or JavaScript--format json
Pipe to jq for quick filtering--format json or --format jsonl
Stream results line by line in shell--format jsonl
Import into Excel or Google Sheets--format csv
Get all available fields--format json or --format csv (table truncates)
Append to a log file incrementally--format jsonl
Share results with non-technical colleagues--format csv (for spreadsheets) or --format table (for email/chat)

Field availability comparison

FieldTableJSONJSONLCSV
bill
provision_type
account_name / description✓ (truncated)✓ (full)✓ (full)✓ (full)
dollars✓ (formatted)✓ (integer)✓ (integer)✓ (integer)
old_dollars✓ (CR subs only)
section
division
agency
semantics
detail_level
raw_text✓ (full)✓ (full)✓ (full)
amount_status✓ (as symbol)✓ (as string)✓ (as string)✓ (as string)
match_tier
quality
provision_index

Redirecting Output to Files

All formats can be redirected to a file using standard shell redirection:

# Save table output (includes Unicode characters)
congress-approp search --dir data --type appropriation > results.txt

# Save JSON
congress-approp search --dir data --type appropriation --format json > results.json

# Save JSONL
congress-approp search --dir data --type appropriation --format jsonl > results.jsonl

# Save CSV
congress-approp search --dir data --type appropriation --format csv > results.csv

Note: The tool writes output to stdout and warnings/errors to stderr. Redirecting with > captures only stdout, so warnings (like “embeddings are stale”) still appear on the terminal. To capture everything: congress-approp search --dir data --format json > results.json 2> warnings.txt


Next Steps

Environment Variables and API Keys

Complete reference for all environment variables used by congress-approp. No API keys are needed to query pre-extracted example data — keys are only required for downloading new bills, extracting provisions, or using semantic search.

API Keys

VariableUsed ByRequired ForCostHow to Get
CONGRESS_API_KEYdownload, api test, api bill list, api bill get, api bill textDownloading bill XML from Congress.govFreeapi.congress.gov/sign-up
ANTHROPIC_API_KEYextractExtracting provisions using ClaudePay-per-useconsole.anthropic.com
OPENAI_API_KEYembed, search --semanticGenerating embeddings and embedding search queriesPay-per-useplatform.openai.com

Setting API Keys

Set keys in your shell before running commands:

export CONGRESS_API_KEY="your-congress-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export OPENAI_API_KEY="your-openai-key"

To persist across sessions, add the export lines to your shell profile (~/.bashrc, ~/.zshrc, or equivalent).

Testing API Keys

Verify that your Congress.gov and Anthropic keys are working:

congress-approp api test

There is no built-in test for the OpenAI key — the embed command will fail with a clear error message if the key is missing or invalid.

Configuration Variables

VariableUsed ByDescriptionDefault
APPROP_MODELextractOverride the default LLM model for extraction. The --model command-line flag takes precedence if both are set.claude-opus-4-6

Setting the Model Override

# Use a different model for all extractions in this session
export APPROP_MODEL="claude-sonnet-4-20250514"
congress-approp extract --dir data/118/hr/9468

# Or override per-command with the flag (takes precedence over env var)
congress-approp extract --dir data/118/hr/9468 --model claude-sonnet-4-20250514

Quality note: The system prompt and expected output format are specifically tuned for Claude Opus. Other models may produce lower-quality extractions. Always check audit output after extracting with a non-default model.

Which Keys Do I Need?

Querying pre-extracted data (no keys needed)

These commands work with the included data/ data and any previously extracted bills — no API keys required:

congress-approp summary --dir data
congress-approp search --dir data --type appropriation
congress-approp search --dir data --keyword "Veterans"
congress-approp audit --dir data
congress-approp compare --base data/118-hr4366 --current data/118-hr9468
congress-approp upgrade --dir data --dry-run

Semantic search (OPENAI_API_KEY only)

Semantic search requires one API call to embed your query text (~100ms, costs fractions of a cent):

export OPENAI_API_KEY="your-key"
congress-approp search --dir data --semantic "school lunch programs" --top 5

The --similar flag does not require an API key — it uses pre-computed vectors stored locally:

# No API key needed for --similar
congress-approp search --dir data --similar 118-hr9468:0 --top 5

Downloading bills (CONGRESS_API_KEY only)

export CONGRESS_API_KEY="your-key"
congress-approp download --congress 118 --type hr --number 9468 --output-dir data
congress-approp api bill list --congress 118 --enacted-only

Extracting provisions (ANTHROPIC_API_KEY only)

export ANTHROPIC_API_KEY="your-key"
congress-approp extract --dir data/118/hr/9468

Generating embeddings (OPENAI_API_KEY only)

export OPENAI_API_KEY="your-key"
congress-approp embed --dir data/118/hr/9468

Full pipeline (all three keys)

export CONGRESS_API_KEY="your-congress-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export OPENAI_API_KEY="your-openai-key"

congress-approp download --congress 118 --enacted-only --output-dir data
congress-approp extract --dir data --parallel 6
congress-approp embed --dir data
congress-approp summary --dir data

Error Messages

ErrorMissing VariableFix
"CONGRESS_API_KEY environment variable not set"CONGRESS_API_KEYexport CONGRESS_API_KEY="your-key"
"ANTHROPIC_API_KEY environment variable not set"ANTHROPIC_API_KEYexport ANTHROPIC_API_KEY="your-key"
"OPENAI_API_KEY environment variable not set"OPENAI_API_KEYexport OPENAI_API_KEY="your-key"
"API key invalid" or 401 errorKey is set but incorrectDouble-check the key value; regenerate if necessary
"Rate limited" or 429 errorKey is valid but quota exceededWait and retry; reduce --parallel for extraction

Security Best Practices

  • Never hardcode API keys in scripts, configuration files checked into version control, or command-line arguments (which may be logged in shell history).

  • Use environment variables as shown above, or source them from a file that is not checked into version control:

    # Create a file (add to .gitignore!)
    echo 'export CONGRESS_API_KEY="your-key"' > ~/.congress-approp-keys
    echo 'export ANTHROPIC_API_KEY="your-key"' >> ~/.congress-approp-keys
    echo 'export OPENAI_API_KEY="your-key"' >> ~/.congress-approp-keys
    
    # Source before use
    source ~/.congress-approp-keys
    congress-approp extract --dir data
    
  • Rotate keys periodically, especially if they may have been exposed.

  • Use separate keys for development and production if your organization supports it.

Cost Estimates

The tool tracks token usage but never displays dollar costs. Here are approximate costs for reference:

Extraction (Anthropic)

Bill TypeEstimated Input TokensEstimated Output Tokens
Small supplemental (~10 KB XML)~1,200~1,500
Continuing resolution (~130 KB XML)~25,000~15,000
Omnibus (~1.8 MB XML)~315,000~200,000

Token usage is recorded in tokens.json after extraction. Use extract --dry-run to preview token counts before committing.

Embeddings (OpenAI)

Bill TypeProvisionsEstimated Cost
Small supplemental7< $0.001
Continuing resolution130< $0.01
Omnibus2,364< $0.01

Semantic Search (OpenAI)

Each --semantic query makes one API call to embed the query text: approximately $0.0001 per search.

The --similar flag uses stored vectors and makes no API calls — completely free after initial embedding.

Summary

TaskKeys Needed
Query pre-extracted dataNone
search --similar (cross-bill matching)None (uses stored vectors)
search --semantic (meaning-based search)OPENAI_API_KEY
Download bills from Congress.govCONGRESS_API_KEY
Extract provisions from bill XMLANTHROPIC_API_KEY
Generate embeddingsOPENAI_API_KEY
Full pipeline (download → extract → embed → query)All three

Next Steps

Data Directory Layout

Complete reference for the file and directory structure used by congress-approp. Every bill lives in its own directory. Files are discovered by recursively walking from whatever --dir path you provide, looking for extraction.json as the anchor file.

Directory Structure

data/                              ← any --dir path works
├── hr4366/                        ← bill directory (FY2024 omnibus)
│   ├── BILLS-118hr4366enr.xml     ← source XML from Congress.gov
│   ├── extraction.json            ← structured provisions (REQUIRED — anchor file)
│   ├── verification.json          ← deterministic verification report
│   ├── metadata.json              ← extraction provenance (model, hashes, timestamps)
│   ├── tokens.json                ← LLM token usage from extraction
│   ├── bill_meta.json             ← bill metadata: FY, jurisdictions, advance classification (enrich)
│   ├── embeddings.json            ← embedding metadata (model, dimensions, hashes)
│   ├── vectors.bin                ← raw float32 embedding vectors
│   └── chunks/                    ← per-chunk LLM artifacts (gitignored)
│       ├── 01JRWN9T5RR0JTQ6C9FYYE96A8.json
│       ├── 01JRWNA2B3C4D5E6F7G8H9J0K1.json
│       └── ...
├── hr5860/                        ← bill directory (FY2024 CR)
│   ├── BILLS-118hr5860enr.xml
│   ├── extraction.json
│   ├── verification.json
│   ├── metadata.json
│   ├── tokens.json
│   ├── embeddings.json
│   ├── vectors.bin
│   └── chunks/
└── hr9468/                        ← bill directory (VA supplemental)
    ├── BILLS-118hr9468enr.xml
    ├── extraction.json
    ├── verification.json
    ├── metadata.json
    ├── embeddings.json
    ├── vectors.bin
    └── chunks/

File Reference

FileRequired?Written ByRead ByMutable?Size (Omnibus)
BILLS-*.xmlFor extractiondownloadextract, upgrade, enrichNever~1.8 MB
extraction.jsonYes (anchor)extract, upgradeAll query commandsOnly by re-extract or upgrade~12 MB
verification.jsonNoextract, upgradeaudit, search (for quality fields)Only by re-extract or upgrade~2 MB
metadata.jsonNoextractStaleness detectionOnly by re-extract~300 bytes
tokens.jsonNoextractInformational onlyNever~200 bytes
bill_meta.jsonNoenrich--subcommittee filtering, staleness detectionOnly by re-enrich~5 KB
embeddings.jsonNoembedSemantic search, staleness detectionOnly by re-embed~230 bytes
vectors.binNoembedsearch --semantic, search --similarOnly by re-embed~29 MB
chunks/*.jsonNoextractDebugging and analysis onlyNeverVaries

Which files are required?

Only extraction.json is required. The loader (loading.rs) walks recursively from the --dir path, finds every file named extraction.json, and treats each one as a bill directory. Everything else is optional:

  • Without verification.json: The audit command won’t work, and search results won’t include amount_status, match_tier, or quality fields.
  • Without metadata.json: Staleness detection for the source XML link is unavailable.
  • Without BILLS-*.xml: Extraction, upgrade, and enrich can’t run (they need the source XML). Query commands work fine.
  • Without bill_meta.json: The --subcommittee flag is unavailable. The --fy flag still works (it uses fiscal year data from extraction.json). Run congress-approp enrich to generate this file — no API keys required.
  • Without embeddings.json + vectors.bin: --semantic and --similar searches are unavailable. If you cloned the git repository, these files are included for the example data. If you installed via cargo install, run congress-approp embed --dir data to generate them (~30 seconds per bill, requires OPENAI_API_KEY).
  • Without tokens.json: No impact on any operation.
  • Without chunks/: No impact on any operation (these are local provenance records).

File Descriptions

BILLS-*.xml

The enrolled bill XML downloaded from Congress.gov. The filename follows the GPO convention:

BILLS-{congress}{type}{number}enr.xml

Examples:

  • BILLS-118hr4366enr.xml — H.R. 4366, 118th Congress, enrolled version
  • BILLS-118hr5860enr.xml — H.R. 5860, 118th Congress, enrolled version
  • BILLS-118hr9468enr.xml — H.R. 9468, 118th Congress, enrolled version

The XML uses semantic markup from the GPO bill DTD: <division>, <title>, <section>, <appropriations-small>, <quote>, <proviso>, and many more. This semantic structure is what enables reliable parsing and chunk boundary detection.

Immutable after download. The source text is never modified by any operation.

extraction.json

The primary output of the extract command. Contains:

  • bill — Bill-level metadata: identifier, classification, short title, fiscal years, divisions
  • provisions — Array of every extracted provision with full structured fields
  • summary — LLM-generated summary statistics (diagnostic only — never used for computation)
  • chunk_map — Links each provision to the extraction chunk that produced it
  • schema_version — Version of the extraction schema

This is the anchor file — the loader discovers bill directories by finding this file. All query commands (search, summary, compare, audit) read it.

See extraction.json Fields for the complete field reference.

verification.json

Deterministic verification of every provision against the source bill text. No LLM involved — pure string matching.

Contains:

  • amount_checks — Was each dollar string found in the source?
  • raw_text_checks — Is each raw text excerpt a substring of the source?
  • completeness — How many dollar strings in the source were matched to provisions?
  • summary — Roll-up metrics (verified, not_found, ambiguous, match tiers, coverage)

See verification.json Fields for the complete field reference.

metadata.json

Extraction provenance — records which model produced the extraction and when:

{
  "model": "claude-opus-4-6",
  "prompt_version": "a1b2c3d4...",
  "extraction_timestamp": "2024-03-17T14:30:00Z",
  "source_xml_sha256": "e5f6a7b8c9d0..."
}

The source_xml_sha256 field is part of the hash chain — it records the SHA-256 of the source XML so the tool can detect if the XML has been re-downloaded.

bill_meta.json

Bill-level metadata generated by the enrich command. Contains fiscal year scoping, subcommittee jurisdiction mappings (division letter → canonical jurisdiction), advance appropriation classification for each budget authority provision, enriched bill nature (omnibus, minibus, full-year CR with appropriations, etc.), and canonical (case-normalized) account names for cross-bill matching.

{
  "schema_version": "1.0",
  "congress": 119,
  "fiscal_years": [2026],
  "bill_nature": "omnibus",
  "subcommittees": [
    { "division": "A", "jurisdiction": "defense", "title": "...", "source": { "type": "pattern_match", "pattern": "department of defense" } }
  ],
  "provision_timing": [
    { "provision_index": 1370, "timing": "advance", "available_fy": 2027, "source": { "type": "fiscal_year_comparison", "availability_fy": 2027, "bill_fy": 2026 } }
  ],
  "canonical_accounts": [
    { "provision_index": 0, "canonical_name": "military personnel, army" }
  ],
  "extraction_sha256": "b461a687..."
}

This file is entirely optional. All commands that existed before v4.0 work without it. It is required only for --subcommittee filtering. The --fy flag works without it (falling back to extraction.json fiscal year data). The extraction_sha256 field is part of the hash chain — it records the SHA-256 of extraction.json at enrichment time, enabling staleness detection.

Requires no API keys to generate. Run congress-approp enrich --dir data to create this file for all bills. See Enrich Bills with Metadata for a detailed guide.

tokens.json

LLM token usage from extraction:

{
  "total_input": 1200,
  "total_output": 1500,
  "total_cache_read": 800,
  "total_cache_create": 400,
  "calls": 1
}

Informational only — not used by any downstream operation. Useful for cost estimation and monitoring.

embeddings.json

Embedding metadata — a small JSON file (~230 bytes) that describes the companion vectors.bin file:

{
  "schema_version": "1.0",
  "model": "text-embedding-3-large",
  "dimensions": 3072,
  "count": 2364,
  "extraction_sha256": "a1b2c3d4...",
  "vectors_file": "vectors.bin",
  "vectors_sha256": "e5f6a7b8..."
}

The extraction_sha256 and vectors_sha256 fields are part of the hash chain for staleness detection.

See embeddings.json Fields for the complete field reference.

vectors.bin

Raw little-endian float32 embedding vectors. No header — just count × dimensions × 4 bytes of floating-point data. The count and dimensions come from embeddings.json.

File sizes for the example data:

BillProvisionsDimensionsFile Size
H.R. 43662,3643,07229,048,832 bytes (29 MB)
H.R. 58601303,0721,597,440 bytes (1.6 MB)
H.R. 946873,07286,016 bytes (86 KB)

These files are excluded from the crates.io package (Cargo.toml exclude field) because they exceed the 10 MB upload limit. They are included in the git repository for users who clone.

See embeddings.json Fields for reading instructions.

chunks/ directory

Per-chunk LLM artifacts stored with ULID filenames (e.g., 01JRWN9T5RR0JTQ6C9FYYE96A8.json). Each file contains:

  • Thinking content — The model’s internal reasoning for this chunk
  • Raw response — The raw JSON the LLM produced before parsing
  • Parsed provisions — The provisions extracted from this chunk after resilient parsing
  • Conversion report — Type coercions, null-to-default conversions, and warnings

These are permanent provenance records — useful for understanding why the LLM classified a particular provision a certain way, or for debugging extraction issues. They are:

  • Gitignored by default (.gitignore includes chunks/)
  • Not part of the hash chain — no downstream artifact references them
  • Not required for any query operation
  • Not included in the crates.io package

Deleting the chunks/ directory has no effect on any operation.


Nesting Flexibility

The --dir flag accepts any directory path. The loader walks recursively from that path, finding every extraction.json. This means any nesting structure works:

# Flat structure (like the examples)
congress-approp summary --dir data
# Finds: data/118-hr4366/extraction.json, data/118-hr5860/extraction.json, data/118-hr9468/extraction.json

# Nested by congress/type/number
congress-approp summary --dir data
# Finds: data/118/hr/4366/extraction.json, data/118/hr/5860/extraction.json, etc.

# Single bill directory
congress-approp summary --dir data/118/hr/9468
# Finds: data/118/hr/9468/extraction.json

# Any arbitrary nesting
congress-approp summary --dir ~/my-appropriations-project/fy2024
# Finds all extraction.json files anywhere under that path

The directory name is used as the bill identifier for --similar references. For example, if the path is data/118-hr9468/extraction.json, the bill directory name is hr9468, and you’d reference it as --similar 118-hr9468:0.


The Hash Chain

Each downstream artifact records the SHA-256 hash of its input, enabling staleness detection:

BILLS-*.xml ──sha256──▶ metadata.json (source_xml_sha256)
                              │
extraction.json ──sha256──▶ bill_meta.json (extraction_sha256)     ← NEW in v4.0
extraction.json ──sha256──▶ embeddings.json (extraction_sha256)
                              │
vectors.bin ──sha256──▶ embeddings.json (vectors_sha256)

If any link in the chain breaks (input file changed but downstream wasn’t regenerated), the tool warns but doesn’t block. See Data Integrity and the Hash Chain for details.


Immutability Model

Every file except links/links.json is write-once. The links file is append-only (link accept adds entries, link remove deletes them):

FileWritten WhenModified When
BILLS-*.xmldownloadNever
extraction.jsonextract, upgradeOnly by deliberate re-extraction or upgrade
verification.jsonextract, upgradeOnly by deliberate re-extraction or upgrade
metadata.jsonextractOnly by re-extraction
tokens.jsonextractNever
bill_meta.jsonenrichOnly by re-enrichment (enrich --force)
embeddings.jsonembedOnly by re-embedding
vectors.binembedOnly by re-embedding
chunks/*.jsonextractNever

This write-once design means:

  • No file locking needed — multiple read processes can run simultaneously
  • No database needed — JSON files on disk are the right abstraction for a read-dominated workload
  • No caching needed — the files ARE the cache
  • Trivially relocatable — copy a bill directory anywhere and it works

The write:read ratio is approximately 1:500. Bills are extracted ~15 times per year (when Congress enacts new legislation), but queried hundreds to thousands of times.


Git Configuration

The project includes two git-related configurations for the data files:

.gitignore

chunks/          # Per-chunk LLM artifacts (local provenance, not for distribution)
NEXT_STEPS.md    # Internal context handoff document
.venv/           # Python virtual environment

The chunks/ directory is gitignored because it contains model thinking traces that are useful for local debugging but not needed for downstream operations or distribution.

.gitattributes

*.bin binary

The vectors.bin files are marked as binary in git to prevent line-ending conversion and diff attempts on float32 data.


Size Estimates

ComponentH.R. 9468 (Supp)H.R. 5860 (CR)H.R. 4366 (Omnibus)
Source XML9 KB131 KB1.8 MB
extraction.json15 KB200 KB12 MB
verification.json5 KB40 KB2 MB
metadata.json~300 B~300 B~300 B
tokens.json~200 B~200 B~200 B
bill_meta.json~1 KB~2 KB~5 KB
embeddings.json~230 B~230 B~230 B
vectors.bin86 KB1.6 MB29 MB
chunks/~10 KB~100 KB~15 MB
Total~120 KB~2 MB~60 MB

For 20 congresses (~60 bills), total storage would be approximately 200–400 MB, dominated by vectors.bin files for large omnibus bills.


Glossary

Definitions of key terms used throughout this documentation and in the tool’s output. Terms are listed alphabetically.


Advance Appropriation — Budget authority enacted in the current year’s appropriations bill but not available for obligation until a future fiscal year. Common for VA medical accounts, where FY2024 legislation may include advance appropriations available starting in FY2025. The enrich command classifies each budget authority provision as current_year, advance, or supplemental using a fiscal-year-aware algorithm that compares “October 1, YYYY” and “first quarter of fiscal year YYYY” dates to the bill’s fiscal year. This classification is stored in bill_meta.json in the provision_timing array. Advance appropriations represent approximately 18% ($1.49 trillion) of total budget authority across the 13-bill dataset. Use --show-advance on summary to see the current/advance split. Failing to separate advance from current-year can cause year-over-year comparisons to be off by hundreds of billions of dollars. See Enrich Bills with Metadata and Budget Authority Calculation.

Ambiguous (verification status) — A dollar amount verification result indicating that the text_as_written dollar string was found at multiple positions in the source bill text. The amount is confirmed to exist in the source — it’s correct — but can’t be pinned to a single location. Common for round numbers like $5,000,000 which may appear 50+ times in a large omnibus. Displayed as in the search table’s $ column. Not an error. See How Verification Works.

Anomaly — See CR Substitution.

Appropriation — A provision that grants budget authority — the legal permission for a federal agency to enter into financial obligations (sign contracts, award grants, hire staff) up to a specified dollar amount. This is the core spending provision type and the most common provision in appropriations bills. In the tool, provisions with provision_type: "appropriation" and semantics: "new_budget_authority" at top_level or line_item detail are counted toward the budget authority total. See Provision Types.

Bill Classification — The type of appropriations bill: regular (one of the twelve annual bills), omnibus (multiple bills combined), minibus (a few bills combined), continuing_resolution (temporary funding at prior-year rates), supplemental (additional funding outside the regular cycle), or rescissions (a bill primarily canceling prior budget authority). Displayed in the Classification column of the summary table. When bill_meta.json is present (from enrich), the summary displays the enriched bill nature instead, which provides finer distinctions. See Bill Nature and How Federal Appropriations Work.

Bill Nature — An enriched classification of an appropriations bill that provides finer distinctions than the LLM’s original classification field. Where the extraction might classify H.R. 1968 as continuing_resolution, the bill nature recognizes it as full_year_cr_with_appropriations — a hybrid vehicle containing $1.786 trillion in full-year appropriations alongside a CR mechanism. Generated by the enrich command and stored in bill_meta.json. Values: regular, omnibus, minibus, continuing_resolution, full_year_cr_with_appropriations, supplemental, authorization, or a free-text string. See Enrich Bills with Metadata.

Budget Authority (BA) — The legal authority Congress grants to federal agencies to enter into financial obligations. This is the dollar figure specified in an appropriation provision — what Congress authorizes agencies to commit to spend. Distinct from outlays, which are the actual cash disbursements by the Treasury. This tool reports budget authority. In the summary table, the “Budget Auth ($)” column sums all provisions with semantics: "new_budget_authority" at top_level or line_item detail levels. See Budget Authority Calculation.

Canonical Account Name — A normalized version of an account name used for cross-bill matching: lowercased, em-dash and en-dash prefixes stripped, whitespace trimmed. For example, "Department of Veterans Affairs—Veterans Benefits Administration—Compensation and Pensions" becomes "compensation and pensions". This ensures that the same account matches across bills even when the LLM uses different naming conventions or capitalization. Generated by enrich and stored in bill_meta.json. Used internally by compare for case-insensitive matching.

Chunk — A segment of bill text sent to the LLM as a single extraction request. Large bills (omnibus, continuing resolutions) are split into chunks at XML <division> and <title> boundaries so each chunk contains a complete legislative section. The FY2024 omnibus (H.R. 4366) splits into approximately 75 chunks. Chunk artifacts are stored in the chunks/ directory (gitignored). See The Extraction Pipeline.

Classification Source — A provenance record in bill_meta.json that documents how each automated classification was determined. Every jurisdiction mapping, advance/current timing classification, and bill nature determination records whether it came from XML structure parsing, pattern matching, fiscal year comparison, note text analysis, a default rule, or LLM classification. This enables auditing: you can see exactly why the tool classified a provision as “advance” or a division as “defense.” See Enrich Bills with Metadata.

Completeness — See Coverage.

Confidence — A float value (0.0–1.0) on each provision representing the LLM’s self-assessed certainty in its extraction. Not calibrated — values above 0.90 are not meaningfully differentiated. Useful only for identifying outliers below 0.80, which may warrant manual review. Do not use for automated quality filtering; use the verification-derived quality field instead.

Congress Number — An identifier for the two-year term of a U.S. Congress. The 118th Congress served January 2023 – January 2025; the 119th Congress serves January 2025 – January 2027. Bills are identified by their congress number — H.R. 4366 of the 118th Congress is a different bill from H.R. 4366 of any other Congress. All three example bills in this tool are from the 118th Congress.

Continuing Resolution (CR) — Temporary legislation that funds the federal government at the prior fiscal year’s rate for agencies whose regular appropriations bills have not been enacted. Most provisions in a CR simply continue prior-year funding, but specific programs may get different treatment through anomalies (formally called CR substitutions). H.R. 5860 in the example data is a continuing resolution with 13 CR substitutions. See Work with CR Substitutions.

Cosine Similarity — The mathematical measure used to compare embedding vectors. For L2-normalized vectors (which is what this tool stores), cosine similarity equals the dot product. Scores range from approximately 0.2 to 0.9 in practice for appropriations data. Above 0.80 indicates nearly identical provisions (same program in different bills); 0.45–0.60 indicates loose conceptual connection; below 0.40 suggests no meaningful relationship. See How Semantic Search Works.

Coverage — The percentage of dollar-sign patterns found in the source bill text that were matched to at least one extracted provision. This measures extraction completeness, not accuracy. Coverage below 100% is often correct — many dollar strings in bill text are statutory cross-references, loan guarantee ceilings, struck amounts in amendments, or prior-year citations that should not be extracted as provisions. Displayed in the Coverage column of the audit table. Removed from the summary table in v2.1.0 to prevent misinterpretation as an accuracy metric. See What Coverage Means (and Doesn’t).

CR Substitution — A provision in a continuing resolution that replaces one dollar amount with another: the bill says “shall be applied by substituting ‘$X’ for ‘$Y’,” meaning fund the program at $X instead of the prior-year level of $Y. Also called an anomaly. The tool captures both the new amount ($X) and old amount ($Y) and automatically shows them in New/Old/Delta columns in the search table. In the example data, H.R. 5860 contains 13 CR substitutions. See Work with CR Substitutions.

Cross Reference — A structured reference from one provision to another law, section, or bill. Stored in the cross_references array with fields ref_type (e.g., amends, notwithstanding, rescinds_from), target (e.g., "31 U.S.C. 1105(a)"), and optional description.

Detail Level — A classification on appropriation provisions indicating where the provision sits in the funding hierarchy: top_level (main account appropriation), line_item (numbered item within a section), sub_allocation (“of which” breakdown), or proviso_amount (dollar amount in a “Provided, That” clause). The compute_totals() function uses this to prevent double-counting: only top_level and line_item provisions count toward budget authority. See The Provision Type System.

Directive — A provision type representing a reporting requirement or instruction to a federal agency (e.g., “The Secretary shall submit a report within 30 days…”). Directives don’t carry dollar amounts and don’t affect budget authority. See Provision Types.

Division — A lettered section of an omnibus or minibus bill (Division A, Division B, etc.), each typically corresponding to one of the twelve appropriations subcommittee jurisdictions. Division letters are bill-internal — Division A means Defense in H.R. 7148 but CJS in H.R. 6938 and MilCon-VA in H.R. 4366. For cross-bill filtering, use --subcommittee (which resolves division letters to canonical jurisdictions via bill_meta.json) instead of --division. The --division flag is still available for within-bill filtering when you know the specific letter. See Jurisdiction.

Earmark — Funding directed to a specific recipient, location, or project, often requested by a specific Member of Congress. Also called “community project funding” or “congressionally directed spending.” Most earmarks are listed in the joint explanatory statement (a separate document not in the enrolled bill XML), so only earmarks that appear in the bill text itself are captured by the tool. See the directed_spending provision type.

Embedding — A high-dimensional vector (list of 3,072 floating-point numbers) that represents the semantic meaning of a provision. Provisions about similar topics have vectors that point in similar directions, enabling meaning-based search and cross-bill matching. Generated by OpenAI’s text-embedding-3-large model and stored in vectors.bin. See How Semantic Search Works.

Enacted — Signed into law by the President (or passed over a veto). This tool downloads and extracts enacted bills — the versions that actually became binding law and authorized spending. The --enacted-only flag on the download command filters to these bills.

Enrich — The process of generating bill-level metadata (bill_meta.json) from the source XML and extraction output. Unlike extraction (which requires an LLM API key) or embedding (which requires an OpenAI API key), enrichment runs entirely offline using XML parsing and deterministic classification rules. Run congress-approp enrich --dir data to enrich all bills. See Enrich Bills with Metadata.

Enrolled — The final version of a bill as passed by both the House and Senate in identical form and sent to the President for signature. This is the text version that congress-approp downloads by default — the authoritative text that becomes law. Distinguished from introduced, engrossed, and other intermediate versions.

Exact (match tier) — A raw text verification result indicating that the provision’s raw_text excerpt is a byte-identical substring of the source bill text. The strongest evidence of faithful extraction — the LLM copied the text perfectly. 95.6% of provisions in the example data match at this tier. See How Verification Works.

Extraction — The process of sending bill text to the LLM (Claude) to identify and classify every spending provision into structured JSON. This is Stage 3 of the pipeline and the only stage that uses an LLM. See The Extraction Pipeline.

Fiscal Year (FY) — The federal government’s accounting year, running from October 1 through September 30. Named for the calendar year in which it ends: FY2024 = October 1, 2023 – September 30, 2024. Bills are labeled by the fiscal year they fund, not the calendar year they were enacted in. Use --fy <YEAR> to filter commands to bills covering a specific fiscal year.

Funding Timing — Whether a budget authority provision’s money is available in the current fiscal year (current_year), a future fiscal year (advance), or was provided as emergency/supplemental funding (supplemental). Determined by the enrich command using a fiscal-year-aware algorithm that compares “October 1, YYYY” and “first quarter of fiscal year YYYY” dates in the availability text to the bill’s fiscal year. Critical for year-over-year comparisons — without separating advance from current, a reporter might overstate FY2024 VA spending by $182 billion (the advance appropriation for FY2025). Use --show-advance on summary to see the split. See Enrich Bills with Metadata.

Jurisdiction — The appropriations subcommittee responsible for a division of an omnibus or minibus bill. The twelve traditional jurisdictions are: Defense, Labor-HHS, THUD (Transportation-Housing-Urban Development), Financial Services, CJS (Commerce-Justice-Science), Energy-Water, Interior, Agriculture, Legislative Branch, MilCon-VA (Military Construction-Veterans Affairs), State-Foreign Operations, and Homeland Security. Division letters are bill-internal (Division A means Defense in one bill but CJS in another), so the enrich command maps each division to its canonical jurisdiction. Used with the --subcommittee flag. See Enrich Bills with Metadata.

Link Hash — A deterministic 8-character hexadecimal identifier for a relationship between two provisions across different bills. Computed from the source provision, target provision, and embedding model using SHA-256. Because the hash is deterministic, the same provision pair always produces the same hash across runs. Displayed in the relate and link suggest command output, and used with link accept to persist cross-bill relationships. See the relate and link commands in the CLI Reference.

Hash Chain — A series of SHA-256 hash links connecting each pipeline artifact to the input it was built from. The source XML hash is recorded in metadata.json; the extraction hash is recorded in embeddings.json; the vectors hash is also recorded in embeddings.json. Enables staleness detection — if an upstream file changes, downstream artifacts are detected as potentially stale. See Data Integrity and the Hash Chain.

Limitation — A provision type representing a cap or prohibition on spending (e.g., “not more than $X,” “none of the funds shall be used for…”). Limitations have semantics: "limitation" and are not counted in budget authority totals. See Provision Types.

Mandatory Spending — Federal spending determined by eligibility rules in permanent law (Social Security, Medicare, Medicaid, SNAP, VA Compensation and Pensions) rather than annual appropriations votes. Accounts for approximately 63% of total federal spending. Some mandatory programs appear as appropriation line items in appropriations bill text — the tool extracts these but does not distinguish them from discretionary spending. See Why the Numbers Might Not Match Headlines.

Mandatory Spending Extension — A provision type representing an amendment to an authorizing statute, typically extending a mandatory program that would otherwise expire. Common in continuing resolutions and certain divisions of omnibus bills. See Provision Types.

Match Tier — The level at which a provision’s raw_text excerpt was confirmed as a substring of the source text: exact (byte-identical), normalized (matches after whitespace/quote/dash normalization), spaceless (matches after removing all spaces), or no_match (not found at any tier). See How Verification Works.

Net Budget Authority (Net BA) — Budget authority minus rescissions. This is the net new spending authority enacted by a bill. Displayed in the “Net BA ($)” column of the summary table. For most reporting purposes, this is the number to cite.

Not Found (verification status) — A dollar amount verification result indicating that the text_as_written dollar string was not found anywhere in the source bill text. This is the most serious verification failure — the LLM may have hallucinated the amount. Displayed as in the search table’s $ column. Across the included dataset: 1 occurrence in 18,584 dollar amounts (99.995%). See Accuracy Metrics. Should always be 0. See How Verification Works.

Omnibus — A single bill packaging multiple (often all twelve) annual appropriations bills together, organized into lettered divisions. Congress frequently uses omnibuses when individual bills stall. H.R. 4366 in the example data is an omnibus covering seven of twelve appropriations subcommittee jurisdictions. See How Federal Appropriations Work.

Outlays — Actual cash disbursements by the U.S. Treasury. Distinct from budget authority, which is the legal permission to commit to spending. Budget authority and outlays differ because agencies often obligate funds in one year but spend them over several years. Headline federal spending figures (~$6.7 trillion) are in outlays. This tool reports budget authority, not outlays.

Provision — A single identifiable directive in an appropriations bill: an appropriation, a rescission, a spending limitation, a transfer authority, a CR anomaly, a policy rider, or any other discrete instruction. This is the fundamental unit of data in congress-approp. The tool classifies each provision into one of 11 types. See Provision Types.

Proviso — A condition attached to an appropriation via a “Provided, That” clause (e.g., “Provided, That not to exceed $279,000 shall be available for official reception expenses”). Provisos may contain dollar amounts, limitations, transfer authorities, or reporting requirements. They are stored in the provisos array within appropriation provisions.

Quality — A derived assessment of a provision’s verification status: strong (dollar amount verified AND raw text exact match), moderate (one of the two is imperfect), weak (dollar amount not found), or n/a (provision has no dollar amount). Available in JSON/CSV output as the quality field. Computed from the deterministic verification data, not from the LLM’s confidence score.

Raw Text — The raw_text field on every provision — a verbatim excerpt from the bill text (typically the first ~150 characters of the provision). Verified against the source text using tiered matching (exact → normalized → spaceless). Allows users to see the actual bill language without opening the XML.

Rescission — A provision that cancels previously enacted budget authority. A rescission of $500 million reduces the net budget authority by that amount. In the summary table, rescissions appear in their own column and are subtracted from gross budget authority to produce Net BA. See Provision Types.

Rider — A policy provision that doesn’t directly appropriate, rescind, or limit funds. Riders establish rules, extend authorities, set policy conditions, or make legislative findings. They don’t carry dollar amounts and don’t affect budget authority calculations. The second most common provision type in the example data (the second most common provision type across the dataset). See Provision Types.

Semantic Search — A search method that finds provisions by meaning rather than exact keywords. Uses embedding vectors to understand that “school lunch programs for kids” means “Child Nutrition Programs” even though the words don’t overlap. Invoked with --semantic "your query" on the search command. Requires pre-computed embeddings and OPENAI_API_KEY. See How Semantic Search Works.

Semantics (amount) — The semantics field on a dollar amount, indicating what the amount represents in budget terms: new_budget_authority (new spending power — counted in BA totals), rescission (cancellation of prior BA), reference_amount (contextual — sub-allocations, “of which” breakdowns), limitation (cap on spending), transfer_ceiling (maximum transfer amount), or mandatory_spending (mandatory program amount). See Provision Types.

Staleness — The condition where a downstream artifact was built from a version of its input that no longer matches the current file on disk. Detected via the hash chain — if extraction.json changes but embeddings.json still records the old hash, the embeddings are stale. The tool warns but never blocks execution. See Data Integrity and the Hash Chain.

Sub-Allocation — A breakdown within a parent account: “of which $X shall be for Y.” Sub-allocations have detail_level: "sub_allocation" and semantics: "reference_amount". They are not additional money — they specify how part of the parent appropriation should be spent. Excluded from budget authority totals to prevent double-counting. See Budget Authority Calculation.

Supplemental — An additional appropriation enacted outside the regular annual cycle, typically in response to emergencies — natural disasters, military operations, public health crises, or funding shortfalls. H.R. 9468 in the example data is a supplemental providing $2.9 billion for VA Compensation and Pensions and Readjustment Benefits. See How Federal Appropriations Work.

Text As Written — The text_as_written field on a dollar amount — the verbatim dollar string from the bill text (e.g., "$2,285,513,000"). This is the string searched for in the source XML during amount verification. It preserves the exact formatting from the bill, including commas and the dollar sign.

Title — A numbered subdivision within a division of an omnibus bill (e.g., Title I, Title II). Identified by Roman numerals in the bill text. The title field in provision data contains the numeral (e.g., "IV", "XIII"). The same title number may appear in different divisions — Division A Title I and Division B Title I are different sections.

Transfer Authority — A provision granting permission to move funds between accounts. The dollar amount is a ceiling (maximum that may be transferred), not new spending. Transfer authority provisions have semantics: "transfer_ceiling" and are not counted in budget authority totals. See Provision Types.

Treasury Account Symbol (TAS) — The master account identifier assigned by the Department of the Treasury to every federal appropriation, receipt, or fund account. Composed of up to 8 fields including the Agency Identifier (CGAC code), Main Account Code, and Period of Availability. The Federal Account Symbol (FAS) is the time-independent version: just the agency code + main account code, collapsing all annual vintages into one persistent identifier. The resolve-tas command maps provisions to FAS codes. See Resolving Treasury Account Symbols and The Authority System.

Verified (verification status) — A dollar amount verification result indicating that the text_as_written dollar string was found at exactly one position in the source bill text. The strongest verification result — the amount is confirmed real and its location is unambiguous. Displayed as in the search table’s $ column. See How Verification Works.

Architecture Overview

This chapter provides a high-level map of how congress-approp is structured — for developers who want to understand the codebase, contribute features, or debug issues.

The Pipeline

Every bill flows through six stages. Each stage is implemented by a distinct set of modules:

Stage 1: Download    →  api/congress/       →  BILLS-*.xml
Stage 2: Parse       →  approp/xml.rs       →  clean text + chunk boundaries
Stage 3: Extract     →  approp/extraction.rs →  extraction.json + verification.json
Stage 4: Embed       →  api/openai/         →  embeddings.json + vectors.bin
Stage 5: Query       →  approp/query.rs     →  search, compare, summary, audit output

Only stages 3 (Extract) and 4 (Embed) call external APIs. Everything else is local and deterministic.

Module Map

src/
  main.rs                    ← CLI entry point, clap definitions, output formatting (~4,200 lines)
  lib.rs                     ← Re-exports: api:: and approp::, plus load_bills and query
  api/
    mod.rs                   ← pub mod anthropic; pub mod congress; pub mod openai;
    anthropic/               ← Claude API client (~660 lines)
      client.rs              ← Message creation with streaming, thinking, caching
      mod.rs
    congress/                ← Congress.gov API client (~850 lines)
      bill.rs                ← Bill listing, metadata, text versions
      client.rs              ← HTTP client with auth
      mod.rs
    openai/                  ← OpenAI API client (~75 lines)
      client.rs              ← Embeddings endpoint only
      mod.rs
  approp/
    mod.rs                   ← pub mod for all submodules
    ontology.rs              ← ALL data types (~960 lines)
    extraction.rs            ← ExtractionPipeline: parallel chunk processing (~840 lines)
    from_value.rs            ← Resilient JSON→Provision parsing (~690 lines)
    xml.rs                   ← Congressional bill XML parsing (~590 lines)
    text_index.rs            ← Dollar amount indexing, section detection (~670 lines)
    prompts.rs               ← System prompt for Claude (~310 lines)
    verification.rs          ← Deterministic verification (~370 lines)
    loading.rs               ← Directory walking, JSON loading, bill_meta (~340 lines)
    query.rs                 ← Library API: search, compare, summarize, audit, relate (~1,300 lines)
    embeddings.rs            ← Embedding storage, cosine similarity (~260 lines)
    staleness.rs             ← Hash chain checking (~165 lines)
    progress.rs              ← Extraction progress bar (~170 lines)
tests/
  cli_tests.rs               ← 42 integration tests against test-data/ and data/ (~1,200 lines)

Total: approximately 9,500 lines of Rust.

Core Data Types (ontology.rs)

The Provision enum is the heart of the data model. It has 11 variants, each representing a different type of legislative provision:

VariantKey Fields
Appropriationaccount_name, agency, amount, detail_level, parent_account, fiscal_year, availability, provisos, earmarks
Rescissionaccount_name, agency, amount, reference_law
TransferAuthorityfrom_scope, to_scope, limit, conditions
Limitationdescription, amount, account_name
DirectedSpendingaccount_name, amount, earmark, detail_level
CrSubstitutionnew_amount, old_amount, account_name, reference_act
MandatorySpendingExtensionprogram_name, statutory_reference, amount, period
Directivedescription, deadlines
Riderdescription, policy_area
ContinuingResolutionBaselinereference_year, reference_laws, rate, duration
Otherllm_classification, description, amounts, metadata

All variants share common fields: section, division, title, confidence, raw_text, notes, cross_references.

The enum uses tagged serde: #[serde(tag = "provision_type", rename_all = "snake_case")], so each JSON object self-identifies.

Supporting Types

  • DollarAmountvalue (AmountValue), semantics (AmountSemantics), text_as_written
  • AmountValueSpecific { dollars: i64 }, SuchSums, None
  • AmountSemanticsNewBudgetAuthority, Rescission, ReferenceAmount, Limitation, TransferCeiling, MandatorySpending, Other(String)
  • BillExtraction — top-level structure: bill, provisions, summary, chunk_map, schema_version
  • BillInfoidentifier, classification, short_title, fiscal_years, divisions, public_law
  • ExtractionSummary — LLM self-check totals (diagnostic only, never used for computation)

The BillExtraction::compute_totals() method deterministically computes budget authority and rescissions from the provisions array, filtering by semantics and detail_level.

The Extraction Pipeline (extraction.rs)

ExtractionPipeline orchestrates the LLM extraction process:

  1. Parse XML — calls xml::parse_bill_xml() to get clean text and chunk boundaries
  2. Build chunks — each chunk gets the full system prompt plus its section of bill text
  3. Extract in parallel — sends chunks to Claude via the Anthropic API with bounded concurrency (--parallel N)
  4. Parse responsesfrom_value::parse_bill_extraction() handles LLM output with resilient parsing
  5. Merge — provisions from all chunks are combined into a single list
  6. Compute totals — budget authority is summed from provisions (never trusting LLM arithmetic)
  7. Verifyverification::verify_extraction() runs deterministic checks
  8. Write — all artifacts saved to disk

Progress updates are sent via a channel to a rendering task that displays the live dashboard.

Resilient Parsing (from_value.rs)

This module bridges the gap between the LLM’s JSON output and Rust’s strict type system:

  • Missing fields → defaults (empty string, null, empty array)
  • Wrong types → coerced (string "$10,000,000" → integer 10000000)
  • Unknown provision types → wrapped as Provision::Other with original classification preserved
  • Extra fields → silently ignored for known types; preserved in metadata map for Other
  • Failed provisions → logged as warnings, skipped

Every compromise is counted in a ConversionReport — the tool never silently hides parsing issues.

Verification (verification.rs)

Three deterministic checks, no LLM involved:

  1. Amount checks — search for each text_as_written dollar string in the source text
  2. Raw text checks — check if raw_text is a substring of source (exact → normalized → spaceless → no_match)
  3. Completeness — count dollar-sign patterns in source and check how many are accounted for

The text_index.rs module builds a positional index of every dollar amount and section header in the source text, used by verification and for chunk boundary computation.

Library API (query.rs)

Pure functions that take &[LoadedBill] and return data structs:

#![allow(unused)]
fn main() {
pub fn summarize(bills: &[LoadedBill]) -> Vec<BillSummary>
pub fn search(bills: &[LoadedBill], filter: &SearchFilter) -> Vec<SearchResult>
pub fn compare(base: &[LoadedBill], current: &[LoadedBill], agency: Option<&str>) -> Vec<AccountDelta>
pub fn audit(bills: &[LoadedBill]) -> Vec<AuditRow>
pub fn rollup_by_department(bills: &[LoadedBill]) -> Vec<AgencyRollup>
pub fn build_embedding_text(provision: &Provision) -> String
}

Design contract: No I/O, no formatting, no API calls, no side effects. The CLI layer (main.rs) handles all formatting and output.

Embeddings (embeddings.rs)

Split storage: JSON metadata + binary float32 vectors.

Key functions:

  • load(dir)Option<LoadedEmbeddings> — loads metadata and binary vectors
  • save(dir, metadata, vectors) — writes both files atomically
  • cosine_similarity(a, b)f32 — dot product (vectors are L2-normalized)
  • normalize(vec) — L2-normalize in place

Loading (loading.rs)

load_bills(dir) recursively walks from a path, finds every extraction.json, and loads it along with sibling verification.json and metadata.json into LoadedBill structs. Results are sorted by bill identifier.

CLI Layer (main.rs)

The CLI is built with clap derive macros. The Commands enum defines all subcommands. Each command has a handler function:

CommandHandlerLinesAsync?
summaryhandle_summary()~160No
searchhandle_search()~530Yes (semantic path)
search --semantichandle_semantic_search()~330Yes
comparehandle_compare()~210No
audithandle_audit()~180No
extracthandle_extract()~310Yes
embedhandle_embed()~120Yes
downloadhandle_download()~400Yes
upgradehandle_upgrade()~150No

Known technical debt: main.rs is ~4,200 lines. While the summary and compare handlers have been consolidated to call library functions in query.rs, the search handler still contains substantial inline formatting logic. Each provision type has its own table column layout, and the semantic search path has ~200 lines of inline filtering. A future refactor could reduce main.rs by extracting the table formatting into a dedicated module.

Key Design Decisions

1. LLM isolation

The LLM touches data exactly once (extraction). Every downstream operation is deterministic. If you don’t trust the LLM’s classification, the raw_text field lets you read the original bill language.

2. Budget totals from provisions, not summaries

compute_totals() sums individual provisions filtered by semantics and detail_level. The LLM’s self-reported total_budget_authority is never used for computation.

3. Semantic chunking

Bills are split at XML <division> and <title> boundaries, not at arbitrary token limits. Each chunk contains a complete legislative section, preserving context for the LLM.

4. Tagged enum deserialization

Provision uses #[serde(tag = "provision_type")]. Each JSON object self-identifies. Forward-compatible and human-readable.

5. Resilient LLM output parsing

from_value.rs manually walks the serde_json::Value tree with fallbacks rather than using strict deserialization. An unknown provision type becomes Other with the original data preserved. Extraction rarely fails entirely.

6. Schema evolution without re-extraction

The upgrade command re-deserializes through the current schema, re-runs verification, and updates files — no LLM calls needed. New fields get defaults, renamed fields get mapped.

7. Write-once, read-many

All artifacts are immutable after creation. No file locking, no database, no caching needed. The files ARE the cache. Hash checks are ~2ms and run on every load.

Dependencies

CrateRole
clapCLI argument parsing (derive macros)
roxmltreeXML parsing — pure Rust, read-only
reqwestHTTP client for all three APIs (with rustls-tls)
tokioAsync runtime for parallel API calls
serde / serde_jsonSerialization for all JSON artifacts
walkdirRecursive directory traversal
comfy-tableTerminal table formatting
csvCSV output
sha2SHA-256 hashing for the hash chain
chronoTimestamps in metadata
ulidUnique IDs for chunk artifacts
anyhow / thiserrorError handling (anyhow for CLI, thiserror for library)
tracing / tracing-subscriberStructured logging
futuresStream processing for parallel extraction

All API clients use rustls-tls — no OpenSSL dependency.

Performance Characteristics

OperationTimeNotes
Load 14 bills (JSON parsing)~40ms
Load embeddings (14 bills, binary)~8msMemory read
SHA-256 hash all files (14 bills)~8ms
Cosine search (8,500 provisions)<0.5msDot products
Total cold-start query~50msLoad + hash + search
Embed query text (OpenAI API)~100msNetwork round-trip
Full extraction (omnibus, 75 chunks)~60 minParallel LLM calls
Generate embeddings (2,500 provisions)~30 secBatch API calls

At 20 congresses (~60 bills, ~15,000 provisions): cold start ~80ms, search <1ms. The system scales linearly and stays interactive at any realistic data volume.

Next Steps

Code Map

A file-by-file guide to the codebase — where each module lives, what it does, how many lines it contains, and when you’d need to edit it.

Source Layout

src/
├── main.rs                          ← CLI entry point (~4,200 lines)
├── lib.rs                           ← Library re-exports (5 lines)
├── api/
│   ├── mod.rs                       ← pub mod anthropic; pub mod congress; pub mod openai;
│   ├── anthropic/
│   │   ├── mod.rs                   ← Re-exports
│   │   └── client.rs               ← Claude API client (~340 lines)
│   ├── congress/
│   │   ├── mod.rs                   ← Types and re-exports
│   │   ├── client.rs               ← Congress.gov HTTP client
│   │   └── bill.rs                 ← Bill listing, metadata, text versions
│   └── openai/
│       ├── mod.rs                   ← Re-exports
│       └── client.rs               ← Embeddings endpoint (~45 lines)
└── approp/
    ├── mod.rs                       ← pub mod for all submodules
    ├── ontology.rs                  ← All data types (~960 lines)
    ├── bill_meta.rs                 ← Bill metadata + classification (~1,280 lines)
    ├── extraction.rs                ← Extraction pipeline (~840 lines)
    ├── from_value.rs                ← Resilient JSON parsing (~690 lines)
    ├── xml.rs                       ← Congressional XML parser (~590 lines)
    ├── text_index.rs                ← Dollar amount indexing (~670 lines)
    ├── prompts.rs                   ← LLM system prompt (~310 lines)
    ├── verification.rs              ← Deterministic verification (~370 lines)
    ├── links.rs                     ← Cross-bill link persistence (~790 lines)
    ├── loading.rs                   ← Directory walking, bill loading (~340 lines)
    ├── query.rs                     ← Library API (~1,300 lines)
    ├── embeddings.rs                ← Embedding storage (~260 lines)
    ├── staleness.rs                 ← Hash chain checking incl bill_meta (~165 lines)
    └── progress.rs                  ← Extraction progress bar (~170 lines)

Supporting Files

tests/
└── cli_tests.rs                     ← 42 integration tests (~1,200 lines)

docs/
├── ARCHITECTURE.md                  ← Architecture doc (~416 lines)
└── FIELD_REFERENCE.md               ← JSON field reference (~348 lines)

book/
└── src/                             ← This mdbook documentation

data/
├── hr4366/                          ← FY2024 omnibus (2,364 provisions)
├── hr5860/                          ← FY2024 continuing resolution (130 provisions)
└── hr9468/                          ← VA supplemental (7 provisions)

File-by-File Reference

Core: CLI and Library Entry Points

FileLinesPurposeWhen to Edit
src/main.rs~4,200CLI entry point. Clap argument definitions, command handlers, output formatting (table/JSON/CSV/JSONL). Contains handlers for all commands: handle_search, handle_summary, handle_compare, handle_audit, handle_extract, handle_embed, handle_download, handle_upgrade, handle_enrich, handle_relate, handle_link, and helper functions including filter_bills_to_subcommittee.Adding new CLI commands or flags; changing output formatting; wiring new library functions to the CLI.
src/lib.rs5Library re-exports: pub mod api; pub mod approp; plus pub use approp::loading::{LoadedBill, load_bills}; pub use approp::query;Adding new top-level re-exports for library consumers.

Core: Data Types

FileLinesPurposeWhen to Edit
src/approp/ontology.rs~960All data types. The Provision enum (11 variants), BillExtraction, BillInfo, DollarAmount, AmountValue, AmountSemantics, ExtractionSummary, ExtractionMetadata, Proviso, Earmark, CrossReference, CrAnomaly, TransferLimit, FundAvailability, BillClassification, SourceSpan, and all accessor methods on Provision. Also contains BillExtraction::compute_totals().Adding new provision types; adding new fields to existing types; changing budget authority calculation logic.
src/approp/from_value.rs~690Resilient JSON → Provision parsing. Manually walks serde_json::Value trees with fallbacks for missing fields, wrong types, and unknown enum variants. Contains parse_bill_extraction(), parse_provision(), parse_dollar_amount(), and dozens of helper functions. Produces ConversionReport documenting every compromise.Adding new provision types (must add a match arm in parse_provision()); handling new LLM output quirks; adding new fields that need special parsing.

Core: Extraction Pipeline

FileLinesPurposeWhen to Edit
src/approp/extraction.rs~840ExtractionPipeline. Orchestrates the full extraction process: XML parsing → chunk splitting → parallel LLM calls → response parsing → merge → compute totals → verify → write artifacts. Contains TokenTracker, ChunkProgress, build_metadata(), and the parallel streaming logic using futures::stream.Changing the extraction flow; adding new artifact types; modifying chunk processing logic. Rarely edited — extraction is stable.
src/approp/xml.rs~590Congressional bill XML parsing via roxmltree. Extracts clean text with ''quote'' delimiters, identifies <appropriations-major> headings, and splits into chunks at <division> and <title> boundaries. Contains parse_bill_xml(), parse_bill_xml_str(), and the recursive XML tree walker.Handling new XML element types; fixing text extraction edge cases; changing chunk splitting logic.
src/approp/text_index.rs~670Dollar amount indexing. Builds a positional index of every $X,XXX,XXX pattern, section header, and proviso clause in the source text. Used by verification for amount checking and by extraction for chunk boundary computation. Contains TextIndex, ExtractionChunk.Adding new text patterns to index; changing how chunks are bounded.
src/approp/prompts.rs~310System prompt for Claude. The EXTRACTION_SYSTEM constant (~300 lines) defines every provision type, shows real JSON examples, constrains output format, and includes specific instructions for edge cases (CR substitutions, sub-allocations, mandatory spending extensions).Improving extraction quality; adding new provision type definitions; fixing edge case handling. Caution: Changes invalidate all existing extractions — re-extraction is needed for affected bills.
src/approp/progress.rs~170Extraction progress bar rendering. Displays the live dashboard during multi-chunk extraction.Changing the progress display format.

Core: Verification and Quality

FileLinesPurposeWhen to Edit
src/approp/verification.rs~370Deterministic verification. Three checks: (1) dollar amount strings searched in source text, (2) raw_text matched via three-tier system (exact → normalized → spaceless → no_match), (3) completeness — percentage of dollar strings in source matched to provisions. Contains verify_extraction(), AmountCheck, RawTextCheck, MatchTier, CheckResult, VerificationReport.Adding new verification checks (e.g., arithmetic checks); changing match tier logic.
src/approp/staleness.rs~165Hash chain checking. Computes SHA-256 of files, compares to stored hashes, returns StaleWarning if mismatched. Contains check(), file_sha256(), StaleWarning enum with ExtractionStale, EmbeddingsStale, and BillMetaStale variants.Adding new staleness checks for additional pipeline artifacts.
FileLinesPurposeWhen to Edit
src/approp/query.rs~1,300Library API. Pure functions: summarize(), search(), compare(), audit(), relate(), rollup_by_department(), build_embedding_text(), compute_link_hash(). Also contains normalize_agency() (35-entry sub-agency lookup) and normalize_account_name(). The compare() function includes cross-semantics orphan rescue. All functions take &[LoadedBill] and return plain data structs. No I/O, no formatting, no side effects.Adding new query functions; adding new search filter fields; changing budget authority logic; adding new output fields.
src/approp/loading.rs~340Directory walking and bill loading. load_bills() recursively finds extraction.json files, deserializes them along with sibling verification.json, metadata.json, and bill_meta.json, and returns Vec<LoadedBill>.Adding new artifact types to load; changing discovery logic.
src/approp/embeddings.rs~260Embedding storage. load() / save() for the JSON metadata + binary vectors format. cosine_similarity(), normalize(), top_n_similar(). The split JSON+binary format is optimized for fast loading (~2ms for 29 MB).Adding new similarity functions; changing storage format; adding batch operations.

API Clients

FileLinesPurposeWhen to Edit
src/api/anthropic/client.rs~340Anthropic API client. Message creation with streaming response handling, thinking/extended thinking support, prompt caching. Uses reqwest with rustls-tls.Adding retry logic; supporting new API features; handling new response formats.
src/api/congress/~850 (total)Congress.gov API client. Bill listing, metadata lookup, text version discovery, XML download. Rate limit handling.Adding new API endpoints; handling pagination edge cases.
src/api/openai/client.rs~45OpenAI API client. Embeddings endpoint only — minimal implementation. Sends batches of text, receives float32 vectors.Adding retry logic; supporting new embedding models; adding new endpoints.

Tests

FileLinesPurposeWhen to Edit
tests/cli_tests.rs~1,20042 integration tests. Runs the actual congress-approp binary against data/ data and checks stdout/stderr. Includes budget authority total pinning, search output validation, format tests, enrich/relate/link workflow tests, FY/subcommittee filtering tests, –show-advance verification, and case-insensitive compare tests.Adding tests for new CLI commands or flags; updating expected output when behavior changes intentionally.

In addition to integration tests, most modules contain inline unit tests in #[cfg(test)] mod tests { } blocks at the bottom of the file.

Data Flow Diagrams

How search --semantic flows through the code

main.rs: main()
  → match Commands::Search
  → handle_search()           [detects semantic.is_some()]
  → handle_semantic_search()  [async]
    → loading::load_bills()   [finds extraction.json files]
    → embeddings::load()      [for each bill directory]
    → OpenAIClient::embed()   [embeds query text — single API call, ~100ms]
    → for each provision:
        apply hard filters (type, division, dollars, etc.)
        cosine_similarity(query_vector, provision_vector)
    → sort by similarity descending
    → truncate to --top N
    → format output (table/json/jsonl/csv)

How extract flows through the code

main.rs: main()
  → match Commands::Extract
  → handle_extract()                     [async]
    → xml::parse_bill_xml()              [parse XML, get clean text + chunks]
    → ExtractionPipeline::new()
    → pipeline.extract_parallel()        [sends chunks to Claude in parallel]
      → for each chunk (bounded concurrency):
          AnthropicClient::create_message()
          from_value::parse_bill_extraction()
          save chunk artifacts to chunks/
    → merge provisions from all chunks
    → BillExtraction::compute_totals()   [sums provisions, never LLM arithmetic]
    → verification::verify_extraction()  [deterministic string matching]
    → write extraction.json, verification.json, metadata.json, tokens.json

How --similar flows through the code

main.rs: main()
  → match Commands::Search
  → handle_search()
  → handle_semantic_search()  [same entry point as --semantic]
    → loading::load_bills()
    → embeddings::load()      [for each bill]
    → look up source provision vector from stored vectors.bin  [NO API call]
    → cosine_similarity against all other provisions
    → sort, filter, truncate, format

Key Patterns to Follow

1. Library function first, CLI second

New logic goes in query.rs (or a new module). The CLI handler in main.rs calls the library function and formats output. Never put business logic in main.rs.

2. All query functions take &[LoadedBill] and return structs

No I/O, no formatting, no side effects in library code. All output structs derive Serialize for JSON output.

#![allow(unused)]
fn main() {
// Good:
pub fn my_query(bills: &[LoadedBill]) -> Vec<MyResult> { ... }

// Bad:
pub fn my_query(dir: &Path) -> Result<()> { ... }  // Does I/O
}

3. Serde for everything

All data types derive Serialize and Deserialize. This enables JSON, JSONL, and CSV output for free.

4. Tests in the same file

Unit tests go in #[cfg(test)] mod tests { } at the bottom of each module. Integration tests go in tests/cli_tests.rs.

5. Clippy clean with -D warnings

Clippy treats warnings as errors in CI. Fix all clippy suggestions at the root cause — don’t suppress with #[allow] unless absolutely necessary. Use #[allow(clippy::too_many_arguments)] sparingly.

6. Format with cargo fmt before committing

The CI rejects improperly formatted code.

Existing CLI Command Definitions

For reference when adding new commands, here are the existing command patterns in main.rs:

congress-approp download   --congress N --type T --number N --output-dir DIR [--enacted-only] [--format F] [--version V] [--dry-run]
congress-approp extract    --dir DIR [--parallel N] [--model M] [--dry-run]
congress-approp embed      --dir DIR [--model M] [--dimensions D] [--batch-size N] [--dry-run]
congress-approp search     --dir DIR [-t TYPE] [-a AGENCY] [--account A] [-k KW] [--bill B] [--division D] [--min-dollars N] [--max-dollars N] [--semantic Q] [--similar S] [--top N] [--format F] [--list-types]
congress-approp summary    --dir DIR [--format F] [--by-agency]
congress-approp compare    --base DIR --current DIR [-a AGENCY] [--format F]
congress-approp audit      --dir DIR [--verbose]
congress-approp upgrade    --dir DIR [--dry-run]
congress-approp api test
congress-approp api bill list --congress N [--type T] [--offset N] [--limit N] [--enacted-only]
congress-approp api bill get --congress N --type T --number N
congress-approp api bill text --congress N --type T --number N

Files That Don’t Exist Yet

These modules are designed but not implemented — they appear in the roadmap:

FilePurposeStatus
src/approp/bill_meta.rsBill metadata types, XML parsing, jurisdiction classification, FY-aware advance detection, account normalization (33 unit tests)Shipped in v4.0
src/approp/links.rsCross-bill link types, suggest algorithm, accept/remove, load/save for links/links.json (10 unit tests)Shipped in v4.0
relate commandDeep-dive on one provision across all bills with FY timeline, confidence tiers, and deterministic link hashesShipped in v4.0

See NEXT_STEPS.md (gitignored) for detailed implementation plans.

Next Steps

Adding a New Provision Type

This guide walks through the complete process of adding a new provision type to the extraction schema. It’s the most common contributor task and touches seven files across the codebase. We’ll use a hypothetical authorization_extension type as a worked example.

When You Need This

Add a new provision type when the existing 11 types don’t adequately capture a recurring legislative pattern. Signs that a new type is warranted:

  • Multiple other provisions share a pattern. If you see 20+ provisions in the other catch-all with similar llm_classification values, they probably deserve their own type.
  • The pattern has distinct fields. A new type should have at least one field that doesn’t exist on any current type. If it can be fully represented by an existing type’s fields, consider improving the LLM prompt to classify it correctly instead of adding a new type.
  • The pattern recurs across bills. A one-off provision in a single bill doesn’t justify a new type. A pattern that appears in every omnibus does.

The Checklist (7 Files)

Every new provision type requires changes in these files, in this order:

StepFileWhat to Add
1src/approp/ontology.rsNew variant on the Provision enum with type-specific fields
2src/approp/ontology.rsAccessor method arms for the new variant (raw_text, section, etc.)
3src/approp/from_value.rsMatch arm in parse_provision() for the new type
4src/approp/prompts.rsType definition and example in the LLM system prompt
5src/main.rsTable rendering for the new type; add to KNOWN_PROVISION_TYPES
6src/approp/query.rsUpdate search/summary logic if the type has special display needs
7tests/cli_tests.rsIntegration test for the new type

Step 1: Add the Enum Variant (ontology.rs)

Add a new variant to the Provision enum. Every variant must include the common fields (section, division, title, confidence, raw_text, notes, cross_references) plus its type-specific fields.

#![allow(unused)]
fn main() {
// In src/approp/ontology.rs, inside the Provision enum:

AuthorizationExtension {
    /// The program being reauthorized
    #[serde(default)]
    program_name: String,
    /// The statute being extended
    #[serde(default)]
    statutory_reference: String,
    /// New authorization level, if specified
    #[serde(default)]
    amount: Option<DollarAmount>,
    /// How long the authorization is extended
    #[serde(default)]
    extension_period: Option<String>,
    /// New expiration date or fiscal year
    #[serde(default)]
    expires: Option<String>,
    // Common fields (must be on every variant):
    #[serde(default)]
    section: String,
    #[serde(default)]
    division: Option<String>,
    #[serde(default)]
    title: Option<String>,
    #[serde(default)]
    confidence: f32,
    #[serde(default)]
    raw_text: String,
    #[serde(default)]
    notes: Vec<String>,
    #[serde(default)]
    cross_references: Vec<CrossReference>,
},
}

Important conventions

  • Use #[serde(default)] on every field. This ensures that missing fields in JSON input get their default values rather than causing a deserialization error.
  • Use Option<T> for fields that may not always be present.
  • Use String (not &str) for owned text fields.
  • Include all 7 common fields. The accessor methods expect them on every variant.

Step 2: Add Accessor Method Arms (ontology.rs)

Every accessor method on Provision exhaustively matches all variants. You must add a match arm for your new variant to each one. The compiler will tell you which methods are missing — look for “non-exhaustive patterns” errors.

Key methods that need arms:

#![allow(unused)]
fn main() {
// raw_text() — returns &str
Provision::AuthorizationExtension { raw_text, .. } => raw_text,

// section() — returns &str
Provision::AuthorizationExtension { section, .. } => section,

// division() — returns Option<&str>
Provision::AuthorizationExtension { division, .. } => division,

// title() — returns Option<&str>
Provision::AuthorizationExtension { title, .. } => title,

// confidence() — returns f32
Provision::AuthorizationExtension { confidence, .. } => *confidence,

// notes() — returns &[String]
Provision::AuthorizationExtension { notes, .. } => notes,

// cross_references() — returns &[CrossReference]
Provision::AuthorizationExtension { cross_references, .. } => cross_references,

// account_name() — returns &str
// If your type has an account_name field, return it. Otherwise return "".
Provision::AuthorizationExtension { .. } => "",

// agency() — returns &str
// Same pattern — return "" if not applicable.
Provision::AuthorizationExtension { .. } => "",

// amount() — returns Option<&DollarAmount>
Provision::AuthorizationExtension { amount, .. } => amount.as_ref(),

// description() — return a meaningful description
Provision::AuthorizationExtension { program_name, .. } => program_name,

// provision_type_str() — returns &str
Provision::AuthorizationExtension { .. } => "authorization_extension",
}

Tip: Let the compiler guide you

After adding the variant, run cargo build. The compiler will emit errors for every match expression that doesn’t cover the new variant. Fix them one by one — this is faster and more reliable than trying to find all match sites manually.

Step 3: Add Parsing Logic (from_value.rs)

In from_value.rs, the parse_provision() function has a match provision_type.as_str() block that dispatches to type-specific parsing. Add a new arm:

#![allow(unused)]
fn main() {
"authorization_extension" => Ok(Provision::AuthorizationExtension {
    program_name: get_str_or_warn(obj, "program_name", report),
    statutory_reference: get_str_or_warn(obj, "statutory_reference", report),
    amount: parse_dollar_amount(obj.get("amount"), report),
    extension_period: get_opt_str(obj, "extension_period"),
    expires: get_opt_str(obj, "expires"),
    section,
    division,
    title,
    confidence,
    raw_text,
    notes,
    cross_references,
}),
}

Parsing conventions

  • Use get_str(obj, "field") for required string fields that default to empty string if missing
  • Use get_str_or_warn(obj, "field", report) for string fields where absence should be logged
  • Use get_opt_str(obj, "field") for optional string fields (returns Option<String>)
  • Use get_opt_u32(obj, "field") for optional integers
  • Use parse_dollar_amount(obj.get("amount"), report) for dollar amount fields
  • Use get_string_array(obj, "field") for arrays of strings

The existing unknown => arm (at the bottom of the match) will catch any provision the LLM outputs with your new type name before you add this arm. It wraps them as Provision::Other with the original classification preserved. This means historical extractions that already contain your new type (classified as other) will still load correctly. After upgrading, they’ll be parsed into the proper new variant.

Step 4: Update the System Prompt (prompts.rs)

In prompts.rs, the EXTRACTION_SYSTEM constant contains the instructions for Claude. Add your new type to the PROVISION TYPES section:

- authorization_extension: Extension or reauthorization of an existing program's authorization
  - MUST have program_name (the program being reauthorized)
  - MUST have statutory_reference (the statute being amended)
  - May have an amount (new authorization level) and extension_period

Also add a JSON example in the examples section of the prompt:

{
  "provision_type": "authorization_extension",
  "program_name": "Community Health Centers",
  "statutory_reference": "Section 330 of the Public Health Service Act (42 U.S.C. 254b)",
  "amount": {
    "value": {"kind": "specific", "dollars": 4000000000},
    "semantics": "mandatory_spending",
    "text_as_written": "$4,000,000,000"
  },
  "extension_period": "2 years",
  "expires": "September 30, 2026",
  "section": "SEC. 201",
  "division": "B",
  "confidence": 0.95,
  "raw_text": "Section 330(r)(1) of the Public Health Service Act is amended by striking '2024' and inserting '2026'."
}

Caution: Changing the system prompt invalidates all existing extractions. Bills extracted with the old prompt won’t have provisions classified under the new type — they’ll be in the other catch-all or classified as something else. You’ll need to re-extract any bills where you want the new type to be used. The upgrade command can re-parse existing data but cannot re-classify provisions — that requires re-extraction.

Step 5: Update CLI Display (main.rs)

Add to KNOWN_PROVISION_TYPES

In main.rs, find the KNOWN_PROVISION_TYPES constant (around line 943) and add your new type:

#![allow(unused)]
fn main() {
const KNOWN_PROVISION_TYPES: &[(&str, &str)] = &[
    ("appropriation", "Budget authority grant"),
    ("rescission", "Cancellation of prior budget authority"),
    // ... existing types ...
    ("authorization_extension", "Extension of program authorization"),
    ("other", "Unclassified provisions"),
];
}

This makes the new type appear in --list-types output.

Update table rendering

If your type needs special table columns (like CR substitutions show New/Old/Delta), add the rendering logic in the handle_search function. If it uses the standard display (Description/Account, Amount, Section, Div), no changes are needed — the default rendering handles it.

Update the Match struct

In the Match struct within handle_search, ensure the new type’s fields are mapped correctly to the output fields (account_name, description, dollars, etc.).

Step 6: Update Query Logic (query.rs)

If your new type:

  • Should contribute to budget authority totals — update BillExtraction::compute_totals() in ontology.rs
  • Has special search display needs — update search() in query.rs to include the type in relevant filters
  • Should appear in comparisons — update compare() in query.rs if the type should be matched across bills

For most new types, no changes to query.rs are needed — the existing search filter (--type authorization_extension) will work automatically because the filter matches against provision_type_str().

Step 7: Add Tests

Unit test (ontology.rs)

Add a test in the #[cfg(test)] mod tests block at the bottom of ontology.rs:

#![allow(unused)]
fn main() {
#[test]
fn authorization_extension_round_trip() {
    let json = r#"{
        "provision_type": "authorization_extension",
        "program_name": "Test Program",
        "statutory_reference": "Section 100 of Test Act",
        "section": "SEC. 201",
        "confidence": 0.95,
        "raw_text": "Test raw text"
    }"#;

    let prov: Provision = serde_json::from_str(json).unwrap();
    assert_eq!(prov.provision_type_str(), "authorization_extension");
    assert_eq!(prov.section(), "SEC. 201");
    assert_eq!(prov.raw_text(), "Test raw text");
}
}

Integration test (cli_tests.rs)

If the example data contains provisions that would be classified under your new type, add a test. Otherwise, the existing tests should still pass — your changes shouldn’t affect the example data’s provision counts or budget totals.

Critical: Run the budget authority regression test:

cargo test budget_authority_totals_match_expected

If this fails, your changes inadvertently affected the budget authority calculation. The expected values are:

BillBudget AuthorityRescissions
H.R. 4366$846,137,099,554$24,659,349,709
H.R. 5860$16,000,000,000$0
H.R. 9468$2,882,482,000$0

Testing Your Changes

Run the full test cycle:

cargo fmt                           # Format code
cargo fmt --check                   # Verify formatting
cargo clippy -- -D warnings         # Lint (CI treats warnings as errors)
cargo test                          # Run all tests (130 unit + 42 integration)

All four must pass before committing.

Backward Compatibility

Adding a new provision type is backward-compatible by design:

  • Old data loads fine. Provisions in existing extraction.json files that were classified as other (because the new type didn’t exist yet) will continue to load as other. The from_value.rs unknown => arm catches them.
  • The upgrade command helps. After adding the new type, running upgrade re-deserializes existing data through the updated parsing logic. If any other provisions have llm_classification matching your new type name, they’ll be re-parsed into the proper variant.
  • Re-extraction is optional. Only needed if you want the LLM to actively use the new type (which requires the updated prompt).

What NOT to Do

  1. Don’t add a type for a single provision. If only one provision in one bill would use the type, leave it as other — the catch-all exists for exactly this purpose.

  2. Don’t duplicate existing types. Before adding a new type, check whether the pattern is actually a variant of an existing type (e.g., a limitation with special characteristics, or an appropriation with a unique availability pattern).

  3. Don’t add fields to existing types unless you also handle missing fields in from_value.rs. Existing extractions won’t have the new field, so #[serde(default)] is mandatory.

  4. Don’t suppress clippy warnings with #[allow]. Fix them at the root cause. The CI rejects code with clippy warnings.

Summary Checklist

  • Added variant to Provision enum in ontology.rs with all common fields
  • Added match arms to all accessor methods in ontology.rs
  • Added parsing arm in parse_provision() in from_value.rs
  • Added type definition and example in EXTRACTION_SYSTEM prompt in prompts.rs
  • Added to KNOWN_PROVISION_TYPES in main.rs
  • Updated table rendering in main.rs if needed
  • Updated query.rs if the type has special search/compare/summary behavior
  • Added unit test for round-trip serialization
  • Verified budget authority totals unchanged: cargo test budget_authority_totals_match_expected
  • Full test cycle passes: cargo fmt --check && cargo clippy -- -D warnings && cargo test

Next Steps

Adding a New CLI Command

This guide walks through the process of adding a new subcommand to congress-approp. The pattern is consistent: define the command in clap, write a library function, create a CLI handler, and add tests.

Overview

Every CLI command follows the same architecture:

1. Define command + flags     →  main.rs (Commands enum, clap derive)
2. Write library function     →  query.rs or new module (pure function, no I/O)
3. Write CLI handler          →  main.rs (parse args → call library → format output)
4. Wire into main()           →  main.rs (match arm in the main dispatch)
5. Add integration test       →  tests/cli_tests.rs
6. Update documentation       →  book/src/reference/cli.md + relevant chapters

The key principle: library function first, CLI second. The library function does the computation; the CLI handler does the I/O and formatting.

Step 1: Define the Command (main.rs)

Add a new variant to the Commands enum with clap derive attributes:

#![allow(unused)]
fn main() {
// In the Commands enum in main.rs:

/// Show the top N provisions by dollar amount
Top {
    /// Data directory containing extracted bills
    #[arg(long, default_value = "./data")]
    dir: String,

    /// Number of provisions to show
    #[arg(long, short = 'n', default_value = "10")]
    count: usize,

    /// Filter by provision type
    #[arg(long, short = 't')]
    r#type: Option<String>,

    /// Output format: table, json, jsonl, csv
    #[arg(long, default_value = "table")]
    format: String,

    /// Enable verbose logging
    #[arg(short, long)]
    verbose: bool,
},
}

Conventions for flags

PatternConvention
Data directory--dir with default "./data"
Output format--format with default "table", options: table, json, jsonl, csv
Provision type filter--type / -t (use r#type for the Rust keyword)
Agency filter--agency / -a
Dry run--dry-run flag
Verbose-v / --verbose (also available as global flag)

Look at existing commands for consistent naming and help text style.

Step 2: Write the Library Function (query.rs)

Add a pure function to src/approp/query.rs that takes &[LoadedBill] and returns a data struct:

#![allow(unused)]
fn main() {
// In src/approp/query.rs:

/// A provision ranked by dollar amount.
#[derive(Debug, Serialize)]
pub struct TopProvision {
    pub bill_identifier: String,
    pub provision_index: usize,
    pub provision_type: String,
    pub account_name: String,
    pub agency: String,
    pub dollars: i64,
    pub semantics: String,
    pub section: String,
    pub division: String,
}

/// Return the top N provisions by absolute dollar amount.
pub fn top_provisions(
    bills: &[LoadedBill],
    count: usize,
    provision_type: Option<&str>,
) -> Vec<TopProvision> {
    let mut results: Vec<TopProvision> = Vec::new();

    for loaded in bills {
        let bill_id = &loaded.extraction.bill.identifier;

        for (i, p) in loaded.extraction.provisions.iter().enumerate() {
            // Apply type filter
            if let Some(ptype) = provision_type {
                if p.provision_type_str() != ptype {
                    continue;
                }
            }

            // Only include provisions with specific dollar amounts
            if let Some(amt) = p.amount() {
                if let Some(dollars) = amt.dollars() {
                    results.push(TopProvision {
                        bill_identifier: bill_id.clone(),
                        provision_index: i,
                        provision_type: p.provision_type_str().to_string(),
                        account_name: p.account_name().to_string(),
                        agency: p.agency().to_string(),
                        dollars,
                        semantics: format!("{}", amt.semantics),
                        section: p.section().to_string(),
                        division: p.division().unwrap_or("").to_string(),
                    });
                }
            }
        }
    }

    // Sort by absolute dollar amount descending
    results.sort_by(|a, b| b.dollars.abs().cmp(&a.dollars.abs()));
    results.truncate(count);
    results
}
}

Library function conventions

  • Take &[LoadedBill] — never a file path. I/O is the CLI’s job.
  • Return a struct that derives Serialize — enables JSON/JSONL/CSV output for free.
  • No formatting, no printing, no side effects.
  • Document with doc comments (///) — these appear in cargo doc output.

Step 3: Write the CLI Handler (main.rs)

Create a handler function in main.rs that bridges the CLI arguments to the library function and formats the output:

#![allow(unused)]
fn main() {
fn handle_top(dir: &str, count: usize, ptype: Option<&str>, format: &str) -> Result<()> {
    let start = Instant::now();
    let bills = loading::load_bills(Path::new(dir))?;

    if bills.is_empty() {
        println!("No extracted bills found in {dir}");
        return Ok(());
    }

    let results = query::top_provisions(&bills, count, ptype);

    match format {
        "json" => {
            println!("{}", serde_json::to_string_pretty(&results)?);
        }
        "jsonl" => {
            for r in &results {
                println!("{}", serde_json::to_string(r)?);
            }
        }
        "csv" => {
            let mut wtr = csv::Writer::from_writer(std::io::stdout());
            for r in &results {
                wtr.serialize(r)?;
            }
            wtr.flush()?;
        }
        _ => {
            // Table output
            let mut table = Table::new();
            table.load_preset(UTF8_FULL_CONDENSED);
            table.set_header(vec![
                Cell::new("Bill"),
                Cell::new("Type"),
                Cell::new("Account"),
                Cell::new("Amount ($)").set_alignment(CellAlignment::Right),
                Cell::new("Section"),
                Cell::new("Div"),
            ]);

            for r in &results {
                table.add_row(vec![
                    Cell::new(&r.bill_identifier),
                    Cell::new(&r.provision_type),
                    Cell::new(truncate(&r.account_name, 45)),
                    Cell::new(format_dollars(r.dollars))
                        .set_alignment(CellAlignment::Right),
                    Cell::new(&r.section),
                    Cell::new(&r.division),
                ]);
            }

            println!("{table}");
            println!("\n{} provisions shown", results.len());
        }
    }

    tracing::debug!("Completed in {:?}", start.elapsed());
    Ok(())
}
}

Handler conventions

  • Name: handle_<command> (e.g., handle_top)
  • Signature: Takes parsed arguments as simple types, returns Result<()>
  • Pattern: Load bills → call library function → format output based on --format flag
  • Table formatting: Use comfy_table with UTF8_FULL_CONDENSED preset (matching existing commands)
  • Timing: Use Instant::now() + tracing::debug! for elapsed time (visible with -v)
  • Empty results: Handle gracefully with a message, don’t panic

Async or sync?

  • If your handler makes no API calls, make it a regular fn (sync).
  • If it needs to call an external API (like handle_embed or handle_semantic_search), make it async fn and .await the API calls.

Important: Don’t use block_on() inside an async function — this causes “cannot start a runtime from within a runtime” panics. If your handler is async, the entire call chain from main() must use .await.

Step 4: Wire into main() Dispatch

In the main() function, add a match arm for your new command:

#![allow(unused)]
fn main() {
// In the main() function's match on cli.command:

Commands::Top {
    dir,
    count,
    r#type,
    format,
    verbose: _,
} => {
    handle_top(&dir, count, r#type.as_deref(), &format)?;
}
}

For async handlers:

#![allow(unused)]
fn main() {
Commands::Top { dir, count, r#type, format, verbose: _ } => {
    handle_top(&dir, count, r#type.as_deref(), &format).await?;
}
}

Step 5: Add Integration Tests (cli_tests.rs)

Add tests in tests/cli_tests.rs that run the actual binary against the example data:

#![allow(unused)]
fn main() {
// In tests/cli_tests.rs:

#[test]
fn top_runs_successfully() {
    cmd()
        .args(["top", "--dir", "data", "-n", "5"])
        .assert()
        .success()
        .stdout(predicates::str::contains("H.R. 4366"));
}

#[test]
fn top_json_output_is_valid() {
    let output = cmd()
        .args(["top", "--dir", "data", "-n", "3", "--format", "json"])
        .output()
        .unwrap();

    assert!(output.status.success());
    let stdout = str::from_utf8(&output.stdout).unwrap();
    let data: Vec<serde_json::Value> = serde_json::from_str(stdout).unwrap();
    assert_eq!(data.len(), 3);

    // Verify the top result has the largest dollar amount
    let first_dollars = data[0]["dollars"].as_i64().unwrap();
    let second_dollars = data[1]["dollars"].as_i64().unwrap();
    assert!(first_dollars.abs() >= second_dollars.abs());
}

#[test]
fn top_with_type_filter() {
    let output = cmd()
        .args(["top", "--dir", "data", "-n", "5", "--type", "rescission", "--format", "json"])
        .output()
        .unwrap();

    assert!(output.status.success());
    let stdout = str::from_utf8(&output.stdout).unwrap();
    let data: Vec<serde_json::Value> = serde_json::from_str(stdout).unwrap();

    for entry in &data {
        assert_eq!(entry["provision_type"].as_str().unwrap(), "rescission");
    }
}
}

Test conventions

  • Use the cmd() helper function (defined at the top of cli_tests.rs) to get a Command for the binary
  • Test with --dir data to use the included example data
  • Test all output formats (table, json, csv)
  • Test filter combinations
  • Verify JSON output parses correctly
  • Never change the expected budget authority totals — the budget_authority_totals_match_expected test is a critical regression guard

Step 6: Update Documentation

CLI Reference (book/src/reference/cli.md)

Add a section for your new command following the existing format:

## top

Show the top N provisions by dollar amount.

\`\`\`text
congress-approp top [OPTIONS]
\`\`\`

| Flag | Short | Type | Default | Description |
|------|-------|------|---------|-------------|
| `--dir` | | path | `./data` | Data directory |
| `--count` | `-n` | integer | `10` | Number of provisions to show |
| `--type` | `-t` | string | — | Filter by provision type |
| `--format` | | string | `table` | Output format: table, json, jsonl, csv |

### Examples

\`\`\`bash
congress-approp top --dir data -n 5
congress-approp top --dir data -n 10 --type rescission
congress-approp top --dir data -n 20 --format csv > top_provisions.csv
\`\`\`

Other documentation

  • Update the SUMMARY.md table of contents if the command deserves its own how-to guide
  • Add a mention in what-this-tool-does.md if the command represents a significant new capability
  • Update the CHANGELOG.md with the new feature

Complete Test Cycle

Before committing, run the full test cycle:

cargo fmt                           # Format code
cargo fmt --check                   # Verify formatting (CI does this)
cargo clippy -- -D warnings         # Lint (CI treats warnings as errors)
cargo test                          # Run all tests

# Data integrity check (budget totals must be unchanged):
./target/release/congress-approp summary --dir data --format json | python3 -c "
import sys, json
expected = {'H.R. 4366': 846137099554, 'H.R. 5860': 16000000000, 'H.R. 9468': 2882482000}
for b in json.load(sys.stdin):
    assert b['budget_authority'] == expected[b['identifier']]
print('Data integrity: OK')
"

All must pass. The CI runs fmt --check, clippy -D warnings, and cargo test on every push.

Commit Message Format

Add `top` command — show provisions ranked by dollar amount

Adds a new CLI subcommand that ranks provisions by absolute dollar
amount across all loaded bills. Supports --type filter and all
output formats (table/json/jsonl/csv).

Library function: query::top_provisions()
CLI handler: handle_top()

Verified:
- cargo fmt/clippy/test: clean, 98 tests pass (77 unit + 21 integration)
- Budget totals unchanged: $846B/$16B/$2.9B

Gotchas

  1. handle_search is async because the --semantic path calls OpenAI. If your new command doesn’t call any APIs, keep it sync — don’t make it async just because other handlers are.

  2. The format_dollars and truncate helper functions are in main.rs (not in a shared module). You can use them directly in your handler.

  3. Provision accessor methods return &str, not Option<&str> in some cases. p.account_name() returns "" (not None) for provisions without accounts. Check with .is_empty() if you need to handle the empty case.

  4. The r#type naming is required because type is a Rust keyword. Use r#type in the struct definition and r#type.as_deref() when passing to functions that expect Option<&str>.

  5. CSV output uses serde_json::to_string(r)? for each row in some handlers, but the cleaner approach is csv::Writer::from_writer with wtr.serialize(r)? as shown above. Make sure your output struct derives Serialize.

  6. Run cargo install --path . after making changes to test the actual installed binary (integration tests use the debug binary from cargo test, not the installed release binary).

Example: Reviewing Existing Commands

The best way to learn the patterns is to read existing handlers. Start with these as templates:

If your command is like…Study this handler
Read-only query, no API callshandle_summary() (~160 lines, sync)
Query with filtershandle_search() (~530 lines, async because of semantic path)
Two-directory comparisonhandle_compare() (~210 lines, sync)
API-calling commandhandle_embed() (~120 lines, async)
Schema migration commandhandle_upgrade() (~150 lines, sync)

Next Steps

Testing Strategy

This chapter explains how the test suite is structured, how to run tests, what the key regression guards are, and how to add tests for new features.

Test Overview

The project has two categories of tests:

CategoryLocationCountWhat They Test
Unit testsInline #[cfg(test)] mod tests in each module~130Individual functions, type round-trips, parsing logic, classification, link management
Integration teststests/cli_tests.rs42Full CLI commands against the data/ data, including enrich, relate, link workflow, FY/subcommittee filtering, –show-advance, case-insensitive compare
Total~172

All tests run with cargo test and must pass before every commit.

Running Tests

Full test cycle (do this before every commit)

cargo fmt                           # Format code
cargo fmt --check                   # Verify formatting (CI does this)
cargo clippy -- -D warnings         # Lint (CI treats warnings as errors)
cargo test                          # Run all tests

All four must pass. The CI runs fmt --check, clippy -D warnings, and cargo test on every push to main and every pull request.

Running specific tests

# Run only unit tests
cargo test --lib

# Run only integration tests
cargo test --test cli_tests

# Run a specific test by name
cargo test budget_authority_totals

# Run tests with output visible (normally captured)
cargo test -- --nocapture

# Run tests matching a pattern
cargo test search

Testing with verbose output

# See which tests are running
cargo test -- --test-threads=1

# See stdout/stderr from tests
cargo test -- --nocapture

The Critical Regression Guard

The single most important test in the suite is budget_authority_totals_match_expected:

#![allow(unused)]
fn main() {
#[test]
fn budget_authority_totals_match_expected() {
    let output = cmd()
        .args(["summary", "--dir", "data", "--format", "json"])
        .output()
        .unwrap();

    assert!(output.status.success());
    let stdout = str::from_utf8(&output.stdout).unwrap();
    let data: Vec<serde_json::Value> = serde_json::from_str(stdout).unwrap();

    let expected: Vec<(&str, i64, i64)> = vec![
        ("H.R. 4366", 846_137_099_554, 24_659_349_709),
        ("H.R. 5860", 16_000_000_000, 0),
        ("H.R. 9468", 2_882_482_000, 0),
    ];

    for (bill, expected_ba, expected_resc) in &expected {
        let entry = data
            .iter()
            .find(|b| b["identifier"].as_str().unwrap() == *bill)
            .unwrap_or_else(|| panic!("Missing bill: {bill}"));

        let ba = entry["budget_authority"].as_i64().unwrap();
        let resc = entry["rescissions"].as_i64().unwrap();

        assert_eq!(ba, *expected_ba, "{bill} budget authority mismatch");
        assert_eq!(resc, *expected_resc, "{bill} rescissions mismatch");
    }
}
}

This test hardcodes the exact budget authority and rescission totals for every example bill:

BillBudget AuthorityRescissions
H.R. 4366$846,137,099,554$24,659,349,709
H.R. 5860$16,000,000,000$0
H.R. 9468$2,882,482,000$0

Any change to the extraction data, the compute_totals() function, the provision parsing logic, or the budget authority calculation that would alter these numbers is caught immediately. This is the tool’s financial integrity guard.

If this test fails, stop and investigate. Either the change was intentional (and the test values need updating with justification) or the change introduced a regression in the budget authority calculation.

Unit Test Patterns

Unit tests are inline in each module, in a #[cfg(test)] mod tests block at the bottom of the file:

#![allow(unused)]
fn main() {
// Example from ontology.rs:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn provision_round_trip_appropriation() {
        let json = r#"{
            "provision_type": "appropriation",
            "account_name": "Test Account",
            "agency": "Test Agency",
            "amount": {
                "value": {"kind": "specific", "dollars": 1000000},
                "semantics": "new_budget_authority",
                "text_as_written": "$1,000,000"
            },
            "detail_level": "top_level",
            "section": "SEC. 101",
            "confidence": 0.95,
            "raw_text": "For necessary expenses..."
        }"#;

        let p: Provision = serde_json::from_str(json).unwrap();
        assert_eq!(p.provision_type_str(), "appropriation");
        assert_eq!(p.account_name(), "Test Account");
        assert_eq!(p.section(), "SEC. 101");

        // Round-trip: serialize back to JSON and re-parse
        let serialized = serde_json::to_string(&p).unwrap();
        let p2: Provision = serde_json::from_str(&serialized).unwrap();
        assert_eq!(p2.provision_type_str(), "appropriation");
        assert_eq!(p2.account_name(), "Test Account");
    }

    #[test]
    fn compute_totals_excludes_sub_allocations() {
        // Create a bill extraction with a top-level and sub-allocation
        // Verify that only top-level counts toward BA
        // ...
    }
}
}

What to unit test

ModuleWhat to Test
ontology.rsProvision serialization round-trips, compute_totals() with various scenarios, accessor methods
from_value.rsResilient parsing: missing fields, wrong types, unknown provision types, edge cases
verification.rsAmount checking logic, raw text matching tiers, completeness calculation
embeddings.rsCosine similarity, vector normalization, load/save round-trip
staleness.rsHash computation, staleness detection
query.rsSearch filters, compare matching, summarize aggregation, rollup logic
xml.rsXML parsing edge cases, chunk splitting
text_index.rsDollar pattern detection, section header detection

Unit test conventions

  1. Place tests at the bottom of the module they test, in #[cfg(test)] mod tests { use super::*; ... }
  2. Name tests descriptivelycompute_totals_excludes_sub_allocations is better than test_compute
  3. Test edge cases — empty inputs, null fields, zero-dollar amounts, maximum values
  4. Use real-world-ish data — test with provision structures similar to what the LLM actually produces
  5. Keep tests fast — no file I/O, no network calls, no sleeping

Integration Test Patterns

Integration tests live in tests/cli_tests.rs and run the actual compiled binary against the data/ data:

#![allow(unused)]
fn main() {
use assert_cmd::Command;
use std::str;

fn cmd() -> Command {
    Command::cargo_bin("congress-approp").unwrap()
}

#[test]
fn summary_table_runs_successfully() {
    cmd()
        .args(["summary", "--dir", "data"])
        .assert()
        .success()
        .stdout(predicates::str::contains("H.R. 4366"))
        .stdout(predicates::str::contains("H.R. 5860"))
        .stdout(predicates::str::contains("H.R. 9468"))
        .stdout(predicates::str::contains("Omnibus"))
        .stdout(predicates::str::contains("Continuing Resolution"))
        .stdout(predicates::str::contains("Supplemental"));
}
}

Existing integration tests

The test suite covers these commands and scenarios:

TestWhat It Checks
budget_authority_totals_match_expectedCritical — exact BA and rescission totals for all three bills
summary_table_runs_successfullySummary command outputs all three bills with correct classifications
summary_json_output_is_validJSON output parses correctly with expected fields
summary_csv_output_has_headerCSV output includes a header row
summary_by_agency_shows_departments--by-agency flag produces department rollup
search_by_type_appropriationType filter returns results with correct type
search_by_type_rescissionRescission search returns results
search_by_type_cr_substitutionCR substitution search returns 13 results
search_by_agencyAgency filter narrows results
search_by_keywordKeyword search finds provisions containing the term
search_json_output_is_validJSON output parses with expected fields
search_csv_outputCSV output is parseable
search_list_types--list-types flag shows all provision types
compare_runs_successfullyCompare command produces output with expected accounts
compare_json_output_is_validCompare JSON output parses correctly
audit_runs_successfullyAudit command shows all three bills
audit_shows_zero_not_foundCritical — NotFound = 0 for all bills
upgrade_dry_runUpgrade dry run completes without modifying files

Writing new integration tests

#![allow(unused)]
fn main() {
#[test]
fn my_new_command_works() {
    // 1. Run the command against example data
    let output = cmd()
        .args(["my-command", "--dir", "data", "--format", "json"])
        .output()
        .unwrap();

    // 2. Check it succeeded
    assert!(output.status.success(), "Command failed: {}", 
        str::from_utf8(&output.stderr).unwrap());

    // 3. Parse the output
    let stdout = str::from_utf8(&output.stdout).unwrap();
    let data: Vec<serde_json::Value> = serde_json::from_str(stdout).unwrap();

    // 4. Verify expected properties
    assert!(!data.is_empty(), "Expected at least one result");
    assert!(data[0]["some_field"].is_string(), "Expected some_field to be a string");
}
}

Integration test conventions

  1. Always use --dir data — the included example data is the test fixture
  2. Test all output formats (table, json, csv) for new commands
  3. Parse JSON output and verify structure — don’t just check for substring matches on JSON
  4. Check for specific expected values where possible (like the budget authority totals)
  5. Test error cases — what happens with a bad --dir path, an invalid --type value, etc.
  6. Don’t test semantic search in CI — there’s no OPENAI_API_KEY in the CI environment. Cosine similarity and vector loading have unit tests instead.

What Is NOT Tested

Semantic search (no API key in CI)

The GitHub Actions CI environment does not have an OPENAI_API_KEY. This means:

  • search --semantic is not tested in CI
  • embed is not tested in CI
  • The OpenAI API client is not tested in CI

These are tested locally by the developer. The underlying cosine similarity, vector loading, and embedding text construction functions have unit tests that don’t require API access.

LLM extraction quality

There are no automated tests that verify the quality of LLM extraction — that would require calling the Anthropic API and comparing results to ground truth. Instead:

  • Budget authority totals serve as a proxy for extraction quality (if totals match, major provisions are correct)
  • The verification pipeline (audit) provides automated quality metrics
  • Manual review of new extractions is expected before committing example data

Performance benchmarks

There are no automated performance tests. The performance characteristics documented in the architecture chapter are based on manual measurement and informal benchmarking.

Data Integrity Check (Manual)

In addition to cargo test, the project includes a manual data integrity check that can be run as a shell command:

./target/release/congress-approp summary --dir data --format json | python3 -c "
import sys, json
expected = {'H.R. 4366': 846137099554, 'H.R. 5860': 16000000000, 'H.R. 9468': 2882482000}
for b in json.load(sys.stdin):
    assert b['budget_authority'] == expected[b['identifier']]
print('Data integrity: OK')
"

This is the same check as the budget_authority_totals_match_expected test but runs against the release binary. It’s useful as a final verification before committing or publishing.

CI/CD Pipeline

GitHub Actions (.github/workflows/ci.yml) runs on every push to main and every pull request:

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
        with:
          components: rustfmt, clippy
      - uses: Swatinem/rust-cache@v2
      - name: Check formatting
        run: cargo fmt --check
      - name: Clippy
        run: cargo clippy -- -D warnings
      - name: Test
        run: cargo test

Three checks, all must pass:

  1. cargo fmt --check — Code must be formatted according to rustfmt rules
  2. cargo clippy -- -D warnings — No clippy warnings allowed (warnings are errors)
  3. cargo test — All unit and integration tests must pass

The CI does NOT:

  • Run semantic search tests (no OPENAI_API_KEY)
  • Run extraction tests (no ANTHROPIC_API_KEY)
  • Run download tests (no CONGRESS_API_KEY)
  • Test against real API endpoints

Adding Tests for New Features

For a new CLI command

  1. Add at least three integration tests:
    • Basic execution with --dir data succeeds
    • JSON output parses correctly with expected fields
    • Filters work as expected
  2. Add unit tests for the library function it calls

For a new provision type

  1. Add a unit test in ontology.rs for serialization round-trip
  2. Add a unit test in from_value.rs for resilient parsing (missing fields, wrong types)
  3. Verify budget_authority_totals_match_expected still passes — your new type shouldn’t change existing totals unless deliberately designed to

For a new search filter

  1. Add an integration test in cli_tests.rs that exercises the filter
  2. Verify the filter works with --format json (check the output structure)
  3. Test the filter in combination with existing filters

For a new output format

  1. Add integration tests for the new format on at least search and summary commands
  2. Verify the output is parseable by its target consumer (e.g., valid CSV, valid JSON)

Debugging Test Failures

“budget_authority_totals_match_expected” failed

This means the budget authority or rescission totals changed. Possible causes:

  1. Example data changed — was extraction.json modified accidentally?
  2. compute_totals() logic changed — did the filtering criteria for budget authority change?
  3. from_value.rs parsing changed — did a change in the resilient parser alter how amounts are parsed?
  4. A new provision type was added that unintentionally contributes to budget authority

Investigation steps:

# Check the actual values
./target/release/congress-approp summary --dir data --format json | python3 -c "
import sys, json
for b in json.load(sys.stdin):
    print(f\"{b['identifier']}: BA={b['budget_authority']}, Resc={b['rescissions']}\")
"

# Compare to expected
# H.R. 4366: BA=846137099554, Resc=24659349709
# H.R. 5860: BA=16000000000, Resc=0
# H.R. 9468: BA=2882482000, Resc=0

Tests pass locally but fail in CI

Common causes:

  1. Unformatted code — run cargo fmt locally (CI checks with cargo fmt --check)
  2. Clippy warnings — run cargo clippy -- -D warnings locally (CI treats warnings as errors)
  3. Platform differences — the CI runs on Ubuntu; if you develop on macOS, there may be subtle differences in text handling
  4. Missing cargo build — integration tests need the binary; cargo test builds it automatically, but sometimes caching can cause stale binaries

A test is flaky (passes sometimes, fails sometimes)

This shouldn’t happen in the current test suite because there’s no randomness, no network calls, and no timing dependencies. If you encounter a flaky test:

  1. Run it with --test-threads=1 to rule out parallelism issues
  2. Check if it depends on filesystem ordering (use sort on any directory listings)
  3. Check if it depends on HashMap iteration order (use BTreeMap or sort results)

Summary

RuleReason
Run cargo fmt && cargo clippy -- -D warnings && cargo test before every commitCI rejects improperly formatted or warning-producing code
Never change the expected budget authority totals without justificationThey’re the tool’s financial integrity guard
Test all output formats for new commandsUsers depend on JSON/CSV parsability
Unit test library functions, integration test CLI commandsTwo layers of confidence
Don’t test semantic search in CINo API keys in CI; test cosine similarity with unit tests instead

Next Steps

Style Guide and Conventions

Coding standards and practices for contributing to congress-approp. These conventions are enforced by CI — pull requests that don’t follow them will be rejected automatically.

The Non-Negotiables

These three checks run on every push and every pull request. All must pass.

1. Format with rustfmt

cargo fmt

Run this before every commit. The CI checks with cargo fmt --check and rejects improperly formatted code. There is no .rustfmt.toml override — the project uses the default rustfmt configuration.

2. No clippy warnings

cargo clippy -- -D warnings

Clippy warnings are treated as errors in CI. Fix every warning at its root cause.

Do NOT suppress warnings with #[allow] annotations unless there is a compelling reason and the team agrees. The most common exception is #[allow(clippy::too_many_arguments)] on functions that genuinely need many parameters (like provision constructors), but even this should be used sparingly.

Do NOT use _ prefixes on variable names just to suppress “unused variable” warnings. If a variable is unused, remove it. If it’s a function parameter that must exist for API compatibility but isn’t used in the current implementation, use _name (single underscore prefix) — but consider whether the function signature should change instead.

3. All tests pass

cargo test

All ~172 tests (130 unit + 42 integration) must pass. See Testing Strategy for details.

The full cycle

cargo fmt && cargo clippy -- -D warnings && cargo test

Run this as a single command before every commit. If any step fails, fix it before proceeding.

Code Organization

Library function first, CLI second

New logic goes in library modules (query.rs, embeddings.rs, or a new module under src/approp/). The CLI handler in main.rs calls the library function and formats the output.

#![allow(unused)]
fn main() {
// Good: Library function is pure, CLI handler formats
// In query.rs:
pub fn top_provisions(bills: &[LoadedBill], count: usize) -> Vec<TopProvision> { ... }

// In main.rs:
fn handle_top(dir: &str, count: usize, format: &str) -> Result<()> {
    let bills = loading::load_bills(Path::new(dir))?;
    let results = query::top_provisions(&bills, count);
    // ... format and print results ...
}
}
#![allow(unused)]
fn main() {
// Bad: Business logic in main.rs
fn handle_top(dir: &str, count: usize, format: &str) -> Result<()> {
    let bills = loading::load_bills(Path::new(dir))?;
    let mut all_provisions = Vec::new();
    for bill in &bills {
        for p in &bill.extraction.provisions {
            // ... inline filtering and sorting logic ...
        }
    }
    // ... 200 lines of inline computation ...
}
}

All query functions take &[LoadedBill]

Library functions in query.rs take loaded data as input and return plain structs. They never do I/O, never format output, never call APIs, and never print anything.

#![allow(unused)]
fn main() {
// Good: Pure function
pub fn summarize(bills: &[LoadedBill]) -> Vec<BillSummary> { ... }

// Bad: Does I/O
pub fn summarize(dir: &Path) -> Result<()> { ... }

// Bad: Formats output
pub fn summarize(bills: &[LoadedBill]) -> String { ... }
}

Serde for everything

All data types derive Serialize and Deserialize:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct MyType {
    pub field: String,
    pub amount: i64,
}
}

Output structs (returned by library functions for CLI consumption) derive at least Serialize:

#![allow(unused)]
fn main() {
#[derive(Debug, Serialize)]
pub struct SearchResult {
    pub bill: String,
    pub dollars: Option<i64>,
    // ...
}
}

This enables JSON, JSONL, and CSV output for free — the CLI handler just calls serde_json::to_string() or csv::Writer::serialize().

Tests in the same file

Unit tests go in a #[cfg(test)] mod tests block at the bottom of the module they test:

#![allow(unused)]
fn main() {
// At the bottom of src/approp/query.rs:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn summarize_computes_correct_totals() {
        // ...
    }

    #[test]
    fn search_filters_by_type() {
        // ...
    }
}
}

Integration tests (which run the actual binary) go in tests/cli_tests.rs.

Naming Conventions

Files and modules

  • snake_case for all file and module names: from_value.rs, text_index.rs, cli_tests.rs
  • Module names should describe what they contain, not what they do: ontology.rs (types), not define_types.rs

Types and enums

  • CamelCase for all type names: BillExtraction, AmountSemantics, LoadedBill
  • Enum variants are also CamelCase: Provision::Appropriation, AmountValue::Specific
  • Acronyms are treated as words: CrSubstitution (not CRSubstitution), XmlParser (not XMLParser)

Functions and methods

  • snake_case for all function names: compute_totals(), load_bills(), parse_provision()
  • CLI handler functions are prefixed with handle_: handle_search(), handle_summary(), handle_extract()
  • Boolean-returning methods use is_ prefix: is_definite(), is_empty()
  • Getter methods use the field name without get_ prefix: account_name(), division(), amount()

Constants

  • SCREAMING_SNAKE_CASE for constants: DEFAULT_MODEL, MAX_TOKENS, KNOWN_PROVISION_TYPES

Command-line flags

  • kebab-case for multi-word flags: --dry-run, --output-dir, --min-dollars, --by-agency
  • Single-character short flags where natural: -v (verbose), -t (type), -a (agency), -k (keyword), -n (count)
  • Use r#type in Rust (since type is a keyword): r#type: Option<String>

Error Handling

Use anyhow for CLI code

#![allow(unused)]
fn main() {
use anyhow::{Context, Result};

fn handle_summary(dir: &str) -> Result<()> {
    let bills = loading::load_bills(Path::new(dir))
        .context("Failed to load bills")?;
    // ...
    Ok(())
}
}

The .context() method adds human-readable context to errors. Use it on every fallible operation that could fail for user-facing reasons (file not found, API error, parse error).

Use thiserror for library errors

If a library module needs typed errors (rather than anyhow::Error), define them with thiserror:

#![allow(unused)]
fn main() {
use thiserror::Error;

#[derive(Error, Debug)]
pub enum LoadError {
    #[error("No extraction.json found in {0}")]
    NoExtraction(PathBuf),
    #[error("Failed to parse {path}: {source}")]
    ParseError {
        path: PathBuf,
        source: serde_json::Error,
    },
}
}

Never panic in library code

Library functions should return Result<T> instead of panicking. Use .unwrap() only in tests or when the invariant is provably guaranteed (e.g., after a .is_some() check).

#![allow(unused)]
fn main() {
// Good:
pub fn load_bills(dir: &Path) -> Result<Vec<LoadedBill>> { ... }

// Bad:
pub fn load_bills(dir: &Path) -> Vec<LoadedBill> {
    // panics on error — caller can't handle it gracefully
}
}

Panicking is fine in CLI handlers

CLI handlers (the handle_* functions in main.rs) can use ? freely since errors propagate to main() and are displayed to the user. The anyhow crate formats the error chain nicely.

Documentation

Doc comments on public items

Every public function, type, and module should have a /// doc comment:

#![allow(unused)]
fn main() {
/// Compute (total_budget_authority, total_rescissions) from the actual provisions.
///
/// This is deterministic — does not use the LLM's self-reported summary.
/// Budget authority includes all `Appropriation` provisions where
/// `semantics == NewBudgetAuthority` and `detail_level` is not
/// `sub_allocation` or `proviso_amount`.
pub fn compute_totals(&self) -> (i64, i64) {
    // ...
}
}

Module-level documentation

Each module should have a //! doc comment at the top explaining its purpose:

#![allow(unused)]
fn main() {
//! Query operations over loaded bill data.
//!
//! These functions take `&[LoadedBill]` and return plain data structs
//! suitable for any output format. The CLI layer handles formatting.
}

Inline comments

Use // comments sparingly — prefer self-documenting code (descriptive names, small functions). When you do comment, explain why, not what:

#![allow(unused)]
fn main() {
// Good: Explains why
// Exclude sub-allocations and proviso amounts — they are
// breakdowns of a parent account, not additional money.
if dl != "sub_allocation" && dl != "proviso_amount" {
    ba += amt.dollars().unwrap_or(0);
}

// Bad: Restates the code
// Add dollars to ba if detail level is not sub_allocation or proviso_amount
if dl != "sub_allocation" && dl != "proviso_amount" {
    ba += amt.dollars().unwrap_or(0);
}
}

Serde Conventions

Use #[serde(default)] on all provision fields

#![allow(unused)]
fn main() {
Appropriation {
    #[serde(default)]
    account_name: String,
    #[serde(default)]
    agency: Option<String>,
    // ...
}
}

This ensures that missing fields in JSON input (which is common with LLM-generated JSON) get default values rather than causing deserialization errors.

Use #[serde(tag = "...", rename_all = "snake_case")] for tagged enums

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "provision_type", rename_all = "snake_case")]
pub enum Provision {
    Appropriation { ... },
    Rescission { ... },
    // ...
}
}

Use #[non_exhaustive] on enums that may grow

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize)]
#[non_exhaustive]
pub enum AmountValue {
    Specific { dollars: i64 },
    SuchSums,
    None,
}
}

This prevents external code from exhaustively matching, ensuring forward compatibility when new variants are added.

Async Conventions

Only use async when calling external APIs

Most of the codebase is synchronous. Only these operations are async:

  • handle_extract() — calls the Anthropic API
  • handle_embed() — calls the OpenAI API
  • handle_search() — the --semantic path calls the OpenAI API
  • handle_download() — calls the Congress.gov API

If your new code doesn’t call an external API, keep it synchronous.

Never use block_on() inside an async function

#![allow(unused)]
fn main() {
// WRONG — causes "cannot start a runtime from within a runtime" panic
async fn handle_my_command() {
    let result = tokio::runtime::Runtime::new()
        .unwrap()
        .block_on(some_async_fn()); // PANIC!
}

// RIGHT — use .await
async fn handle_my_command() {
    let result = some_async_fn().await;
}
}

The main function is async

The main() function uses #[tokio::main] and dispatches to handler functions. Async handlers are .awaited; sync handlers are called directly.

Commit Messages

Use this format:

Short summary of the change (imperative mood, ≤72 characters)

Longer description of what changed and why. Wrap at 72 characters.
Explain the motivation, not just the mechanics.

Verified:
- cargo fmt/clippy/test: clean, N tests pass
- Budget totals unchanged: $846B/$16B/$2.9B

Examples:

Add --division filter to search command

Scopes search results to a single division letter (e.g., --division A
for MilCon-VA in the FY2024 omnibus). Uses case-insensitive exact
match against the provision's division field.

Verified:
- cargo fmt/clippy/test: clean, 172 tests pass
- Budget totals unchanged: $846B/$16B/$2.9B
Fix SuchSums serialization in upgrade path

The upgrade command was not correctly re-serializing SuchSums amount
variants — they were missing the "kind" tag. Fixed by normalizing
through the current AmountValue enum during upgrade.

Verified:
- cargo fmt/clippy/test: clean, 95 tests pass
- Budget totals unchanged: $846B/$16B/$2.9B

Verification line

Always include the verification line in your commit message. It tells reviewers that you ran the full test cycle and checked data integrity. The budget total shorthand ($846B/$16B/$2.9B) refers to the three example bills’ budget authority.

Dependencies

Adding new dependencies

Before adding a new crate dependency:

  1. Check if an existing dependency can do the job. The project already uses reqwest, serde, serde_json, tokio, anyhow, thiserror, sha2, chrono, walkdir, comfy-table, and csv.
  2. Prefer pure-Rust crates. The project avoids C dependencies (uses roxmltree instead of libxml2, rustls-tls instead of OpenSSL).
  3. Check the crate’s maintenance status. Prefer well-maintained crates with recent releases.
  4. Keep the dependency count low. Each new dependency is a maintenance burden and a potential supply-chain risk.

Feature flags

Use feature flags to keep optional dependencies from bloating the binary:

# In Cargo.toml:
reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls", "stream"] }

Logging

Use tracing for structured logging

#![allow(unused)]
fn main() {
use tracing::{debug, info, warn};

debug!(bill = %loaded.extraction.bill.identifier, "Loaded bill");
info!(chunks = chunks.len(), "Starting parallel extraction");
warn!(bill = %identifier, "Embeddings are stale");
}

Log levels

LevelWhen to Use
error!Something failed and the operation can’t continue
warn!Something unexpected happened but the operation continues (e.g., stale embeddings)
info!High-level progress updates (e.g., “Loaded 3 bills”, “Extraction complete”)
debug!Detailed progress for debugging (e.g., per-provision details, timing)
trace!Very detailed internal state (rarely used)

Users see info! and above by default. The -v flag enables debug! level.

Never log to stdout

All logging goes to stderr via tracing-subscriber. Stdout is reserved for command output (tables, JSON, CSV) so it can be piped and redirected cleanly.

Summary

RuleWhy
cargo fmt before every commitCI rejects unformatted code
cargo clippy -- -D warnings before every commitCI rejects code with warnings
Fix clippy at root cause, not with #[allow]Suppressing warnings hides real issues
Library function first, CLI secondSeparates computation from presentation
All query functions take &[LoadedBill]Keeps library functions pure and testable
Serde on everythingEnables all output formats for free
Tests in the same fileEasy to find, easy to maintain
anyhow for CLI, thiserror for libraryRight error handling tool for each context
Never block_on() in asyncCauses runtime panics
Include verification line in commitsProves you ran the full test cycle

Writing and Documentation Tone

All documentation, comments, commit messages, and user-facing text should be direct, factual, and professional. The project’s credibility depends on the data and the verification methodology — not on persuasive language.

Do

  • State what the tool does and how: “Dollar amounts are verified by deterministic string matching against the enrolled bill text.”
  • Let the data speak: “99.995% of dollar amounts confirmed in source text (18,583 of 18,584).”
  • Describe limitations plainly: “FY2025 subcommittee filtering is not available because H.R. 1968 wraps all jurisdictions into a single division.”
  • Use precise language: “budget authority” not “spending”; “enrolled bill” not “the law”.

Do not

  • Use marketing language: “Turn federal spending bills into searchable, structured data.”
  • Use breathless phrasing: “Copy-paste and go!”, “Zero keyword overlap — yet it’s the top result!”
  • Label features by audience: “For Journalists”, “For Staffers”. Describe the task instead.
  • Use callout labels like “Trust callout” or “Key insight” — if the information is important, state it directly.
  • Editorialize about what numbers mean: “That’s a story-saving feature.” Describe the data; let the reader draw conclusions.

README and book chapter guidelines

  • The README and book chapters should read like technical documentation, not a product landing page.
  • Embed specific numbers only in the cookbook dataset card and the accuracy-metrics appendix. Other pages should use relative language (“across the full dataset”) and link to those reference pages. This prevents staleness when bills are added.
  • Every command example should use output that was verified against the actual dataset. Do not fabricate or approximate CLI output.

Next Steps

Included Bills

The data/ directory contains 32 enacted appropriations bills across 4 congresses (116th–119th), covering FY2019 through FY2026. These are real enacted laws with real data — no API keys are needed to query them. All twelve appropriations subcommittees are represented for FY2024 and FY2026.

Each bill directory contains the enrolled XML, extracted provisions (extraction.json) with source spans, verification report, extraction metadata, bill metadata (bill_meta.json), TAS mapping (tas_mapping.json where applicable), and pre-computed embeddings (embeddings.json + vectors.bin). The data root also contains fas_reference.json (FAST Book reference data) and authorities.json (the cross-bill account registry).

Dataset totals: 34,568 provisions, $21.5 trillion in budget authority, 1,051 accounts tracked by Treasury Account Symbol across 937 cross-bill links.

Bill Summary

118th Congress (FY2024/FY2025)

DirectoryBillClassificationSubcommitteesProvisionsBudget Auth
data/118-hr4366/H.R. 4366OmnibusMilCon-VA, Ag, CJS, E&W, Interior, THUD2,364$846B
data/118-hr5860/H.R. 5860Continuing Resolution(all, at prior-year rates)130$16B
data/118-hr9468/H.R. 9468SupplementalVA7$2.9B
data/118-hr815/H.R. 815SupplementalDefense, State (Ukraine/Israel/Taiwan)303$95B
data/118-hr2872/H.R. 2872Continuing Resolution(further CR)31$0
data/118-hr6363/H.R. 6363Continuing Resolution(further CR + extensions)74~$0
data/118-hr7463/H.R. 7463Continuing Resolution(CR extension)10$0
data/118-hr9747/H.R. 9747Continuing Resolution(CR + extensions, FY2025)114$383M
data/118-s870/S. 870AuthorizationFire administration49$0

119th Congress (FY2025/FY2026)

DirectoryBillClassificationSubcommitteesProvisionsBudget Auth
data/119-hr1968/H.R. 1968Full-Year CR with AppropriationsDefense, Homeland, Labor-HHS, others526$1,786B
data/119-hr5371/H.R. 5371MinibusCR + Ag + LegBranch + MilCon-VA1,048$681B
data/119-hr6938/H.R. 6938MinibusCJS + Energy-Water + Interior1,061$196B
data/119-hr7148/H.R. 7148OmnibusDefense + Labor-HHS + THUD + FinServ + State2,837$2,788B

Totals: 34,568 provisions, $21.5 trillion in budget authority, 99.995% dollar verification, 100% source traceability. See Accuracy Metrics for the full breakdown.


H.R. 4366 — The FY2024 Omnibus

What it is

The Consolidated Appropriations Act, 2024 is an omnibus — a single legislative vehicle packaging multiple annual appropriations bills together. It covers seven of the twelve appropriations subcommittee jurisdictions, organized into lettered divisions:

DivisionSubcommittee Jurisdiction
AMilitary Construction, Veterans Affairs
BAgriculture, Rural Development, Food and Drug Administration
CCommerce, Justice, Science
DEnergy and Water Development
EInterior, Environment
FTransportation, Housing and Urban Development
G–HOther matters

Not included: Defense, Labor-HHS-Education, Homeland Security, State-Foreign Operations, Financial Services, and Legislative Branch (these were addressed through other legislation for FY2024).

Why it matters

This is the largest and most complex bill in the example data. At 2,364 provisions across ~1,500 pages of legislative text, it’s a comprehensive test of the tool’s extraction, verification, and query capabilities. It includes every provision type except cr_substitution and continuing_resolution_baseline (which are specific to continuing resolutions).

Provision type breakdown

TypeCountPercentage
appropriation1,21651.4%
limitation45619.3%
rider28512.1%
directive1205.1%
other843.6%
rescission783.3%
transfer_authority773.3%
mandatory_spending_extension401.7%
directed_spending80.3%
Total2,364100%

Key accounts (top 10 by budget authority)

AccountAgencyBudget Authority
Compensation and PensionsDepartment of Veterans Affairs$197,382,903,000
Supplemental Nutrition Assistance ProgramDepartment of Agriculture$122,382,521,000
Medical ServicesDepartment of Veterans Affairs$71,000,000,000
Child Nutrition ProgramsDepartment of Agriculture$33,266,226,000
Tenant-Based Rental AssistanceDept. of Housing and Urban Development$32,386,831,000
Medical Community CareDepartment of Veterans Affairs$20,382,000,000
Weapons ActivitiesDepartment of Energy$19,108,000,000
Project-Based Rental AssistanceDept. of Housing and Urban Development$16,010,000,000
Readjustment BenefitsDepartment of Veterans Affairs$13,774,657,000
OperationsFederal Aviation Administration$12,729,627,000

Note: The largest accounts (VA Comp & Pensions, SNAP, VA Medical Services) are mandatory spending programs that appear as appropriation lines in the bill text. See Why the Numbers Might Not Match Headlines.

Verification metrics

MetricValue
Dollar amounts verified (unique position)762
Dollar amounts not found0
Dollar amounts ambiguous (multiple positions)723
Raw text exact match2,285 (96.7%)
Raw text normalized match59 (2.5%)
Raw text no match20 (0.8%)
Coverage94.2%

The 20 “no match” provisions are all non-dollar statutory amendments where the LLM slightly reformatted section references. No provision with a dollar amount has a text mismatch.

Key data files

FileSizeDescription
BILLS-118hr4366enr.xml1.8 MBEnrolled bill XML from Congress.gov
extraction.json~12 MB2,364 structured provisions
verification.json~2 MBFull verification report
metadata.json~300 bytesExtraction provenance (model, hashes)
embeddings.json~230 bytesEmbedding metadata
vectors.bin29 MB2,364 × 3,072 float32 embedding vectors

Try it

# Summary
congress-approp summary --dir data/118-hr4366

# All appropriations in Division A (MilCon-VA)
congress-approp search --dir data/118-hr4366 --type appropriation --division A

# Rescissions over $1 billion
congress-approp search --dir data/118-hr4366 --type rescission --min-dollars 1000000000

# Everything about the FBI
congress-approp search --dir data/118-hr4366 --account "Federal Bureau of Investigation"

# Budget authority by department
congress-approp summary --dir data/118-hr4366 --by-agency

# Full audit
congress-approp audit --dir data/118-hr4366

H.R. 5860 — The FY2024 Continuing Resolution

What it is

The Continuing Appropriations Act, 2024 is a continuing resolution (CR) — temporary legislation that funded the federal government at FY2023 rates while Congress finished negotiating the full-year omnibus. It was enacted on November 16, 2023, about seven weeks into FY2024 (which started October 1).

The CR’s core mechanism (SEC. 101) says: fund everything at last year’s level. But 13 specific programs got different treatment through CR substitutions (anomalies) — provisions that substitute one dollar amount for another, setting a different level than the default prior-year rate.

Why it matters

CRs are politically significant because the anomalies reveal congressional priorities — which programs Congress chose to fund above or below the default rate. The tool extracts these as structured data with both the new and old amounts, making analysis straightforward.

CRs also have a very different provision profile than omnibus bills: dominated by riders and mandatory spending extensions rather than new appropriations. This tests the tool’s ability to handle diverse provision types.

Provision type breakdown

TypeCountPercentage
rider4937.7%
mandatory_spending_extension4433.8%
cr_substitution1310.0%
other129.2%
appropriation53.8%
limitation43.1%
directive21.5%
continuing_resolution_baseline10.8%
Total130100%

The 13 CR substitutions

These are the programs where Congress set a specific funding level instead of continuing at the prior-year rate:

AccountNew AmountOld AmountDeltaChange
Bilateral Econ. Assistance—Migration and Refugee Assistance$915,048,000$1,535,048,000-$620,000,000-40.4%
(section 521(d)(1) reference)$122,572,000$705,768,000-$583,196,000-82.6%
Bilateral Econ. Assistance—International Disaster Assistance$637,902,000$937,902,000-$300,000,000-32.0%
Int’l Security Assistance—Narcotics Control$74,996,000$374,996,000-$300,000,000-80.0%
Rural Utilities Service—Rural Water$60,000,000$325,000,000-$265,000,000-81.5%
NSF—Research and Related Activities$608,162,000$818,162,000-$210,000,000-25.7%
NSF—STEM Education$92,000,000$217,000,000-$125,000,000-57.6%
State Dept—Diplomatic Programs$87,054,000$147,054,000-$60,000,000-40.8%
Rural Housing Service—Community Facilities$25,300,000$75,300,000-$50,000,000-66.4%
DOT—FAA Facilities and Equipment$2,174,200,000$2,221,200,000-$47,000,000-2.1%
NOAA—Operations, Research, and Facilities$42,000,000$62,000,000-$20,000,000-32.3%
DOT—FAA Facilities and Equipment$617,000,000$570,000,000+$47,000,000+8.2%
OPM—Salaries and Expenses$219,076,000$190,784,000+$28,292,000+14.8%

Eleven of thirteen substitutions are cuts. Only OPM Salaries and one FAA account received increases. All 13 pairs are fully verified — both the new and old dollar amounts were found in the source bill text.

The $16 billion FEMA appropriation

The CR’s $16 billion budget authority comes primarily from SEC. 129, which appropriated $16 billion for the Federal Emergency Management Agency Disaster Relief Fund — a standalone emergency appropriation outside the CR’s baseline mechanism. This is the largest single appropriation in the CR.

Verification metrics

MetricValue
Dollar amounts verified (unique position)33
Dollar amounts not found0
Dollar amounts ambiguous (multiple positions)2
Raw text exact match102 (78.5%)
Raw text normalized match12 (9.2%)
Raw text no match16 (12.3%)
Coverage61.1%

The lower coverage (61.1%) is expected for a CR — most dollar strings in the text are references to prior-year appropriations acts, not new provisions. The 16 “no match” raw text provisions are riders and mandatory spending extensions that amend existing statutes, where the LLM slightly reformatted section references.

Try it

# Summary
congress-approp summary --dir data/118-hr5860

# All CR substitutions (table auto-adapts to show New/Old/Delta)
congress-approp search --dir data/118-hr5860 --type cr_substitution

# The core CR mechanism
congress-approp search --dir data/118-hr5860 --type continuing_resolution_baseline

# Mandatory programs extended
congress-approp search --dir data/118-hr5860 --type mandatory_spending_extension

# Standalone appropriations (FEMA, etc.)
congress-approp search --dir data/118-hr5860 --type appropriation

# Full audit
congress-approp audit --dir data/118-hr5860

H.R. 9468 — The VA Supplemental

What it is

The Veterans Benefits Continuity and Accountability Supplemental Appropriations Act, 2024 is a supplemental — emergency funding enacted outside the regular annual cycle. It was passed after the VA disclosed an unexpected shortfall in its Compensation and Pensions and Readjustment Benefits accounts.

At only 7 provisions, it’s the smallest bill in the example data and serves as an excellent introduction to the tool — small enough to read every provision, yet representative of real appropriations legislation.

Why it matters

This bill tells a complete story in 7 provisions:

  1. $2,285,513,000 for Compensation and Pensions — additional funding to cover the shortfall
  2. $596,969,000 for Readjustment Benefits — additional funding for veteran readjustment
  3. SEC. 101 (rider) — establishes that these amounts are “in addition to” regular appropriations
  4. SEC. 102 (rider) — makes the funds available under normal authorities and conditions
  5. SEC. 103(a) (directive) — requires the VA Secretary to report on corrective actions within 30 days
  6. SEC. 103(b) (directive) — requires quarterly status reports on fund usage through September 2026
  7. SEC. 104 (directive) — requires the VA Inspector General to review the causes of the shortfall within 180 days

The two appropriations provide the money; the two riders establish the legal framework; the three directives impose accountability. This is a typical supplemental pattern — emergency funding paired with oversight requirements.

Provision type breakdown

TypeCount
directive3
appropriation2
rider2
Total7

Verification metrics

MetricValue
Dollar amounts verified (unique position)2
Dollar amounts not found0
Dollar amounts ambiguous0
Raw text exact match5 (71.4%)
Raw text normalized match0
Raw text no match2 (28.6%)
Coverage100.0%

Perfect coverage — every dollar amount in the source text is captured. The only two dollar strings in the bill ($2,285,513,000 and $596,969,000) are both verified. The 2 “no match” raw text provisions are the longer SEC. 103 directives, where the LLM truncated the excerpt.

A teaching example

The VA Supplemental is used throughout this documentation as the primary teaching example because:

  • It’s small enough to show completely. All 7 provisions fit in a single JSON output.
  • It covers three provision types. Appropriations, riders, and directives.
  • Both dollar amounts are unique. No ambiguity — each amount maps to exactly one position in the source.
  • It has real-world significance. The VA funding shortfall was a major news story in 2024.
  • It cross-references the omnibus. The same accounts (Comp & Pensions, Readjustment Benefits) appear in H.R. 4366, enabling cross-bill matching demonstrations.

Try it

# See all 7 provisions
congress-approp search --dir data/118-hr9468

# Just the two appropriations
congress-approp search --dir data/118-hr9468 --type appropriation

# The three directives (reporting requirements)
congress-approp search --dir data/118-hr9468 --type directive

# Full JSON for the complete picture
congress-approp search --dir data/118-hr9468 --format json

# Compare to the omnibus — see the same accounts in both
congress-approp compare --base data/118-hr4366 --current data/118-hr9468 --agency "Veterans"

# Find the omnibus counterpart of the Comp & Pensions provision
congress-approp search --dir data --similar 118-hr9468:0 --top 5

# Audit
congress-approp audit --dir data/118-hr9468

What Each Bill Directory Contains

Every bill directory in the example data has the same file structure:

data/118-hr9468/
├── BILLS-118hr9468enr.xml     ← Source XML from Congress.gov (enrolled version)
├── extraction.json            ← All provisions with structured fields
├── verification.json          ← Deterministic verification against source text
├── metadata.json              ← Extraction provenance (model, hashes, timestamps)
├── embeddings.json            ← Embedding metadata (model, dimensions, hashes)
└── vectors.bin                ← Binary float32 embedding vectors (3,072 dimensions)

Note: tokens.json (LLM token usage) is not included in the example data because the extractions were produced during development. The chunks/ directory is also not included — it’s gitignored as local provenance.

See Data Directory Layout for the complete file reference.


Aggregate Metrics Across All Thirteen Bills

MetricValue
Total provisions34,568
Total budget authority$6,412,476,574,673
Total rescissions$84,074,524,379
Amounts NOT found in source0
Raw text exact match rate95.5%
Advance appropriations detected$1.49 trillion (18% of total BA)
FY2026 subcommittee coverageAll 12 subcommittees

The headline number: 99.995% of dollar amounts verified across 34,568 provisions from 32 bills. Every provision has byte-level source spans in the enrolled bill text.


Using Example Data for Development

The example data serves multiple purposes:

As test fixtures

The integration test suite (tests/cli_tests.rs) runs against data/ and hardcodes exact budget authority totals. Any change to the example data or to the budget authority calculation logic that would alter these numbers is caught immediately.

As documentation source

Every command example, output table, and JSON snippet in this documentation was generated from the example data. The data is the documentation’s source of truth.

As training data for understanding

If you’re new to appropriations, reading through data/118-hr9468/extraction.json (just 7 provisions) is the fastest way to understand what the tool produces. Then explore data/118-hr5860 for CR-specific patterns, and data/118-hr4366 for the full complexity of an omnibus.

As baseline for comparison

When you extract your own bills, you can compare them to the examples:

# Compare your FY2025 omnibus to the FY2024 omnibus
congress-approp compare --base data/118-hr4366 --current data/119/hr/YOUR_BILL

# Find similar provisions across fiscal years
congress-approp search --dir data --dir data --similar 118-hr9468:0 --top 5

Updating Example Data

The example data is checked into the git repository and should only be updated deliberately. The update process:

  1. Run extraction against the source XML: congress-approp extract --dir data/hrNNNN
  2. Run the audit to verify quality: congress-approp audit --dir data/hrNNNN
  3. Regenerate embeddings: congress-approp embed --dir data/hrNNNN
  4. Run the full test suite: cargo test
  5. Verify budget authority totals match expected values
  6. Update the hardcoded test values in tests/cli_tests.rs if totals changed (with justification)
  7. Update documentation if provision counts or metrics changed

Caution: LLM non-determinism means re-extraction may produce slightly different provision counts or classifications. The verification pipeline ensures dollar amounts are always correct, but provision-level details may vary. Only re-extract example data when there’s a specific reason (schema change, prompt improvement, new model).


Future Example Data

The goal is to eventually include all enacted appropriations bills so users can query without running the LLM extraction themselves. Planned additions:

  • FY2023 appropriations (117th and 118th Congress bills)
  • FY2025 appropriations (119th Congress bills, as they are enacted)
  • Defense appropriations (the largest single bill, not covered by the current omnibus example)
  • Labor-HHS-Education (the largest domestic bill, also not in the current examples)

Contributors who extract additional bills and verify their quality are welcome to submit them as additions to the example data.

Next Steps

Accuracy Metrics

This appendix provides a comprehensive breakdown of every verification metric across the included dataset. These numbers are the empirical basis for the trust claims made throughout this documentation.

All verification metrics are deterministic — computed by code against the source bill text, with zero LLM involvement. TAS resolution metrics include both deterministic matching and LLM-verified results.

Aggregate Summary

MetricValue
Bills processed32 (across 4 congresses, 116th–119th)
Fiscal years coveredFY2019–FY2026 (8 years)
Total provisions extracted34,568
Total budget authority$21.5 trillion
Dollar amounts NOT found in source1 (0.005% — a multi-amount edge case in H.R. 2471)
Dollar amounts verified (unique match)10,468 (56.3%)
Dollar amounts ambiguous (multiple matches)8,115 (43.7%)
Source traceability (raw_text in source)34,568 / 34,568 (100.000%)
Source spans (byte-level provenance)34,568 / 34,568 (100%)
Raw text byte-identical to source33,276 (96.3%)
Raw text repaired by verify-text1,292 (3.7%) — deterministic, zero LLM calls
Raw text not found at any tier0
TAS resolution (provisions mapped to FAS codes)6,645 / 6,685 (99.4%)
TAS deterministic matches3,731 (55.8%) — zero false positives
TAS LLM-resolved matches2,914 (43.6%) — 20/20 spot-check correct
TAS unresolved40 (0.6%) — edge cases: Postal, intelligence, FDIC
Authority registry accounts1,051 unique FAS codes
Cross-bill linked accounts937 (appear in 2+ bills)
Name variants tracked443 authorities with multiple names
Rename events detected40 (with fiscal year boundary)
Budget regression pins8 / 8 bills match expected totals
Total rescissions$24,659,349,709
Total net budget authority$840,360,231,845

The single most important number: 99.995% of dollar amounts verified across 34,568 provisions from 32 bills. Every extracted dollar amount was confirmed to exist in the source bill text.


Per-Bill Breakdown

H.R. 4366 — Consolidated Appropriations Act, 2024 (Omnibus)

CategoryMetricValue
ProvisionsTotal extracted2,364
Appropriations1,216 (51.4%)
Limitations456 (19.3%)
Riders285 (12.1%)
Directives120 (5.1%)
Other84 (3.6%)
Rescissions78 (3.3%)
Transfer authorities77 (3.3%)
Mandatory spending extensions40 (1.7%)
Directed spending8 (0.3%)
Dollar AmountsProvisions with amounts1,485
Verified (unique position)762
Ambiguous (multiple positions)723
Not found0
Raw TextExact match2,285 (96.7%)
Normalized match59 (2.5%)
Spaceless match0 (0.0%)
No match20 (0.8%)
CompletenessDollar patterns in source~1,734
Accounted for by provisions~1,634
Coverage94.2%
Budget AuthorityGross BA$846,137,099,554
Rescissions$24,659,349,709
Net BA$821,477,749,845

Notes on H.R. 4366 metrics:

  • The 723 ambiguous dollar amounts reflect the high frequency of round numbers in a 1,500-page bill. The most common: $5,000,000 appears 50 times, $1,000,000 appears 45 times, and $10,000,000 appears 38 times in the source text.
  • The 20 “no match” raw text provisions are all non-dollar provisions — statutory amendments (riders and mandatory spending extensions) where the LLM slightly reformatted section references. No provision with a dollar amount has a raw text mismatch.
  • Coverage of 94.2% means 5.8% of dollar strings in the source text were not matched to a provision. These are primarily statutory cross-references, loan guarantee ceilings, struck amounts in amendments, and proviso sub-references that are correctly excluded from extraction. See What Coverage Means (and Doesn’t).

H.R. 5860 — Continuing Appropriations Act, 2024 (CR)

CategoryMetricValue
ProvisionsTotal extracted130
Riders49 (37.7%)
Mandatory spending extensions44 (33.8%)
CR substitutions13 (10.0%)
Other12 (9.2%)
Appropriations5 (3.8%)
Limitations4 (3.1%)
Directives2 (1.5%)
CR baseline1 (0.8%)
Dollar AmountsProvisions with amounts35
Verified (unique position)33
Ambiguous (multiple positions)2
Not found0
CR SubstitutionsTotal pairs13
Both amounts verified13 (100%)
Programs with cuts (negative delta)11
Programs with increases (positive delta)2
Largest cut-$620,000,000 (Migration and Refugee Assistance)
Largest increase+$47,000,000 (FAA Facilities and Equipment)
Raw TextExact match102 (78.5%)
Normalized match12 (9.2%)
Spaceless match0 (0.0%)
No match16 (12.3%)
CompletenessDollar patterns in source~36
Accounted for by provisions~22
Coverage61.1%
Budget AuthorityGross BA$16,000,000,000
Rescissions$0
Net BA$16,000,000,000

Notes on H.R. 5860 metrics:

  • The CR has a much higher proportion of non-spending provisions (riders and mandatory spending extensions) compared to an omnibus. Only 5 provisions are standalone appropriations — principally the $16 billion FEMA Disaster Relief Fund.
  • All 13 CR substitution pairs are fully verified: both the new amount ($X) and old amount ($Y) were found in the source text.
  • The 16 “no match” raw text provisions are riders and mandatory spending extensions that amend existing statutes. The LLM sometimes reformats section numbering in these provisions (e.g., adding a space after a closing parenthesis).
  • Coverage of 61.1% is expected for a continuing resolution. CRs reference prior-year appropriations acts extensively — those references contain dollar amounts that appear in the CR’s text but are contextual citations, not new provisions.

H.R. 9468 — Veterans Benefits Supplemental (Supplemental)

CategoryMetricValue
ProvisionsTotal extracted7
Directives3 (42.9%)
Appropriations2 (28.6%)
Riders2 (28.6%)
Dollar AmountsProvisions with amounts2
Verified (unique position)2
Ambiguous (multiple positions)0
Not found0
Raw TextExact match5 (71.4%)
Normalized match0 (0.0%)
Spaceless match0 (0.0%)
No match2 (28.6%)
CompletenessDollar patterns in source2
Accounted for by provisions2
Coverage100.0%
Budget AuthorityGross BA$2,882,482,000
Rescissions$0
Net BA$2,882,482,000

Notes on H.R. 9468 metrics:

  • This is the simplest bill in the example data — only 2 dollar amounts in the entire source text, both uniquely verifiable.
  • Perfect coverage: every dollar string in the source is accounted for.
  • The 2 “no match” raw text provisions are the SEC. 103 directives (reporting requirements), where the LLM’s raw text excerpt was truncated and doesn’t appear as-is in the source. The content is correct; only the excerpt boundary is slightly off.
  • Both appropriations ($2,285,513,000 for Compensation and Pensions + $596,969,000 for Readjustment Benefits) are verified at unique positions — the strongest possible verification result.

Amount Verification Detail

The verification pipeline searches for each provision’s text_as_written dollar string (e.g., "$2,285,513,000") verbatim in the source bill text.

Three outcomes

StatusMeaningCountPercentage
VerifiedDollar string found at exactly one position — unambiguous location79752.4%
AmbiguousDollar string found at multiple positions — correct but can’t pin location72547.6%
Not FoundDollar string not found anywhere in source — possible hallucination00.0%

Why ambiguous is so common

Round numbers appear frequently in appropriations bills. In H.R. 4366:

Dollar StringOccurrences in Source
$5,000,00050
$1,000,00045
$10,000,00038
$15,000,00027
$3,000,00025
$500,00024
$50,000,00020
$30,000,00019
$2,000,00019
$25,000,00016

When the tool finds $5,000,000 at 50 positions, it confirms the amount is real but can’t determine which of the 50 occurrences corresponds to this specific provision. That’s “ambiguous” — correct amount, uncertain location.

The 797 “verified” provisions have dollar amounts unique enough to appear exactly once in the entire bill — amounts like $10,643,713,000 (FBI Salaries and Expenses) or $33,266,226,000 (Child Nutrition Programs).

Internal consistency check

Beyond source text verification, the pipeline also checks that the parsed integer in amount.value.dollars is consistent with the text_as_written string. For example:

text_as_writtenParsed dollarsConsistent?
"$2,285,513,000"2285513000✓ Yes
"$596,969,000"596969000✓ Yes

Across all 1,522 provisions with dollar amounts: 0 internal consistency mismatches.


Raw Text Verification Detail

Each provision’s raw_text excerpt (~first 150 characters of the bill language) is checked as a substring of the source text using four-tier matching.

Tier results across all example data

TierMethodCountPercentageWhat It Catches
ExactByte-identical substring2,39295.6%Clean, faithful extractions
NormalizedAfter collapsing whitespace, normalizing quotes ("") and dashes (-)712.8%Unicode formatting differences from XML-to-text conversion
SpacelessAfter removing all spaces00.0%Word-joining artifacts (none in this data)
No MatchNot found at any tier381.5%Paraphrased, truncated, or concatenated excerpts

Analysis of the 38 no-match provisions

All 38 “no match” provisions share a critical property: none of them carry dollar amounts. They are all non-dollar provisions — riders and mandatory spending extensions that amend existing statutes.

The typical pattern:

  • Source text: Section 1886(d)(5)(G) of the Social Security Act (42 U.S.C. 1395ww(d)(5)(G)) is amended—
  • LLM raw_text: Section 1886(d)(5)(G) of the Social Security Act (42 U.S.C. 1395ww(d)(5)(G)) is amended— (1) clause...

The LLM included text from the next line, creating a raw_text that doesn’t appear as a contiguous substring in the source. The statutory reference and substance are correct; the excerpt boundary is slightly off.

Implication: The 38 no-match provisions don’t undermine the tool’s financial accuracy — they affect only the provenance trail for non-dollar legislative provisions. Dollar amounts are verified independently through the amount checks, which show 0 not-found across all data.

Per-bill breakdown

BillExactNormalizedSpacelessNo MatchTotal
H.R. 43662,285 (96.7%)59 (2.5%)0 (0.0%)20 (0.8%)2,364
H.R. 5860102 (78.5%)12 (9.2%)0 (0.0%)16 (12.3%)130
H.R. 94685 (71.4%)0 (0.0%)0 (0.0%)2 (28.6%)7
Total2,392 (95.6%)71 (2.8%)0 (0.0%)38 (1.5%)2,501

Note: The detailed per-bill breakdown above covers the original three FY2024 example bills (H.R. 4366, H.R. 5860, H.R. 9468). The aggregate metrics at the top of this page reflect all 32 bills across FY2019–FY2026.

The omnibus has the highest exact match rate (96.7%), which makes sense — it’s the most straightforward appropriations text. The CR and supplemental have more statutory amendments (which are harder to quote exactly), contributing to their higher no-match rates.


Completeness (Coverage) Detail

Coverage measures what percentage of dollar-sign patterns in the source text were matched to at least one extracted provision’s text_as_written field.

Per-bill coverage

BillDollar Patterns in SourceAccounted ForCoverage
H.R. 4366~1,734~1,63494.2%
H.R. 5860~36~2261.1%
H.R. 946822100.0%

Why coverage varies

H.R. 9468 (100%): The simplest bill — only 2 dollar amounts in the entire source text, both captured.

H.R. 4366 (94.2%): The ~100 unaccounted dollar strings are primarily:

  • Statutory cross-references to other laws (dollar amounts cited for context, not new provisions)
  • Loan guarantee face values (not budget authority)
  • Old amounts being struck by amendments (“striking ‘$50,000’ and inserting ‘$75,000’”)
  • Proviso sub-amounts that are part of a parent provision’s context

H.R. 5860 (61.1%): Continuing resolutions reference prior-year appropriations acts extensively. Those referenced acts contain many dollar amounts that appear in the CR’s text but are citations of prior-year levels, not new provisions. Only the 13 CR substitutions, 5 standalone appropriations, and a few limitations represent genuine new provisions with dollar amounts.

Why coverage < 100% doesn’t mean errors

Coverage below 100% means there are dollar strings in the source text that weren’t captured as provisions. For most of these, non-capture is the correct behavior:

  • A statutory reference like “section 1241(a) ($500,000,000 for each fiscal year)” contains a dollar amount from another law — it’s not a new appropriation in this bill.
  • A loan guarantee ceiling like “$3,500,000,000 for guaranteed farm ownership loans” is a loan volume limit, not budget authority.
  • An amendment language like “striking ‘$50,000’” contains an old amount that’s being replaced — the replacement amount is the one that matters.

See What Coverage Means (and Doesn’t) for a comprehensive explanation with examples.


CR Substitution Verification

All 13 CR substitutions in H.R. 5860 are fully verified — both the new amount ($X in “substituting $X for $Y”) and the old amount ($Y) were found in the source bill text:

#AccountNew Amount Verified?Old Amount Verified?
1Rural Housing Service—Rural Community Facilities
2Rural Utilities Service—Rural Water and Waste Disposal
3(section 521(d)(1) reference)
4NSF—STEM Education
5NOAA—Operations, Research, and Facilities
6NSF—Research and Related Activities
7State Dept—Diplomatic Programs
8Bilateral Econ. Assistance—International Disaster Assistance
9Bilateral Econ. Assistance—Migration and Refugee Assistance
10Int’l Security Assistance—Narcotics Control
11OPM—Salaries and Expenses
12DOT—FAA Facilities and Equipment (#1)
13DOT—FAA Facilities and Equipment (#2)

26 of 26 dollar amounts verified (13 new + 13 old). This is the strongest verification possible for CR substitutions — both sides of every “substituting X for Y” pair are confirmed in the source text.


Budget Authority Verification

Budget authority is computed deterministically from provisions — never from LLM-generated summaries.

The formula

Budget Authority = sum of amount.value.dollars
    WHERE provision_type = "appropriation"
    AND   amount.semantics = "new_budget_authority"
    AND   detail_level NOT IN ("sub_allocation", "proviso_amount")

Detail level filtering

In H.R. 4366, the detail level distribution for appropriation-type provisions is:

Detail LevelCountIncluded in BA?
top_level483Yes
sub_allocation396No — breakdowns of parent accounts
line_item272Yes
proviso_amount65No — conditions, not independent appropriations

Without the detail level filter, the budget authority sum would be $846,159,099,554 — approximately $22 million higher than the correct total of $846,137,099,554. The $22 million represents sub-allocations and proviso amounts correctly excluded from the total.

Regression testing

The exact budget authority totals are hardcoded in the integration test suite:

#![allow(unused)]
fn main() {
let expected: Vec<(&str, i64, i64)> = vec![
    ("H.R. 4366", 846_137_099_554, 24_659_349_709),
    ("H.R. 5860", 16_000_000_000, 0),
    ("H.R. 9468", 2_882_482_000, 0),
];
}

Any change to the extraction data, provision parsing, or budget authority calculation that would alter these numbers is caught immediately by the budget_authority_totals_match_expected test. This is the tool’s primary financial integrity guard.

Independent reproducibility

The budget authority calculation can be independently reproduced in Python:

import json

with open("data/118-hr4366/extraction.json") as f:
    data = json.load(f)

ba = 0
for p in data["provisions"]:
    if p["provision_type"] != "appropriation":
        continue
    amt = p.get("amount")
    if not amt or amt.get("semantics") != "new_budget_authority":
        continue
    val = amt.get("value", {})
    if val.get("kind") != "specific":
        continue
    dl = p.get("detail_level", "")
    if dl in ("sub_allocation", "proviso_amount"):
        continue
    ba += val["dollars"]

print(f"Budget Authority: ${ba:,.0f}")
# Output: Budget Authority: $846,137,099,554

This produces exactly the same number as the CLI. If the Python and Rust calculations ever disagree, something is wrong.


What These Metrics Do and Don’t Prove

What the metrics prove

ClaimEvidence
Extracted dollar amounts are real0 of 1,522 dollar amounts not found in source text
Dollar parsing is consistent0 internal mismatches between text_as_written and parsed dollars
CR substitution pairs are complete26 of 26 amounts (13 new + 13 old) verified in source
Raw text excerpts are faithful95.6% byte-identical to source; remaining 4.4% have verified dollar amounts
Budget authority is deterministicComputed from provisions, not LLM summaries; regression-tested; independently reproducible
Sub-allocations don’t double-countDetail level filter excludes them; $22M difference confirms correct filtering

What the metrics don’t prove

LimitationWhy
Classification correctnessVerification can’t check whether a “rider” should really be a “limitation” — that’s LLM judgment
Attribution correctness for ambiguous amountsWhen $5,000,000 appears 50 times, verification confirms the amount exists but can’t prove it’s attributed to the right account
Completeness of non-dollar provisionsThe coverage metric only counts dollar strings; riders and directives without dollar amounts are not measured
Fiscal year correctnessThe fiscal_year field is inferred by the LLM; verification doesn’t independently confirm it
Detail level correctnessIf the LLM marks a sub-allocation as top_level, it would be incorrectly included in budget authority; this is not automatically detected per-provision

The 95.6% exact match rate as attribution evidence

While verification cannot mathematically prove attribution (that a dollar amount is assigned to the correct account), the 95.6% exact raw text match rate provides strong indirect evidence:

  • If the raw text excerpt is byte-identical to a passage in the source, and that passage mentions an account name and a dollar amount, the provision is almost certainly attributed correctly.
  • The 38 provisions without text matches are all non-dollar provisions, so attribution is a non-issue for them.
  • For the 725 ambiguous dollar amounts, the combination of a verified dollar amount and an exact raw text match narrows the attribution to the specific passage the raw text came from.

For high-stakes analysis, supplement the automated verification with manual spot-checks of critical provisions. See Verify Extraction Accuracy for the procedure.


Reproducing These Metrics

You can reproduce every metric in this appendix using the included example data:

# The full audit table
congress-approp audit --dir data

# Budget authority totals
congress-approp summary --dir data --format json

# Provision type counts
congress-approp search --dir data --format json | \
  jq 'group_by(.provision_type) | map({type: .[0].provision_type, count: length}) | sort_by(-.count)'

# CR substitution verification
congress-approp search --dir data/118-hr5860 --type cr_substitution --format json | jq length

# Detailed verification data
cat data/118-hr9468/verification.json | python3 -m json.tool | head -50

All of these commands work with no API keys against the included data/ directory.


How Metrics Change with Re-Extraction

Due to LLM non-determinism, re-extracting the same bill may produce slightly different metrics:

MetricStabilityNotes
Dollar amounts not foundVery stable (always 0)Dollar verification is independent of classification
Budget authority totalStable (within ±0.1%)Small provision count changes rarely affect the aggregate
Provision countModerately stable (±1-3%)The LLM may split or merge provisions differently
Raw text exact match rateModerately stable (±2%)Different excerpt boundaries may shift a few provisions between tiers
CoverageModerately stable (±3%)Depends on how many sub-amounts the LLM captures
Classification distributionLess stable (±5%)A provision may be classified as rider in one run and limitation in another

The verification pipeline ensures that dollar amount accuracy is invariant across re-extractions — even if provision counts or classifications change, the verified amounts are always correct because they’re checked against the source text, not against the LLM’s internal state.


Next Steps

Changelog

All notable changes to congress-approp are documented here. The format is based on Keep a Changelog.

For the full changelog with technical details, see CHANGELOG.md in the repository.


[5.1.0] — 2026-03-20

Breaking Changes

  • examples/ renamed to data/ with congress-prefixed directory naming (118-hr4366, 119-hr7148). Default --dir changed to ./data. Provision references use congress prefix: 118-hr9468:0.
  • Implicit agency normalization removed. The hardcoded SUB_AGENCY_TO_PARENT lookup table has been replaced with explicit, user-managed dataset.json rules. Compare uses exact matching by default.
  • compare() library API gains agency_groups and account_aliases parameters.
  • Crate no longer includes bill data. Package size reduced from 5.4MB to ~500KB. Use git clone for the full dataset.

Added

  • dataset.json — user-managed entity resolution file for agency groups and account aliases.
  • normalize suggest-text-match — discovers agency naming variants via orphan-pair analysis and regex patterns. Caches results for the accept command.
  • normalize suggest-llm — LLM-assisted entity resolution with XML heading context. Caches results for the accept command.
  • normalize accept — accepts suggestions by hash from cached suggest results. Supports --auto for accepting all.
  • normalize list — displays current entity resolution rules.
  • compare --exact — disables all normalization from dataset.json.
  • (normalized) marker in compare table output; separate normalized column in CSV.
  • Orphan-pair hint in compare stderr suggesting normalize suggest-text-match.
  • Congress number in all outputH.R. 7148 (119th) in summary, search, compare, semantic search.
  • Cache system (~/.congress-approp/cache/) for suggest/accept workflows with automatic invalidation.
  • H.R. 2882 (FY2024 second omnibus) — completes FY2024 with all 12 subcommittees. Dataset: 14 bills, 11,136 provisions, $8.9 trillion.
  • test-data/ directory with 3 small bills for crate integration tests.
  • Download command creates flat {congress}-{type}{number} directories.
  • 220 tests (169 unit + 51 integration).

Fixed

  • Documentation updated from examples/ to data/ across README and ~30 book chapters.
  • Inconsistent --dir defaults unified to ./data.
  • Export tutorial column table now matches actual CSV output.

[4.2.1] — 2026-03-19

Added

  • H.R. 2882 (FY2024 second omnibus) — 2,582 provisions, $2.45 trillion, covering Defense, Financial Services, Homeland Security, Labor-HHS, Legislative Branch, State-Foreign Operations.

[4.2.0] — 2026-03-19

Added

  • fiscal_year, detail_level, confidence, provision_index, and match_tier columns in search --format csv output. The CSV now matches the documented column set.
  • fiscal_year() and detail_level() accessor methods on the Provision enum in the library API.
  • fiscal_years field in BillSummary and a new “FYs” column in the summary table showing which fiscal years each bill covers.
  • Smart export warning — when exporting to CSV/JSON/JSONL, stderr shows a breakdown by semantics type and warns about sub-allocation summing when mixed semantics are present.
  • Export Data section in README with quick export patterns and a sub-allocation warning.
  • 3 new integration tests plus 5 new assertions on existing tests. Total: 191 tests (146 unit + 45 integration).

Fixed

  • Documentation: Export tutorial listed CSV columns that didn’t exist. Code now matches docs. Added bold warning about sub-allocation summing trap and “Computing Totals Correctly” subsection.

[4.1.0] — 2026-03-19

Added

  • --real flag on compare — inflation-adjusted “Real Δ %*” column using CPI-U data from the Bureau of Labor Statistics.
  • --cpi-file <PATH> flag on compare — override bundled CPI-U data with a custom price index file.
  • inflation.rs module — CPI data loading, fiscal-year-weighted averages, inflation rate calculation, real delta computation. 16 unit tests.
  • Bundled CPI data (cpi.json) — monthly CPI-U values from Jan 2013 through Feb 2026. No network access required at runtime.
  • Inflation flags — ▲ (real increase), ▼ (real cut or inflation erosion), — (unchanged) in compare output.
  • Inflation-aware CSV and JSON output with real_delta_pct and inflation_flag columns/fields.
  • Staleness warning when bundled CPI data is more than 60 days old.
  • Inflation adjustment how-to chapter in the documentation book.

[4.0.0] — 2026-03-19

Added

  • enrich command — generates bill_meta.json per bill with fiscal year metadata, subcommittee/jurisdiction mappings, advance appropriation classification, bill nature enrichment, and canonical account names. No API keys required.
  • relate command — deep-dive on one provision across all bills with embedding similarity, confidence tiers, fiscal year timeline (--fy-timeline), and deterministic link hashes (--format hashes).
  • link suggest / link accept / link remove / link list — persistent cross-bill provision links. Discover candidates via embedding similarity, accept by hash, manage saved relationships.
  • --fy <YEAR> on summary, search, compare — filter to bills covering a specific fiscal year.
  • --subcommittee <SLUG> on summary, search, compare — filter by appropriations subcommittee jurisdiction (requires enrich).
  • --show-advance on summary — separates current-year from advance appropriations in the output.
  • --base-fy / --current-fy on compare — compare all bills for one fiscal year against another.
  • compare --use-links — uses accepted links for matching across renames.
  • Advance appropriation detection — fiscal-year-aware classification identifying $1.49 trillion in advance appropriations across the 13-bill dataset.
  • Cross-semantics orphan rescue in compare — recovers provisions like Transit Formula Grants ($14.6B) that have different semantics across bills.
  • Sub-agency normalization — 35-entry lookup table resolving agency granularity mismatches in compare (e.g., “Maritime Administration” ↔ “Department of Transportation”).
  • Pre-enriched bill_meta.json for all 13 example bills.

Changed

  • Compare uses case-insensitive account matching — resolves 52 false orphans from capitalization differences.
  • Summary displays enriched bill classification when bill_meta.json is available (e.g., “Full-Year CR with Appropriations” instead of “Continuing Resolution”).
  • Summary handler consolidated to call query::summarize() instead of reimplementing inline.
  • Hash chain extended to cover bill_meta.json.
  • Version bumped to 4.0.0.

[3.2.0] — 2026-03-18

Added

  • --continue-on-error flag on extract — opt-in to saving partial results when some chunks fail.

Changed

  • Extract aborts on chunk failure by default. Prevents garbage partial extractions.
  • Per-bill error handling in multi-bill extraction runs.

[3.1.0] — 2026-03-18

Added

  • --all-versions flag on download — explicitly download all text versions (introduced, engrossed, enrolled, etc.) when needed for conference tracking or bill comparison workflows.
  • --force flag on extract — re-extract bills even if extraction.json already exists. Without this flag, already-extracted bills are automatically skipped, making it safe to re-run after partial failures.

Changed

  • Download defaults to enrolled only. The download command now fetches only the enrolled (signed into law) XML by default, instead of every available text version. This prevents downloading 4–6 unnecessary files per bill and avoids wasted API calls during extraction. Use --version to request a specific version or --all-versions for all versions.
  • Extract prefers enrolled XML. When a bill directory contains multiple BILLS-*.xml files, the extract command automatically uses only the enrolled version (*enr.xml) and ignores other versions.
  • Extract skips already-extracted bills. If extraction.json already exists in a bill directory, extract skips it with an informational message. Use --force to override. The ANTHROPIC_API_KEY is not required when all bills are already extracted.
  • Extract is resilient to parse failures. If an XML file fails to parse (e.g., a non-enrolled version with an unexpected structure), the tool logs a warning and continues to the next bill instead of aborting the entire run.
  • Better error messages on XML parse failure. Parse errors now include the filename that failed.
  • Version bumped to 3.1.0.

[3.0.0] — 2026-03-17

Added

  • Semantic search--semantic "query" on the search command ranks provisions by meaning similarity using OpenAI embeddings. Finds “Child Nutrition Programs” from “school lunch programs for kids” with zero keyword overlap. See Use Semantic Search.
  • Find similar--similar bill_dir:index finds provisions most similar to a specific one across all loaded bills. Useful for cross-bill matching and year-over-year tracking. No API call needed — uses pre-computed vectors. See Track a Program Across Bills.
  • embed command — generates embeddings for extracted bills using OpenAI text-embedding-3-large. Writes embeddings.json (metadata) + vectors.bin (binary float32 vectors) per bill directory. Skips up-to-date bills automatically. See Generate Embeddings.
  • Pre-generated embeddings for all three example bills (3,072 dimensions). Semantic search works on example data without running embed.
  • OpenAI API client (src/api/openai/) for the embeddings endpoint.
  • Hash chainsource_xml_sha256 in metadata.json, extraction_sha256 in embeddings.json. Enables staleness detection across the full pipeline. See Data Integrity and the Hash Chain.
  • Staleness detection (src/approp/staleness.rs) — checks whether downstream artifacts are consistent with their inputs. Warns but never blocks.
  • --top N flag on search for controlling semantic/similar result count (default 20).
  • Cosine similarity utilities in embeddings.rs with unit tests.
  • build_embedding_text() in query.rs — deterministic text builder for provision embeddings.

Changed

  • handle_search is now async to support OpenAI embedding API calls.
  • README: removed coverage percentages from intro and bill table (was confusing). Updated summary table example to match current output.
  • chunks/ directory renamed from .chunks/ — LLM artifacts kept as local provenance (gitignored, not part of hash chain).
  • Example metadata.json files updated with source_xml_sha256 field.

[2.1.0] — 2026-03-17

Added

  • --division filter on search command — scope results to a single division letter (e.g., --division A for MilCon-VA).
  • --min-dollars and --max-dollars filters on search command — find provisions within a dollar range.
  • --format jsonl output on search and summary — one JSON object per line, pipeable with jq. See Output Formats.
  • Enhanced --dry-run on extract — now shows chunk count and estimated input tokens.
  • Footer on summary table showing count of unverified dollar amounts across all bills.
  • This changelog.

Changed

  • summary table no longer shows the Coverage column — it was routinely misinterpreted as an accuracy metric when it actually measures what percentage of dollar strings in the source text were matched to a provision. Many unmatched dollar strings (statutory references, loan ceilings, old amounts being struck) are correctly excluded. The coverage metric remains available in audit and in --format json output as completeness_pct. See What Coverage Means (and Doesn’t).

Fixed

  • cargo fmt and cargo clippy clean.

[2.0.0] — 2026-03-17

Added

  • --model flag and APPROP_MODEL environment variable on extract command — override the default LLM model. See Extract Provisions from a Bill.
  • upgrade command — migrate extraction data to the latest schema version and re-verify without LLM. See Upgrade Extraction Data.
  • audit command (replaces report) — detailed verification breakdown per bill. See Verify Extraction Accuracy.
  • compare command warns when comparing different bill classifications (e.g., supplemental vs. omnibus).
  • amount_status field in search output — found, found_multiple, or not_found.
  • quality field in search output — strong, moderate, or weak derived from verification data.
  • match_tier field in search output — exact, normalized, spaceless, or no_match.
  • schema_version field in extraction.json and verification.json.
  • 18 integration tests covering all CLI commands with pinned budget authority totals.

Changed

  • report command renamed to audit (report kept as alias).
  • Search output field verified renamed to amount_status with richer values.
  • compare output status labels changed: eliminatedonly in base, newonly in current.
  • arithmetic_checks field in verification.json deprecated — omitted from new files, old files still load.

Removed

  • hallucinated terminology removed from all output and documentation.

[1.2.0] — 2026-03-16

Added

  • audit command with column guide explaining every metric.
  • compare command guard rails for cross-classification comparisons.

Changed

  • Terminology overhaul: reportaudit throughout documentation.

[1.1.0] — 2026-03-16

Added

  • Schema versioning (schema_version: "1.0") in extraction and verification files.
  • upgrade command for migrating pre-versioned data.
  • Verification clarity improvements — column guide in audit output.

Fixed

  • SuchSums amount variants now serialize correctly (fixed via upgrade path).

[1.0.0] — 2026-03-16

Initial release.

Features

See Included Example Bills for detailed profiles of each bill.


Version Numbering

This project uses Semantic Versioning:

  • Major (e.g., 2.0.0 → 3.0.0): Breaking changes to the CLI interface, JSON output schema, or library API. Existing scripts or integrations may need updates.
  • Minor (e.g., 2.0.0 → 2.1.0): New features, new commands, new flags, new output fields. Backward-compatible — existing scripts continue to work.
  • Patch (e.g., 3.0.0 → 3.0.1): Bug fixes, documentation improvements, dependency updates. No behavioral changes.

The extraction data schema has its own version (schema_version field in extraction.json). The upgrade command handles schema migrations without re-extraction.