Accuracy Metrics
This appendix provides a comprehensive breakdown of every verification metric across the included dataset. These numbers are the empirical basis for the trust claims made throughout this documentation.
All verification metrics are deterministic — computed by code against the source bill text, with zero LLM involvement. TAS resolution metrics include both deterministic matching and LLM-verified results.
Aggregate Summary
| Metric | Value |
|---|---|
| Bills processed | 32 (across 4 congresses, 116th–119th) |
| Fiscal years covered | FY2019–FY2026 (8 years) |
| Total provisions extracted | 34,568 |
| Total budget authority | $21.5 trillion |
| Dollar amounts NOT found in source | 1 (0.005% — a multi-amount edge case in H.R. 2471) |
| Dollar amounts verified (unique match) | 10,468 (56.3%) |
| Dollar amounts ambiguous (multiple matches) | 8,115 (43.7%) |
| Source traceability (raw_text in source) | 34,568 / 34,568 (100.000%) |
| Source spans (byte-level provenance) | 34,568 / 34,568 (100%) |
| Raw text byte-identical to source | 33,276 (96.3%) |
| Raw text repaired by verify-text | 1,292 (3.7%) — deterministic, zero LLM calls |
| Raw text not found at any tier | 0 |
| TAS resolution (provisions mapped to FAS codes) | 6,645 / 6,685 (99.4%) |
| TAS deterministic matches | 3,731 (55.8%) — zero false positives |
| TAS LLM-resolved matches | 2,914 (43.6%) — 20/20 spot-check correct |
| TAS unresolved | 40 (0.6%) — edge cases: Postal, intelligence, FDIC |
| Authority registry accounts | 1,051 unique FAS codes |
| Cross-bill linked accounts | 937 (appear in 2+ bills) |
| Name variants tracked | 443 authorities with multiple names |
| Rename events detected | 40 (with fiscal year boundary) |
| Budget regression pins | 8 / 8 bills match expected totals |
| Total rescissions | $24,659,349,709 |
| Total net budget authority | $840,360,231,845 |
The single most important number: 99.995% of dollar amounts verified across 34,568 provisions from 32 bills. Every extracted dollar amount was confirmed to exist in the source bill text.
Per-Bill Breakdown
H.R. 4366 — Consolidated Appropriations Act, 2024 (Omnibus)
| Category | Metric | Value |
|---|---|---|
| Provisions | Total extracted | 2,364 |
| Appropriations | 1,216 (51.4%) | |
| Limitations | 456 (19.3%) | |
| Riders | 285 (12.1%) | |
| Directives | 120 (5.1%) | |
| Other | 84 (3.6%) | |
| Rescissions | 78 (3.3%) | |
| Transfer authorities | 77 (3.3%) | |
| Mandatory spending extensions | 40 (1.7%) | |
| Directed spending | 8 (0.3%) | |
| Dollar Amounts | Provisions with amounts | 1,485 |
| Verified (unique position) | 762 | |
| Ambiguous (multiple positions) | 723 | |
| Not found | 0 | |
| Raw Text | Exact match | 2,285 (96.7%) |
| Normalized match | 59 (2.5%) | |
| Spaceless match | 0 (0.0%) | |
| No match | 20 (0.8%) | |
| Completeness | Dollar patterns in source | ~1,734 |
| Accounted for by provisions | ~1,634 | |
| Coverage | 94.2% | |
| Budget Authority | Gross BA | $846,137,099,554 |
| Rescissions | $24,659,349,709 | |
| Net BA | $821,477,749,845 |
Notes on H.R. 4366 metrics:
- The 723 ambiguous dollar amounts reflect the high frequency of round numbers in a 1,500-page bill. The most common:
$5,000,000appears 50 times,$1,000,000appears 45 times, and$10,000,000appears 38 times in the source text. - The 20 “no match” raw text provisions are all non-dollar provisions — statutory amendments (riders and mandatory spending extensions) where the LLM slightly reformatted section references. No provision with a dollar amount has a raw text mismatch.
- Coverage of 94.2% means 5.8% of dollar strings in the source text were not matched to a provision. These are primarily statutory cross-references, loan guarantee ceilings, struck amounts in amendments, and proviso sub-references that are correctly excluded from extraction. See What Coverage Means (and Doesn’t).
H.R. 5860 — Continuing Appropriations Act, 2024 (CR)
| Category | Metric | Value |
|---|---|---|
| Provisions | Total extracted | 130 |
| Riders | 49 (37.7%) | |
| Mandatory spending extensions | 44 (33.8%) | |
| CR substitutions | 13 (10.0%) | |
| Other | 12 (9.2%) | |
| Appropriations | 5 (3.8%) | |
| Limitations | 4 (3.1%) | |
| Directives | 2 (1.5%) | |
| CR baseline | 1 (0.8%) | |
| Dollar Amounts | Provisions with amounts | 35 |
| Verified (unique position) | 33 | |
| Ambiguous (multiple positions) | 2 | |
| Not found | 0 | |
| CR Substitutions | Total pairs | 13 |
| Both amounts verified | 13 (100%) | |
| Programs with cuts (negative delta) | 11 | |
| Programs with increases (positive delta) | 2 | |
| Largest cut | -$620,000,000 (Migration and Refugee Assistance) | |
| Largest increase | +$47,000,000 (FAA Facilities and Equipment) | |
| Raw Text | Exact match | 102 (78.5%) |
| Normalized match | 12 (9.2%) | |
| Spaceless match | 0 (0.0%) | |
| No match | 16 (12.3%) | |
| Completeness | Dollar patterns in source | ~36 |
| Accounted for by provisions | ~22 | |
| Coverage | 61.1% | |
| Budget Authority | Gross BA | $16,000,000,000 |
| Rescissions | $0 | |
| Net BA | $16,000,000,000 |
Notes on H.R. 5860 metrics:
- The CR has a much higher proportion of non-spending provisions (riders and mandatory spending extensions) compared to an omnibus. Only 5 provisions are standalone appropriations — principally the $16 billion FEMA Disaster Relief Fund.
- All 13 CR substitution pairs are fully verified: both the new amount ($X) and old amount ($Y) were found in the source text.
- The 16 “no match” raw text provisions are riders and mandatory spending extensions that amend existing statutes. The LLM sometimes reformats section numbering in these provisions (e.g., adding a space after a closing parenthesis).
- Coverage of 61.1% is expected for a continuing resolution. CRs reference prior-year appropriations acts extensively — those references contain dollar amounts that appear in the CR’s text but are contextual citations, not new provisions.
H.R. 9468 — Veterans Benefits Supplemental (Supplemental)
| Category | Metric | Value |
|---|---|---|
| Provisions | Total extracted | 7 |
| Directives | 3 (42.9%) | |
| Appropriations | 2 (28.6%) | |
| Riders | 2 (28.6%) | |
| Dollar Amounts | Provisions with amounts | 2 |
| Verified (unique position) | 2 | |
| Ambiguous (multiple positions) | 0 | |
| Not found | 0 | |
| Raw Text | Exact match | 5 (71.4%) |
| Normalized match | 0 (0.0%) | |
| Spaceless match | 0 (0.0%) | |
| No match | 2 (28.6%) | |
| Completeness | Dollar patterns in source | 2 |
| Accounted for by provisions | 2 | |
| Coverage | 100.0% | |
| Budget Authority | Gross BA | $2,882,482,000 |
| Rescissions | $0 | |
| Net BA | $2,882,482,000 |
Notes on H.R. 9468 metrics:
- This is the simplest bill in the example data — only 2 dollar amounts in the entire source text, both uniquely verifiable.
- Perfect coverage: every dollar string in the source is accounted for.
- The 2 “no match” raw text provisions are the SEC. 103 directives (reporting requirements), where the LLM’s raw text excerpt was truncated and doesn’t appear as-is in the source. The content is correct; only the excerpt boundary is slightly off.
- Both appropriations ($2,285,513,000 for Compensation and Pensions + $596,969,000 for Readjustment Benefits) are verified at unique positions — the strongest possible verification result.
Amount Verification Detail
The verification pipeline searches for each provision’s text_as_written dollar string (e.g., "$2,285,513,000") verbatim in the source bill text.
Three outcomes
| Status | Meaning | Count | Percentage |
|---|---|---|---|
| Verified | Dollar string found at exactly one position — unambiguous location | 797 | 52.4% |
| Ambiguous | Dollar string found at multiple positions — correct but can’t pin location | 725 | 47.6% |
| Not Found | Dollar string not found anywhere in source — possible hallucination | 0 | 0.0% |
Why ambiguous is so common
Round numbers appear frequently in appropriations bills. In H.R. 4366:
| Dollar String | Occurrences in Source |
|---|---|
$5,000,000 | 50 |
$1,000,000 | 45 |
$10,000,000 | 38 |
$15,000,000 | 27 |
$3,000,000 | 25 |
$500,000 | 24 |
$50,000,000 | 20 |
$30,000,000 | 19 |
$2,000,000 | 19 |
$25,000,000 | 16 |
When the tool finds $5,000,000 at 50 positions, it confirms the amount is real but can’t determine which of the 50 occurrences corresponds to this specific provision. That’s “ambiguous” — correct amount, uncertain location.
The 797 “verified” provisions have dollar amounts unique enough to appear exactly once in the entire bill — amounts like $10,643,713,000 (FBI Salaries and Expenses) or $33,266,226,000 (Child Nutrition Programs).
Internal consistency check
Beyond source text verification, the pipeline also checks that the parsed integer in amount.value.dollars is consistent with the text_as_written string. For example:
| text_as_written | Parsed dollars | Consistent? |
|---|---|---|
"$2,285,513,000" | 2285513000 | ✓ Yes |
"$596,969,000" | 596969000 | ✓ Yes |
Across all 1,522 provisions with dollar amounts: 0 internal consistency mismatches.
Raw Text Verification Detail
Each provision’s raw_text excerpt (~first 150 characters of the bill language) is checked as a substring of the source text using four-tier matching.
Tier results across all example data
| Tier | Method | Count | Percentage | What It Catches |
|---|---|---|---|---|
| Exact | Byte-identical substring | 2,392 | 95.6% | Clean, faithful extractions |
| Normalized | After collapsing whitespace, normalizing quotes (" → ") and dashes (— → -) | 71 | 2.8% | Unicode formatting differences from XML-to-text conversion |
| Spaceless | After removing all spaces | 0 | 0.0% | Word-joining artifacts (none in this data) |
| No Match | Not found at any tier | 38 | 1.5% | Paraphrased, truncated, or concatenated excerpts |
Analysis of the 38 no-match provisions
All 38 “no match” provisions share a critical property: none of them carry dollar amounts. They are all non-dollar provisions — riders and mandatory spending extensions that amend existing statutes.
The typical pattern:
- Source text:
Section 1886(d)(5)(G) of the Social Security Act (42 U.S.C. 1395ww(d)(5)(G)) is amended— - LLM raw_text:
Section 1886(d)(5)(G) of the Social Security Act (42 U.S.C. 1395ww(d)(5)(G)) is amended— (1) clause...
The LLM included text from the next line, creating a raw_text that doesn’t appear as a contiguous substring in the source. The statutory reference and substance are correct; the excerpt boundary is slightly off.
Implication: The 38 no-match provisions don’t undermine the tool’s financial accuracy — they affect only the provenance trail for non-dollar legislative provisions. Dollar amounts are verified independently through the amount checks, which show 0 not-found across all data.
Per-bill breakdown
| Bill | Exact | Normalized | Spaceless | No Match | Total |
|---|---|---|---|---|---|
| H.R. 4366 | 2,285 (96.7%) | 59 (2.5%) | 0 (0.0%) | 20 (0.8%) | 2,364 |
| H.R. 5860 | 102 (78.5%) | 12 (9.2%) | 0 (0.0%) | 16 (12.3%) | 130 |
| H.R. 9468 | 5 (71.4%) | 0 (0.0%) | 0 (0.0%) | 2 (28.6%) | 7 |
| Total | 2,392 (95.6%) | 71 (2.8%) | 0 (0.0%) | 38 (1.5%) | 2,501 |
Note: The detailed per-bill breakdown above covers the original three FY2024 example bills (H.R. 4366, H.R. 5860, H.R. 9468). The aggregate metrics at the top of this page reflect all 32 bills across FY2019–FY2026.
The omnibus has the highest exact match rate (96.7%), which makes sense — it’s the most straightforward appropriations text. The CR and supplemental have more statutory amendments (which are harder to quote exactly), contributing to their higher no-match rates.
Completeness (Coverage) Detail
Coverage measures what percentage of dollar-sign patterns in the source text were matched to at least one extracted provision’s text_as_written field.
Per-bill coverage
| Bill | Dollar Patterns in Source | Accounted For | Coverage |
|---|---|---|---|
| H.R. 4366 | ~1,734 | ~1,634 | 94.2% |
| H.R. 5860 | ~36 | ~22 | 61.1% |
| H.R. 9468 | 2 | 2 | 100.0% |
Why coverage varies
H.R. 9468 (100%): The simplest bill — only 2 dollar amounts in the entire source text, both captured.
H.R. 4366 (94.2%): The ~100 unaccounted dollar strings are primarily:
- Statutory cross-references to other laws (dollar amounts cited for context, not new provisions)
- Loan guarantee face values (not budget authority)
- Old amounts being struck by amendments (“striking ‘$50,000’ and inserting ‘$75,000’”)
- Proviso sub-amounts that are part of a parent provision’s context
H.R. 5860 (61.1%): Continuing resolutions reference prior-year appropriations acts extensively. Those referenced acts contain many dollar amounts that appear in the CR’s text but are citations of prior-year levels, not new provisions. Only the 13 CR substitutions, 5 standalone appropriations, and a few limitations represent genuine new provisions with dollar amounts.
Why coverage < 100% doesn’t mean errors
Coverage below 100% means there are dollar strings in the source text that weren’t captured as provisions. For most of these, non-capture is the correct behavior:
- A statutory reference like “section 1241(a) ($500,000,000 for each fiscal year)” contains a dollar amount from another law — it’s not a new appropriation in this bill.
- A loan guarantee ceiling like “$3,500,000,000 for guaranteed farm ownership loans” is a loan volume limit, not budget authority.
- An amendment language like “striking ‘$50,000’” contains an old amount that’s being replaced — the replacement amount is the one that matters.
See What Coverage Means (and Doesn’t) for a comprehensive explanation with examples.
CR Substitution Verification
All 13 CR substitutions in H.R. 5860 are fully verified — both the new amount ($X in “substituting $X for $Y”) and the old amount ($Y) were found in the source bill text:
| # | Account | New Amount Verified? | Old Amount Verified? |
|---|---|---|---|
| 1 | Rural Housing Service—Rural Community Facilities | ✓ | ✓ |
| 2 | Rural Utilities Service—Rural Water and Waste Disposal | ✓ | ✓ |
| 3 | (section 521(d)(1) reference) | ✓ | ✓ |
| 4 | NSF—STEM Education | ✓ | ✓ |
| 5 | NOAA—Operations, Research, and Facilities | ✓ | ✓ |
| 6 | NSF—Research and Related Activities | ✓ | ✓ |
| 7 | State Dept—Diplomatic Programs | ✓ | ✓ |
| 8 | Bilateral Econ. Assistance—International Disaster Assistance | ✓ | ✓ |
| 9 | Bilateral Econ. Assistance—Migration and Refugee Assistance | ✓ | ✓ |
| 10 | Int’l Security Assistance—Narcotics Control | ✓ | ✓ |
| 11 | OPM—Salaries and Expenses | ✓ | ✓ |
| 12 | DOT—FAA Facilities and Equipment (#1) | ✓ | ✓ |
| 13 | DOT—FAA Facilities and Equipment (#2) | ✓ | ✓ |
26 of 26 dollar amounts verified (13 new + 13 old). This is the strongest verification possible for CR substitutions — both sides of every “substituting X for Y” pair are confirmed in the source text.
Budget Authority Verification
Budget authority is computed deterministically from provisions — never from LLM-generated summaries.
The formula
Budget Authority = sum of amount.value.dollars
WHERE provision_type = "appropriation"
AND amount.semantics = "new_budget_authority"
AND detail_level NOT IN ("sub_allocation", "proviso_amount")
Detail level filtering
In H.R. 4366, the detail level distribution for appropriation-type provisions is:
| Detail Level | Count | Included in BA? |
|---|---|---|
top_level | 483 | Yes |
sub_allocation | 396 | No — breakdowns of parent accounts |
line_item | 272 | Yes |
proviso_amount | 65 | No — conditions, not independent appropriations |
Without the detail level filter, the budget authority sum would be $846,159,099,554 — approximately $22 million higher than the correct total of $846,137,099,554. The $22 million represents sub-allocations and proviso amounts correctly excluded from the total.
Regression testing
The exact budget authority totals are hardcoded in the integration test suite:
#![allow(unused)]
fn main() {
let expected: Vec<(&str, i64, i64)> = vec![
("H.R. 4366", 846_137_099_554, 24_659_349_709),
("H.R. 5860", 16_000_000_000, 0),
("H.R. 9468", 2_882_482_000, 0),
];
}
Any change to the extraction data, provision parsing, or budget authority calculation that would alter these numbers is caught immediately by the budget_authority_totals_match_expected test. This is the tool’s primary financial integrity guard.
Independent reproducibility
The budget authority calculation can be independently reproduced in Python:
import json
with open("data/118-hr4366/extraction.json") as f:
data = json.load(f)
ba = 0
for p in data["provisions"]:
if p["provision_type"] != "appropriation":
continue
amt = p.get("amount")
if not amt or amt.get("semantics") != "new_budget_authority":
continue
val = amt.get("value", {})
if val.get("kind") != "specific":
continue
dl = p.get("detail_level", "")
if dl in ("sub_allocation", "proviso_amount"):
continue
ba += val["dollars"]
print(f"Budget Authority: ${ba:,.0f}")
# Output: Budget Authority: $846,137,099,554
This produces exactly the same number as the CLI. If the Python and Rust calculations ever disagree, something is wrong.
What These Metrics Do and Don’t Prove
What the metrics prove
| Claim | Evidence |
|---|---|
| Extracted dollar amounts are real | 0 of 1,522 dollar amounts not found in source text |
| Dollar parsing is consistent | 0 internal mismatches between text_as_written and parsed dollars |
| CR substitution pairs are complete | 26 of 26 amounts (13 new + 13 old) verified in source |
| Raw text excerpts are faithful | 95.6% byte-identical to source; remaining 4.4% have verified dollar amounts |
| Budget authority is deterministic | Computed from provisions, not LLM summaries; regression-tested; independently reproducible |
| Sub-allocations don’t double-count | Detail level filter excludes them; $22M difference confirms correct filtering |
What the metrics don’t prove
| Limitation | Why |
|---|---|
| Classification correctness | Verification can’t check whether a “rider” should really be a “limitation” — that’s LLM judgment |
| Attribution correctness for ambiguous amounts | When $5,000,000 appears 50 times, verification confirms the amount exists but can’t prove it’s attributed to the right account |
| Completeness of non-dollar provisions | The coverage metric only counts dollar strings; riders and directives without dollar amounts are not measured |
| Fiscal year correctness | The fiscal_year field is inferred by the LLM; verification doesn’t independently confirm it |
| Detail level correctness | If the LLM marks a sub-allocation as top_level, it would be incorrectly included in budget authority; this is not automatically detected per-provision |
The 95.6% exact match rate as attribution evidence
While verification cannot mathematically prove attribution (that a dollar amount is assigned to the correct account), the 95.6% exact raw text match rate provides strong indirect evidence:
- If the raw text excerpt is byte-identical to a passage in the source, and that passage mentions an account name and a dollar amount, the provision is almost certainly attributed correctly.
- The 38 provisions without text matches are all non-dollar provisions, so attribution is a non-issue for them.
- For the 725 ambiguous dollar amounts, the combination of a verified dollar amount and an exact raw text match narrows the attribution to the specific passage the raw text came from.
For high-stakes analysis, supplement the automated verification with manual spot-checks of critical provisions. See Verify Extraction Accuracy for the procedure.
Reproducing These Metrics
You can reproduce every metric in this appendix using the included example data:
# The full audit table
congress-approp audit --dir data
# Budget authority totals
congress-approp summary --dir data --format json
# Provision type counts
congress-approp search --dir data --format json | \
jq 'group_by(.provision_type) | map({type: .[0].provision_type, count: length}) | sort_by(-.count)'
# CR substitution verification
congress-approp search --dir data/118-hr5860 --type cr_substitution --format json | jq length
# Detailed verification data
cat data/118-hr9468/verification.json | python3 -m json.tool | head -50
All of these commands work with no API keys against the included data/ directory.
How Metrics Change with Re-Extraction
Due to LLM non-determinism, re-extracting the same bill may produce slightly different metrics:
| Metric | Stability | Notes |
|---|---|---|
| Dollar amounts not found | Very stable (always 0) | Dollar verification is independent of classification |
| Budget authority total | Stable (within ±0.1%) | Small provision count changes rarely affect the aggregate |
| Provision count | Moderately stable (±1-3%) | The LLM may split or merge provisions differently |
| Raw text exact match rate | Moderately stable (±2%) | Different excerpt boundaries may shift a few provisions between tiers |
| Coverage | Moderately stable (±3%) | Depends on how many sub-amounts the LLM captures |
| Classification distribution | Less stable (±5%) | A provision may be classified as rider in one run and limitation in another |
The verification pipeline ensures that dollar amount accuracy is invariant across re-extractions — even if provision counts or classifications change, the verified amounts are always correct because they’re checked against the source text, not against the LLM’s internal state.
Next Steps
- Verify Extraction Accuracy — practical guide for running your own audit
- How Verification Works — technical details of the verification pipeline
- What Coverage Means (and Doesn’t) — understanding the completeness metric
- Included Example Bills — detailed profiles of each example bill