Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Accuracy Metrics

This appendix provides a comprehensive breakdown of every verification metric across the included dataset. These numbers are the empirical basis for the trust claims made throughout this documentation.

All verification metrics are deterministic — computed by code against the source bill text, with zero LLM involvement. TAS resolution metrics include both deterministic matching and LLM-verified results.

Aggregate Summary

MetricValue
Bills processed32 (across 4 congresses, 116th–119th)
Fiscal years coveredFY2019–FY2026 (8 years)
Total provisions extracted34,568
Total budget authority$21.5 trillion
Dollar amounts NOT found in source1 (0.005% — a multi-amount edge case in H.R. 2471)
Dollar amounts verified (unique match)10,468 (56.3%)
Dollar amounts ambiguous (multiple matches)8,115 (43.7%)
Source traceability (raw_text in source)34,568 / 34,568 (100.000%)
Source spans (byte-level provenance)34,568 / 34,568 (100%)
Raw text byte-identical to source33,276 (96.3%)
Raw text repaired by verify-text1,292 (3.7%) — deterministic, zero LLM calls
Raw text not found at any tier0
TAS resolution (provisions mapped to FAS codes)6,645 / 6,685 (99.4%)
TAS deterministic matches3,731 (55.8%) — zero false positives
TAS LLM-resolved matches2,914 (43.6%) — 20/20 spot-check correct
TAS unresolved40 (0.6%) — edge cases: Postal, intelligence, FDIC
Authority registry accounts1,051 unique FAS codes
Cross-bill linked accounts937 (appear in 2+ bills)
Name variants tracked443 authorities with multiple names
Rename events detected40 (with fiscal year boundary)
Budget regression pins8 / 8 bills match expected totals
Total rescissions$24,659,349,709
Total net budget authority$840,360,231,845

The single most important number: 99.995% of dollar amounts verified across 34,568 provisions from 32 bills. Every extracted dollar amount was confirmed to exist in the source bill text.


Per-Bill Breakdown

H.R. 4366 — Consolidated Appropriations Act, 2024 (Omnibus)

CategoryMetricValue
ProvisionsTotal extracted2,364
Appropriations1,216 (51.4%)
Limitations456 (19.3%)
Riders285 (12.1%)
Directives120 (5.1%)
Other84 (3.6%)
Rescissions78 (3.3%)
Transfer authorities77 (3.3%)
Mandatory spending extensions40 (1.7%)
Directed spending8 (0.3%)
Dollar AmountsProvisions with amounts1,485
Verified (unique position)762
Ambiguous (multiple positions)723
Not found0
Raw TextExact match2,285 (96.7%)
Normalized match59 (2.5%)
Spaceless match0 (0.0%)
No match20 (0.8%)
CompletenessDollar patterns in source~1,734
Accounted for by provisions~1,634
Coverage94.2%
Budget AuthorityGross BA$846,137,099,554
Rescissions$24,659,349,709
Net BA$821,477,749,845

Notes on H.R. 4366 metrics:

  • The 723 ambiguous dollar amounts reflect the high frequency of round numbers in a 1,500-page bill. The most common: $5,000,000 appears 50 times, $1,000,000 appears 45 times, and $10,000,000 appears 38 times in the source text.
  • The 20 “no match” raw text provisions are all non-dollar provisions — statutory amendments (riders and mandatory spending extensions) where the LLM slightly reformatted section references. No provision with a dollar amount has a raw text mismatch.
  • Coverage of 94.2% means 5.8% of dollar strings in the source text were not matched to a provision. These are primarily statutory cross-references, loan guarantee ceilings, struck amounts in amendments, and proviso sub-references that are correctly excluded from extraction. See What Coverage Means (and Doesn’t).

H.R. 5860 — Continuing Appropriations Act, 2024 (CR)

CategoryMetricValue
ProvisionsTotal extracted130
Riders49 (37.7%)
Mandatory spending extensions44 (33.8%)
CR substitutions13 (10.0%)
Other12 (9.2%)
Appropriations5 (3.8%)
Limitations4 (3.1%)
Directives2 (1.5%)
CR baseline1 (0.8%)
Dollar AmountsProvisions with amounts35
Verified (unique position)33
Ambiguous (multiple positions)2
Not found0
CR SubstitutionsTotal pairs13
Both amounts verified13 (100%)
Programs with cuts (negative delta)11
Programs with increases (positive delta)2
Largest cut-$620,000,000 (Migration and Refugee Assistance)
Largest increase+$47,000,000 (FAA Facilities and Equipment)
Raw TextExact match102 (78.5%)
Normalized match12 (9.2%)
Spaceless match0 (0.0%)
No match16 (12.3%)
CompletenessDollar patterns in source~36
Accounted for by provisions~22
Coverage61.1%
Budget AuthorityGross BA$16,000,000,000
Rescissions$0
Net BA$16,000,000,000

Notes on H.R. 5860 metrics:

  • The CR has a much higher proportion of non-spending provisions (riders and mandatory spending extensions) compared to an omnibus. Only 5 provisions are standalone appropriations — principally the $16 billion FEMA Disaster Relief Fund.
  • All 13 CR substitution pairs are fully verified: both the new amount ($X) and old amount ($Y) were found in the source text.
  • The 16 “no match” raw text provisions are riders and mandatory spending extensions that amend existing statutes. The LLM sometimes reformats section numbering in these provisions (e.g., adding a space after a closing parenthesis).
  • Coverage of 61.1% is expected for a continuing resolution. CRs reference prior-year appropriations acts extensively — those references contain dollar amounts that appear in the CR’s text but are contextual citations, not new provisions.

H.R. 9468 — Veterans Benefits Supplemental (Supplemental)

CategoryMetricValue
ProvisionsTotal extracted7
Directives3 (42.9%)
Appropriations2 (28.6%)
Riders2 (28.6%)
Dollar AmountsProvisions with amounts2
Verified (unique position)2
Ambiguous (multiple positions)0
Not found0
Raw TextExact match5 (71.4%)
Normalized match0 (0.0%)
Spaceless match0 (0.0%)
No match2 (28.6%)
CompletenessDollar patterns in source2
Accounted for by provisions2
Coverage100.0%
Budget AuthorityGross BA$2,882,482,000
Rescissions$0
Net BA$2,882,482,000

Notes on H.R. 9468 metrics:

  • This is the simplest bill in the example data — only 2 dollar amounts in the entire source text, both uniquely verifiable.
  • Perfect coverage: every dollar string in the source is accounted for.
  • The 2 “no match” raw text provisions are the SEC. 103 directives (reporting requirements), where the LLM’s raw text excerpt was truncated and doesn’t appear as-is in the source. The content is correct; only the excerpt boundary is slightly off.
  • Both appropriations ($2,285,513,000 for Compensation and Pensions + $596,969,000 for Readjustment Benefits) are verified at unique positions — the strongest possible verification result.

Amount Verification Detail

The verification pipeline searches for each provision’s text_as_written dollar string (e.g., "$2,285,513,000") verbatim in the source bill text.

Three outcomes

StatusMeaningCountPercentage
VerifiedDollar string found at exactly one position — unambiguous location79752.4%
AmbiguousDollar string found at multiple positions — correct but can’t pin location72547.6%
Not FoundDollar string not found anywhere in source — possible hallucination00.0%

Why ambiguous is so common

Round numbers appear frequently in appropriations bills. In H.R. 4366:

Dollar StringOccurrences in Source
$5,000,00050
$1,000,00045
$10,000,00038
$15,000,00027
$3,000,00025
$500,00024
$50,000,00020
$30,000,00019
$2,000,00019
$25,000,00016

When the tool finds $5,000,000 at 50 positions, it confirms the amount is real but can’t determine which of the 50 occurrences corresponds to this specific provision. That’s “ambiguous” — correct amount, uncertain location.

The 797 “verified” provisions have dollar amounts unique enough to appear exactly once in the entire bill — amounts like $10,643,713,000 (FBI Salaries and Expenses) or $33,266,226,000 (Child Nutrition Programs).

Internal consistency check

Beyond source text verification, the pipeline also checks that the parsed integer in amount.value.dollars is consistent with the text_as_written string. For example:

text_as_writtenParsed dollarsConsistent?
"$2,285,513,000"2285513000✓ Yes
"$596,969,000"596969000✓ Yes

Across all 1,522 provisions with dollar amounts: 0 internal consistency mismatches.


Raw Text Verification Detail

Each provision’s raw_text excerpt (~first 150 characters of the bill language) is checked as a substring of the source text using four-tier matching.

Tier results across all example data

TierMethodCountPercentageWhat It Catches
ExactByte-identical substring2,39295.6%Clean, faithful extractions
NormalizedAfter collapsing whitespace, normalizing quotes ("") and dashes (-)712.8%Unicode formatting differences from XML-to-text conversion
SpacelessAfter removing all spaces00.0%Word-joining artifacts (none in this data)
No MatchNot found at any tier381.5%Paraphrased, truncated, or concatenated excerpts

Analysis of the 38 no-match provisions

All 38 “no match” provisions share a critical property: none of them carry dollar amounts. They are all non-dollar provisions — riders and mandatory spending extensions that amend existing statutes.

The typical pattern:

  • Source text: Section 1886(d)(5)(G) of the Social Security Act (42 U.S.C. 1395ww(d)(5)(G)) is amended—
  • LLM raw_text: Section 1886(d)(5)(G) of the Social Security Act (42 U.S.C. 1395ww(d)(5)(G)) is amended— (1) clause...

The LLM included text from the next line, creating a raw_text that doesn’t appear as a contiguous substring in the source. The statutory reference and substance are correct; the excerpt boundary is slightly off.

Implication: The 38 no-match provisions don’t undermine the tool’s financial accuracy — they affect only the provenance trail for non-dollar legislative provisions. Dollar amounts are verified independently through the amount checks, which show 0 not-found across all data.

Per-bill breakdown

BillExactNormalizedSpacelessNo MatchTotal
H.R. 43662,285 (96.7%)59 (2.5%)0 (0.0%)20 (0.8%)2,364
H.R. 5860102 (78.5%)12 (9.2%)0 (0.0%)16 (12.3%)130
H.R. 94685 (71.4%)0 (0.0%)0 (0.0%)2 (28.6%)7
Total2,392 (95.6%)71 (2.8%)0 (0.0%)38 (1.5%)2,501

Note: The detailed per-bill breakdown above covers the original three FY2024 example bills (H.R. 4366, H.R. 5860, H.R. 9468). The aggregate metrics at the top of this page reflect all 32 bills across FY2019–FY2026.

The omnibus has the highest exact match rate (96.7%), which makes sense — it’s the most straightforward appropriations text. The CR and supplemental have more statutory amendments (which are harder to quote exactly), contributing to their higher no-match rates.


Completeness (Coverage) Detail

Coverage measures what percentage of dollar-sign patterns in the source text were matched to at least one extracted provision’s text_as_written field.

Per-bill coverage

BillDollar Patterns in SourceAccounted ForCoverage
H.R. 4366~1,734~1,63494.2%
H.R. 5860~36~2261.1%
H.R. 946822100.0%

Why coverage varies

H.R. 9468 (100%): The simplest bill — only 2 dollar amounts in the entire source text, both captured.

H.R. 4366 (94.2%): The ~100 unaccounted dollar strings are primarily:

  • Statutory cross-references to other laws (dollar amounts cited for context, not new provisions)
  • Loan guarantee face values (not budget authority)
  • Old amounts being struck by amendments (“striking ‘$50,000’ and inserting ‘$75,000’”)
  • Proviso sub-amounts that are part of a parent provision’s context

H.R. 5860 (61.1%): Continuing resolutions reference prior-year appropriations acts extensively. Those referenced acts contain many dollar amounts that appear in the CR’s text but are citations of prior-year levels, not new provisions. Only the 13 CR substitutions, 5 standalone appropriations, and a few limitations represent genuine new provisions with dollar amounts.

Why coverage < 100% doesn’t mean errors

Coverage below 100% means there are dollar strings in the source text that weren’t captured as provisions. For most of these, non-capture is the correct behavior:

  • A statutory reference like “section 1241(a) ($500,000,000 for each fiscal year)” contains a dollar amount from another law — it’s not a new appropriation in this bill.
  • A loan guarantee ceiling like “$3,500,000,000 for guaranteed farm ownership loans” is a loan volume limit, not budget authority.
  • An amendment language like “striking ‘$50,000’” contains an old amount that’s being replaced — the replacement amount is the one that matters.

See What Coverage Means (and Doesn’t) for a comprehensive explanation with examples.


CR Substitution Verification

All 13 CR substitutions in H.R. 5860 are fully verified — both the new amount ($X in “substituting $X for $Y”) and the old amount ($Y) were found in the source bill text:

#AccountNew Amount Verified?Old Amount Verified?
1Rural Housing Service—Rural Community Facilities
2Rural Utilities Service—Rural Water and Waste Disposal
3(section 521(d)(1) reference)
4NSF—STEM Education
5NOAA—Operations, Research, and Facilities
6NSF—Research and Related Activities
7State Dept—Diplomatic Programs
8Bilateral Econ. Assistance—International Disaster Assistance
9Bilateral Econ. Assistance—Migration and Refugee Assistance
10Int’l Security Assistance—Narcotics Control
11OPM—Salaries and Expenses
12DOT—FAA Facilities and Equipment (#1)
13DOT—FAA Facilities and Equipment (#2)

26 of 26 dollar amounts verified (13 new + 13 old). This is the strongest verification possible for CR substitutions — both sides of every “substituting X for Y” pair are confirmed in the source text.


Budget Authority Verification

Budget authority is computed deterministically from provisions — never from LLM-generated summaries.

The formula

Budget Authority = sum of amount.value.dollars
    WHERE provision_type = "appropriation"
    AND   amount.semantics = "new_budget_authority"
    AND   detail_level NOT IN ("sub_allocation", "proviso_amount")

Detail level filtering

In H.R. 4366, the detail level distribution for appropriation-type provisions is:

Detail LevelCountIncluded in BA?
top_level483Yes
sub_allocation396No — breakdowns of parent accounts
line_item272Yes
proviso_amount65No — conditions, not independent appropriations

Without the detail level filter, the budget authority sum would be $846,159,099,554 — approximately $22 million higher than the correct total of $846,137,099,554. The $22 million represents sub-allocations and proviso amounts correctly excluded from the total.

Regression testing

The exact budget authority totals are hardcoded in the integration test suite:

#![allow(unused)]
fn main() {
let expected: Vec<(&str, i64, i64)> = vec![
    ("H.R. 4366", 846_137_099_554, 24_659_349_709),
    ("H.R. 5860", 16_000_000_000, 0),
    ("H.R. 9468", 2_882_482_000, 0),
];
}

Any change to the extraction data, provision parsing, or budget authority calculation that would alter these numbers is caught immediately by the budget_authority_totals_match_expected test. This is the tool’s primary financial integrity guard.

Independent reproducibility

The budget authority calculation can be independently reproduced in Python:

import json

with open("data/118-hr4366/extraction.json") as f:
    data = json.load(f)

ba = 0
for p in data["provisions"]:
    if p["provision_type"] != "appropriation":
        continue
    amt = p.get("amount")
    if not amt or amt.get("semantics") != "new_budget_authority":
        continue
    val = amt.get("value", {})
    if val.get("kind") != "specific":
        continue
    dl = p.get("detail_level", "")
    if dl in ("sub_allocation", "proviso_amount"):
        continue
    ba += val["dollars"]

print(f"Budget Authority: ${ba:,.0f}")
# Output: Budget Authority: $846,137,099,554

This produces exactly the same number as the CLI. If the Python and Rust calculations ever disagree, something is wrong.


What These Metrics Do and Don’t Prove

What the metrics prove

ClaimEvidence
Extracted dollar amounts are real0 of 1,522 dollar amounts not found in source text
Dollar parsing is consistent0 internal mismatches between text_as_written and parsed dollars
CR substitution pairs are complete26 of 26 amounts (13 new + 13 old) verified in source
Raw text excerpts are faithful95.6% byte-identical to source; remaining 4.4% have verified dollar amounts
Budget authority is deterministicComputed from provisions, not LLM summaries; regression-tested; independently reproducible
Sub-allocations don’t double-countDetail level filter excludes them; $22M difference confirms correct filtering

What the metrics don’t prove

LimitationWhy
Classification correctnessVerification can’t check whether a “rider” should really be a “limitation” — that’s LLM judgment
Attribution correctness for ambiguous amountsWhen $5,000,000 appears 50 times, verification confirms the amount exists but can’t prove it’s attributed to the right account
Completeness of non-dollar provisionsThe coverage metric only counts dollar strings; riders and directives without dollar amounts are not measured
Fiscal year correctnessThe fiscal_year field is inferred by the LLM; verification doesn’t independently confirm it
Detail level correctnessIf the LLM marks a sub-allocation as top_level, it would be incorrectly included in budget authority; this is not automatically detected per-provision

The 95.6% exact match rate as attribution evidence

While verification cannot mathematically prove attribution (that a dollar amount is assigned to the correct account), the 95.6% exact raw text match rate provides strong indirect evidence:

  • If the raw text excerpt is byte-identical to a passage in the source, and that passage mentions an account name and a dollar amount, the provision is almost certainly attributed correctly.
  • The 38 provisions without text matches are all non-dollar provisions, so attribution is a non-issue for them.
  • For the 725 ambiguous dollar amounts, the combination of a verified dollar amount and an exact raw text match narrows the attribution to the specific passage the raw text came from.

For high-stakes analysis, supplement the automated verification with manual spot-checks of critical provisions. See Verify Extraction Accuracy for the procedure.


Reproducing These Metrics

You can reproduce every metric in this appendix using the included example data:

# The full audit table
congress-approp audit --dir data

# Budget authority totals
congress-approp summary --dir data --format json

# Provision type counts
congress-approp search --dir data --format json | \
  jq 'group_by(.provision_type) | map({type: .[0].provision_type, count: length}) | sort_by(-.count)'

# CR substitution verification
congress-approp search --dir data/118-hr5860 --type cr_substitution --format json | jq length

# Detailed verification data
cat data/118-hr9468/verification.json | python3 -m json.tool | head -50

All of these commands work with no API keys against the included data/ directory.


How Metrics Change with Re-Extraction

Due to LLM non-determinism, re-extracting the same bill may produce slightly different metrics:

MetricStabilityNotes
Dollar amounts not foundVery stable (always 0)Dollar verification is independent of classification
Budget authority totalStable (within ±0.1%)Small provision count changes rarely affect the aggregate
Provision countModerately stable (±1-3%)The LLM may split or merge provisions differently
Raw text exact match rateModerately stable (±2%)Different excerpt boundaries may shift a few provisions between tiers
CoverageModerately stable (±3%)Depends on how many sub-amounts the LLM captures
Classification distributionLess stable (±5%)A provision may be classified as rider in one run and limitation in another

The verification pipeline ensures that dollar amount accuracy is invariant across re-extractions — even if provision counts or classifications change, the verified amounts are always correct because they’re checked against the source text, not against the LLM’s internal state.


Next Steps