Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Enrich Bills with Metadata

The enrich command generates bill_meta.json for each bill directory, enabling fiscal year filtering, subcommittee scoping, and advance appropriation classification. Unlike extraction (which requires an Anthropic API key) or embedding (which requires an OpenAI API key), enrichment runs entirely offline.

Quick Start

# Enrich all bills in the data directory
congress-approp enrich --dir data

This creates a bill_meta.json file in each bill directory. You only need to run it once per bill — the tool skips bills that already have metadata unless you pass --force.

What It Enables

After enriching, you can use these filtering options on summary, search, and compare:

# See only FY2026 bills
congress-approp summary --dir data --fy 2026

# Search within a specific subcommittee
congress-approp search --dir data --type appropriation --fy 2026 --subcommittee thud

# Combine semantic search with FY and subcommittee filtering
congress-approp search --dir data --semantic "housing assistance" --fy 2026 --subcommittee thud --top 5

# Compare THUD funding across fiscal years
congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud --dir data

Note: The --fy flag works without enrich — it uses the fiscal year data already in extraction.json. But --subcommittee requires the division-to-jurisdiction mapping that only enrich provides.

Note on embeddings: Semantic search (the --semantic flag) requires embedding vectors. If you cloned the git repository, pre-generated vectors.bin files are included for all example bills. If you installed via cargo install, the embedding files are not included (they exceed the crates.io size limit) — run congress-approp embed --dir data to generate them (~30 seconds per bill, requires OPENAI_API_KEY). The enrich command itself does not require embeddings and does not use any API keys.

What It Generates

The enrich command creates a bill_meta.json file in each bill directory containing five categories of metadata:

Subcommittee Mappings

Each division in an omnibus or minibus bill gets mapped to a canonical jurisdiction. The tool parses division titles directly from the enrolled bill XML and classifies them using pattern matching:

DivisionTitle (from XML)Jurisdiction
ADepartment of Defense Appropriations Act, 2026defense
BDepartments of Labor, Health and Human Services…labor-hhs
DTransportation, Housing and Urban Development…thud
GOther Mattersother

This solves the problem where Division A means Defense in one bill but CJS in another — the --subcommittee flag uses the canonical jurisdiction, not the letter.

Available subcommittee slugs for --subcommittee:

SlugJurisdiction
defenseDepartment of Defense
labor-hhsLabor, Health and Human Services, Education
thudTransportation, Housing and Urban Development
financial-servicesFinancial Services and General Government
cjsCommerce, Justice, Science
energy-waterEnergy and Water Development
interiorInterior, Environment
agricultureAgriculture, Rural Development
legislative-branchLegislative Branch
milcon-vaMilitary Construction, Veterans Affairs
state-foreign-opsState, Foreign Operations
homeland-securityHomeland Security

Advance Appropriation Classification

Each budget authority provision is classified as:

  • current_year — money available in the fiscal year the bill funds
  • advance — money enacted now but available in a future fiscal year
  • supplemental — additional emergency or supplemental funding
  • unknown — a future fiscal year is referenced but no known pattern was matched

The classification uses a fiscal-year-aware algorithm:

  1. Extract “October 1, YYYY” from the provision’s availability text — this means funds available starting fiscal year YYYY+1
  2. Extract “first quarter of fiscal year YYYY” — this means funds for FY YYYY
  3. Compare the availability year to the bill’s fiscal year
  4. If the availability year is later than the bill’s fiscal year → advance
  5. If the availability year equals the bill’s fiscal year → current_year (start of the funded FY)
  6. Check provision notes for “supplemental” → supplemental
  7. Default to current_year

This correctly handles cases like:

  • H.R. 4366 (FY2024): VA Compensation and Pensions “available October 1, 2024” → advance for FY2025 ($182 billion)
  • H.R. 7148 (FY2026): Medicaid “for the first quarter of fiscal year 2027” → advance for FY2027 ($316 billion)
  • H.R. 7148 (FY2026): Tenant-Based Rental Assistance “available October 1, 2026” → advance for FY2027 ($4 billion)

Across the 13-bill dataset, the algorithm identifies $1.49 trillion in advance appropriations — approximately 24% of total budget authority. Failing to separate advance from current-year can cause year-over-year comparisons to be off by hundreds of billions of dollars.

Bill Nature

The enriched bill classification provides finer distinctions than the original LLM classification:

Original ClassificationEnriched Bill NatureReason
continuing_resolutionfull_year_cr_with_appropriationsH.R. 1968 has 260 appropriations + a CR baseline — it’s a hybrid containing $1.786 trillion in full-year appropriations
omnibusminibusH.R. 5371 covers only 3 subcommittees (Agriculture, Legislative Branch, MilCon-VA)
supplemental_appropriationssupplementalH.R. 815 is normalized to the canonical enum value

The classification uses provision type distribution and subcommittee count: 5+ real subcommittees = omnibus, 2-4 = minibus, CR baseline + many appropriations without multiple subcommittees = full-year CR with appropriations.

Canonical Account Names

Every account name is normalized for cross-bill matching:

OriginalCanonical
Grants-In-Aid for Airportsgrants-in-aid for airports
Grants-in-Aid for Airportsgrants-in-aid for airports
Grants-in-aid for Airportsgrants-in-aid for airports
Department of VA—Compensation and Pensionscompensation and pensions

Normalization lowercases, strips em-dash and en-dash prefixes, and trims whitespace. This eliminates false orphans in compare caused by capitalization differences and hierarchical naming conventions.

Classification Provenance

Every classification in bill_meta.json records how it was determined:

{
  "timing": "advance",
  "available_fy": 2027,
  "source": {
    "type": "fiscal_year_comparison",
    "availability_fy": 2027,
    "bill_fy": 2026
  }
}

This means: “classified as advance because the money becomes available in FY2027 but the bill covers FY2026.” Provenance types include xml_structure, pattern_match, fiscal_year_comparison, note_text, and default_rule.

When to Re-Enrich

The tool automatically detects when bill_meta.json is stale — when extraction.json has changed since enrichment. You will see a warning:

⚠ H.R. 7148: bill metadata is stale (extraction.json has changed). Run `enrich --force`.

Run enrich --force to regenerate metadata for all bills.

Flags

FlagDescription
--dir <DIR>Data directory [default: ./data]
--dry-runShow what would be generated without writing files
--forceRe-enrich even if bill_meta.json already exists

Previewing Before Writing

Use --dry-run to see what the enrich command would produce without writing any files:

congress-approp enrich --dir data --dry-run
  would enrich H.R. 1968: nature=FullYearCrWithAppropriations, 3 divisions, 192 BA provisions (8 advance, 3 supplemental)
  would enrich H.R. 4366: nature=Omnibus, 7 divisions, 511 BA provisions (11 advance, 4 supplemental)
  would enrich H.R. 7148: nature=Omnibus, 9 divisions, 505 BA provisions (11 advance, 4 supplemental)
  ...

Using with Compare

The compare command benefits most from enrichment. Without enrich, comparing two omnibus bills that cover different subcommittees produces hundreds of false orphans. With enrichment and --subcommittee scoping:

# Before: 759 orphans (mixing Defense with Agriculture)
congress-approp compare --base data/118-hr4366 --current data/118-hr7148

# After: 43 meaningful changes, 12 unchanged
congress-approp compare --base-fy 2024 --current-fy 2026 --subcommittee thud --dir data

The --base-fy and --current-fy flags automatically select the right bills for each fiscal year and the --subcommittee flag scopes to the correct division in each bill.

Known Limitations

  • Sub-agency mismatches — the LLM sometimes uses sub-agency names (e.g., “Maritime Administration”) in one bill and parent department names (e.g., “Department of Transportation”) in another. The compare command includes a 35-entry sub-agency-to-parent-department lookup table that resolves most of these, but some agency naming inconsistencies (~5-15 orphans per subcommittee) may remain for agencies not in the table.
  • 17 supplemental policy division titles (e.g., “FEND Off Fentanyl Act”, “Protecting Americans from Foreign Adversary Controlled Applications Act”) are classified as other jurisdiction by default. These are from just two bills (H.R. 815 and S. 870) and don’t affect regular appropriations bill analysis.
  • Advance detection patterns cover “October 1, YYYY” and “first quarter of fiscal year YYYY.” If Congress uses novel phrasing in future bills, those provisions would default to current_year. The tool logs a warning when it detects a provision referencing a future fiscal year but not matching any known advance pattern.