Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Directory Layout

Complete reference for the file and directory structure used by congress-approp. Every bill lives in its own directory. Files are discovered by recursively walking from whatever --dir path you provide, looking for extraction.json as the anchor file.

Directory Structure

data/                              ← any --dir path works
├── hr4366/                        ← bill directory (FY2024 omnibus)
│   ├── BILLS-118hr4366enr.xml     ← source XML from Congress.gov
│   ├── extraction.json            ← structured provisions (REQUIRED — anchor file)
│   ├── verification.json          ← deterministic verification report
│   ├── metadata.json              ← extraction provenance (model, hashes, timestamps)
│   ├── tokens.json                ← LLM token usage from extraction
│   ├── bill_meta.json             ← bill metadata: FY, jurisdictions, advance classification (enrich)
│   ├── embeddings.json            ← embedding metadata (model, dimensions, hashes)
│   ├── vectors.bin                ← raw float32 embedding vectors
│   └── chunks/                    ← per-chunk LLM artifacts (gitignored)
│       ├── 01JRWN9T5RR0JTQ6C9FYYE96A8.json
│       ├── 01JRWNA2B3C4D5E6F7G8H9J0K1.json
│       └── ...
├── hr5860/                        ← bill directory (FY2024 CR)
│   ├── BILLS-118hr5860enr.xml
│   ├── extraction.json
│   ├── verification.json
│   ├── metadata.json
│   ├── tokens.json
│   ├── embeddings.json
│   ├── vectors.bin
│   └── chunks/
└── hr9468/                        ← bill directory (VA supplemental)
    ├── BILLS-118hr9468enr.xml
    ├── extraction.json
    ├── verification.json
    ├── metadata.json
    ├── embeddings.json
    ├── vectors.bin
    └── chunks/

File Reference

FileRequired?Written ByRead ByMutable?Size (Omnibus)
BILLS-*.xmlFor extractiondownloadextract, upgrade, enrichNever~1.8 MB
extraction.jsonYes (anchor)extract, upgradeAll query commandsOnly by re-extract or upgrade~12 MB
verification.jsonNoextract, upgradeaudit, search (for quality fields)Only by re-extract or upgrade~2 MB
metadata.jsonNoextractStaleness detectionOnly by re-extract~300 bytes
tokens.jsonNoextractInformational onlyNever~200 bytes
bill_meta.jsonNoenrich--subcommittee filtering, staleness detectionOnly by re-enrich~5 KB
embeddings.jsonNoembedSemantic search, staleness detectionOnly by re-embed~230 bytes
vectors.binNoembedsearch --semantic, search --similarOnly by re-embed~29 MB
chunks/*.jsonNoextractDebugging and analysis onlyNeverVaries

Which files are required?

Only extraction.json is required. The loader (loading.rs) walks recursively from the --dir path, finds every file named extraction.json, and treats each one as a bill directory. Everything else is optional:

  • Without verification.json: The audit command won’t work, and search results won’t include amount_status, match_tier, or quality fields.
  • Without metadata.json: Staleness detection for the source XML link is unavailable.
  • Without BILLS-*.xml: Extraction, upgrade, and enrich can’t run (they need the source XML). Query commands work fine.
  • Without bill_meta.json: The --subcommittee flag is unavailable. The --fy flag still works (it uses fiscal year data from extraction.json). Run congress-approp enrich to generate this file — no API keys required.
  • Without embeddings.json + vectors.bin: --semantic and --similar searches are unavailable. If you cloned the git repository, these files are included for the example data. If you installed via cargo install, run congress-approp embed --dir data to generate them (~30 seconds per bill, requires OPENAI_API_KEY).
  • Without tokens.json: No impact on any operation.
  • Without chunks/: No impact on any operation (these are local provenance records).

File Descriptions

BILLS-*.xml

The enrolled bill XML downloaded from Congress.gov. The filename follows the GPO convention:

BILLS-{congress}{type}{number}enr.xml

Examples:

  • BILLS-118hr4366enr.xml — H.R. 4366, 118th Congress, enrolled version
  • BILLS-118hr5860enr.xml — H.R. 5860, 118th Congress, enrolled version
  • BILLS-118hr9468enr.xml — H.R. 9468, 118th Congress, enrolled version

The XML uses semantic markup from the GPO bill DTD: <division>, <title>, <section>, <appropriations-small>, <quote>, <proviso>, and many more. This semantic structure is what enables reliable parsing and chunk boundary detection.

Immutable after download. The source text is never modified by any operation.

extraction.json

The primary output of the extract command. Contains:

  • bill — Bill-level metadata: identifier, classification, short title, fiscal years, divisions
  • provisions — Array of every extracted provision with full structured fields
  • summary — LLM-generated summary statistics (diagnostic only — never used for computation)
  • chunk_map — Links each provision to the extraction chunk that produced it
  • schema_version — Version of the extraction schema

This is the anchor file — the loader discovers bill directories by finding this file. All query commands (search, summary, compare, audit) read it.

See extraction.json Fields for the complete field reference.

verification.json

Deterministic verification of every provision against the source bill text. No LLM involved — pure string matching.

Contains:

  • amount_checks — Was each dollar string found in the source?
  • raw_text_checks — Is each raw text excerpt a substring of the source?
  • completeness — How many dollar strings in the source were matched to provisions?
  • summary — Roll-up metrics (verified, not_found, ambiguous, match tiers, coverage)

See verification.json Fields for the complete field reference.

metadata.json

Extraction provenance — records which model produced the extraction and when:

{
  "model": "claude-opus-4-6",
  "prompt_version": "a1b2c3d4...",
  "extraction_timestamp": "2024-03-17T14:30:00Z",
  "source_xml_sha256": "e5f6a7b8c9d0..."
}

The source_xml_sha256 field is part of the hash chain — it records the SHA-256 of the source XML so the tool can detect if the XML has been re-downloaded.

bill_meta.json

Bill-level metadata generated by the enrich command. Contains fiscal year scoping, subcommittee jurisdiction mappings (division letter → canonical jurisdiction), advance appropriation classification for each budget authority provision, enriched bill nature (omnibus, minibus, full-year CR with appropriations, etc.), and canonical (case-normalized) account names for cross-bill matching.

{
  "schema_version": "1.0",
  "congress": 119,
  "fiscal_years": [2026],
  "bill_nature": "omnibus",
  "subcommittees": [
    { "division": "A", "jurisdiction": "defense", "title": "...", "source": { "type": "pattern_match", "pattern": "department of defense" } }
  ],
  "provision_timing": [
    { "provision_index": 1370, "timing": "advance", "available_fy": 2027, "source": { "type": "fiscal_year_comparison", "availability_fy": 2027, "bill_fy": 2026 } }
  ],
  "canonical_accounts": [
    { "provision_index": 0, "canonical_name": "military personnel, army" }
  ],
  "extraction_sha256": "b461a687..."
}

This file is entirely optional. All commands that existed before v4.0 work without it. It is required only for --subcommittee filtering. The --fy flag works without it (falling back to extraction.json fiscal year data). The extraction_sha256 field is part of the hash chain — it records the SHA-256 of extraction.json at enrichment time, enabling staleness detection.

Requires no API keys to generate. Run congress-approp enrich --dir data to create this file for all bills. See Enrich Bills with Metadata for a detailed guide.

tokens.json

LLM token usage from extraction:

{
  "total_input": 1200,
  "total_output": 1500,
  "total_cache_read": 800,
  "total_cache_create": 400,
  "calls": 1
}

Informational only — not used by any downstream operation. Useful for cost estimation and monitoring.

embeddings.json

Embedding metadata — a small JSON file (~230 bytes) that describes the companion vectors.bin file:

{
  "schema_version": "1.0",
  "model": "text-embedding-3-large",
  "dimensions": 3072,
  "count": 2364,
  "extraction_sha256": "a1b2c3d4...",
  "vectors_file": "vectors.bin",
  "vectors_sha256": "e5f6a7b8..."
}

The extraction_sha256 and vectors_sha256 fields are part of the hash chain for staleness detection.

See embeddings.json Fields for the complete field reference.

vectors.bin

Raw little-endian float32 embedding vectors. No header — just count × dimensions × 4 bytes of floating-point data. The count and dimensions come from embeddings.json.

File sizes for the example data:

BillProvisionsDimensionsFile Size
H.R. 43662,3643,07229,048,832 bytes (29 MB)
H.R. 58601303,0721,597,440 bytes (1.6 MB)
H.R. 946873,07286,016 bytes (86 KB)

These files are excluded from the crates.io package (Cargo.toml exclude field) because they exceed the 10 MB upload limit. They are included in the git repository for users who clone.

See embeddings.json Fields for reading instructions.

chunks/ directory

Per-chunk LLM artifacts stored with ULID filenames (e.g., 01JRWN9T5RR0JTQ6C9FYYE96A8.json). Each file contains:

  • Thinking content — The model’s internal reasoning for this chunk
  • Raw response — The raw JSON the LLM produced before parsing
  • Parsed provisions — The provisions extracted from this chunk after resilient parsing
  • Conversion report — Type coercions, null-to-default conversions, and warnings

These are permanent provenance records — useful for understanding why the LLM classified a particular provision a certain way, or for debugging extraction issues. They are:

  • Gitignored by default (.gitignore includes chunks/)
  • Not part of the hash chain — no downstream artifact references them
  • Not required for any query operation
  • Not included in the crates.io package

Deleting the chunks/ directory has no effect on any operation.


Nesting Flexibility

The --dir flag accepts any directory path. The loader walks recursively from that path, finding every extraction.json. This means any nesting structure works:

# Flat structure (like the examples)
congress-approp summary --dir data
# Finds: data/118-hr4366/extraction.json, data/118-hr5860/extraction.json, data/118-hr9468/extraction.json

# Nested by congress/type/number
congress-approp summary --dir data
# Finds: data/118/hr/4366/extraction.json, data/118/hr/5860/extraction.json, etc.

# Single bill directory
congress-approp summary --dir data/118/hr/9468
# Finds: data/118/hr/9468/extraction.json

# Any arbitrary nesting
congress-approp summary --dir ~/my-appropriations-project/fy2024
# Finds all extraction.json files anywhere under that path

The directory name is used as the bill identifier for --similar references. For example, if the path is data/118-hr9468/extraction.json, the bill directory name is hr9468, and you’d reference it as --similar 118-hr9468:0.


The Hash Chain

Each downstream artifact records the SHA-256 hash of its input, enabling staleness detection:

BILLS-*.xml ──sha256──▶ metadata.json (source_xml_sha256)
                              │
extraction.json ──sha256──▶ bill_meta.json (extraction_sha256)     ← NEW in v4.0
extraction.json ──sha256──▶ embeddings.json (extraction_sha256)
                              │
vectors.bin ──sha256──▶ embeddings.json (vectors_sha256)

If any link in the chain breaks (input file changed but downstream wasn’t regenerated), the tool warns but doesn’t block. See Data Integrity and the Hash Chain for details.


Immutability Model

Every file except links/links.json is write-once. The links file is append-only (link accept adds entries, link remove deletes them):

FileWritten WhenModified When
BILLS-*.xmldownloadNever
extraction.jsonextract, upgradeOnly by deliberate re-extraction or upgrade
verification.jsonextract, upgradeOnly by deliberate re-extraction or upgrade
metadata.jsonextractOnly by re-extraction
tokens.jsonextractNever
bill_meta.jsonenrichOnly by re-enrichment (enrich --force)
embeddings.jsonembedOnly by re-embedding
vectors.binembedOnly by re-embedding
chunks/*.jsonextractNever

This write-once design means:

  • No file locking needed — multiple read processes can run simultaneously
  • No database needed — JSON files on disk are the right abstraction for a read-dominated workload
  • No caching needed — the files ARE the cache
  • Trivially relocatable — copy a bill directory anywhere and it works

The write:read ratio is approximately 1:500. Bills are extracted ~15 times per year (when Congress enacts new legislation), but queried hundreds to thousands of times.


Git Configuration

The project includes two git-related configurations for the data files:

.gitignore

chunks/          # Per-chunk LLM artifacts (local provenance, not for distribution)
NEXT_STEPS.md    # Internal context handoff document
.venv/           # Python virtual environment

The chunks/ directory is gitignored because it contains model thinking traces that are useful for local debugging but not needed for downstream operations or distribution.

.gitattributes

*.bin binary

The vectors.bin files are marked as binary in git to prevent line-ending conversion and diff attempts on float32 data.


Size Estimates

ComponentH.R. 9468 (Supp)H.R. 5860 (CR)H.R. 4366 (Omnibus)
Source XML9 KB131 KB1.8 MB
extraction.json15 KB200 KB12 MB
verification.json5 KB40 KB2 MB
metadata.json~300 B~300 B~300 B
tokens.json~200 B~200 B~200 B
bill_meta.json~1 KB~2 KB~5 KB
embeddings.json~230 B~230 B~230 B
vectors.bin86 KB1.6 MB29 MB
chunks/~10 KB~100 KB~15 MB
Total~120 KB~2 MB~60 MB

For 20 congresses (~60 bills), total storage would be approximately 200–400 MB, dominated by vectors.bin files for large omnibus bills.