embeddings.json Fields

Complete reference for the embedding metadata file and its companion binary vector file. These are produced by the congress-approp embed command and consumed by search --semantic and search --similar.

Overview

Embeddings use a split storage format:

embeddings.json — Small JSON metadata file (~200 bytes, human-readable)
vectors.bin — Binary float32 array (can be tens of megabytes for large bills)

The metadata file tells you everything you need to interpret the binary file: which model produced the vectors, how many dimensions each vector has, how many provisions are embedded, and SHA-256 hashes for the data integrity chain.

embeddings.json Structure

{
  "schema_version": "1.0",
  "model": "text-embedding-3-large",
  "dimensions": 3072,
  "count": 2364,
  "extraction_sha256": "ae912e3427b8...",
  "vectors_file": "vectors.bin",
  "vectors_sha256": "7bd7821176bc..."
}

Fields

Field	Type	Description
`schema_version`	string	Embedding schema version. Currently `"1.0"`.
`model`	string	The OpenAI embedding model used (e.g., `"text-embedding-3-large"`). All embeddings in a dataset must use the same model — you cannot compare vectors from different models.
`dimensions`	integer	Number of dimensions per vector. Default is `3072` for `text-embedding-3-large`. All embeddings in a dataset must use the same dimension count.
`count`	integer	Number of provisions embedded. Should equal the length of the `provisions` array in the corresponding `extraction.json`.
`extraction_sha256`	string	SHA-256 hash of the `extraction.json` file these embeddings were built from. Used for staleness detection — if the extraction changes, this hash won’t match and the tool warns that embeddings are stale.
`vectors_file`	string	Filename of the binary vectors file. Always `"vectors.bin"`.
`vectors_sha256`	string	SHA-256 hash of the `vectors.bin` file. Integrity check — detects corruption or truncation.

Example Files from Included Data

Bill	Count	Dimensions	embeddings.json Size	vectors.bin Size
H.R. 4366 (omnibus)	2,364	3,072	~230 bytes	29,048,832 bytes (29 MB)
H.R. 5860 (CR)	130	3,072	~230 bytes	1,597,440 bytes (1.6 MB)
H.R. 9468 (supplemental)	7	3,072	~230 bytes	86,016 bytes (86 KB)

vectors.bin Format

A flat binary file containing raw little-endian float32 values. There is no header, no delimiter, and no structure — just count × dimensions floating-point numbers in sequence.

Layout

[provision_0_dim_0] [provision_0_dim_1] ... [provision_0_dim_3071]
[provision_1_dim_0] [provision_1_dim_1] ... [provision_1_dim_3071]
...
[provision_N_dim_0] [provision_N_dim_1] ... [provision_N_dim_3071]

Each float32 is 4 bytes, stored in little-endian byte order. Provisions are stored in the same order as the provisions array in extraction.json — provision index 0 comes first, then index 1, and so on.

File Size Formula

file_size = count × dimensions × 4  (bytes)

For the omnibus: 2364 × 3072 × 4 = 29,048,832 bytes

If the actual file size doesn’t match this formula, the file is corrupted or truncated. The vectors_sha256 hash in embeddings.json provides an independent integrity check.

Reading a Specific Provision’s Vector

To read the vector for provision at index i:

byte_offset = i × dimensions × 4
byte_length = dimensions × 4

Seek to byte_offset and read byte_length bytes, then interpret as dimensions little-endian float32 values.

Vector Properties

All vectors are L2-normalized — each vector has a Euclidean norm of approximately 1.0. This means:

Cosine similarity equals the dot product: cos(a, b) = a · b (since |a| = |b| = 1)
Values range from approximately -0.1 to +0.1 per dimension (spread across 3,072 dimensions)
Similarity scores range from approximately 0.2 to 0.9 in practice for appropriations data

Reading Vectors in Python

Using struct (standard library)

import json
import struct

with open("data/118-hr9468/embeddings.json") as f:
    meta = json.load(f)

dims = meta["dimensions"]  # 3072
count = meta["count"]       # 7

with open("data/118-hr9468/vectors.bin", "rb") as f:
    raw = f.read()

# Verify file size
assert len(raw) == count * dims * 4, "File size mismatch — possible corruption"

# Parse into list of tuples
vectors = []
for i in range(count):
    start = i * dims * 4
    end = start + dims * 4
    vec = struct.unpack(f"<{dims}f", raw[start:end])
    vectors.append(vec)

# Check normalization
norm = sum(x * x for x in vectors[0]) ** 0.5
print(f"Vector 0 L2 norm: {norm:.6f}")  # Should be ~1.000000

Using numpy (faster for large files)

import numpy as np
import json

with open("data/118-hr4366/embeddings.json") as f:
    meta = json.load(f)

vectors = np.fromfile(
    "data/118-hr4366/vectors.bin",
    dtype=np.float32
).reshape(meta["count"], meta["dimensions"])

print(f"Shape: {vectors.shape}")  # (2364, 3072)
print(f"Vector 0 norm: {np.linalg.norm(vectors[0]):.6f}")  # ~1.000000

# Cosine similarity matrix (fast — vectors are normalized)
similarity = vectors @ vectors.T
print(f"Provision 0 vs 1 similarity: {similarity[0, 1]:.4f}")

Computing Cosine Similarity

Since vectors are L2-normalized, cosine similarity is just the dot product:

def cosine_similarity(a, b):
    return sum(x * y for x, y in zip(a, b))

# Or with numpy:
sim = np.dot(vectors[0], vectors[1])

Reading Vectors in Rust

The congress-approp library provides the embeddings module:

#![allow(unused)]
fn main() {
use congress_appropriations::approp::embeddings;
use std::path::Path;

if let Some(loaded) = embeddings::load(Path::new("data/118-hr9468"))? {
    println!("Model: {}", loaded.metadata.model);
    println!("Dimensions: {}", loaded.dimensions());
    println!("Count: {}", loaded.count());

    // Get vector for provision 0
    let vec0: &[f32] = loaded.vector(0);

    // Cosine similarity between provisions 0 and 1
    let sim = embeddings::cosine_similarity(loaded.vector(0), loaded.vector(1));
    println!("Similarity: {:.4}", sim);
}
}

Key Functions

Function	Signature	Description
`embeddings::load(dir)`	`fn load(dir: &Path) -> Result<Option<LoadedEmbeddings>>`	Load embeddings from a bill directory. Returns `None` if no `embeddings.json` exists.
`embeddings::save(dir, meta, vecs)`	`fn save(dir: &Path, metadata: &EmbeddingsMetadata, vectors: &[f32]) -> Result<()>`	Save embeddings to a bill directory. Writes both `embeddings.json` and `vectors.bin`.
`embeddings::cosine_similarity(a, b)`	`fn cosine_similarity(a: &[f32], b: &[f32]) -> f32`	Compute cosine similarity (dot product for normalized vectors).
`embeddings::normalize(vec)`	`fn normalize(vec: &mut [f32])`	L2-normalize a vector in place.
`loaded.vector(i)`	`fn vector(&self, i: usize) -> &[f32]`	Get the embedding vector for provision at index `i`.
`loaded.count()`	`fn count(&self) -> usize`	Number of embedded provisions.
`loaded.dimensions()`	`fn dimensions(&self) -> usize`	Number of dimensions per vector.

The Hash Chain

Embeddings participate in the data integrity hash chain:

extraction.json ──sha256──▶ embeddings.json (extraction_sha256)
vectors.bin ──sha256──▶ embeddings.json (vectors_sha256)

Staleness Detection

When you run a command that uses embeddings (search --semantic or search --similar), the tool:

Computes the SHA-256 of the current extraction.json on disk
Compares it to extraction_sha256 in embeddings.json
If they differ, prints a warning to stderr:

⚠ H.R. 4366: embeddings are stale (extraction.json has changed)

This means the extraction was modified (re-extracted or upgraded) after the embeddings were generated. The provision indices in the vectors may no longer correspond to the current provisions. The warning is advisory — execution continues, but results may be unreliable.

Fix: Regenerate embeddings with congress-approp embed --dir <path>.

Integrity Check

The vectors_sha256 field verifies that vectors.bin hasn’t been corrupted. If the hash doesn’t match, the binary file was modified, truncated, or replaced since embeddings were generated.

Automatic Skip

The embed command checks the hash chain before processing each bill. If extraction_sha256 matches the current extraction and vectors_sha256 matches the current vectors file, the bill is skipped:

Skipping H.R. 9468: embeddings up to date

This makes it safe to run embed --dir data repeatedly — only bills with new or changed extractions are processed.

Consistency Requirements

Same model across all bills

All embeddings in a dataset must use the same model. Cosine similarity between vectors from different models is undefined. The model field in embeddings.json records which model was used.

If you change models, regenerate embeddings for all bills:

# Delete existing embeddings (optional — embed will overwrite)
congress-approp embed --dir data --model text-embedding-3-large

Same dimensions across all bills

All embeddings must use the same dimension count. The default is 3,072 (the native output of text-embedding-3-large). If you truncate dimensions with --dimensions 1024, all bills must use 1,024.

The dimensions field in embeddings.json records the dimension count. The tool does not currently check for dimension mismatches across bills — comparing vectors of different dimensions will silently produce garbage results.

Provision count alignment

The count field should equal the number of provisions in extraction.json. If the extraction is re-run (producing a different number of provisions), the stored vectors no longer align with the provisions — the hash chain detects this as staleness.

Storage on crates.io

The vectors.bin files are excluded from the crates.io package via the exclude field in Cargo.toml:

exclude = ["data/"]

This is because the omnibus bill’s vectors.bin (29 MB) exceeds crates.io’s 10 MB upload limit. Users who install from crates.io can generate embeddings themselves:

export OPENAI_API_KEY="your-key"
congress-approp embed --dir data

Users who clone the GitHub repository get the pre-generated vectors.bin files.

Embedding Model Details

The default model is OpenAI’s text-embedding-3-large:

Property	Value
Model name	`text-embedding-3-large`
Native dimensions	3,072
Normalization	L2-normalized (unit vectors)
Determinism	Near-perfect — max deviation ~1e-6 across repeated embeddings of the same text
Supported dimension truncation	256, 512, 1024, 3072 (via `--dimensions` flag)

Dimension Truncation Trade-offs

Experimental results from this project:

Dimensions	Top-20 Overlap vs. 3072	vectors.bin Size (Omnibus)	Load Time
256	16/20 (lossy)	~2.4 MB	<1ms
512	18/20 (near-lossless)	~4.8 MB	<1ms
1024	19/20	~9.7 MB	~1ms
3072 (default)	20/20 (ground truth)	~29 MB	~2ms

Since binary files load in milliseconds regardless of size, the full 3,072 dimensions are recommended. There is no practical performance benefit to truncation.

How Semantic Search Works — how embeddings enable meaning-based search
Generate Embeddings — creating and managing embeddings
Data Integrity and the Hash Chain — staleness detection across the pipeline
Data Directory Layout — where embedding files fit in the directory structure

Keyboard shortcuts

Congressional Appropriations Analyzer