Design Principles

Five principles govern every architectural decision in this project. They are listed in priority order — when two principles conflict, the higher-ranked principle wins.

1. Deterministic First

If a deterministic method produces correct results, use it. Do not add machine learning, embeddings, or LLM calls where string matching, regex, or lookup tables suffice. L0 through L1 contain zero ML — name decomposition, FIPS enrichment, keyword-based office classification, and hash computation are all deterministic operations that produce identical output from identical input on every run. Deterministic methods are not preferred because they are cheaper (budget is not a constraint). They are preferred because they are reproducible, auditable, and incapable of hallucination. When a journalist asks “why did your system say these two candidates are the same person?”, the answer should be “because their canonical first names, last names, and suffixes are identical” — not “because a language model said so.” Determinism is the default. Non-determinism requires justification.

2. Preserve Signal

Every piece of information in the source data is potential disambiguation signal. Middle initials distinguish David S. Marshall (Maine) from David A. Marshall (Florida) — dropping them collapses two people into one. Suffixes distinguish Robert Williams from Robert Williams Jr. — stripping them merges father and son. Nicknames reveal that Charlie Crist and Charles Joseph Crist are the same person — normalizing too early destroys that connection. The rule at L1 is: decompose names into structured components (raw, first, middle, last, suffix, canonical_first) and preserve every component. Do not discard middle initials. Do not strip suffixes. Do not overwrite the raw name with a canonical form. Clean without collapsing. Downstream layers (L2 embedding, L3 matching, L4 canonicalization) consume these components selectively. The raw material must survive intact through L1 for those layers to function.

3. LLMs for Confirmation, Not Discovery

Embeddings retrieve candidates. LLMs confirm matches. The embedding model (text-embedding-3-large, 3,072 dimensions) identifies pairs that might be the same entity — Charlie Crist at cosine 0.451, Robert Williams Jr at 0.862. The LLM (Claude Sonnet) then examines the full context — structured name components, vote counts, office, state, party — and renders a judgment: match or no-match, with confidence and reasoning. The LLM is never the first line of analysis. It does not scan raw files, parse CSV columns, compute FIPS codes, or generate embeddings. It is called only when cheaper methods have narrowed the problem to a specific, bounded question: “Are these two records the same person?” or “What type of office is the Santa Rosa Island Authority?” This ordering exists for speed (70% of entity resolution is exact match), reproducibility (deterministic steps produce identical results), and auditability (every LLM decision is logged with its prompt, response, and reasoning).

4. Immutable Layers

Outputs are append-only. L0 raw files are never modified. L1 cleaned records are never updated in place — if the parser changes, a new L1 run produces new records with a new parser_version. L2 embeddings are never re-computed by overwriting existing vectors — a new embedding model produces new L2 output alongside the old. L3 match decisions are never silently revised — an override produces a new decision record referencing the original. L4 canonical exports are versioned snapshots, not mutable databases. This immutability serves two purposes. First, provenance: the hash chain from L4 back to L0 depends on every intermediate record remaining unchanged. Modifying an L1 record without incrementing the parser version breaks the chain. Second, debugging: when a result looks wrong, you can inspect every layer’s output at the time it was produced, without worrying that a subsequent run overwrote the evidence.

5. Document Sources, Don’t Store Data

This project does not redistribute election data. Each source — MEDSL, NC SBE, OpenElections, VEST, Census, FEC — publishes data under its own license, on its own schedule, at its own URLs. We provide exact download URLs, file size expectations, schema documentation, known data quality issues, and the tools to process the data. We do not provide the data itself. The reasons are legal (license terms vary), practical (the current corpus exceeds 8 GB and grows with each election cycle), and epistemic (a stale copy of a dataset that the source has since corrected is worse than no copy at all). Users download data from authoritative sources, verify file integrity against documented hashes, and run the pipeline locally. The L0 manifest records exactly where each file came from and when it was retrieved, so any result can be traced back to its authoritative origin.