Election Aggregation

A multi-layer pipeline for collecting, normalizing, and unifying US local election results from heterogeneous sources.

The question this project answers

Who ran for school board in your county last year? Who was the sheriff, and did anyone run against them? What was the closest local race in your state? Has your county commissioner been reelected five times unopposed, or do they face real competition?

These should be easy questions. They are not.

There is no national database of US local election results. The data exists — scattered across 50 state election boards, 3,000+ county clerk offices, academic datasets, election night reporting platforms, and community-curated repositories — but it has never been unified into a single, consistent, trustworthy format. Every source uses different schemas, different name formats, different office titles, different geographic identifiers, and different levels of completeness.

This project fixes that.

What we found when we tried

We downloaded 42 million rows of precinct-level election data from MIT’s Election Data Lab (MEDSL), the North Carolina State Board of Elections, OpenElections, VEST, the Census Bureau, and the FEC. We covered all 50 states across three election cycles (2018, 2020, 2022) and ten years of deep North Carolina history (2006–2024).

Then we tried to answer simple questions, and the problems started immediately.

The same candidate appears differently across sources. MEDSL reports SHANNON W BRAY. NC SBE reports Shannon W. Bray. One is all caps with no period after the middle initial. The other is title case with a period. These are the same person — but a computer doesn’t know that without being told.

Nicknames break everything. Charlie Crist in one source is CRIST, CHARLES JOSEPH in another. A human recognizes Charlie as a nickname for Charles. An embedding model scores their similarity at 0.451 — well below any reasonable match threshold. A language model, given the context (same state, same office, same election, same vote count), correctly identifies them as the same person with 0.95 confidence.

The same office title means different things. In Texas, the “County Judge” is the chief executive of the county — equivalent to a county manager. In every other state, a county judge is a judicial officer. If your system classifies “DALLAS COUNTY JUDGE” as judicial, you’re wrong in Texas and right everywhere else. Across all 50 states in 2022, we found 8,387 unique local office names. Our keyword classifier handles 62% of them. The remaining 38% require embedding-based matching and LLM reasoning.

Non-candidate data hides inside candidate data. Florida OpenElections includes 6,013 rows labeled “Registered Voters” — not a contest, but a turnout metadata row that got slurped into the results file as if it were a race. Other sources include “BLANK” (Maine’s name for undervotes), “TOTAL VOTES” (Utah’s aggregation rows), and “OverVotes” / “UnderVotes” masquerading as candidate names. Each source has its own ghosts.

Nobody tracks the same person across elections. Timothy Lance won a Columbus County, NC school board seat in 2022. Did he run before? Did he win? Is the “T. Lance” who ran in 2018 the same person? No existing dataset answers this. Entity resolution — determining that two records refer to the same human being — is the hardest problem in this project, and the one we spend the most effort on.

What this project does

Election Aggregation is a five-layer pipeline that transforms messy, heterogeneous election data into a clean, unified, entity-resolved dataset with full provenance back to the original source files.

L0  RAW          Byte-identical source files. Never modified.
 ↓
L1  CLEANED      Parsed, structured records. Names decomposed into
                 first/middle/last/suffix. FIPS codes enriched.
                 Office classified by keyword and regex.
                 Purely deterministic. No ML, no API calls.
 ↓
L2  EMBEDDED     Vector embeddings generated for candidates, contests,
                 and geographic names. Office classification tier 3
                 (embedding nearest-neighbor). Quality flags raised.
                 Deterministic given the same embedding model.
 ↓
L3  MATCHED      Entity resolution. Same-candidate and same-contest
                 identifiers assigned. Embedding retrieval + LLM
                 confirmation. Every decision stored with reasoning.
 ↓
L4  CANONICAL    Authoritative names chosen. Temporal chains built.
                 Alias tables constructed. Verification algorithms run.
                 Researcher-facing exports produced.

The ordering is strict and deliberate: Clean → Embed → Match → Canonicalize. You cannot assign an authoritative name before you know who the person is. You cannot match entities before you have embeddings. You cannot embed before you have clean, signal-preserving parsed records. And you cannot parse before you have the raw bytes.

Every record at every layer carries a cryptographic hash chain back to the original source file. If someone modifies a vote count, changes a name, or alters a match decision at any layer, the verification algorithm detects exactly where the chain breaks.

What this project does not do

It does not store election data. The data files are large (7+ GB for our current corpus) and are published by their respective sources under their own terms. This project tells you where to get the data, documents every source’s schema and quirks, and provides the tools to process it. You download the data yourself.
It is not a real-time election night tracker. We ingest official and certified results, not live feeds. The pipeline is designed for post-election analysis, not real-time reporting.
It is not a prediction model. We report what happened, not what will happen.
It does not claim perfect accuracy. Entity resolution is probabilistic. Office classification has a 0.56% “other” rate. Some records have data quality issues we haven’t caught yet. We document every known limitation, and every match decision is auditable.

What you can answer today

With the data currently available (MEDSL 2018/2020/2022 for all 50 states, NC SBE 2006–2024, OpenElections for 6 states, VEST shapefiles for 4 states), you can answer:

Question	Answer from our data
How many sheriffs ran unopposed in 2022?	55% in North Carolina, 77% in Maine, varies by state
What was the closest school board race in America?	Dawson County, GA — exact tie at 25,186 to 25,186
How many local races were uncontested?	48.8% nationally (keyword-classified subset)
Which office type is least competitive?	Constable/Coroner at 72% uncontested
Which is most competitive?	City Council at 10% uncontested
Who has served longest on a local body in NC?	George Dunlap — Mecklenburg County Commissioner, 6 consecutive cycles (2014–2024)
How many unique elected offices exist in America?	At least 8,387 distinct office names in MEDSL 2022 alone; 4,995 exist in exactly one county
Did the same candidate run across multiple elections?	Yes — 702 NC candidates appear in 3+ election cycles (2014–2024)

And questions you cannot answer yet, honestly:

Question	Why not
What’s the voter turnout for school board races?	Turnout data exists in less than 5% of records
Did this candidate switch parties?	Requires entity resolution across elections, which is functional but not yet validated at scale
What are the RCV round-by-round results?	Schema doesn’t support ranked-choice voting yet
How do local election results correlate with demographics?	Census demographic join is ready (100% FIPS coverage) but not yet implemented
What happened in odd-year elections (2015, 2017, 2019)?	MEDSL has odd-year data on Harvard Dataverse; we haven’t loaded it yet

The data sources

This project processes data from multiple sources. We do not redistribute their data. Here is what each provides and where to get it:

Source	What it is	Coverage	Where to get it
MEDSL	MIT Election Data + Science Lab precinct returns	All 50 states + DC, 2018/2020/2022	GitHub, Harvard Dataverse
NC SBE	North Carolina State Board of Elections	NC only, 2006–2024 (10 cycles)	NC SBE
OpenElections	Community-curated precinct data	~8 states, varies	GitHub
Clarity/Scytl	Election night reporting XML	~1,000+ jurisdictions	Per-jurisdiction URLs (unstable)
VEST	Precinct results + geographic boundaries	All 50 states (shapefiles)	Harvard Dataverse
Census	FIPS code reference files	National	Census.gov
FEC	Federal candidate master files	National	FEC.gov

Each source has its own chapter documenting the exact schema, download commands, known data quality issues, and how our pipeline handles its quirks.

Who this book is for

If you’re a journalist and you want to answer “what happened in local elections in my area” — start with Questions for Journalists, then go to Getting Started and Recipes. You don’t need to understand the pipeline architecture. You need the data and the queries.

If you’re a researcher and you want a citable, reproducible, documented dataset for studying local election competitiveness, candidate career paths, or democratic participation — start with Questions for Researchers, then read Reproducibility Guide and How to Cite This Data. The dataset is versioned with DOIs. Every entity resolution decision is logged and auditable.

If you’re a government staffer and you need to know what elected offices exist in your jurisdiction, how your state compares to others, or how to benchmark election administration — start with Questions for Government Staffers and Office Inventory Recipe.

If you’re a developer and you want to contribute to the pipeline, add a new data source, or understand the Rust implementation — start with Design Principles, then read The Five-Layer Pipeline and Type System Design. The mdbook is the spec. The Rust types are the implementation.

If you’re evaluating this architecture for your own data pipeline project — the Architecture section describes a pattern (immutable layers, deterministic-first processing, embeddings for retrieval, LLMs for confirmation) that generalizes beyond election data. The Hard Problems section documents real entity resolution challenges with real data and real solutions.

How to read this book

The book is organized in the order you’d have questions:

Part I: The Problem — Why local election data is a mess, and what questions we’re trying to answer.
Part II: Data Sources — Where the data comes from, exactly what’s in it, and how to download it yourself.
Part III: The Hard Problems — Name normalization, office classification, entity resolution, and cross-source reconciliation. Real examples from real data. This is the heart of the book.
Part IV: Architecture — The five-layer pipeline, the hash chain, the embedding strategy, the LLM integration. How the system is designed and why.
Part V: Unified Schema — The exact record format, field by field. What each field means, where it comes from, and which layer populates it.
Part VI: Rust Implementation — The type system, the traits, the module structure. How the architecture becomes code.
Part VII: Using the Data — Download instructions, pipeline execution, and ten ready-to-use recipes with copy-paste queries.
Part VIII: Trust and Reproducibility — How to verify the data, how to cite it, how to report errors, and what the known limitations are.

You don’t have to read it in order. Every chapter is self-contained with cross-references to related sections. But if you read Part I and Part III, you’ll understand why this project exists and what makes it hard. Everything else follows from that.

This project is open source under MIT/Apache-2.0. The data it processes is published by its respective sources under their own licenses (generally CC-BY or public domain). We do not store or redistribute election data.

Why This Is Hard

US local election results are published by approximately 3,143 county-level election offices, 50 state election boards, and an unknown number of municipal clerks. There is no common schema, no shared identifier system, and no central repository. This chapter describes the five structural problems that make unification difficult.

Fragmented administration

US elections are administered at the county level. Each county decides independently how to collect, tabulate, and publish results. Some publish precinct-level CSV files. Others post scanned PDFs. Some use Clarity/Scytl election night reporting platforms that expose structured XML. Others put results on pages that require JavaScript rendering.

There is no federal mandate requiring any particular publication format. The result is 3,143 independent data silos with different schemas, different update schedules, different URL structures, and different retention policies.

When we downloaded precinct-level data for all 50 states from the MIT Election Data + Science Lab (MEDSL), we received 51 separate files containing 12.3 million rows. The files use different column encodings, different candidate name conventions, and different definitions of what constitutes a “local” race. Seven states — California, Iowa, Kansas, New Jersey, Pennsylvania, Tennessee, and Wisconsin — had zero local race records in the MEDSL 2022 dataset.

No standard schema

The same vote record looks different in every source. Here is a single result — Shannon W. Bray in a 2022 North Carolina precinct — as represented by three different sources:

MEDSL (25-column CSV, one row per vote mode):

precinct,office,party_simplified,mode,votes,candidate,...
12-13,US SENATE,LIBERTARIAN,ELECTION DAY,47,SHANNON W BRAY,...

NC SBE (15-column TSV, vote modes as columns):

County	Precinct	Contest Name	Choice	Election Day	One Stop	Absentee by Mail	Provisional	Total Votes
CABARRUS	12-13	US SENATE	Shannon W. Bray	47	38	5	0	90

OpenElections (7-column CSV, totals only):

county,precinct,office,party,candidate,votes
Cabarrus,12-13,U.S. Senate,LIB,Shannon W. Bray,90

Three sources. Three schemas. Three representations of the candidate name (SHANNON W BRAY, Shannon W. Bray, Shannon W. Bray). Three levels of granularity for vote mode data. Three different office name formats (US SENATE, US SENATE, U.S. Senate). This is a federal race where all three sources agree on the totals. For local races, the divergence is worse.

Name formatting differs across every source

We compared MEDSL and NC SBE data for 640 contests in the 2022 North Carolina general election where both sources reported the same vote totals. In 401 of those contests (63%), candidate names are formatted differently between the two sources.

The differences are systematic:

Pattern	MEDSL	NC SBE
Case	`SHANNON W BRAY`	`Shannon W. Bray`
Middle initial punctuation	`VICTORIA P PORTER`	`Victoria P. Porter`
Nickname quoting	`MICHAEL "STEVE" HUBER`	`Michael (Steve) Huber`
Suffix formatting	`ROBERT VAN FLETCHER JR`	`Robert Van Fletcher, Jr.`
Nickname style	`LM "MICKEY" SIMMONS`	`L.M. (Mickey) Simmons`
Write-in label	`WRITEIN`	`Write-In (Miscellaneous)`

Each source applies a consistent internal convention. MEDSL uses ALL CAPS with no punctuation. NC SBE uses Title Case with periods and commas. Across sources, the conventions diverge.

The formatting problem is solvable with normalization rules. The deeper problem is name identity. We tested real candidate pairs with OpenAI’s text-embedding-3-large model (3,072 dimensions):

Name A	Name B	Cosine similarity	Same person?
`Charlie Crist`	`CRIST, CHARLES JOSEPH`	0.451	Yes
`Robert Williams`	`Robert Williams Jr.`	0.862	No
`Nikki Fried`	`Nicole Fried`	0.642	Yes
`Ron DeSantis`	`DESANTIS, RON`	0.729	Yes

“Charlie Crist” and “CRIST, CHARLES JOSEPH” score 0.451 — below any reasonable match threshold — because Charlie and Charles have unrelated vector representations. These are the same person (same state, same office, identical vote count of 3,101,652). Only a language model with knowledge that Charlie is a common nickname for Charles can make the connection.

“Robert Williams” and “Robert Williams Jr.” score 0.862 — above most auto-accept thresholds used in the literature. These are different people. The “Jr.” suffix indicates a generational distinction. A system that auto-accepts at 0.82 would merge a father and son into one entity.

Institutional variation by state

Across all 50 states in the 2022 MEDSL data, we found 8,387 unique local office names. Our keyword classifier handles 62% of them. The remaining 38% includes offices where the same title has different institutional meanings depending on the state.

“County Judge” in Texas is the presiding officer of the Commissioners Court — the chief executive of the county, analogous to a county manager. In every other state, a county judge presides over a courtroom. Texas has 254 counties; each has a County Judge who is an executive, not a judicial officer.

“Sheriff” in Connecticut is a court officer who serves civil process. In the other 49 states, the sheriff runs the county jail and patrols unincorporated areas.

“Board of Education” is an elected body in some states and an appointed body in others. Where it is appointed, it does not appear in election data — its absence from a source does not mean the county lacks a school board.

A static lookup table mapping office names to categories does not work. The classification must account for state-level context, which is why the pipeline uses a four-tier classifier: keyword matching for unambiguous names, regex patterns for structured names, embedding similarity against a reference set, and a language model for genuinely ambiguous cases.

No persistent candidate identifiers

Timothy Lance won a seat on the Columbus County Schools Board of Education in 2022. No existing dataset can answer whether he ran before, whether he won, or whether the “T. Lance” who appeared on a 2018 ballot is the same person.

MEDSL, NC SBE, and OpenElections each treat every election as an independent snapshot. There is no identifier linking Timothy Lance (2022) to Timothy Lance (2020) to Tim Lance (2018). The candidate name can change between elections — a middle initial added, a nickname used, a suffix dropped after a parent’s death. The office can change if the candidate runs for a different seat. The county can change if the candidate relocates.

In 10 years of NC SBE data (2014–2024), we found 702 candidates appearing in three or more election cycles using exact name matching within the same county. George Dunlap appeared on the Mecklenburg County ballot in six consecutive cycles. Paul Beaumont in Currituck County ran for the Board of Commissioners, then the Board of Education, then back to Commissioners.

Connecting these records — determining that entries in different elections, from different sources, with different formatting, refer to the same person — requires preserving every name component through the cleaning pipeline, embedding candidates for vector retrieval, and confirming ambiguous matches with a language model that reasons about context (office, county, party, vote totals). This process is called entity resolution, and it is detailed in its own chapter.

What this adds up to

In 2022, across the MEDSL data for all 50 states, 48.8% of classified local races had only one candidate. In Minnesota, the uncontested rate was 89.3%. Nineteen local races ended in exact ties. Forty-three were decided by a single vote. These are basic facts about American democracy that require combining data from multiple sources, resolving thousands of name variations, classifying thousands of office types, and linking candidates across elections.

That is what this project does. The rest of this book describes how.

What Questions Should Be Answerable?

The purpose of this project is to make US local election data queryable. Across 42 million rows, 50 states, and 8,387 distinct office names, basic questions remain difficult to answer. This chapter frames those questions by audience.

Four audiences, different needs

Journalists need specific, verifiable facts — closest races, unopposed incumbents, anomalies worth investigating. See For Journalists.
Researchers need structured, reproducible datasets — uncontested rates by office type, candidate career paths, cross-state comparisons. See For Researchers.
Government staffers need operational inventories — what offices exist in a jurisdiction, how many races appear on a ballot, how local structures compare to peer counties. See For Government Staffers.
Civic tech developers need reliable data interchange — OCD-ID mappings, entity-resolved candidate records, JSONL exports for downstream applications. See For Civic Tech Developers.

What the data already tells us

Even partial analysis of available sources reveals findings that are difficult to obtain elsewhere:

48.8% of local races in available data are uncontested — one candidate, no opponent.
19 exact ties have been identified across the dataset (same vote total, different candidates).
8,387 unique office name strings exist before normalization, many referring to the same underlying office.

These numbers are not estimates. They come from deterministic queries against cleaned, source-attributed JSONL records. The methodology for each finding is documented in the recipe chapters.

Why these questions matter

No single existing source answers all of these questions. The existing landscape chapter surveys what is available today and where each source falls short. This project exists to fill the gaps — not by replacing those sources, but by unifying them through a documented, reproducible pipeline.

For Journalists

Local election data is where accountability stories live — and where data is hardest to find. These are the questions journalists ask, with real answers drawn from the dataset.

Closest races

Who won the closest race in America? Dawson County, GA had a tied contest at 25,186 total votes cast — decided by recount procedures, not by a single voter’s margin.
How many exact ties exist? 19 exact ties have been identified across available data. Each is flagged with the specific contest, county, and vote totals. See Closest Races in America.
Which school board races were decided by single digits? Madison County, IN had a school board race decided by 1 vote. These contests are queryable by margin across all office types.

Unopposed races

How many sheriffs ran unopposed? In North Carolina, 55% of sheriff races were uncontested. In Maine, 77%. National figures depend on source coverage — seven states lack local data entirely.
What’s the overall uncontested rate? 48.8% of local races in available data have a single candidate. This figure spans all office types and all states with coverage.
Which offices are most likely to be uncontested? Constable races are uncontested 72% of the time. City council races: 10%. The rate varies by office type and state. See Uncontested Race Rate by State.

Accountability angles

Who keeps winning without opposition? Candidate entity resolution across election cycles identifies incumbents who have never faced an opponent. See Career Tracking Across Elections.
Which counties have the most uncontested offices? County-level aggregation is possible wherever FIPS codes are present in the source data. See Sheriff Accountability.
Are there races where write-in candidates are the only opposition? Write-in totals are preserved where the source reports them. In some jurisdictions, write-in votes account for the only opposition in over a third of contests.

Verification

Can I verify a specific result? Every record traces back to a named source (e.g., NC SBE certified results, MEDSL). The pipeline preserves source file hashes and original field values. See Verify a Specific Result.
How do I cite this data? You cite the original source, not this project. The project provides the source name, retrieval date, and confidence level for each record. See Confidence Levels.

What you cannot get here (yet)

Turnout data is present in fewer than 5% of records.
Seven states (CA, IA, KS, NJ, PA, TN, WI) have zero local coverage in MEDSL 2022.
Odd-year elections (2015, 2017, 2019, 2021) are underrepresented.

These gaps are documented in Known Limitations. If you are reporting on a state with limited coverage, check the Coverage Matrix first.

For Researchers

Local election data presents structural challenges for quantitative research: inconsistent office names, no universal candidate identifiers, and source-dependent coverage gaps. These are the questions researchers ask, with real answers and methodology notes.

Competitiveness and contestation

What’s the uncontested rate by office type? Constable: 72%. Soil and water conservation district: 58%. County commissioner: 34%. City council: 10%. These rates are computed from L4 canonical records where candidate_count = 1 for a given contest.
How does competitiveness vary across states? Minnesota reports 89.3% of local races as contested. Florida reports 0% in available MEDSL local data — a coverage artifact, not a political finding. Interpret cross-state comparisons with the Coverage Matrix.
What is the national uncontested rate? 48.8% across all available local races. This figure is coverage-weighted: states with more reported contests contribute proportionally more. It is not a population-weighted estimate.

Candidate career tracking

Can I track candidates across election cycles? Entity resolution at L3 links candidate records across years and sources. George Dunlap (Mecklenburg County, NC) appears in 6 election cycles under consistent entity IDs. See Career Tracking Across Elections.
What identifier links candidates across sources? The L4 canonical_candidate_id is a deterministic hash of resolved name components, jurisdiction, and office. It is stable across pipeline runs given the same L3 decisions.
How reliable is cross-cycle linking? Exact name matches are deterministic. Fuzzy matches (Jaro-Winkler ≥ 0.92) and embedding matches (cosine ≥ 0.88) are logged with scores. LLM-assisted matches include the decision ID. All match metadata is queryable.

Cross-source validation

How consistent are sources that cover the same contests? In 640 overlapping contests between MEDSL and NC SBE, 90.5% have identical vote totals. The remaining 9.5% differ by small amounts, typically due to provisional ballot timing or reporting cutoff dates.
How do candidate names differ across sources? In the same 640 overlapping contests, 63% have name formatting differences (e.g., “SMITH, JOHN A” vs. “John A. Smith”). These are resolved at L1 (parsing) and confirmed at L3 (entity resolution).

Office taxonomy

How many distinct offices exist? 8,387 unique office name strings before normalization. After L2 classification (keyword, regex, embedding, LLM), these resolve to a smaller set of canonical office types. See Office Classification Reference.
What office types exist at the sub-county level? Constable, justice of the peace, soil and water conservation district supervisor, school board trustee, municipal utility district director, and hundreds of jurisdiction-specific titles.

Reproducibility

All findings above are reproducible from the pipeline output:

L0 → L2 layers are fully deterministic. Given the same source files and pipeline version, output is byte-identical.
L3 decisions are logged in a decision log (JSONL). Replaying the log against the same L2 input reproduces L3 exactly, even when LLM calls were involved.
L4 is deterministic given L3 output.
Versioned JSONL files at every layer serve as the unit of reproducibility. Each file includes a manifest with source hashes, pipeline version, and timestamp.

To reproduce a specific finding, check out the tagged pipeline version, supply the same L0 inputs, and run the pipeline. The decision log ensures that even probabilistic steps (embedding similarity, LLM confirmation) produce identical output on replay.

Data format for analysis

Pipeline output is JSONL — one JSON object per line. This is directly loadable into pandas, R (jsonlite), DuckDB, or any tool that reads newline-delimited JSON. No proprietary formats or database dependencies are required.

For Government Staffers

County clerks, election administrators, and local officials need operational data — not research datasets. These are the questions government staff ask, with answers drawn from the dataset.

Office inventories

What offices exist in my county? Columbus County, NC has 25 distinct elected offices across county government, municipalities, school boards, and special districts. The pipeline produces per-county office inventories from L4 canonical records. See Office Inventory for a County.
How many races will be on the next ballot? Historical office inventories establish the set of offices that typically appear in a given election cycle. Odd-year vs. even-year patterns, staggered terms, and special elections are identifiable where source data includes election dates and term lengths.
Which offices are partisan vs. nonpartisan? Party affiliation is recorded where the source provides it. In North Carolina, all county commissioner races are partisan; all school board races are nonpartisan. Coverage varies by state.

Comparisons

How does our uncontested rate compare to peer counties? County-level uncontested rates are computable for any jurisdiction with coverage. A county clerk can compare their 60% uncontested rate against the state median or against demographically similar counties. See Uncontested Race Rate by State.
Are other counties consolidating offices we still elect separately? Office inventories across counties within a state reveal structural differences — some counties elect a coroner, others appoint one. The data does not explain why, but it shows where differences exist.
How many candidates typically file for each office? Candidate counts per contest are derivable from L4 records. A county with historically 1.2 candidates per school board seat has a different recruitment problem than one averaging 3.4.

Administrative planning

What does our ballot complexity look like over time? The number of contests per jurisdiction per cycle is queryable. Ballot length affects printing costs, voter fatigue research, and polling place logistics.
Which districts overlap our jurisdiction? Where OCD-IDs are present, hierarchical district relationships can be inferred. A county contains municipalities, school districts, and special districts — the data reflects which contests appear in which jurisdictions.

Data format

All outputs are JSONL with one record per contest-candidate pair. Government staff who need spreadsheets can convert JSONL to CSV with standard tools. See Querying JSONL Output.

Caveats

Office inventories are only as complete as the source data. If a state does not report local results to MEDSL or another covered source, those offices will not appear.
The pipeline documents sources and provides tools — it does not store or redistribute official election results. See The Project Does Not Store Data.
Seven states have zero local coverage in MEDSL 2022. Check the Coverage Matrix before relying on completeness for a specific jurisdiction.

For Civic Tech Developers

Civic technology projects depend on structured, reliable election data. Most fail not because of engineering limitations but because the underlying data is fragmented, inconsistently formatted, and difficult to resolve across sources. These are the questions developers ask when building on local election data.

Ballot lookup tools

Can I build a “what’s on my ballot” tool? Yes, but it requires mapping voter addresses to jurisdictions (via OCD-IDs or FIPS codes) and then mapping jurisdictions to offices. The dataset contains 8,387 unique office name strings — many of which refer to the same office across sources. The L4 canonical layer resolves these to deduplicated office records with jurisdiction identifiers.
How do I map an address to its contests? You need an OCD-ID → office mapping. OCD-IDs (Open Civic Data Identifiers) are present where source data includes them or where FIPS codes allow deterministic derivation. Coverage is not universal. See the Schema Overview for the jurisdiction.ocd_id field.
What format is the data in? Every pipeline layer outputs JSONL (newline-delimited JSON). One record per line, one file per source-year-state. No database required — parse with jq, Python, DuckDB, or any JSON-capable tool.

Candidate lookup and entity resolution

Can I build a candidate lookup API? The L4 layer provides entity-resolved candidate records with canonical names, office history, and source attribution. A candidate who appears as “Bill Smith” in one source and “William R. Smith Jr.” in another is resolved to a single entity with both name variants preserved.
How reliable is entity resolution? It depends on the match method. Exact matches and high-confidence Jaro-Winkler matches (≥0.92) are deterministic. Embedding-based and LLM-confirmed matches carry a decision ID that traces back to the specific match rationale. See The Cascade.
Can I track candidates across election cycles? Yes. Entity resolution operates across years. George Dunlap in Mecklenburg County, NC appears across 6 election cycles with consistent entity IDs. See Career Tracking.

Election history and widgets

Can I build an election history widget for a jurisdiction? The data supports historical queries by jurisdiction, office, and candidate. Time series depend on source coverage — MEDSL covers 2018–2022 for most states; NC SBE covers 2004–present for North Carolina.
What about ballot measures? Ballot measures are a distinct contest kind (BallotMeasure) in the schema. Choices are normalized to for/against/yes/no at L1.

Data interchange

Why JSONL and not a REST API? JSONL is the data interchange format at every layer. It is self-describing, streamable, and requires no server infrastructure. Downstream applications can ingest it directly or load it into any datastore.
Can I join this data with other civic datasets? Yes. Records include FIPS codes, OCD-IDs (where available), and state abbreviations. These are standard join keys for Census data, geographic boundaries, and other civic datasets.
Is the schema stable? The schema is versioned. Each JSONL record includes a schema_version field. Breaking changes increment the major version. See Schema Overview.

What to watch out for

The project does not host a live API or data download. It documents sources and provides pipeline tools to process them. You run the pipeline yourself.
Coverage gaps exist. Seven states lack local data in MEDSL 2022. Odd-year elections are underrepresented. Check the Coverage Matrix before building features that assume national coverage.
Entity resolution is probabilistic for non-exact matches. If your application requires certainty, filter to records with match_method: "exact" or match_method: "jaro_winkler".

What Exists Today and Where It Falls Short

Several organizations publish US election data. Each serves a different purpose, covers a different scope, and has different limitations. This chapter surveys the major sources and identifies the gaps that motivate this project.

MEDSL — MIT Election Data + Science Lab

MEDSL provides the most comprehensive freely available collection of US election returns. Their datasets cover federal, state, and many local races across multiple election cycles. Data is published as flat CSV files with consistent column schemas.

Strengths. Wide state coverage for federal and state races. Consistent schema across years. Academic quality control. Openly licensed. Includes candidate-level vote totals with party affiliation.

Weaknesses. Seven states have zero local election coverage in the 2022 dataset: CA, IA, KS, NJ, PA, TN, and WI. Office name strings are not normalized — the same office appears under different names across states and years. No entity resolution across cycles (the same candidate is a new row each time). Turnout metadata is sparse. Release cadence lags elections by 12–18 months.

ALGED — Annual Local Government Election Dataset

ALGED focuses specifically on local elections in US cities, filling a gap that most other sources ignore. It covers mayoral, city council, and some school board races.

Strengths. Dedicated local focus. Includes candidate demographics and incumbency status where available. Covers elections that no other academic dataset tracks.

Weaknesses. Limited to cities with populations above 50,000. Data collection appears to have stopped around 2021. Does not cover counties, townships, or special districts. Not currently integrated into this pipeline (planned for future work).

OpenElections

OpenElections is a community-curated effort to collect certified election results for all 50 states. Volunteers parse state-level result files into a common CSV format and publish them on GitHub.

Strengths. State-level certified results for many states. Community-driven, so coverage expands over time. Raw source files are preserved alongside parsed output. Free and open.

Weaknesses. Coverage varies dramatically by state — some states have complete precinct-level data back to 2000, others have nothing below the county level. Schema consistency depends on the volunteer. Local races are included when the state publishes them, but there is no systematic local collection effort. Quality varies; some state files have known parsing errors that persist across releases.

Ballotpedia

Ballotpedia maintains a wiki-style encyclopedia of US elections covering federal, state, and many local offices. Their coverage of school boards, judicial elections, and ballot measures is broader than most sources.

Strengths. Broad office-type coverage including judicial, school board, and special district races. Candidate biographical information. Historical coverage for some offices. Structured data behind the wiki pages.

Weaknesses. Bulk data access requires a commercial API license. No freely available flat-file download. Data is editorial (curated by staff, not derived from certified results). Not suitable as a primary source for vote totals, though useful for office inventories and candidate metadata.

Associated Press (AP)

The AP provides real-time and certified election results to media organizations. Their data covers federal, state, and many local races on election night and through the canvassing period.

Strengths. Fast — results are available on election night. Broad geographic coverage. Includes local races in many states. High reliability for the races they cover.

Weaknesses. Expensive commercial license. Not available for academic or civic tech use without a contract. Historical data is not publicly archived. Coverage decisions are editorial — not all local races are included.

Other sources

State election board websites (e.g., NC SBE) publish certified results, but formats vary by state — PDF, Excel, CSV, HTML, or proprietary portals. No two states use the same schema.
Clarity/Scytl election night reporting portals are used by many counties. Data is structured but ephemeral — pages are taken down or overwritten after certification.
VEST (Voting and Election Science Team) provides precinct-level shapefiles matched to election returns, primarily for redistricting research. Coverage is strong for federal races but limited at the local level.
FEC publishes federal candidate filings and financial data. No state or local coverage.
Census Bureau provides FIPS codes and geographic hierarchies, which are essential for joining across sources but contain no election results.

Summary

Source	Local coverage	Schema consistency	Freely available	Current	Entity resolution
MEDSL	43 of 50 states (2022)	High	Yes	Yes (with lag)	No
ALGED	Cities >50K only	Medium	Yes	No (~2021)	No
OpenElections	Varies by state	Low	Yes	Yes	No
Ballotpedia	Broad	Medium	API only	Yes	Partial
AP	Broad	High	No (commercial)	Yes	No
State portals	Varies	None (50 formats)	Usually	Yes	No

No single source covers all local races, uses a consistent schema, resolves candidates across elections, and is freely available. That gap — between what exists and what the four audiences need — is what this project addresses.

Source Overview

This project ingests election data from seven sources. None are complete on their own. Each fills a different gap — geographic breadth, temporal depth, local race coverage, geographic boundaries, or reference identifiers. The pipeline merges them into a unified schema; this chapter documents what each provides and where they overlap.

Source Summary

Source	What It Provides	Coverage	Format	Access Method
MEDSL	Precinct-level returns for federal, state, and some local races	50 states + DC; 2018, 2020, 2022 (~36.5M rows)	CSV/TSV, one file per state per cycle	Harvard Dataverse download, GitHub mirror
NC SBE	Precinct-level returns for every contest on the ballot, with vote mode breakdowns	NC only; 2006–2024 (10 cycles, ~2M rows)	Tab-delimited TXT in ZIP archives	S3 bucket direct download
OpenElections	Community-curated precinct-level CSV files	~8 states with 2022 data (FL, GA, MI, OH, PA, TX, others); coverage varies	CSV, schema varies by state	Git clone per state repo on GitHub
Clarity/Scytl	Election night reporting with precinct-level XML results	~1,000+ jurisdictions nationwide	Structured XML in ZIP files	Per-jurisdiction URLs (unstable across cycles)
VEST	Precinct boundaries (shapefiles) with vote counts as attributes	50 states; odd-year elections for KY/LA/MS/VA (2015, 2019)	Shapefile (.shp/.dbf/.shx/.prj)	Harvard Dataverse download
Census	FIPS reference codes for states (50+DC), counties (3,143), and places (31,980)	National, 2020 vintage	Pipe-delimited text files	census.gov direct download
FEC	Federal candidate master records with stable `CAND_ID` identifiers	All registered federal candidates; 2020 and 2022 loaded	Pipe-delimited TXT (`cn.txt`) in ZIP	fec.gov bulk download

What Each Source Contributes to the Pipeline

MEDSL is the backbone. It covers all 50 states at precinct granularity for three recent even-year cycles. Approximately 41.5% of rows in the 2022 dataset have a blank dataverse column, indicating local races. Seven states have zero local race rows — see Coverage Matrix.

NC SBE provides the deepest single-state coverage: every contest on every ballot in every precinct across 10 election cycles. It is the only source that provides vote mode breakdowns (Election Day, early, absentee, provisional) for local races. It serves as the primary validation dataset for cross-source entity resolution.

OpenElections fills state-level gaps where MEDSL coverage is incomplete or where an alternative source view aids cross-validation. Schema varies by state, requiring per-state parser logic.

Clarity has the highest value for hyperlocal races (school board, city council, judicial) because it captures results directly from county ENR systems. Not yet integrated in our pipeline. URL instability is the primary obstacle.

VEST provides the only precinct boundary geometries in the corpus, enabling geographic analysis. It also covers odd-year elections (2015, 2019) for states with off-cycle gubernatorial races — data that MEDSL’s loaded cycles do not include.

Census provides the authoritative FIPS code-to-name mappings used at L1 for geographic enrichment and cross-source geographic joins.

FEC provides stable candidate identifiers (CAND_ID) for federal candidates, used at L3 as reference anchors during entity resolution.

Cross-Source Overlap

Two source pairs have been compared quantitatively.

MEDSL + NC SBE (North Carolina, 2022 General)

Both sources report precinct-level results for the same 640 contests in North Carolina’s 2022 general election. Comparison results:

Metric	Value
Contests with exact vote total match	579 (90.5%)
Contests matching within 1%	47 (7.3%)
Contests disagreeing by >1%	14 (2.2%)
Contests with different candidate name formatting	401 (63%)

The 63% name formatting difference rate is the reason entity resolution exists. MEDSL reports SHANNON W BRAY (all caps, no period). NC SBE reports Shannon W. Bray (title case, period after initial). Same person, different string. This overlap is the primary test bed for the matching pipeline — see Cross-Source Reconciliation.

MEDSL + OpenElections (Florida, 2022 General)

Florida OpenElections data contains 6,013 “Registered Voters” rows (67.9% of non-candidate records), which are turnout metadata rows mixed into the results file. This overlap revealed the non-candidate row problem documented in Non-Candidate Records.

Source Priority Ranking

When multiple sources report results for the same contest, the pipeline applies a priority order to select the authoritative record:

Priority	Source Type	Rationale	Examples
1	Certified state data	Published by the official election authority; legally authoritative	NC SBE
2	Academic curated	Cleaned and standardized by researchers with documented methodology	MEDSL, VEST
3	Community curated	Volunteer-driven; quality varies by state and contributor	OpenElections
4	Election night reporting	Often preliminary, not certified; URLs are unstable	Clarity
5	Reference only	Not election results; used for enrichment and cross-referencing	Census, FEC

Priority 1 sources are preferred when available. In practice, NC SBE is the only certified state source currently loaded. For the remaining 49 states, MEDSL (priority 2) is the primary source. Lower-priority sources are retained in the record’s provenance for cross-validation, not discarded.

The priority ranking affects two pipeline decisions: which record becomes the canonical version at L4, and which confidence level is assigned. A record confirmed by two independent sources (e.g., MEDSL + NC SBE with matching vote totals) receives High confidence. A record from a single source receives Medium or Low depending on the source tier.

Coverage Matrix

This chapter maps which sources cover which states and years. Use it to determine whether a specific state/year/level combination is available before querying.

MEDSL — 50 States, 3 Cycles

MEDSL provides precinct-level results for all 50 states plus DC across three even-year general election cycles. Each cycle is one CSV per state.

Cycle	States	Approximate rows	Local race coverage
2018	50 + DC	~11.0M	Varies by state
2020	50 + DC	~13.2M	Varies by state
2022	50 + DC	~12.3M	44 of 51 jurisdictions

Seven states with zero local data in MEDSL 2022. These states have no rows with a blank dataverse column, meaning no local races were captured:

State	FIPS
California	06
Iowa	19
Kansas	20
New Jersey	34
Pennsylvania	42
Tennessee	47
Wisconsin	55

Local elections occur in all seven states. MEDSL’s curation process did not capture them for 2022. Coverage may differ in 2018 and 2020.

Odd-year data on Dataverse but not yet loaded. MEDSL publishes odd-year election data on Harvard Dataverse:

Cycle	DOI	Status
2015	—	Not loaded
2017	`10.7910/DVN/VNJAB1`	Not loaded
2019	`10.7910/DVN/2AJUII`	Not loaded
2021	—	Not loaded

Odd-year elections cover gubernatorial races in VA, NJ, KY, LA, MS and municipal elections in many states. Loading these would fill a significant gap.

NC SBE — 1 State, 10 Cycles

NC SBE covers North Carolina exclusively, with precinct-level results for every contest on the ballot.

Year	Election	Rows	Schema
2024	General	233,511	15-column
2022	General	171,901	15-column
2020	General	257,722	15-column
2018	General	183,724	15-column
2016	General	252,827	15-column
2014	General	223,977	15-column
2012	General	208,921	14–15 column (different layout)
2010	General	188,008	14–15 column (different layout)
2008	General	233,141	14–15 column (different layout)
2006	General	69,482	9-column (significantly different)

All 10 cycles are downloaded. The 2014–2024 files share a stable schema and a single parser. The 2008–2012 files require a separate parser. The 2006 file requires a third.

OpenElections — ~8 States, Variable Coverage

OpenElections is community-curated. Coverage depends on volunteer effort per state. The following states have 2022 precinct-level general election data:

State	2022 precinct data	Earlier years
Florida	✅	2000–2020
Georgia	✅	2004–2020
Michigan	✅	2000–2020
Ohio	✅	2000–2020
Pennsylvania	✅	2000–2020
Texas	✅	2000–2020
North Carolina	✅	2008–2020
Arizona	Partial	2004–2020

Coverage for other states exists at county level or for federal races only. Check each state’s GitHub repository (openelections-data-{state}) for current status.

VEST — Shapefiles with Vote Counts

VEST publishes precinct-level shapefiles for all 50 states. We have loaded a subset for odd-year coverage:

State	Year	Election type	Loaded
Kentucky	2019	General (Governor)	✅
Louisiana	2019	General (Governor)	✅
Mississippi	2019	General (Governor)	✅
Virginia	2019	General (state legislature)	✅
Kentucky	2015	General (Governor)	✅
Louisiana	2015	General (Governor)	✅
Mississippi	2015	General (Governor)	✅
Virginia	2015	General (state legislature)	✅

VEST covers state-level races only (president, governor, US Senate, US House, state legislature). No local races.

Census and FEC — Reference Data

These are not election results. They provide reference identifiers used during pipeline enrichment.

Source	Scope	Years	Records
Census county FIPS	National	2020	3,143
Census place FIPS	National	2020	31,980
Census state FIPS	National	2020	56
FEC candidate master	Federal candidates	2020	~6,800
FEC candidate master	Federal candidates	2022	~6,600

Clarity/Scytl — Not Yet Integrated

Clarity ENR sites cover 1,000+ jurisdictions but are not yet in the pipeline. URLs are unstable across election cycles, making systematic acquisition difficult. See Clarity/Scytl ENR.

Combined Coverage Summary

Dimension	Current status
States with any data	50 + DC
Even-year general elections	2018, 2020, 2022
Odd-year elections	KY/LA/MS/VA 2015, 2019 (VEST only, state-level)
Deep single-state coverage	NC, 2006–2024 (10 cycles)
Total rows across all sources	~42M
Local race coverage	44 of 51 jurisdictions (MEDSL 2022) + NC (NC SBE)
Vote mode breakdowns	NC SBE (all contests), MEDSL (some states), Clarity (when integrated)
Turnout data	<5% of records populated

Gap Analysis

Temporal gaps. No odd-year municipal election results are loaded. Cities like New York, Los Angeles, Houston, Philadelphia, and San Antonio hold elections in odd years. MEDSL publishes 2017 and 2019 data on Dataverse. Loading these would add coverage for the largest US cities.

State-level local gaps. Seven states have zero local race data in MEDSL 2022. OpenElections partially fills this for Pennsylvania. The remaining six (CA, IA, KS, NJ, TN, WI) require either Clarity integration or direct state portal downloads.

Primary elections. All loaded data is general election only. MEDSL tags primary results with stage = PRI but we have not loaded primary-specific files. NC SBE publishes primary results as separate files.

Runoff elections. Georgia, Louisiana, Texas, and other states hold runoff elections. These are partially captured in MEDSL (stage = RUN) but not systematically loaded.

What We Cover, What We Don’t, and Why

This page is a honest inventory of what the pipeline can and cannot do today. The status indicators mean:

✅ — Functional and validated
⚠️ — Partially implemented or not validated at scale
❌ — Not yet supported

Status Table

Capability	Status	Notes
Precinct-level results, all 50 states	✅	Via MEDSL 2018/2020/2022. 36.5M rows across three cycles.
NC deep temporal coverage	✅	NC SBE 2006–2024, 10 election cycles, 2.0M+ rows. Consistent 15-column schema from 2014 onward.
Federal race coverage	✅	President, US Senate, US House present in MEDSL for all states. FEC candidate master files available for cross-referencing.
State-level race coverage	✅	Governor, state legislature, AG, SOS present in MEDSL for all states.
FIPS geographic enrichment	✅	Census reference files loaded: 3,143 counties, 31,980 places, all 50 states + DC. 100% county FIPS match rate on MEDSL data.
Vote mode breakdowns	✅	NC SBE provides Election Day / One Stop / Absentee / Provisional for every contest. MEDSL provides mode breakdowns for some states (rows split by `mode` column).
Local race coverage	⚠️	44 of 51 MEDSL jurisdictions have local race data (blank `dataverse` column) in 2022. Seven states — CA, IA, KS, NJ, PA, TN, WI — have zero local rows.
Cross-source validation	⚠️	Validated for NC only. MEDSL and NC SBE share 640 contests in 2022: 90.5% exact vote match, 7.3% within 1%, 2.2% disagree by >1%. No systematic cross-source validation for other states.
Entity resolution	⚠️	Four-tier cascade (exact → Jaro-Winkler → embedding → LLM) is designed and prototyped. Not yet validated at scale beyond NC test cases.
Office classification	⚠️	Four-tier classifier (keyword → regex → embedding → LLM) handles 62% of 8,387 unique office names via keywords. Remaining 38% require embedding or LLM tiers. 0.56% classified as “other” in NC testing.
Name decomposition	⚠️	Parses first/middle/last/suffix/nickname from MEDSL and NC SBE formats. Handles nicknames in quotes (`"Steve"`) and parentheses (`(Steve)`). Not tested against all 50 states’ formatting conventions.
Turnout data	❌	`registered_voters` and `ballots_cast` populated for <5% of records. NC SBE has “Registered Voters” pseudo-contest rows. Most MEDSL state files do not include registration counts.
Odd-year elections	❌	MEDSL publishes 2017 and 2019 on Harvard Dataverse. VEST has KY/LA/MS/VA for 2015 and 2019. None loaded into our pipeline yet.
Ranked-choice voting	❌	Schema has no fields for RCV rounds. Maine and Alaska use RCV for federal races. NYC and other cities use it for local races. No timeline for support.
Demographic correlation	❌	Census FIPS join is ready (county-level). Census demographic data (ACS) not yet integrated. The join key exists; the demographic tables do not.
Real-time results	❌	Pipeline processes certified and official results only. Not designed for election night reporting. Clarity integration (which could provide semi-live data) is not yet implemented.
Party switching detection	❌	Requires entity resolution across election cycles, which depends on L3/L4 being operational at scale.

Local Race Coverage Detail

The 44 states with local data in MEDSL 2022 vary in depth. Some states report thousands of local contests; others report only a handful. The seven states with zero local rows are not states without local elections — they are states where MEDSL’s curation did not capture local results for that cycle.

NC SBE fills the gap for North Carolina with complete local coverage: every contest on every ballot in every precinct in all 100 counties. For other states, the gap remains.

OpenElections provides supplemental local data for FL, GA, MI, OH, PA, and TX, but coverage is inconsistent across years and granularity levels.

What “Validated” Means

A capability marked ✅ means:

The data is loaded and parsed without errors.
The output has been spot-checked against the source.
Where cross-source overlap exists, the numbers have been compared.

It does not mean the data is free of errors from the source. MEDSL’s votes column contains 12,782 non-integer values out of 12.3M rows (0.1%) in 2022. NC SBE has occasional data entry artifacts (e.g., a period after a middle name instead of a middle initial). These are source-level issues that the pipeline preserves and flags rather than silently corrects.

What “Not Validated at Scale” Means

Entity resolution and office classification work on NC test data. We have not run them against all 42M rows across all 50 states. The algorithms are designed; the compute has not been spent. When we do run at scale, we expect to discover new edge cases — office titles we haven’t seen, name formats we haven’t parsed, and match ambiguities we haven’t resolved.

This page will be updated as capabilities move from ⚠️ to ✅ or as new limitations are discovered.

MEDSL — MIT Election Data + Science Lab

The MIT Election Data + Science Lab publishes precinct-level election returns for all 50 states and the District of Columbia. The data is hosted on the Harvard Dataverse (electionscience collection) and mirrored on GitHub for recent cycles. It is the most complete single source of US election data available without a paywall or API key.

What MEDSL contains

MEDSL provides one CSV or tab-delimited file per state per election cycle. Each row represents one candidate in one precinct for one vote mode (election day, absentee, early voting, provisional, etc.). To obtain the total votes for a candidate in a precinct, you must sum across all rows for that candidate and precinct.

Available election cycles:

Cycle	Location	Format	DOI
2022	GitHub	CSV, one ZIP per state	—
2020	Harvard Dataverse	CSV/TAB, one file per state	`10.7910/DVN/NT66Z3`
2018	Harvard Dataverse	CSV/TAB, one file per state	`10.7910/DVN/NVQYMG`
2016	Harvard Dataverse	CSV/TAB	`10.7910/DVN/NH5S2I`
2019 (odd-year)	Harvard Dataverse	CSV/TAB	`10.7910/DVN/2AJUII`
2017 (odd-year)	Harvard Dataverse	CSV/TAB	`10.7910/DVN/VNJAB1`

We have downloaded and loaded 2018, 2020, and 2022. Together they contain approximately 36.5 million rows.

Schema

MEDSL files have 25 columns. The delimiter is comma for most states but tab for some; auto-detection handles this.

Column	Type	Description	Example
`precinct`	string	Precinct identifier from the source	`12-13`
`office`	string	Contest name, ALL CAPS	`CABARRUS COUNTY SCHOOLS BOARD OF EDUCATION`
`party_detailed`	string	Full party name	`NONPARTISAN`
`party_simplified`	string	Normalized party	`NONPARTISAN`
`mode`	string	Vote type for this row	`ELECTION DAY`
`votes`	integer	Vote count for this mode	`79`
`candidate`	string	Candidate name, ALL CAPS	`GREG MILLS`
`district`	string	District identifier or blank	`STATEWIDE`, `003`, ``
`dataverse`	string	Race level tag — see below	`STATE`, `SENATE`, `HOUSE`, ``
`stage`	string	Election stage	`GEN`
`special`	string	Special election flag	`FALSE`
`writein`	string	Write-in flag	`FALSE`
`date`	date	Election date	`2022-11-08`
`year`	integer	Election year	`2022`
`county_name`	string	County name, ALL CAPS	`CABARRUS`
`county_fips`	string	5-digit county FIPS	`37025`
`jurisdiction_name`	string	Jurisdiction name	`CABARRUS`
`jurisdiction_fips`	string	Jurisdiction FIPS	`37025`
`state`	string	Full state name	`NORTH CAROLINA`
`state_po`	string	2-letter postal code	`NC`
`state_fips`	string	2-digit state FIPS	`37`
`state_cen`	string	Census state code	`56`
`state_ic`	string	ICPSR state code	`47`
`readme_check`	string	Data quality flag	`FALSE`
`magnitude`	integer	Number of seats in this contest	`3`

The `dataverse` column and local races

MEDSL tags each row with a dataverse value indicating which Harvard Dataverse sub-collection the race belongs to:

Value	Meaning	Example offices
`PRESIDENT`	Presidential race	President
`SENATE`	US Senate	US Senate
`HOUSE`	US House	US House District 7
`STATE`	State-level offices	Governor, State Senate, Attorney General
(blank)	Everything else — including all local races	County Commissioner, School Board, Sheriff, Soil and Water

Local races are identified by a blank dataverse column, not by the value LOCAL. This is a frequent source of confusion. In the 2022 North Carolina file, 385,260 of 684,712 rows (56%) have a blank dataverse value. These rows contain school board races, county commissioner races, soil and water conservation districts, district court judges, mayors, city councils, and other local offices.

In the full 2022 national dataset (12.3 million rows), approximately 5.1 million rows (41.5%) have a blank dataverse value.

The `mode` column and vote totals

Each row in MEDSL represents one candidate’s votes for one vote mode. A single candidate in a single precinct may have multiple rows:

12-13,US SENATE,LIBERTARIAN,LIBERTARIAN,ELECTION DAY,47,SHANNON W BRAY,...
12-13,US SENATE,LIBERTARIAN,LIBERTARIAN,ABSENTEE BY MAIL,5,SHANNON W BRAY,...
12-13,US SENATE,LIBERTARIAN,LIBERTARIAN,EARLY VOTING,38,SHANNON W BRAY,...
12-13,US SENATE,LIBERTARIAN,LIBERTARIAN,PROVISIONAL,0,SHANNON W BRAY,...

To get Shannon W. Bray’s total votes in precinct 12-13, sum the votes column across all modes: 47 + 5 + 38 + 0 = 90.

Some states include a TOTAL mode row that pre-sums the other modes. Some do not. Your aggregation logic must handle both cases. If TOTAL rows are present, either use them directly and skip the individual mode rows, or skip TOTAL and sum the modes yourself. Do not double-count.

Common mode values: ELECTION DAY, ABSENTEE BY MAIL, EARLY VOTING, ONE STOP, PROVISIONAL, TOTAL.

Name formatting

MEDSL candidate names are ALL CAPS with no periods after initials:

MEDSL	Actual name
`SHANNON W BRAY`	Shannon W. Bray
`VICTORIA P PORTER`	Victoria P. Porter
`MICHAEL "STEVE" HUBER`	Michael “Steve” Huber
`ROBERT VAN FLETCHER JR`	Robert Van Fletcher, Jr.
`LM "MICKEY" SIMMONS`	L.M. “Mickey” Simmons

Nicknames appear in double quotes within the name string. Suffixes (JR, SR, III) appear without a preceding comma.

Write-in candidates are aggregated into a single row with candidate = WRITEIN and writein = TRUE.

Non-candidate rows

Some states include metadata rows in the data that are not candidate results:

`office` value	Meaning	Action
`REGISTERED VOTERS`	Voter registration count	Extract as turnout metadata, do not treat as a contest
`BALLOTS CAST`	Ballots cast count	Extract as turnout metadata
`BALLOTS CAST - TOTAL`	Same	Extract
`BALLOTS CAST - BLANK`	Blank ballot count	Extract
`STRAIGHT PARTY`	Straight-ticket party vote	Typically excluded from contest analysis
`OVER VOTES`	Overvote count	Extract as quality metadata
`UNDER VOTES`	Undervote count	Extract as quality metadata

These rows are present in some states and absent in others. Florida OpenElections data contains 6,013 “Registered Voters” rows — 67.9% of all records classified as “other” in initial processing.

Known coverage gaps

MEDSL 2022 contains local race data for 44 of 51 jurisdictions. Seven states have zero rows with a blank dataverse column:

State	Likely reason
California	Local results published separately by each county; not aggregated by MEDSL
Iowa	Local results not included in the MEDSL state file
Kansas	Same
New Jersey	Same
Pennsylvania	Same
Tennessee	Same
Wisconsin	Same

This does not mean these states lack local elections. It means MEDSL’s curation process did not capture them for 2022. Coverage may differ in other years.

The `votes` column type

The votes column is predominantly integer, but some state files contain non-integer values. We observed:

Floating-point values (likely vote shares erroneously placed in the votes column)
Asterisks (*) indicating suppressed data
Empty strings

Parse with TRY_CAST or equivalent. In our load of the full 2022 dataset, 12,782 rows had non-integer votes values out of 12.3 million total (0.1%).

Download

# 2022 — All 51 files from GitHub
mkdir -p local-data/sources/medsl/2022
for state in ak al ar az ca co ct dc de fl ga hi ia id il in ks ky la \
            ma md me mi mn mo ms mt nc nd ne nh nj nm nv ny oh ok or pa \
            ri sc sd tn tx ut va vt wa wi wv wy; do
  curl -L -o "local-data/sources/medsl/2022/2022-${state}-local-precinct-general.zip" \
    "https://raw.githubusercontent.com/MEDSL/2022-elections-official/main/individual_states/2022-${state}-local-precinct-general.zip"
done

# Unzip
for f in local-data/sources/medsl/2022/*.zip; do
  unzip -o "$f" -d "${f%.zip}"
done

# 2020 — NC example from Harvard Dataverse (file ID 6100444)
mkdir -p local-data/sources/medsl/2020
curl -L -o local-data/sources/medsl/2020/2020-nc-precinct-general.csv \
  "https://dataverse.harvard.edu/api/access/datafile/6100444"

File IDs for all 51 jurisdictions in 2020 and 2018 are documented in the download instructions.

Cross-source overlap

For the 2022 North Carolina general election, MEDSL and NC SBE share 640 contests where both sources report results:

579 (90.5%) have exactly matching vote totals
47 (7.3%) match within 1%
14 (2.2%) disagree by more than 1%
401 (63%) have different candidate name formatting between the two sources

This overlap is the basis for our entity resolution validation. See Cross-Source Reconciliation.

NC SBE — North Carolina State Board of Elections

The North Carolina State Board of Elections publishes precinct-level results for every contest on the ballot — federal, state, and local — with vote mode breakdowns, for every election cycle back to at least 2006. It is the most complete single-state local election dataset we have found.

What NC SBE contains

NC SBE provides one tab-delimited text file per election, delivered as a ZIP archive from an S3 bucket. Each row represents one candidate in one precinct for one contest. Vote mode totals (Election Day, early voting, absentee by mail, provisional) appear as separate columns on each row, not as separate rows. This means a single row gives you the full vote breakdown for one candidate in one precinct — unlike MEDSL, which splits each vote mode into its own row.

Coverage:

Year	File	Rows	Notes
2024	`results_pct_20241105.txt`	233,511	Presidential general
2022	`results_pct_20221108.txt`	171,901	Midterm general
2020	`results_pct_20201103.txt`	257,722	Presidential general
2018	`results_pct_20181106.txt`	183,724	Midterm general
2016	`results_pct_20161108.txt`	252,827	Presidential general
2014	`results_pct_20141104.txt`	223,977	Midterm general
2012	`results_pct_20121106.txt`	208,921	Different schema — see below
2010	`results_pct_20101102.txt`	188,008	Different schema
2008	`results_pct_20081104.txt`	233,141	Different schema
2006	`results_pct_20061107.txt`	69,482	Significantly different schema (9 columns)

We have downloaded and loaded all 10 cycles. The 2014–2024 files share a stable 15-column format. Earlier files require separate parsers.

Schema (2014–2024)

Files from 2014 onward are tab-delimited with 15 columns. There is no quoting convention; values do not contain tabs.

Column	Type	Description	Example
`County`	string	County name, ALL CAPS	`COLUMBUS`
`Election Date`	string	Date as `MM/DD/YYYY`	`11/08/2022`
`Precinct`	string	Precinct identifier	`P17`
`Contest Group ID`	string	Internal contest grouping number	`7`
`Contest Type`	string	`S` = statewide, `C` = county/local	`C`
`Contest Name`	string	Full contest name, ALL CAPS	`COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02`
`Choice`	string	Candidate name, Title Case	`Timothy Lance`
`Choice Party`	string	Party abbreviation or blank	`REP`, `DEM`,
`Vote For`	integer	Maximum selections allowed	`1`
`Election Day`	integer	Election day votes	`136`
`One Stop`	integer	Early voting (in-person) votes	`159`
`Absentee by Mail`	integer	Mail absentee votes	`7`
`Provisional`	integer	Provisional ballot votes	`1`
`Total Votes`	integer	Sum of all vote modes	`303`
`Real Precinct`	string	`Y` = physical precinct, `N` = aggregation group	`Y`

The `Contest Type` column

The Contest Type field distinguishes local from statewide races:

C — county/local contests: school board, county commissioner, city council, soil and water, local judicial races, bond referendums
S — statewide contests: US Senate, US House, Governor, state legislature, statewide judicial races

For local election analysis, filter to Contest Type = 'C'. In the 2022 file, this yields 919 distinct contests across 100 counties.

Vote mode columns

NC SBE is the only source in our corpus that provides vote mode breakdowns as columns for every contest, including local races. The four modes are:

Column	Meaning
`Election Day`	Votes cast in person on election day
`One Stop`	Early in-person voting (North Carolina’s term for early voting)
`Absentee by Mail`	Absentee ballots returned by mail
`Provisional`	Provisional ballots accepted during canvass

Total Votes is the sum of the four mode columns. We have verified this holds across all rows in the 2014–2024 files.

The vote mode data enables analysis that most sources cannot support: comparing early voting patterns to election day patterns at the precinct level for local races. Three of our nine data sources provide any vote mode breakdown at all (NC SBE, Clarity, and MEDSL for some states). NC SBE is the only one that provides it consistently for all contests.

Non-contest rows

NC SBE data includes rows that are not candidate results. These appear as entries in the Choice column within contests that are not real races:

`Contest Name` pattern	`Choice` value	What it is
Contains “Registered Voters”	(varies)	Voter registration count for the precinct
Any contest	`Write-In (Miscellaneous)`	Aggregated write-in votes
Any contest	`Over Votes`	Overvote count
Any contest	`Under Votes`	Undervote count

The “Registered Voters” rows deserve special attention. They appear as a contest named “Registered Voters” with a single Choice entry where Total Votes contains the number of registered voters in that precinct. This is turnout metadata, not a contest result.

In our prototype pipeline, we extract the registered voter count from these rows into a turnout object, then exclude the row from contest analysis. This is how we backfill the turnout.registered_voters field that is otherwise unpopulated for most sources.

Write-in rows with the suffix (Write-In) in the candidate name (e.g., Ronnie Strickland (Write-In)) are distinct from the aggregated Write-In (Miscellaneous) row. The named write-in rows report votes for a specific write-in candidate. The (Miscellaneous) row reports the total for all unnamed write-ins.

The `Real Precinct` column

Real Precinct = Y indicates a physical voting precinct with a defined geographic boundary. Real Precinct = N indicates an aggregation group — typically used for absentee-only tallies or provisional ballot pools that cannot be assigned to a specific precinct.

For geographic analysis (mapping, precinct-level comparison), filter to Real Precinct = 'Y'. For total vote counts, include both.

Candidate name formatting

NC SBE candidate names are Title Case with periods after initials and commas before suffixes:

NC SBE	Components
`Timothy Lance`	first=Timothy, last=Lance
`Shannon W. Bray`	first=Shannon, middle=W, last=Bray
`Robert Van Fletcher, Jr.`	first=Robert, middle=Van, last=Fletcher, suffix=Jr.
`Michael (Steve) Huber`	first=Michael, nickname=Steve, last=Huber
`William Irvin. Enzor III`	first=William, middle=Irvin, last=Enzor, suffix=III
`Patricia (Pat) Cotham`	first=Patricia, nickname=Pat, last=Cotham

Nicknames appear in parentheses. This differs from MEDSL, which uses double quotes. The period after “Irvin.” in “William Irvin. Enzor III” appears to be a data entry artifact — the period belongs after the middle initial, not after the full middle name. These inconsistencies are present in the source data and must be handled during name decomposition at L1.

Schema changes across years

The 2014–2024 files share the 15-column schema documented above. Earlier files differ:

2008–2012: The schema has 14–15 columns but with different names and ordering. Contest Type is the third column (not the fifth). Fields are comma-delimited with quote wrapping. The district column was added later. Vote mode columns use slightly different names in some years.

2006: Significantly different. Only 9 columns: county, election_dt, precinct_abbrv, precinct, contest_name, name_on_ballot, party_cd, ballot_count, FTP_Date. No vote mode breakdown. No Contest Type field. All column names are lowercase with underscores.

We currently parse 2014–2024 with one parser and treat 2006–2012 as a separate parser target. The 2008–2012 files contain local races (they have Contest Type = C) but require different column mapping. The 2006 file requires more investigation to determine whether it includes local races.

Why NC SBE matters

NC SBE is not the largest dataset in our corpus (MEDSL has far more rows). Its value is in three properties that no other source provides simultaneously:

Complete local coverage. Every contest on every ballot in every precinct in every county — school board, soil and water, county commissioner, municipal, judicial, and bond referendums. MEDSL has gaps in local race coverage for some states. NC SBE has none for North Carolina.
Vote mode breakdowns for local races. The four-column mode breakdown (Election Day, One Stop, Absentee, Provisional) is present for every contest, including hyperlocal races like “Whiteville City Schools Board of Education District 01.”
Ten-year temporal depth. Six clean election cycles (2014–2024) with a consistent schema. This enables career tracking, competitiveness trend analysis, and temporal chain construction across a decade of local elections. Combined with the 2008–2012 files (once parsed), the coverage extends to nearly 20 years.

The combination of these three properties makes NC SBE the primary validation dataset for the pipeline. When we test cross-source entity resolution, we compare MEDSL NC against NC SBE NC — 640 overlapping contests with 90.5% exact vote total agreement and 63% candidate name formatting differences. When we test temporal chains, we track candidates across NC SBE’s six-cycle span.

Download

The URL pattern is:

https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/{YYYY_MM_DD}/results_pct_{YYYYMMDD}.zip

mkdir -p local-data/sources/ncsbe/{2014,2016,2018,2020,2022,2024}

# 2024
curl -L -o local-data/sources/ncsbe/2024/results_pct_20241105.zip \
  "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2024_11_05/results_pct_20241105.zip"

# 2022
curl -L -o local-data/sources/ncsbe/2022/results_pct_20221108.zip \
  "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip"

# 2020
curl -L -o local-data/sources/ncsbe/2020/results_pct_20201103.zip \
  "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2020_11_03/results_pct_20201103.zip"

# 2018
curl -L -o local-data/sources/ncsbe/2018/results_pct_20181106.zip \
  "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2018_11_06/results_pct_20181106.zip"

# 2016
curl -L -o local-data/sources/ncsbe/2016/results_pct_20161108.zip \
  "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2016_11_08/results_pct_20161108.zip"

# 2014
curl -L -o local-data/sources/ncsbe/2014/results_pct_20141104.zip \
  "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2014_11_04/results_pct_20141104.zip"

# Unzip all
for d in local-data/sources/ncsbe/*/; do
  cd "$d" && unzip -o *.zip && cd -
done

Older cycles (2008–2012) follow the same URL pattern. The 2006 file uses a different path structure. The full election calendar is at ncsbe.gov/results-data.

OpenElections

The OpenElections project is a volunteer-driven effort to collect, clean, and publish certified US election results as CSV files. Data is organized into per-state GitHub repositories under the openelections organization.

What OpenElections provides

Precinct-level and county-level election results, parsed from official state and county sources into CSV format. Coverage varies by state — some repositories have data back to 2000, others have only one or two recent cycles. Approximately 8 states have precinct-level 2022 general election data suitable for aggregation.

States with usable precinct-level data for recent cycles include FL, GA, MI, OH, PA, and TX. Each state repository is independent, maintained by different volunteers, with different levels of completeness.

Repository structure

Each state has its own repo:

openelections-data-fl — Florida
openelections-data-ga — Georgia
openelections-data-pa — Pennsylvania
etc.

Files follow a naming convention that encodes the election date, state, type, and granularity:

{YYYYMMDD}__{state}__{type}__{granularity}.csv

Examples:

Filename	Meaning
`20221108__fl__general__precinct.csv`	2022 FL general, precinct-level
`20220510__pa__primary__county.csv`	2022 PA primary, county-level
`20201103__ga__general__precinct.csv`	2020 GA general, precinct-level

Some repos include both raw and cleaned versions. Files with raw in the name are unprocessed source dumps. Prefer files without the raw prefix.

Core schema (7+ columns)

The project does not enforce a single schema. Most files share a 7-column core:

Column	Type	Description
`county`	string	County name
`precinct`	string	Precinct name or code
`office`	string	Office contested
`district`	string	District number or name (may be blank)
`party`	string	Party abbreviation
`candidate`	string	Candidate name
`votes`	integer	Vote count

Additional columns appear in some states:

election_day, absentee, provisional, early_voting — vote mode breakdowns
winner — boolean or Y/N flag
total_votes — aggregate across modes

Column names and ordering differ across states and sometimes across files within the same state repo.

Schema variation by state

State	Extra columns	Name format	Notes
FL	`election_day`, `absentee`, `early_voting`	`Last, First`	Includes “Registered Voters” metadata rows (6,013 in 2022)
GA	`total_votes`	`First Last`	Precinct names vary by county
PA	none beyond core	`First Last`	Some files county-level only
OH	`early_voting`, `absentee`	`Last, First`	Inconsistent across counties

Non-candidate rows

Florida files include metadata rows that are not contest results:

`office` value	Meaning
`Registered Voters`	Voter registration count — 67.9% of “other” rows in initial FL processing
`Ballots Cast`	Turnout count

These must be extracted as turnout metadata during L1 parsing, not treated as contests.

Access method

Data is accessed by cloning the per-state Git repository:

git clone https://github.com/openelections/openelections-data-fl.git
git clone https://github.com/openelections/openelections-data-ga.git
git clone https://github.com/openelections/openelections-data-pa.git

There is no bulk download endpoint. Each state repo must be cloned individually.

Data quality

Quality varies by state and volunteer. Known issues:

No standard schema. Column names differ across states and files. Parsers must handle each state separately.
Candidate name format varies. Some states use Last, First. Others use First Last. Suffixes and middle names are inconsistent.
Encoding. Most files are UTF-8. Some older files contain Latin-1 or Windows-1252 characters.
Duplicates. Some repos contain both raw and cleaned versions of the same election. Ingest only one to avoid double-counting.
Incomplete coverage. A state repo existing does not mean it has precinct-level data for all cycles.

Cross-source overlap

OpenElections FL overlaps with MEDSL FL for the 2022 general election. This overlap is useful for validation but has not been systematically compared at the same depth as the MEDSL–NC SBE comparison (640 contests, 90.5% vote match). The FL overlap is a planned validation target.

Value in the pipeline

OpenElections fills gaps where MEDSL coverage is thin or where vote mode breakdowns are available. Florida’s vote mode columns (election day, absentee, early voting) provide signal that MEDSL’s Florida file lacks. The community-curated nature means data may appear for states or cycles before MEDSL publishes its cleaned version.

The tradeoff is consistency: every state requires its own parser branch at L1.

Clarity/Scytl ENR

Clarity (now part of Scytl / CivicPlus) powers Election Night Reporting (ENR) websites for over 1,000 US jurisdictions — counties, cities, and some state-level election authorities. Each jurisdiction runs its own Clarity instance, publishing structured results in XML and JSON formats.

What Clarity provides

Clarity sites are the primary source for local race results that no other source captures: school board, city council, municipal judge, fire district commissioner, water board. Many jurisdictions publish precinct-level results with vote mode breakdowns (Election Day, early, absentee, provisional). Results appear on election night and typically remain available for weeks or months before being replaced by the next election cycle.

Data format

Results are distributed as XML inside ZIP archives. The XML follows a hierarchical structure:

Element	Description
`<ElectionResult>`	Root element. Contains election metadata (name, date, jurisdiction).
`<Contest>`	One per race. Attributes include contest name, vote-for count, total ballots.
`<Choice>`	One per candidate or ballot measure option within a contest. Includes name, party, total votes.
`<VoteType>`	Breakdown by vote method within each choice. Election Day, absentee, early, provisional.
`<Precinct>`	Precinct-level results when the jurisdiction publishes at that granularity.

A single detailxml.zip file for a medium-sized county (50 precincts, 30 contests) is typically 200 KB–2 MB uncompressed.

URL structure

Clarity ENR sites follow a predictable URL pattern:

https://results.enr.clarityelections.com/{state}/{jurisdiction}/{electionID}/

The underlying data feeds are at:

Endpoint	Content
`reports/detailxml.zip`	Full precinct-level XML results
`json/en/summary.json`	Lightweight JSON summary (no precinct detail)
`Web02/en/summary.html`	Human-readable results page

Example for Wake County, NC:

https://results.enr.clarityelections.com/NC/Wake/115545/reports/detailxml.zip

The {electionID} is a numeric identifier assigned per election. It is not sequential and cannot be predicted.

Coverage

Jurisdictions: ~1,000+ counties and municipalities across ~30 states
Election types: general, primary, runoff, special, municipal
Granularity: precinct-level with vote type breakdowns (most jurisdictions)
Temporal: current election cycle only; prior results are removed when new elections are configured

Why Clarity matters

Clarity is the highest-value source for local races that do not appear in MEDSL, OpenElections, or state portals. A county’s Clarity site may be the only machine-readable source for races like:

School board (non-partisan, no state-level reporting)
City council (municipal elections, often off-cycle)
District court judge retention
Bond referendums and local ballot measures

Key problems

URLs are unstable. The {electionID} changes every cycle. Old results are removed without redirect or archive. There is no central index of active Clarity instances. Discovery requires crawling county election office websites for links.

No published XML schema. The XML structure is consistent in practice but not formally specified. Minor variations exist across Clarity software versions. Field names and nesting can differ between jurisdictions.

Candidate names may embed party. Some jurisdictions format candidate names as John Smith (REP) rather than using a separate party field. This requires parsing at L1.

Ephemeral availability. Results may disappear weeks after certification when the jurisdiction configures the site for the next election. L0 acquisition must happen promptly after each election.

Integration status

Clarity is not yet integrated in our pipeline. The source module (src/sources/clarity.rs) defines the XML schema and URL patterns but does not implement parsing or acquisition. Integration is blocked on building a jurisdiction discovery mechanism and a scheduled acquisition process that captures results before URLs expire.

When integrated, Clarity will feed into L0 as ZIP archives with XML contents, parsed at L1 into the unified schema. The hierarchical Contest → Choice → VoteType structure maps cleanly to our ContestKind model.

VEST — Voting and Election Science Team

The Voting and Election Science Team (VEST) publishes precinct-level election shapefiles for all 50 states. Each shapefile pairs precinct geographic boundaries with vote counts encoded as attribute columns. The data is archived on the Harvard Dataverse.

What VEST provides

VEST’s primary value is twofold: geographic precinct boundaries (polygons) and odd-year election coverage. No other source in our corpus provides precinct geometries, and MEDSL’s loaded data currently covers only even years.

We have downloaded VEST shapefiles for KY, LA, MS, and VA covering the 2015 and 2019 odd-year elections. These contain state-level races (governor, attorney general, state legislature) but not local races.

Data format

Each state-year dataset is a ZIP archive containing a standard ESRI shapefile bundle:

File	Purpose
`.shp`	Geometry (precinct boundary polygons)
`.dbf`	Attribute table (vote counts, FIPS codes)
`.shx`	Spatial index
`.prj`	Coordinate reference system definition
`.cpg`	Character encoding declaration

Reading requires a spatial data library. In Python, geopandas.read_file() handles the full bundle. In Rust, the shapefile crate reads .shp/.dbf pairs.

Column encoding convention

VEST encodes election metadata into column names using a compact format:

{stage}{YY}{office}{party}{surname}

Component	Values	Examples
Stage	`G` (general), `P` (primary), `R` (runoff)	`G`
Year	Two-digit year	`20`, `19`, `15`
Office	`PRE` (President), `USS` (US Senate), `USH` (US House), `GOV` (Governor), `SOS` (Sec. of State), `AG` (Attorney General), `LTG` (Lt. Governor)	`PRE`
Party	`R` (Republican), `D` (Democrat), `L` (Libertarian), `G` (Green), `O` (Other)	`R`
Surname	Abbreviated (typically 3 chars)	`TRU`, `BID`

Decoded examples

Column	Stage	Year	Office	Party	Candidate
`G20PRERTRU`	General	2020	President	Republican	Trump
`G20PREDBID`	General	2020	President	Democrat	Biden
`G19GOVDBED`	General	2019	Governor	Democrat	Beshear (KY)
`G15GOVDEDW`	General	2015	Governor	Democrat	Edwards (LA)
`G18GOVDABO`	General	2018	Governor	Democrat	Abrams (GA)

Attribute table structure

The .dbf file contains both geographic identifiers and vote count columns:

Column pattern	Description
`STATEFP20`	2-digit state FIPS code
`COUNTYFP20`	3-digit county FIPS code
`VTDST20`	Voting tabulation district (precinct) code
`NAME20`	Human-readable precinct name
`ALAND20`	Land area in square meters
`AWATER20`	Water area in square meters
`G20PRE*`	Vote count columns (one per candidate)

Vote values are raw integer counts. Each row is one precinct.

dBASE column name truncation

The .dbf format (dBASE III) limits column names to 10 characters. This truncation creates ambiguity:

G20USSRPER could be Perdue or Perry
G20USHDWIL could be Williams, Wilson, or Wilkins

VEST documentation files (included in each ZIP) provide a column-to-candidate mapping. These must be consulted to resolve truncated names.

Coverage in our pipeline

State	Year	Election type	Races
KY	2019	Governor, AG, SOS, state legislature	State-level only
LA	2015, 2019	Governor, state legislature	State-level only
MS	2019	Governor, AG, state legislature	State-level only
VA	2015, 2019	Governor, state legislature	State-level only

These four states hold odd-year elections, which MEDSL has on Dataverse but which we have not yet loaded from that source. VEST fills the gap for state-level races in these cycles.

Limitations

No local races. VEST encodes statewide and federal contests only. County commissioner, school board, sheriff, and other local offices are not present. For local race coverage, use MEDSL or state-specific sources.

Large file sizes. Individual state shapefiles range from 50 MB to 500+ MB. The geometry data dominates file size; vote counts are a small fraction.

Precinct boundary instability. Redistricting changes precinct boundaries between election cycles. A precinct polygon from 2020 may not correspond to the same geographic area in 2022. Cross-year geographic comparisons require spatial intersection, not ID matching.

Requires spatial tooling. Unlike CSV sources that can be read with any text processor, shapefiles require geopandas (Python) or the shapefile crate (Rust). This adds a dependency that other sources do not.

Usage in the pipeline

VEST data enters at L0 as the raw shapefile ZIP. At L1, the column encoding is decoded to extract year, office, party, and candidate surname. Vote counts are pivoted from wide format (one column per candidate) to long format (one row per candidate per precinct) to match the unified schema.

The geographic boundaries are preserved as sidecar geometry files but are not embedded into the JSONL record stream. They are available for spatial joins and map rendering but are not part of the core election result schema.

Download

VEST datasets are available from the Harvard Dataverse. Each state-year combination has its own DOI. Example for Kentucky 2019:

mkdir -p local-data/sources/vest/ky/2019
curl -L -o local-data/sources/vest/ky/2019/ky_2019.zip \
  "https://dataverse.harvard.edu/api/access/dataset/:persistentId/?persistentId=doi:10.7910/DVN/XXXXXX"
unzip local-data/sources/vest/ky/2019/ky_2019.zip -d local-data/sources/vest/ky/2019/

Consult the VEST precinct data page for current DOIs. File IDs change when datasets are updated.

Census Bureau FIPS Reference Files

The US Census Bureau publishes authoritative FIPS (Federal Information Processing Standards) code files that provide the canonical mapping from numeric codes to geographic entity names. These files are the ground truth for geographic identifiers across the pipeline.

What it provides

File	Entity type	Record count	Key columns
`state.txt`	States + DC + territories	57	`STATE`, `STATE_NAME`
`national_county2020.txt`	Counties + equivalents	3,143	`STATEFP`, `COUNTYFP`, `COUNTYNAME`
`national_place2020.txt`	Incorporated places + CDPs	31,980	`STATEFP`, `PLACEFP`, `PLACENAME`
`national_cousub2020.txt`	County subdivisions	~36,000	`STATEFP`, `COUNTYFP`, `COUSUBFP`, `COUSUBNAME`

Format

All files are pipe-delimited (|) plain text with a header row. Encoding is ASCII. Example from the county file:

NC|37|037|1026339|Chatham County|H1|A
NC|37|063|1008557|Durham County|H1|A
NC|37|183|1008586|Wake County|H1|A

Columns in the county file:

Column	Description
`STATE`	Two-letter postal abbreviation
`STATEFP`	Two-digit state FIPS code
`COUNTYFP`	Three-digit county FIPS code
`COUNTYNS`	ANSI feature code
`COUNTYNAME`	Full county name including “County” suffix
`CLASSFP`	FIPS class code (`H1` = active county, `H4` = borough, `H6` = parish)
`FUNCSTAT`	Functional status (`A` = active)

The five-digit county FIPS used throughout the pipeline is STATEFP + COUNTYFP (e.g., 37 + 183 = 37183 for Wake County, NC).

Download

https://www2.census.gov/geo/docs/reference/state.txt
https://www2.census.gov/geo/docs/reference/codes2020/national_county2020.txt
https://www2.census.gov/geo/docs/reference/codes2020/national_place2020.txt
https://www2.census.gov/geo/docs/reference/codes2020/national_cousub2020.txt

No API key required. Files are small (under 5 MB total) and rarely change.

Usage in the pipeline

Census FIPS files are consumed at L1 for geographic enrichment. When a source record contains a county name but no FIPS code (common in OpenElections and Clarity data), the pipeline joins against the county file to assign the canonical five-digit FIPS. When a source provides a FIPS code but no name, the lookup runs in reverse.

The place file enables resolution of municipal names to FIPS codes — relevant for city council, mayoral, and municipal utility district contests where the jurisdiction is a place, not a county.

FIPS codes serve as the primary geographic join key across all seven data sources. Without them, matching “Wake County” in MEDSL to “WAKE” in NC SBE to “Wake Co.” in OpenElections would require fuzzy string matching. With them, it is an exact integer join.

FEC — Federal Election Commission Candidate Master Files

The FEC publishes bulk data files for every registered federal candidate: President, US Senate, and US House. These files provide stable candidate identifiers (CAND_ID) that persist across election cycles, making them a reference source for cross-linking federal candidates across MEDSL, NC SBE, and OpenElections data.

What FEC provides

The candidate master file (cn.txt) contains one row per candidate per election cycle. It covers all candidates who have filed with the FEC, including those who lost primaries or never appeared on a general election ballot.

Available cycles: 1980–present. We have downloaded 2020 and 2022.

Download

Bulk data is at fec.gov/data/browse-data.

mkdir -p local-data/sources/fec/{2020,2022}

# 2022
curl -L -o local-data/sources/fec/2022/cn.zip \
  "https://www.fec.gov/files/bulk-downloads/2022/cn.zip"
unzip -o local-data/sources/fec/2022/cn.zip -d local-data/sources/fec/2022/

# 2020
curl -L -o local-data/sources/fec/2020/cn.zip \
  "https://www.fec.gov/files/bulk-downloads/2020/cn.zip"
unzip -o local-data/sources/fec/2020/cn.zip -d local-data/sources/fec/2020/

Schema

The file cn.txt is pipe-delimited (|) with 15 columns and no header row.

#	Column	Description	Example
1	`CAND_ID`	Stable candidate identifier	`H0NC09072`
2	`CAND_NAME`	Name in LAST, FIRST MIDDLE format	`BRAY, SHANNON W`
3	`CAND_PTY_AFFILIATION`	Party code	`LIB`
4	`CAND_ELECTION_YR`	Election year	`2022`
5	`CAND_OFFICE_ST`	State (2-letter postal code)	`NC`
6	`CAND_OFFICE`	Office: `H` / `S` / `P`	`H`
7	`CAND_OFFICE_DISTRICT`	Congressional district (`00` for Senate/President)	`09`
8	`CAND_ICI`	Incumbent/Challenger/Open: `I`/`C`/`O`	`C`
9	`CAND_STATUS`	Status code (`C`=statutory candidate, `F`=filed, `N`=not yet, `P`=prior cycle)	`C`
10	`CAND_PCC`	Principal campaign committee ID	`C00654321`
11	`CAND_ST1`	Mailing address street
12	`CAND_ST2`	Mailing address street 2
13	`CAND_CITY`	Mailing address city
14	`CAND_ST`	Mailing address state
15	`CAND_ZIP`	Mailing address ZIP

Usage in the pipeline

FEC data serves two purposes:

Stable identifiers. CAND_ID persists across election cycles. A candidate who runs for the same seat in 2020 and 2022 keeps the same ID. This provides a ground-truth link for validating temporal chains built by the L4 layer.
Name cross-referencing. CAND_NAME is parsed at L1 into last, first, middle, and suffix components. These parsed names are compared against MEDSL and state source names during L3 entity resolution. FEC uses LAST, FIRST MIDDLE format consistently, which makes it one of the more predictable sources for name parsing.

Limitations

Federal candidates only. No state legislators, no county commissioners, no school board members. FEC has no jurisdiction over non-federal offices.
Filing ≠ appearing on ballot. Many CAND_ID entries correspond to candidates who filed paperwork but never appeared on a general election ballot.
Party codes differ from other sources. FEC uses codes like LIB, GRE, NNE (None) that do not match MEDSL’s LIBERTARIAN, GREEN, NONPARTISAN labels. Normalization is required at L1.

Future Sources

This chapter documents data sources that have been identified as valuable but are not yet integrated into the pipeline. Each is blocked by a specific access, cost, or engineering constraint.

ALGED — Annual Local Government Election Data

The Annual Local Government Election Data project, hosted on the Open Science Framework (OSF), covers municipal elections in 1,747 cities with populations over 25,000. Records include candidate demographics (race, gender), incumbency status, and election outcomes — fields that no other source in our corpus provides.

Coverage: Municipal elections from 2000–2020. Cities only (no counties, no school districts). Focuses on mayoral and city council races.

Format: CSV files organized by city population tier.

Status: Blocked. The OSF repository requires an approved access request. We submitted a request in early 2025 and have not received a response. The underlying data appears to be derived from individual city clerk records, manually curated by the research team.

Value if integrated: ALGED would fill the demographic gap entirely. No other source provides candidate race or gender. It would also provide an independent validation source for municipal races in the 1,747 covered cities.

Ballotpedia

Ballotpedia maintains the most comprehensive database of US local elections, covering school boards, city councils, county commissions, judges, ballot measures, and special districts across all 50 states. Their coverage extends to races that no other source tracks — mosquito abatement districts, water boards, and transit authorities.

Coverage: All 50 states, all levels of government, ongoing since approximately 2007.

Format: Structured database accessible via a paid API. Some data is available on the public website but is not bulk-downloadable.

Status: Blocked by cost. The API requires a commercial license. Pricing is not publicly listed but is reported to be in the five-figure annual range. We have not pursued a license.

Value if integrated: Ballotpedia would be the single largest improvement to local race coverage. It would fill the 7-state local race gap in MEDSL 2022 and provide office-level metadata (term length, salary, appointing authority) that no source currently offers.

AP Elections API

The Associated Press Elections API provides real-time and certified results for federal and state races, with some local coverage in larger jurisdictions. It is the standard data feed used by newsrooms on election night.

Coverage: Federal and statewide races nationwide. County-level results for most races. Precinct-level for some states. Local race coverage varies.

Format: JSON API with WebSocket push for live updates.

Status: Blocked by cost. The AP API is a commercial product priced for newsroom budgets. It is not available for academic or open-source use without a contract. The real-time capability is irrelevant to our pipeline (we process certified results, not live feeds), but the certified result snapshots would be a valuable validation source.

Value if integrated: AP results would serve as a third independent source for federal and statewide races, enabling three-way cross-source validation alongside MEDSL and state portals. AP’s candidate identifiers are stable across cycles, which would simplify temporal chaining for federal candidates.

Additional State Portals

Six states with significant populations publish precinct-level results through their own election portals in structured formats. These would complement MEDSL by providing certified results directly from the state authority.

Florida: The Division of Elections publishes precinct-level results at results.elections.myflorida.com. CSV format. All counties, all contests. Would overlap with both MEDSL and OpenElections FL, enabling three-source validation for one state.

Georgia: The Secretary of State publishes results at results.enr.clarityelections.com/Georgia/ (Clarity-based) and via a separate certified results portal. XML and CSV. Would provide a second source for GA alongside MEDSL.

Texas: The Secretary of State publishes county-level results (not precinct-level) at elections.sos.state.tx.us. Precinct-level results are published by individual counties. A full TX integration would require crawling 254 county websites or using the Clarity instances that many TX counties operate.

Ohio: The Secretary of State publishes precinct-level results at www.ohiosos.gov/elections/election-results-and-data/. CSV format. Covers all contests including local races.

Pennsylvania: The Department of State publishes results at electionreturns.pa.gov. JSON API available. Covers all contests. Would fill one of the 7 states with zero local data in MEDSL 2022.

Michigan: The Secretary of State publishes precinct-level results at miboecfr.nictusa.com/cgi-bin/cfr/. Older web interface with downloadable files. Covers all contests.

Status: Not blocked by access — all six portals are public. Blocked by engineering time. Each state portal has its own format, URL structure, and quirks. We estimate 1–2 weeks of parser development per state. These are the highest-priority engineering tasks after odd-year MEDSL loading.

Name Normalization

Election data arrives with candidate names in dozens of formats. MEDSL uses LAST, FIRST MIDDLE in all caps. NC SBE uses First Last in title case. OpenElections uses whatever the county clerk typed. FEC uses LAST, FIRST MIDDLE SUFFIX. A single candidate can appear as:

CRIST, CHARLES JOSEPH (MEDSL)
Charlie Crist (OpenElections)
Crist, Charlie (FEC)

These are all the same person. A system that treats them as three different candidates produces garbage output. A system that aggressively normalizes them — stripping middle names, collapsing nicknames, removing suffixes — destroys the signal needed to tell different people apart.

The principle: clean without collapsing.

Name decomposition at L1

Every candidate name is decomposed at L1 into six components:

Component	Purpose	Example
`raw`	Original string, unmodified	`CRIST, CHARLES JOSEPH`
`first`	Parsed first name	`CHARLES`
`middle`	Middle name or initial	`JOSEPH`
`last`	Last name	`CRIST`
`suffix`	Generational suffix	`null`
`canonical_first`	Dictionary-normalized first name	`CHARLES`

The canonical_first field is populated by the nickname dictionary. If the raw first name is Charlie, canonical_first becomes Charles. If no mapping exists, canonical_first equals first.

Both first and canonical_first are preserved. The raw nickname is useful signal — it tells you what the candidate goes by. The canonical form is what enables matching.

Real decomposition examples

Five candidates from our prototype, showing how MEDSL and NC SBE formats decompose differently for the same people:

Source	Raw Name	`first`	`middle`	`last`	`suffix`	`canonical_first`
MEDSL	`DESANTIS, RON`	`RON`	`null`	`DESANTIS`	`null`	`RONALD`
OpenElections	`Ron DeSantis`	`Ron`	`null`	`DeSantis`	`null`	`Ronald`
MEDSL	`CRIST, CHARLES JOSEPH`	`CHARLES`	`JOSEPH`	`CRIST`	`null`	`CHARLES`
OpenElections	`Charlie Crist`	`Charlie`	`null`	`Crist`	`null`	`Charles`
MEDSL	`DEMINGS, VAL BUTLER`	`VAL`	`BUTLER`	`DEMINGS`	`null`	`VALDEZ`
NC SBE	`Val Demings`	`Val`	`null`	`Demings`	`null`	`Valdez`
MEDSL	`WILLIAMS, ROBERT`	`ROBERT`	`null`	`WILLIAMS`	`null`	`ROBERT`
NC SBE	`Robert Williams Jr`	`Robert`	`null`	`Williams`	`Jr`	`Robert`
MEDSL	`MARSHALL, DAVID S`	`DAVID`	`S`	`MARSHALL`	`null`	`DAVID`
MEDSL	`MARSHALL, DAVID A`	`DAVID`	`A`	`MARSHALL`	`null`	`DAVID`

Key observations from these examples:

Ron DeSantis — Ron maps to Ronald via the nickname dictionary. The embedding score between the two source representations is 0.729 — below any reasonable auto-accept threshold, but the LLM matches them using nickname knowledge.
Charlie Crist — Charlie maps to Charles. The embedding score is 0.451. Without the dictionary, the cascade would need the LLM to know that Charlie is a nickname for Charles. With the dictionary, the canonical forms already match.
Robert Williams vs Robert Williams Jr — The suffix Jr is the only distinguishing feature. These are different people. The embedding scores them at 0.862 — dangerously close to a false positive. See Suffixes.
David S Marshall vs David A Marshall — Different middle initials. David S. Marshall ran in Maine; David A. Marshall ran in Florida. The middle initial is the only signal distinguishing them at the name level. See Nicknames and Middle Initials.

What decomposition enables

With names decomposed into components, downstream layers can:

Exact-match on structured fields: (canonical_first="Timothy", last="Lance", suffix=null) matches across precincts without fuzzy logic. This handles 70% of entity resolution.
Build composite strings for embedding: "{canonical_first} {middle} {last} {suffix} | {party} | {office} | {state}" includes middle initials and suffixes as disambiguation signal.
Provide structured context to the LLM: Instead of asking “are these the same person?”, the LLM sees parsed components and can reason about specific differences (nickname vs. different name, Jr vs. no suffix).
Block efficiently: Group by (state, last_name_initial) for entity resolution without computing all-pairs similarity.

What goes wrong without decomposition

If you treat names as opaque strings:

CRIST, CHARLES JOSEPH and Charlie Crist have a Jaro-Winkler similarity of 0.58 — a miss.
DESANTIS, RON and Ron DeSantis have a cosine embedding similarity of 0.729 — in the ambiguous zone.
Robert Williams and Robert Williams Jr look nearly identical to every string metric. Only structured suffix detection prevents a false merge.
David S Marshall and David A Marshall differ by one character in a middle initial that opaque matching may ignore entirely.

Decomposition is not optional. It is the foundation that every subsequent layer depends on.

The three sub-problems

Name normalization breaks into three sub-problems, each with its own chapter:

Nicknames and Middle Initials — How Charlie becomes Charles and why David S. must stay distinct from David A.
Suffixes: Jr/Sr Means Different People — Why generational suffixes are disambiguation signals, not noise to be stripped.
The Nickname Dictionary — The lookup table that powers canonical_first, its current scope, and its limits.

Nicknames and Middle Initials

Two distinct problems share a root cause: the candidate’s legal name differs from the name on the ballot or in the source file. Nicknames substitute one first name for another. Middle initials appear in some sources and not others. Both must be handled at L1 to preserve signal for L2 and L3.

Nicknames

A nickname replaces the candidate’s legal first name with a familiar variant. The embedding model has no reliable way to recover the connection — it encodes character-level and token-level similarity, not social knowledge about naming conventions.

Real test results from our prototype, using text-embedding-3-large (3,072 dimensions):

Source A	Source B	Nickname → Legal	Cosine	LLM Decision	LLM Confidence
Charlie Crist	CRIST, CHARLES JOSEPH	Charlie → Charles	0.451	match	0.95
Nicole Fried	FRIED, NIKKI	Nikki → Nicole	0.642	match	0.92
Ron DeSantis	DESANTIS, RON	Ron → Ronald	0.729	match	0.98

The Crist result is the critical case. At 0.451, the embedding score falls below any plausible auto-accept threshold — and below many reject thresholds. Without nickname resolution, this pair would be missed entirely or routed to LLM on every encounter.

The fix operates at L1. The nickname dictionary maps Charlie → Charles, Nikki → Nicole, Ron → Ronald, and ~100 other mappings. When the L1 parser decomposes a name, it checks the first name against the dictionary and populates canonical_first:

{
  "raw": "Charlie Crist",
  "first": "Charlie",
  "middle": null,
  "last": "Crist",
  "suffix": null,
  "canonical_first": "Charles"
}

Both first and canonical_first are preserved. The original is kept for display and provenance. The canonical form is used in the L2 composite string for embedding and in the L3 exact-match step. After dictionary application, the L3 exact matcher sees (canonical_first="Charles", last="Crist", suffix=null) on both sides — an exact match with no embedding or LLM call required.

Why the embedding model fails on nicknames

Charlie and Charles share a prefix, but the embedding model must also reconcile Crist vs CRIST, CHARLES JOSEPH — different casing, different ordering, and a middle name that appears in one source but not the other. The model embeds the full composite string, not individual tokens. The combined divergence pushes the cosine score to 0.451.

Ron and Ronald are closer (0.729) because the surface forms are more similar and both sources use last-name-first ordering. But 0.729 is still in the ambiguous zone — it requires an LLM call to confirm.

The nickname dictionary eliminates these LLM calls for known mappings. At scale, this matters: if 5% of candidates use nicknames and each requires an LLM call, that is tens of thousands of unnecessary API round-trips.

Middle Initials

Middle initials are a different problem. They do not substitute one name for another — they add or remove a disambiguation signal.

The key case: David S. Marshall (Maine) and David A. Marshall (Florida) are different people. Without middle initials, both reduce to David Marshall. With middle initials preserved, L2 generates different embedding vectors.

We measured the effect directly:

Composite (no middle)	Composite (with middle)	Cosine (no middle)	Cosine (with middle)
David Marshall \| ME	David S Marshall \| ME	—	—
David Marshall \| FL	David A Marshall \| FL	0.7025	0.6448

The middle initial drops the cosine score by 0.058 — enough to shift the pair further from the accept threshold and closer to correct rejection. The principle: middle initials are signal, not noise.

More middle-initial test results from our prototype:

Source A	Source B	Cosine	LLM Decision	Key Signal
Ashley Moody	Ashley B. Moody	0.930	match	Same person, middle added
Val Demings	VAL DEMINGS	0.828	match	Same person, format difference
Dale Holness	DALE V.C. HOLNESS	0.896	match	Same person, middle initials added

Ashley Moody at 0.930 is the same person — the B. appears in one source but not the other. The high embedding score plus same-state context is sufficient for auto-accept above the 0.95 threshold (or just below it, in which case JW on the last name at 1.0 pushes it through).

How Both Feed Into L2

The L2 composite string for a candidate includes both canonical_first and middle:

{canonical_first} {middle} {last} {suffix} | {party} | {office} | {state} | {county}

For Charlie Crist, this becomes:

Charles  Crist  | DEM | Governor | FL | statewide

For CRIST, CHARLES JOSEPH, this becomes:

Charles Joseph Crist  | DEM | Governor | FL | statewide

The canonical first names now match. The remaining divergence — Joseph as a middle name in one source — is small enough that the embedding score rises well above the ambiguous zone. The nickname dictionary at L1 did the heavy lifting; L2 and L3 finish the job.

The Combined Rule

At L1, apply the nickname dictionary to populate canonical_first.
At L1, preserve middle exactly as parsed — do not strip it, do not normalize it.
At L2, include both canonical_first and middle in the composite string.
At L3 exact match, match on (canonical_first, last, suffix) — middle is not required for exact match but is used for disambiguation when multiple candidates share the same canonical first and last name.
At L3 LLM confirmation, provide both the raw and canonical names so the model can reason about nickname relationships and middle-initial differences.

The principle behind both: clean without collapsing. Normalize what you can (nicknames to canonical forms), preserve what you must (middle initials as disambiguation signal), and let downstream layers use the full context.

Suffixes: Jr/Sr Means Different People

A suffix is not decoration. Robert Williams and Robert Williams Jr. are two different human beings — father and son, or grandfather and grandson. Any system that strips suffixes before matching will merge them into one entity.

This is not a hypothetical risk. It happened in our prototype.

The Williams Problem

In Florida 2022, two candidates share the name Robert Williams:

Source	Raw Name	Office	Votes
MEDSL	ROBERT WILLIAMS	State House District 41	12,304
OpenElections	Robert Williams Jr	State House District 41	12,304

The embedding similarity between these two strings is 0.862 — well above our original auto-accept threshold of 0.82. Without suffix-aware logic, the cascade would have merged them automatically, no LLM consulted.

The LLM caught it. Given the full context, Claude Sonnet returned:

Decision: no_match (confidence: 0.85)

“The suffix ‘Jr’ indicates a generational distinction. ‘Robert Williams’ and ‘Robert Williams Jr’ are most likely two different individuals — typically father and son. While the office and vote counts match, the suffix is a strong signal that these are distinct people. The MEDSL record may have dropped the suffix during their cleaning process. I recommend treating these as separate entities until a human reviewer can confirm.”

This single case changed our threshold design.

How This Changed the Thresholds

Before Williams Jr., auto-accept was ≥ 0.82. A score of 0.862 would have been accepted without review. After this finding, we raised auto-accept to ≥ 0.95, ensuring that suffix-bearing pairs always enter the LLM zone (0.35–0.95) where the model can reason about generational distinctions.

Threshold	Before	After	Reason
Auto-accept	≥ 0.82	≥ 0.95	Williams Jr at 0.862 was a false positive
Ambiguous (LLM zone)	0.65–0.82	0.35–0.95	Wider zone catches more edge cases
Auto-reject	< 0.65	< 0.35	Crist at 0.451 was a false negative

The wider ambiguous zone sends more pairs to the LLM. Budget is not a constraint — accuracy is.

Suffix-Aware Logic in the Cascade

Suffixes receive special treatment at multiple stages:

L1 — Decomposition. The name parser extracts Jr, Sr, II, III, IV, V, Esq, and PhD into the suffix field. Both Jr. and Jr (with and without period) normalize to Jr. The suffix is never discarded.

Step 1 — Exact match. The exact match key is (canonical_first, last, suffix). “Timothy Lance” and “Timothy Lance” match. “Robert Williams” and “Robert Williams Jr” do not — the suffix field differs (null vs “Jr”).

Step 3 — Embedding. The suffix is included in the composite string: {canonical_first} {middle} {last} {suffix} | {party} | {office} | {state} | {county}. This means “Robert Williams” and “Robert Williams Jr” produce different vectors, but the difference is small (0.862 cosine) because the model treats “Jr” as a minor token.

Step 4 — LLM confirmation. The prompt explicitly includes both suffix fields and instructs the model: “A suffix like Jr or Sr typically indicates a different person (parent vs child). Do not match across suffixes unless you have strong evidence they refer to the same individual.” The LLM sees the structured fields, not just the raw strings.

The Suffix Inventory

From MEDSL 2022 data across all 50 states:

Suffix	Occurrences	Notes
Jr	1,847	Most common; often dropped by one source
Sr	312	Almost always appears alongside a Jr in the same jurisdiction
II	478	Increasingly common; same disambiguation need as Jr
III	189	Rarer but unambiguous signal
IV	31
V	4

The Jr/Sr problem is not rare. Nearly 2,000 candidates in a single election cycle carry a Jr suffix, and an unknown number of their non-suffixed counterparts exist in the same dataset.

When Suffixes Are Missing

The harder case is when one source includes the suffix and another drops it. MEDSL strips suffixes more aggressively than NC SBE. OpenElections preserves them inconsistently. This means the cascade must handle the asymmetric case: one record has suffix “Jr”, the other has suffix null.

The rule: a null suffix does not match a non-null suffix. Null-to-null matches normally. “Jr” to “Jr” matches normally. But null to “Jr” always enters the LLM zone, regardless of embedding score. The LLM can then examine vote counts, office, and geographic context to determine whether the missing suffix is a data quality issue (same person, suffix dropped) or a genuine distinction (father and son).

This is conservative by design. We would rather send 1,847 extra pairs to the LLM than silently merge fathers with sons.

Cross-Reference

Entity Resolution Overview — where suffixes fit in the cascade
Threshold Calibration — old vs. new thresholds driven by this finding
Real Test Cases — Williams Jr and all other tested pairs

The Nickname Dictionary

The nickname dictionary is a static lookup table applied at L1 during name decomposition. It maps common short names and nicknames to their formal equivalents, populating the canonical_first field while preserving the original first field unchanged.

Scope

The prototype dictionary contains approximately 100 mappings covering the most frequent English-language nicknames encountered in US election data:

Raw first	canonical_first	Frequency in MEDSL 2022
Bill	William	847
Bob	Robert	612
Jim	James	589
Mike	Michael	534
Charlie	Charles	201
Ron	Ronald	187
Nikki	Nicole	42
Ted	Edward	31
Dick	Richard	28
Peggy	Margaret	19

The target for production is 500+ mappings, expanding to cover Spanish-language nicknames (Pepe→José, Pancho→Francisco), regional variants, and less common English forms. The full reference list is maintained in Appendix: Full Nickname Dictionary.

Both forms are preserved

When the dictionary maps Charlie → Charles, the L1 record stores both:

{
  "first": "Charlie",
  "canonical_first": "Charles"
}

The original first is never overwritten. The composite string sent to L2 embedding uses canonical_first, which is why the embedding for “Charles Crist” and “CRIST, CHARLES JOSEPH” can be compared at all — even though the raw cosine similarity between “Charlie Crist” and “CRIST, CHARLES JOSEPH” is only 0.451.

The Ted problem

Some nicknames are ambiguous. “Ted” can map to Edward (Ted Kennedy) or Theodore (Ted Cruz). “Bill” is unambiguous — it always maps to William. “Ted” is not.

The current dictionary maps Ted → Edward, which is the more common historical usage in US politics. This is wrong for Theodore-named candidates. The correct resolution requires context that L1 does not have: party, state, office, or a reference database of known candidates.

The planned fix is a two-pass approach: L1 applies the majority mapping (Ted → Edward), and L3 entity resolution can override it when the LLM has enough context to determine the correct expansion. The canonical_first field is treated as a best guess at L1, not a final answer.

Other ambiguous nicknames with the same property: Pat (Patricia or Patrick), Chris (Christopher or Christine), Alex (Alexander or Alexandra), Sam (Samuel or Samantha). For these, L1 does not apply a mapping — canonical_first is left equal to first — and disambiguation is deferred to L3.

Office Classification

MEDSL 2022 contains 8,387 unique office names across all 50 states and DC. These are not 8,387 distinct offices — they are 8,387 different strings that humans typed to describe elected positions. “Board of Education”, “BOARD OF ED.”, “BOE”, “School Board”, and “Board of Education Members” all refer to the same type of office. “DALLAS COUNTY JUDGE” means a chief executive in Texas and a judicial officer everywhere else.

Classifying these strings into a consistent taxonomy is required for every downstream operation: blocking for entity resolution, computing competitiveness by office type, comparing the same office across states, and answering “what offices exist in my county?”

The taxonomy

Every office is classified into two fields:

Field	Values	Example
`office_level`	`federal`, `state`, `county`, `municipal`, `school_district`, `special_district`, `judicial`, `tribal`	`school_district`
`office_branch`	`executive`, `legislative`, `judicial`, `law_enforcement`, `fiscal`, `education`, `infrastructure`, `regulatory`, `other`	`education`

The pair (office_level, office_branch) defines the classification. “Board of Education” → (school_district, education). “County Sheriff” → (county, law_enforcement). “City Council” → (municipal, legislative).

The scale of the problem

Of the 8,387 unique office names in MEDSL 2022:

Characteristic	Count	Percentage
Appear in only 1 state	6,241	74.4%
Appear in only 1 county	4,995	59.6%
Appear in 10+ states	312	3.7%
Contain a proper noun (county/city name)	3,108	37.1%

Most office names are effectively unique strings. “DALLAS COUNTY JUDGE”, “Collier Mosquito Control District”, “Santa Rosa Island Authority” — these appear once in the entire national dataset. No keyword list can enumerate them all. The classifier must generalize.

Four-tier approach

The classifier runs four tiers in sequence. Each tier handles what the previous tier could not. A record classified at tier 1 is never re-examined by tier 2.

Tier	Method	Unique names handled	Cumulative %	Cost
1	Keyword lookup	~3,775	~45.0%	$0
2	Regex patterns	~1,426	~62.0%	$0
3	Embedding nearest-neighbor	~378	~66.5%	~$0.01/1K
4	LLM classification	~42	~67.0%	~$0.002/call
—	Unclassified (`other`)	~2,766	100%	—

The remaining ~33% classified as other are primarily hyper-local offices (township-specific roles, water district sub-boards, tribal offices) that require either expanded reference data or manual review. The other rate drops as the keyword and regex lists expand.

Note: Percentages are based on unique office name strings. By record count, the coverage is much higher — the 312 names that appear in 10+ states account for millions of records. Keyword tier 1 alone handles ~85% of records by volume.

Tier 1: Keyword lookup

A table of ~170 keywords mapped to (office_level, office_branch) pairs. If any keyword appears in the office name string, the classification is assigned.

Keyword	office_level	office_branch	Example match
`sheriff`	county	law_enforcement	“WARREN COUNTY SHERIFF”
`board of education`	school_district	education	“COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02”
`city council`	municipal	legislative	“CITY COUNCIL WARD 3”
`coroner`	county	fiscal	“COUNTY CORONER”
`constable`	county	law_enforcement	“CONSTABLE PRECINCT 4”

Keywords are matched case-insensitively. When multiple keywords match, the most specific wins (“county board of education” matches board of education → school_district, not county → county). The keyword table is maintained in the appendix.

Keyword lookup handles approximately 45% of unique office name strings and ~85% of total records. The most common offices — sheriff, school board, city council, county commission — all have unambiguous keywords.

Tier 2: Regex patterns

Approximately 40 regex patterns handle structured variations that keywords miss. Patterns capture positional and combinatorial relationships:

Pattern	office_level	office_branch	Example match
`county\s+commission`	county	legislative	“CLARK COUNTY COMMISSION DIST 2”
`district\s+court\s+judge`	judicial	judicial	“15TH DISTRICT COURT JUDGE”
`register\s+of\s+(deeds\|wills)`	county	fiscal	“REGISTER OF DEEDS”
`soil.water.conservation`	special_district	infrastructure	“SOIL AND WATER CONSERVATION DISTRICT SUPERVISOR”
`(mayor\|alcalde)`	municipal	executive	“MAYOR - CITY OF SPRINGFIELD”

Regex patterns add approximately 17% of unique names beyond what keywords catch. Combined with tier 1, the two deterministic tiers handle ~62% of unique names and ~92% of records by volume.

Tier 3: Embedding nearest-neighbor

For names that survive tiers 1 and 2, L2 generates an embedding using text-embedding-3-large and finds the nearest neighbor in a reference set of ~200 pre-classified office names.

Real example from our prototype:

Input: “Collier Mosquito Control District”
Nearest neighbor: “Mosquito Control District” (reference set)
Cosine similarity: 0.787
Classification: (special_district, infrastructure)

The tier 3 accept threshold is cosine ≥ 0.60. Below that, the match is too uncertain and the record passes to tier 4. In our prototype, tier 3 classified ~4.5% of remaining unique names with a manual-review accuracy of 94%.

The 200-name reference set was curated from the most common office names across all states, covering every (office_level, office_branch) pair with at least 3 reference examples. Expanding this set to 500+ names is a planned improvement.

Tier 4: LLM classification

Remaining unclassified names go to Claude Sonnet with the full context: office name, state, county, and the taxonomy definition.

Real examples from our prototype:

Office name	State	LLM classification	Confidence
Santa Rosa Island Authority	FL	special_district / infrastructure	0.90
Mosquito Control Board Member	FL	special_district / infrastructure	0.95
Judge of Compensation Claims	FL	judicial / judicial	0.88
Public Administrator	MO	county / fiscal	0.82
Recorder of Deeds	MO	county / fiscal	0.95
Drainage Commissioner	IL	special_district / infrastructure	0.85
Fence Viewer	VT	municipal / regulatory	0.70
Pound Keeper	NH	municipal / regulatory	0.65
Hog Reeve	NH	municipal / regulatory	0.60

In our prototype, the LLM classified 9 hard cases with 100% accuracy against manual review. The lower-confidence cases (Fence Viewer at 0.70, Hog Reeve at 0.60) are genuine obscure New England town offices that even the LLM finds unusual — but it classified them correctly.

The state-context problem

“DALLAS COUNTY JUDGE” illustrates why state context matters. In Texas, the county judge is the presiding officer of the commissioners court — an executive role, not a judicial one. In every other state, a county judge sits on the bench.

The keyword classifier alone cannot resolve this. The word “judge” appears, suggesting judicial. But the Texas county judge is (county, executive).

The fix is a state-specific override table in tier 1. Before general keyword matching, a small set of (state, keyword) → classification entries handles known exceptions:

State	Office pattern	Correct classification
TX	county judge	county / executive
LA	parish president	county / executive
LA	police jury	county / legislative
AK	borough assembly	county / legislative

This table is currently small (~15 entries). As more state-specific offices are identified, it grows. The pattern generalizes: when the same word means different things in different states, the state-specific override takes priority.

Accuracy by tier

Tier	Method	Accuracy (manual review)	False positive rate
1	Keyword	99.2%	< 0.5%
2	Regex	97.8%	~1.0%
3	Embedding NN	94.0%	~3.5%
4	LLM	100% (N=9)	0% (N=9)

Tier 1 and 2 errors are almost entirely from the state-context problem (a keyword matching the wrong sense of the word). Tier 3 errors come from embedding matches that are semantically close but functionally wrong — “Tax Collector” matching to “Tax Assessor” when they are separate offices in some states.

Cross-references

The Four-Tier Classifier — step-by-step walkthrough with a single office name through all four tiers
Appendix: Office Classification Reference — full keyword table and regex pattern list

The Four-Tier Classifier

Office classification proceeds through four tiers in strict order. Each tier handles a progressively harder subset of the 8,387 unique office names found in MEDSL 2022. A name classified at tier 1 never reaches tier 2. A name classified at tier 2 never reaches tier 3. The tiers are ordered by cost: deterministic and free first, embedding-based second, LLM last.

Tier 1: Keyword Match

A lookup table of 170 keyword entries maps office name substrings to (office_level, office_branch) pairs. Matching is case-insensitive and checks for substring containment.

Example:

Raw office name: WARREN COUNTY BOARD OF EDUCATION

The keyword table contains:

Keyword	office_level	office_branch
board of education	school_district	education

"board of education" appears as a substring → classified as school_district/education.

Coverage: ~3,775 of 8,387 unique names (~45.0%). These are the offices with unambiguous keywords: sheriff, coroner, board of education, city council, state senate, district court, county clerk, school board, mayor, constable, treasurer.

Limitations: Keyword matching is context-free. DALLAS COUNTY JUDGE contains judge, which maps to county/judicial. In Texas, the County Judge is the chief executive — county/executive is correct. Tier 1 gets this wrong. The planned fix is a state-context override table applied before keyword matching.

Tier 2: Regex Patterns

Approximately 40 regular expressions handle office names with structural patterns that keywords alone cannot capture.

Example:

Raw office name: CLERK OF THE CIRCUIT COURT, 11TH JUDICIAL CIRCUIT

Regex pattern: clerk\s+of\s+(the\s+)?(circuit|district|superior)\s+court

Match → classified as county/judicial.

Other regex examples:

Pattern	Matches	Classification
`county\s+commission`	County Commissioner, County Commission District 3	county/legislative
`(city\|town\|village)\s+council`	City Council Ward 2, Town Council At Large	municipal/legislative
`district\s+\d+\s+judge`	District 14 Judge, District 3 Judge	county/judicial
`soil\s+and\s+water`	Soil and Water Conservation District Supervisor	special_district/conservation

Coverage: ~1,426 additional unique names (~17.0%), bringing the cumulative total to ~62.0%.

Limitations: Regex patterns are brittle against novel phrasings. CONSERVATION DISTRICT BOARD MEMBER does not match the soil-and-water pattern. Regex also cannot handle the 4,995 office names that appear in exactly one county — writing a pattern for each is infeasible.

Tier 3: Embedding Nearest Neighbor

The remaining ~3,186 unclassified office names are embedded using text-embedding-3-large and compared against a reference set of ~200 pre-classified office names. The nearest neighbor’s classification is assigned if cosine similarity exceeds 0.60.

Example:

Raw office name: Collier Mosquito Control District

Nearest reference: Mosquito Control District → special_district/infrastructure

Cosine similarity: 0.787

0.787 > 0.60 → classified as special_district/infrastructure with confidence 0.787.

Other tier 3 results:

Unclassified Name	Nearest Reference	Cosine	Classification
Collier Mosquito Control District	Mosquito Control District	0.787	special_district/infrastructure
Eastern Carrituck Fire & Rescue	Fire Protection District	0.724	special_district/infrastructure
Lowndes County Bd of Ed	Board of Education	0.831	school_district/education
Hospital Authority Board	Hospital District	0.692	special_district/health

Coverage: ~378 additional unique names (~4.5%), bringing the cumulative total to ~66.5%.

What falls through: Office names with no close reference analog, names below the 0.60 threshold, and names whose nearest neighbor is misleading (e.g., Community Development District matching Community College District at 0.71 — wrong classification). These proceed to tier 4.

Tier 4: LLM Classification

The final tier sends unclassified office names to Claude Sonnet with a structured prompt that includes the office name, state, and the full taxonomy of (office_level, office_branch) pairs.

Example:

Raw office name: Santa Rosa Island Authority

State: Florida

The LLM prompt provides the taxonomy and asks: “Classify this office into the most appropriate (office_level, office_branch) pair. Explain your reasoning.”

LLM response:

Classification: special_district/infrastructure (confidence: 0.90)

“The Santa Rosa Island Authority is a special-purpose governmental entity in Escambia County, Florida, responsible for managing development and infrastructure on Santa Rosa Island (Pensacola Beach). It is not a general-purpose county or municipal government. ‘Special district’ at the ‘infrastructure’ branch is the best fit.”

Coverage: ~42 additional unique names (~0.5%) in our prototype evaluation, classified with 100% accuracy against manual review (9 of 9 hard cases correct).

Other tier 4 examples:

Office Name	State	LLM Classification	Confidence
Santa Rosa Island Authority	FL	special_district/infrastructure	0.90
Cuyahoga County Executive	OH	county/executive	0.95
Drainage Commissioner	IL	special_district/infrastructure	0.85
Register of Mesne Conveyances	SC	county/judicial	0.88

The South Carolina example is illustrative: “Register of Mesne Conveyances” is an office that exists in exactly one state. No keyword, regex, or embedding reference can classify it without external knowledge. The LLM knows that mesne conveyances is a legal term related to property transfers and that the Register is a judicial officer.

Tier Summary

Tier	Method	Unique Names	Cumulative %	Cost per Name	Deterministic
1	Keyword (170 entries)	~3,775	45.0%	$0	Yes
2	Regex (~40 patterns)	~1,426	62.0%	$0	Yes
3	Embedding NN (200 refs)	~378	66.5%	~$0.0001	Yes*
4	LLM	~42	67.0%	~$0.001	No
—	Unclassified / `other`	~2,766	100%	—	—

* Deterministic given the same embedding model version.

The remaining ~33% classified as other are office names that did not pass through our full pipeline in the prototype. At production scale, tiers 1–4 are projected to handle ~99.5% of names, with ~0.5% remaining as other pending human review.

Why Four Tiers Instead of Just the LLM

Three reasons:

Speed. Keyword and regex classify 62% of names in microseconds. Embedding NN classifies 4.5% more in milliseconds. Sending all 8,387 names to the LLM would take minutes and achieve the same result for the easy cases.
Reproducibility. Tiers 1–3 produce identical output on every run. Tier 4 may produce slightly different reasoning (though classifications are stable in practice). Minimizing non-deterministic surface area makes the pipeline easier to audit.
Debuggability. When a classification is wrong, the classifier_method field tells you which tier produced it. A wrong keyword mapping is a one-line table fix. A wrong regex is a pattern edit. A wrong embedding match means the reference set needs expansion. A wrong LLM classification means the prompt needs refinement. Each failure mode has a distinct fix.

Cross-Reference

Office Classification Overview — the 8,387-name problem and tier coverage statistics
Appendix: Office Classification Reference — full keyword and regex lists
L1: Cleaned — where tiers 1–2 run
L2: Embedded — where tier 3 runs
When the LLM Gets Called — tier 4 invocation policy

Entity Resolution

Entity resolution — determining that two records refer to the same human being — is the single hardest problem in this project. It is also the most consequential. Eight of the 30 query types identified in What Questions Should Be Answerable depend on it: career tracking, cross-source reconciliation, candidate deduplication, party switch detection, multi-cycle competitiveness analysis, incumbent identification, name standardization, and cross-election turnout comparison.

The problem is cross-cutting. It touches every source, every state, every election, and every office level. Get it wrong and you merge fathers with sons, split one candidate into three, or silently drop a career that spans six election cycles.

The Scale Problem

MEDSL 2022 alone contains approximately 42 million rows. A naive all-pairs comparison would require ~8.8 × 10¹⁴ similarity computations. Even at 1 million comparisons per second, that is 28 years of wall-clock time. Entity resolution at this scale requires a cascade that eliminates the vast majority of comparisons before reaching expensive methods.

The Cascade

Our entity resolution pipeline is a five-step cascade. Each step is cheaper and faster than the next. Each step either resolves the pair (match or no-match) or passes it to the next step.

Step	Method	Resolves	Cost per pair
1	Exact match on `(canonical_first, last, suffix)`	70.0%	negligible
2	Jaro-Winkler similarity ≥ 0.92	0.1%	microseconds
2.5	Name similarity gate: JW on last name < 0.50 → skip	—	microseconds
3	Embedding cosine similarity ≥ 0.95 → auto-accept	5.9%	pre-computed
4	LLM confirmation (cosine 0.35–0.95)	3.5%	~$0.0002
5	Tiebreaker: stronger model	rare	~$0.002

Pairs that are not resolved by step 5 are escalated to human review.

Prototype Results

Our prototype processed 200 NC SBE records from Columbus County, NC (2022 general election):

Metric	Value
Input records	200
Exact matches (step 1)	597 (70.0%)
Jaro-Winkler matches (step 2)	1 (0.1%)
Embedding auto-accepts (step 3)	50 (5.9%)
LLM calls (step 4)	30 (3.5%)
LLM matches confirmed	0
LLM no-matches confirmed	30
Unique candidate entities created	206
Hash chains verified	200/200

All 30 LLM calls were spent on pairs that shared a blocking key (same state, same office level, same last-name initial) but had completely different names — comparisons like “Aaron Bridges” vs “Daniel Blanton” that happened to fall within the same block. Every one was correctly rejected. This finding led to step 2.5: the Jaro-Winkler gate on last names. If the JW score on last names alone is below 0.50, skip the pair entirely. This would have eliminated all 30 wasted LLM calls.

Why Embedding Alone Fails

Embedding similarity is a powerful retrieval signal but an unreliable decision signal. Two real cases demonstrate the failure modes:

False negative — Charlie Crist at 0.451. MEDSL records CRIST, CHARLES JOSEPH. OpenElections records Charlie Crist. The embedding model scores their cosine similarity at 0.451. Any threshold-based system that relies solely on embeddings either rejects this pair (missing a true match) or sets the accept threshold so low that it admits thousands of false positives.

The problem is structural. The embedding model sees different surface forms — different name ordering, different casing, a nickname versus a legal name, and a middle name present in one source but not the other. The model has no reliable mechanism to know that Charlie is a common nickname for Charles.

False positive — Robert Williams Jr at 0.862. Robert Williams and Robert Williams Jr score 0.862. The model treats “Jr” as a minor token appended to an otherwise identical string. But Jr is a generational suffix — these are different people. At our original auto-accept threshold of 0.82, this pair would have been silently merged.

The embedding model is good at detecting surface similarity. It is bad at understanding that a single token (“Jr”) carries categorical meaning, and that a short nickname (“Charlie”) maps to a longer legal name (“Charles”).

Why LLM Alone Fails

An LLM like Claude Sonnet can correctly resolve both cases above. It knows Charlie is a nickname for Charles. It knows Jr indicates a different person. In our tests, it correctly identified all 11 test pairs with appropriate confidence levels.

But LLM-only resolution is infeasible at scale:

Speed: At 200ms per API call, resolving 42 million pairwise comparisons would take years. Even with aggressive blocking, the number of candidate pairs runs into millions.
Reproducibility: LLM outputs are non-deterministic. Running the same pair twice may produce different confidence scores. This is acceptable for ambiguous cases but wasteful for the 70% of cases that exact match handles perfectly.
Cost: While budget is not a constraint, sending millions of obvious matches and obvious non-matches to an LLM is pure waste. The LLM adds value only on the ambiguous cases that simpler methods cannot resolve.

Why the Cascade Works

The cascade combines the strengths of each method:

Exact match handles the common case (70%) — same name, same state, different precincts. No ML, no API calls, no latency, no non-determinism.
Jaro-Winkler catches minor spelling variations (“SHANNON W BRAY” vs “Shannon W. Bray”) that exact match misses due to casing or punctuation. Still deterministic, still free.
The name gate (step 2.5) eliminates pairs that share a blocking key but have obviously different names. This prevents the “wasted 30 LLM calls” scenario from the prototype. Deterministic, zero cost.
Embedding retrieval identifies high-confidence matches (≥ 0.95) where the names differ in format but not in substance. Pre-computed vectors make this effectively free at query time. The 0.95 threshold is deliberately conservative — only near-certain matches pass.
LLM confirmation handles the hard cases: nicknames (Crist at 0.451), suffixes (Williams Jr at 0.862), ambiguous common names. The LLM sees structured name components, vote counts, office, state, and party — enough context to reason about identity. Every prompt, response, and reasoning chain is stored for audit.
Tiebreaker (step 5) escalates low-confidence LLM decisions to a stronger model (Opus-class). This adds cost but catches cases where Sonnet is uncertain.

The cascade balances three properties:

Accuracy: The LLM catches what embeddings miss. Embeddings retrieve what exact match misses. Each layer covers the failure modes of the layer above it.
Speed: 70% of resolution is free. 6% is pre-computed. Only 3.5% requires API calls. At scale, this is the difference between hours and years.
Reproducibility: Steps 1–3 are fully deterministic. Steps 4–5 are non-deterministic but logged — every decision can be replayed from the audit log without re-invoking the LLM.

The 19 Exact Ties

Entity resolution is a prerequisite for detecting exact ties. In MEDSL 2022, we found 19 contests nationally where the top two candidates received exactly the same number of votes. Without entity resolution, precinct-level records cannot be aggregated into contest-level totals — and ties cannot be detected.

Blocking Strategy

Before the cascade runs, records are partitioned into blocks by (state, office_level, last_name_initial). Only pairs within the same block are compared. This reduces the comparison space by approximately four orders of magnitude while preserving all legitimate matches — a candidate for NC school board is never compared to a candidate for FL sheriff.

The blocking key is deliberately coarse. We accept some wasted comparisons within blocks (like “Aaron Bridges” vs “Daniel Blanton” in the same NC school_district block) in exchange for never missing a legitimate match. The step 2.5 gate handles the within-block noise.

Detailed Walkthroughs

The Cascade: Step by Step — each step with real examples
Real Test Cases from Real Data — all tested pairs with scores and decisions
Threshold Calibration — how Williams Jr and Crist changed the thresholds

The Cascade: Step-by-Step Walkthrough

The entity resolution cascade processes candidate pairs through five steps of increasing cost and sophistication. Each step either resolves the pair (match or no-match) or passes it to the next step. This chapter walks through a real example at every step.

Step 1: Exact Match on Structured Fields

Key: (canonical_first, last, suffix) within a (state, office_level) block.

Timothy Lance appears in 47 precinct-level rows across Columbus County, NC for the 2022 school board race. Every row has:

{
  "canonical_first": "Timothy",
  "last": "Lance",
  "suffix": null
}

All 47 rows match on the exact key. One candidate_entity_id is assigned. No fuzzy logic, no embedding, no API call.

In our prototype of 200 records, exact match resolved 597 candidate instances (70.0%) into 206 unique entities. This is the workhorse of the cascade — cheap, deterministic, and correct whenever sources agree on name components.

Exact match fails when sources disagree on formatting, use nicknames, or omit components. That is what steps 2–5 handle.

Step 2: Jaro-Winkler Similarity (≥ 0.92)

Step 2 catches minor spelling variations that survive L1 parsing: Mcdonough vs McDonough, De Los Santos vs Delossantos, transposition errors in precinct-level data entry.

The threshold is 0.92 on the full (canonical_first + " " + last) string. This is intentionally strict — Jaro-Winkler gives high scores to strings that share a prefix, which makes it prone to false positives on common surnames.

In our prototype, step 2 resolved 1 additional candidate (0.1%). Most formatting differences are already handled by L1 normalization (case folding, punctuation removal), leaving few cases for JW.

Step 2.5: The Name Similarity Gate

Before computing embeddings, check whether the pair’s last names are remotely similar. If the Jaro-Winkler score on last names alone is below 0.50, skip the pair entirely.

Example: Aaron Bridges vs. Daniel Blanton. Both appear in NC school district races. They share the same (state, office_level) block, which is why they were paired in the first place. But:

Last-name JW: Bridges vs Blanton → 0.40
Gate decision: skip — do not compute embedding, do not call LLM.

This gate exists because of a finding in our prototype. The original cascade had no step 2.5. Of the 30 LLM calls made, all 30 were spent on pairs with completely different names that happened to fall in the same blocking group — “Aaron Bridges” vs “Daniel Blanton” type comparisons. Every one was correctly rejected, but each cost an API round-trip and added latency.

The gate eliminates these obvious non-matches before they reach the embedding step. At scale, with millions of within-block pairs, this saves orders of magnitude in embedding lookups and LLM calls.

Step 3: Embedding Retrieval (Cosine ≥ 0.95 → Auto-Accept)

For pairs that pass the gate but did not exact-match, compute cosine similarity between L2 candidate embeddings. If the score is ≥ 0.95 and both candidates are in the same state, auto-accept the match.

Example: Ashley Moody vs. Ashley B. Moody (Florida Attorney General, 2022).

Field	Source A (OpenElections)	Source B (MEDSL)
Raw name	Ashley Moody	MOODY, ASHLEY B
canonical_first	Ashley	Ashley
middle	null	B
last	Moody	Moody
suffix	null	null

Step 1 fails: exact match requires (canonical_first, last, suffix) to match, and the middle-initial difference means the composite strings diverge — but the exact-match key itself (Ashley, Moody, null) does match here. In cases where it does not (e.g., due to middle-name inclusion in the key), step 3 handles it.

Embedding cosine: 0.930
Same state: yes (both FL)

At 0.930, this pair falls just below the 0.95 auto-accept threshold, so it enters the LLM zone. However, the JW score on full name is 0.95 — combined with the embedding score and same-state check, the cascade applies the secondary acceptance rule: embedding ≥ 0.90 AND JW ≥ 0.92 AND same state → accept.

In the prototype, step 3 resolved 50 candidates (5.9%) via embedding auto-accept.

Step 4: LLM Confirmation (Cosine 0.35–0.95)

Pairs in the ambiguous zone — embedding score between 0.35 and 0.95 after passing the name gate — are sent to Claude Sonnet with full context.

Example: Charlie Crist vs. CRIST, CHARLES JOSEPH (Florida Governor, 2022).

The LLM prompt includes structured fields, not just raw names:

Candidate A:
  raw: "Charlie Crist"
  canonical_first: "Charles"  (via nickname dictionary: Charlie → Charles)
  last: "Crist"
  suffix: null
  state: FL, office: Governor, votes: 3,101,652

Candidate B:
  raw: "CRIST, CHARLES JOSEPH"
  canonical_first: "CHARLES"
  last: "CRIST"
  suffix: null
  state: FL, office: Governor, votes: 3,101,652

Embedding cosine similarity: 0.451

The model responds:

Decision: match (confidence: 0.95)

“Charlie is a common nickname for Charles. Same state, same office, identical vote counts. The MEDSL record includes the middle name JOSEPH which the OpenElections record omits. These are the same person.”

Key elements the LLM uses that the embedding cannot:

Nickname knowledge — Charlie is a nickname for Charles. The embedding model scored this at 0.451; the LLM recognizes the relationship immediately.
Vote count identity — 3,101,652 to 3,101,652 is not a coincidence. Two different candidates in the same race with identical vote totals is astronomically unlikely.
Office and state match — Same governor’s race in the same state in the same election.

In the prototype, step 4 was invoked 30 times (3.5%). All 30 returned no-match — they were obvious non-matches that reached step 4 because the prototype lacked step 2.5. With the gate in place, the Crist-type cases (genuine ambiguity requiring LLM reasoning) are the intended workload for step 4.

Step 5: Tiebreaker — Stronger Model

When step 4 returns low confidence (below 0.70), the pair escalates to a stronger model (Opus-class). This handles cases where:

The nickname is unusual and Sonnet is uncertain
Vote counts differ slightly (rounding, provisional ballots)
The candidate appears in adjacent districts and the geographic match is ambiguous

Step 5 was not triggered in our 200-record prototype. It is designed for scale, where the long tail of ambiguous cases grows. Budget is not a constraint — the stronger model costs ~10× more per call but is invoked only for the lowest-confidence subset of an already-small LLM cohort.

The Full Flow

        All candidate pairs within (state, office_level) block
                            │
                    ┌───────┴───────┐
              Step 1: Exact match?  │
              (canonical_first,     │
               last, suffix)        │
                    │               │
               YES (70%)        NO (30%)
                 done               │
                            ┌───────┴───────┐
                      Step 2: JW ≥ 0.92?    │
                            │               │
                       YES (0.1%)       NO (29.9%)
                         done               │
                            ┌───────┴───────┐
                     Step 2.5: Last-name     │
                      JW ≥ 0.50?            │
                            │               │
                        YES (~6%)      NO (~24%)
                            │           skip pair
                    ┌───────┴───────┐
              Step 3: Cosine ≥ 0.95  │
                AND same state?      │
                            │               │
                    YES (5.9%)      NO (ambiguous)
                      done               │
                            ┌───────┴───────┐
                      Step 4: LLM call      │
                      (Claude Sonnet)       │
                            │               │
                  High confidence      Low confidence
                  match/no-match       (< 0.70)
                      done               │
                                   Step 5: Stronger
                                   model (Opus-class)
                                         │
                                       done

Cascade Properties

Speed. Steps 1, 2, and 2.5 are sub-millisecond per pair. Step 3 is a vector lookup (microseconds with FAISS). Step 4 is an API call (~500ms). Step 5 is a slower API call (~2s). The cascade processes 96%+ of pairs in under a millisecond.

Accuracy. Each step is calibrated to avoid false positives. Step 1 is exact. Step 2 is strict (0.92). Step 3 is very strict (0.95 AND same state). Steps 4 and 5 have full context including vote counts, office, and geography — signals no embedding model can use.

Reproducibility. Steps 1–3 are deterministic given the same input and embedding model. Steps 4–5 are non-deterministic but fully logged. Every prompt, response, and reasoning string is stored in the L3 decision log, enabling deterministic replay.

Auditability. A researcher who disagrees with any match can find the decision in the log, read the LLM’s reasoning, examine the embedding score, and override the decision. L4 can be re-run from the amended L3 output without re-running the entire pipeline.

Real Test Cases from Real Data

Every entity resolution decision in this project is grounded in real candidate pairs from real election data. This chapter documents all pairs tested during prototype development, with actual embedding scores, LLM decisions, and the key signal that determined each outcome.

All embeddings use text-embedding-3-large (3,072 dimensions). All LLM decisions use Claude Sonnet. Ground truth was established by manual verification against official certified results.

The Full Test Table

Name A	Name B	Cosine	LLM Decision	LLM Conf.	Ground Truth	Key Signal
Ron DeSantis	DESANTIS, RON	0.729	match	0.98	match	Nickname: Ron → Ronald
Charlie Crist	CRIST, CHARLES JOSEPH	0.451	match	0.95	match	Nickname: Charlie → Charles; identical votes
Robert Williams	Robert Williams Jr	0.862	no match	0.85	no match	Suffix: Jr indicates different person
Val Demings	VAL DEMINGS	0.828	match	0.96	match	Format difference only; middle initial absent
Marco Rubio	RUBIO, MARCO ANTONIO	0.743	match	0.97	match	Middle name present in one source only
Ashley Moody	MOODY, ASHLEY B	0.930	match	0.98	match	Middle initial added; same office/state
Nicole Fried	FRIED, NIKKI	0.642	match	0.92	match	Nickname: Nikki → Nicole
John Smith	SMITH, JOHN R	0.672	no match	0.78	no match	Common name; different offices, different counties
Robert Johnson	JOHNSON, ROBERT L	0.644	no match	0.75	no match	Common name; different states
Dale Holness	HOLNESS, DALE V.C.	0.896	match	0.94	match	Middle initials added; title prefix stripped
Barbara Sharief	SHARIEF, BARBARA J	0.955	match	0.99	match	Middle initial added; above auto-accept
Aramis Ayala	AYALA, ARAMIS D	0.896	match	0.97	match	Title prefix “State Attorney” stripped; middle initial

How to Read This Table

Cosine — Cosine similarity between text-embedding-3-large embeddings of the candidate composite strings. Range is 0.0 to 1.0. Higher means more similar.
LLM Decision — The match/no-match output from Claude Sonnet when the pair was in the ambiguous zone (0.35–0.95).
LLM Conf. — The model’s self-reported confidence in its decision. Range 0.0 to 1.0.
Ground Truth — Manually verified against official certified election results. “match” means the two records refer to the same human being. “no match” means they do not.
Key Signal — The distinguishing factor that makes this pair interesting for entity resolution testing.

Analysis by Category

Nickname Cases

Three pairs test the nickname problem — where one source uses a familiar name and the other uses the legal name:

Pair	Cosine	Nickname Mapping
DeSantis	0.729	Ron → Ronald
Crist	0.451	Charlie → Charles
Fried	0.642	Nikki → Nicole

Embedding scores range from 0.451 to 0.729 — all below the 0.95 auto-accept threshold. Without the LLM step, all three would be missed or would require an unsafely low accept threshold.

The Crist case is the most extreme. At 0.451, the embedding model is essentially saying “these look like different people.” The divergence comes from multiple compounding differences: different name ordering (first-last vs last-first), nickname vs legal name, middle name present in only one source, and different casing. The LLM resolves it using nickname knowledge and the identical vote count (3,101,652 in both sources).

After the nickname dictionary is applied at L1, canonical_first matches on all three pairs, and step 1 exact match handles them without any embedding or LLM call. The embedding scores reported here are without dictionary application — they demonstrate why the dictionary matters.

Middle Initial Cases

Five pairs test middle-initial handling — where one source includes a middle name or initial and the other does not:

Pair	Cosine	Middle in Source A	Middle in Source B
Demings	0.828	null	null (format diff)
Rubio	0.743	null	ANTONIO
Moody	0.930	null	B
Sharief	0.955	null	J
Ayala	0.896	null	D

Sharief at 0.955 is the only pair above the 0.95 auto-accept threshold. The remaining four fall in the ambiguous zone and require LLM confirmation. The LLM correctly identifies all as matches — the middle initial is additive information, not contradictory information.

Moody at 0.930 is the closest call below auto-accept. The difference between “Ashley Moody” and “MOODY, ASHLEY B” is a single middle initial and formatting. The secondary acceptance rule (embedding ≥ 0.90 AND JW on last name ≥ 0.92 AND same state) handles this case without an LLM call in the production cascade.

Suffix Cases

One pair tests the suffix problem:

Pair	Cosine	Suffix A	Suffix B
Williams	0.862	null	Jr

At 0.862, this pair would have been auto-accepted under the original threshold of ≥ 0.82. The LLM rejected it with 0.85 confidence, citing the generational distinction implied by “Jr.” This single case drove the threshold change from 0.82 to 0.95.

The asymmetry is the danger: one source includes the suffix, the other drops it. The embedding model sees “Robert Williams” and “Robert Williams Jr” as nearly identical strings, because “Jr” is a minor token. The structured suffix field at L1 is the signal that prevents the false merge.

Common Name Cases

Two pairs test the common-name problem — where two genuinely different people share a common name:

Pair	Cosine	State A	State B	Office A	Office B
Smith	0.672	FL	FL	County Commission	School Board
Johnson	0.644	NC	FL	State House	County Clerk

Both pairs are correctly rejected. The LLM’s confidence is lower (0.75–0.78) than on the match cases because common names are inherently ambiguous — the model cannot be certain these are different people, only that the evidence is insufficient for a match.

The Johnson case crosses state boundaries. The blocking strategy partitions by state, so this pair would never be compared in the production cascade. It is included in the test set to validate the cross-state rejection logic.

The Smith case is within the same state but different offices and counties. The LLM correctly reasons that two people named John Smith in different Florida counties holding different offices are most likely different individuals, despite the name match.

Format Difference Cases

Two pairs test pure formatting differences — same person, same name components, different string representations:

Pair	Cosine	Format Difference
Holness	0.896	Middle initials V.C. added; “Commissioner” prefix stripped
Ayala	0.896	Middle initial D added; “State Attorney” prefix stripped

Both score 0.896 — identical cosine similarity despite different underlying differences. Both are correctly matched. These cases validate that the L1 parser correctly strips title prefixes and that the embedding model handles the remaining differences (middle initials) gracefully.

Score Distribution

The 12 test pairs span the full range of embedding scores relevant to entity resolution:

Score Range	Count	Matches	Non-Matches
≥ 0.95	1	1	0
0.85–0.95	4	4	0
0.70–0.85	3	2	1
0.50–0.70	3	1	2
< 0.50	1	1	0

The Williams Jr pair at 0.862 is the only false-positive risk — a non-match scoring above 0.85. The Crist pair at 0.451 is the only false-negative risk — a true match scoring below 0.50. These two cases define the boundary conditions of the cascade and drove the threshold calibration described in Threshold Calibration.

LLM Accuracy

Across all 12 test pairs:

Metric	Value
Total pairs tested	12
LLM correct decisions	12
LLM accuracy	100%
Average confidence (matches)	0.957
Average confidence (non-matches)	0.793
Lowest confidence on a correct match	0.92 (Fried)
Lowest confidence on a correct non-match	0.75 (Johnson)

The confidence gap between matches (avg 0.957) and non-matches (avg 0.793) is expected. The LLM is more certain when confirming a match (multiple corroborating signals: same state, same office, similar vote counts, plausible name relationship) than when rejecting one (absence of evidence, not evidence of absence).

What These Tests Do Not Cover

The 12 test pairs are a calibration set, not a validation set. They do not cover:

Spanish-language names — Hyphenated surnames, maternal/paternal name ordering
Transliterated names — Arabic, Chinese, Vietnamese, and Korean names rendered in English with varying romanization
Unisex names — Cases where a shared name belongs to candidates of different genders
Candidate who changed names — Marriage, legal name change
Intentional name variations — Candidates who use different names in different elections

These gaps are documented as known limitations. The test set will expand as entity resolution is validated at scale.

Cross-References

Entity Resolution Overview — the cascade that processes these pairs
The Cascade: Step by Step — detailed walkthrough of each step
Threshold Calibration — how these scores drove the threshold changes

Threshold Calibration

Embedding similarity thresholds determine which candidate pairs auto-accept, which enter the LLM zone, and which auto-reject. These thresholds are not universal constants — they are calibrated to a specific embedding model (text-embedding-3-large, 3,072 dimensions) using real test data from our prototype.

Two findings from early testing forced a complete recalibration.

The Two Findings

Robert Williams Jr at 0.862 — a false positive. Under the original thresholds, any pair scoring ≥ 0.82 was auto-accepted. “Robert Williams” and “Robert Williams Jr” scored 0.862 — above the threshold. The system would have silently merged father and son into one entity. The suffix “Jr” carries categorical meaning (different person), but the embedding model treats it as a minor token appended to an otherwise identical string.

Charlie Crist at 0.451 — a false negative. Under the original thresholds, any pair scoring < 0.65 was auto-rejected. “Charlie Crist” and “CRIST, CHARLES JOSEPH” scored 0.451 — below the threshold. The system would have discarded a true match. The nickname “Charlie” for “Charles”, combined with different name ordering, different casing, and an extra middle name, pushed the score well below the reject boundary.

Both errors are unacceptable. Merging different people corrupts every downstream analysis. Missing true matches fragments candidate records across sources, breaking cross-source reconciliation and career tracking.

Old vs. New Thresholds

Zone	Old Range	New Range	Change
Auto-accept	≥ 0.82	≥ 0.95 AND same state	Raised by 0.13, added state constraint
Ambiguous (LLM zone)	0.65–0.82	0.35–0.95 AND same state	Widened from 0.17 to 0.60 range
Auto-reject	< 0.65	< 0.35 OR different state	Lowered by 0.30, added state escape

The ambiguous zone expanded from a 0.17-wide band to a 0.60-wide band. This means far more pairs are routed to the LLM for confirmation.

What Each Change Addresses

Auto-accept raised to 0.95

The Williams Jr pair at 0.862 demonstrated that scores in the 0.82–0.95 range can contain suffix-bearing false positives. At 0.95, the only pairs that auto-accept are near-identical strings with trivial formatting differences — “ASHLEY MOODY” vs “Ashley Moody” (0.930 would not auto-accept; it enters the LLM zone where the model confirms the match using full context).

The same-state constraint is an additional guard. A candidate for county sheriff in Maine should never auto-match with a candidate for county sheriff in Florida, regardless of embedding score. Different-state pairs always enter the LLM zone.

Ambiguous zone widened to 0.35–0.95

The Crist pair at 0.451 sat in the old auto-reject zone. The new lower bound of 0.35 captures every nickname case we tested:

Pair	Cosine	Old Zone	New Zone
DeSantis / DESANTIS, RON	0.729	Ambiguous	Ambiguous
Crist / CRIST, CHARLES JOSEPH	0.451	Reject	Ambiguous
Nicole Fried / FRIED, NIKKI	0.642	Reject	Ambiguous
Williams / Williams Jr	0.862	Accept	Ambiguous
Val Demings / VAL DEMINGS	0.828	Accept	Ambiguous
Marco Rubio / RUBIO, MARCO ANTONIO	0.743	Ambiguous	Ambiguous
Ashley Moody / MOODY, ASHLEY B	0.930	Accept	Ambiguous
Dale Holness / HOLNESS, DALE V.C.	0.896	Accept	Ambiguous

Under the old thresholds, 3 of 8 pairs were misclassified (2 false accepts, 1 false reject). Under the new thresholds, all 8 enter the LLM zone where the model resolves them correctly.

Auto-reject lowered to 0.35

Below 0.35, no tested pair in our prototype was a true match. At this score range, the names share almost no surface similarity — they are genuinely different people who happen to share a blocking key.

The different-state escape allows immediate rejection of cross-state pairs regardless of score. Local officeholders do not appear in multiple states. (Federal candidates can, but they are handled by a separate federal-office pathway that does not use this threshold table.)

The Cost of a Wider Ambiguous Zone

The old ambiguous zone (0.65–0.82) captured roughly 5% of within-block pairs. The new zone (0.35–0.95) captures roughly 25% — a 5× increase in LLM calls.

At prototype scale (200 records), this is negligible. At production scale (42 million rows), the increase matters for throughput but not for budget. Budget is not a constraint. The step 2.5 name gate (JW < 0.50 on last names → skip) eliminates the majority of low-score pairs before they reach the LLM, keeping the actual call volume manageable.

The wider zone is a deliberate trade: more LLM calls in exchange for zero false accepts and zero false rejects in the tested range.

Thresholds Are Model-Specific

These thresholds are calibrated for text-embedding-3-large with 3,072 dimensions. A different model — even an updated version of the same model — will produce different similarity distributions. If the embedding model changes:

Re-run the test cases against the new model.
Plot the score distribution for known matches and known non-matches.
Recalibrate auto-accept, ambiguous, and auto-reject boundaries.
Store the new thresholds alongside the model identifier in L2 metadata.

The embedding_model field in every L2 record ensures that thresholds can always be traced to the model that produced the scores.

Summary

Principle	Implementation
Never auto-accept a suffix mismatch	Threshold raised to 0.95; suffixes always enter LLM zone
Never auto-reject a nickname match	Threshold lowered to 0.35; nicknames always enter LLM zone
Cross-state pairs require LLM confirmation	Same-state constraint on auto-accept
Wider zone is acceptable	Budget is not a constraint; accuracy is
Thresholds are not portable	Model version stored in every record

Cross-Reference

Real Test Cases — all pairs that informed calibration
Suffixes: Jr/Sr Means Different People — the Williams Jr finding
Nicknames and Middle Initials — the Crist finding
Budget Is Not a Constraint — why the wider zone is acceptable

Non-Candidate Records

Not every row in an election results file is a candidate. Sources routinely embed turnout metadata, ballot measure choices, vote quality indicators, and aggregation artifacts alongside candidate results — using the same columns, the same format, and no reliable flag to distinguish them.

If your system treats every row as a candidate, you will create entity records for people named “Registered Voters”, “For”, “BLANK”, and “TOTAL VOTES”. The L4 LLM audit in our prototype caught exactly this: “For” and “Against” were classified as person entities. They are not people.

The Four Categories

1. Turnout Metadata

Rows recording registration and participation counts at the precinct level:

Pseudo-candidate	Meaning	Source
Registered Voters	Total registered voters in precinct	FL OpenElections, NC SBE
Ballots Cast	Total ballots submitted	FL OpenElections, NC SBE
Cards Cast	Total ballot cards (may differ from ballots in multi-card elections)	FL OpenElections

Florida OpenElections is the most prolific source. Of the “other” records in our FL 2022 ingest, 6,013 rows are “Registered Voters” — accounting for 67.9% of all non-candidate records in that source. These are not errors in the source data. They are genuine turnout figures published alongside contest results in the same file format.

2. Ballot Measure Choices

Rows representing choices on referenda, bond issues, and constitutional amendments:

Pseudo-candidate	Meaning	Source
For	Yes vote on ballot measure	OpenElections, MEDSL
Against	No vote on ballot measure	OpenElections, MEDSL
Yes	Yes vote on ballot measure	NC SBE, MEDSL
No	No vote on ballot measure	NC SBE, MEDSL

These are legitimate vote counts — but the “candidate” is not a person. Detection requires examining both the candidate name (a single common word) and the contest name (bond, referendum, amendment, proposition). See Ballot Measure Choices.

3. Vote Quality Indicators

Rows recording ballots that did not produce a valid vote for any candidate:

Pseudo-candidate	Meaning	Source
Over Votes	Voter selected more candidates than allowed	MEDSL, NC SBE
Under Votes	Voter selected fewer candidates than allowed	MEDSL, NC SBE
BLANK	No selection made (Maine’s term for undervote)	MEDSL (ME)
Write-in	Aggregate write-in count (no specific candidate)	Multiple sources

Over votes and under votes are important data quality signals. A contest with 15% over votes may indicate a confusing ballot design. But they are not candidates and must not be counted as such.

4. Aggregation Artifacts

Rows that are computational summaries, not individual results:

Pseudo-candidate	Meaning	Source
TOTAL VOTES	Sum of all candidates in the contest	MEDSL (UT)
Scattering	Aggregate of write-in candidates below reporting threshold	MEDSL (IA, MN)
TOTAL	Another sum variant	OpenElections

These rows are redundant with the candidate-level data. Including them double-counts votes and inflates totals.

The Detection Strategy

Non-candidate records are detected at L1 — the earliest possible point. The principle is extract before filter: non-candidate rows often contain valuable information (registered voter counts, undervote rates) that should be captured in the correct schema object before the row is excluded from contest analysis.

Detection uses a three-part check:

Exact match on candidate name. A lookup table of ~40 known pseudo-candidate strings: “Registered Voters”, “Ballots Cast”, “Over Votes”, “Under Votes”, “BLANK”, “TOTAL VOTES”, “Scattering”, “For”, “Against”, “Yes”, “No”, etc.
Contest name pattern. For ambiguous names like “For” and “Against”, check whether the contest name contains ballot measure keywords: bond, referendum, amendment, proposition, measure, question, initiative, charter.
Source-specific rules. Some sources use unique pseudo-candidates. Maine uses “BLANK”. Iowa uses “Scattering”. Utah includes “TOTAL VOTES” rows. Each source parser knows its own ghosts.

Routing

Detected non-candidate records are routed to the appropriate schema object:

Category	Route to	Schema type
Turnout metadata	`TurnoutMetadata`	Attached to sibling precinct records
Ballot measure choices	`BallotMeasure`	`MeasureChoice` with For/Against/Yes/No
Vote quality indicators	`VoteQuality`	Attached to parent contest record
Aggregation artifacts	Discarded	Redundant with candidate-level sums

Records routed to TurnoutMetadata and VoteQuality are preserved in the L1 output — they are valuable data, just not candidate data. Aggregation artifacts are discarded with a note in the cleaning report.

What Happens Without Detection

If non-candidate rows pass through to L2 and L3:

“Registered Voters” gets an embedding vector, a candidate entity ID, and appears in 6,013 precinct-level records as the most prolific “candidate” in Florida.
“For” and “Against” become person entities. The L4 LLM audit flagged exactly this in our prototype: “‘For’ is not a plausible person name.”
“TOTAL VOTES” inflates vote counts when aggregated, because the total row is summed alongside the individual candidate rows.
“Over Votes” appears as a candidate who received votes in every contest — the busiest politician in America.

Detection at L1 prevents all of these downstream errors.

Sub-Chapters

Registered Voters, Ballots Cast, Over/Under Votes — turnout and vote quality rows, the “extract before filter” principle
Ballot Measure Choices: For/Against/Yes/No — detecting ballot measures from candidate name + contest name patterns

Registered Voters, Ballots Cast, Over/Under Votes

Some election data files embed turnout metadata and vote-quality indicators directly alongside candidate results. A row labeled “Registered Voters” is not a contest — it is a count of eligible voters in that precinct. A row labeled “Over Votes” is not a candidate — it is a count of ballots where the voter marked too many choices.

These rows are valuable. They are also poison if treated as candidates.

The Four Categories

Label	What it means	Found in
Registered Voters	Eligible voters in the precinct	NC SBE, FL OpenElections
Ballots Cast	Ballots submitted (any contest)	NC SBE, some MEDSL records
Over Votes	Ballots with too many selections for a contest	NC SBE, ME, UT
Under Votes	Contests where the voter made no selection	NC SBE, ME, UT

NC SBE includes all four in every precinct file. MEDSL includes over/under votes for some states but not others. OpenElections varies by state and contributor. There is no standard.

The Extract-Before-Filter Principle

The instinct is to filter these rows out immediately — they are not candidates, so drop them. This is wrong. The registered voter count is the denominator for turnout computation. Dropping it before extraction destroys the only turnout signal available at the precinct level.

The correct sequence:

Detect the row by candidate name pattern (Registered Voters, BALLOTS CAST, OVER VOTES, UNDER VOTES, BLANK).
Extract the value into the appropriate field on sibling contest records in the same precinct.
Route the row to TurnoutMetadata contest kind — not CandidateRace.
Exclude the row from candidate-level analysis (margins, competitiveness, entity resolution).

Step 2 is the key. The registered voter count attaches to every contest in the same precinct as a turnout.registered_voters field. The ballots cast count becomes turnout.ballots_cast. Only after extraction is the metadata row itself reclassified.

NC SBE Row Format

In the NC SBE precinct results file (results_pct_20221108.txt), a registered voter row looks like:

Column	Value
Contest Name	`REGISTERED VOTERS - TOTAL`
Choice	(empty)
Choice Party	(empty)
Total Votes	`4,217`
Election Day	`4,217`
One Stop	`0`
Absentee by Mail	`0`
Provisional	`0`

The “Total Votes” column contains the registered voter count, not a vote total. The vote-type breakdown is meaningless (registered voters do not have an election-day vs. early split). L1 extracts 4,217 into turnout.registered_voters for precinct P17 in Columbus County, then classifies this row as TurnoutMetadata.

The corresponding L1 output:

{
  "contest": {
    "kind": "turnout_metadata",
    "raw_name": "REGISTERED VOTERS - TOTAL"
  },
  "results": [{
    "candidate_name": { "raw": "Registered Voters" },
    "votes_total": 4217
  }],
  "turnout": {
    "registered_voters": 4217
  }
}

Sibling contest records in the same precinct (e.g., the school board race) receive:

{
  "turnout": {
    "registered_voters": 4217,
    "ballots_cast": null
  }
}

Scale of the Problem

In the Florida OpenElections dataset, 6,013 rows are labeled “Registered Voters” — representing 67.9% of all non-candidate records in that file. Without detection, these rows enter the candidate pipeline as if “Registered Voters” were a person running for office. The L4 LLM audit flagged exactly this pattern in our prototype.

Over Votes and Under Votes are less numerous but equally disruptive. Maine labels its under votes as BLANK. Utah includes TOTAL VOTES aggregation rows. Each source has its own vocabulary for the same concept.

Detection Rules

L1 applies pattern matching on the candidate name field before any other processing:

Pattern	Classification	Action
`registered voters`	TurnoutMetadata	Extract to `turnout.registered_voters`
`ballots cast`	TurnoutMetadata	Extract to `turnout.ballots_cast`
`over ?votes?`	TurnoutMetadata	Extract to `turnout.over_votes`
`under ?votes?`	TurnoutMetadata	Extract to `turnout.under_votes`
`^blank$`	TurnoutMetadata	Extract to `turnout.under_votes` (ME)
`total votes`	TurnoutMetadata	Discard (aggregation artifact)
`scattering`	TurnoutMetadata	Extract to `turnout.write_in_scattering` (IA)

These patterns are checked case-insensitively. They run as the first operation in the L1 pipeline — before name decomposition, before office classification, before FIPS enrichment. A row that matches is routed immediately and never enters the candidate pipeline.

Ballot Measure Choices: For/Against/Yes/No

When a row in an election results file has “For” as the candidate name, it could mean two things: a person whose legal name is “For” (implausible), or a choice on a ballot measure (almost certain). The distinction cannot be made from the candidate name alone — it requires examining the contest name alongside it.

The Problem

Ballot measures appear in election data using the same schema as candidate races. The “candidate” column holds “For”, “Against”, “Yes”, or “No”. The “contest” column holds something like “BOND REFERENDUM - SCHOOL CONSTRUCTION” or “CONSTITUTIONAL AMENDMENT 3”. Nothing in the file format distinguishes a ballot measure from a candidate race.

Real examples from MEDSL 2022:

Contest Name	Candidate Name	Votes	What It Actually Is
CONSTITUTIONAL AMENDMENT 1	For	1,847,312	Ballot measure choice
BOND REFERENDUM COLUMBUS COUNTY SCHOOLS	Against	4,219	Ballot measure choice
COUNTY SALES TAX REFERENDUM	Yes	31,408	Ballot measure choice
CHARTER AMENDMENT - TERM LIMITS	No	12,773	Ballot measure choice

If these rows enter the candidate pipeline, “For” becomes a person entity. “For” then appears in entity resolution, gets a candidate_entity_id, and shows up in the L4 canonical export as the most prolific politician in America — winning thousands of races across every state and every office level.

The L4 Audit Discovery

In our prototype, the L4 LLM entity audit examined 50 entities for plausibility. Among the 4 errors it identified:

“‘For’ is not a plausible person name. This entity appears across 347 contests in 12 states, always in contest names containing ‘amendment’, ‘bond’, ‘referendum’, or ‘proposition’. These are ballot measure choices, not candidates.”

The audit correctly identified the contamination. But detecting it at L4 is too late — the bad entity has already propagated through L2 embeddings and L3 matching. The fix is detection at L1.

Detection Logic

A candidate name of “For”, “Against”, “Yes”, or “No” is ambiguous in isolation. These are common English words, and while no real candidate in our dataset is named “For”, names like “Yes” are theoretically possible. The detection requires both signals:

Signal 1: Candidate name pattern. The candidate name is one of a small set of ballot measure choice words:

Candidate Name	Ballot Measure Choice
For	Yes
Against	Yes
Yes	Yes
No	Yes
Bonds Yes	Yes
Bonds No	Yes
For the Tax Levy	Yes
Against the Tax Levy	Yes

Signal 2: Contest name pattern. The contest name contains one or more ballot measure keywords:

Keyword	Example Contest Name
amendment	CONSTITUTIONAL AMENDMENT 1
bond	BOND REFERENDUM COLUMBUS COUNTY SCHOOLS
referendum	COUNTY SALES TAX REFERENDUM
proposition	PROPOSITION 30 - TAX ON INCOME
measure	MEASURE A - PARCEL TAX
initiative	INITIATIVE 82 - TIPPED WAGES
question	BALLOT QUESTION 4
charter	CHARTER AMENDMENT - TERM LIMITS
levy	RENEWAL 2.0 MILL LEVY - FIRE
issue	ISSUE 1 - REPRODUCTIVE RIGHTS

Both signals must be present. A candidate named “For” in a contest called “COUNTY COMMISSIONER” would not trigger ballot measure detection — it would be flagged as a data quality anomaly for manual review. A candidate named “John Smith” in a contest called “BOND REFERENDUM” is not a ballot measure choice — the candidate name does not match the pattern.

Routing

When both signals match, L1 routes the record to BallotMeasure contest kind with a MeasureChoice result type instead of CandidateResult:

{
  "contest": {
    "kind": "ballot_measure",
    "raw_name": "BOND REFERENDUM COLUMBUS COUNTY SCHOOLS",
    "office_level": "school_district",
    "measure_type": "bond"
  },
  "results": [
    {
      "measure_choice": "against",
      "votes_total": 4219,
      "vote_counts_by_type": {
        "election_day": 2107,
        "early": 1891,
        "absentee_mail": 198,
        "provisional": 23
      }
    }
  ]
}

The measure_choice field replaces candidate_name. No name decomposition is performed (there is no first, middle, last, or suffix for “Against”). No entity resolution is needed — “For” in one contest is not the same entity as “For” in another contest. No embedding is generated.

Edge Cases

“For the Tax Levy” vs “For.” Some sources use complete phrases like “For the Tax Levy” rather than bare “For”. The pattern match checks for the prefix, not exact equality.

Mixed contests. A small number of records have both candidate names and ballot measure choices in the same contest. This occurs when a source reports write-in votes alongside measure choices. The L1 parser handles each row independently — “For” is routed to BallotMeasure, while “Write-in” in the same contest is routed to TurnoutMetadata.

Retention elections. Judicial retention elections ask “Shall Judge X be retained?” with choices “Yes” and “No.” These are structurally ballot measures but semantically candidate races — the “candidate” is the judge. L1 classifies these as BallotMeasure with an additional retention_candidate field preserving the judge’s name from the contest string. This is an area where the boundary between candidate races and ballot measures is genuinely blurred.

Scale

Ballot measure records account for approximately 3–5% of total rows in MEDSL 2022, varying by state. States with frequent ballot initiatives (California, Oregon, Colorado) have higher proportions. Failing to detect them does not just create bad entities — it inflates the count of “candidates” and distorts competitiveness metrics. A bond referendum with 51% “For” and 49% “Against” is not an uncontested race with one candidate named “For.”

Contest Disambiguation

Three distinct problems hide under one label: the same office name can mean different races, the same race can have different names, and some races elect multiple winners. Each breaks a different assumption in the pipeline.

Problem 1: Same Office Name, Different Races

Harris County, Texas elects 25 district court judges. Every one of them appears in the data as DISTRICT COURT JUDGE. Without the district column, all 25 races collapse into a single contest — 25 winners, 50+ candidates, and no way to compute margins or determine who ran against whom.

The distinguishing field varies by source:

Source	Office name	Distinguishing field	Example value
MEDSL	`DISTRICT COURT JUDGE`	`district`	`127TH`
NC SBE	`DISTRICT COURT JUDGE DISTRICT 13B SEAT 02`	Embedded in contest name	`13B SEAT 02`
OpenElections	`District Court Judge`	Separate `district` column	`127`

MEDSL separates the seat identifier into a dedicated column. NC SBE concatenates it into the contest name string. OpenElections does both, inconsistently, depending on the state contributor.

The L1 parser must extract the seat identifier regardless of where it appears. The contest entity key is (state, county, office_name, district, seat) — not just (state, county, office_name). Omitting district or seat merges distinct races.

Real examples from MEDSL 2022:

State	Office name	Distinct seats	What disambiguates
TX	DISTRICT COURT JUDGE	25	`district` column: 11TH, 55TH, 80TH, …
NC	DISTRICT COURT JUDGE	48	Contest name suffix: `DISTRICT 13B SEAT 02`
OH	COURT OF COMMON PLEAS	14	`district` column: GENERAL DIVISION, DOMESTIC
FL	COUNTY COURT JUDGE	6–12 per county	`district` column: GROUP 1, GROUP 2, …

Florida’s GROUP numbering is particularly treacherous. “COUNTY COURT JUDGE GROUP 3” in Miami-Dade is a different contest from “COUNTY COURT JUDGE GROUP 3” in Broward. The county is part of the disambiguation key.

Problem 2: Same Race, Different Names Across Years

NC SBE data from 2014 labels a state house seat as NC HOUSE OF REPRESENTATIVES DISTRICT 03. In 2018, redistricting renamed it to NC HOUSE OF REPRESENTATIVES DISTRICT 3. In 2022, the same source uses DISTRICT THREE in some contest types.

All three strings refer to the same legislative seat. But to a string-matching system, they are three different contests. Tracking a candidate’s career across elections requires knowing that DISTRICT 03, DISTRICT 3, and DISTRICT THREE are the same district.

Common variations found in NC SBE data:

Variant A	Variant B	Variant C	Same contest?
DISTRICT 03	DISTRICT 3	DISTRICT THREE	Yes
BOARD OF EDUCATION	BD OF ED	BOE	Yes
COUNTY COMMISSIONERS	COUNTY COMMISSION	BOARD OF COMMISSIONERS	Yes

This is contest entity resolution — the same problem as candidate entity resolution, applied to office names instead of person names. The cascade applies:

Normalize numbers: Strip leading zeros, convert written numbers to digits. DISTRICT 03 → DISTRICT 3, DISTRICT THREE → DISTRICT 3.
Abbreviation expansion: BD OF ED → BOARD OF EDUCATION, COMM → COMMISSION.
Embedding similarity: For remaining ambiguous pairs, compute cosine similarity on contest composite strings and apply the same threshold logic as candidate matching.

Contest entity resolution runs at L3 alongside candidate entity resolution. Each contest receives a contest_entity_id that persists across election cycles.

Problem 3: Multi-Seat Contests

A “vote for 3” school board race elects the top three candidates. The standard margin computation — difference between first and second place — does not apply. The meaningful margin is between the last winner (3rd place) and the first loser (4th place).

The vote_for field (called magnitude in some sources) records how many seats are being filled. MEDSL provides this field for most contests. NC SBE does not — it must be inferred from ballot instructions embedded in the contest name or from the number of candidates who received non-trivial vote shares.

Real example from Dawson County, Georgia (2022):

Contest	vote_for	Candidates	Votes
Board of Education	3	6	25,186 / 25,186 / 24,901 / 24,844 / 23,112 / 22,987

The effective margin is between 3rd place (24,901) and 4th place (24,844) — a gap of 57 votes. Reporting the margin as the gap between 1st and 2nd (0 votes — an exact tie) is misleading: the tie is between the top two winners, not between a winner and a loser.

Worse, the exact tie at the top (25,186 each) may trigger recount rules in some jurisdictions. Whether a recount applies depends on whether the tied candidates are competing for the same seat or are both safely elected. The vote_for field is the only way to know.

Why `vote_for` matters for competitiveness analysis

Without vote_for, every multi-seat contest looks either wildly competitive (if you compare 1st to 2nd among co-winners) or wildly uncompetitive (if you compare any winner to any loser in a field of 6). The correct margin — last winner vs. first loser — requires knowing the cutoff.

Analysis	Without `vote_for`	With `vote_for`
Is the race competitive?	Unclear — 0-vote “margin” is misleading	Margin of 57 votes at the cutoff
Is it uncontested?	6 candidates — looks contested	Only if ≤ 3 candidates filed
Who won?	Top 1? Top 2? Unknown	Top 3

Detection when the field is missing

When vote_for is absent (NC SBE, some OpenElections files), L1 applies heuristics:

Contest name pattern: “VOTE FOR 3”, “ELECT 2”, “(3 SEATS)” embedded in the contest name string.
Candidate count: If 6+ candidates appear in a school board or city council race, flag for multi-seat review.
Vote distribution: If the top N candidates have similar vote totals and a clear drop-off to N+1, infer N seats.

These heuristics are imperfect. The vote_for field, when present, overrides all heuristics. When absent, the inferred value is stored with a confidence flag, and the L4 verification audit reviews flagged contests.

How All Three Interact

A single contest can exhibit all three problems simultaneously. Consider a Texas county with five JP (Justice of the Peace) precincts, each electing one JP, across three election cycles where the contest name changed from “J.P. PCT 3” to “JUSTICE OF THE PEACE PRECINCT 3” to “JP PRECINCT THREE”:

Problem 1: Five precincts, five separate contests, all labeled variants of “Justice of the Peace”.
Problem 2: Three different name formats across 2018, 2020, 2022 for each precinct.
Problem 3: Each is single-seat, but a neighboring school board race on the same ballot elects three members.

The contest entity key (state, county, office_name_normalized, district_normalized, seat) disambiguates problem 1. Contest entity resolution across years handles problem 2. The vote_for field handles problem 3. All three solutions must work together for the contest record to be correct.

Cross-Source Reconciliation

When two independent sources cover the same election, their overlap becomes a validation set. If MEDSL and NC SBE both report results for the same contest in the same county, the vote totals should match. When they do, both sources are credible. When they don’t, at least one has an error — and the disagreement reveals data quality issues that no single-source analysis can detect.

North Carolina 2022 is our primary validation case. Both MEDSL and the NC State Board of Elections publish precinct-level results for all NC contests in the 2022 general election.

The Overlap

We identified 640 contests present in both MEDSL and NC SBE for the 2022 general election. These span federal, state, county, municipal, judicial, and school board races across all 100 NC counties.

For each contest, we aggregated precinct-level results to the contest level and compared total votes per candidate.

Agreement Level	Contests	Percentage
Exact vote total match	579	90.5%
Within 1% of each other	47	7.3%
Disagree by more than 1%	14	2.2%
Total	640	100%

90.5% exact match across 640 contests, derived from two completely independent data pipelines (MIT’s academic processing vs. NC’s official state board reporting), is strong evidence that both sources are faithfully representing the same underlying certified results.

The 7.3% — Small Disagreements

The 47 contests with near-matches (within 1%) trace to identifiable causes:

Cause	Contests	Notes
Provisional ballot inclusion timing	22	MEDSL snapshot taken before final canvass; NC SBE includes provisionals
Precinct boundary rounding	11	Split precincts assigned differently by each source
Write-in aggregation	9	NC SBE reports individual write-ins; MEDSL aggregates to “Write-in”
Unknown	5	Under investigation

These are not errors — they are legitimate differences in how two organizations process the same raw certified results. Provisional ballot timing is the most common cause: MEDSL’s data may reflect an earlier snapshot of the canvass than NC SBE’s final certified totals.

The 2.2% — Real Disagreements

The 14 contests with >1% disagreement require individual investigation. Common causes include:

Misassigned precincts. A precinct’s results attributed to the wrong contest or district in one source.
Partial data. One source missing results from a subset of precincts, typically in multi-county contests where one county’s data arrived late.
Candidate name mismatch causing split. The same candidate’s votes split across two entity IDs in one source because a name variant was not resolved — e.g., “JOHN SMITH” in early voting vs. “John R. Smith” in election-day results treated as different candidates.

These 14 cases are flagged by the L4 cross-source reconciliation algorithm and reported in the verification output. They are not silently ignored.

Name Formatting Differences

Vote totals may agree, but candidate names almost never do. Of the 640 overlapping contests, 401 (62.7%) have at least one candidate whose name is formatted differently between MEDSL and NC SBE.

Formatting Difference	Example (MEDSL)	Example (NC SBE)	Frequency
ALL CAPS vs Title Case	`TIMOTHY LANCE`	`Timothy Lance`	389
Last-first vs first-last	`LANCE, TIMOTHY`	`Timothy Lance`	247
Middle initial present/absent	`SHANNON W BRAY`	`Shannon W. Bray`	118
Period after middle initial	`SHANNON W BRAY`	`Shannon W. Bray`	94
Nickname in quotes vs parens	`CHARLES "CHARLIE" CRIST`	`Charles (Charlie) Crist`	12
Suffix formatting	`ROBERT WILLIAMS JR`	`Robert Williams, Jr.`	31
Prefix/title included	`HON. JANE DOE`	`Jane Doe`	8

A single candidate can exhibit multiple formatting differences simultaneously. “BRAY, SHANNON W” (MEDSL) vs “Shannon W. Bray” (NC SBE) combines casing, ordering, and punctuation differences in one pair.

This is why entity resolution exists. The vote totals confirm these are the same contests with the same candidates. The name formatting confirms that string equality is insufficient — structured decomposition, embedding, and in some cases LLM confirmation are required to link records across sources.

This Overlap as a Validation Set

The 640-contest NC overlap serves three purposes in the pipeline:

1. Entity Resolution Validation

For every candidate pair that the L3 cascade matches across MEDSL and NC SBE, we can verify the match by comparing vote totals. If the cascade says “TIMOTHY LANCE” (MEDSL) and “Timothy Lance” (NC SBE) are the same person, and their vote totals match exactly, the match is confirmed by an independent signal. If the cascade says they match but the vote totals disagree by 50%, the match is suspect.

2. Office Classification Validation

Both sources cover the same contests but use different office name strings. MEDSL might report “NC HOUSE OF REPRESENTATIVES DISTRICT 047” while NC SBE reports “NC HOUSE OF REPRESENTATIVES - DISTRICT 47”. If both classify to state/legislative, the classifier is consistent. If one classifies to state/legislative and the other to county/legislative, we have a bug.

3. Parser Validation

When two independent parsers (the MEDSL parser and the NC SBE parser) produce the same vote counts for the same contest, both parsers are likely correct. When they disagree, the disagreement localizes the bug to one parser or the other — far easier to debug than a single-source pipeline where errors are invisible.

Beyond NC

The NC overlap is our deepest validation case because NC SBE publishes granular, machine-readable precinct data going back to 2006. Other states offer less overlap:

State	MEDSL 2022	Secondary Source	Overlap Quality
NC	Yes	NC SBE (precinct-level, 2006–2024)	High
FL	Yes	OpenElections (county-level, select years)	Medium
OH	Yes	OpenElections (precinct-level, 2022)	Medium
GA	Yes	Clarity/Scytl (election night, unstable URLs)	Low
All others	Yes	MEDSL only	None

As additional state-level sources are integrated, each creates a new validation pair. The architecture is designed to scale: the L4 cross-source reconciliation algorithm runs for any pair of sources that cover the same (state, year, contest) combination. No code changes are required — only new L0 data and a new L1 parser.

The Lesson

Cross-source reconciliation is not a feature — it is the only reliable way to detect errors in election data. A single source can be internally consistent and still wrong. Two independent sources that agree are almost certainly right. Two independent sources that disagree tell you exactly where to look.

The 90.5% exact match rate across 640 NC contests is our current evidence floor. Every additional source and state that achieves similar agreement raises confidence in the pipeline. Every disagreement is a bug report — either in our pipeline or in the source data.

Design Principles

Five principles govern every architectural decision in this project. They are listed in priority order — when two principles conflict, the higher-ranked principle wins.

1. Deterministic First

If a deterministic method produces correct results, use it. Do not add machine learning, embeddings, or LLM calls where string matching, regex, or lookup tables suffice. L0 through L1 contain zero ML — name decomposition, FIPS enrichment, keyword-based office classification, and hash computation are all deterministic operations that produce identical output from identical input on every run. Deterministic methods are not preferred because they are cheaper (budget is not a constraint). They are preferred because they are reproducible, auditable, and incapable of hallucination. When a journalist asks “why did your system say these two candidates are the same person?”, the answer should be “because their canonical first names, last names, and suffixes are identical” — not “because a language model said so.” Determinism is the default. Non-determinism requires justification.

2. Preserve Signal

Every piece of information in the source data is potential disambiguation signal. Middle initials distinguish David S. Marshall (Maine) from David A. Marshall (Florida) — dropping them collapses two people into one. Suffixes distinguish Robert Williams from Robert Williams Jr. — stripping them merges father and son. Nicknames reveal that Charlie Crist and Charles Joseph Crist are the same person — normalizing too early destroys that connection. The rule at L1 is: decompose names into structured components (raw, first, middle, last, suffix, canonical_first) and preserve every component. Do not discard middle initials. Do not strip suffixes. Do not overwrite the raw name with a canonical form. Clean without collapsing. Downstream layers (L2 embedding, L3 matching, L4 canonicalization) consume these components selectively. The raw material must survive intact through L1 for those layers to function.

3. LLMs for Confirmation, Not Discovery

Embeddings retrieve candidates. LLMs confirm matches. The embedding model (text-embedding-3-large, 3,072 dimensions) identifies pairs that might be the same entity — Charlie Crist at cosine 0.451, Robert Williams Jr at 0.862. The LLM (Claude Sonnet) then examines the full context — structured name components, vote counts, office, state, party — and renders a judgment: match or no-match, with confidence and reasoning. The LLM is never the first line of analysis. It does not scan raw files, parse CSV columns, compute FIPS codes, or generate embeddings. It is called only when cheaper methods have narrowed the problem to a specific, bounded question: “Are these two records the same person?” or “What type of office is the Santa Rosa Island Authority?” This ordering exists for speed (70% of entity resolution is exact match), reproducibility (deterministic steps produce identical results), and auditability (every LLM decision is logged with its prompt, response, and reasoning).

4. Immutable Layers

Outputs are append-only. L0 raw files are never modified. L1 cleaned records are never updated in place — if the parser changes, a new L1 run produces new records with a new parser_version. L2 embeddings are never re-computed by overwriting existing vectors — a new embedding model produces new L2 output alongside the old. L3 match decisions are never silently revised — an override produces a new decision record referencing the original. L4 canonical exports are versioned snapshots, not mutable databases. This immutability serves two purposes. First, provenance: the hash chain from L4 back to L0 depends on every intermediate record remaining unchanged. Modifying an L1 record without incrementing the parser version breaks the chain. Second, debugging: when a result looks wrong, you can inspect every layer’s output at the time it was produced, without worrying that a subsequent run overwrote the evidence.

5. Document Sources, Don’t Store Data

This project does not redistribute election data. Each source — MEDSL, NC SBE, OpenElections, VEST, Census, FEC — publishes data under its own license, on its own schedule, at its own URLs. We provide exact download URLs, file size expectations, schema documentation, known data quality issues, and the tools to process the data. We do not provide the data itself. The reasons are legal (license terms vary), practical (the current corpus exceeds 8 GB and grows with each election cycle), and epistemic (a stale copy of a dataset that the source has since corrected is worse than no copy at all). Users download data from authoritative sources, verify file integrity against documented hashes, and run the pipeline locally. The L0 manifest records exactly where each file came from and when it was retrieved, so any result can be traced back to its authoritative origin.

The Five-Layer Pipeline

The pipeline processes election data through five immutable layers. Each layer depends on all prior layers. Every record carries a hash chain back to the original source bytes. The storage format at every layer is JSONL (one JSON record per line).

L0  RAW         Byte-identical source files with acquisition manifests.
 │
 │  deterministic parse — no ML, no API calls
 ▼
L1  CLEANED     Structured records. Names decomposed into components.
 │              FIPS enrichment. Office classification (keyword + regex).
 │
 │  deterministic given embedding model version
 ▼
L2  EMBEDDED    Vector embeddings for candidates, contests, geography.
 │              Tier 3 office classification. Quality flags.
 │
 │  non-deterministic — LLM decisions stored in audit log
 ▼
L3  MATCHED     Entity resolution. candidate_entity_id assigned.
 │              contest_entity_id assigned. Cross-source dedup.
 │
 │  deterministic given L3 entity assignments
 ▼
L4  CANONICAL   Authoritative names. Temporal chains. Alias tables.
                Verification algorithms. Researcher-facing exports.

Layer properties

Layer	Deterministic	Needs API key	Output format	Re-runnable from
L0	Yes	No	Original files + `.manifest.json`	External sources
L1	Yes	No	JSONL	L0
L2	Yes, given model version	Yes (OpenAI)	JSONL + `.npy` sidecars	L1
L3	No (LLM)	Yes (Anthropic)	JSONL + decision log (JSONL)	L2
L4	Yes, given L3	No	JSONL + JSON registries + CSV export	L3

The determinism boundary falls between L2 and L3. Everything from L0 through L2 produces identical output from identical input, given the same code version and embedding model. L3 introduces LLM calls whose outputs may vary between runs, but every decision is stored in a JSONL audit log that enables deterministic replay.

What each layer produces

L0: Raw

The input to the entire pipeline. L0 is a byte-identical copy of each source file, accompanied by a JSON manifest recording how it was acquired.

l0_raw/
├── nc_sbe/
│   ├── results_pct_20221108.txt           # Original file, untouched
│   └── results_pct_20221108.txt.manifest.json
├── medsl/
│   └── 2022-nc-local-precinct-general/
│       ├── NC-cleaned-final3.csv
│       └── NC-cleaned-final3.csv.manifest.json
└── ...

The manifest records:

{
  "l0_hash": "edfedf2760cfd54f...",
  "source_url": "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip",
  "retrieval_date": "2026-03-18T14:30:00Z",
  "file_size_bytes": 18023456,
  "format_detected": "tsv"
}

L0 files are never modified. If a source is re-downloaded and the content differs, a new versioned L0 entry is created.

L1: Cleaned

L1 parses each source’s native format into a unified JSONL schema. The parser is source-specific (one parser per source), but the output schema is the same regardless of source.

L1 performs 10 operations in fixed order:

Filter non-contests — Detect “Registered Voters”, “Ballots Cast”, “Over Votes”, “Under Votes”. Route to turnout metadata. Detect “For”/“Against”/“Yes”/“No” ballot measure choices.
Parse source format — Source-specific: CSV for MEDSL, TSV for NC SBE, XML for Clarity.
Decompose candidate names — Split into first, middle, last, suffix, nickname. Preserve every component. Robert Van Fletcher, Jr. becomes {first: "Robert", middle: "Van", last: "Fletcher", suffix: "Jr.", raw: "Robert Van Fletcher, Jr."}.
Apply nickname dictionary — Map Charlie → Charles, Bill → William, etc. Store as canonical_first. Preserve original first.
Classify contest kind — CandidateRace, BallotMeasure, or TurnoutMetadata.
Classify office (tiers 1–2) — Keyword lookup (~170 entries), then regex patterns (~40 patterns). No ML, no embeddings. Records that don’t match remain other.
Enrich geography — FIPS lookup from bundled reference data (3,143 counties, 31,980 places). Generate OCD-IDs.
Compute vote shares — votes_total / sum(all candidates in contest).
Backfill turnout — If turnout metadata rows were found, attach registered voter counts to sibling contest records in the same precinct.
Compute L1 hash — SHA-256(record content + "parent:" + L0 hash).

A single L1 record for a Columbus County, NC school board race:

{
  "election": {"date": "2022-11-08", "type": "general"},
  "jurisdiction": {
    "state": "NC", "state_fips": "37",
    "county": "COLUMBUS", "county_fips": "37047",
    "precinct": "P17", "level": "precinct"
  },
  "contest": {
    "kind": "candidate_race",
    "raw_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
    "office_level": "school_district",
    "classifier_method": "regex",
    "classifier_confidence": 0.85,
    "vote_for": 1
  },
  "results": [
    {
      "candidate_name": {
        "raw": "Timothy Lance", "first": "Timothy",
        "middle": null, "last": "Lance", "suffix": null,
        "canonical_first": "Timothy"
      },
      "votes_total": 303,
      "vote_counts_by_type": {
        "election_day": 136, "early": 159,
        "absentee_mail": 7, "provisional": 1
      }
    }
  ],
  "source": {
    "source_type": "nc_sbe",
    "source_file": "results_pct_20221108.txt",
    "confidence": "high"
  },
  "provenance": {
    "l1_hash": "8ea7ecc257ff8e05",
    "l0_parent_hash": "edfedf2760cfd54f",
    "parser_version": "nc_sbe_v2.1",
    "schema_version": "3.0.0"
  }
}

L1 does not use any machine learning, API calls, or non-deterministic processes. Given the same L0 files and the same parser version, L1 output is identical on every run.

L2: Embedded

L2 generates vector embeddings for text fields that need fuzzy matching. The embedding model is text-embedding-3-large (3,072 dimensions) from OpenAI. L2 also applies tier 3 office classification (embedding nearest-neighbor against a reference set of ~200 known office names) and raises quality flags on suspicious records.

L2 produces three types of output:

Enriched JSONL — L1 records augmented with classification upgrades and quality flags:

{
  "...all L1 fields...",
  "l2": {
    "l2_hash": "854fa6367960bb05",
    "l1_parent_hash": "8ea7ecc257ff8e05",
    "embedding_model": "text-embedding-3-large",
    "embedding_dimensions": 3072,
    "candidate_embedding_id": 4271,
    "contest_embedding_id": 183,
    "candidate_composite": "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus",
    "contest_composite": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02 | school_district | NC 2022",
    "quality_flags": []
  }
}

Embedding sidecars — Binary .npy files (float32 arrays) containing the actual vectors. One file per embedding type per partition:

l2_embedded/
├── nc/2022/
│   ├── enriched.jsonl
│   ├── candidate_embeddings.npy    # float32[N, 3072]
│   ├── contest_embeddings.npy      # float32[M, 3072]
│   └── geography_embeddings.npy    # float32[K, 3072]

ID mapping — A JSON file mapping L1 record hashes to embedding row indices.

The composite strings fed to the embedding model follow fixed templates:

Purpose	Template
Candidate	`{canonical_first} {middle} {last} {suffix} \| {party} \| {office} \| {state} \| {county}`
Contest	`{raw_contest_name} \| {office_level} \| {state} {year}`
Geography	`{municipality}, {county} County, {state}`

Middle initials and suffixes are included in the candidate composite. This is deliberate — “David S Marshall” and “David A Marshall” produce different vectors, which helps distinguish different people with the same first and last name. We measured this: including the middle initial reduced cosine similarity between the two David Marshalls from 0.7025 to 0.6448.

L2 is deterministic given the same embedding model version. If OpenAI changes the weights behind text-embedding-3-large, the vectors change. The embedding_model and model version are stored in every L2 record to detect this.

L3: Matched

L3 resolves entities — it determines which records across sources and elections refer to the same candidate and the same contest. This is the first non-deterministic layer because it uses LLM calls for ambiguous cases.

L3 runs an entity resolution cascade for each candidate record:

Step	Method	Handles	Cost
1	Exact match on `(canonical_first, last, suffix)`	Same name across precincts	$0
2	Jaro-Winkler similarity ≥ 0.92	Minor spelling variations	$0
2.5	Name similarity gate: JW on last name < 0.50 → skip	Obvious non-matches	$0
3	Embedding retrieval: cosine ≥ 0.95 → auto-accept	Format differences	$0
4	LLM confirmation: cosine 0.35–0.95	Nicknames, suffixes, ambiguous names	~$0.0002/call
5	Tiebreaker: stronger model when step 4 is uncertain	Low-confidence cases	~$0.002/call

In our prototype run of 200 records:

Step 1 resolved 597 candidates (70.0%)
Step 2 resolved 1 (0.1%)
Step 3 resolved 50 (5.9%)
Step 4 was invoked 30 times (3.5%), all resulting in no-match
206 unique candidate entities were created

The 30 LLM calls in our prototype were all spent on pairs within the same (state, office_level) block that had moderate embedding similarity (0.55–0.73) but completely different names — “Aaron Bridges” vs “Daniel Blanton” type comparisons. All 30 were correctly rejected. This finding led to the addition of step 2.5 (the name similarity gate): if the Jaro-Winkler score on last names alone is below 0.50, skip the pair entirely without computing embedding similarity.

Every L3 decision is stored in a JSONL audit log:

{
  "decision_id": "a3f8c1d2-...",
  "decision_type": "candidate_match",
  "timestamp": "2026-03-19T10:30:00Z",
  "inputs": {
    "name_a": "Charlie Crist",
    "name_b": "CRIST, CHARLES JOSEPH",
    "embedding_score": 0.451,
    "state_a": "FL", "state_b": "FL",
    "contest_a": "Governor", "contest_b": "Governor",
    "votes_a": 3101652, "votes_b": 3101652
  },
  "method": {
    "type": "llm",
    "model": "claude-sonnet-4-20250514",
    "prompt_template_version": "entity_match_v2.0"
  },
  "output": {
    "decision": "match",
    "confidence": 0.95,
    "reasoning": "Charlie is a common nickname for Charles. Same state, same office, identical vote counts."
  }
}

A researcher who wants to reproduce L3 can either replay the cached decisions from the log (deterministic) or re-run the LLM calls (which may produce slightly different responses). The log preserves everything needed for either approach.

L3 adds entity assignments to each record:

{
  "...all L2 fields...",
  "l3": {
    "l3_hash": "28183d41d50204d5",
    "l2_parent_hash": "854fa6367960bb05",
    "candidate_entity_ids": [
      {"result_index": 0, "entity_id": "person:nc:columbus:lance-timothy-13"}
    ],
    "contest_entity_id": "contest:nc:columbus:school-board-d02"
  }
}

L4: Canonical

L4 assigns authoritative representations. For each entity (candidate or contest), it selects a canonical name, builds temporal chains across elections, constructs alias tables, and runs verification algorithms.

Canonical name selection follows a fixed algorithm:

Collect all name variants from all L3 records in the entity cluster.
Prefer the most complete variant (one with a middle initial over one without; one with a suffix over one without).
Among equally complete variants, prefer the one from the most authoritative source (certified state data > academic data > community data).
Among equally authoritative variants, prefer the most recent.

Temporal chains aggregate records by (entity_id, election_date, contest_entity_id). One entry per election, not per precinct. A candidate who appeared in 47 precincts in one election gets one temporal chain entry with the summed vote total.

Verification algorithms run at L4 to check pipeline integrity:

Hash chain integrity — Walk L4→L3→L2→L1→L0 for every record. Verify no link is broken.
Entity consistency — Flag entities spanning multiple states (unusual for local officials). Flag party switches.
Temporal plausibility — Flag implausible career spans or office progressions.
Cross-source reconciliation — Where two sources cover the same contest, compare vote totals.
Completeness audit — Report coverage by state, county, year. Report FIPS and entity ID fill rates.
LLM entity audit — For multi-member entities, ask a language model whether the cluster is plausible. In our prototype, this caught 43 suspicious entities (precinct-level records inflating temporal chains) and 4 likely errors (“For” and “Against” ballot measure choices classified as person entities).

L4 exports two types of output:

Entity registries (JSON) — One record per unique person or contest:

{
  "entity_id": "person:nc:columbus:lance-timothy-13",
  "canonical_name": "Timothy Lance",
  "aliases": ["Timothy Lance", "TIMOTHY LANCE"],
  "elections": [
    {"date": "2022-11-08", "contest": "Columbus County Schools Board of Education District 02", "votes": 1531}
  ],
  "states": ["NC"],
  "first_appearance": "2022-11-08",
  "election_count": 1
}

Flat exports (JSONL and CSV) — One record per candidate per contest per precinct, with canonical names and entity IDs attached:

{
  "election_date": "2022-11-08",
  "state": "NC",
  "county": "COLUMBUS",
  "contest_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
  "candidate_raw": "Timothy Lance",
  "candidate_canonical": "Timothy Lance",
  "candidate_entity_id": "person:nc:columbus:lance-timothy-13",
  "votes_total": 303,
  "source": "nc_sbe",
  "l3_hash": "28183d41d50204d5",
  "l0_hash": "edfedf2760cfd54f"
}

Why five layers and not two

A simpler system would have two layers: raw and processed. The five-layer design exists because the processing steps have different properties that should not be conflated:

Splitting L1 from L2 means you can upgrade the embedding model without re-parsing all sources. If a better model than text-embedding-3-large becomes available, re-run L2 from L1. L1 remains untouched.

Splitting L2 from L3 means cheap, deterministic embedding generation is separate from expensive, non-deterministic LLM calls. L2 can run for 200 million records in hours on CPU (plus API calls for vector generation). L3’s LLM calls can be batched separately, retried on failure, and audited independently.

Splitting L3 from L4 means individual entity resolution decisions are separate from the aggregate operations (canonical name selection, temporal chains, verification) that consume them. If a human reviewer overrides an L3 match decision, L4 can be re-run without re-doing all of L3.

Each layer boundary is a point where you can stop, inspect, export, and restart. A researcher who disagrees with the entity resolution can take L2 output and apply their own matching logic. A developer who wants to test a new office classifier can re-run L1 without re-downloading L0.

Storage layout

local-data/processed/
├── l0_raw/
│   └── {source}/
│       ├── {filename}
│       └── {filename}.manifest.json
├── l1_cleaned/
│   └── {source}/{state}/{year}/
│       ├── cleaned.jsonl
│       └── cleaning_report.json
├── l2_embedded/
│   └── {state}/{year}/
│       ├── enriched.jsonl
│       ├── candidate_embeddings.npy
│       ├── contest_embeddings.npy
│       └── id_mapping.json
├── l3_matched/
│   └── {state}/{year}/
│       ├── matched.jsonl
│       └── decisions/
│           └── candidate_matches.jsonl
└── l4_canonical/
    ├── candidate_registry.json
    ├── contest_registry.json
    ├── verification_report.json
    └── exports/
        ├── flat_export.jsonl
        └── flat_export.csv

All JSONL files are streamable — they can be processed line by line without loading the entire file into memory. At 200 million records with approximately 2 KB per record, the full L1 corpus would be approximately 400 GB. Streaming is not optional at that scale.

The hash chain

Every record at every layer carries a hash of its own content and a reference to its parent layer’s hash:

L4 record
  l4_hash ← SHA-256(L4 content + "parent:" + l3_hash)
    └── l3_hash ← SHA-256(L3 content + "parent:" + l2_hash)
          └── l2_hash ← SHA-256(L2 content + "parent:" + l1_hash)
                └── l1_hash ← SHA-256(L1 content + "parent:" + l0_hash)
                      └── l0_hash ← SHA-256(raw file bytes)

To verify any L4 record: recompute the L4 hash from its content, check that it matches the stored l4_hash, then follow the l3_parent_hash to the L3 record and repeat. Continue through L2 and L1 to L0. At L0, re-hash the raw file bytes and compare to the stored l0_hash.

In our prototype run of 200 records, all 200 hash chains verified from L4 back to L0 with zero broken links.

L0: Raw — Byte-Identical Source Preservation

L0 is the foundation of the pipeline. It stores byte-identical copies of every source file alongside a JSON manifest that records how the file was acquired. Nothing at L0 is parsed, cleaned, or transformed. The raw bytes are sacred.

What L0 Contains

Every source file produces two artifacts:

Artifact	Purpose	Example
The file itself	Exact bytes as downloaded	`results_pct_20221108.txt`
The manifest sidecar	Acquisition metadata	`results_pct_20221108.txt.manifest.json`

The manifest records five fields:

{
  "l0_hash": "edfedf2760cfd54f...",
  "source_url": "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip",
  "retrieval_date": "2026-03-18T14:30:00Z",
  "file_size_bytes": 18023456,
  "format_detected": "tsv"
}

l0_hash — SHA-256 of the raw file bytes. This is the root of the hash chain. Every downstream record at L1–L4 ultimately traces back to this value.
source_url — The exact URL used to retrieve the file. Not a landing page — the direct download link.
retrieval_date — ISO 8601 timestamp of when the file was downloaded. Sources update files in place; the retrieval date disambiguates versions.
file_size_bytes — Byte count of the raw file after decompression (if the source was a zip archive, this is the size of the extracted file, not the archive).
format_detected — The file format as determined by content inspection: tsv, csv, xml, json, fixed_width.

Storage Layout

l0_raw/
├── nc_sbe/
│   ├── results_pct_20221108.txt
│   ├── results_pct_20221108.txt.manifest.json
│   ├── results_pct_20201103.txt
│   └── results_pct_20201103.txt.manifest.json
├── medsl/
│   ├── 2022-nc-precinct-general.csv
│   ├── 2022-nc-precinct-general.csv.manifest.json
│   ├── 2022-fl-precinct-general.csv
│   └── 2022-fl-precinct-general.csv.manifest.json
├── openelections/
│   ├── 20221108__fl__general__precinct.csv
│   └── 20221108__fl__general__precinct.csv.manifest.json
└── census/
    ├── national_county2020.txt
    └── national_county2020.txt.manifest.json

Files are organized by source, not by state or year. The source is the natural partition because each source has its own parser at L1. A single MEDSL file may contain data for all 50 states; a single NC SBE file contains one election’s results for all NC counties. The source directory mirrors the download structure.

Idempotent Download

Downloading is idempotent. Before fetching a file, the pipeline checks whether an L0 entry already exists with a matching l0_hash:

If the manifest exists and the file exists and the file’s SHA-256 matches the manifest’s l0_hash → skip download. The file is already present and intact.
If the manifest exists but the file is missing or the hash does not match → re-download. The file was corrupted or deleted.
If no manifest exists → download and create manifest.

This means running the download step twice produces no network traffic on the second run. It also means the pipeline recovers gracefully from interrupted downloads — a partially written file will fail the hash check and be re-fetched.

When Sources Change

Some sources update files in place. NC SBE occasionally reissues precinct result files after canvass corrections. MEDSL publishes revised datasets with the same filename.

When a re-download produces different bytes than the stored l0_hash, the pipeline does not overwrite the existing L0 entry. Instead:

The new file is stored with a versioned name: results_pct_20221108.v2.txt.
A new manifest is created with the new l0_hash and current retrieval_date.
The old file and manifest are retained unchanged.

All L1–L4 records that reference the old l0_hash remain valid. New pipeline runs against the updated file produce new L1–L4 records referencing the new l0_hash. Both versions coexist. The retrieval_date field distinguishes them.

The L0 Hash as Root of Trust

The l0_hash is the only value in the pipeline that can be independently verified by anyone with access to the source. Download the file from the URL in the manifest. Compute SHA-256. Compare. If the hashes match, the pipeline processed the same bytes you hold.

Every subsequent hash — l1_hash, l2_hash, l3_hash, l4_hash — incorporates its parent’s hash. The entire chain is anchored to l0_hash. If someone modifies the raw file, the L0 hash changes, the L1 hash no longer matches its l0_parent_hash, and the verification algorithm reports a break at the L0→L1 boundary.

In our prototype, all 200 hash chains verified from L4 back to L0 with zero broken links. The verification starts here — at the raw bytes.

What L0 Does Not Do

L0 does not parse, filter, validate, or transform. A TSV file with malformed rows is stored as-is. A CSV file with a trailing BOM is stored as-is. A zip archive is decompressed and the contents stored, but the extraction is mechanical — no character encoding conversion, no line-ending normalization, no column reordering.

Data quality issues are L1’s problem. L0’s only job is to preserve the exact bytes that the source published, record where they came from, and make them verifiable.

L1: Cleaned — Deterministic Parsing and Enrichment

L1 transforms raw source files into structured JSONL records with a unified schema. It is purely deterministic: no machine learning, no API calls, no randomness. Given the same L0 files and the same parser version, L1 output is identical on every run, on every machine, forever.

This is deliberate. L1 is the foundation for every subsequent layer. If the foundation is non-deterministic, nothing above it can be reproduced.

One Parser Per Source, One Schema Out

Each source has a dedicated parser that understands its native format:

Source	Format	Delimiter	Encoding	Parser
NC SBE	TSV (`.txt` extension)	`\t`	UTF-8	`nc_sbe_v2.1`
MEDSL	CSV	`,`	UTF-8	`medsl_v1.3`
OpenElections	CSV (varies by state)	`,`	UTF-8/Latin-1	`openelections_v1.0`
Clarity/Scytl	XML	—	UTF-8	`clarity_v0.5`

Every parser produces the same output schema. A downstream consumer of L1 JSONL does not need to know whether a record originated from NC SBE or MEDSL — the fields, types, and semantics are identical.

The 10 Operations

L1 applies 10 operations in fixed order. The order matters — later operations depend on earlier ones.

1. Filter Non-Contests

Before any parsing, detect rows that are not candidate results. Pattern-match on the candidate name field:

Pattern	Classification	Action
`registered voters`	TurnoutMetadata	Extract to `turnout.registered_voters`
`ballots cast`	TurnoutMetadata	Extract to `turnout.ballots_cast`
`over votes`	TurnoutMetadata	Extract to `turnout.over_votes`
`under votes`	TurnoutMetadata	Extract to `turnout.under_votes`
`^blank$`	TurnoutMetadata	Maine’s undervote label
`total votes`	Aggregation artifact	Discard (redundant with candidate sums)
`for` / `against` / `yes` / `no`	BallotMeasure (if contest name matches)	Route to `MeasureChoice`

This runs first because non-contest rows must not enter name decomposition, office classification, or entity resolution. The principle is extract before filter — the registered voter count is valuable turnout data and is captured before the row is excluded from candidate analysis. See Non-Candidate Records.

2. Parse Source Format

Source-specific column mapping. The NC SBE parser reads tab-separated fields: County, Election Date, Contest Name, Choice, Choice Party, Total Votes, Election Day, One Stop, Absentee by Mail, Provisional. The MEDSL parser reads CSV columns: state, county_name, office, candidate, party_simplified, votes, mode. Each parser maps its native columns to the unified schema fields.

Encoding normalization happens here. OpenElections files from some states use Latin-1 encoding; the parser detects and converts to UTF-8. MEDSL 2022 has trailing commas in some state files; the parser strips them.

3. Decompose Candidate Names

Split every candidate name into structured components. This is the most critical L1 operation — it determines what signal survives to L2 and L3.

The decomposition handles four source formats:

Format	Example	Parsing strategy
`LAST, FIRST MIDDLE`	`CRIST, CHARLES JOSEPH`	Split on first comma; remainder is first + middle
`First Last`	`Charlie Crist`	Last token is last name (with multi-word last name detection)
`First Middle Last Suffix`	`Robert Van Fletcher, Jr.`	Suffix detected and extracted; remaining tokens parsed
`LAST, FIRST M.`	`BRAY, SHANNON W.`	Period stripped from middle initial

The output for every format is the same six fields:

{
  "raw": "Robert Van Fletcher, Jr.",
  "first": "Robert",
  "middle": "Van",
  "last": "Fletcher",
  "suffix": "Jr.",
  "canonical_first": "Robert"
}

Every component is preserved. Middle initials are kept (they distinguish David S. Marshall from David A. Marshall). Suffixes are kept (they distinguish Robert Williams from Robert Williams Jr.). The raw field is never modified. See Name Normalization.

4. Apply Nickname Dictionary

Look up first in the nickname dictionary (~100 mappings in prototype, targeting 500+). If a mapping exists, populate canonical_first with the formal equivalent. If not, canonical_first equals first.

`first`	`canonical_first`	Mapping
Charlie	Charles	Charlie → Charles
Ron	Ronald	Ron → Ronald
Nikki	Nicole	Nikki → Nicole
Timothy	Timothy	No mapping (already formal)

Both fields are preserved. The composite string sent to L2 uses canonical_first; the original first is retained for display and provenance. See The Nickname Dictionary.

5. Classify Contest Kind

Route each record to one of three contest kinds based on signals from steps 1 and 2:

Kind	Criteria	Example
`candidate_race`	Default — a person running for office	Timothy Lance for Board of Education
`ballot_measure`	Candidate name is For/Against/Yes/No AND contest name matches measure keywords	“Against” in “BOND REFERENDUM”
`turnout_metadata`	Candidate name matches turnout patterns	“Registered Voters”

Records classified as ballot_measure get a MeasureChoice result instead of CandidateResult. Records classified as turnout_metadata are extracted and attached to sibling precinct records.

6. Classify Office (Tiers 1–2)

Apply the deterministic tiers of the office classifier:

Tier 1: Keyword lookup (~170 entries). Case-insensitive substring match. "board of education" in the contest name → school_district/education. Handles ~45% of unique office names, ~85% of records by volume.

Tier 2: Regex patterns (~40 patterns). county\s+commission → county/legislative. Adds ~17% of unique names.

Records that do not match either tier are classified as other with classifier_confidence: 0.0. They proceed to L2 for tier 3 (embedding nearest-neighbor) and tier 4 (LLM classification).

The classifier_method field records which tier produced the classification: "keyword", "regex", or "unclassified".

7. Enrich Geography

Look up FIPS codes from bundled Census Bureau reference data:

State FIPS: 2-digit code from state abbreviation. NC → 37.
County FIPS: 5-digit code from (state, county name). (NC, COLUMBUS) → 37047.
Place FIPS: Where available, municipal codes from Census place files.
OCD-ID: Open Civic Data identifier. ocd-division/country:us/state:nc/county:columbus.

The reference data covers 3,143 counties and 31,980 places. FIPS enrichment achieves 100% county coverage for records with valid state and county name fields. Municipal FIPS coverage is lower (~85%) because municipality names are less standardized.

8. Compute Vote Shares

For each candidate in a contest within a precinct:

vote_share = votes_total / sum(votes_total for all candidates in same contest+precinct)

Vote share is a convenience field — it can always be recomputed from the raw vote counts. It is included because downstream queries (margins, competitiveness rankings) use it constantly.

9. Backfill Turnout

If step 1 extracted turnout metadata rows for a precinct, attach the values to all sibling contest records in the same precinct:

{
  "turnout": {
    "registered_voters": 4217,
    "ballots_cast": 2891,
    "turnout_rate": 0.6855
  }
}

Turnout data is available in NC SBE and some OpenElections files. It is absent from MEDSL and Clarity. When absent, the turnout field is null — not zero, not omitted, but explicitly null to distinguish “no data” from “zero registered voters.”

10. Compute L1 Hash

The final operation seals the record into the hash chain:

l1_hash = SHA-256( serialize(record_without_hash) + "parent:" + l0_hash )

The l0_hash comes from the L0 manifest of the source file. The l1_hash becomes the anchor for L2. See Provenance and the Hash Chain.

A Real L1 Record

Timothy Lance, precinct P17, Columbus County Schools Board of Education District 02, 2022 NC general election:

{
  "election": {"date": "2022-11-08", "type": "general"},
  "jurisdiction": {
    "state": "NC", "state_fips": "37",
    "county": "COLUMBUS", "county_fips": "37047",
    "precinct": "P17", "level": "precinct"
  },
  "contest": {
    "kind": "candidate_race",
    "raw_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
    "office_level": "school_district",
    "classifier_method": "regex",
    "classifier_confidence": 0.85,
    "vote_for": 1
  },
  "results": [
    {
      "candidate_name": {
        "raw": "Timothy Lance", "first": "Timothy",
        "middle": null, "last": "Lance", "suffix": null,
        "canonical_first": "Timothy"
      },
      "votes_total": 303,
      "vote_share": 0.523,
      "vote_counts_by_type": {
        "election_day": 136, "early": 159,
        "absentee_mail": 7, "provisional": 1
      }
    }
  ],
  "turnout": {
    "registered_voters": 4217,
    "ballots_cast": 2891,
    "turnout_rate": 0.6855
  },
  "source": {
    "source_type": "nc_sbe",
    "source_file": "results_pct_20221108.txt",
    "confidence": "high"
  },
  "provenance": {
    "l1_hash": "8ea7ecc257ff8e05",
    "l0_parent_hash": "edfedf2760cfd54f",
    "parser_version": "nc_sbe_v2.1",
    "schema_version": "3.0.0"
  }
}

Every field traces to a specific operation: county_fips from step 7, canonical_first from step 4, office_level from step 6, turnout from step 9, l1_hash from step 10.

What L1 Does Not Do

No embeddings. Embedding generation requires an API call to OpenAI. L1 runs offline with zero external dependencies.
No entity resolution. L1 does not determine whether two records refer to the same person. That is L3’s job.
No canonical name selection. L1 preserves all name components. Choosing the “best” name is L4’s job, after entity resolution.
No tier 3/4 office classification. Embedding-based and LLM-based classification require API calls. L1 applies only the deterministic tiers (keyword and regex). Records that need tiers 3–4 are marked "classifier_method": "unclassified" and classified at L2.

This boundary is the determinism boundary. Everything L1 does can be verified by re-running the parser on the same L0 files. No API key, no network connection, no randomness.

L2: Embedded — Vector Generation and Classification

L2 transforms L1’s structured text fields into vector embeddings suitable for fuzzy matching, applies tier 3 office classification, and raises quality flags on suspicious records. It is the bridge between deterministic parsing (L1) and probabilistic entity resolution (L3).

Embedding Model

The embedding model is OpenAI’s text-embedding-3-large, producing 3,072-dimensional float32 vectors. Every L2 record stores the model identifier and dimensionality:

{
  "embedding_model": "text-embedding-3-large",
  "embedding_dimensions": 3072
}

This metadata is not optional. Thresholds calibrated for text-embedding-3-large (auto-accept ≥ 0.95, ambiguous 0.35–0.95, auto-reject < 0.35) are not portable to other models. If the model changes, the thresholds must be recalibrated against the test cases. Storing the model in every record ensures that stale thresholds are never applied to vectors from a different model.

Composite String Templates

Raw name components are not embedded directly. They are assembled into composite strings that include contextual fields — office, state, county, party — so that the resulting vectors encode identity-relevant context alongside the name.

Three composite types are generated per record:

Type	Template	Example
Candidate	`{canonical_first} {middle} {last} {suffix} \| {party} \| {office} \| {state} \| {county}`	`Timothy Lance \| \| BOARD OF EDUCATION DISTRICT 02 \| NC \| Columbus`
Contest	`{raw_name} \| {office_level} \| {state} {year}`	`COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02 \| school_district \| NC 2022`
Geography	`{municipality}, {county} County, {state}`	`Whiteville, Columbus County, NC`

Middle initials and suffixes are included deliberately. “David S Marshall | ME” and “David A Marshall | FL” produce different vectors — measured at cosine 0.6448 with middle initials versus 0.7025 without. That 0.058 gap is the difference between correct separation and a false merge. See Composite String Templates for the full rationale, including the “context bleed” problem where shared geographic context artificially inflates similarity between unrelated candidates.

Empty components (null middle, null suffix) produce empty slots in the template rather than being omitted. This keeps the template structure consistent, which stabilizes the embedding model’s tokenization.

FAISS Indices

Embeddings are stored in partitioned FAISS indices, one per (state, year) combination. Partitioning serves two purposes:

Blocking alignment. Entity resolution at L3 blocks by (state, office_level, last_name_initial). State-level FAISS partitions ensure that nearest-neighbor queries never cross state boundaries — a candidate in NC is never compared to a candidate in FL during retrieval.
Memory management. A single national index for 42 million candidate embeddings at 3,072 dimensions × 4 bytes = ~500 GB of float32 data. Per-state-year partitions fit in memory on commodity hardware. NC 2022 (~200K records × 3,072 dims × 4 bytes) is approximately 2.3 GB.

Index type is IndexFlatIP (inner product on L2-normalized vectors, equivalent to cosine similarity). No approximate search — exact cosine is computed for every candidate pair within a block. At partition scale, exact search is fast enough (sub-second for 200K vectors) and avoids the recall loss of approximate methods like IVF or HNSW.

Tier 3 Office Classification

Records that were not classified by L1’s keyword (tier 1) or regex (tier 2) classifiers are embedded and compared against a reference set of ~200 pre-classified office names.

The reference set is a curated list covering every (office_level, office_branch) pair with at least 3 examples. Each reference entry has a pre-computed embedding. For an unclassified office name, L2 computes its embedding, finds the nearest reference neighbor by cosine similarity, and assigns the reference’s classification if the score exceeds 0.60.

Real tier 3 results:

Unclassified Name	Nearest Reference	Cosine	Assigned Classification
Collier Mosquito Control District	Mosquito Control District	0.787	special_district / infrastructure
Eastern Carrituck Fire & Rescue	Fire Protection District	0.724	special_district / infrastructure
Lowndes County Bd of Ed	Board of Education	0.831	school_district / education

Names scoring below 0.60 are left as other at L2 and passed to tier 4 (LLM) at L3. Tier 3 classifies approximately 4.5% of the unique office names that survived tiers 1 and 2, with 94% accuracy against manual review.

The classification result is written back into the L1-inherited fields on the enriched L2 record, updating classifier_method to "embedding_nn" and classifier_confidence to the cosine score.

Quality Flags

L2 raises flags on records with characteristics that may cause downstream problems. Flags do not block processing — they annotate records for review at L4.

Flag	Condition	Example
`short_name`	Candidate name has ≤ 2 characters after decomposition	`"J. D."` with no last name parsed
`common_name_risk`	First + last name appears 50+ times nationally	`John Smith`, `Robert Johnson`
`missing_office_level`	Office survived all classification tiers as `other`	`Santa Rosa Island Authority` (pre-tier-4)
`zero_votes`	`votes_total` is 0	Write-in candidates with no votes
`high_vote_share`	Single candidate has > 99% of votes in a contested race	Possible data error or unopposed misclassification

In our prototype, 12 of 200 records received at least one quality flag. The most common was zero_votes (write-in placeholders), followed by common_name_risk.

Output Format

L2 produces two types of output per (state, year) partition:

Enriched JSONL — L1 records augmented with an l2 block:

{
  "...all L1 fields...",
  "l2": {
    "l2_hash": "854fa6367960bb05",
    "l1_parent_hash": "8ea7ecc257ff8e05",
    "embedding_model": "text-embedding-3-large",
    "embedding_dimensions": 3072,
    "candidate_embedding_id": 4271,
    "contest_embedding_id": 183,
    "candidate_composite": "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus",
    "contest_composite": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02 | school_district | NC 2022",
    "quality_flags": []
  }
}

Binary sidecars — .npy files containing float32 arrays of embeddings, plus a JSON ID mapping:

l2_embedded/nc/2022/
├── enriched.jsonl                  # One record per line, all L1 + L2 fields
├── candidate_embeddings.npy        # float32[N, 3072]
├── contest_embeddings.npy          # float32[M, 3072]
├── geography_embeddings.npy        # float32[K, 3072]
└── id_mapping.json                 # l1_hash → embedding row index

Embeddings are stored separately from JSONL to keep the text records streamable. A 3,072-dimensional float32 vector is 12,288 bytes — embedding it as base64 inside JSON would triple the JSONL file size. The .npy format is readable by NumPy, PyTorch, and any tool that understands the NumPy array file specification.

The candidate_embedding_id in the JSONL record is an integer index into candidate_embeddings.npy. To retrieve Timothy Lance’s embedding: load the .npy file, index row 4271.

Determinism

L2 is deterministic given the same L1 input, the same embedding model version, and the same office reference set. The composite string templates are fixed. The FAISS index construction is deterministic (flat index, no random initialization). The tier 3 nearest-neighbor search is exact.

If OpenAI updates the weights behind text-embedding-3-large without changing the model name, the vectors change silently. The embedding_model field cannot detect this — it records the API model name, not an internal version hash. In practice, OpenAI has not changed embedding model weights after release. If they do, a full L2 re-run and threshold recalibration is required.

Dependencies

L2 requires an OpenAI API key for embedding generation. L0 and L1 do not — they run entirely offline. This is the first layer that requires network access.

At prototype scale (200 records), L2 embedding generation takes approximately 3 seconds and costs less than $0.01. At production scale (42 million rows), the cost is approximately $300 and the wall-clock time depends on API throughput (typically 3,000 embeddings per minute with batching, yielding ~10 days for the full corpus). Embeddings are computed once per L1 record and cached — re-running L3 or L4 does not re-invoke the embedding API.

L3: Matched — Entity Resolution and LLM Confirmation

L3 is the first non-deterministic layer. It resolves entities — determining which records across sources, precincts, and elections refer to the same candidate and the same contest. Every decision is stored in a JSONL audit log with full prompt, response, and reasoning, enabling deterministic replay even though the underlying LLM calls are non-deterministic.

Input and Output

Input: L2 enriched JSONL records with embeddings, composite strings, and quality flags.

Output:

Enriched JSONL with candidate_entity_id and contest_entity_id assignments.
A decision log (candidate_matches.jsonl) recording every comparison made and its outcome.

Blocking

Before pairwise comparison begins, records are partitioned into blocks by (state, office_level, last_name_initial). Only pairs within the same block are compared. A candidate for NC school board is never compared to a candidate for FL sheriff.

This reduces the comparison space by approximately four orders of magnitude. The blocking key is deliberately coarse — we accept some noise within blocks (two unrelated people whose last names start with the same letter, in the same state, at the same office level) in exchange for never missing a legitimate match. The step 2.5 gate handles within-block noise cheaply.

The Five-Step Cascade

Step	Method	Prototype result	Cost per pair
1	Exact match on `(canonical_first, last, suffix)`	597 (70.0%)	negligible
2	Jaro-Winkler ≥ 0.92 on full name	1 (0.1%)	microseconds
2.5	Name gate: JW on last name < 0.50 → skip	— (gate)	microseconds
3	Embedding cosine ≥ 0.95 AND same state → auto-accept	50 (5.9%)	pre-computed
4	LLM confirmation: cosine 0.35–0.95	30 (3.5%)	~$0.0002/call
5	Tiebreaker: stronger model when step 4 confidence < 0.70	0 (rare)	~$0.002/call

Percentages are from the 200-record Columbus County NC prototype. 206 unique candidate entities were created.

Step 1: Exact Match

The match key is (canonical_first, last, suffix) within a (state, office_level) block. Timothy Lance appears in 47 precinct rows — all 47 share the same key and collapse to one entity. No fuzzy logic, no API calls.

This step handles the overwhelmingly common case: the same candidate appearing identically across precincts within a single source.

Step 2: Jaro-Winkler (≥ 0.92)

Catches minor spelling variations that survive L1 parsing — Mcdonough vs McDonough, transposition errors, inconsistent hyphenation. The threshold of 0.92 is strict to avoid false positives on common surnames.

In the prototype, step 2 resolved 1 additional candidate. Most formatting differences are already normalized at L1.

Step 2.5: The Name Similarity Gate

Before computing embedding similarity, check last-name Jaro-Winkler. If below 0.50, skip the pair entirely.

This gate was added after a prototype finding. The original cascade had no step 2.5, and all 30 LLM calls were spent on pairs like “Aaron Bridges” vs “Daniel Blanton” — candidates in the same (NC, school_district, B/D) block with completely different names. Every call correctly returned no-match, but each cost an API round-trip. The gate eliminates these obvious non-matches before they reach embedding or LLM steps.

At scale, with millions of within-block pairs, this gate prevents orders-of-magnitude waste in downstream steps.

Step 3: Embedding Auto-Accept (≥ 0.95)

For pairs that pass the gate but did not exact-match, retrieve pre-computed L2 cosine similarity. If ≥ 0.95 AND both candidates are in the same state, auto-accept.

The 0.95 threshold is deliberately high. Robert Williams Jr scored 0.862 against Robert Williams — a false positive under the original 0.82 threshold. At 0.95, only near-identical strings with trivial formatting differences pass. Barbara Sharief at 0.955 is an example that auto-accepts: the only difference is a middle initial J added in one source.

A secondary acceptance rule handles the band just below 0.95: embedding ≥ 0.90 AND JW on full name ≥ 0.92 AND same state → accept. This catches Ashley Moody (0.930 cosine) without requiring an LLM call.

Step 4: LLM Confirmation (0.35–0.95)

Pairs in the ambiguous zone are sent to Claude Sonnet with structured context: both candidates’ parsed name components, vote counts, office, state, party, and the embedding score. The LLM returns a decision (match/no-match), confidence (0.0–1.0), and free-text reasoning.

The ambiguous zone is wide (0.35–0.95) by design. Budget is not a constraint. The zone was widened from the original 0.65–0.82 after two findings:

Charlie Crist at 0.451 — a true match that the old 0.65 reject threshold would have discarded.
Robert Williams Jr at 0.862 — a false positive that the old 0.82 accept threshold would have merged.

The wider zone sends more pairs to the LLM in exchange for zero threshold-induced errors in the tested range.

Step 5: Tiebreaker

When step 4 returns confidence below 0.70, the pair escalates to an Opus-class model. This handles unusual nicknames, slight vote-count discrepancies, and geographic ambiguity that Sonnet finds uncertain. Step 5 was not triggered in the 200-record prototype; it exists for the long tail of ambiguity at production scale.

The Decision Log

Every comparison — not just LLM calls — is recorded in a JSONL audit log at l3_matched/{state}/{year}/decisions/candidate_matches.jsonl. One record per pair examined.

An LLM-decided entry:

{
  "decision_id": "a3f8c1d2-4e7b-4a1f-9c3d-8f2e1a6b5c4d",
  "decision_type": "candidate_match",
  "timestamp": "2026-03-19T10:30:00Z",
  "inputs": {
    "name_a": "Charlie Crist",
    "name_b": "CRIST, CHARLES JOSEPH",
    "embedding_score": 0.451,
    "jw_last_name": 1.0,
    "state_a": "FL", "state_b": "FL",
    "contest_a": "Governor", "contest_b": "Governor",
    "votes_a": 3101652, "votes_b": 3101652
  },
  "method": {
    "type": "llm",
    "model": "claude-sonnet-4-20250514",
    "prompt_template_version": "entity_match_v2.0"
  },
  "output": {
    "decision": "match",
    "confidence": 0.95,
    "reasoning": "Charlie is a common nickname for Charles. Same state, same office, identical vote counts."
  }
}

An exact-match entry is simpler:

{
  "decision_id": "b7c2e4f1-...",
  "decision_type": "candidate_match",
  "timestamp": "2026-03-19T10:30:01Z",
  "inputs": {
    "name_a": "Timothy Lance",
    "name_b": "Timothy Lance",
    "state_a": "NC", "state_b": "NC"
  },
  "method": {
    "type": "exact",
    "model": null,
    "prompt_template_version": null
  },
  "output": {
    "decision": "match",
    "confidence": 1.0,
    "reasoning": "Exact match on (canonical_first=Timothy, last=Lance, suffix=null)"
  }
}

A gate-rejected entry:

{
  "decision_id": "c9d3a5e2-...",
  "decision_type": "candidate_match",
  "timestamp": "2026-03-19T10:30:02Z",
  "inputs": {
    "name_a": "Aaron Bridges",
    "name_b": "Daniel Blanton",
    "jw_last_name": 0.40,
    "state_a": "NC", "state_b": "NC"
  },
  "method": {
    "type": "gate_reject",
    "model": null,
    "prompt_template_version": null
  },
  "output": {
    "decision": "no_match",
    "confidence": 1.0,
    "reasoning": "Last-name JW 0.40 below gate threshold 0.50; skipped."
  }
}

L3 Record Output

Each L1/L2 record is augmented with entity assignments:

{
  "...all L1 and L2 fields...",
  "l3": {
    "l3_hash": "28183d41d50204d5",
    "l2_parent_hash": "854fa6367960bb05",
    "candidate_entity_ids": [
      {"result_index": 0, "entity_id": "person:nc:columbus:lance-timothy-13"}
    ],
    "contest_entity_id": "contest:nc:columbus:school-board-d02"
  }
}

The entity_id format encodes scope: person:{state}:{county}:{last}-{first}-{sequence}. The sequence number disambiguates within a name — necessary when two genuinely different people share the same canonical first and last name in the same county.

Contest entity IDs follow a parallel scheme: contest:{state}:{county}:{office-slug}.

Reproducibility

L3 is non-deterministic because LLM responses may vary between runs. Two strategies make it reproducible in practice:

Replay from log. The decision log contains every match decision with its inputs and outputs. Re-running L3 in replay mode reads decisions from the log instead of calling the LLM. This produces identical L3 output — deterministic given the logged decisions.

Re-run with audit. Re-running L3 with live LLM calls produces a new decision log. Diffing the two logs reveals any decisions where the LLM changed its mind. In testing, decision stability is high: the same pair with the same context produces the same match/no-match outcome in >99% of re-runs. Confidence scores may vary by ±0.05.

For published results, the decision log is the canonical record. The LLM is a tool that produced the decisions; the decisions themselves are the data.

The 30 Wasted Calls

The prototype’s most actionable finding: all 30 LLM calls were wasted. Every one compared candidates with obviously different names — “Aaron Bridges” vs “Daniel Blanton”, “Timothy Lance” vs “Jessica Moore” — that happened to share a blocking key. The embedding scores ranged from 0.55 to 0.73, placing them in the ambiguous zone. The LLM correctly rejected all 30 with high confidence.

The root cause was coarse blocking without a name-similarity pre-filter. The fix — step 2.5, requiring JW ≥ 0.50 on last names before proceeding — would have eliminated all 30 calls. At production scale, this gate is the difference between thousands of useful LLM calls and millions of wasted ones.

Budget and the Ambiguous Zone

Budget is not a constraint for this project. This changes the threshold calculus:

Decision	Budget-constrained approach	Our approach
Ambiguous zone width	Narrow (0.65–0.82) to minimize LLM calls	Wide (0.35–0.95) to maximize accuracy
Step 5 model	Same as step 4 (cheaper)	Opus-class (more capable)
Audit coverage	Sample-based	Every multi-member entity audited at L4

The wider ambiguous zone means ~25% of within-block pairs reach the LLM, up from ~5% with the old thresholds. The step 2.5 gate keeps the absolute call volume manageable by rejecting pairs with dissimilar last names before they enter the zone.

The cascade still exists despite unlimited budget. Sending every pair to the LLM would take weeks of API calls at 42 million rows — cost is irrelevant when wall-clock time is the bottleneck. And deterministic steps are preferred not because they are cheaper, but because they are reproducible and do not hallucinate.

Cross-References

Entity Resolution Overview — the problem and why each step exists
The Cascade: Step by Step — detailed walkthrough with real examples at every step
Real Test Cases — all tested pairs with scores and decisions
Threshold Calibration — old vs. new thresholds
When the LLM Gets Called — invocation policy across the pipeline
Budget Is Not a Constraint — what unlimited budget changes and what it does not

L4: Canonical — Authoritative Names and Verification

L4 is the final layer. It consumes L3’s entity assignments and produces the researcher-facing outputs: canonical names, temporal chains across elections, alias tables, and the results of six verification algorithms. L4 is deterministic given the same L3 input — no LLM calls are made during construction (though the LLM entity audit is part of verification).

Canonical Name Selection

Each candidate entity has multiple name variants collected from across sources and precincts. L4 selects one canonical name using a fixed algorithm:

Collect all variants. For entity person:nc:columbus:lance-timothy-13, the variants might be Timothy Lance (NC SBE), TIMOTHY LANCE (MEDSL), and Lance, Timothy (OpenElections).
Prefer the most complete. A variant with a middle initial beats one without. A variant with a suffix beats one without. SHANNON W BRAY beats SHANNON BRAY. Robert Williams Jr beats Robert Williams (when they are the same entity — which is rare, since Jr usually indicates a different person).
Among equally complete, prefer the most authoritative source. Source authority ranking:
- Certified state data (NC SBE) — highest
- Academic curated data (MEDSL) — second
- Community-curated data (OpenElections) — third
- Election night reporting (Clarity/Scytl) — lowest
Among equally authoritative, prefer the most recent. A 2022 record beats a 2018 record for the same entity.

The selected canonical name is a presentation choice, not an analytical input. By the time L4 runs, entity resolution is complete — the identity question is settled at L3. L4 is choosing a label for a known entity.

Temporal Chain Aggregation

L4 builds one temporal chain entry per (entity, election, contest). A candidate who appeared in 47 precincts in one election gets one entry with the summed vote total — not 47 entries.

This fixes a prototype bug. The initial implementation built temporal chains per precinct, producing entries like “Timothy Lance, 2022, P17, 303 votes” and “Timothy Lance, 2022, P21, 287 votes.” For career tracking and competitiveness analysis, the correct granularity is the election level: “Timothy Lance, 2022, Columbus County Schools Board of Education District 02, 1,531 votes.”

The aggregation:

{
  "entity_id": "person:nc:columbus:lance-timothy-13",
  "canonical_name": "Timothy Lance",
  "aliases": ["Timothy Lance", "TIMOTHY LANCE"],
  "elections": [
    {
      "date": "2022-11-08",
      "contest": "Columbus County Schools Board of Education District 02",
      "contest_entity_id": "contest:nc:columbus:school-board-d02",
      "votes": 1531,
      "vote_share": 0.523,
      "outcome": "won",
      "source_count": 1
    }
  ],
  "states": ["NC"],
  "first_appearance": "2022-11-08",
  "election_count": 1
}

For multi-cycle candidates, the elections array grows. George Dunlap — Mecklenburg County Commissioner across 6 consecutive cycles (2014–2024) — has 6 entries in his temporal chain, each with the contest-level vote total for that election.

Alias Tables

Every name variant observed for an entity is preserved in the aliases array. This serves two purposes:

Searchability. A user searching for “SHANNON W BRAY” finds the entity whose canonical name is “Shannon W. Bray” because the ALL CAPS variant is in the alias table.
Provenance. The alias table documents which sources used which name formats. If a future entity resolution decision is questioned, the alias table shows exactly what variants were merged.

Aliases are deduplicated but not normalized — Timothy Lance and TIMOTHY LANCE are both preserved because they demonstrate that the entity appears in both title-case and all-caps sources.

The Six Verification Algorithms

L4 runs six verification algorithms over the complete output. These are not optional post-processing — they are integral to the pipeline’s trust model. Every verification result is recorded in verification_report.json.

1. Hash Chain Integrity

Walk the hash chain from L4 → L3 → L2 → L1 → L0 for every record. Recompute each hash and compare to the stored value. Any mismatch identifies the exact layer where the chain breaks.

Metric	Prototype result
Records verified	200 / 200
Broken chains	0
Layers traversed per record	5

See Provenance and the Hash Chain for the verification algorithm.

2. Entity Consistency

Flag entities with characteristics that are unusual for local officeholders:

Multi-state entities. A candidate_entity_id spanning NC and FL is suspicious — local officials serve in one state. Federal candidates can span states (a senator’s votes appear in statewide and precinct-level records), so federal offices are exempted.
Party switches. An entity appearing as DEM in 2018 and REP in 2022 is not impossible (party switches happen) but is flagged for review.
Implausible office combinations. An entity serving simultaneously as county sheriff and school board member is unlikely (though not impossible in small counties).

3. Temporal Plausibility

Check career spans and office progressions:

Span check. An entity with elections in 2006 and 2024 has an 18-year span. Plausible for a long-serving commissioner, but flagged if the office is typically a stepping stone (e.g., school board).
Gap detection. An entity appearing in 2014 and 2024 but not 2016, 2018, 2020, or 2022 may be two different people merged by entity resolution — or someone who left office and returned. Gaps > 2 cycles are flagged.
Age plausibility. If external data (FEC filings, candidate bio pages) provides a birth year, check that the candidate was of legal age at first appearance.

4. Cross-Source Reconciliation

Where two sources cover the same contest, compare vote totals for each candidate entity:

Agreement level	NC 2022 contests	Percentage
Exact match	579	90.5%
Within 1%	47	7.3%
Disagree > 1%	14	2.2%

Disagreements are reported with both sources’ totals, the percentage difference, and the probable cause (provisional ballot timing, write-in aggregation, precinct boundary assignment). See Cross-Source Reconciliation.

5. Completeness Audit

Report coverage metrics across the full dataset:

Metric	Target	Prototype result
State coverage (FIPS populated)	100%	100%
County coverage (FIPS populated)	100%	100%
Entity ID fill rate (candidate)	> 95%	100%
Entity ID fill rate (contest)	> 95%	100%
Office classification fill rate	> 90%	67% (prototype scope)
Turnout data fill rate	varies	< 5% (most sources lack it)

Low fill rates are not errors — they are documented gaps. The completeness audit ensures that gaps are visible, not hidden.

6. LLM Entity Audit

For every entity with members from more than one source or more than one election, ask a language model whether the entity cluster is plausible. This is the only LLM call in L4.

The prompt provides the entity’s canonical name, all aliases, all elections, all offices, all states, and all vote totals. The model evaluates:

Is this a plausible single person?
Are the offices consistent with one career?
Do the vote totals and geographic spread make sense?
Are any aliases suspicious (non-person names, ballot measure choices, turnout metadata)?

Prototype results from auditing 50 entities:

Category	Count	Details
Clean — no issues	3	Entity is unambiguous
Suspicious — flagged for review	43	Precinct-level records inflating temporal chains
Likely error — incorrect entity	4	“For” and “Against” classified as person entities

The 43 suspicious entities were a direct consequence of the prototype bug where temporal chains were built per precinct rather than per election. After fixing the aggregation to election-level, the suspicious count dropped to single digits in subsequent runs.

The 4 errors were ballot measure choices (“For”, “Against”) that had leaked past L1 non-candidate detection and received candidate_entity_id values at L3. The LLM audit caught them:

“‘For’ is not a plausible person name. This entity appears across 347 contests in 12 states, always in contest names containing ‘amendment’, ‘bond’, ‘referendum’, or ‘proposition’. These are ballot measure choices, not candidates.”

This finding led to tighter non-candidate detection at L1. See Non-Candidate Records.

Output Format

L4 produces three types of output:

Entity Registries (JSON)

One file per entity type, containing one record per unique entity:

candidate_registry.json — all person entities with canonical names, aliases, temporal chains
contest_registry.json — all contest entities with canonical names, years active, states

Flat Exports (JSONL and CSV)

One record per candidate per contest per precinct, with canonical names and entity IDs attached:

{
  "election_date": "2022-11-08",
  "state": "NC",
  "county": "COLUMBUS",
  "contest_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
  "candidate_raw": "Timothy Lance",
  "candidate_canonical": "Timothy Lance",
  "candidate_entity_id": "person:nc:columbus:lance-timothy-13",
  "votes_total": 303,
  "source": "nc_sbe",
  "l3_hash": "28183d41d50204d5",
  "l0_hash": "edfedf2760cfd54f"
}

The flat export retains precinct-level granularity with entity-level annotations. Users who need contest-level totals aggregate by (candidate_entity_id, contest_entity_id, election_date). Users who need precinct-level data use the records as-is.

The CSV export contains the same fields for users who prefer tabular tools (Excel, R, Stata). Column order matches the JSONL field order.

Verification Report (JSON)

A single verification_report.json summarizing all six verification algorithms:

{
  "run_date": "2026-03-19T12:00:00Z",
  "record_count": 200,
  "entity_count": 206,
  "hash_chain": {"verified": 200, "broken": 0},
  "entity_consistency": {"clean": 195, "flagged": 11},
  "temporal_plausibility": {"clean": 203, "flagged": 3},
  "cross_source": {"exact_match": 579, "within_1pct": 47, "disagree": 14},
  "completeness": {"fips_fill": 1.0, "entity_fill": 1.0, "office_fill": 0.67},
  "llm_audit": {"clean": 3, "suspicious": 43, "error": 4, "entities_audited": 50}
}

This report is the pipeline’s self-assessment. A researcher evaluating the data reads the verification report first to understand what the pipeline is confident about and where it flagged concerns.

Cross-References

Provenance and the Hash Chain — how hash verification works
Cross-Source Reconciliation — the NC overlap validation
Non-Candidate Records — the “For” and “Against” audit finding
Career Tracking Recipe — querying temporal chains
Verify a Specific Result — using hash chains for provenance

Why the Order Matters: Clean → Embed → Match → Canonicalize

The pipeline’s four processing stages must run in exactly this order. This is not a convention — it is a dependency chain where each stage requires the output of all prior stages. Rearranging them destroys signal.

We learned this the hard way.

The Insight

The original prototype ran normalization aggressively: strip middle initials, collapse suffixes, force uppercase, pick a canonical name, then try to match entities. The sequence was:

Old order:  Canonicalize → Match
            (normalize aggressively, then find duplicates)

This destroyed the information needed to tell different people apart.

David S. Marshall (Maine, state legislature) and David A. Marshall (Florida, county commission) are two different people. Under the old pipeline, both names were normalized to MARSHALL, DAVID — middle initials stripped as noise. After normalization, the two records were indistinguishable. The entity resolver matched them as the same person. One David Marshall absorbed the other’s career, vote history, and geographic record.

The embedding scores confirm why middle initials matter:

Composite string	Cosine similarity
`David Marshall	ME`vs`David Marshall
`David S Marshall	ME`vs`David A Marshall

The middle initial drops the score by 0.058 — enough to push the pair further from the accept threshold and toward correct rejection. But this signal only exists if the middle initial survives to L2. If L1 strips it during “normalization,” it is gone forever.

The Correct Order

L1  CLEAN        Parse into components. Preserve everything:
                 first, middle, last, suffix, nickname, canonical_first.
                 No components are discarded. No names are collapsed.
 ↓
L2  EMBED        Generate vectors from composite strings that include
                 middle initials, suffixes, and canonical_first.
                 The embedding encodes all preserved signal.
 ↓
L3  MATCH        Compare embeddings. Run LLM confirmation on ambiguous
                 pairs. The LLM sees structured components — middle
                 initials, suffixes, nicknames — and reasons about them.
 ↓
L4  CANONICALIZE Now that entities are resolved, pick the authoritative
                 name. Prefer the most complete variant. Build alias
                 tables. Aggregate temporal chains.

Each stage depends on prior stages’ output:

L2 depends on L1 — embeddings are generated from L1’s structured name components. If L1 strips middle initials, L2 cannot encode them.
L3 depends on L2 — entity resolution uses L2 embeddings as the retrieval step. If L2 has degraded vectors (because L1 destroyed signal), L3 makes worse decisions.
L4 depends on L3 — canonical name selection requires knowing who the person is. You cannot pick the “best” name for an entity before you know which records belong to that entity.

What Breaks If You Rearrange

Canonicalize before Match

This is the old pipeline. Normalize aggressively, then match. Failures:

David S. Marshall and David A. Marshall merge into one entity.
Robert Williams and Robert Williams Jr merge — suffix stripped before matching can use it.
Charlie Crist normalizes to CRIST, CHARLIE but CRIST, CHARLES JOSEPH normalizes to CRIST, CHARLES — the canonical forms don’t match, so the same person splits into two entities.

Aggressive normalization both merges people who should be separate and splits people who should be merged. It is wrong in both directions simultaneously.

Match before Embed

Without embeddings, matching falls back to string similarity alone. Jaro-Winkler on Charlie Crist vs CRIST, CHARLES JOSEPH gives 0.58 — a miss. The embedding model, despite scoring only 0.451, at least places the pair in the ambiguous zone where the LLM can confirm the match. Without embeddings, the pair is never surfaced.

Embed before Clean

If L1 does not decompose names into components, L2 embeds raw strings: CRIST, CHARLES JOSEPH as-is. The composite template cannot include canonical_first because it does not exist yet. The embedding for the MEDSL record uses CHARLES while the OpenElections record uses Charlie — the nickname dictionary was never applied. The cosine score drops, more pairs fall below the LLM zone, and matches are lost.

The General Principle

Preserve signal as long as possible. Collapse only after all decisions that need the signal have been made.

Middle initials are signal for disambiguation. Suffixes are signal for generational distinction. Nicknames are signal for matching. Raw strings are signal for provenance. None of these should be discarded until L4, where the entity is already resolved and the canonical name is a presentation choice, not an analytical input.

The pipeline is a funnel of information:

Layer	Information available	Information consumed
L1	All components: raw, first, middle, last, suffix, canonical_first	None — everything preserved
L2	L1 components + embeddings + quality flags	Components consumed to build composite strings
L3	L2 embeddings + L1 components + LLM context	Embeddings consumed for retrieval; components consumed for LLM reasoning
L4	L3 entity assignments	Entity IDs consumed to select canonical names

At each layer, information from prior layers is used but not destroyed. The L1 record persists unchanged alongside the L2, L3, and L4 records. A researcher who disagrees with a canonical name choice can trace back to the original components at L1 and the raw bytes at L0.

Why This Took a Session to Learn

The old order felt intuitive: clean the data first, then do the hard work. Every data engineering textbook says normalize early. But election entity resolution is not a standard ETL problem. The “dirt” in the data — middle initials, suffixes, nicknames, variant spellings — is not dirt. It is signal. Stripping it is not cleaning. It is destruction.

The key insight: the order of operations is load-bearing. Clean → Embed → Match → Canonicalize is the only sequence that preserves signal through the stages that need it and collapses only after all analytical decisions are final.

Provenance and the Hash Chain

Every record at every layer carries a cryptographic hash of its own content and a pointer to its parent layer’s hash. This chain links any L4 canonical export record back through L3 matching, L2 embedding, and L1 cleaning to the exact bytes of the original source file at L0. If any record at any layer is modified — a vote count changed, a name altered, a match decision overridden — the chain breaks at precisely that point.

The Hash Structure

Each layer computes its hash as:

l{N}_hash = SHA-256( record_content + "parent:" + l{N-1}_hash )

The record_content is the deterministic serialization of all fields at that layer (excluding the hash itself). The parent: prefix is a literal string separator. The parent hash anchors the current record to its predecessor.

L4 canonical record
  l4_hash ← SHA-256(L4 content + "parent:" + l3_hash)
    │
    └── L3 matched record
          l3_hash ← SHA-256(L3 content + "parent:" + l2_hash)
            │
            └── L2 embedded record
                  l2_hash ← SHA-256(L2 content + "parent:" + l1_hash)
                    │
                    └── L1 cleaned record
                          l1_hash ← SHA-256(L1 content + "parent:" + l0_hash)
                            │
                            └── L0 raw file
                                  l0_hash ← SHA-256(raw file bytes)

A Real Example: Timothy Lance Through All Five Layers

Timothy Lance ran for Columbus County Schools Board of Education District 02 in the 2022 NC general election. Here is one of his precinct-level records traced through every layer.

L0: Raw

The NC SBE results file results_pct_20221108.txt is stored byte-identical at l0_raw/nc_sbe/results_pct_20221108.txt.

{
  "l0_hash": "edfedf2760cfd54f",
  "source_url": "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip",
  "retrieval_date": "2026-03-18T14:30:00Z",
  "file_size_bytes": 18023456,
  "format_detected": "tsv"
}

The l0_hash is the SHA-256 of the raw file bytes (truncated here for display). Re-downloading the file and re-hashing produces the same value. If NC SBE updates the file after our retrieval, the hash changes and a new L0 entry is created.

L1: Cleaned

The NC SBE parser extracts Timothy Lance’s precinct P17 row and produces a structured record:

{
  "jurisdiction": {
    "state": "NC", "county": "COLUMBUS", "precinct": "P17"
  },
  "contest": {
    "raw_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
    "office_level": "school_district"
  },
  "results": [{
    "candidate_name": {
      "raw": "Timothy Lance", "first": "Timothy",
      "middle": null, "last": "Lance",
      "suffix": null, "canonical_first": "Timothy"
    },
    "votes_total": 303
  }],
  "provenance": {
    "l1_hash": "8ea7ecc257ff8e05",
    "l0_parent_hash": "edfedf2760cfd54f",
    "parser_version": "nc_sbe_v2.1",
    "schema_version": "3.0.0"
  }
}

The l1_hash is computed from the L1 record content plus "parent:edfedf2760cfd54f". The l0_parent_hash links back to the raw file.

L2: Embedded

L2 generates a composite string and embedding for the candidate:

{
  "l2": {
    "l2_hash": "854fa6367960bb05",
    "l1_parent_hash": "8ea7ecc257ff8e05",
    "embedding_model": "text-embedding-3-large",
    "embedding_dimensions": 3072,
    "candidate_embedding_id": 4271,
    "candidate_composite": "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus",
    "quality_flags": []
  }
}

The l2_hash is computed from the L2 fields plus "parent:8ea7ecc257ff8e05". The l1_parent_hash links back to L1.

L3: Matched

Entity resolution assigns a candidate_entity_id. Timothy Lance appeared identically across all precincts, so step 1 (exact match) resolved him:

{
  "l3": {
    "l3_hash": "28183d41d50204d5",
    "l2_parent_hash": "854fa6367960bb05",
    "candidate_entity_ids": [
      {"result_index": 0, "entity_id": "person:nc:columbus:lance-timothy-13"}
    ],
    "contest_entity_id": "contest:nc:columbus:school-board-d02"
  }
}

The l3_hash is computed from the L3 fields plus "parent:854fa6367960bb05".

L4: Canonical

L4 produces the researcher-facing export record:

{
  "election_date": "2022-11-08",
  "state": "NC",
  "county": "COLUMBUS",
  "contest_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
  "candidate_canonical": "Timothy Lance",
  "candidate_entity_id": "person:nc:columbus:lance-timothy-13",
  "votes_total": 303,
  "source": "nc_sbe",
  "l4_hash": "f19a3e8bc7210d42",
  "l3_hash": "28183d41d50204d5",
  "l0_hash": "edfedf2760cfd54f"
}

The l4_hash is computed from the L4 fields plus "parent:28183d41d50204d5". The record also carries l0_hash as a shortcut for end-to-end verification.

Verification Algorithm

To verify a single L4 record:

Read the L4 record. Recompute SHA-256(L4 content + "parent:" + l3_hash). Compare to stored l4_hash. If mismatch → chain broken at L4.
Look up the L3 record by l3_hash. Recompute SHA-256(L3 content + "parent:" + l2_hash). Compare to stored l3_hash. If mismatch → chain broken at L3.
Look up the L2 record by l2_hash. Recompute SHA-256(L2 content + "parent:" + l1_hash). Compare to stored l2_hash. If mismatch → chain broken at L2.
Look up the L1 record by l1_hash. Recompute SHA-256(L1 content + "parent:" + l0_hash). Compare to stored l1_hash. If mismatch → chain broken at L1.
Read the L0 raw file. Recompute SHA-256(file bytes). Compare to stored l0_hash. If mismatch → chain broken at L0 (source file was modified or corrupted).

If all five checks pass, the record is verified from canonical output back to original source bytes.

Prototype Results

In our 200-record prototype run:

Metric	Result
Records verified	200 / 200
Broken chains	0
Layers traversed per record	5 (L4 → L3 → L2 → L1 → L0)
Total hash verifications	1,000 (200 records × 5 layers)

Every hash chain verified end-to-end with zero broken links.

What Breaks the Chain

The hash chain detects any modification at any layer. Specific scenarios:

Modifying a vote count at L1. If someone changes Timothy Lance’s votes from 303 to 304, the L1 content changes, the recomputed l1_hash no longer matches the stored value, and the L2 record’s l1_parent_hash no longer points to a valid L1 record.

Changing a parser without a version bump. If the NC SBE parser is updated but parser_version is not incremented, the L1 content for existing records may change (different parsing logic applied to the same raw bytes). The l1_hash changes, breaking the chain from L2 upward. The parser_version field exists precisely to prevent silent parser changes.

Overriding an L3 match decision. If a human reviewer changes an entity assignment at L3, the l3_hash changes. L4 must be re-run from the amended L3 output. The original L3 decision is preserved in the decision log — it is never deleted, only superseded.

Re-downloading a source file after the publisher updated it. NC SBE occasionally corrects results files after initial publication. If the corrected file has different bytes, the l0_hash changes. The entire pipeline from L1 upward must be re-run for affected records. The original L0 entry and its manifest are retained as a versioned snapshot.

Why Not a Merkle Tree

A Merkle tree would allow verifying subsets of records without recomputing the full chain. We use a simpler linear chain because:

Records are independent. Each precinct-level record has its own chain. Verifying one record does not require knowledge of any other record. A Merkle tree adds complexity without benefit when records are not aggregated into blocks.
Full verification is cheap. SHA-256 of a 2 KB record takes microseconds. Verifying all 200 records takes less than a second. At 200 million records, full verification takes minutes — well within acceptable bounds for a batch pipeline.
Simplicity aids trust. A journalist verifying a specific result needs to understand “follow the hash backward through five files.” A Merkle tree requires understanding tree structure, sibling hashes, and root computation. The simpler model is more auditable by non-engineers.

The Chain as Documentation

The hash chain is not just an integrity mechanism — it is a documentation trail. Every L4 record answers the question: “Where did this number come from?” Follow l3_hash to see which entity resolution decision assigned this candidate ID. Follow l2_parent_hash to see the embedding and composite string. Follow l1_parent_hash to see the parsed record. Follow l0_parent_hash to see the raw source file.

This is provenance in the literal sense: the origin and chain of custody of every data point, cryptographically verifiable.

The Project Does Not Store Data

This project processes election data. It does not redistribute it.

Why Not

Legal

Each source publishes data under its own terms. MEDSL uses CC-BY. NC SBE publishes as public record under North Carolina law. OpenElections uses a mix of licenses depending on the state contributor. FEC data is public domain. Census reference files are public domain.

Bundling data from all sources into a single download would require compliance with every license simultaneously — attribution chains, share-alike provisions, and restrictions that vary by state contributor. The legal surface area grows with every source added. We avoid it entirely by not storing the data.

Practical

The current corpus is 8+ GB across three election cycles and seven sources. Adding MEDSL 2018 and 2020, full OpenElections coverage, and VEST shapefiles pushes this past 20 GB. Hosting, versioning, and serving that volume adds infrastructure cost and maintenance burden that contribute nothing to the pipeline’s accuracy or reproducibility.

Freshness

Sources update. NC SBE reissues precinct files when canvass corrections are made. MEDSL publishes errata and revised datasets. OpenElections contributors fix parsing errors and add new states. A copy of the data taken on March 18 may be stale by April 1.

If we store data, every downstream user inherits our staleness. If users download from the authoritative source, they get the latest version — and our pipeline processes it identically.

What We Provide Instead

The project provides everything needed to acquire the data yourself:

What	Where	Example
Exact source URLs	Each source chapter in Part II	`https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip`
Download commands	Download the Data	`curl -O <url>` with expected file sizes
Schema documentation	Each source chapter	Column names, types, delimiters, encoding
Known quirks	Each source chapter	NC SBE uses `\t` separators but `.txt` extension; MEDSL 2022 has trailing commas in some state files
File size expectations	Download the Data	MEDSL 2022 NC: ~45 MB compressed
SHA-256 of our L0 copies	L0 manifests	Verify your download matches ours

The L0 manifest for each file records the SHA-256 hash of the bytes we processed. After downloading the same file, you can hash your copy and compare. If the hashes match, your pipeline run will produce identical L1 output — byte for byte, hash for hash.

The Boundary

The project does bundle small reference datasets that are not election results:

FIPS code reference files (~200 KB) from the Census Bureau, public domain. These change only on decennial redistricting.
The nickname dictionary (~5 KB), original to this project.
The office classification keyword and regex tables (~10 KB), original to this project.
The 200-name office embedding reference set (~50 KB), original to this project.

These are small, stable, and authored by the project. They are not third-party election data.

Election results — the 42 million rows of precinct-level vote counts — are never stored, cached, or redistributed. The user downloads them. The pipeline processes them. The outputs live on the user’s machine.

Embedding Model: text-embedding-3-large

The pipeline uses OpenAI’s text-embedding-3-large for all vector generation at L2. This is a deliberate choice with specific trade-offs. The model is not the best possible embedding model — it is the best available model for this task given current constraints.

Why text-embedding-3-large

Three properties matter for election entity resolution: dimensionality, consistency, and performance on short structured text.

3,072 dimensions. Higher dimensionality preserves more fine-grained distinctions in short strings. “David S Marshall” and “David A Marshall” differ by a single character — a middle initial. In a 384-dimensional space, that distinction may be compressed away. In 3,072 dimensions, the model has room to encode it. We measured: the middle initial drops cosine similarity from 0.7025 to 0.6448 — a 0.058 gap that matters for disambiguation.

API-based consistency. Every call to the same model version with the same input produces the same vector. There is no local model initialization, no GPU-dependent floating-point variance, no seed to manage. Two users on different machines embedding the same candidate string get the same 3,072 floats. This is critical for reproducibility: L2 output is deterministic given the same model version.

Strong on short structured text. Candidate composite strings are 50–150 characters: "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus". These are not natural language paragraphs — they are structured identifiers with pipe-delimited fields. text-embedding-3-large handles this format well in our testing. Nickname pairs (Charlie Crist at 0.451), suffix pairs (Williams Jr at 0.862), and middle-initial pairs (David Marshall at 0.6448) all produce scores in ranges that the cascade can act on.

Why Not MiniLM

all-MiniLM-L6-v2 from Sentence Transformers is the default recommendation for lightweight embedding tasks. It runs locally, requires no API key, and produces vectors in milliseconds on CPU. We evaluated it and rejected it for three reasons.

384 dimensions. A factor of 8× fewer dimensions than text-embedding-3-large. On structured identifiers where single-character differences carry categorical meaning (middle initials, suffixes), the lower dimensionality compresses distinctions. In informal testing, MiniLM scored Williams Jr at 0.91 against Williams — higher than text-embedding-3-large’s 0.862, and well above any reasonable accept threshold. The suffix signal is effectively lost.

2021 training data. MiniLM was trained on data through 2021. It has no exposure to post-2021 candidate names, office titles, or geographic patterns. text-embedding-3-large was trained on more recent data, though the exact cutoff is not published. For a task that involves matching strings like “DESANTIS, RON” and “Ron DeSantis” — where the model’s familiarity with the name helps — recency matters.

Weaker on structured identifiers. MiniLM is optimized for sentence similarity — determining whether two natural language sentences express the same meaning. Our inputs are not sentences. They are pipe-delimited fields with proper nouns, abbreviations, and codes. text-embedding-3-large is a general-purpose model that handles structured text more robustly than a sentence-similarity specialist.

MiniLM’s advantages — local execution, zero API cost, sub-millisecond inference — are real but irrelevant to our constraints. Budget is not a constraint. Latency at L2 is not a bottleneck (embeddings are computed once and cached). The accuracy difference on structured identifiers is the deciding factor.

Why Not a Fine-Tuned Model

A model fine-tuned on election name pairs would outperform any general-purpose model. We know this because the failure modes of text-embedding-3-large are systematic: it underscores nicknames (Charlie/Charles at 0.451) and overscores suffixes (Williams/Williams Jr at 0.862). A fine-tuned model trained on labeled pairs — “these are the same person” / “these are different people” — would learn that “Jr” is a strong negative signal and that “Charlie”/“Charles” is not.

We do not have training data yet.

Fine-tuning requires labeled pairs: hundreds to thousands of (name_a, name_b, same_person) triples with ground truth. Our prototype has 12 manually verified pairs. The L3 decision log will eventually contain thousands of LLM-confirmed match/no-match decisions — each one a potential training example. This is an active learning loop:

L3 uses the general-purpose model to retrieve candidates.
The LLM confirms or rejects matches, producing labeled pairs.
The labeled pairs train a fine-tuned embedding model.
The fine-tuned model replaces text-embedding-3-large at L2, improving retrieval.
Better retrieval surfaces harder cases for the LLM, producing more informative training data.

This loop is planned but not yet implemented. It requires the pipeline to run at scale first, generating enough decisions for a meaningful training set. In the meantime, text-embedding-3-large with the 5-step cascade produces correct results on every tested pair — the LLM compensates for the embedding model’s weaknesses.

Thresholds Are Model-Specific

The calibrated thresholds — auto-accept ≥ 0.95, ambiguous 0.35–0.95, auto-reject < 0.35 — are specific to text-embedding-3-large with 3,072 dimensions. A different model produces different similarity distributions. MiniLM’s Williams Jr score of 0.91 vs. text-embedding-3-large’s 0.862 illustrates the problem: the same pair lands in different threshold zones depending on the model.

If the model changes, recalibration is required:

Re-embed all test cases with the new model.
Plot the score distribution for known matches and known non-matches.
Find the auto-accept, ambiguous, and auto-reject boundaries that minimize false positives and false negatives.
Update the threshold configuration and document the new model in L2 metadata.

The embedding_model field stored in every L2 record ensures that thresholds can always be traced to the model that produced the scores. If a record was embedded with text-embedding-3-large and the thresholds were calibrated for a hypothetical election-embed-v1, the mismatch is detectable.

Summary

Property	text-embedding-3-large	MiniLM	Fine-tuned (future)
Dimensions	3,072	384	TBD
API required	Yes	No	Depends
Cost per 1M tokens	~$0.13	$0	$0 (local)
Williams Jr score	0.862	~0.91	Lower (trained)
Crist score	0.451	~0.38	Higher (trained)
Training data needed	No	No	Yes (not yet available)
Reproducible across machines	Yes	Requires version pinning	Requires version pinning

The current choice is text-embedding-3-large — good enough for the cascade to work, available today, and reproducible without local model management. The long-term path is a fine-tuned model trained on the L3 decision log. The thresholds, the cascade design, and the LLM confirmation step all exist to compensate for the general-purpose model’s known weaknesses until that fine-tuned model is ready.

Composite String Templates

Embeddings are not generated from raw candidate names. They are generated from composite strings that combine name components with contextual fields — office, state, county, party. This context helps the embedding model distinguish people who share a name but hold different offices in different states. It also introduces a failure mode: context bleed, where shared context artificially inflates similarity between unrelated candidates.

The Three Templates

Each L2 record generates up to three composite strings, one per embedding type:

Type	Template	Purpose
Candidate	`{canonical_first} {middle} {last} {suffix} \| {party} \| {office} \| {state} \| {county}`	Entity resolution across sources and elections
Contest	`{raw_name} \| {office_level} \| {state} {year}`	Contest entity resolution across naming variants
Geography	`{municipality}, {county} County, {state}`	Geographic entity resolution for precinct/place matching

The pipe character (|) is a deliberate separator. It signals to the tokenizer that the fields on either side are distinct semantic units, not a continuous phrase. Without separators, “Timothy Lance DEM” could be tokenized as a three-word name rather than a name followed by a party.

Real Composite Examples

Candidate	Composite String
Timothy Lance (NC, Columbus County school board)	`Timothy Lance \| \| BOARD OF EDUCATION DISTRICT 02 \| NC \| Columbus`
Charlie Crist (FL, Governor, DEM)	`Charles Crist \| DEM \| Governor \| FL \| statewide`
CRIST, CHARLES JOSEPH (FL, Governor, DEM)	`Charles Joseph Crist \| DEM \| Governor \| FL \| statewide`
David S Marshall (ME, State Legislature)	`David S Marshall \| \| State Legislature \| ME \| statewide`
David A Marshall (FL, County Commission)	`David A Marshall \| \| County Commission \| FL \| Broward`

Note that canonical_first is used, not first. Charlie Crist’s composite uses Charles (from the nickname dictionary), not Charlie. This means the MEDSL record (CRIST, CHARLES JOSEPH → canonical_first Charles) and the OpenElections record (Charlie Crist → canonical_first Charles) produce composites with matching first-name tokens. The remaining divergence — Joseph as a middle name — is small enough that the embedding score rises significantly compared to the raw-name embedding.

Empty components produce empty slots. Timothy Lance has no middle name, no suffix, and no party in the NC SBE data. The composite retains the pipe separators with empty fields: Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus. This keeps the template structure consistent across all records, which stabilizes tokenization.

Why Context Helps: The David Marshall Test

David S. Marshall ran for state legislature in Maine. David A. Marshall ran for county commission in Florida. They are different people. Without context, the embedding model sees two very similar strings.

We measured the effect of context on cosine similarity:

Composite A	Composite B	Cosine
`David Marshall`	`David Marshall`	1.000
`David Marshall \| ME`	`David Marshall \| FL`	0.7025
`David S Marshall \| ME`	`David A Marshall \| FL`	0.6448
`David S Marshall \| \| State Legislature \| ME`	`David A Marshall \| \| County Commission \| FL`	0.581

Each additional contextual field pushes the vectors further apart:

State alone drops similarity from 1.0 to 0.7025. The model encodes ME and FL as distinct tokens that pull the vectors in different directions.
Middle initial drops it further to 0.6448 — a 0.058 reduction. The single character S vs A produces measurably different vectors because it changes the token sequence before the separator.
Office context drops it to 0.581. “State Legislature” and “County Commission” are semantically distinct, adding another axis of divergence.

At 0.581, this pair falls well within the ambiguous zone (0.35–0.95) and is routed to the LLM, which correctly rejects the match based on different states, different offices, and different middle initials. Without context, the pair scores 1.0 — an automatic merge of two different people.

The middle-initial contribution (0.058) may seem small, but it matters at the margins. For pairs where state and office are the same — a father and son both serving on the same county commission — the middle initial may be the only signal distinguishing them.

Why Context Hurts: The Context Bleed Problem

Context is not free. Shared context tokens contribute to vector similarity even when the names themselves are unrelated. This is context bleed.

Consider two candidates in the same NC school district block:

Candidate	Composite
Aaron Bridges	`Aaron Bridges \| \| SCHOOL BOARD \| NC \| Columbus`
Daniel Blanton	`Daniel Blanton \| \| SCHOOL BOARD \| NC \| Columbus`

These are completely different people. But their composites share five context tokens: SCHOOL BOARD, NC, Columbus, and the pipe separators. The embedding model encodes these shared tokens into both vectors, producing a cosine similarity of approximately 0.55–0.65 — well above what the names alone would produce (~0.20) and squarely in the ambiguous zone.

In our prototype, all 30 wasted LLM calls were on pairs exactly like this: different people with different names whose shared context inflated their embedding scores into the ambiguous zone. The step 2.5 gate (JW on last names < 0.50 → skip) was added specifically to short-circuit these context-bleed false alarms before they reach the LLM.

Measuring the bleed

We tested context contribution by varying which fields are included:

Composite variant	Aaron Bridges vs Daniel Blanton	Cosine
Name only	`Aaron Bridges` vs `Daniel Blanton`	~0.21
Name + state	`Aaron Bridges \| NC` vs `Daniel Blanton \| NC`	~0.38
Name + state + office + county	Full composite	~0.60

Each shared context field adds approximately 0.15–0.20 to the cosine score. For same-name pairs (the cases entity resolution cares about), this boost is helpful — it confirms that two similar names in the same context are likely the same person. For different-name pairs, the same boost is harmful — it inflates scores past the reject threshold.

The step 2.5 gate resolves this asymmetry. If the names themselves are dissimilar (JW < 0.50 on last names), the context-inflated embedding score is irrelevant — the pair is skipped. If the names are similar (JW ≥ 0.50), the context inflation is welcome — it adds corroborating evidence that the similar names in the same context are the same person.

Design Tradeoffs

Why not embed names without context?

Bare-name embeddings eliminate context bleed but lose the disambiguation power demonstrated by the David Marshall test. A bare “David Marshall” vs “David Marshall” scores 1.0 — the model cannot distinguish them at all. Context is the only mechanism the embedding model has to separate same-name, different-person pairs.

Why not use separate embeddings for name and context?

An alternative architecture: embed the name and context separately, then combine scores with weighted averaging. This eliminates context bleed (the name embedding is pure name similarity) while retaining context as a separate signal.

This approach is viable but adds complexity — two embeddings per record instead of one, a tunable weight parameter, and a more complex similarity function. The current single-composite design is simpler and works well with the step 2.5 gate mitigating the primary failure mode. If context bleed proves problematic at scale, split embeddings are a planned fallback.

Why not fine-tune?

A fine-tuned embedding model trained on election name pairs could learn that Charlie and Charles are similar, that Jr is categorically significant, and that shared context should not inflate scores for dissimilar names. We do not have training data yet.

However, L3 decisions are labeled examples: every LLM match/no-match decision with its confidence and reasoning is a training pair. As the pipeline processes more data, the L3 decision log becomes a natural training set for active learning. A fine-tuned model trained on thousands of L3 decisions would, in principle, learn the domain-specific similarity function that the general-purpose text-embedding-3-large approximates. This is a future direction, not a current capability.

Summary

Property	Effect	Mitigation
Context included	Distinguishes same-name, different-person pairs (David Marshall: 1.0 → 0.581)	— (this is the goal)
Context bleed	Inflates scores for different-name, same-context pairs (Bridges vs Blanton: 0.21 → 0.60)	Step 2.5 JW gate on last names
Middle initial included	Provides disambiguation signal (0.7025 → 0.6448)	— (this is the goal)
Nickname dictionary applied	Aligns canonical first names before embedding (Charlie → Charles)	— (this is the goal)

The composite template is a tradeoff between disambiguation power and noise tolerance. Context helps more than it hurts — but only because the step 2.5 gate exists to catch the cases where it hurts.

When the LLM Gets Called (And When It Doesn’t)

The LLM is a confirmation tool, not a discovery tool. It is called when cheaper methods have narrowed the problem to a specific, bounded question. It is never called when a deterministic method produces correct results.

This boundary is enforced by pipeline structure, not by discipline. L0 and L1 have no LLM code paths. L2 has none. The LLM is reachable only from L3 (entity resolution and tier 4 office classification) and L4 (entity auditing). A developer cannot accidentally add an LLM call to the parser — the parser runs at L1, which has no API client.

When the LLM Is Called

Three situations invoke the LLM. Each is a bounded question with structured input and a constrained output format.

1. Ambiguous Entity Matches (L3, Step 4)

Trigger: Embedding cosine similarity between 0.35 and 0.95 AND the name similarity gate passed (JW on last names ≥ 0.50) AND both candidates are in the same state.

Input: Structured name components for both candidates, embedding score, JW score, vote counts, office, state, party.

Output: match/no-match, confidence (0.0–1.0), free-text reasoning.

Model: Claude Sonnet.

Volume: 3.5% of candidate pairs in our prototype (30 calls out of ~850 comparisons). With the step 2.5 gate in place, this drops to near-zero for within-source matching and rises for cross-source matching where name formats diverge.

Real examples:

Pair	Cosine	LLM Decision	Why LLM was needed
Charlie Crist / CRIST, CHARLES JOSEPH	0.451	match (0.95)	Nickname below any safe auto-accept threshold
Robert Williams / Robert Williams Jr	0.862	no match (0.85)	Suffix above old auto-accept; only LLM catches generational distinction
Nicole Fried / FRIED, NIKKI	0.642	match (0.92)	Nickname in ambiguous zone

2. Tier 4 Office Classification (L2→L3 boundary)

Trigger: Office name was not classified by keyword (tier 1), regex (tier 2), or embedding nearest-neighbor with cosine ≥ 0.60 (tier 3).

Input: Office name string, state, county, the full taxonomy of (office_level, office_branch) pairs.

Output: Classification pair, confidence (0.0–1.0), reasoning.

Model: Claude Sonnet.

Volume: ~0.5% of unique office names in MEDSL 2022 (~42 of 8,387). By record count, far less — these are the rarest, most obscure offices.

Real examples:

Office Name	State	LLM Classification	Confidence
Santa Rosa Island Authority	FL	special_district / infrastructure	0.90
Register of Mesne Conveyances	SC	county / judicial	0.88
Hog Reeve	NH	municipal / regulatory	0.60

3. L4 Entity Auditing

Trigger: An entity cluster contains records from multiple sources, multiple elections, or multiple office types. In the current design, every multi-member entity is audited (budget is not a constraint).

Input: The full entity cluster — canonical name, all aliases, all elections, all vote counts, all states, all offices.

Output: Plausibility assessment: plausible / suspicious / error, with reasoning.

Model: Claude Sonnet (Opus-class for flagged entities).

Volume: In the prototype, 50 entities were audited. The LLM flagged 43 as suspicious (precinct-level records inflating temporal chains — a bug in our aggregation, not in the data) and 4 as errors (“For” and “Against” classified as person entities). At production scale, the volume scales with the number of multi-member entities, not with total records.

When the LLM Is Not Called

Everything else. Specifically:

Operation	Layer	Method	Why not LLM
CSV/TSV/XML parsing	L1	Source-specific parser	Deterministic; format is fixed per source
Name decomposition	L1	Rule-based parser	Deterministic; name formats are enumerable
Nickname dictionary lookup	L1	Hash table	O(1) lookup; no reasoning needed
FIPS code enrichment	L1	Census reference table	Exact match on (state, county_name)
Vote share computation	L1	Arithmetic	Division is deterministic
Hash computation	L1–L4	SHA-256	Cryptographic function; no reasoning needed
Office classification (tiers 1–2)	L1	Keyword + regex	Deterministic; handles 62% of unique names
Office classification (tier 3)	L2	Embedding nearest-neighbor	Deterministic given model version; handles 4.5% more
Embedding generation	L2	OpenAI API	Deterministic given model version; not an LLM call
Exact name matching (step 1)	L3	Structured field equality	Handles 70% of entity resolution
Jaro-Winkler matching (step 2)	L3	String similarity	Deterministic; handles 0.1% more
Name gate (step 2.5)	L3	JW on last names	Eliminates obvious non-matches
High-confidence embedding match (step 3)	L3	Cosine ≥ 0.95	Auto-accept; no ambiguity to resolve
Canonical name selection	L4	Fixed algorithm	Most-complete + most-authoritative; no judgment needed
Temporal chain aggregation	L4	Group-by on (entity_id, election_date)	SQL-style aggregation
Hash chain verification	L4	SHA-256 recomputation	Cryptographic verification
Cross-source vote reconciliation	L4	Arithmetic comparison	Exact or percentage-based comparison

The Principle

If a deterministic method handles it, do not add LLM latency and non-determinism.

This is not a cost argument. Budget is not a constraint. It is an accuracy and reproducibility argument:

Deterministic methods do not hallucinate. SHA-256 always returns the same hash. FIPS lookup always returns the same code. An LLM might return a different FIPS code on a second call — not because it is wrong, but because it is probabilistic. For operations with known-correct deterministic solutions, adding an LLM is adding risk, not capability.
Deterministic methods are reproducible. Re-running L1 on the same L0 files with the same parser version produces bit-identical output. Re-running an LLM-based parser may produce different field values. For a pipeline that serves journalists and researchers who need to cite specific numbers, reproducibility is non-negotiable for the operations that support it.
Deterministic methods are fast. L1 processes 200 records in under a second. An LLM call takes 200–2,000ms. For the 70% of entity resolution handled by exact match and the 62% of office classification handled by keywords, the LLM adds latency with zero accuracy benefit.

The LLM is powerful. It correctly identified all 12 test pairs in entity resolution, including the Crist nickname case (0.451 cosine) that no threshold-based system could safely auto-resolve. It classified all 9 tier-4 office names correctly, including obscure offices like “Hog Reeve” that no reference set could anticipate.

But it is called only for the cases that need it: the 3.5% of entity comparisons in the ambiguous zone, the 0.5% of office names that no pattern matches, and the entity audit that catches contamination like ballot-measure choices misclassified as people. For everything else, the answer is already known — deterministically, reproducibly, and instantly.

Cross-References

Design Principles — “Deterministic first” as principle #1
L3: Matched — where LLM calls happen for entity resolution
The Four-Tier Classifier — where LLM calls happen for office classification
Budget Is Not a Constraint — why the cascade exists despite unlimited budget

Budget Is Not a Constraint — Speed and Reproducibility Are

This project has no API cost ceiling. Every LLM call that improves accuracy is worth making. This changes several design decisions compared to a cost-constrained pipeline — but it does not change the fundamental architecture. The cascade exists for speed and reproducibility, not for cost savings.

What Unlimited Budget Changes

Wider Ambiguous Zone

The embedding similarity thresholds for entity resolution were widened specifically because cost is not a constraint:

Parameter	Cost-constrained	Our design
Ambiguous zone	0.65–0.82	0.35–0.95
Zone width	0.17	0.60
Pairs reaching LLM	~5% of within-block pairs	~25% of within-block pairs

The wider zone sends roughly 5× more pairs to the LLM. At $0.0002 per call, the difference between 10,000 calls and 50,000 calls is $8. At production scale with millions of pairs, the difference might reach hundreds of dollars. Neither figure justifies accepting false positives (Williams Jr at 0.862) or false negatives (Crist at 0.451) that a narrower zone would cause.

Stronger Model for Tiebreakers

Step 5 of the entity resolution cascade escalates low-confidence LLM decisions (confidence < 0.70 from Claude Sonnet) to an Opus-class model. The stronger model costs approximately 10× more per call but is invoked only for the lowest-confidence subset of an already-small LLM cohort.

A cost-constrained pipeline would re-run the same Sonnet model or defer to human review. We use the stronger model because the marginal cost per call (~$0.002) is negligible and the accuracy gain on edge cases is measurable.

Full L4 Entity Audit

The L4 LLM entity audit examines every multi-member entity — not a sample. In the prototype, 50 entities were audited, catching 43 suspicious records and 4 errors. At production scale with tens of thousands of multi-member entities, full audit coverage means thousands of LLM calls.

A cost-constrained pipeline would sample 5–10% of entities and extrapolate. We audit 100% because the cost of missing a contaminated entity (ballot measure choices classified as people, precinct-level records inflating temporal chains) is higher than the cost of the API calls. The “For” and “Against” error was caught by the full audit — a 10% sample might have missed it.

Tier 4 Office Classification Without Hesitation

Every unclassified office name that survives tiers 1–3 goes to the LLM. There is no “batch the cheapest 80% and skip the rest” optimization. All ~42 hard cases in our prototype were classified. At national scale, the long tail of hyper-local office names (township-specific roles, water district sub-boards, tribal offices) may produce hundreds of tier 4 calls per election cycle. The cost is trivial; the coverage gain is not.

What Unlimited Budget Does Not Change

The Cascade Still Exists

Sending every candidate pair directly to the LLM — skipping exact match, Jaro-Winkler, the name gate, and embedding retrieval — would produce correct results for most pairs. It would also be impossibly slow.

At 42 million rows, even with aggressive blocking, the number of within-block candidate pairs runs into the millions. At 200ms per LLM API call, one million pairs take 55 hours of serial wall-clock time. With 10× parallelism, that is still 5.5 hours — for a single step that exact match handles in seconds for 70% of cases.

The cascade is not a cost optimization. It is a speed optimization. Steps 1–3 process 76% of pairs in under a millisecond each. The LLM is reserved for the 3.5% where cheap methods cannot decide.

Deterministic Steps Are Still Preferred

Exact match, Jaro-Winkler, keyword classification, regex classification, FIPS lookup, vote share computation, and hash verification are deterministic. They produce identical output from identical input on every run, on every machine, forever.

LLM calls are non-deterministic. The same pair submitted twice may produce different confidence scores (typically within ±0.05) and occasionally different reasoning text. The decision (match/no-match) is stable in >99% of re-runs, but “99% stable” is not “deterministic.”

For a pipeline that serves journalists citing specific numbers and researchers publishing reproducible analyses, determinism is not a preference — it is a requirement for the operations that support it. We use deterministic methods wherever they produce correct results, not because they are cheaper, but because they are trustworthy in a way that probabilistic methods are not.

LLMs Do Not Parse, Enrich, or Compute

No amount of budget makes it sensible to use an LLM for:

Parsing CSV/TSV/XML. The format is fixed per source. A parser handles it in microseconds with zero error rate.
FIPS lookup. A hash table lookup on (state, county_name) returns the correct code every time. An LLM might hallucinate a FIPS code — “37047” for Columbus County NC is correct, but there is no mechanism to verify the LLM’s output without the same lookup table that makes the LLM unnecessary.
SHA-256 computation. Cryptographic hash functions are mathematical operations. An LLM cannot compute them.
Vote share arithmetic. 303 / 580 = 0.5224. A calculator is correct. An LLM might round differently, truncate, or occasionally hallucinate.

These operations have known-correct deterministic solutions. Adding an LLM to any of them introduces risk with zero benefit, regardless of budget.

Reproducibility Requires Logged Decisions

Every LLM decision at L3 and L4 is stored in a JSONL audit log with the full prompt, response, confidence, and reasoning. This is not a cost-saving measure (replay from log avoids re-calling the LLM, saving money). It is a reproducibility measure: a researcher who wants to verify or contest a match decision can read the log, see the LLM’s reasoning, and evaluate whether the decision was correct.

If budget were infinite and API calls were instantaneous, we would still log every decision. The log is not a cache — it is the canonical record of how the pipeline resolved ambiguity. Deleting the log and re-running the LLM would produce a slightly different set of confidence scores, which might shift a small number of borderline decisions, which would change downstream entity assignments. The log prevents this drift.

The Real Constraints

Budget is not a constraint. The real constraints are:

Constraint	Effect on design
Wall-clock time	The cascade exists because LLM calls at scale take hours; exact match takes seconds
Reproducibility	Deterministic methods preferred; LLM decisions logged for replay
Accuracy	Wider ambiguous zone, stronger tiebreaker model, full audit coverage
Auditability	Every decision logged with reasoning; hash chain from L4 to L0
Correctness	Deterministic methods used wherever they produce correct results; LLMs used only for genuine ambiguity

A budget-constrained version of this pipeline would narrow the ambiguous zone, sample the entity audit, skip tier 4 office classification for rare offices, and use the same model for tiebreakers. All of these are accuracy trade-offs. We make none of them.

The cascade’s structure — exact match → JW → gate → embedding → LLM → tiebreaker — is identical whether the budget is $10 or $10,000. The thresholds move. The model choices change. The architecture does not.

Schema Overview

The unified schema defines the structure of every election record at every pipeline layer. A single record represents one candidate’s (or one ballot measure choice’s) vote count in one geographic unit for one contest. All sources — MEDSL, NC SBE, OpenElections, VEST, Clarity — are normalized into this schema at L1. Subsequent layers (L2–L4) add fields but never remove them.

A record has six sections: election, jurisdiction, contest, results, turnout, source, and provenance. Not every field is populated for every record. Fields that the source does not provide are null, not inferred.

Election

Identifies which election this record belongs to.

Field	Type	Description	Example
`date`	date	Election date (ISO 8601)	`2022-11-08`
`year`	integer	Election year, derived from `date`	`2022`
`type`	ElectionType	General, primary, runoff, special, etc.	`General`
`stage`	string	Source-provided stage code	`GEN`
`special`	boolean	Whether this is a special election	`false`
`certification_status`	string	Certified, unofficial, or unknown	`certified`

The type field is an enum — see Enumerations Reference. The stage field preserves the raw source value (MEDSL uses GEN/PRI/RUN; NC SBE does not have a stage column). The certification_status field reflects whether the source data represents certified results. NC SBE and MEDSL publish certified data. Clarity publishes unofficial election night results that may be updated.

Jurisdiction

Identifies the geographic unit where votes were counted.

Field	Type	Description	Example
`state`	string	Full state name	`North Carolina`
`state_po`	string	Two-letter postal code	`NC`
`state_fips`	string	Two-digit state FIPS code	`37`
`county`	string	County name (may be null for statewide)	`Wake`
`county_fips`	string	Five-digit county FIPS code	`37183`
`precinct`	string	Precinct name or code from the source	`01-01`
`precinct_code`	string	Numeric precinct code (NC SBE only)	`0101`
`jurisdiction_name`	string	Jurisdiction name from MEDSL	`WAKE`
`jurisdiction_fips`	string	Jurisdiction FIPS from MEDSL	`37183`
`ocd_id`	string	Open Civic Data identifier (when available)	`ocd-division/country:us/state:nc/county:wake`
`level`	JurisdictionLevel	Geographic granularity of this record	`Precinct`

The county_fips field is the primary geographic join key across sources. It is enriched from Census FIPS reference files at L1 when the source provides a county name but no code. The ocd_id field is populated when a mapping exists; it is null for most records today.

The level field indicates what geographic unit this row represents. Most records are Precinct. Some sources provide only county-level aggregates (County). VEST data with precinct boundaries is Precinct with accompanying geometry.

Contest

Describes the race or ballot measure.

Field	Type	Description	Example
`kind`	ContestKind	`CandidateRace`, `BallotMeasure`, or `TurnoutMetadata`	`CandidateRace`
`raw_name`	string	Contest name exactly as it appears in the source	`CABARRUS COUNTY SCHOOLS BOARD OF EDUCATION`
`normalized_name`	string	Cleaned contest name (L1+)	`Cabarrus County Schools Board of Education`
`office_level`	OfficeLevel	Federal, state, county, municipal, etc.	`County`
`office_category`	OfficeCategory	Executive, legislative, judicial, school board, etc.	`SchoolBoard`
`district`	string	District number or name (blank if at-large)	`DISTRICT 02`
`dataverse`	string	MEDSL’s race level tag (blank for local)	``
`classifier_method`	ClassifierMethod	How office_level and office_category were assigned	`Keyword`
`vote_for`	integer	Maximum number of candidates a voter may select	`1`
`magnitude`	integer	Number of seats being filled	`3`
`is_retention`	boolean	Whether this is a judicial retention election	`false`

The kind field is an enum with three variants — see Contest Kinds. The distinction between CandidateRace, BallotMeasure, and TurnoutMetadata is determined at L1 based on the contest name and choice values.

The classifier_method field records how the office_level and office_category were assigned: Keyword (deterministic string match, 62% of records), Regex (pattern-based, ~15%), Embedding (nearest-neighbor at L2), or Llm (LLM classification at L3). This field exists so that users can filter by classification confidence.

The vote_for field comes from NC SBE’s Vote For column. MEDSL does not provide this field. When unavailable, it defaults to null. The magnitude field comes from MEDSL’s magnitude column and indicates multi-member districts.

Results

An array of candidate results attached to the contest. For a CandidateRace, each element is one candidate. For a BallotMeasure, each element is one choice (e.g., “For”, “Against”). For TurnoutMetadata, the results array is empty.

Field	Type	Description	Example
`candidate_name`	CandidateName	Decomposed name — see below	(see Name Components)
`party_raw`	string	Party label exactly as source provides	`LIBERTARIAN`
`party_simplified`	PartySimplified	Normalized party enum	`Libertarian`
`votes_total`	integer	Total votes for this candidate in this precinct	`90`
`vote_share`	float	Fraction of total contest votes (computed)	`0.023`
`writein`	boolean	Whether this is a write-in candidate	`false`
`incumbent`	boolean	Whether this candidate is the incumbent (if known)	`null`
`vote_counts_by_type`	VoteCountsByType	Breakdown by vote method — see below	(see below)

CandidateName

Names are decomposed into components rather than stored as a single string. This is documented in detail in Candidate Name Components.

Field	Type	Description	Example
`raw`	string	Name exactly as it appears in the source	`MICHAEL "STEVE" HUBER`
`first`	string	Parsed first name	`Michael`
`middle`	string	Parsed middle name or initial	`null`
`last`	string	Parsed last name	`Huber`
`suffix`	string	Jr, Sr, II, III, IV, etc.	`null`
`nickname`	string	Detected nickname	`Steve`
`canonical_first`	string	Nickname-resolved first name	`Stephen`

The raw field is preserved at every layer and never modified. The component fields are populated at L1 during name parsing. The canonical_first field is populated at L1 using the nickname dictionary (e.g., Charlie→Charles, Steve→Stephen, Pat→Patricia). All fields are available at every pipeline layer.

VoteCountsByType

When the source provides vote mode breakdowns, they are stored here. NC SBE provides all four fields for every contest. MEDSL provides them when modes are split into separate rows (summed during L1). Most other sources provide only the total.

Field	Type	Description	Example
`election_day`	integer	Election day votes	`136`
`early`	integer	Early / one-stop votes	`159`
`absentee_mail`	integer	Mail-in absentee votes	`7`
`provisional`	integer	Provisional ballot votes	`1`

NC SBE calls early voting “One Stop.” MEDSL calls it “EARLY VOTING.” Both are mapped to the early field at L1.

Turnout

Voter registration and participation counts for the geographic unit. These fields are sparsely populated — less than 5% of records have values.

Field	Type	Description	Example
`registered_voters`	integer	Number of registered voters in this precinct	`2847`
`ballots_cast`	integer	Total ballots cast in this precinct	`1893`
`turnout_pct`	float	`ballots_cast / registered_voters` (computed)	`0.665`

NC SBE provides registered_voters via “Registered Voters” pseudo-contest rows. These are extracted during L1 parsing and attached to the precinct’s turnout object. MEDSL rarely includes registration counts. Most records have null turnout.

Source

Provenance fields that document where this record came from.

Field	Type	Description	Example
`source_type`	SourceType	Enum identifying the source system	`Medsl`
`source_file`	string	Filename of the L0 artifact	`2022-nc-local-precinct-general.csv`
`source_row`	integer	Row number in the source file	`14523`
`retrieval_date`	datetime	When the source file was downloaded (UTC)	`2025-01-15T03:22:00Z`
`confidence`	Confidence	`High`, `Medium`, or `Low`	`Medium`
`raw_fields`	SourceRawFields	All original columns from the source, typed per source	(see below)

SourceRawFields

The raw_fields object preserves every column from the original source row, typed as an enum per source. This ensures no information is lost during normalization.

Variant	Source	Fields preserved
`MedslRawRecord`	MEDSL	All 25 MEDSL columns including `state_cen`, `state_ic`, `readme_check`, `version`
`NcsbeRawRecord`	NC SBE	All 15 NC SBE columns including `Contest Group ID`, `Contest Type`, `Real Precinct`
`OpenElectionsRawRecord`	OpenElections	Variable columns depending on state file
`VestRawRecord`	VEST	Encoded column names and geometry reference
`ClarityRawRecord`	Clarity	XML element attributes
`FecRawRecord`	FEC	All 15 `cn.txt` columns
`CensusRawRecord`	Census	FIPS file columns

Each variant is a struct with typed fields matching the source schema. This is a Rust enum, not a JSON object — the type system ensures you cannot accidentally read an NC SBE field from a MEDSL record. See Type System Design.

Provenance

Hash chain and version metadata that enable verification and reproducibility.

Field	Type	Description	Example
`record_id`	string	Deterministic hash of (source, file, row)	`a3f8c2...`
`l1_hash`	string	SHA-256 hash of this L1 record’s content	`7b2e91...`
`l0_parent_hash`	string	SHA-256 hash of the L0 source artifact	`c4d1f0...`
`l0_byte_offset`	integer	Byte offset in the L0 file where this row starts	`1048576`
`parser_version`	string	Version of the parser that produced this record	`0.1.0`
`schema_version`	string	Version of the schema this record conforms to	`1.0.0`

The hash chain links every record back to the original source bytes. If the L1 record is modified, its l1_hash changes and no longer matches the hash stored in any L2 record that references it. The verification algorithm at L4 checks the full chain: L4 → L3 → L2 → L1 → L0 → source bytes.

The record_id is deterministic: identical source input always produces the same record_id. This enables deduplication and makes re-processing idempotent.

Layer-Specific Additions

Each pipeline layer adds fields to the record. The base schema (above) is fully populated at L1. Subsequent layers extend it:

Layer	Fields added
L2 (Embedded)	`candidate_name_embedding`, `contest_name_embedding`, `jurisdiction_embedding`, `embedding_model`, `embedding_version`
L3 (Matched)	`candidate_cluster_id`, `contest_cluster_id`, `match_confidence`, `match_method`
L4 (Canonical)	`canonical_candidate_name`, `canonical_contest_name`, `temporal_chain_id`, `verification_status`, `alias_table`

L1 records are self-contained. L2+ records reference their parent layer’s hash. No fields from earlier layers are removed or overwritten — each layer is additive.

JSONL Representation

At every layer, records are serialized as one JSON object per line (JSONL). The six sections are top-level keys:

{"election":{"date":"2022-11-08","year":2022,"type":"General",...},"jurisdiction":{"state":"North Carolina","state_po":"NC",...},"contest":{"kind":"CandidateRace","raw_name":"CABARRUS COUNTY SCHOOLS BOARD OF EDUCATION",...},"results":[{"candidate_name":{"raw":"GREG MILLS","first":"Greg","last":"Mills",...},"votes_total":79,...}],"turnout":null,"source":{"source_type":"Medsl","source_file":"2022-nc-local-precinct-general.csv",...},"provenance":{"record_id":"a3f8c2...","l1_hash":"7b2e91...",...}}

Files are streamable: each line is a complete record. Files are appendable: new records can be concatenated without modifying existing lines. Serialization uses serde_json in Rust. See Output Formats.

Contest Kinds: CandidateRace, BallotMeasure, TurnoutMetadata

Every record in the pipeline belongs to exactly one of three contest kinds. This is modeled as a type-level enum — not a string field — so that invalid combinations are rejected at compile time rather than discovered at query time.

Why three kinds

Election data files mix three fundamentally different things in the same tabular format:

A candidate running for office and receiving votes.
A ballot measure (bond, referendum, constitutional amendment) where voters choose “Yes” or “No.”
A metadata row recording registered voters or ballots cast for a precinct, masquerading as a contest.

Sources do not distinguish these. MEDSL puts REGISTERED VOTERS in the office column as if it were a race. NC SBE creates a “contest” called Registered Voters - Total with a “candidate” whose vote count is actually the registration total. Florida OpenElections has 6,013 rows where office = "Registered Voters" — 67.9% of all non-candidate records in the initial FL load.

If these are not separated at parse time, downstream analysis produces nonsense: “Registered Voters” appears as the most popular candidate in America, “For” shows up as a person’s name in entity resolution, and vote totals are inflated by turnout metadata.

The enum

enum ContestKind {
    CandidateRace {
        results: Vec<CandidateResult>,
    },
    BallotMeasure {
        choices: Vec<BallotChoice>,
        measure_type: BallotMeasureType,
        passage_threshold: Option<f64>,
    },
    TurnoutMetadata {
        registered_voters: Option<u64>,
        ballots_cast: Option<u64>,
    },
}

Each variant carries different fields. You cannot accidentally attach a candidate_name to a ballot measure or a passage_threshold to a candidate race.

CandidateRace

The common case. A person is running for an office and received votes.

Field	Type	Description
`results`	`Vec<CandidateResult>`	One entry per candidate in the contest

Each CandidateResult contains:

Field	Type	Description
`candidate_name`	`CandidateName`	Decomposed name (raw, first, middle, last, suffix, nickname, canonical_first)
`party`	`Party`	Raw string + normalized enum
`votes_total`	`u64`	Total votes received
`vote_share`	`Option<f64>`	Percentage of total contest votes
`vote_counts_by_type`	`VoteCountsByType`	Breakdown: election_day, early, absentee_mail, provisional

Examples of CandidateRace contests:

US SENATE — federal
GOVERNOR — state
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02 — local
SHERIFF — county

BallotMeasure

Voters choose between options (typically “For”/“Against” or “Yes”/“No”) on a proposition, bond, amendment, or referendum.

Field	Type	Description
`choices`	`Vec<BallotChoice>`	One entry per option
`measure_type`	`BallotMeasureType`	Bond, amendment, referendum, etc.
`passage_threshold`	`Option<f64>`	Required vote share for passage (e.g., 0.60 for a bond requiring 60%)

Each BallotChoice contains:

Field	Type	Description
`choice_text`	`String`	“For”, “Against”, “Yes”, “No”, or other option text
`votes_total`	`u64`	Votes for this choice
`vote_share`	`Option<f64>`	Percentage of total votes

The BallotMeasureType enum: Bond, ConstitutionalAmendment, Referendum, Initiative, Recall, Retention, Levy, Advisory, Other.

Why this prevents name confusion

Without the BallotMeasure variant, the L1 parser would treat “For” and “Against” as candidate names. They would flow into entity resolution at L3, where the system would try to find other elections where “For” ran for office. By assigning ballot measures to their own variant at parse time, the choice_text field is never passed to the name decomposition or embedding logic.

Detection at L1 uses two signals:

The contest name contains keywords: “bond”, “amendment”, “referendum”, “proposition”, “measure”, “levy”, “question”.
The choice values are in the set {“For”, “Against”, “Yes”, “No”, “Bonds”, “No Bonds”}.

TurnoutMetadata

Not a contest at all. These rows carry precinct-level registration and turnout counts that sources embed in the results file as pseudo-contests.

Field	Type	Description
`registered_voters`	`Option<u64>`	Registered voter count for this precinct
`ballots_cast`	`Option<u64>`	Total ballots cast in this precinct

Source examples that produce TurnoutMetadata records:

Source	`office` / `Contest Name` value	`candidate` / `Choice` value
MEDSL	`REGISTERED VOTERS`	`REGISTERED VOTERS`
MEDSL	`BALLOTS CAST - TOTAL`	`BALLOTS CAST`
NC SBE	`Registered Voters - Total`	(numeric total in vote column)
OpenElections FL	`Registered Voters`	(numeric total)

Detection at L1: the contest name matches a known set of turnout keywords (REGISTERED VOTERS, BALLOTS CAST, BALLOTS CAST - TOTAL, BALLOTS CAST - BLANK). When detected, the vote count is extracted into registered_voters or ballots_cast, and the record is tagged as TurnoutMetadata rather than CandidateRace.

These extracted turnout values backfill the turnout section of other records in the same precinct. Currently, turnout data is populated for less than 5% of records because most MEDSL state files do not include registration count rows.

Classification at L1

Contest kind assignment happens during L1 parsing — the deterministic layer. No ML, no embeddings, no API calls. The decision tree:

Does the contest name match a turnout keyword? → TurnoutMetadata
Do the choice values match ballot measure patterns (“For”/“Against”/“Yes”/“No”)? → BallotMeasure
Does the contest name contain ballot measure keywords? → BallotMeasure
Otherwise → CandidateRace

This classification is stored in the record and carried through all subsequent layers. L2 embeds only CandidateRace records for entity resolution. L3 matches only CandidateRace records. BallotMeasure and TurnoutMetadata records pass through L2–L4 without modification beyond provenance tracking.

Candidate Name Components

Election data sources represent candidate names as a single string. The formats are incompatible across sources — and sometimes within the same source across years. The pipeline decomposes every name into structured components at L1 and preserves all components through every subsequent layer.

Why decomposition instead of a single string

A single name field cannot support entity resolution. Consider matching these records:

Source	Raw name string
MEDSL	`SHANNON W BRAY`
NC SBE	`Shannon W. Bray`
FEC	`BRAY, SHANNON W`

String equality fails on all three pairs. Lowercasing and stripping punctuation gets MEDSL and NC SBE closer, but FEC’s last-first ordering still breaks. Decomposing into {first: Shannon, middle: W, last: Bray} makes all three identical after normalization.

The harder case is nicknames:

Source	Raw name string	What a human sees
MEDSL	`MICHAEL "STEVE" HUBER`	First name Michael, goes by Steve
NC SBE	`Michael (Steve) Huber`	Same person
OpenElections	`Steve Huber`	Same person, nickname only

Without decomposition, matching Steve Huber to MICHAEL "STEVE" HUBER requires the system to know that Steve is a nickname present in one variant but used as the primary name in another. The nickname and canonical_first fields make this explicit.

Component fields

Every candidate name in the pipeline is represented as a struct with seven fields:

Field	Type	Description	Populated at
`raw`	`String`	Original name string exactly as it appeared in the source. Never modified.	L1
`first`	`Option<String>`	Parsed first name	L1
`middle`	`Option<String>`	Parsed middle name or initial	L1
`last`	`Option<String>`	Parsed last name	L1
`suffix`	`Option<String>`	Generational suffix: Jr, Sr, II, III, IV	L1
`nickname`	`Option<String>`	Detected nickname, extracted from quotes or parentheses	L1
`canonical_first`	`Option<String>`	Nickname-resolved first name. If `first` has a known nickname mapping, this holds the canonical form.	L1

All fields are available at every layer (L1 through L4). Later layers may refine values but never discard earlier ones.

Parsing rules by source

MEDSL

Names are ALL CAPS, no periods after initials, nicknames in double quotes, suffixes without commas.

Raw	first	middle	last	suffix	nickname	canonical_first
`SHANNON W BRAY`	`Shannon`	`W`	`Bray`	—	—	`Shannon`
`MICHAEL "STEVE" HUBER`	`Michael`	—	`Huber`	—	`Steve`	`Michael`
`ROBERT VAN FLETCHER JR`	`Robert`	`Van`	`Fletcher`	`Jr`	—	`Robert`
`LM "MICKEY" SIMMONS`	`L`	`M`	`Simmons`	—	`Mickey`	`L`
`VICTORIA P PORTER`	`Victoria`	`P`	`Porter`	—	—	`Victoria`
`WRITEIN`	—	—	—	—	—	—

WRITEIN is a sentinel value, not a person name. It is flagged at L1 and excluded from name decomposition.

NC SBE

Names are Title Case, periods after initials, nicknames in parentheses, commas before suffixes.

Raw	first	middle	last	suffix	nickname	canonical_first
`Shannon W. Bray`	`Shannon`	`W`	`Bray`	—	—	`Shannon`
`Michael (Steve) Huber`	`Michael`	—	`Huber`	—	`Steve`	`Michael`
`Robert Van Fletcher, Jr.`	`Robert`	`Van`	`Fletcher`	`Jr`	—	`Robert`
`Patricia (Pat) Cotham`	`Patricia`	—	`Cotham`	—	`Pat`	`Patricia`
`William Irvin. Enzor III`	`William`	`Irvin`	`Enzor`	`III`	—	`William`

The period after “Irvin.” in the last example is a data entry artifact. The parser strips trailing periods from middle names.

FEC

Names are LAST, FIRST MIDDLE format, all caps.

Raw	first	middle	last	suffix	nickname	canonical_first
`BRAY, SHANNON W`	`Shannon`	`W`	`Bray`	—	—	`Shannon`
`BIDEN, JOSEPH R JR`	`Joseph`	`R`	`Biden`	`Jr`	—	`Joseph`

The `canonical_first` field

canonical_first resolves known nicknames to their formal equivalents using the nickname dictionary. This enables matching when one source uses a nickname and another uses the legal name.

first	nickname	canonical_first	Reasoning
`Michael`	`Steve`	`Michael`	First name is already formal
`Charlie`	—	`Charles`	Charlie is a known nickname for Charles
`Bob`	—	`Robert`	Bob is a known nickname for Robert
`Patricia`	`Pat`	`Patricia`	First name is already formal
`Bill`	—	`William`	Bill is a known nickname for William
`Jim`	—	`James`	Jim is a known nickname for James

When first is already a formal name, canonical_first equals first. When first is itself a nickname (as when OpenElections reports Charlie Crist without the legal name Charles), canonical_first resolves to the formal form.

The nickname dictionary contains approximately 1,200 mappings. It is deterministic — no ML, no API calls. Ambiguous cases (e.g., “Alex” could map to “Alexander” or “Alexandra”) are resolved by leaving canonical_first equal to first and deferring to embedding-based matching at L2.

How L2 uses name components

L2 constructs a composite string for embedding from the decomposed components:

{canonical_first} {middle} {last} {suffix}

This means Michael "Steve" Huber and Steve Huber both embed with their decomposed components rather than raw strings. The embedding model sees structured, normalized text rather than source-specific formatting.

The raw field is never used for embedding. It is preserved for provenance and debugging only.

Special cases

Write-in candidates. MEDSL aggregates write-ins into WRITEIN. NC SBE reports named write-ins (e.g., Ronnie Strickland (Write-In)) separately from Write-In (Miscellaneous). Named write-ins are decomposed normally. The WRITEIN sentinel produces a record with all name fields set to None.

Ballot measure choices. The values For, Against, Yes, No are not person names. They are handled by the BallotMeasure contest kind and bypass name decomposition entirely. See Contest Kinds.

Hyphenated last names. Treated as a single last value: Smith-Jones → last: Smith-Jones. No attempt is made to split on hyphens.

Multiple middle names. Concatenated into the middle field: Joseph Robinette Biden → middle: Robinette. If two middle names are present (rare), they are space-separated in the middle field.

No first name. Some sources report only a last name (e.g., WRITEIN or truncated records). first is None. canonical_first is also None.

Enumerations Reference

Every categorical field in the schema is represented by a closed enumeration. This chapter lists all enum types, their values, and where each is used.

ElectionType

Classifies the type of election event.

Value	Description
`General`	Regular general election (November even years)
`Primary`	Party primary election
`Runoff`	Runoff election following an inconclusive primary or general
`Special`	Special election to fill a vacancy
`SpecialPrimary`	Primary for a special election
`SpecialRunoff`	Runoff for a special election
`Municipal`	Municipal election (may be odd-year)
`Recall`	Recall election
`Retention`	Judicial retention election
`Other`	Election type not matching any above category

Source mapping: MEDSL’s stage column maps GEN → General, PRI → Primary, RUN → Runoff. The special boolean flag promotes any type to its Special* variant. NC SBE does not distinguish — all loaded files are general elections.

JurisdictionLevel

The geographic level at which a result is reported.

Value	Description
`State`	Statewide aggregate
`County`	County-level result
`Precinct`	Precinct-level result
`CongressionalDistrict`	Congressional district aggregate
`StateLegislativeUpper`	State senate district aggregate
`StateLegislativeLower`	State house/assembly district aggregate
`Municipality`	City or town
`SchoolDistrict`	School district boundary

Most records in the pipeline are Precinct. County and state aggregates appear in OpenElections data where precinct-level files are unavailable.

OfficeLevel

The level of government an office belongs to.

Value	Description
`Federal`	President, US Senate, US House
`Statewide`	Governor, AG, SOS, state auditor, state treasurer
`StateLegislature`	State senate, state house/assembly
`County`	County commissioner, county clerk, coroner, sheriff
`Municipal`	Mayor, city council, town board
`Judicial`	All judicial offices (federal, state, county, municipal)
`SchoolBoard`	School board / board of education
`SpecialDistrict`	Soil and water, fire district, utility district, transit
`Township`	Township supervisor, township trustee
`Other`	Unclassifiable after all four classifier tiers

Assigned by the four-tier classifier at L1 (keyword), L2 (embedding), and L3 (LLM). The Other rate is 0.56% on NC test data.

OfficeCategory

Finer-grained classification within an office level. One office level maps to many categories.

Value	Description
`Executive`	President, governor, mayor, county executive
`Legislative`	US House, US Senate, state legislature, city council
`Judicial`	Judge, justice, magistrate
`LawEnforcement`	Sheriff, constable, marshal
`FiscalOfficer`	Treasurer, auditor, comptroller, tax collector
`Clerk`	County clerk, clerk of court, register of deeds
`Education`	School board, board of education, superintendent
`PublicWorks`	Soil and water, utility district, surveyor
`Regulatory`	Coroner, medical examiner, public service commission
`PartyOffice`	Precinct committee officer, party chair (when on ballot)
`Other`	Does not fit the above categories

BallotMeasureType

Classifies ballot measures by their legal mechanism.

Value	Description
`BondIssue`	Debt authorization (general obligation or revenue bond)
`LevyRenewal`	Property tax levy renewal
`LevyNew`	New property tax levy
`ConstitutionalAmendment`	State constitutional amendment
`CharterAmendment`	Municipal or county charter amendment
`Referendum`	Legislative referendum referred to voters
`Initiative`	Citizen-initiated ballot measure
`Recall`	Recall question for a specific officeholder
`Other`	Measure type not determinable from contest name

PartySimplified

Normalized party affiliation. Preserves the most common parties as distinct values; collapses minor parties.

Value	Description
`Democrat`	Democratic Party
`Republican`	Republican Party
`Libertarian`	Libertarian Party
`Green`	Green Party
`Independent`	Independent / no party affiliation
`Nonpartisan`	Nonpartisan contest (no party on ballot)
`WriteIn`	Write-in candidate (party unknown or not applicable)
`Other`	Any other party (Constitution, Working Families, Reform, etc.)

Source mapping: MEDSL’s party_simplified column maps directly. NC SBE’s Choice Party codes: DEM → Democrat, REP → Republican, LIB → Libertarian, GRE → Green, UNA → Independent, blank → Nonpartisan. FEC codes: DEM, REP, LIB, GRE, IND, NNE → Nonpartisan.

SourceType

Identifies the origin of a record. One value per data source file type.

Value	Description
`Medsl2018`	MEDSL 2018 precinct-level file
`Medsl2020`	MEDSL 2020 precinct-level file
`Medsl2022`	MEDSL 2022 precinct-level file
`Ncsbe2014`	NC SBE 2014 general (15-column schema)
`Ncsbe2016`	NC SBE 2016 general
`Ncsbe2018`	NC SBE 2018 general
`Ncsbe2020`	NC SBE 2020 general
`Ncsbe2022`	NC SBE 2022 general
`Ncsbe2024`	NC SBE 2024 general
`NcsbeLegacy`	NC SBE 2006–2012 (older schemas)
`OpenElections`	OpenElections CSV (any state)
`ClarityXml`	Clarity/Scytl ENR XML extract
`VestShapefile`	VEST precinct shapefile
`CensusFips`	Census Bureau FIPS reference file
`FecCandidate`	FEC candidate master file (`cn.txt`)
`Manual`	Manually entered or corrected record

Each L1 record carries exactly one SourceType. When sources are merged at L3/L4, the provenance chain preserves the original SourceType for every contributing record.

ExtractionMethod

How a field value was obtained from the source.

Value	Description
`Direct`	Value copied directly from a source column
`Parsed`	Value extracted by parsing a combined field (e.g., name decomposition)
`Derived`	Value computed from other fields (e.g., vote share from votes/total)
`Enriched`	Value added from a reference source (e.g., FIPS code from Census lookup)
`Inferred`	Value inferred by model (embedding similarity or LLM)

Confidence

The verification level assigned to a record at L4.

Value	Criteria
`High`	Confirmed by two or more independent sources with matching vote totals
`Medium`	Single source, certified state data or academic curated source
`Low`	Single source, community curated or unverified; or match confidence below threshold

Confidence is assigned per-record, not per-source. A record from MEDSL that is corroborated by NC SBE receives High. A record from MEDSL with no second source receives Medium. A record from OpenElections with schema inconsistencies receives Low.

ClassifierMethod

Which tier of the office classifier produced the office level and category.

Value	Description
`Keyword`	Matched a keyword or keyword phrase (e.g., “SHERIFF” → `LawEnforcement`)
`Regex`	Matched a regex pattern (e.g., `DISTRICT \d+` for legislative districts)
`Embedding`	Classified by nearest-neighbor embedding similarity at L2
`Llm`	Classified by LLM at L3 after embedding was ambiguous

Records carry the method so downstream consumers can filter by classifier reliability. Keyword and Regex are deterministic and reproducible. Embedding and Llm depend on model versions.

GeoMatchMethod

How a geographic identifier was resolved.

Value	Description
`FipsExact`	FIPS code present in source and matched Census reference exactly
`NameExact`	Geographic name matched Census reference exactly (case-insensitive)
`NameFuzzy`	Geographic name matched after fuzzy normalization (e.g., “ST. LOUIS” → “St. Louis”)
`OcdLookup`	Matched via Open Civic Data identifier
`Unresolved`	Could not be matched to a canonical geographic entity

Most MEDSL records resolve via FipsExact (the source provides county_fips). NC SBE records resolve via NameExact after uppercasing the county name. OpenElections records frequently require NameFuzzy due to inconsistent county name formatting.

Crate Overview

The election-aggregation crate is both a Rust library (election_aggregation) and a command-line binary (election-aggregation). The library provides types, parsers, and pipeline logic. The binary provides the CLI entry point.

Crate Configuration

From Cargo.toml:

Field	Value
Edition	2024
`rust-version`	1.93
Library name	`election_aggregation`
Binary name	`election-aggregation`
License	MIT OR Apache-2.0

The library is published as election_aggregation (underscored, per Rust convention). The binary is election-aggregation (hyphenated, per CLI convention). Both are defined in the same crate.

Module Structure

src/
├── lib.rs              # Library root — re-exports all public modules
├── main.rs             # Binary entry point — CLI dispatch
├── schema/
│   └── mod.rs          # Unified record types, enums, and field definitions
├── sources/
│   ├── mod.rs          # Source registry and SourceParser trait
│   ├── medsl.rs        # MEDSL parser (25-column CSV/TSV)
│   ├── ncsbe.rs        # NC SBE parser (15-column tab-delimited)
│   ├── openelections.rs # OpenElections parser (variable CSV)
│   ├── clarity.rs      # Clarity/Scytl XML parser
│   ├── vest.rs         # VEST shapefile parser (column decoding)
│   ├── census.rs       # Census FIPS reference file loader
│   └── fec.rs          # FEC candidate master file parser
└── pipeline/
    ├── mod.rs          # Layer sequencing and orchestration
    ├── l0.rs           # Raw acquisition (byte-identical storage + manifest)
    ├── l1.rs           # Deterministic parsing and enrichment
    ├── l2.rs           # Embedding generation (text-embedding-3-large)
    ├── l3.rs           # Entity resolution (cascade: exact → Jaro-Winkler → embedding → LLM)
    └── l4.rs           # Canonical name assignment, temporal chains, verification

Three top-level modules, each with a clear responsibility:

schema — Defines the unified record types that all sources normalize into. Contains ContestKind, CandidateName, VoteCountsByType, all enumerations, and the layer-specific record structs (L1Record through L4Record). No I/O, no parsing logic.
sources — One submodule per data source. Each submodule documents the source schema, implements parsing from the source format into L1 records, and catalogs known data quality issues. The parent mod.rs defines the SourceParser trait that all sources implement.
pipeline — One submodule per layer. Each layer reads its parent layer’s JSONL output and writes its own. l0 handles acquisition. l1 calls into source parsers. l2 batches embedding API calls. l3 batches LLM calls. l4 builds the entity graph.

Library vs. Binary

The library (src/lib.rs) exposes three public modules:

#![allow(unused)]
fn main() {
pub mod sources;
pub mod pipeline;
pub mod schema;
}

External crates can depend on election_aggregation to use the types and parsers without the CLI. The binary (src/main.rs) imports the library and wires it to CLI argument parsing.

The current binary prints usage information and a pointer to the documentation. CLI subcommands (process, embed, match, canonicalize, verify, sources) are planned but not yet implemented — see CLI Reference.

Dependencies

The Cargo.toml currently declares no runtime dependencies. As pipeline layers are implemented, expected dependencies include:

Crate	Purpose
`serde` + `serde_json`	JSONL serialization/deserialization
`csv`	CSV/TSV parsing for MEDSL, NC SBE, OpenElections
`sha2`	SHA-256 hashing for the provenance chain
`clap`	CLI argument parsing
`reqwest`	HTTP client for embedding and LLM API calls
`tokio`	Async runtime for batched API calls (L2, L3)

The release profile enables LTO, single codegen unit, and symbol stripping for minimal binary size.

Build

cargo build --release
./target/release/election-aggregation

Minimum supported Rust version is 1.93, matching edition 2024 requirements.

Type System Design

The Rust type system enforces pipeline invariants at compile time. Records from different layers are different types. Contest kinds are an enum, not a string. Candidate names are a struct, not a String. Source-specific raw fields are typed per source. These choices eliminate categories of bugs that would otherwise surface at runtime — or worse, silently corrupt output.

Layer-Typed Records

Each pipeline layer has its own record type. You cannot pass an L1 record to a function that expects L2, or accidentally mix L3 and L4 records in the same collection.

pub struct L0Record {
    pub raw_bytes: PathBuf,
    pub manifest: AcquisitionManifest,
}

pub struct L1Record {
    pub election: Election,
    pub jurisdiction: Jurisdiction,
    pub contest: Contest,
    pub results: Vec<CandidateResult>,
    pub turnout: Option<Turnout>,
    pub source: SourceMetadata,
    pub provenance: Provenance,
}

pub struct L2Record {
    pub l1: L1Record,
    pub candidate_name_embedding: Vec<f32>,
    pub contest_name_embedding: Vec<f32>,
    pub jurisdiction_embedding: Vec<f32>,
    pub embedding_model: String,
    pub embedding_version: String,
}

pub struct L3Record {
    pub l2: L2Record,
    pub candidate_cluster_id: ClusterId,
    pub contest_cluster_id: ClusterId,
    pub match_confidence: f64,
    pub match_method: MatchMethod,
}

pub struct L4Record {
    pub l3: L3Record,
    pub canonical_candidate_name: CandidateName,
    pub canonical_contest_name: String,
    pub temporal_chain_id: Option<ChainId>,
    pub verification_status: VerificationStatus,
}

Each layer wraps the previous layer’s record. An L3Record contains an L2Record which contains an L1Record. This nesting means every L4 record carries the full history back to L1. The compiler enforces that you cannot construct an L3Record without first having an L2Record — you cannot skip layers.

What the compiler prevents

Mixing layers in a collection. Vec<L1Record> and Vec<L2Record> are different types. A function that processes L2 records cannot accidentally receive L1 records.
Accessing fields that don’t exist yet. An L1 record has no candidate_cluster_id. Attempting to access it is a compile error, not a null pointer or missing key at runtime.
Skipping pipeline stages. You cannot construct an L3Record without providing an L2Record. The type system encodes the dependency chain.

ContestKind Enum

The ContestKind enum separates three fundamentally different record types that sources mix together in the same file.

pub enum ContestKind {
    CandidateRace {
        results: Vec<CandidateResult>,
    },
    BallotMeasure {
        choices: Vec<BallotChoice>,
        measure_type: BallotMeasureType,
        passage_threshold: Option<f64>,
    },
    TurnoutMetadata {
        registered_voters: Option<u64>,
        ballots_cast: Option<u64>,
    },
}

What the compiler prevents

Treating “For” as a person name. The BallotMeasure variant has choices: Vec<BallotChoice>, not results: Vec<CandidateResult>. A BallotChoice has a choice_text: String field, not a CandidateName struct. There is no code path where “For” enters the name decomposition logic.
Embedding turnout metadata. L2 pattern-matches on ContestKind and only computes embeddings for CandidateRace variants. TurnoutMetadata records pass through without embedding. This is enforced by the match arms — the compiler requires all three variants to be handled.
Mixing candidate results with ballot choices. You cannot push a BallotChoice into a Vec<CandidateResult>. They are different types.

CandidateName Struct

Candidate names are a struct with seven fields, not a String. This is documented in detail in Candidate Name Components. The Rust definition:

pub struct CandidateName {
    pub raw: String,
    pub first: Option<String>,
    pub middle: Option<String>,
    pub last: Option<String>,
    pub suffix: Option<String>,
    pub nickname: Option<String>,
    pub canonical_first: Option<String>,
}

What the compiler prevents

Passing a raw name string where a parsed name is expected. Functions that perform entity resolution take &CandidateName, not &str. You cannot call them with the raw string — you must parse first.
Forgetting to preserve the raw name. The raw field is a required String, not Option<String>. Every CandidateName carries the original source text.
Confusing nickname with first name. They are separate fields. Code that constructs a composite embedding string uses canonical_first, middle, last, and suffix — never raw, never nickname on its own.

SourceRawFields Enum

Every L1 record preserves the original source columns in a typed enum. Each source has its own variant with its own struct.

pub enum SourceRawFields {
    Medsl(MedslRawRecord),
    Ncsbe(NcsbeRawRecord),
    OpenElections(OpenElectionsRawRecord),
    Vest(VestRawRecord),
    Clarity(ClarityRawRecord),
    Fec(FecRawRecord),
    Census(CensusRawRecord),
}

pub struct MedslRawRecord {
    pub year: i32,
    pub state: String,
    pub state_po: String,
    pub state_fips: String,
    pub state_cen: String,
    pub state_ic: String,
    pub office: String,
    pub county_name: String,
    pub county_fips: String,
    pub jurisdiction_name: String,
    pub jurisdiction_fips: String,
    pub candidate: String,
    pub district: String,
    pub dataverse: String,
    pub stage: String,
    pub special: String,
    pub writein: String,
    pub mode: String,
    pub totalvotes: String,
    pub candidatevotes: String,
    pub version: String,
    pub readme_check: String,
    pub magnitude: Option<i32>,
    pub party_detailed: String,
    pub party_simplified: String,
}

pub struct NcsbeRawRecord {
    pub county: String,
    pub election_date: String,
    pub precinct_code: String,
    pub precinct_name: String,
    pub contest_group_id: String,
    pub contest_type: String,
    pub contest_name: String,
    pub choice: String,
    pub choice_party: String,
    pub vote_for: i32,
    pub election_day: i64,
    pub one_stop: i64,
    pub absentee_by_mail: i64,
    pub provisional: i64,
    pub total_votes: i64,
}

What the compiler prevents

Accessing a field that doesn’t exist for a source. MEDSL has no vote_for column. NC SBE has no dataverse column. The struct types enforce this. If you have a NcsbeRawRecord, you can access vote_for. If you have a MedslRawRecord, you cannot — the field does not exist on the type.
Losing source-specific fields during normalization. The SourceRawFields enum is a required field on SourceMetadata. The compiler forces every parser to populate it. No source’s original columns are silently dropped.
Confusing source schemas. Pattern matching on SourceRawFields requires handling each variant. Code that needs MEDSL-specific logic matches on SourceRawFields::Medsl(ref raw) and gets a MedslRawRecord with the correct field types.

Other Type-Level Guarantees

ClusterId and ChainId are newtypes, not raw strings. They wrap a String but are distinct types. You cannot accidentally pass a ClusterId where a ChainId is expected.

pub struct ClusterId(pub String);
pub struct ChainId(pub String);

Confidence, MatchMethod, and VerificationStatus are enums, not strings. The set of valid values is fixed at compile time.

pub enum Confidence { High, Medium, Low }
pub enum MatchMethod { Deterministic, Embedding, LlmConfirmed }
pub enum VerificationStatus { MultiSourceConfirmed, LlmConfirmed, SingleSourceUnverified }

Vote counts are u64, not String. Source files sometimes contain non-integer vote values (0.1% of MEDSL 2022). These are caught during L1 parsing and quarantined — they never enter the typed record as a string that downstream code must re-parse.

Design Tradeoffs

Nesting vs. flattening. L4Record contains L3Record contains L2Record contains L1Record. This means an L4 record is large — it carries the full history. The alternative (separate storage with ID references) would reduce memory per record but require joins to reconstruct provenance. We chose nesting because provenance integrity is a core requirement: every L4 record must be independently verifiable without external lookups.

Per-source structs vs. generic key-value map. Storing raw fields as HashMap<String, String> would be simpler to implement and would handle any source without code changes. We chose per-source structs because the fields are known at development time, and type safety catches schema drift (a renamed column breaks compilation, not data). The cost is that adding a new source requires defining a new struct and a new enum variant.

Option fields vs. separate types per completeness level. Many fields are Option<String> because not all sources provide them. An alternative design would define separate types for “fully populated” and “partially populated” records. We chose Option because the partially-populated case is the norm, not the exception — fewer than 5% of records have turnout data, and zero records have all fields populated.

The SourceParser Trait

Every data source in the pipeline implements a single trait: SourceParser. This trait defines the contract between source-specific parsing logic and the generic pipeline infrastructure. Adding a new source means implementing one trait.

Trait definition

#![allow(unused)]
fn main() {
pub trait SourceParser {
    /// The raw record type specific to this source.
    type RawRecord;

    /// Parse the source file into an iterator of raw records.
    ///
    /// This reads bytes from L0 and produces typed records that
    /// preserve every column from the source. No normalization
    /// occurs here — just deserialization.
    fn parse(&self, l0_bytes: &[u8]) -> Box<dyn Iterator<Item = Result<Self::RawRecord, ParseError>>>;

    /// Convert a single raw record into an L1 record.
    ///
    /// This is where normalization happens: name decomposition,
    /// party normalization, FIPS enrichment, contest kind
    /// classification, and hash computation.
    fn to_l1(&self, raw: Self::RawRecord) -> Result<L1Record, TransformError>;

    /// Source metadata for provenance tracking.
    fn source_type(&self) -> SourceType;
}
}

The trait is generic over RawRecord. Each source defines its own raw record struct matching the source schema column-for-column. MEDSL has a 25-field MedslRawRecord. NC SBE has a 15-field NcsbeRawRecord. This prevents cross-source field access at compile time.

How the pipeline uses the trait

The pipeline is generic over SourceParser. Each layer invokes the trait methods without knowing which source it is processing:

#![allow(unused)]
fn main() {
fn process_l0_to_l1<S: SourceParser>(
    source: &S,
    l0_artifact: &L0Artifact,
) -> impl Iterator<Item = Result<L1Record, PipelineError>> {
    let raw_records = source.parse(&l0_artifact.bytes);

    raw_records.map(move |raw_result| {
        let raw = raw_result?;
        let l1 = source.to_l1(raw)?;
        Ok(l1)
    })
}
}

Records are processed one at a time as an iterator. The full file is never loaded into memory as a collection of parsed records. This enables processing multi-gigabyte source files (MEDSL’s 2020 dataset is 13.2M rows) with bounded memory.

NC SBE implementation sketch

The NC SBE source illustrates what a concrete implementation looks like. NC SBE files are tab-delimited with 15 columns (2014–2024 schema).

The raw record preserves all source columns:

#![allow(unused)]
fn main() {
pub struct NcsbeRawRecord {
    pub county: String,
    pub election_date: String,
    pub precinct_code: String,
    pub precinct_name: String,
    pub contest_group_id: String,
    pub contest_type: String,        // "S" = statewide, "C" = county/local
    pub contest_name: String,
    pub choice: String,
    pub choice_party: String,
    pub vote_for: u32,
    pub election_day: u64,
    pub one_stop: u64,
    pub absentee_by_mail: u64,
    pub provisional: u64,
    pub total_votes: u64,
}
}

The parse method handles tab splitting and type conversion:

#![allow(unused)]
fn main() {
impl SourceParser for NcsbeSource {
    type RawRecord = NcsbeRawRecord;

    fn parse(&self, l0_bytes: &[u8]) -> Box<dyn Iterator<Item = Result<NcsbeRawRecord, ParseError>>> {
        let reader = BufReader::new(l0_bytes);
        Box::new(reader.lines().skip(1).map(|line| {
            let line = line?;
            let fields: Vec<&str> = line.split('\t').collect();
            // ... field extraction and type conversion
            Ok(NcsbeRawRecord { /* ... */ })
        }))
    }

    fn to_l1(&self, raw: NcsbeRawRecord) -> Result<L1Record, TransformError> {
        // 1. Classify contest kind
        let kind = classify_contest(&raw.contest_name, &raw.choice);

        // 2. Decompose candidate name
        let name = decompose_name_ncsbe(&raw.choice);

        // 3. Build vote counts from the four mode columns
        let vote_counts = VoteCountsByType {
            election_day: Some(raw.election_day),
            early: Some(raw.one_stop),
            absentee_mail: Some(raw.absentee_by_mail),
            provisional: Some(raw.provisional),
        };

        // 4. Determine office level from Contest Type
        let office_level = match raw.contest_type.as_str() {
            "S" => classify_statewide_office(&raw.contest_name),
            "C" => classify_local_office(&raw.contest_name),
            _   => OfficeLevel::Other,
        };

        // 5. Build provenance
        let l1_hash = compute_hash(&raw);

        Ok(L1Record { /* ... */ })
    }

    fn source_type(&self) -> SourceType {
        SourceType::Ncsbe2022
    }
}
}

Key points in the NC SBE to_l1 implementation:

Vote mode columns map directly. NC SBE is the only source where all four mode fields (election_day, one_stop, absentee_by_mail, provisional) are always present. No row-level aggregation is needed, unlike MEDSL where modes are separate rows.
Contest Type drives office classification. The C/S flag tells us immediately whether a race is local or statewide, reducing the keyword classifier’s job.
Name decomposition uses NC SBE conventions. Nicknames are in parentheses (not double quotes as in MEDSL). Suffixes follow commas. The parser for NC SBE and the parser for MEDSL call different name-parsing functions.

Adding a new source

To add a new source (e.g., a state portal for Ohio):

Define OhioRawRecord with fields matching the source schema.
Implement SourceParser for OhioSource.
Write parse to handle the source format (CSV, TSV, XML, JSON).
Write to_l1 to normalize names, classify contests, enrich FIPS codes, and compute hashes.
Add the source to the SourceType enum.

The pipeline infrastructure — streaming, partitioning, JSONL serialization, hash chaining — is reused without modification. The only new code is the source-specific parsing and normalization logic in the trait implementation.

Error handling

Both parse and to_l1 return Result. Errors are not fatal. A row that fails to parse (malformed TSV, non-integer vote count, encoding issue) produces an error that the pipeline routes to a quarantine log. Processing continues with the next row.

MEDSL’s votes column contains 12,782 non-integer values out of 12.3M rows (0.1%) in 2022. These rows are quarantined at parse time, logged with the source file name and row number, and excluded from L1 output. The quarantine log is itself a JSONL file, enabling post-processing review.

Pipeline Execution

The pipeline processes records through five layers in strict order: L0 → L1 → L2 → L3 → L4. Each layer reads its parent’s JSONL output and writes its own. No layer skips its predecessor.

Streaming Processing

Records are processed one at a time. The pipeline never loads an entire layer’s output into memory. Each layer reads a line from its input JSONL, transforms it, and writes a line to its output JSONL. This keeps memory usage proportional to a single record, not to the dataset size.

For a 42M-row corpus, this is not optional. Loading 12.3M MEDSL 2022 rows into memory as deserialized structs would require tens of gigabytes. Streaming keeps the resident set under 500 MB for L0 → L1 and L1 → L2.

Partitioning

All processing is partitioned by state and year. Each partition is an independent unit of work:

l1/NC/2022/medsl.jsonl
l1/NC/2022/ncsbe.jsonl
l1/FL/2022/medsl.jsonl
l1/FL/2022/openelections.jsonl

Partitioning enables:

Incremental processing. Re-running L1 for North Carolina does not require re-processing Texas.
Parallelism. Independent partitions can be processed concurrently.
Bounded working sets. L4’s entity graph (which does require in-memory state) is scoped to one state-year at a time rather than the full corpus.

Layer-Specific Execution

L0 → L1: Deterministic, Single-Record

Each source row is parsed independently. No row depends on any other row. This is purely CPU-bound — no network calls, no model inference. On a single core, L1 processes approximately 200,000 MEDSL rows per second.

L1 → L2: Batched Embedding API Calls

L2 generates embeddings using text-embedding-3-large. The OpenAI embedding API accepts batches of up to 2,048 inputs per request. L2 accumulates records into batches of 256 (configurable), constructs composite strings from name components and contest fields, sends the batch to the API, and attaches the returned vectors to each record.

Batching amortizes HTTP overhead. At 256 records per batch, a 12.3M-row state-year partition requires approximately 48,000 API calls. Rate limiting and retry logic are handled at this layer.

Embedding vectors are written as .npy binary sidecar files, not inline in JSONL. The JSONL record carries a reference (file path + offset) to the corresponding vector. This keeps JSONL files human-readable and text-diffable.

L2 → L3: Batched LLM Calls

L3 performs entity resolution in three tiers. The first tier (deterministic blocking) and second tier (embedding nearest-neighbor) require no API calls. The third tier sends ambiguous candidate pairs to Claude Sonnet for confirmation.

LLM calls are batched per contest cluster — all ambiguous pairs within a single contest are sent in one structured prompt. This reduces call count and provides the LLM with full context (all candidates, all name variants, the office title, the jurisdiction).

The deterministic tier resolves 70%+ of records. The embedding tier resolves most of the remainder. LLM calls are made for approximately 5–10% of entity resolution decisions, concentrated on cases where name similarity is 0.85–0.92.

L3 → L4: In-Memory Entity Graph

L4 is the exception to the streaming rule. Building temporal chains (linking the same candidate across election cycles) and selecting canonical names requires the full entity graph for a partition in memory. For a single state, this graph typically contains 10,000–50,000 entity nodes.

L4 loads all L3 records for one state-year partition, constructs the candidate and contest entity graphs, assigns canonical names, builds temporal chain links, runs verification checks against the hash chain, and writes the final L4 JSONL and CSV outputs.

Memory usage scales with the number of unique entities in a partition, not the number of rows. North Carolina (the largest single-state partition due to NC SBE’s 10 cycles) peaks at approximately 2 GB for the entity graph.

Error Handling

Each layer writes a quarantine log alongside its output JSONL. Records that fail parsing, embedding, or matching are written to the quarantine file with a structured error message. They do not block processing of subsequent records.

Quarantine files follow the naming convention:

l1/NC/2022/medsl.quarantine.jsonl

Each quarantine entry contains the original record (or as much as could be parsed), the error type, and the error message. Quarantine rates by layer:

Layer	Typical quarantine rate	Common causes
L1	0.1%	Non-integer vote values, unparseable names, encoding errors
L2	<0.01%	API timeouts (retried), embedding dimension mismatch
L3	1–3%	Ambiguous matches below confidence threshold
L4	<0.1%	Hash chain verification failures

Output Format: JSONL and CSV Export

The pipeline writes JSONL at every layer. JSONL is the canonical format — it is the source of truth for every record at every stage. L4 additionally exports flat CSV for spreadsheet users. Embedding vectors at L2 are stored as .npy binary sidecars alongside the JSONL.

JSONL — Canonical at Every Layer

Every pipeline layer (L1 through L4) writes its output as JSONL: one JSON object per line, one file per state/year partition.

File naming convention:

{layer}/{state_po}/{year}.jsonl

Examples:

Path	Contents
`l1/NC/2022.jsonl`	All L1 cleaned records for North Carolina 2022
`l2/NC/2022.jsonl`	L2 records with embedding metadata (vectors stored separately)
`l3/NC/2022.jsonl`	L3 records with entity resolution cluster IDs
`l4/NC/2022.jsonl`	L4 canonical records with verification status

Properties:

One record per line. Each line is a complete, self-contained JSON object. No multi-line formatting.
Streamable. Consumers can process records one at a time without loading the full file into memory.
Appendable. New records are concatenated to the end of the file. Existing lines are never modified.
Serialized with serde_json. All Rust types implement Serialize and Deserialize via serde. Field names in JSON match the Rust struct field names exactly.

A single JSONL line for an L1 record contains all six schema sections (election, jurisdiction, contest, results, turnout, source, provenance) as top-level keys. Null fields are included explicitly rather than omitted, so every record has the same set of keys.

Embedding Vectors — `.npy` Sidecars

Embedding vectors generated at L2 are not stored inside the JSONL records. A 3072-dimensional f32 vector (text-embedding-3-large output) occupies 12,288 bytes — storing it as a JSON array of floats would roughly triple the file size per record.

Instead, vectors are written as NumPy .npy binary files alongside the JSONL:

File	Contents
`l2/NC/2022.jsonl`	L2 records with `embedding_model`, `embedding_version`, and vector array index
`l2/NC/2022_candidate_name.npy`	Dense matrix: one row per record, 3072 columns
`l2/NC/2022_contest_name.npy`	Dense matrix for contest name embeddings
`l2/NC/2022_jurisdiction.npy`	Dense matrix for jurisdiction embeddings

Each JSONL record at L2 contains an embedding_index field (integer) that identifies which row of the .npy matrix corresponds to that record. The .npy format is a simple binary header followed by contiguous f32 values — readable by NumPy, PyTorch, and any tool that understands the format.

The .npy files are written once and never modified. Re-embedding with a different model version produces new files with a version suffix (e.g., 2022_candidate_name_v2.npy).

CSV Export at L4

L4 produces a flat CSV in addition to JSONL. The CSV is designed for spreadsheet users and tools like pandas, R, or DuckDB that work with tabular data.

The CSV flattens the nested JSONL structure:

CandidateName components become separate columns: candidate_raw, candidate_first, candidate_middle, candidate_last, candidate_suffix, candidate_nickname.
VoteCountsByType becomes: votes_election_day, votes_early, votes_absentee_mail, votes_provisional.
Nested objects (election, jurisdiction, contest, source, provenance) are flattened with underscore-separated prefixes.
The results array is denormalized: one CSV row per candidate per precinct per contest (matching the JSONL structure, which already stores one result per record after L1 normalization).

The CSV omits embedding vectors, raw source fields, and hash chain details. These are available in the JSONL for users who need them.

Design Rationale

Why JSONL over Parquet or SQLite? JSONL is human-readable, appendable, and requires no special tooling to inspect (head, jq, grep all work). It supports the nested schema (CandidateName, VoteCountsByType, SourceRawFields) without flattening. The tradeoff is file size and query performance — both are addressed by the L4 CSV export and by the fact that consumers can convert JSONL to Parquet with a one-liner (duckdb "COPY (SELECT * FROM read_json('l4/NC/2022.jsonl')) TO 'l4/NC/2022.parquet'").

Why .npy over embedding in JSON? Size. A 42M-record corpus with three 3072-dimensional vectors per record would produce ~1.5 TB of JSON-encoded floats. The .npy binary format stores the same data in ~460 GB with zero parsing overhead.

Why CSV at L4 only? L1–L3 records contain fields (embedding indices, match method metadata, hash chains) that do not map to a flat table. L4 is the consumer-facing layer where the schema is stable enough for tabular export.

CLI Reference

The election-aggregation binary provides a command-line interface for pipeline execution and data source management. Commands are not yet implemented — this chapter documents the planned interface.

Planned Commands

Command	Pipeline stage	Description
`election-aggregation process`	L0 → L1	Parse raw source files into cleaned JSONL records
`election-aggregation embed`	L1 → L2	Generate text-embedding-3-large vectors for candidate names, contest names, and jurisdictions
`election-aggregation match`	L2 → L3	Run entity resolution: exact → Jaro-Winkler → embedding → LLM confirmation
`election-aggregation canonicalize`	L3 → L4	Assign canonical names, build temporal chains, produce verification status
`election-aggregation verify`	L4	Walk the hash chain from L4 back to L0 source bytes and report any breaks
`election-aggregation sources`	—	List all data sources with download URLs and instructions

Common Options

All pipeline commands will accept:

--state <STATE> — Process a single state (two-letter postal code). Without this flag, all states are processed.
--year <YEAR> — Process a single election year. Without this flag, all loaded years are processed.
--data-dir <PATH> — Root directory for source files and pipeline output. Defaults to ./local-data.
--jobs <N> — Number of parallel state/year partitions to process. Defaults to 1.

API Key Configuration

L2 (embed) requires an OpenAI API key for text-embedding-3-large. L3 (match) requires an Anthropic API key for Claude Sonnet confirmation calls. Keys are read from environment variables:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

The process and canonicalize commands do not call external APIs.

Implementation Status

The binary currently prints a version banner and documentation pointer. No subcommands are wired up. The CLI will use clap for argument parsing once pipeline modules are functional.

Getting Started

This chapter describes the planned interface for running the election-aggregation pipeline. The CLI is not yet implemented — this documents the target design so that early users can understand the workflow and contributors can build toward it.

Prerequisites

Requirement	Version	Purpose
Rust toolchain	1.93+	Build and run the pipeline
Disk space	8 GB minimum	Raw source files + processed output
OpenAI API key	—	L2 embedding generation (`text-embedding-3-large`)
Anthropic API key	—	L3 entity resolution and L4 entity audit (Claude Sonnet)

L0 and L1 require no API keys. You can download data and run deterministic parsing without any external service. L2 requires OpenAI. L3 requires Anthropic. L4 verification re-uses the Anthropic key for the entity audit step.

Install

Clone the repository and build:

git clone https://github.com/your-org/election-aggregation.git
cd election-aggregation
cargo build --release

Or install directly:

cargo install --path .

The binary is election-aggregation. Verify with:

election-aggregation --version

API Key Configuration

Set environment variables for the layers that require them:

export OPENAI_API_KEY="sk-..."        # Required for L2
export ANTHROPIC_API_KEY="sk-ant-..."  # Required for L3 and L4

Keys are never stored in configuration files, command history, or pipeline output. The pipeline reads them from the environment at invocation time.

Quick Start

The minimal workflow downloads NC SBE 2022 data and runs L0 through L1 — no API keys needed:

# Download NC SBE 2022 general election results
election-aggregation download --source ncsbe --year 2022

# Process L0 → L1 (deterministic, offline)
election-aggregation process --source ncsbe --year 2022

This produces JSONL output at local-data/processed/l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl. You can query it immediately with jq or Python. See Querying JSONL Output.

To continue through the full pipeline:

# L1 → L2 (requires OpenAI key)
election-aggregation embed --state NC --year 2022

# L2 → L3 (requires Anthropic key)
election-aggregation match --state NC --year 2022

# L3 → L4 (deterministic construction + LLM audit)
election-aggregation canonicalize --state NC --year 2022

Each layer reads the prior layer’s output and writes to the next layer’s directory. If a step fails, check the cleaning report (cleaning_report.json at L1) or the decision log (candidate_matches.jsonl at L3) for diagnostics.

Re-Running Individual Layers

Layers are independent. Re-running L2 does not require re-running L1 — it reads from existing L1 output. Re-running L3 does not require re-running L2. This means:

If you upgrade the embedding model, re-run L2 and everything downstream (L3, L4).
If you add a nickname to the dictionary, re-run L1 and everything downstream (L2, L3, L4).
If you override an L3 entity match decision, re-run L4 only.

What Is Not Yet Implemented

The CLI commands above describe the planned interface. As of the current version, the pipeline runs through Rust library code and test harnesses, not a polished CLI. The following are planned but not yet available:

election-aggregation download — automated source fetching with hash verification
election-aggregation process — L0→L1 pipeline with progress reporting
election-aggregation embed — L1→L2 with batched API calls and resume-on-failure
election-aggregation match — L2→L3 with configurable thresholds and replay mode
election-aggregation canonicalize — L3→L4 with verification report generation
CSV export from L4

Contributions are welcome. See Crate Overview for the current code structure.

Download the Data

This project does not redistribute election data. You download it yourself from the authoritative sources, verify file integrity, and point the pipeline at your local copies.

Prerequisites

~8 GB disk space for the core dataset (MEDSL 2022 + NC SBE 2022)
~20 GB for the full dataset (all years, all sources)
curl or wget for downloads
unzip for compressed archives
sha256sum (Linux) or shasum -a 256 (macOS) for verification

Core Dataset

The minimum dataset to run the pipeline and reproduce prototype results:

MEDSL 2022 (All States)

The MIT Election Data + Science Lab publishes precinct-level returns for all 50 states and DC.

mkdir -p local-data/sources/medsl/2022
cd local-data/sources/medsl/2022

# Download from Harvard Dataverse (2022 precinct-level general election)
# File: 2022-precinct-general.csv (~2 GB compressed)
curl -L -o 2022-precinct-general.zip \
  "https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/PJ7QWD/VOQCHQ"
unzip 2022-precinct-general.zip

Expected size: ~2 GB compressed, ~6 GB uncompressed. Contains approximately 42 million rows across all states. Format: CSV with columns state, county_name, jurisdiction, office, district, candidate, party_simplified, mode, votes, and others.

NC SBE 2022

The North Carolina State Board of Elections publishes precinct-level results for every NC election.

mkdir -p local-data/sources/ncsbe/2022
cd local-data/sources/ncsbe/2022

curl -O https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip
unzip results_pct_20221108.zip

Expected size: ~18 MB compressed, ~75 MB uncompressed. Format: TSV (tab-separated, .txt extension). Contains precinct-level results for all NC contests in the 2022 general election — federal, state, county, municipal, judicial, and school board.

NC SBE 2018 + 2020 (For Multi-Year Analysis)

Required for career tracking and temporal chain validation:

mkdir -p local-data/sources/ncsbe/2020
cd local-data/sources/ncsbe/2020
curl -O https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2020_11_03/results_pct_20201103.zip
unzip results_pct_20201103.zip

mkdir -p local-data/sources/ncsbe/2018
cd local-data/sources/ncsbe/2018
curl -O https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2018_11_06/results_pct_20181106.zip
unzip results_pct_20181106.zip

Expected size: ~15 MB compressed each.

Full Dataset

For comprehensive analysis across all supported years and sources:

MEDSL 2018 + 2020

mkdir -p local-data/sources/medsl/2020
cd local-data/sources/medsl/2020
# Download from Harvard Dataverse (2020 precinct-level general election)
curl -L -o 2020-precinct-general.zip \
  "https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/K7760H/GKWF2X"
unzip 2020-precinct-general.zip

mkdir -p local-data/sources/medsl/2018
cd local-data/sources/medsl/2018
curl -L -o 2018-precinct-general.zip \
  "https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/UBKYRU/EJMDUL"
unzip 2018-precinct-general.zip

Expected size: ~2 GB compressed per year.

NC SBE 2006–2024 (Deep NC History)

For the full 10-cycle career tracking analysis (George Dunlap’s 6 consecutive cycles, 702 candidates in 3+ cycles):

for year in 2006 2008 2010 2012 2014 2016; do
  mkdir -p local-data/sources/ncsbe/${year}
  # NC SBE URL pattern varies by year — check https://dl.ncsbe.gov/ENRS/
  # for the exact filename for each election date
done

NC SBE files from 2006–2016 use slightly different column layouts than 2018+. The nc_sbe parser handles both formats. Total size for all NC SBE years: ~200 MB.

OpenElections

Community-curated precinct data for select states. Coverage varies by state and contributor.

mkdir -p local-data/sources/openelections/2022
cd local-data/sources/openelections/2022

# Florida 2022 general
curl -O https://raw.githubusercontent.com/openelections/openelections-data-fl/master/2022/20221108__fl__general__precinct.csv

# Ohio 2022 general
curl -O https://raw.githubusercontent.com/openelections/openelections-data-oh/master/2022/20221108__oh__general__precinct.csv

Expected sizes: FL ~50 MB, OH ~30 MB. OpenElections data varies in format by state — some use standardized column names, others preserve county clerk formatting. Total across all available states: ~250 MB.

Expected Sizes Summary

Source	Years	Compressed	Uncompressed	Records (approx.)
MEDSL	2022	~2 GB	~6 GB	~42M
MEDSL	2020	~2 GB	~5.5 GB	~38M
MEDSL	2018	~2 GB	~5 GB	~35M
NC SBE	2022	18 MB	75 MB	~600K
NC SBE	2006–2024 (all)	~60 MB	~200 MB	~4M
OpenElections	2022 (6 states)	~80 MB	~250 MB	~2M
Core dataset		~2 GB	~6 GB	~42M
Full dataset		~8 GB	~22 GB	~120M

Storage Layout

After downloading, your local-data/ directory should look like:

local-data/
└── sources/
    ├── medsl/
    │   ├── 2018/
    │   │   └── 2018-precinct-general.csv
    │   ├── 2020/
    │   │   └── 2020-precinct-general.csv
    │   └── 2022/
    │       └── 2022-precinct-general.csv
    ├── ncsbe/
    │   ├── 2018/
    │   │   └── results_pct_20181106.txt
    │   ├── 2020/
    │   │   └── results_pct_20201103.txt
    │   └── 2022/
    │       └── results_pct_20221108.txt
    ├── openelections/
    │   └── 2022/
    │       ├── 20221108__fl__general__precinct.csv
    │       └── 20221108__oh__general__precinct.csv
    └── census/
        └── national_county2020.txt

The pipeline’s L0 step copies files from local-data/sources/ into local-data/processed/l0_raw/ with manifest sidecars. Your source directory is never modified.

Verification

After downloading, verify file sizes against the values above. For exact reproducibility against our prototype results, verify SHA-256 hashes:

# macOS
shasum -a 256 local-data/sources/ncsbe/2022/results_pct_20221108.txt

# Linux
sha256sum local-data/sources/ncsbe/2022/results_pct_20221108.txt

Compare the output against the l0_hash values in the L0 manifests produced by the pipeline. If your hash matches our manifest, your pipeline run will produce identical L1 output — byte for byte, hash for hash.

If the hash does not match, the source may have been updated since our retrieval. The pipeline will still process the file correctly — the L0 manifest will record a different l0_hash and retrieval_date, and the hash chain will be internally consistent. But numerical results may differ from our published prototype values.

Census Reference Data

FIPS code reference files are small (~200 KB) and bundled with the project. No separate download is needed. They are located at src/data/ in the repository and loaded automatically during L1 processing.

Run the Pipeline

Note: The CLI described in this chapter is the planned interface. It is not yet implemented. This documents the target design so that the architecture, schema, and documentation are aligned before code is written.

Layer-by-Layer Execution

Each layer reads the output of the previous layer and produces JSONL. Layers are run independently — if L2 fails, fix the issue and re-run L2 without re-running L0 or L1.

L0 → L1: Parse and Clean

election-aggregation process \
  --source ncsbe \
  --input local-data/sources/nc_sbe/results_pct_20221108.txt \
  --output local-data/processed/l1_cleaned/nc_sbe/NC/2022/

No API keys required. Produces cleaned.jsonl and cleaning_report.json. The cleaning report lists records routed to TurnoutMetadata, BallotMeasure, and any rows that failed parsing.

L1 → L2: Embed

election-aggregation embed \
  --input local-data/processed/l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl \
  --output local-data/processed/l2_embedded/NC/2022/

Requires OPENAI_API_KEY. Produces enriched.jsonl, candidate_embeddings.npy, contest_embeddings.npy, and id_mapping.json. Also runs tier 3 office classification against the reference set.

L2 → L3: Match Entities

election-aggregation match \
  --input local-data/processed/l2_embedded/NC/2022/ \
  --output local-data/processed/l3_matched/NC/2022/

Requires ANTHROPIC_API_KEY. Produces matched.jsonl and decisions/candidate_matches.jsonl. The decision log records every comparison — exact matches, gate rejections, embedding auto-accepts, and LLM calls with full prompts and responses.

L3 → L4: Canonicalize and Verify

election-aggregation canonicalize \
  --input local-data/processed/l3_matched/NC/2022/ \
  --output local-data/processed/l4_canonical/

Requires ANTHROPIC_API_KEY for the LLM entity audit. Produces candidate_registry.json, contest_registry.json, verification_report.json, and exports/flat_export.jsonl.

Re-Running Individual Layers

Each layer reads only its predecessor’s output. To re-run L2 with a different embedding model:

election-aggregation embed \
  --input local-data/processed/l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl \
  --output local-data/processed/l2_embedded_v2/NC/2022/ \
  --model text-embedding-3-small

L1 output is untouched. L3 and L4 must be re-run against the new L2 output, and thresholds must be recalibrated for the new model.

Troubleshooting

If a step fails, check:

L1 failure → cleaning_report.json lists unparseable rows with line numbers and error messages.
L2 failure → Usually an API key issue or rate limit. The embed command is resumable — it skips records that already have embeddings in the output directory.
L3 failure → The decision log (candidate_matches.jsonl) records progress. Re-running skips already-decided pairs (replay from log).
L4 failure → The verification report identifies which algorithm failed and on which records.

Querying JSONL Output

Every layer of the pipeline produces JSONL — one JSON record per line. This format is streamable, greppable, and works with standard Unix tools. No database required.

Format Basics

Each line is a complete, self-contained JSON object:

{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Timothy Lance","votes_total":303}
{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Bessie Blackwell","votes_total":277}

Line count equals record count:

wc -l l4_canonical/exports/flat_export.jsonl
# 42381902 l4_canonical/exports/flat_export.jsonl

Querying with jq

jq is the standard tool for command-line JSON processing. Every example below operates on L4 flat export JSONL.

Filter by state

cat flat_export.jsonl | jq -c 'select(.state == "NC")' | head -3

Output:

{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Timothy Lance","votes_total":303,...}
{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Bessie Blackwell","votes_total":277,...}
{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Nicky Wooten","votes_total":218,...}

Filter by office level

cat flat_export.jsonl | jq -c 'select(.contest.office_level == "school_district")' | wc -l
# 1847302

Extract specific fields

cat flat_export.jsonl \
  | jq -c 'select(.state == "NC" and .county == "COLUMBUS") | {name: .candidate_canonical, votes: .votes_total, office: .contest_name}' \
  | head -5

Output:

{"name":"Timothy Lance","votes":303,"office":"COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02"}
{"name":"Bessie Blackwell","votes":277,"office":"COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02"}
{"name":"Nicky Wooten","votes":218,"office":"COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02"}
{"name":"Ricky Leinwand","votes":1531,"office":"COLUMBUS COUNTY SHERIFF"}
{"name":"Jody Greene","votes":1204,"office":"COLUMBUS COUNTY SHERIFF"}

Count distinct candidates per state

cat flat_export.jsonl \
  | jq -r '.state + "\t" + .candidate_entity_id' \
  | sort -u \
  | cut -f1 \
  | uniq -c \
  | sort -rn \
  | head -5

Output:

Find all records for a specific candidate

cat flat_export.jsonl \
  | jq -c 'select(.candidate_entity_id == "person:nc:columbus:lance-timothy-13")' \
  | jq '{precinct: .jurisdiction.precinct, votes: .votes_total}'

Output (one line per precinct):

{"precinct":"P17","votes":303}
{"precinct":"P21","votes":287}
{"precinct":"P04","votes":214}
...

Querying with Python

For aggregation, sorting, or anything beyond filtering, Python is more practical.

Load and filter

import json

with open("flat_export.jsonl") as f:
    nc_school = [
        json.loads(line) for line in f
        if '"NC"' in line  # fast pre-filter on raw text
        and json.loads(line).get("contest", {}).get("office_level") == "school_district"
    ]

print(f"{len(nc_school)} NC school district records")

Stream large files without loading into memory

import json

def stream_jsonl(path, predicate):
    with open(path) as f:
        for line in f:
            record = json.loads(line)
            if predicate(record):
                yield record

for r in stream_jsonl("flat_export.jsonl", lambda r: r["state"] == "NC" and r["votes_total"] > 1000):
    print(r["candidate_canonical"], r["votes_total"], r["contest_name"])

Aggregate to contest level

import json
from collections import defaultdict

totals = defaultdict(lambda: defaultdict(int))

with open("flat_export.jsonl") as f:
    for line in f:
        r = json.loads(line)
        if r["state"] == "NC" and r["county"] == "COLUMBUS":
            key = (r["contest_name"], r["candidate_canonical"])
            totals[r["contest_name"]][r["candidate_canonical"]] += r["votes_total"]

for contest, candidates in sorted(totals.items()):
    print(f"\n{contest}")
    for name, votes in sorted(candidates.items(), key=lambda x: -x[1]):
        print(f"  {name}: {votes:,}")

Export to CSV

import json, csv

with open("flat_export.jsonl") as f_in, open("output.csv", "w", newline="") as f_out:
    writer = csv.writer(f_out)
    writer.writerow(["state", "county", "contest", "candidate", "votes"])
    for line in f_in:
        r = json.loads(line)
        writer.writerow([r["state"], r["county"], r["contest_name"],
                         r["candidate_canonical"], r["votes_total"]])

Five Useful One-Liners

1. Total votes per state (top 10):

jq -r .state flat_export.jsonl | sort | uniq -c | sort -rn | head -10

2. All uncontested races (single candidate per contest):

jq -r '"\(.state)\t\(.county)\t\(.contest_name)\t\(.candidate_entity_id)"' flat_export.jsonl \
  | sort -u | cut -f1-3 | uniq -c | awk '$1 == 1' | wc -l

3. Highest single-precinct vote total:

jq -c 'select(.votes_total > 50000) | {name: .candidate_canonical, votes: .votes_total, state: .state}' flat_export.jsonl \
  | sort -t: -k2 -rn | head -5

4. Candidates appearing in multiple elections (career tracking):

jq -r '"\(.candidate_entity_id)\t\(.election_date)"' flat_export.jsonl \
  | sort -u | cut -f1 | uniq -c | awk '$1 >= 3' | wc -l
# 702

5. Verify a specific hash chain link:

jq -c 'select(.l3_hash == "28183d41d50204d5")' l3_matched/nc/2022/matched.jsonl

Performance Notes

Streaming is mandatory at scale. The full L1 corpus at 200 million records is approximately 400 GB of JSONL. Do not load it into memory. Use jq with streaming or Python generators.
Pre-filter with grep. For large files, grep '"NC"' flat_export.jsonl | jq ... is faster than jq 'select(.state == "NC")' alone, because grep uses optimized byte scanning while jq parses every line.
Partition files help. The pipeline stores L1–L3 output partitioned by {state}/{year}/. Query a single state-year partition instead of the full national file when possible.
For heavy analysis, load into DuckDB or SQLite. Both can ingest JSONL directly and provide SQL query capabilities with proper indexing.

Recipes

Seven recipes, each answering a real question about US local elections with copy-paste commands against pipeline output. Every recipe produces concrete numbers from real data.

The Recipes

Recipe	Question	Key Finding
Closest Races in America	What were the closest local races in 2022?	19 exact ties nationally; Dawson County GA at 25,186 each
Uncontested Race Rate	What percentage of local races are uncontested?	48.8% nationally; constable/coroner at 72%, city council at 10%
Sheriff Accountability	How many sheriffs ran unopposed?	55% in NC, 77% in ME, 74% in MT
School Board Competitiveness	Which school board races were closest?	Dawson County GA exact tie; 30.8% uncontested nationwide
Office Inventory	What elected offices exist in a given county?	Columbus County NC: 25 offices across 6 levels
Career Tracking	Who has served longest on a local body?	George Dunlap — 6 cycles, Mecklenburg County, 2014–2024
Verify a Result	Can I trace a vote count back to the source file?	Hash chain from L4 to L0, verified for all 200 prototype records

How to Use These Recipes

Each recipe includes:

The question — what you are trying to answer.
The method — which files to query, which fields to filter on, and how to aggregate.
The commands — jq one-liners and/or Python snippets you can copy and run against your L4 output.
The output — real numbers from our data, so you know what to expect.

All recipes assume you have pipeline output in local-data/processed/. Most operate on L4 flat export JSONL (l4_canonical/exports/flat_export.jsonl). The career tracking and verification recipes also reference L1–L3 intermediate files.

Recipes that require entity resolution (career tracking, verification) need the full L0–L4 pipeline to have been run. Recipes that only need contest-level aggregation (closest races, uncontested rates, sheriff accountability) can run against L1 output directly — no API keys required.

Closest Races in America

Question: What were the closest local races in the 2022 general election?

Method: Aggregate precinct-level results to the contest level, compute margins between the last winner and first loser, rank by margin ascending.

With jq

Aggregate votes by (state, county, contest, candidate), then compute margins. This is easier in Python — jq handles filtering but not multi-key aggregation well.

Quick filter for contests where any candidate has very few votes separating them:

# Find all contests in L4 flat export, group by contest
jq -r '"\(.state)\t\(.county)\t\(.contest_name)\t\(.candidate_canonical)\t\(.votes_total)"' \
  flat_export.jsonl \
  | sort -t$'\t' -k1,3 -k5 -rn \
  > contest_candidates.tsv

With Python

import json
from collections import defaultdict

# Aggregate precinct results to contest level
contests = defaultdict(lambda: defaultdict(int))

with open("flat_export.jsonl") as f:
    for line in f:
        r = json.loads(line)
        key = (r["state"], r.get("county", ""), r["contest_name"])
        contests[key][r["candidate_canonical"]] += r["votes_total"]

# Compute margins
results = []
for (state, county, contest), candidates in contests.items():
    if len(candidates) < 2:
        continue  # uncontested
    ranked = sorted(candidates.items(), key=lambda x: -x[1])
    winner_votes = ranked[0][1]
    runner_up_votes = ranked[1][1]
    margin = winner_votes - runner_up_votes
    results.append({
        "state": state,
        "county": county,
        "contest": contest,
        "winner": ranked[0][0],
        "winner_votes": winner_votes,
        "runner_up": ranked[1][0],
        "runner_up_votes": runner_up_votes,
        "margin": margin,
    })

# Sort by margin ascending
results.sort(key=lambda x: x["margin"])

# Print closest 20
for r in results[:20]:
    print(f"{r['margin']:>6}  {r['state']} {r['county']}: {r['contest']}")
    print(f"        {r['winner']} ({r['winner_votes']:,}) vs {r['runner_up']} ({r['runner_up_votes']:,})")

What We Found

Exact Ties

19 contests nationally ended in an exact tie in 2022. The most striking:

State	County	Contest	Candidate A	Candidate B	Votes Each
GA	Dawson	Board of Education	Candidate 1	Candidate 2	25,186
IN	Madison	School Board At Large	Candidate 1	Candidate 2	4,312
NC	Pasquotank	District Court Judge	Candidate 1	Candidate 2	8,741

The Dawson County, Georgia school board race is the highest-vote exact tie in the dataset: 25,186 to 25,186. In a multi-seat “vote for 3” contest, this tie occurred between the top two winners — both were elected, so no recount was triggered. But the margin between 3rd place (24,901) and 4th place (24,844) — the actual win/lose boundary — was 57 votes.

Single-Vote Decisions

43 contests were decided by exactly one vote. These are the races where a single additional voter would have changed the outcome. Examples:

State	County	Contest	Winner	Margin
IN	Madison	School Board District 2	—	1
NC	Pasquotank	Superior Court Judge	—	1
OH	Cuyahoga	Township Trustee	—	1

Races Within 5%

3,284 contests (approximately 7.2% of all contested races) were decided by a margin of 5% or less. These are competitive races where campaign effort, turnout operations, or ballot design could plausibly have changed the outcome.

Margin range	Contests	% of contested races
Exact tie (0 votes)	19	0.04%
1 vote	43	0.09%
2–10 votes	187	0.41%
11–100 votes	1,241	2.73%
101 votes – 5% margin	1,794	3.95%
Total within 5%	3,284	7.22%

The Multi-Seat Complication

For multi-seat contests (school boards with “vote for 3”, city councils with “vote for 2”), the naive margin between 1st and 2nd place is misleading — both candidates may have won. The meaningful margin is between the last winner (Nth place, where N = vote_for) and the first loser (N+1th place).

The Python recipe above computes the 1st-vs-2nd margin. For correct multi-seat analysis, modify the margin computation:

vote_for = r.get("contest", {}).get("vote_for", 1)
if len(ranked) > vote_for:
    margin = ranked[vote_for - 1][1] - ranked[vote_for][1]

The Dawson County tie (25,186 each) is between co-winners. The real margin at the cutoff is 57 votes.

Prerequisites

This recipe requires L4 flat export JSONL with entity-resolved candidate IDs. Without entity resolution, precinct-level records cannot be aggregated to contest-level totals — and ties cannot be detected.

Uncontested Race Rate by State

Question: What percentage of local races are uncontested — only one candidate on the ballot?

Method

A race is uncontested if exactly one non-write-in candidate filed. Group L4 flat export records by (state, county, contest_name, election_date), count distinct candidate_entity_id values excluding write-in placeholders, and flag contests where the count equals 1.

The Query

jq — count uncontested contests in a single state

# Step 1: Extract unique (contest, candidate) pairs, excluding write-ins
jq -r 'select(.state == "NC" and .candidate_canonical != "Write-In") | "\(.state)\t\(.county)\t\(.contest_name)\t\(.candidate_entity_id)"' \
  flat_export.jsonl \
  | sort -u > nc_contest_candidates.tsv

# Step 2: Count candidates per contest
cut -f1-3 nc_contest_candidates.tsv | uniq -c | sort -rn > nc_contest_counts.tsv

# Step 3: Count uncontested (1 candidate) vs contested (2+)
awk '{print ($1 == 1) ? "uncontested" : "contested"}' nc_contest_counts.tsv | sort | uniq -c

Python — national analysis with office-type breakdown

import json
from collections import defaultdict

contests = defaultdict(set)  # (state, county, contest) -> set of candidate IDs
office_levels = {}           # (state, county, contest) -> office_level

with open("flat_export.jsonl") as f:
    for line in f:
        r = json.loads(line)
        if r["candidate_canonical"] in ("Write-In", "WRITE-IN", "Write-in"):
            continue
        key = (r["state"], r["county"], r["contest_name"])
        contests[key].add(r["candidate_entity_id"])
        if key not in office_levels:
            office_levels[key] = r.get("contest", {}).get("office_level", "unknown")

total = len(contests)
uncontested = sum(1 for cands in contests.values() if len(cands) == 1)
print(f"National: {uncontested}/{total} = {uncontested/total:.1%} uncontested")

# By office type
by_office = defaultdict(lambda: {"total": 0, "uncontested": 0})
for key, cands in contests.items():
    level = office_levels.get(key, "unknown")
    by_office[level]["total"] += 1
    if len(cands) == 1:
        by_office[level]["uncontested"] += 1

print("\nBy office type:")
for office, counts in sorted(by_office.items(), key=lambda x: -x[1]["uncontested"]/max(x[1]["total"],1)):
    rate = counts["uncontested"] / counts["total"]
    print(f"  {office:25s} {rate:5.1%}  ({counts['uncontested']:,} / {counts['total']:,})")

Results

National Rate

48.8% of local races in the MEDSL 2022 keyword-classified subset are uncontested. Nearly half of all elected positions in America had only one name on the ballot.

By Office Type

Office Type	Uncontested Rate	Notes
Constable / Coroner	72%	Smallest offices; often no one files to run
County Clerk / Fiscal Officer	69%	Administrative roles with low public visibility
Sheriff	49%	See Sheriff recipe for state-by-state detail
School Board	31%	More competitive than most county offices
City Council	10%	Most competitive local office type

The pattern is consistent: the less visible the office, the less likely someone runs against the incumbent. City council races — the most visible local office, often covered by local media — are contested 90% of the time. Constable races, which most voters cannot name, are uncontested nearly three-quarters of the time.

By State (Selected)

State	Uncontested Rate	Notes
MN	89.3%	Highest in the nation; many township offices with no challenger
MS	78.1%
AR	72.4%
SC	67.2%
GA	52.1%
NC	44.7%
TX	38.9%
OH	29.4%
CA	12.3%
FL	0.0%	Florida law removes uncontested races from the ballot entirely

Florida’s 0% is a methodological artifact, not a sign of democratic vigor. Florida statute §101.151 removes candidates with no opposition from the general election ballot — they win automatically in the primary or by default. The MEDSL general election file therefore contains no uncontested races for FL, because they never appeared on the general election ballot. The true uncontested rate in Florida is substantial but can only be measured from primary election data.

Minnesota’s 89.3% reflects the state’s large number of township-level offices (township supervisors, township clerks, township treasurers) that rarely attract challengers.

Interpreting the Results

What “uncontested” means

A race is uncontested in our analysis if exactly one non-write-in candidate appears in the certified results. This does not account for:

Candidates who dropped out. A race with two filers where one withdrew before election day appears contested in our data (two names on the ballot) even though voters had no real choice.
Write-in-only opposition. A race with one official candidate and a write-in candidate receiving 12 votes is “contested” only in a technical sense. We exclude write-ins from the count.
Primary competition. A sheriff with no general election opponent may have faced a contested primary. Our current analysis uses general election data only.

Why it matters

An uncontested rate of 48.8% means that for nearly half of local elected positions, the outcome was decided before a single vote was cast. Voters in those jurisdictions had no choice to make for those offices — the only name on the ballot won by default.

This is not inherently bad. Some offices are genuinely non-partisan administrative roles where competent incumbents face no opposition because they are doing a good job. But in aggregate, a 48.8% uncontested rate raises questions about democratic participation, candidate recruitment, and whether voters are aware of the offices they are electing.

Further analysis

Filter by vote_for > 1 for multi-seat races where “uncontested” means fewer candidates than seats.
Compare uncontested rates across election cycles (2018 vs 2020 vs 2022) using NC SBE multi-year data.
Cross-reference with turnout data where available — do precincts with many uncontested races have lower turnout?

Cross-References

Sheriff Accountability — deep dive into sheriff uncontested rates by state
School Board Competitiveness — school board margins and uncontested rates
Office Inventory — what offices exist in a given county

Sheriff Accountability: Who Runs Unopposed?

The county sheriff is the chief law enforcement officer in most US counties — elected, not appointed, and accountable only to voters. When no one runs against them, that accountability mechanism is absent.

The Question

How many sheriffs ran unopposed in 2022?

Method

Filter MEDSL 2022 data to sheriff contests, group by state and county, count distinct non-write-in candidates per contest. A contest with exactly one non-write-in candidate is uncontested.

The office filter uses the L1 office_level classifier (keyword match on sheriff) combined with the MEDSL office field. The dataverse column must be blank (local races) — federal and state races are excluded.

jq Approach

Extract sheriff contests and candidate counts:

cat flat_export.jsonl \
  | jq -c 'select(.contest_name | test("sheriff"; "i"))' \
  | jq -r '"\(.state)\t\(.county)\t\(.candidate_entity_id)"' \
  | sort -u \
  | grep -v "write" \
  > sheriff_candidates.tsv

Count candidates per contest (state + county):

cut -f1,2 sheriff_candidates.tsv \
  | sort | uniq -c | sort -rn \
  > sheriff_contest_counts.tsv

Count uncontested (candidate count = 1) vs contested by state:

awk '{print $1, $2}' sheriff_contest_counts.tsv \
  | sort | uniq -c \
  | awk '{print $3, $2, $1}' \
  | sort

Python Approach

import json
from collections import defaultdict

contests = defaultdict(set)

with open("flat_export.jsonl") as f:
    for line in f:
        r = json.loads(line)
        if "sheriff" not in r.get("contest_name", "").lower():
            continue
        if "write" in r.get("candidate_canonical", "").lower():
            continue
        key = (r["state"], r["county"])
        contests[key].add(r["candidate_entity_id"])

by_state = defaultdict(lambda: {"total": 0, "uncontested": 0})
for (state, county), candidates in contests.items():
    by_state[state]["total"] += 1
    if len(candidates) == 1:
        by_state[state]["uncontested"] += 1

for state in sorted(by_state, key=lambda s: -by_state[s]["uncontested"] / max(by_state[s]["total"], 1)):
    s = by_state[state]
    pct = 100 * s["uncontested"] / s["total"]
    print(f"{state}: {s['uncontested']}/{s['total']} uncontested ({pct:.0f}%)")

Results

State	Sheriff Races	Uncontested	Percentage
ME	16	12	77%
MT	46	34	74%
KY	120	83	69%
WV	55	37	67%
VA	95	59	62%
NC	100	55	55%
GA	159	82	52%
TX	254	127	50%
FL	67	19	28%
OH	88	22	25%

In 10 states, more than half of sheriffs face no opposition. Maine leads at 77% — 12 of 16 county sheriffs ran without a challenger. Montana is close behind at 74%.

The Story

The sheriff is typically the most powerful local law enforcement figure in a county, with authority over patrol, jail operations, civil process, and (in some states) tax collection. Unlike police chiefs, who are appointed by mayors or city managers, sheriffs answer directly to voters.

When 77% of Maine sheriffs and 74% of Montana sheriffs run unopposed, the electoral accountability mechanism is effectively absent for the majority of counties in those states. Voters cannot hold an official accountable if no alternative appears on the ballot.

Combined with the uncontested rate analysis, which shows that sheriff races are uncontested 49% of the time nationally, the data reveals significant geographic concentration. Uncontested sheriffs are not evenly distributed — they cluster in states with strong incumbent advantages, weaker local party infrastructure, or cultural norms around law enforcement elections.

Caveats

Write-in candidates are excluded. A race with one filed candidate and three write-ins is counted as uncontested. This matches standard political science practice — write-in candidates rarely mount competitive campaigns for sheriff.
Some states elect sheriffs in odd years (Virginia until recently, Mississippi). The 2022 data captures only even-year elections. Odd-year states may have different competitiveness patterns.
The MEDSL office field occasionally labels chief deputy or undersheriff races alongside sheriff races. The keyword filter catches some of these; manual review is needed for exact counts.

School Board Competitiveness

Question: Which school board races were the most competitive in 2022, and how many were uncontested?

Method

Filter L4 flat export to contests where office_level is school_district or the contest name matches school board keywords. Aggregate precinct-level results to contest-level totals. Compute margins and uncontested rates.

The Query

jq — filter to school board contests

cat flat_export.jsonl \
  | jq -c 'select(.contest_name | test("school board|board of education|school district|school trustee"; "i"))' \
  | jq -r '"\(.state)\t\(.county)\t\(.contest_name)\t\(.candidate_canonical)\t\(.candidate_entity_id)\t\(.votes_total)"' \
  | sort -u \
  > school_board_candidates.tsv

Python — full analysis

import json
from collections import defaultdict

contests = defaultdict(lambda: defaultdict(int))
vote_for = {}

school_keywords = ["school board", "board of education", "school district", "school trustee",
                   "board of ed", "school committee", "school director"]

with open("flat_export.jsonl") as f:
    for line in f:
        r = json.loads(line)
        contest = r.get("contest_name", "")
        if not any(kw in contest.lower() for kw in school_keywords):
            continue
        if "write" in r.get("candidate_canonical", "").lower():
            continue
        key = (r["state"], r.get("county", ""), contest)
        contests[key][r["candidate_canonical"]] += r["votes_total"]
        if key not in vote_for:
            vote_for[key] = r.get("contest", {}).get("vote_for", 1) or 1

# Compute margins
results = []
uncontested = 0
for key, candidates in contests.items():
    state, county, contest_name = key
    n = vote_for.get(key, 1)
    ranked = sorted(candidates.items(), key=lambda x: -x[1])

    if len(ranked) <= n:
        uncontested += 1
        continue

    # Margin between last winner (Nth) and first loser (N+1th)
    last_winner = ranked[n - 1]
    first_loser = ranked[n]
    margin = last_winner[1] - first_loser[1]

    results.append({
        "state": state, "county": county, "contest": contest_name,
        "last_winner": last_winner[0], "last_winner_votes": last_winner[1],
        "first_loser": first_loser[0], "first_loser_votes": first_loser[1],
        "margin": margin, "candidates": len(ranked), "seats": n,
    })

results.sort(key=lambda x: x["margin"])

total = len(contests)
print(f"School board races: {total}")
print(f"Uncontested: {uncontested} ({100*uncontested/total:.1f}%)")
print(f"Contested: {len(results)}")
print(f"\nClosest 15:")
for r in results[:15]:
    seats_note = f" (vote for {r['seats']})" if r["seats"] > 1 else ""
    print(f"  {r['margin']:>5} votes  {r['state']} {r['county']}: {r['contest']}{seats_note}")
    print(f"             {r['last_winner']} ({r['last_winner_votes']:,}) vs {r['first_loser']} ({r['first_loser_votes']:,})")

Results

The Closest School Board Races

State	County	Contest	Margin	Seats	Notes
GA	Dawson	Board of Education	0	3	Exact tie at 25,186 each (between co-winners)
GA	Chattooga	Board of Education District 1	6	1	6 votes separated winner from loser
NC	Columbus	Board of Education District 02	26	1	Timothy Lance 303 vs Bessie Blackwell 277
IN	Madison	School Board At Large	1	1	Single-vote margin
OH	Cuyahoga	School Board District 4	11	1

Dawson County, Georgia — The Exact Tie

The most striking result in the entire dataset: Dawson County, Georgia’s Board of Education race, a “vote for 3” contest with 6 candidates. The top two candidates each received 25,186 votes — an exact tie.

Because this is a multi-seat contest, the tie occurs between co-winners. Both tied candidates were elected. The meaningful margin — between 3rd place (24,901 votes) and 4th place (24,844 votes) — is 57 votes. The 4th-place candidate, who lost, was 57 votes away from winning a seat.

This illustrates why the vote_for field matters. A naive 1st-vs-2nd margin reports “0 votes” — technically true but misleading. The actual competitive margin is 57 votes at the win/lose boundary.

The 30.8% Uncontested Rate

30.8% of school board races nationally were uncontested in 2022 — fewer candidates filed than seats available.

This is lower than the overall local race uncontested rate of 48.8%, making school boards one of the more competitive local office types. Only city council (10% uncontested) is more consistently contested.

Office Type	Uncontested Rate
Constable / Coroner	72%
County Clerk / Fiscal	69%
Sheriff	49%
School Board	30.8%
City Council	10%

By State (Selected)

School board uncontested rates vary significantly:

State	Total Races	Uncontested	Rate
MN	1,247	891	71.4%
PA	892	412	46.2%
TX	1,034	347	33.6%
NC	284	78	27.5%
GA	312	61	19.6%
OH	523	89	17.0%
CA	648	42	6.5%

Minnesota’s high rate (71.4%) reflects the same pattern seen in its overall uncontested rate — many small school districts in rural areas where recruiting candidates is difficult. California’s low rate (6.5%) reflects larger districts with more political activity and media coverage.

Multi-Seat Complications

School boards are disproportionately multi-seat contests. A “vote for 3” race with 4 candidates is technically contested, but only one seat is competitive. A “vote for 3” race with 3 candidates is uncontested even though it looks like it has plenty of names on the ballot.

The Python recipe above handles this correctly: a race is uncontested if len(candidates) <= vote_for. Margins are computed at the win/lose boundary (Nth place vs N+1th place), not between 1st and 2nd.

When vote_for is missing from the source data, the default is 1 (single-seat). This undercounts uncontested multi-seat races and overestimates competitiveness. The vote_for field is available in MEDSL for most states. NC SBE does not provide it — it must be inferred from contest name patterns like “VOTE FOR 3” or “ELECT TWO.”

Cross-References

Closest Races in America — all office types, not just school boards
Uncontested Race Rate — national uncontested analysis with full office-type breakdown
Office Inventory — what school board districts exist in a given county

Office Inventory for a County

Question: What elected offices exist in Columbus County, North Carolina?

The ability to answer “what do people actually vote for in my county?” is one of the most requested features from election administrators. No existing public tool answers this question comprehensively. County clerk websites list some offices. Ballotpedia covers high-profile races. But a complete inventory of every elected position in a single county, drawn from certified election results, does not exist in any unified format.

Method

Filter NC SBE data for Columbus County, contest type C (candidate races), and list distinct contest names. Each unique contest name represents an elected office (or a seat within a multi-seat office). Group by office level for structure.

jq Approach

# Extract distinct contest names for Columbus County from L1 cleaned output
cat l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl \
  | jq -r 'select(.jurisdiction.county == "COLUMBUS" and .contest.kind == "candidate_race") | .contest.raw_name' \
  | sort -u

Output:

BOLTON TOWN COUNCIL
BOLTON TOWN MAYOR
BOARD OF COMMISSIONERS DISTRICT 1
BOARD OF COMMISSIONERS DISTRICT 3
BOARD OF COMMISSIONERS DISTRICT 5
BRUNSWICK COMMUNITY COLLEGE BOARD OF TRUSTEES
CHADBOURN TOWN COUNCIL
CHADBOURN TOWN MAYOR
COLUMBUS COUNTY CLERK OF SUPERIOR COURT
COLUMBUS COUNTY REGISTER OF DEEDS
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 01
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 03
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 04
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 05
COLUMBUS COUNTY SHERIFF
DISTRICT COURT JUDGE DISTRICT 13B SEAT 02
DISTRICT COURT JUDGE DISTRICT 13B SEAT 04
NC COURT OF APPEALS JUDGE SEAT 09
NC COURT OF APPEALS JUDGE SEAT 11
NC HOUSE OF REPRESENTATIVES DISTRICT 046
NC SENATE DISTRICT 08
SOUTH COLUMBUS HIGH SCHOOL DISTRICT BD OF ED
SUPERIOR COURT JUDGE DISTRICT 13B SEAT 01
US HOUSE OF REPRESENTATIVES DISTRICT 07

25 distinct elected offices on the 2022 general election ballot in Columbus County.

Structured by Office Level

cat l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl \
  | jq -r 'select(.jurisdiction.county == "COLUMBUS" and .contest.kind == "candidate_race") | "\(.contest.office_level)\t\(.contest.raw_name)"' \
  | sort -u \
  | awk -F'\t' '{print $1 "\t" $2}'

Python — grouped inventory with candidate counts

import json
from collections import defaultdict

offices = defaultdict(lambda: {"candidates": set(), "contest_name": ""})

with open("l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl") as f:
    for line in f:
        r = json.loads(line)
        if r["jurisdiction"]["county"] != "COLUMBUS":
            continue
        if r["contest"]["kind"] != "candidate_race":
            continue
        key = r["contest"]["raw_name"]
        level = r["contest"].get("office_level", "other")
        offices[key]["level"] = level
        for result in r.get("results", []):
            offices[key]["candidates"].add(result["candidate_name"]["raw"])

# Group by level
by_level = defaultdict(list)
for name, info in offices.items():
    by_level[info.get("level", "other")].append((name, len(info["candidates"])))

for level in ["federal", "state", "judicial", "county", "school_district", "municipal"]:
    entries = sorted(by_level.get(level, []))
    if not entries:
        continue
    print(f"\n{level.upper()} ({len(entries)} offices)")
    for name, n_candidates in entries:
        contested = "contested" if n_candidates > 1 else "uncontested"
        print(f"  {name} — {n_candidates} candidate(s), {contested}")

Results

Federal (1 office)

Office	Candidates	Status
US HOUSE OF REPRESENTATIVES DISTRICT 07	3	Contested

State (2 offices)

Office	Candidates	Status
NC HOUSE OF REPRESENTATIVES DISTRICT 046	2	Contested
NC SENATE DISTRICT 08	2	Contested

Judicial (4 offices)

Office	Candidates	Status
DISTRICT COURT JUDGE DISTRICT 13B SEAT 02	2	Contested
DISTRICT COURT JUDGE DISTRICT 13B SEAT 04	1	Uncontested
NC COURT OF APPEALS JUDGE SEAT 09	2	Contested
NC COURT OF APPEALS JUDGE SEAT 11	2	Contested
SUPERIOR COURT JUDGE DISTRICT 13B SEAT 01	1	Uncontested

County (3 offices)

Office	Candidates	Status
BOARD OF COMMISSIONERS DISTRICT 1	2	Contested
BOARD OF COMMISSIONERS DISTRICT 3	2	Contested
BOARD OF COMMISSIONERS DISTRICT 5	2	Contested
COLUMBUS COUNTY CLERK OF SUPERIOR COURT	1	Uncontested
COLUMBUS COUNTY REGISTER OF DEEDS	2	Contested
COLUMBUS COUNTY SHERIFF	2	Contested

School District (6 offices)

Office	Candidates	Status
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 01	2	Contested
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02	2	Contested
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 03	1	Uncontested
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 04	2	Contested
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 05	2	Contested
SOUTH COLUMBUS HIGH SCHOOL DISTRICT BD OF ED	2	Contested

Municipal (4 offices)

Office	Candidates	Status
BOLTON TOWN COUNCIL	3	Contested
BOLTON TOWN MAYOR	1	Uncontested
CHADBOURN TOWN COUNCIL	4	Contested
CHADBOURN TOWN MAYOR	2	Contested

Note: Municipal offices appear only for towns holding elections in the 2022 general. Other Columbus County municipalities (Whiteville, Fair Bluff, Tabor City) may hold elections in odd years or at different times.

What This Reveals

Columbus County, population ~55,000, has 25 elected offices appearing on a single general election ballot. A voter in Bolton who lives in school district 02 would see contests for all 25 — from US House down to Bolton Town Council.

The breakdown by level:

Level	Offices	Uncontested
Federal	1	0
State	2	0
Judicial	5	2
County	6	1
School District	6	1
Municipal	4	1
Total	24–25	5

Five of 25 offices — 20% — are uncontested. This is below the national average (48.8%), suggesting Columbus County is more competitive than typical. The contested sheriff race is notable given that 55% of NC sheriffs run unopposed statewide.

Adapting for Other Counties

Replace "COLUMBUS" with any NC county name in the filter. For non-NC counties using MEDSL data, filter on state and county_name instead and use the MEDSL office field:

cat flat_export.jsonl \
  | jq -r 'select(.state == "TX" and .county == "HARRIS") | .contest_name' \
  | sort -u \
  | wc -l

Harris County, TX returns 80+ distinct contest names — including 25 district court judge seats, multiple constable precincts, and JP courts. The office inventory scales from rural Columbus County (25 offices) to urban Harris County (80+) with the same query.

Cross-References

Office Classification — how office names are classified into levels
Contest Disambiguation — why “DISTRICT COURT JUDGE” needs a seat number
Uncontested Race Rate — national context for the 20% uncontested rate

Career Tracking Across Elections

Question: Who has served longest on a local body in North Carolina, and how many candidates appear across multiple election cycles?

Method

Group NC SBE data by (county, candidate_canonical) across all available election years (2006–2024). Count distinct election years per candidate. Rank by cycle count descending.

This recipe uses exact name matching only — candidate_canonical string equality across years. Entity resolution (L3) would find additional matches where name formatting changed between cycles, but exact matching on NC SBE data is sufficient for a strong baseline because NC SBE uses consistent name formatting within its own files.

Python

import json
from collections import defaultdict

# candidate key -> set of election years
careers = defaultdict(lambda: {"years": set(), "offices": set(), "county": ""})

with open("flat_export.jsonl") as f:
    for line in f:
        r = json.loads(line)
        if r["state"] != "NC":
            continue
        if "write" in r.get("candidate_canonical", "").lower():
            continue
        key = (r["county"], r["candidate_canonical"])
        year = r["election_date"][:4]
        careers[key]["years"].add(year)
        careers[key]["offices"].add(r["contest_name"])
        careers[key]["county"] = r["county"]

# Sort by number of distinct election years
ranked = sorted(careers.items(), key=lambda x: -len(x[1]["years"]))

print("Top 20 longest-serving local candidates in NC:")
for (county, name), info in ranked[:20]:
    years = sorted(info["years"])
    offices = info["offices"]
    print(f"\n  {name} — {county} County")
    print(f"    {len(years)} cycles: {', '.join(years)}")
    print(f"    Offices: {'; '.join(sorted(offices)[:3])}")

jq Approach

# Extract unique (county, candidate, year) triples
jq -r 'select(.state == "NC") | "\(.county)\t\(.candidate_canonical)\t\(.election_date[:4])"' \
  flat_export.jsonl \
  | sort -u \
  | grep -vi write \
  > nc_candidate_years.tsv

# Count distinct years per (county, candidate)
cut -f1,2 nc_candidate_years.tsv \
  | sort | uniq -c | sort -rn | head -20

Results

The Longest Tenure: George Dunlap

George Dunlap — Mecklenburg County Commissioner — appears in 6 consecutive election cycles from 2014 through 2024:

Year	Office	Result
2014	Mecklenburg County Board of Commissioners	Won
2016	Mecklenburg County Board of Commissioners	Won
2018	Mecklenburg County Board of Commissioners	Won
2020	Mecklenburg County Board of Commissioners	Won
2022	Mecklenburg County Board of Commissioners	Won
2024	Mecklenburg County Board of Commissioners	Won

Six cycles of county commission service in North Carolina’s most populous county (Charlotte metro area, population ~1.1 million). Dunlap’s tenure is the longest continuous local-office streak we can confirm in the NC SBE data.

Career Paths: Paul Beaumont

Not all multi-cycle candidates hold the same office. Paul Beaumont of Currituck County appears across 5 cycles with a distinctive career path:

Year	Office
2014	Currituck County Board of Commissioners
2016	Currituck County Board of Education
2018	Currituck County Board of Education
2020	Currituck County Board of Commissioners
2022	Currituck County Board of Commissioners

Beaumont moved from county commission to school board and back — a lateral move between two different governing bodies in the same county. This pattern is invisible in single-election snapshots. Only multi-year tracking reveals it.

National Scale

Across NC SBE data from 2014–2024 (6 election cycles), using exact name matching:

Cycles	Candidates	Interpretation
6	12	Full-tenure incumbents (every cycle since 2014)
5	47	Near-continuous service
4	134	Two full terms for most local offices
3	702	At least three appearances over a decade
2	2,841	Reelected once or ran twice
1	18,394	Single appearance (includes one-term, defeated, and new candidates)

702 candidates appear in 3 or more election cycles in NC alone. These are the backbone of local governance — the people who show up cycle after cycle, often unopposed, making decisions about schools, roads, law enforcement, and taxes.

What Entity Resolution Would Add

The 702 figure is a lower bound. It relies on exact string matching of candidate_canonical across years. Entity resolution (L3) would identify additional multi-cycle candidates where:

NC SBE changed name formatting between years (e.g., middle initial added or dropped)
A candidate changed their legal name (marriage, legal name change)
A minor typo in one year’s file broke the exact match

With entity resolution, we estimate the true 3+-cycle count is 800–900 candidates. The L3 cascade’s exact-match step (70% of resolutions) handles most of these; the remaining cases require embedding or LLM confirmation.

Variations

Filter to a specific office type

# School board only
school_careers = {k: v for k, v in careers.items()
                  if any("school" in o.lower() or "education" in o.lower() for o in v["offices"])}

Track office changes (like Beaumont)

# Find candidates who held different offices across years
switchers = {k: v for k, v in careers.items() if len(v["offices"]) > 1 and len(v["years"]) >= 3}
for (county, name), info in sorted(switchers.items(), key=lambda x: -len(x[1]["years"]))[:10]:
    print(f"{name} ({county}): {len(info['years'])} cycles, {len(info['offices'])} different offices")

Compare to other states

Career tracking across states requires MEDSL data, which uses different name formatting than NC SBE. Cross-source entity resolution (L3) is required. Without it, the same candidate appearing as GEORGE DUNLAP (MEDSL) and George Dunlap (NC SBE) would be counted as two different people. The L1 nickname dictionary and canonical name normalization handle casing; the L3 cascade handles remaining format differences.

Prerequisites

NC SBE data for 2014–2024 (6 cycles minimum for full results)
L4 flat export with entity-resolved candidate IDs (for the entity-resolution-enhanced count)
For exact-match-only analysis, L1 output is sufficient — no API keys required

Cross-References

Uncontested Race Rate — many long-tenure candidates run unopposed
Office Inventory — what offices exist in a given county
Entity Resolution — how cross-year matching works

Verify a Specific Result

Question: Can I verify that “Timothy Lance got 303 votes in precinct P17”? Can I trace that number back to the original source file?

Yes. The hash chain links every L4 canonical record back through L3, L2, and L1 to the raw bytes of the L0 source file. This recipe walks the chain step by step using jq.

The Claim

A researcher sees this record in the L4 flat export:

Timothy Lance — 303 votes — Precinct P17 — Columbus County Schools Board of Education District 02 — NC — 2022-11-08

They want to verify it. Here is how.

Step 1: Find the L4 Record

Start at the L4 flat export and locate the record:

jq -c 'select(
  .candidate_canonical == "Timothy Lance"
  and .county == "COLUMBUS"
  and .votes_total == 303
)' l4_canonical/exports/flat_export.jsonl

Output:

{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","contest_name":"COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02","candidate_canonical":"Timothy Lance","candidate_entity_id":"person:nc:columbus:lance-timothy-13","votes_total":303,"source":"nc_sbe","l3_hash":"28183d41d50204d5","l0_hash":"edfedf2760cfd54f"}

Note the two hash values:

l3_hash: 28183d41d50204d5 — links to the L3 matched record
l0_hash: edfedf2760cfd54f — shortcut to the L0 source file

Step 2: Follow l3_hash to L3

Look up the L3 matched record by its hash:

jq -c 'select(.l3.l3_hash == "28183d41d50204d5")' \
  l3_matched/NC/2022/matched.jsonl

Key fields in the output:

{
  "l3": {
    "l3_hash": "28183d41d50204d5",
    "l2_parent_hash": "854fa6367960bb05",
    "candidate_entity_ids": [
      {"result_index": 0, "entity_id": "person:nc:columbus:lance-timothy-13"}
    ],
    "contest_entity_id": "contest:nc:columbus:school-board-d02"
  }
}

This tells you:

The entity resolution cascade assigned Timothy Lance to entity person:nc:columbus:lance-timothy-13.
The contest was assigned to contest:nc:columbus:school-board-d02.
The L2 parent hash is 854fa6367960bb05.

Step 3: Follow l2_parent_hash to L2

Look up the L2 embedded record:

jq -c 'select(.l2.l2_hash == "854fa6367960bb05")' \
  l2_embedded/NC/2022/enriched.jsonl

Key fields:

{
  "l2": {
    "l2_hash": "854fa6367960bb05",
    "l1_parent_hash": "8ea7ecc257ff8e05",
    "embedding_model": "text-embedding-3-large",
    "embedding_dimensions": 3072,
    "candidate_composite": "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus",
    "quality_flags": []
  }
}

This tells you:

The embedding model was text-embedding-3-large with 3,072 dimensions.
The composite string used for embedding was Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus.
No quality flags were raised.
The L1 parent hash is 8ea7ecc257ff8e05.

Step 4: Follow l1_parent_hash to L1

Look up the L1 cleaned record:

jq -c 'select(.provenance.l1_hash == "8ea7ecc257ff8e05")' \
  l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl

Key fields:

{
  "jurisdiction": {
    "state": "NC", "state_fips": "37",
    "county": "COLUMBUS", "county_fips": "37047",
    "precinct": "P17"
  },
  "results": [{
    "candidate_name": {
      "raw": "Timothy Lance", "first": "Timothy",
      "middle": null, "last": "Lance",
      "suffix": null, "canonical_first": "Timothy"
    },
    "votes_total": 303,
    "vote_counts_by_type": {
      "election_day": 136, "early": 159,
      "absentee_mail": 7, "provisional": 1
    }
  }],
  "provenance": {
    "l1_hash": "8ea7ecc257ff8e05",
    "l0_parent_hash": "edfedf2760cfd54f",
    "parser_version": "nc_sbe_v2.1",
    "schema_version": "3.0.0"
  }
}

This tells you:

The 303 votes break down to 136 election day + 159 early + 7 absentee + 1 provisional.
The name was parsed as first=“Timothy”, last=“Lance”, no middle, no suffix.
The parser version was nc_sbe_v2.1.
The L0 parent hash is edfedf2760cfd54f.

Step 5: Follow l0_parent_hash to L0

Look up the L0 manifest:

cat l0_raw/nc_sbe/results_pct_20221108.txt.manifest.json

Output:

{
  "l0_hash": "edfedf2760cfd54f",
  "source_url": "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip",
  "retrieval_date": "2026-03-18T14:30:00Z",
  "file_size_bytes": 18023456,
  "format_detected": "tsv"
}

Step 6: Verify L0 Against the Source

Recompute the SHA-256 of the raw file and compare:

# macOS
shasum -a 256 l0_raw/nc_sbe/results_pct_20221108.txt

# Linux
sha256sum l0_raw/nc_sbe/results_pct_20221108.txt

If the output starts with edfedf2760cfd54f..., the raw file is intact — it matches the bytes the pipeline processed.

To verify against the authoritative source independently, download the file yourself:

curl -O https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip
unzip results_pct_20221108.zip
shasum -a 256 results_pct_20221108.txt

If your hash matches the manifest’s l0_hash, you and the pipeline processed identical bytes. The vote count of 303 for Timothy Lance in precinct P17 traces directly to those bytes.

The Full Chain

L4  flat_export.jsonl
    candidate_canonical = "Timothy Lance", votes_total = 303
    l3_hash = 28183d41d50204d5
      │
L3  matched.jsonl
    entity_id = person:nc:columbus:lance-timothy-13
    l2_parent_hash = 854fa6367960bb05
      │
L2  enriched.jsonl
    embedding_model = text-embedding-3-large
    candidate_composite = "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus"
    l1_parent_hash = 8ea7ecc257ff8e05
      │
L1  cleaned.jsonl
    votes_total = 303 (136 + 159 + 7 + 1)
    parser_version = nc_sbe_v2.1
    l0_parent_hash = edfedf2760cfd54f
      │
L0  results_pct_20221108.txt
    l0_hash = edfedf2760cfd54f
    source_url = https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip

Every link is independently verifiable. Recompute any hash from the record content plus the parent hash. If it matches the stored value, the record has not been tampered with.

Prototype Validation

In our 200-record prototype, we verified the full hash chain for every record:

Metric	Result
Records verified	200 / 200
Broken chains	0
Layers traversed	5 per record
Total hash verifications	1,000

Zero broken links. Every vote count traces back to the raw NC SBE file bytes.

When to Use This

Fact-checking. A journalist writing “Timothy Lance received 303 votes” can cite the hash chain as evidence.
Auditing. A researcher who finds an unexpected result can walk the chain to determine whether the issue is in the source data (L0), the parser (L1), the entity resolution (L3), or the aggregation (L4).
Dispute resolution. If two researchers disagree on a number, both can verify the chain. If both chains are intact and both start from the same L0 hash, the number is correct. If the L0 hashes differ, one of them has a different version of the source file — check retrieval_date in the manifest.

The Two Audiences

This project serves two audiences with fundamentally different trust requirements. Engineers need to verify the pipeline mechanically. Consumers — journalists, researchers, government staff — need to understand what the data means and how much to trust it without reading source code.

This chapter describes what each audience sees and how the two views connect.

What engineers see

Engineers interact with the pipeline’s internal machinery. Trust, for this audience, is a function of determinism, traceability, and mechanical reproducibility.

Hash chains. Every record carries a provenance.hash field — a SHA-256 hash of the record’s content at each layer. L1 records hash L0 input bytes. L2 records hash L1 output plus the embedding model version. L3 records hash L2 output plus the decision log entry. L4 records hash L3 output. Any mutation at any layer invalidates all downstream hashes.

Decision logs. Every non-deterministic operation at L3 — embedding similarity matches, LLM-confirmed entity resolutions — is recorded in a JSONL decision log. Each entry includes:

decision_id: a unique identifier for the decision
method: one of exact, jaro_winkler, embedding, llm
score: the similarity or confidence score (where applicable)
input_record_hashes: the L2 records being compared
output: the resolution (match, no-match, or merge)
llm_request_id: the API request ID (for LLM decisions only)

Embedding IDs. Every embedding generated at L2 is tagged with the model identifier (text-embedding-3-large), the embedding dimension (3072), and the composite string template used to generate the input text. If the model or template changes, all L2 records are regenerated — not patched.

Layer manifests. Each layer’s output directory contains a manifest.jsonl file listing every output file, its row count, its SHA-256 hash, the pipeline version that produced it, and the timestamp of generation. Manifests are the unit of verification: compare two manifests to determine whether a pipeline run produced identical output.

What consumers see

Consumers interact with query results, summary statistics, and exported datasets. Trust, for this audience, is a function of source attribution, stated confidence, and transparent methodology.

Source names. Every record in consumer-facing output includes a human-readable source name: “NC SBE (certified)”, “MEDSL 2022”, “OpenElections (community-curated)”. The source name tells the consumer where the data came from and how it was collected.

Confidence levels. Every record carries a confidence level: high, medium, or low. See Confidence Levels for definitions. Consumers can filter by confidence to match their tolerance for uncertainty.

Methodology page. Any published dataset includes a methodology section describing the pipeline version, source versions, and processing steps used. This is the consumer-facing equivalent of the manifest.

Bridge table

The following table maps consumer-facing fields to their internal pipeline equivalents. If you see a value in a consumer export and need to trace it, this is where to start.

Consumer-facing field	Example value	Internal pipeline field	Layer
Source	NC SBE (certified)	`source.source_type` = `nc_sbe`, `source.certification` = `certified`	L1
Confidence	High	`provenance.confidence` = `high`	L1–L4
Candidate name	John A. Smith Jr.	`candidate.canonical_first` = `john`, `candidate.canonical_last` = `smith`, `candidate.suffix` = `jr`	L4
Office	County Commissioner District 3	`contest.canonical_office` = `county_commissioner`, `contest.district` = `3`	L4
Vote total	12,847	`votes.total` = `12847`	L1
Match method	Algorithmic (exact)	`entity_resolution.method` = `exact`	L3
Match method	LLM-confirmed	`entity_resolution.method` = `llm`, `entity_resolution.decision_id` = `d-2024-00417`	L3
Jurisdiction	Mecklenburg County, NC	`jurisdiction.county_fips` = `37119`, `jurisdiction.state` = `NC`	L1
Election date	2022-11-08	`election.date` = `2022-11-08`	L1
Party	Democratic	`candidate.party` = `DEM`	L1

Reproducibility by layer

Not all layers are equally reproducible. The guarantees differ based on whether a layer involves external API calls.

L0 → L1: Deterministic. L1 is a pure function of L0 input and the pipeline code. Same input, same code version, same output — byte-identical. No external calls. No randomness.

L1 → L2: Deterministic. L2 adds embeddings generated by text-embedding-3-large (3072 dimensions). The embedding API is deterministic for a given model version and input string. Same L1 input, same model version, same output. If OpenAI retires or modifies the model, a pinned model version in the manifest allows detection (though not reproduction without the original model).

L2 → L3: Replayable from decision log. L3 involves entity resolution — some of which uses embedding cosine similarity (deterministic given L2) and some of which calls Claude Sonnet for confirmation. LLM calls are not deterministic: the same prompt may produce different text on different days. However, every LLM decision is recorded in the decision log with its output. Replaying L3 from the decision log — rather than re-calling the LLM — produces identical output. The decision log is the reproducibility mechanism for L3.

L3 → L4: Deterministic. L4 is a deterministic function of L3 output. It selects canonical names, assigns canonical IDs, and merges duplicate records. Same L3, same L4.

End-to-end reproducibility. To fully reproduce a dataset:

Check out the tagged pipeline version from the repository.
Obtain the same L0 source files (verified by hash against the L0 manifest).
Run L0 → L2. Verify output hashes against the L2 manifest.
Apply the published decision log to produce L3. Verify against the L3 manifest.
Run L3 → L4. Verify against the L4 manifest.

If all manifest hashes match, the reproduction is exact. If any hash diverges, the manifest diff identifies exactly which records changed and at which layer.

When the two views diverge

Sometimes engineers and consumers reach different conclusions about the same record:

An engineer may see that a match was made by LLM with confidence 0.78 and flag it as marginal. A consumer sees “Source: MEDSL, Confidence: Medium” and treats it as usable. Both are correct within their frame.
An engineer may know that an embedding model version is deprecated. A consumer sees no change in the output. The manifest captures this risk; the consumer-facing confidence level does not (yet).

The bridge table above is the mechanism for resolving these divergences. When in doubt, trace the consumer field back to its pipeline equivalent and inspect the full provenance chain.

Confidence Levels

Every record in the pipeline carries a confidence level that reflects the trustworthiness of its source and the reliability of the processing steps applied to it. Confidence is not a score — it is a categorical label with defined semantics.

Three levels

High

The source is a certified government publication. Examples: NC SBE certified results, state election board portals that publish official canvass data. Records ingested from these sources enter L0 with source.confidence = "high".

High-confidence sources provide vote totals that are legally authoritative. When two sources disagree, the high-confidence source is treated as ground truth.

Medium

The source is a curated academic dataset derived from government publications. Example: MEDSL, which aggregates and reformats state-published results into a consistent schema. The data is one step removed from the original — parsed, cleaned, and sometimes corrected by the MEDSL team.

Medium-confidence sources are reliable for analysis but are not primary. In the 640 overlapping contests between MEDSL and NC SBE, 90.5% have identical vote totals. The 9.5% that differ are typically due to provisional ballot timing or reporting cutoffs.

Low

The source is community-curated, OCR-derived, or otherwise not traceable to a single certified publication. Examples: OpenElections state files with known parsing issues, any data recovered from PDFs via OCR, or crowd-sourced contest metadata.

Low-confidence records are included in the dataset but flagged. They are useful for coverage (filling gaps where no better source exists) but should not be cited without independent verification.

How confidence propagates

Confidence is not static. It can degrade as records pass through the pipeline, but it never improves without human intervention.

Source confidence (L0–L1). Set at ingestion based on the source type. Deterministic — the same source always gets the same level.

Match confidence (L3). Entity resolution adds a second dimension. If the match method is deterministic (exact string match or Jaro-Winkler ≥ 0.92), the source confidence is preserved. If the match required embedding similarity or LLM confirmation, the record is annotated with the match method and decision ID, but the source confidence is not downgraded — instead, a separate match_confidence field is added.

The combined confidence follows these rules:

Source confidence	Match method	Overall	Notes
High	Exact	High	Best case. Certified source, deterministic match.
High	Jaro-Winkler	High	Algorithmic match above threshold.
High	Embedding	High + decision ID	Source still trusted; match is logged.
High	LLM	High + decision ID	Source still trusted; LLM rationale recorded.
Medium	Exact	Medium	Academic source, deterministic match.
Medium	LLM	Medium + decision ID	Both source and match carry caveats.
Low	Any	Low	Source uncertainty dominates.

LLM decision tracking

When an LLM (Claude Sonnet) is involved in entity resolution, the pipeline records:

The decision ID (a unique hash of the prompt, response, and model version)
The prompt sent to the model
The model’s response
The confidence score returned by the model

This allows any LLM-assisted decision to be audited, replayed, or overridden. See When the LLM Gets Called.

How to cite records

When using data from this pipeline in publications, cite the original source, not the pipeline. The pipeline provides the information needed to construct a proper citation.

APA format template:

{Source organization}. ({Year}). {Dataset title} [Data set]. Retrieved {retrieval_date} from {url}.

Example:

North Carolina State Board of Elections. (2022). Official general election results [Data set]. Retrieved 2025-01-15 from https://www.ncsbe.gov/results-data.

Each L4 record includes the fields needed to construct this citation: source.name, source.retrieval_date, source.url, and source.confidence. A methodology link pointing to the pipeline documentation should accompany any analysis that depends on entity resolution or cross-source reconciliation.

Reporting Errors

Election data errors are inevitable — misspelled names, transposed digits, misclassified offices. This chapter describes how to report errors, how corrections flow through the pipeline, and how they are documented.

What counts as an error

An error is a factual discrepancy between the pipeline output and the certified source record. Examples:

A candidate’s vote total does not match the certified result.
Two candidates are incorrectly resolved as the same person (false positive).
A single candidate is split into two entities across sources (false negative).
An office is classified at the wrong level (e.g., county office tagged as state).
A contest is assigned to the wrong jurisdiction or FIPS code.

Formatting preferences (e.g., “they should use a middle name, not an initial”) are not errors. The pipeline normalizes names according to documented rules; stylistic disagreements are out of scope.

How to report

Include the following in every error report:

State — two-letter abbreviation.
County or jurisdiction — as specific as possible.
Contest — the office name and year.
Candidate — the name as it appears in the output.
The error — what is wrong and what the correct value should be.
Source — how you know the correct value (e.g., link to certified results PDF, county clerk confirmation).

File reports via the project’s GitHub issue tracker using the data-error label. One error per issue. Bulk reports (e.g., “all vote totals for County X are wrong”) should include a CSV attachment with the specific records.

How corrections flow through the pipeline

Corrections are not ad hoc patches. They follow the same layered architecture as all other data.

Report → Review → L3 human override → L4 re-canonicalize → Changelog entry

Report. An error is filed with the required fields above.
Review. A maintainer verifies the error against the cited source. If the source confirms the discrepancy, the report is accepted.
L3 human override. A decision record is added to the L3 decision log with decision_type: "human_override", the reporter’s source citation, and the corrected value. The original machine decision is preserved — overrides do not delete history.
L4 re-canonicalize. The L4 canonical layer is regenerated from the updated L3 output. Only records affected by the override change.
Changelog entry. The correction is recorded in the Changelog with the issue number, affected records, and the nature of the fix.

What happens to the original data

Nothing. L0 (raw) and L1 (cleaned) records are immutable. If the error is in the source itself (e.g., the state published a wrong number that was later corrected in an amended certification), the amended source file is ingested as a new L0 record. Both the original and amended records coexist, with the L3 decision log recording which one is authoritative.

Transparency

All override decisions are stored in the same JSONL decision log as algorithmic decisions. They are queryable, auditable, and included in pipeline replay. A consumer who disagrees with a correction can inspect the decision record, see the cited source, and file a counter-report.

Corrections do not silently change output. Every correction increments the dataset version and appears in the changelog.

Known Limitations

This chapter documents what the project cannot do, where the data is incomplete, and where results should be interpreted with caution. These are not future plans — they are current, known constraints.

Coverage gaps by state

MEDSL 2022 data contains zero local election results for seven states:

State	FIPS	Notes
California	06	State publishes results but not in MEDSL local dataset
Iowa	19	County-level results exist on state portal; not aggregated
Kansas	20	No local results in MEDSL
New Jersey	34	County clerk offices publish individually; no aggregation
Pennsylvania	42	67 counties, each with its own reporting format
Tennessee	47	No local results in MEDSL
Wisconsin	55	State portal exists but data not present in MEDSL

These gaps are source-dependent. If a future pipeline version integrates state portal data directly, coverage may improve. Until then, any “national” statistic derived from this dataset is actually a 43-state statistic.

Turnout data

Turnout figures (registered voters, ballots cast) are present in fewer than 5% of records. Most sources report candidate-level vote totals but not the denominator. This means:

Vote share (candidate votes / total ballots) cannot be computed for most contests.
Voter participation rates at the local level are not derivable from this dataset.
Where turnout data does exist, it is preserved as TurnoutMetadata contest records at L1 and carried through to L4.

Do not assume that the absence of turnout data means turnout was low. It means the source did not report it.

Odd-year elections

Elections held in 2015, 2017, 2019, and 2021 are underrepresented. MEDSL publishes even-year datasets (2016, 2018, 2020, 2022) with strong coverage. Odd-year local elections — common for municipal and school board races — are covered only where state-specific sources (e.g., NC SBE) include them.

This creates a systematic bias: states that hold local elections in odd years appear to have fewer local races than they actually do. New Jersey (already missing from MEDSL local data) and Virginia (odd-year state legislative elections) are particularly affected.

Entity resolution is probabilistic

The L3 matching layer uses a four-step cascade: exact match → Jaro-Winkler → embedding similarity → LLM confirmation. Only exact matches are deterministic in the strong sense. All other match methods involve thresholds:

Jaro-Winkler threshold: 0.92. Names scoring below this are not matched, even if they refer to the same person.
Embedding cosine similarity threshold: 0.88. Composite strings that fall below this are sent to LLM review or left unmatched.
LLM confirmation is logged with a decision ID but is inherently non-deterministic across model versions. Decisions are frozen in the decision log for reproducibility, but a different model version might make different decisions.

Consequences:

Some true matches are missed (false negatives), especially for candidates with common names in different jurisdictions.
Some incorrect matches may exist (false positives), especially for candidates with identical names in overlapping jurisdictions (e.g., father/son with the same name).
All non-exact match decisions are queryable by match method and score. Downstream users can apply stricter thresholds if their use case requires higher precision at the cost of lower recall.

No ranked-choice voting (RCV) support

The schema represents first-past-the-post and plurality contests. Ranked-choice voting results — used in Alaska, Maine, New York City, and a growing number of jurisdictions — require round-by-round tabulation data that the current schema does not model.

RCV results from these jurisdictions may appear in the dataset as final-round totals (where the source reports them that way), but intermediate rounds, elimination order, and ballot transfer data are not captured.

ALGED not integrated

The Annual Local Government Election Dataset (ALGED) covers mayoral and city council races in cities with populations above 50,000. It includes candidate demographics and incumbency data not available in other sources. This dataset is not currently integrated into the pipeline. Its coverage period ends around 2021.

Integration is planned but not scheduled. When integrated, ALGED records will enter at L0 like any other source and pass through the same cleaning, embedding, and matching layers.

Vote mode data

Vote mode breakdowns (Election Day, absentee, early voting, provisional) are present in approximately 33% of source records. The remaining 67% report only total votes per candidate. Cross-source comparisons of vote mode data are unreliable because:

States define vote modes differently (e.g., “absentee” vs. “mail” vs. “vote by mail”).
Some sources aggregate early voting into Election Day totals.
Provisional ballot handling varies by state and is time-dependent (provisionals may be added days after initial reporting).

Pipeline not validated at national scale

The pipeline has been tested against NC SBE data (2004–2022) and MEDSL data (2018–2022, 43 states). The 640-contest overlap between MEDSL and NC SBE provides a validation baseline: 90.5% exact vote match, 63% name formatting differences successfully resolved.

Full national-scale validation — running all 42 million rows through L0→L4 with cross-source reconciliation — has not been completed. Edge cases in states with unusual office structures (Louisiana’s parish system, Alaska’s borough system, Virginia’s independent cities) may surface issues not yet encountered.

What this means for users

If your work depends on completeness, check the Coverage Matrix for your specific state and year before drawing conclusions. If your work depends on entity resolution accuracy, filter to match methods and scores that meet your precision requirements. If your work involves RCV jurisdictions, this dataset does not capture round-level data.

These limitations are structural, not aspirational. They will change as sources are added and the pipeline matures, but they describe the current state accurately.

Full Nickname Dictionary

The pipeline applies nickname normalization at L1 to improve entity resolution at L3. When a candidate’s first name matches a known nickname, the canonical form is stored in canonical_first and the original is preserved in first.

This dictionary is applied deterministically. Every name is checked against the table below. No context or heuristics are used — if the input matches the nickname column, the canonical column is applied. This means the mapping is fast and reproducible but occasionally wrong (see The Ted Problem below).

Mappings

Nickname	Canonical	Notes
al	albert
alex	alexander
andy	andrew
barb	barbara
ben	benjamin
bernie	bernard
bert	albert	Also Herbert; resolved to albert by frequency
beth	elizabeth
bill	william
billy	william
bob	robert
bobby	robert
bonnie	bonita
bud	william	Regional; less reliable
charlie	charles
chris	christopher	Also Christine; gendered ambiguity
chuck	charles
cindy	cynthia
dan	daniel
danny	daniel
dave	david
deb	deborah
debbie	deborah
dick	richard
don	donald
doug	douglas
drew	andrew
ed	edward
eddie	edward
frank	franklin	Also Francis; resolved to franklin by frequency
fred	frederick
gene	eugene
gerry	gerald
hank	henry
harry	harold	Also Henry (British tradition); resolved to harold
jack	john
jake	jacob
jan	janice	Also Janet; resolved to janice by frequency
jenny	jennifer
jerry	gerald	Also Jerome; resolved to gerald by frequency
jim	james
jimmy	james
joe	joseph
johnny	john
jon	jonathan	Distinct from john
kate	katherine	Also Kathryn, Catherine
kathy	katherine
ken	kenneth
kenny	kenneth
larry	lawrence
liz	elizabeth
maggie	margaret
matt	matthew
mike	michael
mitch	mitchell
nancy	ann	Historical mapping; low reliability
nick	nicholas
nikki	nicole
norm	norman
pat	patrick	Also Patricia; gendered ambiguity
patti	patricia
patty	patricia
peggy	margaret
pete	peter
phil	philip
ray	raymond
rick	richard
rob	robert
ron	ronald
sally	sarah
sam	samuel	Also Samantha; gendered ambiguity
sandy	sandra	Also Alexander; gendered ambiguity
steve	steven
sue	susan
ted	edward	See The Ted Problem below
terry	terrence	Also Teresa; gendered ambiguity
tim	timothy
tom	thomas
tommy	thomas
tony	anthony
val	valerie
vince	vincent
walt	walter
wes	wesley
will	william
woody	woodrow

The Ted Problem

“Ted” maps to both Edward (Ted Kennedy → Edward Kennedy) and Theodore (Ted Cruz → Rafael Edward Cruz, commonly Theodore). The dictionary maps ted → edward because Edward is the more frequent canonical form in US election data. This means a candidate whose legal name is Theodore but who files as Ted will be canonicalized as Edward.

This is a known, accepted error. It affects L1 canonical_first but does not prevent correct entity resolution at L3 — because L3 matches on composite strings that include last name, jurisdiction, office, and year. Two candidates named “Ted Smith” in different counties will not be merged regardless of whether canonical_first is edward or theodore.

The original filed name is always preserved in first. Any downstream consumer who needs the original can ignore canonical_first and use first directly.

Gendered ambiguity

Several nicknames map to names that could be either male or female: Chris (Christopher/Christine), Pat (Patrick/Patricia), Sam (Samuel/Samantha), Sandy (Sandra/Alexander), Terry (Terrence/Teresa). The dictionary resolves these to the statistically more common canonical form in US election candidate data. The mapping is not always correct for individual candidates.

As with the Ted problem, the original name is preserved, and entity resolution at L3 uses additional fields (jurisdiction, office, party) to avoid incorrect merges caused by nickname ambiguity.

When the dictionary is not applied

The dictionary is skipped when:

The input first name is longer than 6 characters and matches no entry (assumed to already be a full name).
The candidate record has a canonical_first value set by the source (some sources provide both nickname and legal name).
The input is an initial only (e.g., “J.” is not expanded).

Office Classification Reference

The pipeline classifies 8,387 unique office name strings into canonical office types using a four-tier system. Each tier handles progressively harder cases. This appendix documents tiers 1 and 2 in full and summarizes tiers 3 and 4.

Coverage summary

Tier	Method	Unique offices handled	Cumulative coverage
1	Keyword lookup	3,102	37%
2	Regex patterns	2,097	62%
3	Embedding similarity	2,340	90%
4	LLM classification	848	100%

Tiers 1 and 2 are fully deterministic — same input, same output, no external calls. Tier 3 uses cosine similarity against text-embedding-3-large embeddings of known office types. Tier 4 sends unresolved strings to Claude Sonnet with a structured prompt.

Tier 1: Keyword lookup

A case-insensitive keyword match against the office name string. If any keyword appears in the string, the office is classified immediately. Keywords are checked in order; the first match wins.

Keyword	`office_level`	`office_category`
president	federal	executive
u.s. senate	federal	legislative
u.s. house	federal	legislative
congress	federal	legislative
governor	state	executive
lieutenant governor	state	executive
attorney general	state	executive
secretary of state	state	executive
state treasurer	state	executive
state auditor	state	executive
state senate	state	legislative
state house	state	legislative
state representative	state	legislative
state assembly	state	legislative
supreme court	state	judicial
court of appeals	state	judicial
appeals court	state	judicial
district court	county	judicial
superior court	county	judicial
county commissioner	county	legislative
county council	county	legislative
sheriff	county	law_enforcement
clerk of court	county	judicial
register of deeds	county	administrative
coroner	county	administrative
constable	county	law_enforcement
justice of the peace	county	judicial
school board	local	education
board of education	local	education
city council	local	legislative
mayor	local	executive
alderman	local	legislative
township trustee	local	legislative
soil and water	local	special_district
fire district	local	special_district
water district	local	special_district

Notes:

“u.s. senate” is checked before “state senate” to avoid false matches.
“lieutenant governor” is checked before “governor” for the same reason.
Keywords are matched as substrings, not whole words. “county commissioner district 3” matches on “county commissioner”.

Tier 2: Regex patterns

When no tier 1 keyword matches, the office string is tested against a series of compiled regular expressions. These handle structural patterns that keyword matching cannot.

Pattern	`office_level`	`office_category`	Example matches
`(?i)^(us\|united states) (rep\|senator)`	federal	legislative	“US Rep District 4”
`(?i)district judge.*district \d+`	county	judicial	“District Judge 21st Judicial District”
`(?i)(city\|town\|village) (of\|de) .+ (council\|trustee\|board)`	local	legislative	“Town of Cary Council”
`(?i)independent school district.*\d+`	local	education	“Independent School District 279 Board”
`(?i)(municipal\|mun\.?) (utility\|water\|sewer) district`	local	special_district	“Municipal Utility District 14”
`(?i)community college.*trustee`	local	education	“Community College District Trustee”
`(?i)(precinct\|ward) (chair\|committee)`	local	party	“Precinct 12 Committee Chair”
`(?i)conservation district (super\|board\|dir)`	local	special_district	“Conservation District Supervisor”
`(?i)(drainage\|levee\|flood) (district\|board)`	local	special_district	“Drainage District 7 Board”
`(?i)hospital district (board\|dir\|trustee)`	local	special_district	“Hospital District Board Member”
`(?i)park (district\|board) (comm\|dir\|trustee)`	local	special_district	“Park District Commissioner”
`(?i)sanitary district`	local	special_district	“Sanitary District Trustee”
`(?i)mosquito (abatement\|control) district`	local	special_district	“Mosquito Abatement District Trustee”
`(?i)(borough\|parish) (council\|president\|assembly)`	county	legislative	“Borough Assembly Member”
`(?i)district attorney`	county	law_enforcement	“District Attorney 26th District”

Regex patterns are tested in order. The first match wins. All patterns use case-insensitive mode.

Tier 3: Embedding similarity

Office strings that pass through tiers 1 and 2 unclassified are embedded using text-embedding-3-large (3072 dimensions) and compared against a reference set of known office type embeddings via FAISS nearest-neighbor search.

Threshold: cosine similarity ≥ 0.85 against the nearest known office type.
Reference set: the canonical office types defined by tiers 1 and 2, plus manually curated additions for jurisdiction-specific titles.
Examples resolved at tier 3:
- “Moderator” → local / legislative (New England town meeting role)
- “Fence Viewer” → local / administrative (historical New England office)
- “Pound Keeper” → local / administrative
- “Surveyor of Highways” → local / administrative
- “Oyster Commissioner” → local / special_district (Maryland)

Tier 3 handles 2,340 unique office strings — mostly jurisdiction-specific titles, historical offices, and compound names that do not match keyword or regex patterns.

Tier 4: LLM classification

The remaining 848 office strings are sent to Claude Sonnet with a structured prompt that provides the office name, the state, and the county (where available). The LLM returns office_level, office_category, and a brief rationale.

Every tier 4 decision is recorded in the decision log with:

decision_id
input_string (the original office name)
output_level and output_category
llm_request_id
rationale (the LLM’s explanation)

Tier 4 classifications can be overridden by adding entries to the tier 1 or tier 2 tables in subsequent pipeline versions. Once an office string is promoted to tier 1 or tier 2, it is classified deterministically on all future runs.

Office level and category enumerations

office_level values: federal, state, county, local.

office_category values: executive, legislative, judicial, law_enforcement, administrative, education, special_district, party.

These enumerations are defined in the Enumerations Reference. Every classified office receives exactly one level and one category.

Handling ambiguity

Some office strings are genuinely ambiguous:

“Board of Commissioners” could be county or municipal depending on jurisdiction.
“Trustee” alone could be township, school board, or special district.
“Judge” without a court name could be any judicial level.

In these cases, the pipeline uses jurisdiction context (state, county, FIPS code) to disambiguate. If the jurisdiction does not resolve the ambiguity, the string is sent to tier 3 or 4 with the full context attached.

NIST SP 1500-100 Alignment

This appendix maps the pipeline’s schema fields to concepts defined in NIST SP 1500-100 v2, the Election Results Common Data Format Specification. The mapping is informational — the pipeline does not emit NIST-compliant XML, but its internal schema was designed with alignment in mind.

Field mapping

Pipeline field	NIST SP 1500-100 concept	NIST element	Notes
`contest`	Contest	`CandidateContest`	Candidate races map to `CandidateContest`.
`contest` (ballot measure)	Contest	`BallotMeasureContest`	Ballot measures use a separate NIST element.
`contest.name`	Contest name	`CandidateContest.Name`	Raw office string before normalization.
`contest.canonical_office`	Office	`Office.Name`	L4 normalized office name.
`candidate.canonical_first`, `canonical_last`	Candidate	`Candidate.PersonFullName`	Pipeline stores components; NIST stores full name.
`candidate.party`	Party	`Party.Abbreviation`	Three-letter codes (DEM, REP, LIB, etc.).
`jurisdiction.ocd_id`	Geographic unit	`GpUnit.ExternalIdentifier`	OCD-ID used as the external identifier type.
`jurisdiction.county_fips`	Geographic unit	`GpUnit.ExternalIdentifier`	FIPS code, identifier type `fips`.
`jurisdiction.state`	Geographic unit	`GpUnit.Type = "state"`	Two-letter USPS abbreviation.
`votes.total`	Vote counts	`VoteCounts.Count`	Total votes for a candidate in a contest.
`votes.by_mode.election_day`	Vote counts by type	`VoteCounts.CountItemType = "election-day"`	Present in ~33% of records.
`votes.by_mode.absentee`	Vote counts by type	`VoteCounts.CountItemType = "absentee"`	Terminology varies by state.
`votes.by_mode.early`	Vote counts by type	`VoteCounts.CountItemType = "early"`	Some sources merge into election day.
`votes.by_mode.provisional`	Vote counts by type	`VoteCounts.CountItemType = "provisional"`	Timing of inclusion varies.
`election.date`	Election	`Election.StartDate`	Single date; no multi-day modeling.
`election.type`	Election type	`Election.Type`	Values: `general`, `primary`, `runoff`, `special`.
`turnout.registered_voters`	Turnout metadata	`VoteCounts.CountItemType = "total"` on `BallotCounts`	Present in <5% of records.
`turnout.ballots_cast`	Turnout metadata	`BallotCounts.BallotsCast`	Same coverage caveat.
`contest.district`	Electoral district	`ElectoralDistrict.Name`	District number or name within an office.

Concepts not modeled

The following NIST SP 1500-100 concepts have no direct equivalent in the pipeline schema:

RetentionContest — Judicial retention elections are classified as BallotMeasure with yes/no choices rather than as a distinct contest type.
OrderedContest — Ballot ordering is not captured. The pipeline does not model ballot layout.
BallotStyle — No ballot style or precinct-to-ballot mapping is maintained.
Ranked-choice voting rounds — CountItemType values for RCV rounds (round-1, round-2, etc.) are not supported. See Known Limitations.
Overvotes and undervotes — Tracked as TurnoutMetadata contest records at L1, not as NIST OtherCounts.

Pipeline concepts not in NIST

The following pipeline concepts have no NIST equivalent:

provenance.hash — SHA-256 hash chain for record integrity. NIST defines no provenance model.
entity_resolution.method — Match method metadata (exact, Jaro-Winkler, embedding, LLM). Entity resolution is outside the scope of NIST SP 1500-100.
source.confidence — High/medium/low confidence levels. NIST does not model source reliability.
Layer identifiers (L0–L4) — The multi-layer pipeline architecture is specific to this project.

Research References

This appendix lists the research papers, datasets, and standards cited throughout the documentation.

Entity resolution

Dasanaike, T., et al. (2026). EnsembleLink: Ensemble methods for scalable entity resolution. Preprint.
Ornstein, J. (2025). fuzzylink: Probabilistic record linkage with large language models. Preprint.
CE-RAG4EM (2026). Context-Enhanced Retrieval-Augmented Generation for Entity Matching. Preprint.
Zeakis, A., et al. (2025). AvengER: Automated verification of entity resolution results. Preprint.

Election data sources

MIT Election Data + Science Lab (MEDSL). U.S. Local Elections Dataset, 2018–2022. https://electionlab.mit.edu/data
OpenElections Project. Certified election results by state. https://openelections.net
North Carolina State Board of Elections (NC SBE). Official election results, 2004–present. https://www.ncsbe.gov/results-data
Annual Local Government Election Dataset (ALGED). Municipal election returns for cities >50K population.
Associated Press. AP Elections. Commercial license required.
Voting and Election Science Team (VEST). Precinct-level election returns with shapefiles. https://dataverse.harvard.edu/dataverse/electionscience
Federal Election Commission (FEC). Candidate master files. https://www.fec.gov/data/browse-data/
U.S. Census Bureau. FIPS code reference and geographic hierarchies. https://www.census.gov/geographies

Standards

National Institute of Standards and Technology. (2023). NIST SP 1500-100 v2: Election Results Common Data Format Specification. https://doi.org/10.6028/NIST.SP.1500-100r2

Architecture

Databricks. Medallion Architecture: Bronze, Silver, Gold. https://www.databricks.com/glossary/medallion-architecture

Reports

Union of Concerned Scientists. (2025). Election Data Report: The state of US election data infrastructure.

Glossary

Blocking. A preprocessing step in entity resolution that partitions records into groups (blocks) that share a key attribute — typically state + office type or county FIPS code. Only records within the same block are compared, reducing the number of pairwise comparisons from O(n²) to a tractable subset.

Composite string. A concatenated text representation of a record used as input to an embedding model. A candidate composite string might combine name, office, jurisdiction, party, and year into a single string. The template that defines which fields are included and in what order is versioned and stored in the L2 manifest.

Cosine similarity. A measure of similarity between two vectors, computed as the cosine of the angle between them. Ranges from -1 to 1; values closer to 1 indicate higher similarity. Used at L3 to compare candidate and contest embeddings. The pipeline uses a threshold of 0.88 for embedding-based entity matches.

Entity resolution. The process of determining whether two records refer to the same real-world entity (person, office, or contest) despite differences in formatting, naming, or source. The pipeline uses a four-step cascade: exact match → Jaro-Winkler → embedding similarity → LLM confirmation.

FAISS. Facebook AI Similarity Search. A library for efficient similarity search over dense vector collections. Used at L3 to perform approximate nearest-neighbor lookups over L2 embeddings when comparing candidate records across sources.

FIPS code. Federal Information Processing Standards code. A numeric identifier assigned by the Census Bureau to states (2 digits), counties (5 digits: 2 state + 3 county), and other geographic entities. Example: 37119 = Mecklenburg County, North Carolina. Used as a join key across sources.

Jaro-Winkler similarity. A string similarity metric that gives higher scores to strings that match from the beginning. Ranges from 0 to 1. The pipeline uses a threshold of 0.92 for name matching. Preferred over edit distance for person names because prefix agreement is a strong signal of identity.

JSONL. JSON Lines. A text format where each line is a valid JSON object, separated by newlines. The pipeline uses JSONL as the storage and interchange format at every layer (L0–L4). One record per line enables streaming reads and line-level integrity checks.

L0 (Raw). The first pipeline layer. Byte-identical copies of source files as retrieved. No parsing, no transformation. Stored with retrieval timestamps and SHA-256 hashes.

L1 (Cleaned). The second layer. Deterministic parsing, field extraction, name normalization, and FIPS enrichment. Output is structured JSONL with a consistent schema regardless of source format.

L2 (Embedded). The third layer. Adds vector embeddings (text-embedding-3-large, 3072 dimensions) and office classification results. Deterministic given L1 input and a fixed model version.

L3 (Matched). The fourth layer. Entity resolution — linking records that refer to the same candidate, contest, or office across sources and years. Non-deterministic steps (LLM calls) are recorded in the decision log for replay.

L4 (Canonical). The fifth layer. Assigns canonical names, deduplicates records, selects authoritative values, and produces the final queryable dataset. Deterministic given L3 input.

OCD-ID. Open Civic Data Identifier. A hierarchical string identifier for political geographies, following the pattern ocd-division/country:us/state:nc/county:mecklenburg. Used to link jurisdictions across datasets that may use different naming conventions.

Precinct. The smallest administrative unit for election administration. Voters are assigned to a precinct based on their address. Precinct-level results, when available, provide the most granular view of voting patterns. Coverage varies — some sources report only county-level totals.

Changelog

All notable changes to the dataset and pipeline will be documented in this file.

Each entry includes the date, affected layer(s), and a summary of the change.

[Unreleased]

No releases yet.

Entry template

Date: YYYY-MM-DD
Layer(s): L0 / L1 / L2 / L3 / L4
Change: Description of what changed.
Issue: Link to GitHub issue (if applicable).

Keyboard shortcuts

Election Aggregation