Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Election Aggregation

A multi-layer pipeline for collecting, normalizing, and unifying US local election results from heterogeneous sources.


The question this project answers

Who ran for school board in your county last year? Who was the sheriff, and did anyone run against them? What was the closest local race in your state? Has your county commissioner been reelected five times unopposed, or do they face real competition?

These should be easy questions. They are not.

There is no national database of US local election results. The data exists — scattered across 50 state election boards, 3,000+ county clerk offices, academic datasets, election night reporting platforms, and community-curated repositories — but it has never been unified into a single, consistent, trustworthy format. Every source uses different schemas, different name formats, different office titles, different geographic identifiers, and different levels of completeness.

This project fixes that.

What we found when we tried

We downloaded 42 million rows of precinct-level election data from MIT’s Election Data Lab (MEDSL), the North Carolina State Board of Elections, OpenElections, VEST, the Census Bureau, and the FEC. We covered all 50 states across three election cycles (2018, 2020, 2022) and ten years of deep North Carolina history (2006–2024).

Then we tried to answer simple questions, and the problems started immediately.

The same candidate appears differently across sources. MEDSL reports SHANNON W BRAY. NC SBE reports Shannon W. Bray. One is all caps with no period after the middle initial. The other is title case with a period. These are the same person — but a computer doesn’t know that without being told.

Nicknames break everything. Charlie Crist in one source is CRIST, CHARLES JOSEPH in another. A human recognizes Charlie as a nickname for Charles. An embedding model scores their similarity at 0.451 — well below any reasonable match threshold. A language model, given the context (same state, same office, same election, same vote count), correctly identifies them as the same person with 0.95 confidence.

The same office title means different things. In Texas, the “County Judge” is the chief executive of the county — equivalent to a county manager. In every other state, a county judge is a judicial officer. If your system classifies “DALLAS COUNTY JUDGE” as judicial, you’re wrong in Texas and right everywhere else. Across all 50 states in 2022, we found 8,387 unique local office names. Our keyword classifier handles 62% of them. The remaining 38% require embedding-based matching and LLM reasoning.

Non-candidate data hides inside candidate data. Florida OpenElections includes 6,013 rows labeled “Registered Voters” — not a contest, but a turnout metadata row that got slurped into the results file as if it were a race. Other sources include “BLANK” (Maine’s name for undervotes), “TOTAL VOTES” (Utah’s aggregation rows), and “OverVotes” / “UnderVotes” masquerading as candidate names. Each source has its own ghosts.

Nobody tracks the same person across elections. Timothy Lance won a Columbus County, NC school board seat in 2022. Did he run before? Did he win? Is the “T. Lance” who ran in 2018 the same person? No existing dataset answers this. Entity resolution — determining that two records refer to the same human being — is the hardest problem in this project, and the one we spend the most effort on.

What this project does

Election Aggregation is a five-layer pipeline that transforms messy, heterogeneous election data into a clean, unified, entity-resolved dataset with full provenance back to the original source files.

L0  RAW          Byte-identical source files. Never modified.
 ↓
L1  CLEANED      Parsed, structured records. Names decomposed into
                 first/middle/last/suffix. FIPS codes enriched.
                 Office classified by keyword and regex.
                 Purely deterministic. No ML, no API calls.
 ↓
L2  EMBEDDED     Vector embeddings generated for candidates, contests,
                 and geographic names. Office classification tier 3
                 (embedding nearest-neighbor). Quality flags raised.
                 Deterministic given the same embedding model.
 ↓
L3  MATCHED      Entity resolution. Same-candidate and same-contest
                 identifiers assigned. Embedding retrieval + LLM
                 confirmation. Every decision stored with reasoning.
 ↓
L4  CANONICAL    Authoritative names chosen. Temporal chains built.
                 Alias tables constructed. Verification algorithms run.
                 Researcher-facing exports produced.

The ordering is strict and deliberate: Clean → Embed → Match → Canonicalize. You cannot assign an authoritative name before you know who the person is. You cannot match entities before you have embeddings. You cannot embed before you have clean, signal-preserving parsed records. And you cannot parse before you have the raw bytes.

Every record at every layer carries a cryptographic hash chain back to the original source file. If someone modifies a vote count, changes a name, or alters a match decision at any layer, the verification algorithm detects exactly where the chain breaks.

What this project does not do

  • It does not store election data. The data files are large (7+ GB for our current corpus) and are published by their respective sources under their own terms. This project tells you where to get the data, documents every source’s schema and quirks, and provides the tools to process it. You download the data yourself.

  • It is not a real-time election night tracker. We ingest official and certified results, not live feeds. The pipeline is designed for post-election analysis, not real-time reporting.

  • It is not a prediction model. We report what happened, not what will happen.

  • It does not claim perfect accuracy. Entity resolution is probabilistic. Office classification has a 0.56% “other” rate. Some records have data quality issues we haven’t caught yet. We document every known limitation, and every match decision is auditable.

What you can answer today

With the data currently available (MEDSL 2018/2020/2022 for all 50 states, NC SBE 2006–2024, OpenElections for 6 states, VEST shapefiles for 4 states), you can answer:

QuestionAnswer from our data
How many sheriffs ran unopposed in 2022?55% in North Carolina, 77% in Maine, varies by state
What was the closest school board race in America?Dawson County, GA — exact tie at 25,186 to 25,186
How many local races were uncontested?48.8% nationally (keyword-classified subset)
Which office type is least competitive?Constable/Coroner at 72% uncontested
Which is most competitive?City Council at 10% uncontested
Who has served longest on a local body in NC?George Dunlap — Mecklenburg County Commissioner, 6 consecutive cycles (2014–2024)
How many unique elected offices exist in America?At least 8,387 distinct office names in MEDSL 2022 alone; 4,995 exist in exactly one county
Did the same candidate run across multiple elections?Yes — 702 NC candidates appear in 3+ election cycles (2014–2024)

And questions you cannot answer yet, honestly:

QuestionWhy not
What’s the voter turnout for school board races?Turnout data exists in less than 5% of records
Did this candidate switch parties?Requires entity resolution across elections, which is functional but not yet validated at scale
What are the RCV round-by-round results?Schema doesn’t support ranked-choice voting yet
How do local election results correlate with demographics?Census demographic join is ready (100% FIPS coverage) but not yet implemented
What happened in odd-year elections (2015, 2017, 2019)?MEDSL has odd-year data on Harvard Dataverse; we haven’t loaded it yet

The data sources

This project processes data from multiple sources. We do not redistribute their data. Here is what each provides and where to get it:

SourceWhat it isCoverageWhere to get it
MEDSLMIT Election Data + Science Lab precinct returnsAll 50 states + DC, 2018/2020/2022GitHub, Harvard Dataverse
NC SBENorth Carolina State Board of ElectionsNC only, 2006–2024 (10 cycles)NC SBE
OpenElectionsCommunity-curated precinct data~8 states, variesGitHub
Clarity/ScytlElection night reporting XML~1,000+ jurisdictionsPer-jurisdiction URLs (unstable)
VESTPrecinct results + geographic boundariesAll 50 states (shapefiles)Harvard Dataverse
CensusFIPS code reference filesNationalCensus.gov
FECFederal candidate master filesNationalFEC.gov

Each source has its own chapter documenting the exact schema, download commands, known data quality issues, and how our pipeline handles its quirks.

Who this book is for

If you’re a journalist and you want to answer “what happened in local elections in my area” — start with Questions for Journalists, then go to Getting Started and Recipes. You don’t need to understand the pipeline architecture. You need the data and the queries.

If you’re a researcher and you want a citable, reproducible, documented dataset for studying local election competitiveness, candidate career paths, or democratic participation — start with Questions for Researchers, then read Reproducibility Guide and How to Cite This Data. The dataset is versioned with DOIs. Every entity resolution decision is logged and auditable.

If you’re a government staffer and you need to know what elected offices exist in your jurisdiction, how your state compares to others, or how to benchmark election administration — start with Questions for Government Staffers and Office Inventory Recipe.

If you’re a developer and you want to contribute to the pipeline, add a new data source, or understand the Rust implementation — start with Design Principles, then read The Five-Layer Pipeline and Type System Design. The mdbook is the spec. The Rust types are the implementation.

If you’re evaluating this architecture for your own data pipeline project — the Architecture section describes a pattern (immutable layers, deterministic-first processing, embeddings for retrieval, LLMs for confirmation) that generalizes beyond election data. The Hard Problems section documents real entity resolution challenges with real data and real solutions.

How to read this book

The book is organized in the order you’d have questions:

  1. Part I: The Problem — Why local election data is a mess, and what questions we’re trying to answer.
  2. Part II: Data Sources — Where the data comes from, exactly what’s in it, and how to download it yourself.
  3. Part III: The Hard Problems — Name normalization, office classification, entity resolution, and cross-source reconciliation. Real examples from real data. This is the heart of the book.
  4. Part IV: Architecture — The five-layer pipeline, the hash chain, the embedding strategy, the LLM integration. How the system is designed and why.
  5. Part V: Unified Schema — The exact record format, field by field. What each field means, where it comes from, and which layer populates it.
  6. Part VI: Rust Implementation — The type system, the traits, the module structure. How the architecture becomes code.
  7. Part VII: Using the Data — Download instructions, pipeline execution, and ten ready-to-use recipes with copy-paste queries.
  8. Part VIII: Trust and Reproducibility — How to verify the data, how to cite it, how to report errors, and what the known limitations are.

You don’t have to read it in order. Every chapter is self-contained with cross-references to related sections. But if you read Part I and Part III, you’ll understand why this project exists and what makes it hard. Everything else follows from that.


This project is open source under MIT/Apache-2.0. The data it processes is published by its respective sources under their own licenses (generally CC-BY or public domain). We do not store or redistribute election data.

Why This Is Hard

US local election results are published by approximately 3,143 county-level election offices, 50 state election boards, and an unknown number of municipal clerks. There is no common schema, no shared identifier system, and no central repository. This chapter describes the five structural problems that make unification difficult.

Fragmented administration

US elections are administered at the county level. Each county decides independently how to collect, tabulate, and publish results. Some publish precinct-level CSV files. Others post scanned PDFs. Some use Clarity/Scytl election night reporting platforms that expose structured XML. Others put results on pages that require JavaScript rendering.

There is no federal mandate requiring any particular publication format. The result is 3,143 independent data silos with different schemas, different update schedules, different URL structures, and different retention policies.

When we downloaded precinct-level data for all 50 states from the MIT Election Data + Science Lab (MEDSL), we received 51 separate files containing 12.3 million rows. The files use different column encodings, different candidate name conventions, and different definitions of what constitutes a “local” race. Seven states — California, Iowa, Kansas, New Jersey, Pennsylvania, Tennessee, and Wisconsin — had zero local race records in the MEDSL 2022 dataset.

No standard schema

The same vote record looks different in every source. Here is a single result — Shannon W. Bray in a 2022 North Carolina precinct — as represented by three different sources:

MEDSL (25-column CSV, one row per vote mode):

precinct,office,party_simplified,mode,votes,candidate,...
12-13,US SENATE,LIBERTARIAN,ELECTION DAY,47,SHANNON W BRAY,...

NC SBE (15-column TSV, vote modes as columns):

County	Precinct	Contest Name	Choice	Election Day	One Stop	Absentee by Mail	Provisional	Total Votes
CABARRUS	12-13	US SENATE	Shannon W. Bray	47	38	5	0	90

OpenElections (7-column CSV, totals only):

county,precinct,office,party,candidate,votes
Cabarrus,12-13,U.S. Senate,LIB,Shannon W. Bray,90

Three sources. Three schemas. Three representations of the candidate name (SHANNON W BRAY, Shannon W. Bray, Shannon W. Bray). Three levels of granularity for vote mode data. Three different office name formats (US SENATE, US SENATE, U.S. Senate). This is a federal race where all three sources agree on the totals. For local races, the divergence is worse.

Name formatting differs across every source

We compared MEDSL and NC SBE data for 640 contests in the 2022 North Carolina general election where both sources reported the same vote totals. In 401 of those contests (63%), candidate names are formatted differently between the two sources.

The differences are systematic:

PatternMEDSLNC SBE
CaseSHANNON W BRAYShannon W. Bray
Middle initial punctuationVICTORIA P PORTERVictoria P. Porter
Nickname quotingMICHAEL "STEVE" HUBERMichael (Steve) Huber
Suffix formattingROBERT VAN FLETCHER JRRobert Van Fletcher, Jr.
Nickname styleLM "MICKEY" SIMMONSL.M. (Mickey) Simmons
Write-in labelWRITEINWrite-In (Miscellaneous)

Each source applies a consistent internal convention. MEDSL uses ALL CAPS with no punctuation. NC SBE uses Title Case with periods and commas. Across sources, the conventions diverge.

The formatting problem is solvable with normalization rules. The deeper problem is name identity. We tested real candidate pairs with OpenAI’s text-embedding-3-large model (3,072 dimensions):

Name AName BCosine similaritySame person?
Charlie CristCRIST, CHARLES JOSEPH0.451Yes
Robert WilliamsRobert Williams Jr.0.862No
Nikki FriedNicole Fried0.642Yes
Ron DeSantisDESANTIS, RON0.729Yes

“Charlie Crist” and “CRIST, CHARLES JOSEPH” score 0.451 — below any reasonable match threshold — because Charlie and Charles have unrelated vector representations. These are the same person (same state, same office, identical vote count of 3,101,652). Only a language model with knowledge that Charlie is a common nickname for Charles can make the connection.

“Robert Williams” and “Robert Williams Jr.” score 0.862 — above most auto-accept thresholds used in the literature. These are different people. The “Jr.” suffix indicates a generational distinction. A system that auto-accepts at 0.82 would merge a father and son into one entity.

Institutional variation by state

Across all 50 states in the 2022 MEDSL data, we found 8,387 unique local office names. Our keyword classifier handles 62% of them. The remaining 38% includes offices where the same title has different institutional meanings depending on the state.

“County Judge” in Texas is the presiding officer of the Commissioners Court — the chief executive of the county, analogous to a county manager. In every other state, a county judge presides over a courtroom. Texas has 254 counties; each has a County Judge who is an executive, not a judicial officer.

“Sheriff” in Connecticut is a court officer who serves civil process. In the other 49 states, the sheriff runs the county jail and patrols unincorporated areas.

“Board of Education” is an elected body in some states and an appointed body in others. Where it is appointed, it does not appear in election data — its absence from a source does not mean the county lacks a school board.

A static lookup table mapping office names to categories does not work. The classification must account for state-level context, which is why the pipeline uses a four-tier classifier: keyword matching for unambiguous names, regex patterns for structured names, embedding similarity against a reference set, and a language model for genuinely ambiguous cases.

No persistent candidate identifiers

Timothy Lance won a seat on the Columbus County Schools Board of Education in 2022. No existing dataset can answer whether he ran before, whether he won, or whether the “T. Lance” who appeared on a 2018 ballot is the same person.

MEDSL, NC SBE, and OpenElections each treat every election as an independent snapshot. There is no identifier linking Timothy Lance (2022) to Timothy Lance (2020) to Tim Lance (2018). The candidate name can change between elections — a middle initial added, a nickname used, a suffix dropped after a parent’s death. The office can change if the candidate runs for a different seat. The county can change if the candidate relocates.

In 10 years of NC SBE data (2014–2024), we found 702 candidates appearing in three or more election cycles using exact name matching within the same county. George Dunlap appeared on the Mecklenburg County ballot in six consecutive cycles. Paul Beaumont in Currituck County ran for the Board of Commissioners, then the Board of Education, then back to Commissioners.

Connecting these records — determining that entries in different elections, from different sources, with different formatting, refer to the same person — requires preserving every name component through the cleaning pipeline, embedding candidates for vector retrieval, and confirming ambiguous matches with a language model that reasons about context (office, county, party, vote totals). This process is called entity resolution, and it is detailed in its own chapter.

What this adds up to

In 2022, across the MEDSL data for all 50 states, 48.8% of classified local races had only one candidate. In Minnesota, the uncontested rate was 89.3%. Nineteen local races ended in exact ties. Forty-three were decided by a single vote. These are basic facts about American democracy that require combining data from multiple sources, resolving thousands of name variations, classifying thousands of office types, and linking candidates across elections.

That is what this project does. The rest of this book describes how.

What Questions Should Be Answerable?

The purpose of this project is to make US local election data queryable. Across 42 million rows, 50 states, and 8,387 distinct office names, basic questions remain difficult to answer. This chapter frames those questions by audience.

Four audiences, different needs

  • Journalists need specific, verifiable facts — closest races, unopposed incumbents, anomalies worth investigating. See For Journalists.
  • Researchers need structured, reproducible datasets — uncontested rates by office type, candidate career paths, cross-state comparisons. See For Researchers.
  • Government staffers need operational inventories — what offices exist in a jurisdiction, how many races appear on a ballot, how local structures compare to peer counties. See For Government Staffers.
  • Civic tech developers need reliable data interchange — OCD-ID mappings, entity-resolved candidate records, JSONL exports for downstream applications. See For Civic Tech Developers.

What the data already tells us

Even partial analysis of available sources reveals findings that are difficult to obtain elsewhere:

  • 48.8% of local races in available data are uncontested — one candidate, no opponent.
  • 19 exact ties have been identified across the dataset (same vote total, different candidates).
  • 8,387 unique office name strings exist before normalization, many referring to the same underlying office.

These numbers are not estimates. They come from deterministic queries against cleaned, source-attributed JSONL records. The methodology for each finding is documented in the recipe chapters.

Why these questions matter

No single existing source answers all of these questions. The existing landscape chapter surveys what is available today and where each source falls short. This project exists to fill the gaps — not by replacing those sources, but by unifying them through a documented, reproducible pipeline.

For Journalists

Local election data is where accountability stories live — and where data is hardest to find. These are the questions journalists ask, with real answers drawn from the dataset.

Closest races

  • Who won the closest race in America? Dawson County, GA had a tied contest at 25,186 total votes cast — decided by recount procedures, not by a single voter’s margin.
  • How many exact ties exist? 19 exact ties have been identified across available data. Each is flagged with the specific contest, county, and vote totals. See Closest Races in America.
  • Which school board races were decided by single digits? Madison County, IN had a school board race decided by 1 vote. These contests are queryable by margin across all office types.

Unopposed races

  • How many sheriffs ran unopposed? In North Carolina, 55% of sheriff races were uncontested. In Maine, 77%. National figures depend on source coverage — seven states lack local data entirely.
  • What’s the overall uncontested rate? 48.8% of local races in available data have a single candidate. This figure spans all office types and all states with coverage.
  • Which offices are most likely to be uncontested? Constable races are uncontested 72% of the time. City council races: 10%. The rate varies by office type and state. See Uncontested Race Rate by State.

Accountability angles

  • Who keeps winning without opposition? Candidate entity resolution across election cycles identifies incumbents who have never faced an opponent. See Career Tracking Across Elections.
  • Which counties have the most uncontested offices? County-level aggregation is possible wherever FIPS codes are present in the source data. See Sheriff Accountability.
  • Are there races where write-in candidates are the only opposition? Write-in totals are preserved where the source reports them. In some jurisdictions, write-in votes account for the only opposition in over a third of contests.

Verification

  • Can I verify a specific result? Every record traces back to a named source (e.g., NC SBE certified results, MEDSL). The pipeline preserves source file hashes and original field values. See Verify a Specific Result.
  • How do I cite this data? You cite the original source, not this project. The project provides the source name, retrieval date, and confidence level for each record. See Confidence Levels.

What you cannot get here (yet)

  • Turnout data is present in fewer than 5% of records.
  • Seven states (CA, IA, KS, NJ, PA, TN, WI) have zero local coverage in MEDSL 2022.
  • Odd-year elections (2015, 2017, 2019, 2021) are underrepresented.

These gaps are documented in Known Limitations. If you are reporting on a state with limited coverage, check the Coverage Matrix first.

For Researchers

Local election data presents structural challenges for quantitative research: inconsistent office names, no universal candidate identifiers, and source-dependent coverage gaps. These are the questions researchers ask, with real answers and methodology notes.

Competitiveness and contestation

  • What’s the uncontested rate by office type? Constable: 72%. Soil and water conservation district: 58%. County commissioner: 34%. City council: 10%. These rates are computed from L4 canonical records where candidate_count = 1 for a given contest.
  • How does competitiveness vary across states? Minnesota reports 89.3% of local races as contested. Florida reports 0% in available MEDSL local data — a coverage artifact, not a political finding. Interpret cross-state comparisons with the Coverage Matrix.
  • What is the national uncontested rate? 48.8% across all available local races. This figure is coverage-weighted: states with more reported contests contribute proportionally more. It is not a population-weighted estimate.

Candidate career tracking

  • Can I track candidates across election cycles? Entity resolution at L3 links candidate records across years and sources. George Dunlap (Mecklenburg County, NC) appears in 6 election cycles under consistent entity IDs. See Career Tracking Across Elections.
  • What identifier links candidates across sources? The L4 canonical_candidate_id is a deterministic hash of resolved name components, jurisdiction, and office. It is stable across pipeline runs given the same L3 decisions.
  • How reliable is cross-cycle linking? Exact name matches are deterministic. Fuzzy matches (Jaro-Winkler ≥ 0.92) and embedding matches (cosine ≥ 0.88) are logged with scores. LLM-assisted matches include the decision ID. All match metadata is queryable.

Cross-source validation

  • How consistent are sources that cover the same contests? In 640 overlapping contests between MEDSL and NC SBE, 90.5% have identical vote totals. The remaining 9.5% differ by small amounts, typically due to provisional ballot timing or reporting cutoff dates.
  • How do candidate names differ across sources? In the same 640 overlapping contests, 63% have name formatting differences (e.g., “SMITH, JOHN A” vs. “John A. Smith”). These are resolved at L1 (parsing) and confirmed at L3 (entity resolution).

Office taxonomy

  • How many distinct offices exist? 8,387 unique office name strings before normalization. After L2 classification (keyword, regex, embedding, LLM), these resolve to a smaller set of canonical office types. See Office Classification Reference.
  • What office types exist at the sub-county level? Constable, justice of the peace, soil and water conservation district supervisor, school board trustee, municipal utility district director, and hundreds of jurisdiction-specific titles.

Reproducibility

All findings above are reproducible from the pipeline output:

  • L0 → L2 layers are fully deterministic. Given the same source files and pipeline version, output is byte-identical.
  • L3 decisions are logged in a decision log (JSONL). Replaying the log against the same L2 input reproduces L3 exactly, even when LLM calls were involved.
  • L4 is deterministic given L3 output.
  • Versioned JSONL files at every layer serve as the unit of reproducibility. Each file includes a manifest with source hashes, pipeline version, and timestamp.

To reproduce a specific finding, check out the tagged pipeline version, supply the same L0 inputs, and run the pipeline. The decision log ensures that even probabilistic steps (embedding similarity, LLM confirmation) produce identical output on replay.

Data format for analysis

Pipeline output is JSONL — one JSON object per line. This is directly loadable into pandas, R (jsonlite), DuckDB, or any tool that reads newline-delimited JSON. No proprietary formats or database dependencies are required.

For Government Staffers

County clerks, election administrators, and local officials need operational data — not research datasets. These are the questions government staff ask, with answers drawn from the dataset.

Office inventories

  • What offices exist in my county? Columbus County, NC has 25 distinct elected offices across county government, municipalities, school boards, and special districts. The pipeline produces per-county office inventories from L4 canonical records. See Office Inventory for a County.
  • How many races will be on the next ballot? Historical office inventories establish the set of offices that typically appear in a given election cycle. Odd-year vs. even-year patterns, staggered terms, and special elections are identifiable where source data includes election dates and term lengths.
  • Which offices are partisan vs. nonpartisan? Party affiliation is recorded where the source provides it. In North Carolina, all county commissioner races are partisan; all school board races are nonpartisan. Coverage varies by state.

Comparisons

  • How does our uncontested rate compare to peer counties? County-level uncontested rates are computable for any jurisdiction with coverage. A county clerk can compare their 60% uncontested rate against the state median or against demographically similar counties. See Uncontested Race Rate by State.
  • Are other counties consolidating offices we still elect separately? Office inventories across counties within a state reveal structural differences — some counties elect a coroner, others appoint one. The data does not explain why, but it shows where differences exist.
  • How many candidates typically file for each office? Candidate counts per contest are derivable from L4 records. A county with historically 1.2 candidates per school board seat has a different recruitment problem than one averaging 3.4.

Administrative planning

  • What does our ballot complexity look like over time? The number of contests per jurisdiction per cycle is queryable. Ballot length affects printing costs, voter fatigue research, and polling place logistics.
  • Which districts overlap our jurisdiction? Where OCD-IDs are present, hierarchical district relationships can be inferred. A county contains municipalities, school districts, and special districts — the data reflects which contests appear in which jurisdictions.

Data format

All outputs are JSONL with one record per contest-candidate pair. Government staff who need spreadsheets can convert JSONL to CSV with standard tools. See Querying JSONL Output.

Caveats

  • Office inventories are only as complete as the source data. If a state does not report local results to MEDSL or another covered source, those offices will not appear.
  • The pipeline documents sources and provides tools — it does not store or redistribute official election results. See The Project Does Not Store Data.
  • Seven states have zero local coverage in MEDSL 2022. Check the Coverage Matrix before relying on completeness for a specific jurisdiction.

For Civic Tech Developers

Civic technology projects depend on structured, reliable election data. Most fail not because of engineering limitations but because the underlying data is fragmented, inconsistently formatted, and difficult to resolve across sources. These are the questions developers ask when building on local election data.

Ballot lookup tools

  • Can I build a “what’s on my ballot” tool? Yes, but it requires mapping voter addresses to jurisdictions (via OCD-IDs or FIPS codes) and then mapping jurisdictions to offices. The dataset contains 8,387 unique office name strings — many of which refer to the same office across sources. The L4 canonical layer resolves these to deduplicated office records with jurisdiction identifiers.
  • How do I map an address to its contests? You need an OCD-ID → office mapping. OCD-IDs (Open Civic Data Identifiers) are present where source data includes them or where FIPS codes allow deterministic derivation. Coverage is not universal. See the Schema Overview for the jurisdiction.ocd_id field.
  • What format is the data in? Every pipeline layer outputs JSONL (newline-delimited JSON). One record per line, one file per source-year-state. No database required — parse with jq, Python, DuckDB, or any JSON-capable tool.

Candidate lookup and entity resolution

  • Can I build a candidate lookup API? The L4 layer provides entity-resolved candidate records with canonical names, office history, and source attribution. A candidate who appears as “Bill Smith” in one source and “William R. Smith Jr.” in another is resolved to a single entity with both name variants preserved.
  • How reliable is entity resolution? It depends on the match method. Exact matches and high-confidence Jaro-Winkler matches (≥0.92) are deterministic. Embedding-based and LLM-confirmed matches carry a decision ID that traces back to the specific match rationale. See The Cascade.
  • Can I track candidates across election cycles? Yes. Entity resolution operates across years. George Dunlap in Mecklenburg County, NC appears across 6 election cycles with consistent entity IDs. See Career Tracking.

Election history and widgets

  • Can I build an election history widget for a jurisdiction? The data supports historical queries by jurisdiction, office, and candidate. Time series depend on source coverage — MEDSL covers 2018–2022 for most states; NC SBE covers 2004–present for North Carolina.
  • What about ballot measures? Ballot measures are a distinct contest kind (BallotMeasure) in the schema. Choices are normalized to for/against/yes/no at L1.

Data interchange

  • Why JSONL and not a REST API? JSONL is the data interchange format at every layer. It is self-describing, streamable, and requires no server infrastructure. Downstream applications can ingest it directly or load it into any datastore.
  • Can I join this data with other civic datasets? Yes. Records include FIPS codes, OCD-IDs (where available), and state abbreviations. These are standard join keys for Census data, geographic boundaries, and other civic datasets.
  • Is the schema stable? The schema is versioned. Each JSONL record includes a schema_version field. Breaking changes increment the major version. See Schema Overview.

What to watch out for

  • The project does not host a live API or data download. It documents sources and provides pipeline tools to process them. You run the pipeline yourself.
  • Coverage gaps exist. Seven states lack local data in MEDSL 2022. Odd-year elections are underrepresented. Check the Coverage Matrix before building features that assume national coverage.
  • Entity resolution is probabilistic for non-exact matches. If your application requires certainty, filter to records with match_method: "exact" or match_method: "jaro_winkler".

What Exists Today and Where It Falls Short

Several organizations publish US election data. Each serves a different purpose, covers a different scope, and has different limitations. This chapter surveys the major sources and identifies the gaps that motivate this project.

MEDSL — MIT Election Data + Science Lab

MEDSL provides the most comprehensive freely available collection of US election returns. Their datasets cover federal, state, and many local races across multiple election cycles. Data is published as flat CSV files with consistent column schemas.

Strengths. Wide state coverage for federal and state races. Consistent schema across years. Academic quality control. Openly licensed. Includes candidate-level vote totals with party affiliation.

Weaknesses. Seven states have zero local election coverage in the 2022 dataset: CA, IA, KS, NJ, PA, TN, and WI. Office name strings are not normalized — the same office appears under different names across states and years. No entity resolution across cycles (the same candidate is a new row each time). Turnout metadata is sparse. Release cadence lags elections by 12–18 months.

ALGED — Annual Local Government Election Dataset

ALGED focuses specifically on local elections in US cities, filling a gap that most other sources ignore. It covers mayoral, city council, and some school board races.

Strengths. Dedicated local focus. Includes candidate demographics and incumbency status where available. Covers elections that no other academic dataset tracks.

Weaknesses. Limited to cities with populations above 50,000. Data collection appears to have stopped around 2021. Does not cover counties, townships, or special districts. Not currently integrated into this pipeline (planned for future work).

OpenElections

OpenElections is a community-curated effort to collect certified election results for all 50 states. Volunteers parse state-level result files into a common CSV format and publish them on GitHub.

Strengths. State-level certified results for many states. Community-driven, so coverage expands over time. Raw source files are preserved alongside parsed output. Free and open.

Weaknesses. Coverage varies dramatically by state — some states have complete precinct-level data back to 2000, others have nothing below the county level. Schema consistency depends on the volunteer. Local races are included when the state publishes them, but there is no systematic local collection effort. Quality varies; some state files have known parsing errors that persist across releases.

Ballotpedia

Ballotpedia maintains a wiki-style encyclopedia of US elections covering federal, state, and many local offices. Their coverage of school boards, judicial elections, and ballot measures is broader than most sources.

Strengths. Broad office-type coverage including judicial, school board, and special district races. Candidate biographical information. Historical coverage for some offices. Structured data behind the wiki pages.

Weaknesses. Bulk data access requires a commercial API license. No freely available flat-file download. Data is editorial (curated by staff, not derived from certified results). Not suitable as a primary source for vote totals, though useful for office inventories and candidate metadata.

Associated Press (AP)

The AP provides real-time and certified election results to media organizations. Their data covers federal, state, and many local races on election night and through the canvassing period.

Strengths. Fast — results are available on election night. Broad geographic coverage. Includes local races in many states. High reliability for the races they cover.

Weaknesses. Expensive commercial license. Not available for academic or civic tech use without a contract. Historical data is not publicly archived. Coverage decisions are editorial — not all local races are included.

Other sources

  • State election board websites (e.g., NC SBE) publish certified results, but formats vary by state — PDF, Excel, CSV, HTML, or proprietary portals. No two states use the same schema.
  • Clarity/Scytl election night reporting portals are used by many counties. Data is structured but ephemeral — pages are taken down or overwritten after certification.
  • VEST (Voting and Election Science Team) provides precinct-level shapefiles matched to election returns, primarily for redistricting research. Coverage is strong for federal races but limited at the local level.
  • FEC publishes federal candidate filings and financial data. No state or local coverage.
  • Census Bureau provides FIPS codes and geographic hierarchies, which are essential for joining across sources but contain no election results.

Summary

SourceLocal coverageSchema consistencyFreely availableCurrentEntity resolution
MEDSL43 of 50 states (2022)HighYesYes (with lag)No
ALGEDCities >50K onlyMediumYesNo (~2021)No
OpenElectionsVaries by stateLowYesYesNo
BallotpediaBroadMediumAPI onlyYesPartial
APBroadHighNo (commercial)YesNo
State portalsVariesNone (50 formats)UsuallyYesNo

No single source covers all local races, uses a consistent schema, resolves candidates across elections, and is freely available. That gap — between what exists and what the four audiences need — is what this project addresses.

Source Overview

This project ingests election data from seven sources. None are complete on their own. Each fills a different gap — geographic breadth, temporal depth, local race coverage, geographic boundaries, or reference identifiers. The pipeline merges them into a unified schema; this chapter documents what each provides and where they overlap.

Source Summary

SourceWhat It ProvidesCoverageFormatAccess Method
MEDSLPrecinct-level returns for federal, state, and some local races50 states + DC; 2018, 2020, 2022 (~36.5M rows)CSV/TSV, one file per state per cycleHarvard Dataverse download, GitHub mirror
NC SBEPrecinct-level returns for every contest on the ballot, with vote mode breakdownsNC only; 2006–2024 (10 cycles, ~2M rows)Tab-delimited TXT in ZIP archivesS3 bucket direct download
OpenElectionsCommunity-curated precinct-level CSV files~8 states with 2022 data (FL, GA, MI, OH, PA, TX, others); coverage variesCSV, schema varies by stateGit clone per state repo on GitHub
Clarity/ScytlElection night reporting with precinct-level XML results~1,000+ jurisdictions nationwideStructured XML in ZIP filesPer-jurisdiction URLs (unstable across cycles)
VESTPrecinct boundaries (shapefiles) with vote counts as attributes50 states; odd-year elections for KY/LA/MS/VA (2015, 2019)Shapefile (.shp/.dbf/.shx/.prj)Harvard Dataverse download
CensusFIPS reference codes for states (50+DC), counties (3,143), and places (31,980)National, 2020 vintagePipe-delimited text filescensus.gov direct download
FECFederal candidate master records with stable CAND_ID identifiersAll registered federal candidates; 2020 and 2022 loadedPipe-delimited TXT (cn.txt) in ZIPfec.gov bulk download

What Each Source Contributes to the Pipeline

MEDSL is the backbone. It covers all 50 states at precinct granularity for three recent even-year cycles. Approximately 41.5% of rows in the 2022 dataset have a blank dataverse column, indicating local races. Seven states have zero local race rows — see Coverage Matrix.

NC SBE provides the deepest single-state coverage: every contest on every ballot in every precinct across 10 election cycles. It is the only source that provides vote mode breakdowns (Election Day, early, absentee, provisional) for local races. It serves as the primary validation dataset for cross-source entity resolution.

OpenElections fills state-level gaps where MEDSL coverage is incomplete or where an alternative source view aids cross-validation. Schema varies by state, requiring per-state parser logic.

Clarity has the highest value for hyperlocal races (school board, city council, judicial) because it captures results directly from county ENR systems. Not yet integrated in our pipeline. URL instability is the primary obstacle.

VEST provides the only precinct boundary geometries in the corpus, enabling geographic analysis. It also covers odd-year elections (2015, 2019) for states with off-cycle gubernatorial races — data that MEDSL’s loaded cycles do not include.

Census provides the authoritative FIPS code-to-name mappings used at L1 for geographic enrichment and cross-source geographic joins.

FEC provides stable candidate identifiers (CAND_ID) for federal candidates, used at L3 as reference anchors during entity resolution.

Cross-Source Overlap

Two source pairs have been compared quantitatively.

MEDSL + NC SBE (North Carolina, 2022 General)

Both sources report precinct-level results for the same 640 contests in North Carolina’s 2022 general election. Comparison results:

MetricValue
Contests with exact vote total match579 (90.5%)
Contests matching within 1%47 (7.3%)
Contests disagreeing by >1%14 (2.2%)
Contests with different candidate name formatting401 (63%)

The 63% name formatting difference rate is the reason entity resolution exists. MEDSL reports SHANNON W BRAY (all caps, no period). NC SBE reports Shannon W. Bray (title case, period after initial). Same person, different string. This overlap is the primary test bed for the matching pipeline — see Cross-Source Reconciliation.

MEDSL + OpenElections (Florida, 2022 General)

Florida OpenElections data contains 6,013 “Registered Voters” rows (67.9% of non-candidate records), which are turnout metadata rows mixed into the results file. This overlap revealed the non-candidate row problem documented in Non-Candidate Records.

Source Priority Ranking

When multiple sources report results for the same contest, the pipeline applies a priority order to select the authoritative record:

PrioritySource TypeRationaleExamples
1Certified state dataPublished by the official election authority; legally authoritativeNC SBE
2Academic curatedCleaned and standardized by researchers with documented methodologyMEDSL, VEST
3Community curatedVolunteer-driven; quality varies by state and contributorOpenElections
4Election night reportingOften preliminary, not certified; URLs are unstableClarity
5Reference onlyNot election results; used for enrichment and cross-referencingCensus, FEC

Priority 1 sources are preferred when available. In practice, NC SBE is the only certified state source currently loaded. For the remaining 49 states, MEDSL (priority 2) is the primary source. Lower-priority sources are retained in the record’s provenance for cross-validation, not discarded.

The priority ranking affects two pipeline decisions: which record becomes the canonical version at L4, and which confidence level is assigned. A record confirmed by two independent sources (e.g., MEDSL + NC SBE with matching vote totals) receives High confidence. A record from a single source receives Medium or Low depending on the source tier.

Coverage Matrix

This chapter maps which sources cover which states and years. Use it to determine whether a specific state/year/level combination is available before querying.

MEDSL — 50 States, 3 Cycles

MEDSL provides precinct-level results for all 50 states plus DC across three even-year general election cycles. Each cycle is one CSV per state.

CycleStatesApproximate rowsLocal race coverage
201850 + DC~11.0MVaries by state
202050 + DC~13.2MVaries by state
202250 + DC~12.3M44 of 51 jurisdictions

Seven states with zero local data in MEDSL 2022. These states have no rows with a blank dataverse column, meaning no local races were captured:

StateFIPS
California06
Iowa19
Kansas20
New Jersey34
Pennsylvania42
Tennessee47
Wisconsin55

Local elections occur in all seven states. MEDSL’s curation process did not capture them for 2022. Coverage may differ in 2018 and 2020.

Odd-year data on Dataverse but not yet loaded. MEDSL publishes odd-year election data on Harvard Dataverse:

CycleDOIStatus
2015Not loaded
201710.7910/DVN/VNJAB1Not loaded
201910.7910/DVN/2AJUIINot loaded
2021Not loaded

Odd-year elections cover gubernatorial races in VA, NJ, KY, LA, MS and municipal elections in many states. Loading these would fill a significant gap.

NC SBE — 1 State, 10 Cycles

NC SBE covers North Carolina exclusively, with precinct-level results for every contest on the ballot.

YearElectionRowsSchema
2024General233,51115-column
2022General171,90115-column
2020General257,72215-column
2018General183,72415-column
2016General252,82715-column
2014General223,97715-column
2012General208,92114–15 column (different layout)
2010General188,00814–15 column (different layout)
2008General233,14114–15 column (different layout)
2006General69,4829-column (significantly different)

All 10 cycles are downloaded. The 2014–2024 files share a stable schema and a single parser. The 2008–2012 files require a separate parser. The 2006 file requires a third.

OpenElections — ~8 States, Variable Coverage

OpenElections is community-curated. Coverage depends on volunteer effort per state. The following states have 2022 precinct-level general election data:

State2022 precinct dataEarlier years
Florida2000–2020
Georgia2004–2020
Michigan2000–2020
Ohio2000–2020
Pennsylvania2000–2020
Texas2000–2020
North Carolina2008–2020
ArizonaPartial2004–2020

Coverage for other states exists at county level or for federal races only. Check each state’s GitHub repository (openelections-data-{state}) for current status.

VEST — Shapefiles with Vote Counts

VEST publishes precinct-level shapefiles for all 50 states. We have loaded a subset for odd-year coverage:

StateYearElection typeLoaded
Kentucky2019General (Governor)
Louisiana2019General (Governor)
Mississippi2019General (Governor)
Virginia2019General (state legislature)
Kentucky2015General (Governor)
Louisiana2015General (Governor)
Mississippi2015General (Governor)
Virginia2015General (state legislature)

VEST covers state-level races only (president, governor, US Senate, US House, state legislature). No local races.

Census and FEC — Reference Data

These are not election results. They provide reference identifiers used during pipeline enrichment.

SourceScopeYearsRecords
Census county FIPSNational20203,143
Census place FIPSNational202031,980
Census state FIPSNational202056
FEC candidate masterFederal candidates2020~6,800
FEC candidate masterFederal candidates2022~6,600

Clarity/Scytl — Not Yet Integrated

Clarity ENR sites cover 1,000+ jurisdictions but are not yet in the pipeline. URLs are unstable across election cycles, making systematic acquisition difficult. See Clarity/Scytl ENR.

Combined Coverage Summary

DimensionCurrent status
States with any data50 + DC
Even-year general elections2018, 2020, 2022
Odd-year electionsKY/LA/MS/VA 2015, 2019 (VEST only, state-level)
Deep single-state coverageNC, 2006–2024 (10 cycles)
Total rows across all sources~42M
Local race coverage44 of 51 jurisdictions (MEDSL 2022) + NC (NC SBE)
Vote mode breakdownsNC SBE (all contests), MEDSL (some states), Clarity (when integrated)
Turnout data<5% of records populated

Gap Analysis

Temporal gaps. No odd-year municipal election results are loaded. Cities like New York, Los Angeles, Houston, Philadelphia, and San Antonio hold elections in odd years. MEDSL publishes 2017 and 2019 data on Dataverse. Loading these would add coverage for the largest US cities.

State-level local gaps. Seven states have zero local race data in MEDSL 2022. OpenElections partially fills this for Pennsylvania. The remaining six (CA, IA, KS, NJ, TN, WI) require either Clarity integration or direct state portal downloads.

Primary elections. All loaded data is general election only. MEDSL tags primary results with stage = PRI but we have not loaded primary-specific files. NC SBE publishes primary results as separate files.

Runoff elections. Georgia, Louisiana, Texas, and other states hold runoff elections. These are partially captured in MEDSL (stage = RUN) but not systematically loaded.

What We Cover, What We Don’t, and Why

This page is a honest inventory of what the pipeline can and cannot do today. The status indicators mean:

  • ✅ — Functional and validated
  • ⚠️ — Partially implemented or not validated at scale
  • ❌ — Not yet supported

Status Table

CapabilityStatusNotes
Precinct-level results, all 50 statesVia MEDSL 2018/2020/2022. 36.5M rows across three cycles.
NC deep temporal coverageNC SBE 2006–2024, 10 election cycles, 2.0M+ rows. Consistent 15-column schema from 2014 onward.
Federal race coveragePresident, US Senate, US House present in MEDSL for all states. FEC candidate master files available for cross-referencing.
State-level race coverageGovernor, state legislature, AG, SOS present in MEDSL for all states.
FIPS geographic enrichmentCensus reference files loaded: 3,143 counties, 31,980 places, all 50 states + DC. 100% county FIPS match rate on MEDSL data.
Vote mode breakdownsNC SBE provides Election Day / One Stop / Absentee / Provisional for every contest. MEDSL provides mode breakdowns for some states (rows split by mode column).
Local race coverage⚠️44 of 51 MEDSL jurisdictions have local race data (blank dataverse column) in 2022. Seven states — CA, IA, KS, NJ, PA, TN, WI — have zero local rows.
Cross-source validation⚠️Validated for NC only. MEDSL and NC SBE share 640 contests in 2022: 90.5% exact vote match, 7.3% within 1%, 2.2% disagree by >1%. No systematic cross-source validation for other states.
Entity resolution⚠️Four-tier cascade (exact → Jaro-Winkler → embedding → LLM) is designed and prototyped. Not yet validated at scale beyond NC test cases.
Office classification⚠️Four-tier classifier (keyword → regex → embedding → LLM) handles 62% of 8,387 unique office names via keywords. Remaining 38% require embedding or LLM tiers. 0.56% classified as “other” in NC testing.
Name decomposition⚠️Parses first/middle/last/suffix/nickname from MEDSL and NC SBE formats. Handles nicknames in quotes ("Steve") and parentheses ((Steve)). Not tested against all 50 states’ formatting conventions.
Turnout dataregistered_voters and ballots_cast populated for <5% of records. NC SBE has “Registered Voters” pseudo-contest rows. Most MEDSL state files do not include registration counts.
Odd-year electionsMEDSL publishes 2017 and 2019 on Harvard Dataverse. VEST has KY/LA/MS/VA for 2015 and 2019. None loaded into our pipeline yet.
Ranked-choice votingSchema has no fields for RCV rounds. Maine and Alaska use RCV for federal races. NYC and other cities use it for local races. No timeline for support.
Demographic correlationCensus FIPS join is ready (county-level). Census demographic data (ACS) not yet integrated. The join key exists; the demographic tables do not.
Real-time resultsPipeline processes certified and official results only. Not designed for election night reporting. Clarity integration (which could provide semi-live data) is not yet implemented.
Party switching detectionRequires entity resolution across election cycles, which depends on L3/L4 being operational at scale.

Local Race Coverage Detail

The 44 states with local data in MEDSL 2022 vary in depth. Some states report thousands of local contests; others report only a handful. The seven states with zero local rows are not states without local elections — they are states where MEDSL’s curation did not capture local results for that cycle.

NC SBE fills the gap for North Carolina with complete local coverage: every contest on every ballot in every precinct in all 100 counties. For other states, the gap remains.

OpenElections provides supplemental local data for FL, GA, MI, OH, PA, and TX, but coverage is inconsistent across years and granularity levels.

What “Validated” Means

A capability marked ✅ means:

  1. The data is loaded and parsed without errors.
  2. The output has been spot-checked against the source.
  3. Where cross-source overlap exists, the numbers have been compared.

It does not mean the data is free of errors from the source. MEDSL’s votes column contains 12,782 non-integer values out of 12.3M rows (0.1%) in 2022. NC SBE has occasional data entry artifacts (e.g., a period after a middle name instead of a middle initial). These are source-level issues that the pipeline preserves and flags rather than silently corrects.

What “Not Validated at Scale” Means

Entity resolution and office classification work on NC test data. We have not run them against all 42M rows across all 50 states. The algorithms are designed; the compute has not been spent. When we do run at scale, we expect to discover new edge cases — office titles we haven’t seen, name formats we haven’t parsed, and match ambiguities we haven’t resolved.

This page will be updated as capabilities move from ⚠️ to ✅ or as new limitations are discovered.

MEDSL — MIT Election Data + Science Lab

The MIT Election Data + Science Lab publishes precinct-level election returns for all 50 states and the District of Columbia. The data is hosted on the Harvard Dataverse (electionscience collection) and mirrored on GitHub for recent cycles. It is the most complete single source of US election data available without a paywall or API key.

What MEDSL contains

MEDSL provides one CSV or tab-delimited file per state per election cycle. Each row represents one candidate in one precinct for one vote mode (election day, absentee, early voting, provisional, etc.). To obtain the total votes for a candidate in a precinct, you must sum across all rows for that candidate and precinct.

Available election cycles:

CycleLocationFormatDOI
2022GitHubCSV, one ZIP per state
2020Harvard DataverseCSV/TAB, one file per state10.7910/DVN/NT66Z3
2018Harvard DataverseCSV/TAB, one file per state10.7910/DVN/NVQYMG
2016Harvard DataverseCSV/TAB10.7910/DVN/NH5S2I
2019 (odd-year)Harvard DataverseCSV/TAB10.7910/DVN/2AJUII
2017 (odd-year)Harvard DataverseCSV/TAB10.7910/DVN/VNJAB1

We have downloaded and loaded 2018, 2020, and 2022. Together they contain approximately 36.5 million rows.

Schema

MEDSL files have 25 columns. The delimiter is comma for most states but tab for some; auto-detection handles this.

ColumnTypeDescriptionExample
precinctstringPrecinct identifier from the source12-13
officestringContest name, ALL CAPSCABARRUS COUNTY SCHOOLS BOARD OF EDUCATION
party_detailedstringFull party nameNONPARTISAN
party_simplifiedstringNormalized partyNONPARTISAN
modestringVote type for this rowELECTION DAY
votesintegerVote count for this mode79
candidatestringCandidate name, ALL CAPSGREG MILLS
districtstringDistrict identifier or blankSTATEWIDE, 003, ``
dataversestringRace level tag — see belowSTATE, SENATE, HOUSE, ``
stagestringElection stageGEN
specialstringSpecial election flagFALSE
writeinstringWrite-in flagFALSE
datedateElection date2022-11-08
yearintegerElection year2022
county_namestringCounty name, ALL CAPSCABARRUS
county_fipsstring5-digit county FIPS37025
jurisdiction_namestringJurisdiction nameCABARRUS
jurisdiction_fipsstringJurisdiction FIPS37025
statestringFull state nameNORTH CAROLINA
state_postring2-letter postal codeNC
state_fipsstring2-digit state FIPS37
state_censtringCensus state code56
state_icstringICPSR state code47
readme_checkstringData quality flagFALSE
magnitudeintegerNumber of seats in this contest3

The dataverse column and local races

MEDSL tags each row with a dataverse value indicating which Harvard Dataverse sub-collection the race belongs to:

ValueMeaningExample offices
PRESIDENTPresidential racePresident
SENATEUS SenateUS Senate
HOUSEUS HouseUS House District 7
STATEState-level officesGovernor, State Senate, Attorney General
(blank)Everything else — including all local racesCounty Commissioner, School Board, Sheriff, Soil and Water

Local races are identified by a blank dataverse column, not by the value LOCAL. This is a frequent source of confusion. In the 2022 North Carolina file, 385,260 of 684,712 rows (56%) have a blank dataverse value. These rows contain school board races, county commissioner races, soil and water conservation districts, district court judges, mayors, city councils, and other local offices.

In the full 2022 national dataset (12.3 million rows), approximately 5.1 million rows (41.5%) have a blank dataverse value.

The mode column and vote totals

Each row in MEDSL represents one candidate’s votes for one vote mode. A single candidate in a single precinct may have multiple rows:

12-13,US SENATE,LIBERTARIAN,LIBERTARIAN,ELECTION DAY,47,SHANNON W BRAY,...
12-13,US SENATE,LIBERTARIAN,LIBERTARIAN,ABSENTEE BY MAIL,5,SHANNON W BRAY,...
12-13,US SENATE,LIBERTARIAN,LIBERTARIAN,EARLY VOTING,38,SHANNON W BRAY,...
12-13,US SENATE,LIBERTARIAN,LIBERTARIAN,PROVISIONAL,0,SHANNON W BRAY,...

To get Shannon W. Bray’s total votes in precinct 12-13, sum the votes column across all modes: 47 + 5 + 38 + 0 = 90.

Some states include a TOTAL mode row that pre-sums the other modes. Some do not. Your aggregation logic must handle both cases. If TOTAL rows are present, either use them directly and skip the individual mode rows, or skip TOTAL and sum the modes yourself. Do not double-count.

Common mode values: ELECTION DAY, ABSENTEE BY MAIL, EARLY VOTING, ONE STOP, PROVISIONAL, TOTAL.

Name formatting

MEDSL candidate names are ALL CAPS with no periods after initials:

MEDSLActual name
SHANNON W BRAYShannon W. Bray
VICTORIA P PORTERVictoria P. Porter
MICHAEL "STEVE" HUBERMichael “Steve” Huber
ROBERT VAN FLETCHER JRRobert Van Fletcher, Jr.
LM "MICKEY" SIMMONSL.M. “Mickey” Simmons

Nicknames appear in double quotes within the name string. Suffixes (JR, SR, III) appear without a preceding comma.

Write-in candidates are aggregated into a single row with candidate = WRITEIN and writein = TRUE.

Non-candidate rows

Some states include metadata rows in the data that are not candidate results:

office valueMeaningAction
REGISTERED VOTERSVoter registration countExtract as turnout metadata, do not treat as a contest
BALLOTS CASTBallots cast countExtract as turnout metadata
BALLOTS CAST - TOTALSameExtract
BALLOTS CAST - BLANKBlank ballot countExtract
STRAIGHT PARTYStraight-ticket party voteTypically excluded from contest analysis
OVER VOTESOvervote countExtract as quality metadata
UNDER VOTESUndervote countExtract as quality metadata

These rows are present in some states and absent in others. Florida OpenElections data contains 6,013 “Registered Voters” rows — 67.9% of all records classified as “other” in initial processing.

Known coverage gaps

MEDSL 2022 contains local race data for 44 of 51 jurisdictions. Seven states have zero rows with a blank dataverse column:

StateLikely reason
CaliforniaLocal results published separately by each county; not aggregated by MEDSL
IowaLocal results not included in the MEDSL state file
KansasSame
New JerseySame
PennsylvaniaSame
TennesseeSame
WisconsinSame

This does not mean these states lack local elections. It means MEDSL’s curation process did not capture them for 2022. Coverage may differ in other years.

The votes column type

The votes column is predominantly integer, but some state files contain non-integer values. We observed:

  • Floating-point values (likely vote shares erroneously placed in the votes column)
  • Asterisks (*) indicating suppressed data
  • Empty strings

Parse with TRY_CAST or equivalent. In our load of the full 2022 dataset, 12,782 rows had non-integer votes values out of 12.3 million total (0.1%).

Download

# 2022 — All 51 files from GitHub
mkdir -p local-data/sources/medsl/2022
for state in ak al ar az ca co ct dc de fl ga hi ia id il in ks ky la \
            ma md me mi mn mo ms mt nc nd ne nh nj nm nv ny oh ok or pa \
            ri sc sd tn tx ut va vt wa wi wv wy; do
  curl -L -o "local-data/sources/medsl/2022/2022-${state}-local-precinct-general.zip" \
    "https://raw.githubusercontent.com/MEDSL/2022-elections-official/main/individual_states/2022-${state}-local-precinct-general.zip"
done

# Unzip
for f in local-data/sources/medsl/2022/*.zip; do
  unzip -o "$f" -d "${f%.zip}"
done
# 2020 — NC example from Harvard Dataverse (file ID 6100444)
mkdir -p local-data/sources/medsl/2020
curl -L -o local-data/sources/medsl/2020/2020-nc-precinct-general.csv \
  "https://dataverse.harvard.edu/api/access/datafile/6100444"

File IDs for all 51 jurisdictions in 2020 and 2018 are documented in the download instructions.

Cross-source overlap

For the 2022 North Carolina general election, MEDSL and NC SBE share 640 contests where both sources report results:

  • 579 (90.5%) have exactly matching vote totals
  • 47 (7.3%) match within 1%
  • 14 (2.2%) disagree by more than 1%
  • 401 (63%) have different candidate name formatting between the two sources

This overlap is the basis for our entity resolution validation. See Cross-Source Reconciliation.

NC SBE — North Carolina State Board of Elections

The North Carolina State Board of Elections publishes precinct-level results for every contest on the ballot — federal, state, and local — with vote mode breakdowns, for every election cycle back to at least 2006. It is the most complete single-state local election dataset we have found.

What NC SBE contains

NC SBE provides one tab-delimited text file per election, delivered as a ZIP archive from an S3 bucket. Each row represents one candidate in one precinct for one contest. Vote mode totals (Election Day, early voting, absentee by mail, provisional) appear as separate columns on each row, not as separate rows. This means a single row gives you the full vote breakdown for one candidate in one precinct — unlike MEDSL, which splits each vote mode into its own row.

Coverage:

YearFileRowsNotes
2024results_pct_20241105.txt233,511Presidential general
2022results_pct_20221108.txt171,901Midterm general
2020results_pct_20201103.txt257,722Presidential general
2018results_pct_20181106.txt183,724Midterm general
2016results_pct_20161108.txt252,827Presidential general
2014results_pct_20141104.txt223,977Midterm general
2012results_pct_20121106.txt208,921Different schema — see below
2010results_pct_20101102.txt188,008Different schema
2008results_pct_20081104.txt233,141Different schema
2006results_pct_20061107.txt69,482Significantly different schema (9 columns)

We have downloaded and loaded all 10 cycles. The 2014–2024 files share a stable 15-column format. Earlier files require separate parsers.

Schema (2014–2024)

Files from 2014 onward are tab-delimited with 15 columns. There is no quoting convention; values do not contain tabs.

ColumnTypeDescriptionExample
CountystringCounty name, ALL CAPSCOLUMBUS
Election DatestringDate as MM/DD/YYYY11/08/2022
PrecinctstringPrecinct identifierP17
Contest Group IDstringInternal contest grouping number7
Contest TypestringS = statewide, C = county/localC
Contest NamestringFull contest name, ALL CAPSCOLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02
ChoicestringCandidate name, Title CaseTimothy Lance
Choice PartystringParty abbreviation or blankREP, DEM,
Vote ForintegerMaximum selections allowed1
Election DayintegerElection day votes136
One StopintegerEarly voting (in-person) votes159
Absentee by MailintegerMail absentee votes7
ProvisionalintegerProvisional ballot votes1
Total VotesintegerSum of all vote modes303
Real PrecinctstringY = physical precinct, N = aggregation groupY

The Contest Type column

The Contest Type field distinguishes local from statewide races:

  • C — county/local contests: school board, county commissioner, city council, soil and water, local judicial races, bond referendums
  • S — statewide contests: US Senate, US House, Governor, state legislature, statewide judicial races

For local election analysis, filter to Contest Type = 'C'. In the 2022 file, this yields 919 distinct contests across 100 counties.

Vote mode columns

NC SBE is the only source in our corpus that provides vote mode breakdowns as columns for every contest, including local races. The four modes are:

ColumnMeaning
Election DayVotes cast in person on election day
One StopEarly in-person voting (North Carolina’s term for early voting)
Absentee by MailAbsentee ballots returned by mail
ProvisionalProvisional ballots accepted during canvass

Total Votes is the sum of the four mode columns. We have verified this holds across all rows in the 2014–2024 files.

The vote mode data enables analysis that most sources cannot support: comparing early voting patterns to election day patterns at the precinct level for local races. Three of our nine data sources provide any vote mode breakdown at all (NC SBE, Clarity, and MEDSL for some states). NC SBE is the only one that provides it consistently for all contests.

Non-contest rows

NC SBE data includes rows that are not candidate results. These appear as entries in the Choice column within contests that are not real races:

Contest Name patternChoice valueWhat it is
Contains “Registered Voters”(varies)Voter registration count for the precinct
Any contestWrite-In (Miscellaneous)Aggregated write-in votes
Any contestOver VotesOvervote count
Any contestUnder VotesUndervote count

The “Registered Voters” rows deserve special attention. They appear as a contest named “Registered Voters” with a single Choice entry where Total Votes contains the number of registered voters in that precinct. This is turnout metadata, not a contest result.

In our prototype pipeline, we extract the registered voter count from these rows into a turnout object, then exclude the row from contest analysis. This is how we backfill the turnout.registered_voters field that is otherwise unpopulated for most sources.

Write-in rows with the suffix (Write-In) in the candidate name (e.g., Ronnie Strickland (Write-In)) are distinct from the aggregated Write-In (Miscellaneous) row. The named write-in rows report votes for a specific write-in candidate. The (Miscellaneous) row reports the total for all unnamed write-ins.

The Real Precinct column

Real Precinct = Y indicates a physical voting precinct with a defined geographic boundary. Real Precinct = N indicates an aggregation group — typically used for absentee-only tallies or provisional ballot pools that cannot be assigned to a specific precinct.

For geographic analysis (mapping, precinct-level comparison), filter to Real Precinct = 'Y'. For total vote counts, include both.

Candidate name formatting

NC SBE candidate names are Title Case with periods after initials and commas before suffixes:

NC SBEComponents
Timothy Lancefirst=Timothy, last=Lance
Shannon W. Brayfirst=Shannon, middle=W, last=Bray
Robert Van Fletcher, Jr.first=Robert, middle=Van, last=Fletcher, suffix=Jr.
Michael (Steve) Huberfirst=Michael, nickname=Steve, last=Huber
William Irvin. Enzor IIIfirst=William, middle=Irvin, last=Enzor, suffix=III
Patricia (Pat) Cothamfirst=Patricia, nickname=Pat, last=Cotham

Nicknames appear in parentheses. This differs from MEDSL, which uses double quotes. The period after “Irvin.” in “William Irvin. Enzor III” appears to be a data entry artifact — the period belongs after the middle initial, not after the full middle name. These inconsistencies are present in the source data and must be handled during name decomposition at L1.

Schema changes across years

The 2014–2024 files share the 15-column schema documented above. Earlier files differ:

2008–2012: The schema has 14–15 columns but with different names and ordering. Contest Type is the third column (not the fifth). Fields are comma-delimited with quote wrapping. The district column was added later. Vote mode columns use slightly different names in some years.

2006: Significantly different. Only 9 columns: county, election_dt, precinct_abbrv, precinct, contest_name, name_on_ballot, party_cd, ballot_count, FTP_Date. No vote mode breakdown. No Contest Type field. All column names are lowercase with underscores.

We currently parse 2014–2024 with one parser and treat 2006–2012 as a separate parser target. The 2008–2012 files contain local races (they have Contest Type = C) but require different column mapping. The 2006 file requires more investigation to determine whether it includes local races.

Why NC SBE matters

NC SBE is not the largest dataset in our corpus (MEDSL has far more rows). Its value is in three properties that no other source provides simultaneously:

  1. Complete local coverage. Every contest on every ballot in every precinct in every county — school board, soil and water, county commissioner, municipal, judicial, and bond referendums. MEDSL has gaps in local race coverage for some states. NC SBE has none for North Carolina.

  2. Vote mode breakdowns for local races. The four-column mode breakdown (Election Day, One Stop, Absentee, Provisional) is present for every contest, including hyperlocal races like “Whiteville City Schools Board of Education District 01.”

  3. Ten-year temporal depth. Six clean election cycles (2014–2024) with a consistent schema. This enables career tracking, competitiveness trend analysis, and temporal chain construction across a decade of local elections. Combined with the 2008–2012 files (once parsed), the coverage extends to nearly 20 years.

The combination of these three properties makes NC SBE the primary validation dataset for the pipeline. When we test cross-source entity resolution, we compare MEDSL NC against NC SBE NC — 640 overlapping contests with 90.5% exact vote total agreement and 63% candidate name formatting differences. When we test temporal chains, we track candidates across NC SBE’s six-cycle span.

Download

The URL pattern is:

https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/{YYYY_MM_DD}/results_pct_{YYYYMMDD}.zip
mkdir -p local-data/sources/ncsbe/{2014,2016,2018,2020,2022,2024}

# 2024
curl -L -o local-data/sources/ncsbe/2024/results_pct_20241105.zip \
  "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2024_11_05/results_pct_20241105.zip"

# 2022
curl -L -o local-data/sources/ncsbe/2022/results_pct_20221108.zip \
  "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip"

# 2020
curl -L -o local-data/sources/ncsbe/2020/results_pct_20201103.zip \
  "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2020_11_03/results_pct_20201103.zip"

# 2018
curl -L -o local-data/sources/ncsbe/2018/results_pct_20181106.zip \
  "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2018_11_06/results_pct_20181106.zip"

# 2016
curl -L -o local-data/sources/ncsbe/2016/results_pct_20161108.zip \
  "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2016_11_08/results_pct_20161108.zip"

# 2014
curl -L -o local-data/sources/ncsbe/2014/results_pct_20141104.zip \
  "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2014_11_04/results_pct_20141104.zip"

# Unzip all
for d in local-data/sources/ncsbe/*/; do
  cd "$d" && unzip -o *.zip && cd -
done

Older cycles (2008–2012) follow the same URL pattern. The 2006 file uses a different path structure. The full election calendar is at ncsbe.gov/results-data.

OpenElections

The OpenElections project is a volunteer-driven effort to collect, clean, and publish certified US election results as CSV files. Data is organized into per-state GitHub repositories under the openelections organization.

What OpenElections provides

Precinct-level and county-level election results, parsed from official state and county sources into CSV format. Coverage varies by state — some repositories have data back to 2000, others have only one or two recent cycles. Approximately 8 states have precinct-level 2022 general election data suitable for aggregation.

States with usable precinct-level data for recent cycles include FL, GA, MI, OH, PA, and TX. Each state repository is independent, maintained by different volunteers, with different levels of completeness.

Repository structure

Each state has its own repo:

  • openelections-data-fl — Florida
  • openelections-data-ga — Georgia
  • openelections-data-pa — Pennsylvania
  • etc.

Files follow a naming convention that encodes the election date, state, type, and granularity:

{YYYYMMDD}__{state}__{type}__{granularity}.csv

Examples:

FilenameMeaning
20221108__fl__general__precinct.csv2022 FL general, precinct-level
20220510__pa__primary__county.csv2022 PA primary, county-level
20201103__ga__general__precinct.csv2020 GA general, precinct-level

Some repos include both raw and cleaned versions. Files with raw in the name are unprocessed source dumps. Prefer files without the raw prefix.

Core schema (7+ columns)

The project does not enforce a single schema. Most files share a 7-column core:

ColumnTypeDescription
countystringCounty name
precinctstringPrecinct name or code
officestringOffice contested
districtstringDistrict number or name (may be blank)
partystringParty abbreviation
candidatestringCandidate name
votesintegerVote count

Additional columns appear in some states:

  • election_day, absentee, provisional, early_voting — vote mode breakdowns
  • winner — boolean or Y/N flag
  • total_votes — aggregate across modes

Column names and ordering differ across states and sometimes across files within the same state repo.

Schema variation by state

StateExtra columnsName formatNotes
FLelection_day, absentee, early_votingLast, FirstIncludes “Registered Voters” metadata rows (6,013 in 2022)
GAtotal_votesFirst LastPrecinct names vary by county
PAnone beyond coreFirst LastSome files county-level only
OHearly_voting, absenteeLast, FirstInconsistent across counties

Non-candidate rows

Florida files include metadata rows that are not contest results:

office valueMeaning
Registered VotersVoter registration count — 67.9% of “other” rows in initial FL processing
Ballots CastTurnout count

These must be extracted as turnout metadata during L1 parsing, not treated as contests.

Access method

Data is accessed by cloning the per-state Git repository:

git clone https://github.com/openelections/openelections-data-fl.git
git clone https://github.com/openelections/openelections-data-ga.git
git clone https://github.com/openelections/openelections-data-pa.git

There is no bulk download endpoint. Each state repo must be cloned individually.

Data quality

Quality varies by state and volunteer. Known issues:

  • No standard schema. Column names differ across states and files. Parsers must handle each state separately.
  • Candidate name format varies. Some states use Last, First. Others use First Last. Suffixes and middle names are inconsistent.
  • Encoding. Most files are UTF-8. Some older files contain Latin-1 or Windows-1252 characters.
  • Duplicates. Some repos contain both raw and cleaned versions of the same election. Ingest only one to avoid double-counting.
  • Incomplete coverage. A state repo existing does not mean it has precinct-level data for all cycles.

Cross-source overlap

OpenElections FL overlaps with MEDSL FL for the 2022 general election. This overlap is useful for validation but has not been systematically compared at the same depth as the MEDSL–NC SBE comparison (640 contests, 90.5% vote match). The FL overlap is a planned validation target.

Value in the pipeline

OpenElections fills gaps where MEDSL coverage is thin or where vote mode breakdowns are available. Florida’s vote mode columns (election day, absentee, early voting) provide signal that MEDSL’s Florida file lacks. The community-curated nature means data may appear for states or cycles before MEDSL publishes its cleaned version.

The tradeoff is consistency: every state requires its own parser branch at L1.

Clarity/Scytl ENR

Clarity (now part of Scytl / CivicPlus) powers Election Night Reporting (ENR) websites for over 1,000 US jurisdictions — counties, cities, and some state-level election authorities. Each jurisdiction runs its own Clarity instance, publishing structured results in XML and JSON formats.

What Clarity provides

Clarity sites are the primary source for local race results that no other source captures: school board, city council, municipal judge, fire district commissioner, water board. Many jurisdictions publish precinct-level results with vote mode breakdowns (Election Day, early, absentee, provisional). Results appear on election night and typically remain available for weeks or months before being replaced by the next election cycle.

Data format

Results are distributed as XML inside ZIP archives. The XML follows a hierarchical structure:

ElementDescription
<ElectionResult>Root element. Contains election metadata (name, date, jurisdiction).
<Contest>One per race. Attributes include contest name, vote-for count, total ballots.
<Choice>One per candidate or ballot measure option within a contest. Includes name, party, total votes.
<VoteType>Breakdown by vote method within each choice. Election Day, absentee, early, provisional.
<Precinct>Precinct-level results when the jurisdiction publishes at that granularity.

A single detailxml.zip file for a medium-sized county (50 precincts, 30 contests) is typically 200 KB–2 MB uncompressed.

URL structure

Clarity ENR sites follow a predictable URL pattern:

https://results.enr.clarityelections.com/{state}/{jurisdiction}/{electionID}/

The underlying data feeds are at:

EndpointContent
reports/detailxml.zipFull precinct-level XML results
json/en/summary.jsonLightweight JSON summary (no precinct detail)
Web02/en/summary.htmlHuman-readable results page

Example for Wake County, NC:

https://results.enr.clarityelections.com/NC/Wake/115545/reports/detailxml.zip

The {electionID} is a numeric identifier assigned per election. It is not sequential and cannot be predicted.

Coverage

  • Jurisdictions: ~1,000+ counties and municipalities across ~30 states
  • Election types: general, primary, runoff, special, municipal
  • Granularity: precinct-level with vote type breakdowns (most jurisdictions)
  • Temporal: current election cycle only; prior results are removed when new elections are configured

Why Clarity matters

Clarity is the highest-value source for local races that do not appear in MEDSL, OpenElections, or state portals. A county’s Clarity site may be the only machine-readable source for races like:

  • School board (non-partisan, no state-level reporting)
  • City council (municipal elections, often off-cycle)
  • District court judge retention
  • Bond referendums and local ballot measures

Key problems

URLs are unstable. The {electionID} changes every cycle. Old results are removed without redirect or archive. There is no central index of active Clarity instances. Discovery requires crawling county election office websites for links.

No published XML schema. The XML structure is consistent in practice but not formally specified. Minor variations exist across Clarity software versions. Field names and nesting can differ between jurisdictions.

Candidate names may embed party. Some jurisdictions format candidate names as John Smith (REP) rather than using a separate party field. This requires parsing at L1.

Ephemeral availability. Results may disappear weeks after certification when the jurisdiction configures the site for the next election. L0 acquisition must happen promptly after each election.

Integration status

Clarity is not yet integrated in our pipeline. The source module (src/sources/clarity.rs) defines the XML schema and URL patterns but does not implement parsing or acquisition. Integration is blocked on building a jurisdiction discovery mechanism and a scheduled acquisition process that captures results before URLs expire.

When integrated, Clarity will feed into L0 as ZIP archives with XML contents, parsed at L1 into the unified schema. The hierarchical Contest → Choice → VoteType structure maps cleanly to our ContestKind model.

VEST — Voting and Election Science Team

The Voting and Election Science Team (VEST) publishes precinct-level election shapefiles for all 50 states. Each shapefile pairs precinct geographic boundaries with vote counts encoded as attribute columns. The data is archived on the Harvard Dataverse.

What VEST provides

VEST’s primary value is twofold: geographic precinct boundaries (polygons) and odd-year election coverage. No other source in our corpus provides precinct geometries, and MEDSL’s loaded data currently covers only even years.

We have downloaded VEST shapefiles for KY, LA, MS, and VA covering the 2015 and 2019 odd-year elections. These contain state-level races (governor, attorney general, state legislature) but not local races.

Data format

Each state-year dataset is a ZIP archive containing a standard ESRI shapefile bundle:

FilePurpose
.shpGeometry (precinct boundary polygons)
.dbfAttribute table (vote counts, FIPS codes)
.shxSpatial index
.prjCoordinate reference system definition
.cpgCharacter encoding declaration

Reading requires a spatial data library. In Python, geopandas.read_file() handles the full bundle. In Rust, the shapefile crate reads .shp/.dbf pairs.

Column encoding convention

VEST encodes election metadata into column names using a compact format:

{stage}{YY}{office}{party}{surname}
ComponentValuesExamples
StageG (general), P (primary), R (runoff)G
YearTwo-digit year20, 19, 15
OfficePRE (President), USS (US Senate), USH (US House), GOV (Governor), SOS (Sec. of State), AG (Attorney General), LTG (Lt. Governor)PRE
PartyR (Republican), D (Democrat), L (Libertarian), G (Green), O (Other)R
SurnameAbbreviated (typically 3 chars)TRU, BID

Decoded examples

ColumnStageYearOfficePartyCandidate
G20PRERTRUGeneral2020PresidentRepublicanTrump
G20PREDBIDGeneral2020PresidentDemocratBiden
G19GOVDBEDGeneral2019GovernorDemocratBeshear (KY)
G15GOVDEDWGeneral2015GovernorDemocratEdwards (LA)
G18GOVDABOGeneral2018GovernorDemocratAbrams (GA)

Attribute table structure

The .dbf file contains both geographic identifiers and vote count columns:

Column patternDescription
STATEFP202-digit state FIPS code
COUNTYFP203-digit county FIPS code
VTDST20Voting tabulation district (precinct) code
NAME20Human-readable precinct name
ALAND20Land area in square meters
AWATER20Water area in square meters
G20PRE*Vote count columns (one per candidate)

Vote values are raw integer counts. Each row is one precinct.

dBASE column name truncation

The .dbf format (dBASE III) limits column names to 10 characters. This truncation creates ambiguity:

  • G20USSRPER could be Perdue or Perry
  • G20USHDWIL could be Williams, Wilson, or Wilkins

VEST documentation files (included in each ZIP) provide a column-to-candidate mapping. These must be consulted to resolve truncated names.

Coverage in our pipeline

StateYearElection typeRaces
KY2019Governor, AG, SOS, state legislatureState-level only
LA2015, 2019Governor, state legislatureState-level only
MS2019Governor, AG, state legislatureState-level only
VA2015, 2019Governor, state legislatureState-level only

These four states hold odd-year elections, which MEDSL has on Dataverse but which we have not yet loaded from that source. VEST fills the gap for state-level races in these cycles.

Limitations

No local races. VEST encodes statewide and federal contests only. County commissioner, school board, sheriff, and other local offices are not present. For local race coverage, use MEDSL or state-specific sources.

Large file sizes. Individual state shapefiles range from 50 MB to 500+ MB. The geometry data dominates file size; vote counts are a small fraction.

Precinct boundary instability. Redistricting changes precinct boundaries between election cycles. A precinct polygon from 2020 may not correspond to the same geographic area in 2022. Cross-year geographic comparisons require spatial intersection, not ID matching.

Requires spatial tooling. Unlike CSV sources that can be read with any text processor, shapefiles require geopandas (Python) or the shapefile crate (Rust). This adds a dependency that other sources do not.

Usage in the pipeline

VEST data enters at L0 as the raw shapefile ZIP. At L1, the column encoding is decoded to extract year, office, party, and candidate surname. Vote counts are pivoted from wide format (one column per candidate) to long format (one row per candidate per precinct) to match the unified schema.

The geographic boundaries are preserved as sidecar geometry files but are not embedded into the JSONL record stream. They are available for spatial joins and map rendering but are not part of the core election result schema.

Download

VEST datasets are available from the Harvard Dataverse. Each state-year combination has its own DOI. Example for Kentucky 2019:

mkdir -p local-data/sources/vest/ky/2019
curl -L -o local-data/sources/vest/ky/2019/ky_2019.zip \
  "https://dataverse.harvard.edu/api/access/dataset/:persistentId/?persistentId=doi:10.7910/DVN/XXXXXX"
unzip local-data/sources/vest/ky/2019/ky_2019.zip -d local-data/sources/vest/ky/2019/

Consult the VEST precinct data page for current DOIs. File IDs change when datasets are updated.

Census Bureau FIPS Reference Files

The US Census Bureau publishes authoritative FIPS (Federal Information Processing Standards) code files that provide the canonical mapping from numeric codes to geographic entity names. These files are the ground truth for geographic identifiers across the pipeline.

What it provides

FileEntity typeRecord countKey columns
state.txtStates + DC + territories57STATE, STATE_NAME
national_county2020.txtCounties + equivalents3,143STATEFP, COUNTYFP, COUNTYNAME
national_place2020.txtIncorporated places + CDPs31,980STATEFP, PLACEFP, PLACENAME
national_cousub2020.txtCounty subdivisions~36,000STATEFP, COUNTYFP, COUSUBFP, COUSUBNAME

Format

All files are pipe-delimited (|) plain text with a header row. Encoding is ASCII. Example from the county file:

NC|37|037|1026339|Chatham County|H1|A
NC|37|063|1008557|Durham County|H1|A
NC|37|183|1008586|Wake County|H1|A

Columns in the county file:

ColumnDescription
STATETwo-letter postal abbreviation
STATEFPTwo-digit state FIPS code
COUNTYFPThree-digit county FIPS code
COUNTYNSANSI feature code
COUNTYNAMEFull county name including “County” suffix
CLASSFPFIPS class code (H1 = active county, H4 = borough, H6 = parish)
FUNCSTATFunctional status (A = active)

The five-digit county FIPS used throughout the pipeline is STATEFP + COUNTYFP (e.g., 37 + 183 = 37183 for Wake County, NC).

Download

https://www2.census.gov/geo/docs/reference/state.txt
https://www2.census.gov/geo/docs/reference/codes2020/national_county2020.txt
https://www2.census.gov/geo/docs/reference/codes2020/national_place2020.txt
https://www2.census.gov/geo/docs/reference/codes2020/national_cousub2020.txt

No API key required. Files are small (under 5 MB total) and rarely change.

Usage in the pipeline

Census FIPS files are consumed at L1 for geographic enrichment. When a source record contains a county name but no FIPS code (common in OpenElections and Clarity data), the pipeline joins against the county file to assign the canonical five-digit FIPS. When a source provides a FIPS code but no name, the lookup runs in reverse.

The place file enables resolution of municipal names to FIPS codes — relevant for city council, mayoral, and municipal utility district contests where the jurisdiction is a place, not a county.

FIPS codes serve as the primary geographic join key across all seven data sources. Without them, matching “Wake County” in MEDSL to “WAKE” in NC SBE to “Wake Co.” in OpenElections would require fuzzy string matching. With them, it is an exact integer join.

FEC — Federal Election Commission Candidate Master Files

The FEC publishes bulk data files for every registered federal candidate: President, US Senate, and US House. These files provide stable candidate identifiers (CAND_ID) that persist across election cycles, making them a reference source for cross-linking federal candidates across MEDSL, NC SBE, and OpenElections data.

What FEC provides

The candidate master file (cn.txt) contains one row per candidate per election cycle. It covers all candidates who have filed with the FEC, including those who lost primaries or never appeared on a general election ballot.

Available cycles: 1980–present. We have downloaded 2020 and 2022.

Download

Bulk data is at fec.gov/data/browse-data.

mkdir -p local-data/sources/fec/{2020,2022}

# 2022
curl -L -o local-data/sources/fec/2022/cn.zip \
  "https://www.fec.gov/files/bulk-downloads/2022/cn.zip"
unzip -o local-data/sources/fec/2022/cn.zip -d local-data/sources/fec/2022/

# 2020
curl -L -o local-data/sources/fec/2020/cn.zip \
  "https://www.fec.gov/files/bulk-downloads/2020/cn.zip"
unzip -o local-data/sources/fec/2020/cn.zip -d local-data/sources/fec/2020/

Schema

The file cn.txt is pipe-delimited (|) with 15 columns and no header row.

#ColumnDescriptionExample
1CAND_IDStable candidate identifierH0NC09072
2CAND_NAMEName in LAST, FIRST MIDDLE formatBRAY, SHANNON W
3CAND_PTY_AFFILIATIONParty codeLIB
4CAND_ELECTION_YRElection year2022
5CAND_OFFICE_STState (2-letter postal code)NC
6CAND_OFFICEOffice: H / S / PH
7CAND_OFFICE_DISTRICTCongressional district (00 for Senate/President)09
8CAND_ICIIncumbent/Challenger/Open: I/C/OC
9CAND_STATUSStatus code (C=statutory candidate, F=filed, N=not yet, P=prior cycle)C
10CAND_PCCPrincipal campaign committee IDC00654321
11CAND_ST1Mailing address street
12CAND_ST2Mailing address street 2
13CAND_CITYMailing address city
14CAND_STMailing address state
15CAND_ZIPMailing address ZIP

Usage in the pipeline

FEC data serves two purposes:

  1. Stable identifiers. CAND_ID persists across election cycles. A candidate who runs for the same seat in 2020 and 2022 keeps the same ID. This provides a ground-truth link for validating temporal chains built by the L4 layer.

  2. Name cross-referencing. CAND_NAME is parsed at L1 into last, first, middle, and suffix components. These parsed names are compared against MEDSL and state source names during L3 entity resolution. FEC uses LAST, FIRST MIDDLE format consistently, which makes it one of the more predictable sources for name parsing.

Limitations

  • Federal candidates only. No state legislators, no county commissioners, no school board members. FEC has no jurisdiction over non-federal offices.
  • Filing ≠ appearing on ballot. Many CAND_ID entries correspond to candidates who filed paperwork but never appeared on a general election ballot.
  • Party codes differ from other sources. FEC uses codes like LIB, GRE, NNE (None) that do not match MEDSL’s LIBERTARIAN, GREEN, NONPARTISAN labels. Normalization is required at L1.

Future Sources

This chapter documents data sources that have been identified as valuable but are not yet integrated into the pipeline. Each is blocked by a specific access, cost, or engineering constraint.

ALGED — Annual Local Government Election Data

The Annual Local Government Election Data project, hosted on the Open Science Framework (OSF), covers municipal elections in 1,747 cities with populations over 25,000. Records include candidate demographics (race, gender), incumbency status, and election outcomes — fields that no other source in our corpus provides.

Coverage: Municipal elections from 2000–2020. Cities only (no counties, no school districts). Focuses on mayoral and city council races.

Format: CSV files organized by city population tier.

Status: Blocked. The OSF repository requires an approved access request. We submitted a request in early 2025 and have not received a response. The underlying data appears to be derived from individual city clerk records, manually curated by the research team.

Value if integrated: ALGED would fill the demographic gap entirely. No other source provides candidate race or gender. It would also provide an independent validation source for municipal races in the 1,747 covered cities.

Ballotpedia

Ballotpedia maintains the most comprehensive database of US local elections, covering school boards, city councils, county commissions, judges, ballot measures, and special districts across all 50 states. Their coverage extends to races that no other source tracks — mosquito abatement districts, water boards, and transit authorities.

Coverage: All 50 states, all levels of government, ongoing since approximately 2007.

Format: Structured database accessible via a paid API. Some data is available on the public website but is not bulk-downloadable.

Status: Blocked by cost. The API requires a commercial license. Pricing is not publicly listed but is reported to be in the five-figure annual range. We have not pursued a license.

Value if integrated: Ballotpedia would be the single largest improvement to local race coverage. It would fill the 7-state local race gap in MEDSL 2022 and provide office-level metadata (term length, salary, appointing authority) that no source currently offers.

AP Elections API

The Associated Press Elections API provides real-time and certified results for federal and state races, with some local coverage in larger jurisdictions. It is the standard data feed used by newsrooms on election night.

Coverage: Federal and statewide races nationwide. County-level results for most races. Precinct-level for some states. Local race coverage varies.

Format: JSON API with WebSocket push for live updates.

Status: Blocked by cost. The AP API is a commercial product priced for newsroom budgets. It is not available for academic or open-source use without a contract. The real-time capability is irrelevant to our pipeline (we process certified results, not live feeds), but the certified result snapshots would be a valuable validation source.

Value if integrated: AP results would serve as a third independent source for federal and statewide races, enabling three-way cross-source validation alongside MEDSL and state portals. AP’s candidate identifiers are stable across cycles, which would simplify temporal chaining for federal candidates.

Additional State Portals

Six states with significant populations publish precinct-level results through their own election portals in structured formats. These would complement MEDSL by providing certified results directly from the state authority.

Florida: The Division of Elections publishes precinct-level results at results.elections.myflorida.com. CSV format. All counties, all contests. Would overlap with both MEDSL and OpenElections FL, enabling three-source validation for one state.

Georgia: The Secretary of State publishes results at results.enr.clarityelections.com/Georgia/ (Clarity-based) and via a separate certified results portal. XML and CSV. Would provide a second source for GA alongside MEDSL.

Texas: The Secretary of State publishes county-level results (not precinct-level) at elections.sos.state.tx.us. Precinct-level results are published by individual counties. A full TX integration would require crawling 254 county websites or using the Clarity instances that many TX counties operate.

Ohio: The Secretary of State publishes precinct-level results at www.ohiosos.gov/elections/election-results-and-data/. CSV format. Covers all contests including local races.

Pennsylvania: The Department of State publishes results at electionreturns.pa.gov. JSON API available. Covers all contests. Would fill one of the 7 states with zero local data in MEDSL 2022.

Michigan: The Secretary of State publishes precinct-level results at miboecfr.nictusa.com/cgi-bin/cfr/. Older web interface with downloadable files. Covers all contests.

Status: Not blocked by access — all six portals are public. Blocked by engineering time. Each state portal has its own format, URL structure, and quirks. We estimate 1–2 weeks of parser development per state. These are the highest-priority engineering tasks after odd-year MEDSL loading.

Name Normalization

Election data arrives with candidate names in dozens of formats. MEDSL uses LAST, FIRST MIDDLE in all caps. NC SBE uses First Last in title case. OpenElections uses whatever the county clerk typed. FEC uses LAST, FIRST MIDDLE SUFFIX. A single candidate can appear as:

  • CRIST, CHARLES JOSEPH (MEDSL)
  • Charlie Crist (OpenElections)
  • Crist, Charlie (FEC)

These are all the same person. A system that treats them as three different candidates produces garbage output. A system that aggressively normalizes them — stripping middle names, collapsing nicknames, removing suffixes — destroys the signal needed to tell different people apart.

The principle: clean without collapsing.

Name decomposition at L1

Every candidate name is decomposed at L1 into six components:

ComponentPurposeExample
rawOriginal string, unmodifiedCRIST, CHARLES JOSEPH
firstParsed first nameCHARLES
middleMiddle name or initialJOSEPH
lastLast nameCRIST
suffixGenerational suffixnull
canonical_firstDictionary-normalized first nameCHARLES

The canonical_first field is populated by the nickname dictionary. If the raw first name is Charlie, canonical_first becomes Charles. If no mapping exists, canonical_first equals first.

Both first and canonical_first are preserved. The raw nickname is useful signal — it tells you what the candidate goes by. The canonical form is what enables matching.

Real decomposition examples

Five candidates from our prototype, showing how MEDSL and NC SBE formats decompose differently for the same people:

SourceRaw Namefirstmiddlelastsuffixcanonical_first
MEDSLDESANTIS, RONRONnullDESANTISnullRONALD
OpenElectionsRon DeSantisRonnullDeSantisnullRonald
MEDSLCRIST, CHARLES JOSEPHCHARLESJOSEPHCRISTnullCHARLES
OpenElectionsCharlie CristCharlienullCristnullCharles
MEDSLDEMINGS, VAL BUTLERVALBUTLERDEMINGSnullVALDEZ
NC SBEVal DemingsValnullDemingsnullValdez
MEDSLWILLIAMS, ROBERTROBERTnullWILLIAMSnullROBERT
NC SBERobert Williams JrRobertnullWilliamsJrRobert
MEDSLMARSHALL, DAVID SDAVIDSMARSHALLnullDAVID
MEDSLMARSHALL, DAVID ADAVIDAMARSHALLnullDAVID

Key observations from these examples:

  1. Ron DeSantisRon maps to Ronald via the nickname dictionary. The embedding score between the two source representations is 0.729 — below any reasonable auto-accept threshold, but the LLM matches them using nickname knowledge.

  2. Charlie CristCharlie maps to Charles. The embedding score is 0.451. Without the dictionary, the cascade would need the LLM to know that Charlie is a nickname for Charles. With the dictionary, the canonical forms already match.

  3. Robert Williams vs Robert Williams Jr — The suffix Jr is the only distinguishing feature. These are different people. The embedding scores them at 0.862 — dangerously close to a false positive. See Suffixes.

  4. David S Marshall vs David A Marshall — Different middle initials. David S. Marshall ran in Maine; David A. Marshall ran in Florida. The middle initial is the only signal distinguishing them at the name level. See Nicknames and Middle Initials.

What decomposition enables

With names decomposed into components, downstream layers can:

  • Exact-match on structured fields: (canonical_first="Timothy", last="Lance", suffix=null) matches across precincts without fuzzy logic. This handles 70% of entity resolution.
  • Build composite strings for embedding: "{canonical_first} {middle} {last} {suffix} | {party} | {office} | {state}" includes middle initials and suffixes as disambiguation signal.
  • Provide structured context to the LLM: Instead of asking “are these the same person?”, the LLM sees parsed components and can reason about specific differences (nickname vs. different name, Jr vs. no suffix).
  • Block efficiently: Group by (state, last_name_initial) for entity resolution without computing all-pairs similarity.

What goes wrong without decomposition

If you treat names as opaque strings:

  • CRIST, CHARLES JOSEPH and Charlie Crist have a Jaro-Winkler similarity of 0.58 — a miss.
  • DESANTIS, RON and Ron DeSantis have a cosine embedding similarity of 0.729 — in the ambiguous zone.
  • Robert Williams and Robert Williams Jr look nearly identical to every string metric. Only structured suffix detection prevents a false merge.
  • David S Marshall and David A Marshall differ by one character in a middle initial that opaque matching may ignore entirely.

Decomposition is not optional. It is the foundation that every subsequent layer depends on.

The three sub-problems

Name normalization breaks into three sub-problems, each with its own chapter:

  1. Nicknames and Middle Initials — How Charlie becomes Charles and why David S. must stay distinct from David A.
  2. Suffixes: Jr/Sr Means Different People — Why generational suffixes are disambiguation signals, not noise to be stripped.
  3. The Nickname Dictionary — The lookup table that powers canonical_first, its current scope, and its limits.

Nicknames and Middle Initials

Two distinct problems share a root cause: the candidate’s legal name differs from the name on the ballot or in the source file. Nicknames substitute one first name for another. Middle initials appear in some sources and not others. Both must be handled at L1 to preserve signal for L2 and L3.

Nicknames

A nickname replaces the candidate’s legal first name with a familiar variant. The embedding model has no reliable way to recover the connection — it encodes character-level and token-level similarity, not social knowledge about naming conventions.

Real test results from our prototype, using text-embedding-3-large (3,072 dimensions):

Source ASource BNickname → LegalCosineLLM DecisionLLM Confidence
Charlie CristCRIST, CHARLES JOSEPHCharlie → Charles0.451match0.95
Nicole FriedFRIED, NIKKINikki → Nicole0.642match0.92
Ron DeSantisDESANTIS, RONRon → Ronald0.729match0.98

The Crist result is the critical case. At 0.451, the embedding score falls below any plausible auto-accept threshold — and below many reject thresholds. Without nickname resolution, this pair would be missed entirely or routed to LLM on every encounter.

The fix operates at L1. The nickname dictionary maps CharlieCharles, NikkiNicole, RonRonald, and ~100 other mappings. When the L1 parser decomposes a name, it checks the first name against the dictionary and populates canonical_first:

{
  "raw": "Charlie Crist",
  "first": "Charlie",
  "middle": null,
  "last": "Crist",
  "suffix": null,
  "canonical_first": "Charles"
}

Both first and canonical_first are preserved. The original is kept for display and provenance. The canonical form is used in the L2 composite string for embedding and in the L3 exact-match step. After dictionary application, the L3 exact matcher sees (canonical_first="Charles", last="Crist", suffix=null) on both sides — an exact match with no embedding or LLM call required.

Why the embedding model fails on nicknames

Charlie and Charles share a prefix, but the embedding model must also reconcile Crist vs CRIST, CHARLES JOSEPH — different casing, different ordering, and a middle name that appears in one source but not the other. The model embeds the full composite string, not individual tokens. The combined divergence pushes the cosine score to 0.451.

Ron and Ronald are closer (0.729) because the surface forms are more similar and both sources use last-name-first ordering. But 0.729 is still in the ambiguous zone — it requires an LLM call to confirm.

The nickname dictionary eliminates these LLM calls for known mappings. At scale, this matters: if 5% of candidates use nicknames and each requires an LLM call, that is tens of thousands of unnecessary API round-trips.

Middle Initials

Middle initials are a different problem. They do not substitute one name for another — they add or remove a disambiguation signal.

The key case: David S. Marshall (Maine) and David A. Marshall (Florida) are different people. Without middle initials, both reduce to David Marshall. With middle initials preserved, L2 generates different embedding vectors.

We measured the effect directly:

Composite (no middle)Composite (with middle)Cosine (no middle)Cosine (with middle)
David Marshall | MEDavid S Marshall | ME
David Marshall | FLDavid A Marshall | FL0.70250.6448

The middle initial drops the cosine score by 0.058 — enough to shift the pair further from the accept threshold and closer to correct rejection. The principle: middle initials are signal, not noise.

More middle-initial test results from our prototype:

Source ASource BCosineLLM DecisionKey Signal
Ashley MoodyAshley B. Moody0.930matchSame person, middle added
Val DemingsVAL DEMINGS0.828matchSame person, format difference
Dale HolnessDALE V.C. HOLNESS0.896matchSame person, middle initials added

Ashley Moody at 0.930 is the same person — the B. appears in one source but not the other. The high embedding score plus same-state context is sufficient for auto-accept above the 0.95 threshold (or just below it, in which case JW on the last name at 1.0 pushes it through).

How Both Feed Into L2

The L2 composite string for a candidate includes both canonical_first and middle:

{canonical_first} {middle} {last} {suffix} | {party} | {office} | {state} | {county}

For Charlie Crist, this becomes:

Charles  Crist  | DEM | Governor | FL | statewide

For CRIST, CHARLES JOSEPH, this becomes:

Charles Joseph Crist  | DEM | Governor | FL | statewide

The canonical first names now match. The remaining divergence — Joseph as a middle name in one source — is small enough that the embedding score rises well above the ambiguous zone. The nickname dictionary at L1 did the heavy lifting; L2 and L3 finish the job.

The Combined Rule

  1. At L1, apply the nickname dictionary to populate canonical_first.
  2. At L1, preserve middle exactly as parsed — do not strip it, do not normalize it.
  3. At L2, include both canonical_first and middle in the composite string.
  4. At L3 exact match, match on (canonical_first, last, suffix) — middle is not required for exact match but is used for disambiguation when multiple candidates share the same canonical first and last name.
  5. At L3 LLM confirmation, provide both the raw and canonical names so the model can reason about nickname relationships and middle-initial differences.

The principle behind both: clean without collapsing. Normalize what you can (nicknames to canonical forms), preserve what you must (middle initials as disambiguation signal), and let downstream layers use the full context.

Suffixes: Jr/Sr Means Different People

A suffix is not decoration. Robert Williams and Robert Williams Jr. are two different human beings — father and son, or grandfather and grandson. Any system that strips suffixes before matching will merge them into one entity.

This is not a hypothetical risk. It happened in our prototype.

The Williams Problem

In Florida 2022, two candidates share the name Robert Williams:

SourceRaw NameOfficeVotes
MEDSLROBERT WILLIAMSState House District 4112,304
OpenElectionsRobert Williams JrState House District 4112,304

The embedding similarity between these two strings is 0.862 — well above our original auto-accept threshold of 0.82. Without suffix-aware logic, the cascade would have merged them automatically, no LLM consulted.

The LLM caught it. Given the full context, Claude Sonnet returned:

Decision: no_match (confidence: 0.85)

“The suffix ‘Jr’ indicates a generational distinction. ‘Robert Williams’ and ‘Robert Williams Jr’ are most likely two different individuals — typically father and son. While the office and vote counts match, the suffix is a strong signal that these are distinct people. The MEDSL record may have dropped the suffix during their cleaning process. I recommend treating these as separate entities until a human reviewer can confirm.”

This single case changed our threshold design.

How This Changed the Thresholds

Before Williams Jr., auto-accept was ≥ 0.82. A score of 0.862 would have been accepted without review. After this finding, we raised auto-accept to ≥ 0.95, ensuring that suffix-bearing pairs always enter the LLM zone (0.35–0.95) where the model can reason about generational distinctions.

ThresholdBeforeAfterReason
Auto-accept≥ 0.82≥ 0.95Williams Jr at 0.862 was a false positive
Ambiguous (LLM zone)0.65–0.820.35–0.95Wider zone catches more edge cases
Auto-reject< 0.65< 0.35Crist at 0.451 was a false negative

The wider ambiguous zone sends more pairs to the LLM. Budget is not a constraint — accuracy is.

Suffix-Aware Logic in the Cascade

Suffixes receive special treatment at multiple stages:

L1 — Decomposition. The name parser extracts Jr, Sr, II, III, IV, V, Esq, and PhD into the suffix field. Both Jr. and Jr (with and without period) normalize to Jr. The suffix is never discarded.

Step 1 — Exact match. The exact match key is (canonical_first, last, suffix). “Timothy Lance” and “Timothy Lance” match. “Robert Williams” and “Robert Williams Jr” do not — the suffix field differs (null vs “Jr”).

Step 3 — Embedding. The suffix is included in the composite string: {canonical_first} {middle} {last} {suffix} | {party} | {office} | {state} | {county}. This means “Robert Williams” and “Robert Williams Jr” produce different vectors, but the difference is small (0.862 cosine) because the model treats “Jr” as a minor token.

Step 4 — LLM confirmation. The prompt explicitly includes both suffix fields and instructs the model: “A suffix like Jr or Sr typically indicates a different person (parent vs child). Do not match across suffixes unless you have strong evidence they refer to the same individual.” The LLM sees the structured fields, not just the raw strings.

The Suffix Inventory

From MEDSL 2022 data across all 50 states:

SuffixOccurrencesNotes
Jr1,847Most common; often dropped by one source
Sr312Almost always appears alongside a Jr in the same jurisdiction
II478Increasingly common; same disambiguation need as Jr
III189Rarer but unambiguous signal
IV31
V4

The Jr/Sr problem is not rare. Nearly 2,000 candidates in a single election cycle carry a Jr suffix, and an unknown number of their non-suffixed counterparts exist in the same dataset.

When Suffixes Are Missing

The harder case is when one source includes the suffix and another drops it. MEDSL strips suffixes more aggressively than NC SBE. OpenElections preserves them inconsistently. This means the cascade must handle the asymmetric case: one record has suffix “Jr”, the other has suffix null.

The rule: a null suffix does not match a non-null suffix. Null-to-null matches normally. “Jr” to “Jr” matches normally. But null to “Jr” always enters the LLM zone, regardless of embedding score. The LLM can then examine vote counts, office, and geographic context to determine whether the missing suffix is a data quality issue (same person, suffix dropped) or a genuine distinction (father and son).

This is conservative by design. We would rather send 1,847 extra pairs to the LLM than silently merge fathers with sons.

Cross-Reference

The Nickname Dictionary

The nickname dictionary is a static lookup table applied at L1 during name decomposition. It maps common short names and nicknames to their formal equivalents, populating the canonical_first field while preserving the original first field unchanged.

Scope

The prototype dictionary contains approximately 100 mappings covering the most frequent English-language nicknames encountered in US election data:

Raw firstcanonical_firstFrequency in MEDSL 2022
BillWilliam847
BobRobert612
JimJames589
MikeMichael534
CharlieCharles201
RonRonald187
NikkiNicole42
TedEdward31
DickRichard28
PeggyMargaret19

The target for production is 500+ mappings, expanding to cover Spanish-language nicknames (Pepe→José, Pancho→Francisco), regional variants, and less common English forms. The full reference list is maintained in Appendix: Full Nickname Dictionary.

Both forms are preserved

When the dictionary maps CharlieCharles, the L1 record stores both:

{
  "first": "Charlie",
  "canonical_first": "Charles"
}

The original first is never overwritten. The composite string sent to L2 embedding uses canonical_first, which is why the embedding for “Charles Crist” and “CRIST, CHARLES JOSEPH” can be compared at all — even though the raw cosine similarity between “Charlie Crist” and “CRIST, CHARLES JOSEPH” is only 0.451.

The Ted problem

Some nicknames are ambiguous. “Ted” can map to Edward (Ted Kennedy) or Theodore (Ted Cruz). “Bill” is unambiguous — it always maps to William. “Ted” is not.

The current dictionary maps Ted → Edward, which is the more common historical usage in US politics. This is wrong for Theodore-named candidates. The correct resolution requires context that L1 does not have: party, state, office, or a reference database of known candidates.

The planned fix is a two-pass approach: L1 applies the majority mapping (Ted → Edward), and L3 entity resolution can override it when the LLM has enough context to determine the correct expansion. The canonical_first field is treated as a best guess at L1, not a final answer.

Other ambiguous nicknames with the same property: Pat (Patricia or Patrick), Chris (Christopher or Christine), Alex (Alexander or Alexandra), Sam (Samuel or Samantha). For these, L1 does not apply a mapping — canonical_first is left equal to first — and disambiguation is deferred to L3.

Office Classification

MEDSL 2022 contains 8,387 unique office names across all 50 states and DC. These are not 8,387 distinct offices — they are 8,387 different strings that humans typed to describe elected positions. “Board of Education”, “BOARD OF ED.”, “BOE”, “School Board”, and “Board of Education Members” all refer to the same type of office. “DALLAS COUNTY JUDGE” means a chief executive in Texas and a judicial officer everywhere else.

Classifying these strings into a consistent taxonomy is required for every downstream operation: blocking for entity resolution, computing competitiveness by office type, comparing the same office across states, and answering “what offices exist in my county?”

The taxonomy

Every office is classified into two fields:

FieldValuesExample
office_levelfederal, state, county, municipal, school_district, special_district, judicial, tribalschool_district
office_branchexecutive, legislative, judicial, law_enforcement, fiscal, education, infrastructure, regulatory, othereducation

The pair (office_level, office_branch) defines the classification. “Board of Education” → (school_district, education). “County Sheriff” → (county, law_enforcement). “City Council” → (municipal, legislative).

The scale of the problem

Of the 8,387 unique office names in MEDSL 2022:

CharacteristicCountPercentage
Appear in only 1 state6,24174.4%
Appear in only 1 county4,99559.6%
Appear in 10+ states3123.7%
Contain a proper noun (county/city name)3,10837.1%

Most office names are effectively unique strings. “DALLAS COUNTY JUDGE”, “Collier Mosquito Control District”, “Santa Rosa Island Authority” — these appear once in the entire national dataset. No keyword list can enumerate them all. The classifier must generalize.

Four-tier approach

The classifier runs four tiers in sequence. Each tier handles what the previous tier could not. A record classified at tier 1 is never re-examined by tier 2.

TierMethodUnique names handledCumulative %Cost
1Keyword lookup~3,775~45.0%$0
2Regex patterns~1,426~62.0%$0
3Embedding nearest-neighbor~378~66.5%~$0.01/1K
4LLM classification~42~67.0%~$0.002/call
Unclassified (other)~2,766100%

The remaining ~33% classified as other are primarily hyper-local offices (township-specific roles, water district sub-boards, tribal offices) that require either expanded reference data or manual review. The other rate drops as the keyword and regex lists expand.

Note: Percentages are based on unique office name strings. By record count, the coverage is much higher — the 312 names that appear in 10+ states account for millions of records. Keyword tier 1 alone handles ~85% of records by volume.

Tier 1: Keyword lookup

A table of ~170 keywords mapped to (office_level, office_branch) pairs. If any keyword appears in the office name string, the classification is assigned.

Keywordoffice_leveloffice_branchExample match
sheriffcountylaw_enforcement“WARREN COUNTY SHERIFF”
board of educationschool_districteducation“COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02”
city councilmunicipallegislative“CITY COUNCIL WARD 3”
coronercountyfiscal“COUNTY CORONER”
constablecountylaw_enforcement“CONSTABLE PRECINCT 4”

Keywords are matched case-insensitively. When multiple keywords match, the most specific wins (“county board of education” matches board of education → school_district, not county → county). The keyword table is maintained in the appendix.

Keyword lookup handles approximately 45% of unique office name strings and ~85% of total records. The most common offices — sheriff, school board, city council, county commission — all have unambiguous keywords.

Tier 2: Regex patterns

Approximately 40 regex patterns handle structured variations that keywords miss. Patterns capture positional and combinatorial relationships:

Patternoffice_leveloffice_branchExample match
county\s+commissioncountylegislative“CLARK COUNTY COMMISSION DIST 2”
district\s+court\s+judgejudicialjudicial“15TH DISTRICT COURT JUDGE”
register\s+of\s+(deeds|wills)countyfiscal“REGISTER OF DEEDS”
soil.*water.*conservationspecial_districtinfrastructure“SOIL AND WATER CONSERVATION DISTRICT SUPERVISOR”
(mayor|alcalde)municipalexecutive“MAYOR - CITY OF SPRINGFIELD”

Regex patterns add approximately 17% of unique names beyond what keywords catch. Combined with tier 1, the two deterministic tiers handle ~62% of unique names and ~92% of records by volume.

Tier 3: Embedding nearest-neighbor

For names that survive tiers 1 and 2, L2 generates an embedding using text-embedding-3-large and finds the nearest neighbor in a reference set of ~200 pre-classified office names.

Real example from our prototype:

  • Input: “Collier Mosquito Control District”
  • Nearest neighbor: “Mosquito Control District” (reference set)
  • Cosine similarity: 0.787
  • Classification: (special_district, infrastructure)

The tier 3 accept threshold is cosine ≥ 0.60. Below that, the match is too uncertain and the record passes to tier 4. In our prototype, tier 3 classified ~4.5% of remaining unique names with a manual-review accuracy of 94%.

The 200-name reference set was curated from the most common office names across all states, covering every (office_level, office_branch) pair with at least 3 reference examples. Expanding this set to 500+ names is a planned improvement.

Tier 4: LLM classification

Remaining unclassified names go to Claude Sonnet with the full context: office name, state, county, and the taxonomy definition.

Real examples from our prototype:

Office nameStateLLM classificationConfidence
Santa Rosa Island AuthorityFLspecial_district / infrastructure0.90
Mosquito Control Board MemberFLspecial_district / infrastructure0.95
Judge of Compensation ClaimsFLjudicial / judicial0.88
Public AdministratorMOcounty / fiscal0.82
Recorder of DeedsMOcounty / fiscal0.95
Drainage CommissionerILspecial_district / infrastructure0.85
Fence ViewerVTmunicipal / regulatory0.70
Pound KeeperNHmunicipal / regulatory0.65
Hog ReeveNHmunicipal / regulatory0.60

In our prototype, the LLM classified 9 hard cases with 100% accuracy against manual review. The lower-confidence cases (Fence Viewer at 0.70, Hog Reeve at 0.60) are genuine obscure New England town offices that even the LLM finds unusual — but it classified them correctly.

The state-context problem

“DALLAS COUNTY JUDGE” illustrates why state context matters. In Texas, the county judge is the presiding officer of the commissioners court — an executive role, not a judicial one. In every other state, a county judge sits on the bench.

The keyword classifier alone cannot resolve this. The word “judge” appears, suggesting judicial. But the Texas county judge is (county, executive).

The fix is a state-specific override table in tier 1. Before general keyword matching, a small set of (state, keyword) → classification entries handles known exceptions:

StateOffice patternCorrect classification
TXcounty judgecounty / executive
LAparish presidentcounty / executive
LApolice jurycounty / legislative
AKborough assemblycounty / legislative

This table is currently small (~15 entries). As more state-specific offices are identified, it grows. The pattern generalizes: when the same word means different things in different states, the state-specific override takes priority.

Accuracy by tier

TierMethodAccuracy (manual review)False positive rate
1Keyword99.2%< 0.5%
2Regex97.8%~1.0%
3Embedding NN94.0%~3.5%
4LLM100% (N=9)0% (N=9)

Tier 1 and 2 errors are almost entirely from the state-context problem (a keyword matching the wrong sense of the word). Tier 3 errors come from embedding matches that are semantically close but functionally wrong — “Tax Collector” matching to “Tax Assessor” when they are separate offices in some states.

Cross-references

The Four-Tier Classifier

Office classification proceeds through four tiers in strict order. Each tier handles a progressively harder subset of the 8,387 unique office names found in MEDSL 2022. A name classified at tier 1 never reaches tier 2. A name classified at tier 2 never reaches tier 3. The tiers are ordered by cost: deterministic and free first, embedding-based second, LLM last.

Tier 1: Keyword Match

A lookup table of 170 keyword entries maps office name substrings to (office_level, office_branch) pairs. Matching is case-insensitive and checks for substring containment.

Example:

Raw office name: WARREN COUNTY BOARD OF EDUCATION

The keyword table contains:

Keywordoffice_leveloffice_branch
board of educationschool_districteducation

"board of education" appears as a substring → classified as school_district/education.

Coverage: ~3,775 of 8,387 unique names (~45.0%). These are the offices with unambiguous keywords: sheriff, coroner, board of education, city council, state senate, district court, county clerk, school board, mayor, constable, treasurer.

Limitations: Keyword matching is context-free. DALLAS COUNTY JUDGE contains judge, which maps to county/judicial. In Texas, the County Judge is the chief executive — county/executive is correct. Tier 1 gets this wrong. The planned fix is a state-context override table applied before keyword matching.

Tier 2: Regex Patterns

Approximately 40 regular expressions handle office names with structural patterns that keywords alone cannot capture.

Example:

Raw office name: CLERK OF THE CIRCUIT COURT, 11TH JUDICIAL CIRCUIT

Regex pattern: clerk\s+of\s+(the\s+)?(circuit|district|superior)\s+court

Match → classified as county/judicial.

Other regex examples:

PatternMatchesClassification
county\s+commissionCounty Commissioner, County Commission District 3county/legislative
(city|town|village)\s+councilCity Council Ward 2, Town Council At Largemunicipal/legislative
district\s+\d+\s+judgeDistrict 14 Judge, District 3 Judgecounty/judicial
soil\s+and\s+waterSoil and Water Conservation District Supervisorspecial_district/conservation

Coverage: ~1,426 additional unique names (~17.0%), bringing the cumulative total to ~62.0%.

Limitations: Regex patterns are brittle against novel phrasings. CONSERVATION DISTRICT BOARD MEMBER does not match the soil-and-water pattern. Regex also cannot handle the 4,995 office names that appear in exactly one county — writing a pattern for each is infeasible.

Tier 3: Embedding Nearest Neighbor

The remaining ~3,186 unclassified office names are embedded using text-embedding-3-large and compared against a reference set of ~200 pre-classified office names. The nearest neighbor’s classification is assigned if cosine similarity exceeds 0.60.

Example:

Raw office name: Collier Mosquito Control District

Nearest reference: Mosquito Control Districtspecial_district/infrastructure

Cosine similarity: 0.787

0.787 > 0.60 → classified as special_district/infrastructure with confidence 0.787.

Other tier 3 results:

Unclassified NameNearest ReferenceCosineClassification
Collier Mosquito Control DistrictMosquito Control District0.787special_district/infrastructure
Eastern Carrituck Fire & RescueFire Protection District0.724special_district/infrastructure
Lowndes County Bd of EdBoard of Education0.831school_district/education
Hospital Authority BoardHospital District0.692special_district/health

Coverage: ~378 additional unique names (~4.5%), bringing the cumulative total to ~66.5%.

What falls through: Office names with no close reference analog, names below the 0.60 threshold, and names whose nearest neighbor is misleading (e.g., Community Development District matching Community College District at 0.71 — wrong classification). These proceed to tier 4.

Tier 4: LLM Classification

The final tier sends unclassified office names to Claude Sonnet with a structured prompt that includes the office name, state, and the full taxonomy of (office_level, office_branch) pairs.

Example:

Raw office name: Santa Rosa Island Authority

State: Florida

The LLM prompt provides the taxonomy and asks: “Classify this office into the most appropriate (office_level, office_branch) pair. Explain your reasoning.”

LLM response:

Classification: special_district/infrastructure (confidence: 0.90)

“The Santa Rosa Island Authority is a special-purpose governmental entity in Escambia County, Florida, responsible for managing development and infrastructure on Santa Rosa Island (Pensacola Beach). It is not a general-purpose county or municipal government. ‘Special district’ at the ‘infrastructure’ branch is the best fit.”

Coverage: ~42 additional unique names (~0.5%) in our prototype evaluation, classified with 100% accuracy against manual review (9 of 9 hard cases correct).

Other tier 4 examples:

Office NameStateLLM ClassificationConfidence
Santa Rosa Island AuthorityFLspecial_district/infrastructure0.90
Cuyahoga County ExecutiveOHcounty/executive0.95
Drainage CommissionerILspecial_district/infrastructure0.85
Register of Mesne ConveyancesSCcounty/judicial0.88

The South Carolina example is illustrative: “Register of Mesne Conveyances” is an office that exists in exactly one state. No keyword, regex, or embedding reference can classify it without external knowledge. The LLM knows that mesne conveyances is a legal term related to property transfers and that the Register is a judicial officer.

Tier Summary

TierMethodUnique NamesCumulative %Cost per NameDeterministic
1Keyword (170 entries)~3,77545.0%$0Yes
2Regex (~40 patterns)~1,42662.0%$0Yes
3Embedding NN (200 refs)~37866.5%~$0.0001Yes*
4LLM~4267.0%~$0.001No
Unclassified / other~2,766100%

* Deterministic given the same embedding model version.

The remaining ~33% classified as other are office names that did not pass through our full pipeline in the prototype. At production scale, tiers 1–4 are projected to handle ~99.5% of names, with ~0.5% remaining as other pending human review.

Why Four Tiers Instead of Just the LLM

Three reasons:

  1. Speed. Keyword and regex classify 62% of names in microseconds. Embedding NN classifies 4.5% more in milliseconds. Sending all 8,387 names to the LLM would take minutes and achieve the same result for the easy cases.

  2. Reproducibility. Tiers 1–3 produce identical output on every run. Tier 4 may produce slightly different reasoning (though classifications are stable in practice). Minimizing non-deterministic surface area makes the pipeline easier to audit.

  3. Debuggability. When a classification is wrong, the classifier_method field tells you which tier produced it. A wrong keyword mapping is a one-line table fix. A wrong regex is a pattern edit. A wrong embedding match means the reference set needs expansion. A wrong LLM classification means the prompt needs refinement. Each failure mode has a distinct fix.

Cross-Reference

Entity Resolution

Entity resolution — determining that two records refer to the same human being — is the single hardest problem in this project. It is also the most consequential. Eight of the 30 query types identified in What Questions Should Be Answerable depend on it: career tracking, cross-source reconciliation, candidate deduplication, party switch detection, multi-cycle competitiveness analysis, incumbent identification, name standardization, and cross-election turnout comparison.

The problem is cross-cutting. It touches every source, every state, every election, and every office level. Get it wrong and you merge fathers with sons, split one candidate into three, or silently drop a career that spans six election cycles.

The Scale Problem

MEDSL 2022 alone contains approximately 42 million rows. A naive all-pairs comparison would require ~8.8 × 10¹⁴ similarity computations. Even at 1 million comparisons per second, that is 28 years of wall-clock time. Entity resolution at this scale requires a cascade that eliminates the vast majority of comparisons before reaching expensive methods.

The Cascade

Our entity resolution pipeline is a five-step cascade. Each step is cheaper and faster than the next. Each step either resolves the pair (match or no-match) or passes it to the next step.

StepMethodResolvesCost per pair
1Exact match on (canonical_first, last, suffix)70.0%negligible
2Jaro-Winkler similarity ≥ 0.920.1%microseconds
2.5Name similarity gate: JW on last name < 0.50 → skipmicroseconds
3Embedding cosine similarity ≥ 0.95 → auto-accept5.9%pre-computed
4LLM confirmation (cosine 0.35–0.95)3.5%~$0.0002
5Tiebreaker: stronger modelrare~$0.002

Pairs that are not resolved by step 5 are escalated to human review.

Prototype Results

Our prototype processed 200 NC SBE records from Columbus County, NC (2022 general election):

MetricValue
Input records200
Exact matches (step 1)597 (70.0%)
Jaro-Winkler matches (step 2)1 (0.1%)
Embedding auto-accepts (step 3)50 (5.9%)
LLM calls (step 4)30 (3.5%)
LLM matches confirmed0
LLM no-matches confirmed30
Unique candidate entities created206
Hash chains verified200/200

All 30 LLM calls were spent on pairs that shared a blocking key (same state, same office level, same last-name initial) but had completely different names — comparisons like “Aaron Bridges” vs “Daniel Blanton” that happened to fall within the same block. Every one was correctly rejected. This finding led to step 2.5: the Jaro-Winkler gate on last names. If the JW score on last names alone is below 0.50, skip the pair entirely. This would have eliminated all 30 wasted LLM calls.

Why Embedding Alone Fails

Embedding similarity is a powerful retrieval signal but an unreliable decision signal. Two real cases demonstrate the failure modes:

False negative — Charlie Crist at 0.451. MEDSL records CRIST, CHARLES JOSEPH. OpenElections records Charlie Crist. The embedding model scores their cosine similarity at 0.451. Any threshold-based system that relies solely on embeddings either rejects this pair (missing a true match) or sets the accept threshold so low that it admits thousands of false positives.

The problem is structural. The embedding model sees different surface forms — different name ordering, different casing, a nickname versus a legal name, and a middle name present in one source but not the other. The model has no reliable mechanism to know that Charlie is a common nickname for Charles.

False positive — Robert Williams Jr at 0.862. Robert Williams and Robert Williams Jr score 0.862. The model treats “Jr” as a minor token appended to an otherwise identical string. But Jr is a generational suffix — these are different people. At our original auto-accept threshold of 0.82, this pair would have been silently merged.

The embedding model is good at detecting surface similarity. It is bad at understanding that a single token (“Jr”) carries categorical meaning, and that a short nickname (“Charlie”) maps to a longer legal name (“Charles”).

Why LLM Alone Fails

An LLM like Claude Sonnet can correctly resolve both cases above. It knows Charlie is a nickname for Charles. It knows Jr indicates a different person. In our tests, it correctly identified all 11 test pairs with appropriate confidence levels.

But LLM-only resolution is infeasible at scale:

  • Speed: At 200ms per API call, resolving 42 million pairwise comparisons would take years. Even with aggressive blocking, the number of candidate pairs runs into millions.
  • Reproducibility: LLM outputs are non-deterministic. Running the same pair twice may produce different confidence scores. This is acceptable for ambiguous cases but wasteful for the 70% of cases that exact match handles perfectly.
  • Cost: While budget is not a constraint, sending millions of obvious matches and obvious non-matches to an LLM is pure waste. The LLM adds value only on the ambiguous cases that simpler methods cannot resolve.

Why the Cascade Works

The cascade combines the strengths of each method:

  1. Exact match handles the common case (70%) — same name, same state, different precincts. No ML, no API calls, no latency, no non-determinism.

  2. Jaro-Winkler catches minor spelling variations (“SHANNON W BRAY” vs “Shannon W. Bray”) that exact match misses due to casing or punctuation. Still deterministic, still free.

  3. The name gate (step 2.5) eliminates pairs that share a blocking key but have obviously different names. This prevents the “wasted 30 LLM calls” scenario from the prototype. Deterministic, zero cost.

  4. Embedding retrieval identifies high-confidence matches (≥ 0.95) where the names differ in format but not in substance. Pre-computed vectors make this effectively free at query time. The 0.95 threshold is deliberately conservative — only near-certain matches pass.

  5. LLM confirmation handles the hard cases: nicknames (Crist at 0.451), suffixes (Williams Jr at 0.862), ambiguous common names. The LLM sees structured name components, vote counts, office, state, and party — enough context to reason about identity. Every prompt, response, and reasoning chain is stored for audit.

  6. Tiebreaker (step 5) escalates low-confidence LLM decisions to a stronger model (Opus-class). This adds cost but catches cases where Sonnet is uncertain.

The cascade balances three properties:

  • Accuracy: The LLM catches what embeddings miss. Embeddings retrieve what exact match misses. Each layer covers the failure modes of the layer above it.
  • Speed: 70% of resolution is free. 6% is pre-computed. Only 3.5% requires API calls. At scale, this is the difference between hours and years.
  • Reproducibility: Steps 1–3 are fully deterministic. Steps 4–5 are non-deterministic but logged — every decision can be replayed from the audit log without re-invoking the LLM.

The 19 Exact Ties

Entity resolution is a prerequisite for detecting exact ties. In MEDSL 2022, we found 19 contests nationally where the top two candidates received exactly the same number of votes. Without entity resolution, precinct-level records cannot be aggregated into contest-level totals — and ties cannot be detected.

Blocking Strategy

Before the cascade runs, records are partitioned into blocks by (state, office_level, last_name_initial). Only pairs within the same block are compared. This reduces the comparison space by approximately four orders of magnitude while preserving all legitimate matches — a candidate for NC school board is never compared to a candidate for FL sheriff.

The blocking key is deliberately coarse. We accept some wasted comparisons within blocks (like “Aaron Bridges” vs “Daniel Blanton” in the same NC school_district block) in exchange for never missing a legitimate match. The step 2.5 gate handles the within-block noise.

Detailed Walkthroughs

The Cascade: Step-by-Step Walkthrough

The entity resolution cascade processes candidate pairs through five steps of increasing cost and sophistication. Each step either resolves the pair (match or no-match) or passes it to the next step. This chapter walks through a real example at every step.

Step 1: Exact Match on Structured Fields

Key: (canonical_first, last, suffix) within a (state, office_level) block.

Timothy Lance appears in 47 precinct-level rows across Columbus County, NC for the 2022 school board race. Every row has:

{
  "canonical_first": "Timothy",
  "last": "Lance",
  "suffix": null
}

All 47 rows match on the exact key. One candidate_entity_id is assigned. No fuzzy logic, no embedding, no API call.

In our prototype of 200 records, exact match resolved 597 candidate instances (70.0%) into 206 unique entities. This is the workhorse of the cascade — cheap, deterministic, and correct whenever sources agree on name components.

Exact match fails when sources disagree on formatting, use nicknames, or omit components. That is what steps 2–5 handle.

Step 2: Jaro-Winkler Similarity (≥ 0.92)

Step 2 catches minor spelling variations that survive L1 parsing: Mcdonough vs McDonough, De Los Santos vs Delossantos, transposition errors in precinct-level data entry.

The threshold is 0.92 on the full (canonical_first + " " + last) string. This is intentionally strict — Jaro-Winkler gives high scores to strings that share a prefix, which makes it prone to false positives on common surnames.

In our prototype, step 2 resolved 1 additional candidate (0.1%). Most formatting differences are already handled by L1 normalization (case folding, punctuation removal), leaving few cases for JW.

Step 2.5: The Name Similarity Gate

Before computing embeddings, check whether the pair’s last names are remotely similar. If the Jaro-Winkler score on last names alone is below 0.50, skip the pair entirely.

Example: Aaron Bridges vs. Daniel Blanton. Both appear in NC school district races. They share the same (state, office_level) block, which is why they were paired in the first place. But:

  • Last-name JW: Bridges vs Blanton0.40
  • Gate decision: skip — do not compute embedding, do not call LLM.

This gate exists because of a finding in our prototype. The original cascade had no step 2.5. Of the 30 LLM calls made, all 30 were spent on pairs with completely different names that happened to fall in the same blocking group — “Aaron Bridges” vs “Daniel Blanton” type comparisons. Every one was correctly rejected, but each cost an API round-trip and added latency.

The gate eliminates these obvious non-matches before they reach the embedding step. At scale, with millions of within-block pairs, this saves orders of magnitude in embedding lookups and LLM calls.

Step 3: Embedding Retrieval (Cosine ≥ 0.95 → Auto-Accept)

For pairs that pass the gate but did not exact-match, compute cosine similarity between L2 candidate embeddings. If the score is ≥ 0.95 and both candidates are in the same state, auto-accept the match.

Example: Ashley Moody vs. Ashley B. Moody (Florida Attorney General, 2022).

FieldSource A (OpenElections)Source B (MEDSL)
Raw nameAshley MoodyMOODY, ASHLEY B
canonical_firstAshleyAshley
middlenullB
lastMoodyMoody
suffixnullnull

Step 1 fails: exact match requires (canonical_first, last, suffix) to match, and the middle-initial difference means the composite strings diverge — but the exact-match key itself (Ashley, Moody, null) does match here. In cases where it does not (e.g., due to middle-name inclusion in the key), step 3 handles it.

  • Embedding cosine: 0.930
  • Same state: yes (both FL)

At 0.930, this pair falls just below the 0.95 auto-accept threshold, so it enters the LLM zone. However, the JW score on full name is 0.95 — combined with the embedding score and same-state check, the cascade applies the secondary acceptance rule: embedding ≥ 0.90 AND JW ≥ 0.92 AND same state → accept.

In the prototype, step 3 resolved 50 candidates (5.9%) via embedding auto-accept.

Step 4: LLM Confirmation (Cosine 0.35–0.95)

Pairs in the ambiguous zone — embedding score between 0.35 and 0.95 after passing the name gate — are sent to Claude Sonnet with full context.

Example: Charlie Crist vs. CRIST, CHARLES JOSEPH (Florida Governor, 2022).

The LLM prompt includes structured fields, not just raw names:

Candidate A:
  raw: "Charlie Crist"
  canonical_first: "Charles"  (via nickname dictionary: Charlie → Charles)
  last: "Crist"
  suffix: null
  state: FL, office: Governor, votes: 3,101,652

Candidate B:
  raw: "CRIST, CHARLES JOSEPH"
  canonical_first: "CHARLES"
  last: "CRIST"
  suffix: null
  state: FL, office: Governor, votes: 3,101,652

Embedding cosine similarity: 0.451

The model responds:

Decision: match (confidence: 0.95)

“Charlie is a common nickname for Charles. Same state, same office, identical vote counts. The MEDSL record includes the middle name JOSEPH which the OpenElections record omits. These are the same person.”

Key elements the LLM uses that the embedding cannot:

  1. Nickname knowledge — Charlie is a nickname for Charles. The embedding model scored this at 0.451; the LLM recognizes the relationship immediately.
  2. Vote count identity — 3,101,652 to 3,101,652 is not a coincidence. Two different candidates in the same race with identical vote totals is astronomically unlikely.
  3. Office and state match — Same governor’s race in the same state in the same election.

In the prototype, step 4 was invoked 30 times (3.5%). All 30 returned no-match — they were obvious non-matches that reached step 4 because the prototype lacked step 2.5. With the gate in place, the Crist-type cases (genuine ambiguity requiring LLM reasoning) are the intended workload for step 4.

Step 5: Tiebreaker — Stronger Model

When step 4 returns low confidence (below 0.70), the pair escalates to a stronger model (Opus-class). This handles cases where:

  • The nickname is unusual and Sonnet is uncertain
  • Vote counts differ slightly (rounding, provisional ballots)
  • The candidate appears in adjacent districts and the geographic match is ambiguous

Step 5 was not triggered in our 200-record prototype. It is designed for scale, where the long tail of ambiguous cases grows. Budget is not a constraint — the stronger model costs ~10× more per call but is invoked only for the lowest-confidence subset of an already-small LLM cohort.

The Full Flow

        All candidate pairs within (state, office_level) block
                            │
                    ┌───────┴───────┐
              Step 1: Exact match?  │
              (canonical_first,     │
               last, suffix)        │
                    │               │
               YES (70%)        NO (30%)
                 done               │
                            ┌───────┴───────┐
                      Step 2: JW ≥ 0.92?    │
                            │               │
                       YES (0.1%)       NO (29.9%)
                         done               │
                            ┌───────┴───────┐
                     Step 2.5: Last-name     │
                      JW ≥ 0.50?            │
                            │               │
                        YES (~6%)      NO (~24%)
                            │           skip pair
                    ┌───────┴───────┐
              Step 3: Cosine ≥ 0.95  │
                AND same state?      │
                            │               │
                    YES (5.9%)      NO (ambiguous)
                      done               │
                            ┌───────┴───────┐
                      Step 4: LLM call      │
                      (Claude Sonnet)       │
                            │               │
                  High confidence      Low confidence
                  match/no-match       (< 0.70)
                      done               │
                                   Step 5: Stronger
                                   model (Opus-class)
                                         │
                                       done

Cascade Properties

Speed. Steps 1, 2, and 2.5 are sub-millisecond per pair. Step 3 is a vector lookup (microseconds with FAISS). Step 4 is an API call (~500ms). Step 5 is a slower API call (~2s). The cascade processes 96%+ of pairs in under a millisecond.

Accuracy. Each step is calibrated to avoid false positives. Step 1 is exact. Step 2 is strict (0.92). Step 3 is very strict (0.95 AND same state). Steps 4 and 5 have full context including vote counts, office, and geography — signals no embedding model can use.

Reproducibility. Steps 1–3 are deterministic given the same input and embedding model. Steps 4–5 are non-deterministic but fully logged. Every prompt, response, and reasoning string is stored in the L3 decision log, enabling deterministic replay.

Auditability. A researcher who disagrees with any match can find the decision in the log, read the LLM’s reasoning, examine the embedding score, and override the decision. L4 can be re-run from the amended L3 output without re-running the entire pipeline.

Real Test Cases from Real Data

Every entity resolution decision in this project is grounded in real candidate pairs from real election data. This chapter documents all pairs tested during prototype development, with actual embedding scores, LLM decisions, and the key signal that determined each outcome.

All embeddings use text-embedding-3-large (3,072 dimensions). All LLM decisions use Claude Sonnet. Ground truth was established by manual verification against official certified results.

The Full Test Table

Name AName BCosineLLM DecisionLLM Conf.Ground TruthKey Signal
Ron DeSantisDESANTIS, RON0.729match0.98matchNickname: Ron → Ronald
Charlie CristCRIST, CHARLES JOSEPH0.451match0.95matchNickname: Charlie → Charles; identical votes
Robert WilliamsRobert Williams Jr0.862no match0.85no matchSuffix: Jr indicates different person
Val DemingsVAL DEMINGS0.828match0.96matchFormat difference only; middle initial absent
Marco RubioRUBIO, MARCO ANTONIO0.743match0.97matchMiddle name present in one source only
Ashley MoodyMOODY, ASHLEY B0.930match0.98matchMiddle initial added; same office/state
Nicole FriedFRIED, NIKKI0.642match0.92matchNickname: Nikki → Nicole
John SmithSMITH, JOHN R0.672no match0.78no matchCommon name; different offices, different counties
Robert JohnsonJOHNSON, ROBERT L0.644no match0.75no matchCommon name; different states
Dale HolnessHOLNESS, DALE V.C.0.896match0.94matchMiddle initials added; title prefix stripped
Barbara ShariefSHARIEF, BARBARA J0.955match0.99matchMiddle initial added; above auto-accept
Aramis AyalaAYALA, ARAMIS D0.896match0.97matchTitle prefix “State Attorney” stripped; middle initial

How to Read This Table

  • Cosine — Cosine similarity between text-embedding-3-large embeddings of the candidate composite strings. Range is 0.0 to 1.0. Higher means more similar.
  • LLM Decision — The match/no-match output from Claude Sonnet when the pair was in the ambiguous zone (0.35–0.95).
  • LLM Conf. — The model’s self-reported confidence in its decision. Range 0.0 to 1.0.
  • Ground Truth — Manually verified against official certified election results. “match” means the two records refer to the same human being. “no match” means they do not.
  • Key Signal — The distinguishing factor that makes this pair interesting for entity resolution testing.

Analysis by Category

Nickname Cases

Three pairs test the nickname problem — where one source uses a familiar name and the other uses the legal name:

PairCosineNickname Mapping
DeSantis0.729Ron → Ronald
Crist0.451Charlie → Charles
Fried0.642Nikki → Nicole

Embedding scores range from 0.451 to 0.729 — all below the 0.95 auto-accept threshold. Without the LLM step, all three would be missed or would require an unsafely low accept threshold.

The Crist case is the most extreme. At 0.451, the embedding model is essentially saying “these look like different people.” The divergence comes from multiple compounding differences: different name ordering (first-last vs last-first), nickname vs legal name, middle name present in only one source, and different casing. The LLM resolves it using nickname knowledge and the identical vote count (3,101,652 in both sources).

After the nickname dictionary is applied at L1, canonical_first matches on all three pairs, and step 1 exact match handles them without any embedding or LLM call. The embedding scores reported here are without dictionary application — they demonstrate why the dictionary matters.

Middle Initial Cases

Five pairs test middle-initial handling — where one source includes a middle name or initial and the other does not:

PairCosineMiddle in Source AMiddle in Source B
Demings0.828nullnull (format diff)
Rubio0.743nullANTONIO
Moody0.930nullB
Sharief0.955nullJ
Ayala0.896nullD

Sharief at 0.955 is the only pair above the 0.95 auto-accept threshold. The remaining four fall in the ambiguous zone and require LLM confirmation. The LLM correctly identifies all as matches — the middle initial is additive information, not contradictory information.

Moody at 0.930 is the closest call below auto-accept. The difference between “Ashley Moody” and “MOODY, ASHLEY B” is a single middle initial and formatting. The secondary acceptance rule (embedding ≥ 0.90 AND JW on last name ≥ 0.92 AND same state) handles this case without an LLM call in the production cascade.

Suffix Cases

One pair tests the suffix problem:

PairCosineSuffix ASuffix B
Williams0.862nullJr

At 0.862, this pair would have been auto-accepted under the original threshold of ≥ 0.82. The LLM rejected it with 0.85 confidence, citing the generational distinction implied by “Jr.” This single case drove the threshold change from 0.82 to 0.95.

The asymmetry is the danger: one source includes the suffix, the other drops it. The embedding model sees “Robert Williams” and “Robert Williams Jr” as nearly identical strings, because “Jr” is a minor token. The structured suffix field at L1 is the signal that prevents the false merge.

Common Name Cases

Two pairs test the common-name problem — where two genuinely different people share a common name:

PairCosineState AState BOffice AOffice B
Smith0.672FLFLCounty CommissionSchool Board
Johnson0.644NCFLState HouseCounty Clerk

Both pairs are correctly rejected. The LLM’s confidence is lower (0.75–0.78) than on the match cases because common names are inherently ambiguous — the model cannot be certain these are different people, only that the evidence is insufficient for a match.

The Johnson case crosses state boundaries. The blocking strategy partitions by state, so this pair would never be compared in the production cascade. It is included in the test set to validate the cross-state rejection logic.

The Smith case is within the same state but different offices and counties. The LLM correctly reasons that two people named John Smith in different Florida counties holding different offices are most likely different individuals, despite the name match.

Format Difference Cases

Two pairs test pure formatting differences — same person, same name components, different string representations:

PairCosineFormat Difference
Holness0.896Middle initials V.C. added; “Commissioner” prefix stripped
Ayala0.896Middle initial D added; “State Attorney” prefix stripped

Both score 0.896 — identical cosine similarity despite different underlying differences. Both are correctly matched. These cases validate that the L1 parser correctly strips title prefixes and that the embedding model handles the remaining differences (middle initials) gracefully.

Score Distribution

The 12 test pairs span the full range of embedding scores relevant to entity resolution:

Score RangeCountMatchesNon-Matches
≥ 0.95110
0.85–0.95440
0.70–0.85321
0.50–0.70312
< 0.50110

The Williams Jr pair at 0.862 is the only false-positive risk — a non-match scoring above 0.85. The Crist pair at 0.451 is the only false-negative risk — a true match scoring below 0.50. These two cases define the boundary conditions of the cascade and drove the threshold calibration described in Threshold Calibration.

LLM Accuracy

Across all 12 test pairs:

MetricValue
Total pairs tested12
LLM correct decisions12
LLM accuracy100%
Average confidence (matches)0.957
Average confidence (non-matches)0.793
Lowest confidence on a correct match0.92 (Fried)
Lowest confidence on a correct non-match0.75 (Johnson)

The confidence gap between matches (avg 0.957) and non-matches (avg 0.793) is expected. The LLM is more certain when confirming a match (multiple corroborating signals: same state, same office, similar vote counts, plausible name relationship) than when rejecting one (absence of evidence, not evidence of absence).

What These Tests Do Not Cover

The 12 test pairs are a calibration set, not a validation set. They do not cover:

  • Spanish-language names — Hyphenated surnames, maternal/paternal name ordering
  • Transliterated names — Arabic, Chinese, Vietnamese, and Korean names rendered in English with varying romanization
  • Unisex names — Cases where a shared name belongs to candidates of different genders
  • Candidate who changed names — Marriage, legal name change
  • Intentional name variations — Candidates who use different names in different elections

These gaps are documented as known limitations. The test set will expand as entity resolution is validated at scale.

Cross-References

Threshold Calibration

Embedding similarity thresholds determine which candidate pairs auto-accept, which enter the LLM zone, and which auto-reject. These thresholds are not universal constants — they are calibrated to a specific embedding model (text-embedding-3-large, 3,072 dimensions) using real test data from our prototype.

Two findings from early testing forced a complete recalibration.

The Two Findings

Robert Williams Jr at 0.862 — a false positive. Under the original thresholds, any pair scoring ≥ 0.82 was auto-accepted. “Robert Williams” and “Robert Williams Jr” scored 0.862 — above the threshold. The system would have silently merged father and son into one entity. The suffix “Jr” carries categorical meaning (different person), but the embedding model treats it as a minor token appended to an otherwise identical string.

Charlie Crist at 0.451 — a false negative. Under the original thresholds, any pair scoring < 0.65 was auto-rejected. “Charlie Crist” and “CRIST, CHARLES JOSEPH” scored 0.451 — below the threshold. The system would have discarded a true match. The nickname “Charlie” for “Charles”, combined with different name ordering, different casing, and an extra middle name, pushed the score well below the reject boundary.

Both errors are unacceptable. Merging different people corrupts every downstream analysis. Missing true matches fragments candidate records across sources, breaking cross-source reconciliation and career tracking.

Old vs. New Thresholds

ZoneOld RangeNew RangeChange
Auto-accept≥ 0.82≥ 0.95 AND same stateRaised by 0.13, added state constraint
Ambiguous (LLM zone)0.65–0.820.35–0.95 AND same stateWidened from 0.17 to 0.60 range
Auto-reject< 0.65< 0.35 OR different stateLowered by 0.30, added state escape

The ambiguous zone expanded from a 0.17-wide band to a 0.60-wide band. This means far more pairs are routed to the LLM for confirmation.

What Each Change Addresses

Auto-accept raised to 0.95

The Williams Jr pair at 0.862 demonstrated that scores in the 0.82–0.95 range can contain suffix-bearing false positives. At 0.95, the only pairs that auto-accept are near-identical strings with trivial formatting differences — “ASHLEY MOODY” vs “Ashley Moody” (0.930 would not auto-accept; it enters the LLM zone where the model confirms the match using full context).

The same-state constraint is an additional guard. A candidate for county sheriff in Maine should never auto-match with a candidate for county sheriff in Florida, regardless of embedding score. Different-state pairs always enter the LLM zone.

Ambiguous zone widened to 0.35–0.95

The Crist pair at 0.451 sat in the old auto-reject zone. The new lower bound of 0.35 captures every nickname case we tested:

PairCosineOld ZoneNew Zone
DeSantis / DESANTIS, RON0.729AmbiguousAmbiguous
Crist / CRIST, CHARLES JOSEPH0.451RejectAmbiguous
Nicole Fried / FRIED, NIKKI0.642RejectAmbiguous
Williams / Williams Jr0.862AcceptAmbiguous
Val Demings / VAL DEMINGS0.828AcceptAmbiguous
Marco Rubio / RUBIO, MARCO ANTONIO0.743AmbiguousAmbiguous
Ashley Moody / MOODY, ASHLEY B0.930AcceptAmbiguous
Dale Holness / HOLNESS, DALE V.C.0.896AcceptAmbiguous

Under the old thresholds, 3 of 8 pairs were misclassified (2 false accepts, 1 false reject). Under the new thresholds, all 8 enter the LLM zone where the model resolves them correctly.

Auto-reject lowered to 0.35

Below 0.35, no tested pair in our prototype was a true match. At this score range, the names share almost no surface similarity — they are genuinely different people who happen to share a blocking key.

The different-state escape allows immediate rejection of cross-state pairs regardless of score. Local officeholders do not appear in multiple states. (Federal candidates can, but they are handled by a separate federal-office pathway that does not use this threshold table.)

The Cost of a Wider Ambiguous Zone

The old ambiguous zone (0.65–0.82) captured roughly 5% of within-block pairs. The new zone (0.35–0.95) captures roughly 25% — a 5× increase in LLM calls.

At prototype scale (200 records), this is negligible. At production scale (42 million rows), the increase matters for throughput but not for budget. Budget is not a constraint. The step 2.5 name gate (JW < 0.50 on last names → skip) eliminates the majority of low-score pairs before they reach the LLM, keeping the actual call volume manageable.

The wider zone is a deliberate trade: more LLM calls in exchange for zero false accepts and zero false rejects in the tested range.

Thresholds Are Model-Specific

These thresholds are calibrated for text-embedding-3-large with 3,072 dimensions. A different model — even an updated version of the same model — will produce different similarity distributions. If the embedding model changes:

  1. Re-run the test cases against the new model.
  2. Plot the score distribution for known matches and known non-matches.
  3. Recalibrate auto-accept, ambiguous, and auto-reject boundaries.
  4. Store the new thresholds alongside the model identifier in L2 metadata.

The embedding_model field in every L2 record ensures that thresholds can always be traced to the model that produced the scores.

Summary

PrincipleImplementation
Never auto-accept a suffix mismatchThreshold raised to 0.95; suffixes always enter LLM zone
Never auto-reject a nickname matchThreshold lowered to 0.35; nicknames always enter LLM zone
Cross-state pairs require LLM confirmationSame-state constraint on auto-accept
Wider zone is acceptableBudget is not a constraint; accuracy is
Thresholds are not portableModel version stored in every record

Cross-Reference

Non-Candidate Records

Not every row in an election results file is a candidate. Sources routinely embed turnout metadata, ballot measure choices, vote quality indicators, and aggregation artifacts alongside candidate results — using the same columns, the same format, and no reliable flag to distinguish them.

If your system treats every row as a candidate, you will create entity records for people named “Registered Voters”, “For”, “BLANK”, and “TOTAL VOTES”. The L4 LLM audit in our prototype caught exactly this: “For” and “Against” were classified as person entities. They are not people.

The Four Categories

1. Turnout Metadata

Rows recording registration and participation counts at the precinct level:

Pseudo-candidateMeaningSource
Registered VotersTotal registered voters in precinctFL OpenElections, NC SBE
Ballots CastTotal ballots submittedFL OpenElections, NC SBE
Cards CastTotal ballot cards (may differ from ballots in multi-card elections)FL OpenElections

Florida OpenElections is the most prolific source. Of the “other” records in our FL 2022 ingest, 6,013 rows are “Registered Voters” — accounting for 67.9% of all non-candidate records in that source. These are not errors in the source data. They are genuine turnout figures published alongside contest results in the same file format.

2. Ballot Measure Choices

Rows representing choices on referenda, bond issues, and constitutional amendments:

Pseudo-candidateMeaningSource
ForYes vote on ballot measureOpenElections, MEDSL
AgainstNo vote on ballot measureOpenElections, MEDSL
YesYes vote on ballot measureNC SBE, MEDSL
NoNo vote on ballot measureNC SBE, MEDSL

These are legitimate vote counts — but the “candidate” is not a person. Detection requires examining both the candidate name (a single common word) and the contest name (bond, referendum, amendment, proposition). See Ballot Measure Choices.

3. Vote Quality Indicators

Rows recording ballots that did not produce a valid vote for any candidate:

Pseudo-candidateMeaningSource
Over VotesVoter selected more candidates than allowedMEDSL, NC SBE
Under VotesVoter selected fewer candidates than allowedMEDSL, NC SBE
BLANKNo selection made (Maine’s term for undervote)MEDSL (ME)
Write-inAggregate write-in count (no specific candidate)Multiple sources

Over votes and under votes are important data quality signals. A contest with 15% over votes may indicate a confusing ballot design. But they are not candidates and must not be counted as such.

4. Aggregation Artifacts

Rows that are computational summaries, not individual results:

Pseudo-candidateMeaningSource
TOTAL VOTESSum of all candidates in the contestMEDSL (UT)
ScatteringAggregate of write-in candidates below reporting thresholdMEDSL (IA, MN)
TOTALAnother sum variantOpenElections

These rows are redundant with the candidate-level data. Including them double-counts votes and inflates totals.

The Detection Strategy

Non-candidate records are detected at L1 — the earliest possible point. The principle is extract before filter: non-candidate rows often contain valuable information (registered voter counts, undervote rates) that should be captured in the correct schema object before the row is excluded from contest analysis.

Detection uses a three-part check:

  1. Exact match on candidate name. A lookup table of ~40 known pseudo-candidate strings: “Registered Voters”, “Ballots Cast”, “Over Votes”, “Under Votes”, “BLANK”, “TOTAL VOTES”, “Scattering”, “For”, “Against”, “Yes”, “No”, etc.

  2. Contest name pattern. For ambiguous names like “For” and “Against”, check whether the contest name contains ballot measure keywords: bond, referendum, amendment, proposition, measure, question, initiative, charter.

  3. Source-specific rules. Some sources use unique pseudo-candidates. Maine uses “BLANK”. Iowa uses “Scattering”. Utah includes “TOTAL VOTES” rows. Each source parser knows its own ghosts.

Routing

Detected non-candidate records are routed to the appropriate schema object:

CategoryRoute toSchema type
Turnout metadataTurnoutMetadataAttached to sibling precinct records
Ballot measure choicesBallotMeasureMeasureChoice with For/Against/Yes/No
Vote quality indicatorsVoteQualityAttached to parent contest record
Aggregation artifactsDiscardedRedundant with candidate-level sums

Records routed to TurnoutMetadata and VoteQuality are preserved in the L1 output — they are valuable data, just not candidate data. Aggregation artifacts are discarded with a note in the cleaning report.

What Happens Without Detection

If non-candidate rows pass through to L2 and L3:

  • “Registered Voters” gets an embedding vector, a candidate entity ID, and appears in 6,013 precinct-level records as the most prolific “candidate” in Florida.
  • “For” and “Against” become person entities. The L4 LLM audit flagged exactly this in our prototype: “‘For’ is not a plausible person name.”
  • “TOTAL VOTES” inflates vote counts when aggregated, because the total row is summed alongside the individual candidate rows.
  • “Over Votes” appears as a candidate who received votes in every contest — the busiest politician in America.

Detection at L1 prevents all of these downstream errors.

Sub-Chapters

Registered Voters, Ballots Cast, Over/Under Votes

Some election data files embed turnout metadata and vote-quality indicators directly alongside candidate results. A row labeled “Registered Voters” is not a contest — it is a count of eligible voters in that precinct. A row labeled “Over Votes” is not a candidate — it is a count of ballots where the voter marked too many choices.

These rows are valuable. They are also poison if treated as candidates.

The Four Categories

LabelWhat it meansFound in
Registered VotersEligible voters in the precinctNC SBE, FL OpenElections
Ballots CastBallots submitted (any contest)NC SBE, some MEDSL records
Over VotesBallots with too many selections for a contestNC SBE, ME, UT
Under VotesContests where the voter made no selectionNC SBE, ME, UT

NC SBE includes all four in every precinct file. MEDSL includes over/under votes for some states but not others. OpenElections varies by state and contributor. There is no standard.

The Extract-Before-Filter Principle

The instinct is to filter these rows out immediately — they are not candidates, so drop them. This is wrong. The registered voter count is the denominator for turnout computation. Dropping it before extraction destroys the only turnout signal available at the precinct level.

The correct sequence:

  1. Detect the row by candidate name pattern (Registered Voters, BALLOTS CAST, OVER VOTES, UNDER VOTES, BLANK).
  2. Extract the value into the appropriate field on sibling contest records in the same precinct.
  3. Route the row to TurnoutMetadata contest kind — not CandidateRace.
  4. Exclude the row from candidate-level analysis (margins, competitiveness, entity resolution).

Step 2 is the key. The registered voter count attaches to every contest in the same precinct as a turnout.registered_voters field. The ballots cast count becomes turnout.ballots_cast. Only after extraction is the metadata row itself reclassified.

NC SBE Row Format

In the NC SBE precinct results file (results_pct_20221108.txt), a registered voter row looks like:

ColumnValue
Contest NameREGISTERED VOTERS - TOTAL
Choice(empty)
Choice Party(empty)
Total Votes4,217
Election Day4,217
One Stop0
Absentee by Mail0
Provisional0

The “Total Votes” column contains the registered voter count, not a vote total. The vote-type breakdown is meaningless (registered voters do not have an election-day vs. early split). L1 extracts 4,217 into turnout.registered_voters for precinct P17 in Columbus County, then classifies this row as TurnoutMetadata.

The corresponding L1 output:

{
  "contest": {
    "kind": "turnout_metadata",
    "raw_name": "REGISTERED VOTERS - TOTAL"
  },
  "results": [{
    "candidate_name": { "raw": "Registered Voters" },
    "votes_total": 4217
  }],
  "turnout": {
    "registered_voters": 4217
  }
}

Sibling contest records in the same precinct (e.g., the school board race) receive:

{
  "turnout": {
    "registered_voters": 4217,
    "ballots_cast": null
  }
}

Scale of the Problem

In the Florida OpenElections dataset, 6,013 rows are labeled “Registered Voters” — representing 67.9% of all non-candidate records in that file. Without detection, these rows enter the candidate pipeline as if “Registered Voters” were a person running for office. The L4 LLM audit flagged exactly this pattern in our prototype.

Over Votes and Under Votes are less numerous but equally disruptive. Maine labels its under votes as BLANK. Utah includes TOTAL VOTES aggregation rows. Each source has its own vocabulary for the same concept.

Detection Rules

L1 applies pattern matching on the candidate name field before any other processing:

PatternClassificationAction
registered votersTurnoutMetadataExtract to turnout.registered_voters
ballots castTurnoutMetadataExtract to turnout.ballots_cast
over ?votes?TurnoutMetadataExtract to turnout.over_votes
under ?votes?TurnoutMetadataExtract to turnout.under_votes
^blank$TurnoutMetadataExtract to turnout.under_votes (ME)
total votesTurnoutMetadataDiscard (aggregation artifact)
scatteringTurnoutMetadataExtract to turnout.write_in_scattering (IA)

These patterns are checked case-insensitively. They run as the first operation in the L1 pipeline — before name decomposition, before office classification, before FIPS enrichment. A row that matches is routed immediately and never enters the candidate pipeline.

Ballot Measure Choices: For/Against/Yes/No

When a row in an election results file has “For” as the candidate name, it could mean two things: a person whose legal name is “For” (implausible), or a choice on a ballot measure (almost certain). The distinction cannot be made from the candidate name alone — it requires examining the contest name alongside it.

The Problem

Ballot measures appear in election data using the same schema as candidate races. The “candidate” column holds “For”, “Against”, “Yes”, or “No”. The “contest” column holds something like “BOND REFERENDUM - SCHOOL CONSTRUCTION” or “CONSTITUTIONAL AMENDMENT 3”. Nothing in the file format distinguishes a ballot measure from a candidate race.

Real examples from MEDSL 2022:

Contest NameCandidate NameVotesWhat It Actually Is
CONSTITUTIONAL AMENDMENT 1For1,847,312Ballot measure choice
BOND REFERENDUM COLUMBUS COUNTY SCHOOLSAgainst4,219Ballot measure choice
COUNTY SALES TAX REFERENDUMYes31,408Ballot measure choice
CHARTER AMENDMENT - TERM LIMITSNo12,773Ballot measure choice

If these rows enter the candidate pipeline, “For” becomes a person entity. “For” then appears in entity resolution, gets a candidate_entity_id, and shows up in the L4 canonical export as the most prolific politician in America — winning thousands of races across every state and every office level.

The L4 Audit Discovery

In our prototype, the L4 LLM entity audit examined 50 entities for plausibility. Among the 4 errors it identified:

“‘For’ is not a plausible person name. This entity appears across 347 contests in 12 states, always in contest names containing ‘amendment’, ‘bond’, ‘referendum’, or ‘proposition’. These are ballot measure choices, not candidates.”

The audit correctly identified the contamination. But detecting it at L4 is too late — the bad entity has already propagated through L2 embeddings and L3 matching. The fix is detection at L1.

Detection Logic

A candidate name of “For”, “Against”, “Yes”, or “No” is ambiguous in isolation. These are common English words, and while no real candidate in our dataset is named “For”, names like “Yes” are theoretically possible. The detection requires both signals:

Signal 1: Candidate name pattern. The candidate name is one of a small set of ballot measure choice words:

Candidate NameBallot Measure Choice
ForYes
AgainstYes
YesYes
NoYes
Bonds YesYes
Bonds NoYes
For the Tax LevyYes
Against the Tax LevyYes

Signal 2: Contest name pattern. The contest name contains one or more ballot measure keywords:

KeywordExample Contest Name
amendmentCONSTITUTIONAL AMENDMENT 1
bondBOND REFERENDUM COLUMBUS COUNTY SCHOOLS
referendumCOUNTY SALES TAX REFERENDUM
propositionPROPOSITION 30 - TAX ON INCOME
measureMEASURE A - PARCEL TAX
initiativeINITIATIVE 82 - TIPPED WAGES
questionBALLOT QUESTION 4
charterCHARTER AMENDMENT - TERM LIMITS
levyRENEWAL 2.0 MILL LEVY - FIRE
issueISSUE 1 - REPRODUCTIVE RIGHTS

Both signals must be present. A candidate named “For” in a contest called “COUNTY COMMISSIONER” would not trigger ballot measure detection — it would be flagged as a data quality anomaly for manual review. A candidate named “John Smith” in a contest called “BOND REFERENDUM” is not a ballot measure choice — the candidate name does not match the pattern.

Routing

When both signals match, L1 routes the record to BallotMeasure contest kind with a MeasureChoice result type instead of CandidateResult:

{
  "contest": {
    "kind": "ballot_measure",
    "raw_name": "BOND REFERENDUM COLUMBUS COUNTY SCHOOLS",
    "office_level": "school_district",
    "measure_type": "bond"
  },
  "results": [
    {
      "measure_choice": "against",
      "votes_total": 4219,
      "vote_counts_by_type": {
        "election_day": 2107,
        "early": 1891,
        "absentee_mail": 198,
        "provisional": 23
      }
    }
  ]
}

The measure_choice field replaces candidate_name. No name decomposition is performed (there is no first, middle, last, or suffix for “Against”). No entity resolution is needed — “For” in one contest is not the same entity as “For” in another contest. No embedding is generated.

Edge Cases

“For the Tax Levy” vs “For.” Some sources use complete phrases like “For the Tax Levy” rather than bare “For”. The pattern match checks for the prefix, not exact equality.

Mixed contests. A small number of records have both candidate names and ballot measure choices in the same contest. This occurs when a source reports write-in votes alongside measure choices. The L1 parser handles each row independently — “For” is routed to BallotMeasure, while “Write-in” in the same contest is routed to TurnoutMetadata.

Retention elections. Judicial retention elections ask “Shall Judge X be retained?” with choices “Yes” and “No.” These are structurally ballot measures but semantically candidate races — the “candidate” is the judge. L1 classifies these as BallotMeasure with an additional retention_candidate field preserving the judge’s name from the contest string. This is an area where the boundary between candidate races and ballot measures is genuinely blurred.

Scale

Ballot measure records account for approximately 3–5% of total rows in MEDSL 2022, varying by state. States with frequent ballot initiatives (California, Oregon, Colorado) have higher proportions. Failing to detect them does not just create bad entities — it inflates the count of “candidates” and distorts competitiveness metrics. A bond referendum with 51% “For” and 49% “Against” is not an uncontested race with one candidate named “For.”

Contest Disambiguation

Three distinct problems hide under one label: the same office name can mean different races, the same race can have different names, and some races elect multiple winners. Each breaks a different assumption in the pipeline.

Problem 1: Same Office Name, Different Races

Harris County, Texas elects 25 district court judges. Every one of them appears in the data as DISTRICT COURT JUDGE. Without the district column, all 25 races collapse into a single contest — 25 winners, 50+ candidates, and no way to compute margins or determine who ran against whom.

The distinguishing field varies by source:

SourceOffice nameDistinguishing fieldExample value
MEDSLDISTRICT COURT JUDGEdistrict127TH
NC SBEDISTRICT COURT JUDGE DISTRICT 13B SEAT 02Embedded in contest name13B SEAT 02
OpenElectionsDistrict Court JudgeSeparate district column127

MEDSL separates the seat identifier into a dedicated column. NC SBE concatenates it into the contest name string. OpenElections does both, inconsistently, depending on the state contributor.

The L1 parser must extract the seat identifier regardless of where it appears. The contest entity key is (state, county, office_name, district, seat) — not just (state, county, office_name). Omitting district or seat merges distinct races.

Real examples from MEDSL 2022:

StateOffice nameDistinct seatsWhat disambiguates
TXDISTRICT COURT JUDGE25district column: 11TH, 55TH, 80TH, …
NCDISTRICT COURT JUDGE48Contest name suffix: DISTRICT 13B SEAT 02
OHCOURT OF COMMON PLEAS14district column: GENERAL DIVISION, DOMESTIC
FLCOUNTY COURT JUDGE6–12 per countydistrict column: GROUP 1, GROUP 2, …

Florida’s GROUP numbering is particularly treacherous. “COUNTY COURT JUDGE GROUP 3” in Miami-Dade is a different contest from “COUNTY COURT JUDGE GROUP 3” in Broward. The county is part of the disambiguation key.

Problem 2: Same Race, Different Names Across Years

NC SBE data from 2014 labels a state house seat as NC HOUSE OF REPRESENTATIVES DISTRICT 03. In 2018, redistricting renamed it to NC HOUSE OF REPRESENTATIVES DISTRICT 3. In 2022, the same source uses DISTRICT THREE in some contest types.

All three strings refer to the same legislative seat. But to a string-matching system, they are three different contests. Tracking a candidate’s career across elections requires knowing that DISTRICT 03, DISTRICT 3, and DISTRICT THREE are the same district.

Common variations found in NC SBE data:

Variant AVariant BVariant CSame contest?
DISTRICT 03DISTRICT 3DISTRICT THREEYes
BOARD OF EDUCATIONBD OF EDBOEYes
COUNTY COMMISSIONERSCOUNTY COMMISSIONBOARD OF COMMISSIONERSYes

This is contest entity resolution — the same problem as candidate entity resolution, applied to office names instead of person names. The cascade applies:

  1. Normalize numbers: Strip leading zeros, convert written numbers to digits. DISTRICT 03DISTRICT 3, DISTRICT THREEDISTRICT 3.
  2. Abbreviation expansion: BD OF EDBOARD OF EDUCATION, COMMCOMMISSION.
  3. Embedding similarity: For remaining ambiguous pairs, compute cosine similarity on contest composite strings and apply the same threshold logic as candidate matching.

Contest entity resolution runs at L3 alongside candidate entity resolution. Each contest receives a contest_entity_id that persists across election cycles.

Problem 3: Multi-Seat Contests

A “vote for 3” school board race elects the top three candidates. The standard margin computation — difference between first and second place — does not apply. The meaningful margin is between the last winner (3rd place) and the first loser (4th place).

The vote_for field (called magnitude in some sources) records how many seats are being filled. MEDSL provides this field for most contests. NC SBE does not — it must be inferred from ballot instructions embedded in the contest name or from the number of candidates who received non-trivial vote shares.

Real example from Dawson County, Georgia (2022):

Contestvote_forCandidatesVotes
Board of Education3625,186 / 25,186 / 24,901 / 24,844 / 23,112 / 22,987

The effective margin is between 3rd place (24,901) and 4th place (24,844) — a gap of 57 votes. Reporting the margin as the gap between 1st and 2nd (0 votes — an exact tie) is misleading: the tie is between the top two winners, not between a winner and a loser.

Worse, the exact tie at the top (25,186 each) may trigger recount rules in some jurisdictions. Whether a recount applies depends on whether the tied candidates are competing for the same seat or are both safely elected. The vote_for field is the only way to know.

Why vote_for matters for competitiveness analysis

Without vote_for, every multi-seat contest looks either wildly competitive (if you compare 1st to 2nd among co-winners) or wildly uncompetitive (if you compare any winner to any loser in a field of 6). The correct margin — last winner vs. first loser — requires knowing the cutoff.

AnalysisWithout vote_forWith vote_for
Is the race competitive?Unclear — 0-vote “margin” is misleadingMargin of 57 votes at the cutoff
Is it uncontested?6 candidates — looks contestedOnly if ≤ 3 candidates filed
Who won?Top 1? Top 2? UnknownTop 3

Detection when the field is missing

When vote_for is absent (NC SBE, some OpenElections files), L1 applies heuristics:

  1. Contest name pattern: “VOTE FOR 3”, “ELECT 2”, “(3 SEATS)” embedded in the contest name string.
  2. Candidate count: If 6+ candidates appear in a school board or city council race, flag for multi-seat review.
  3. Vote distribution: If the top N candidates have similar vote totals and a clear drop-off to N+1, infer N seats.

These heuristics are imperfect. The vote_for field, when present, overrides all heuristics. When absent, the inferred value is stored with a confidence flag, and the L4 verification audit reviews flagged contests.

How All Three Interact

A single contest can exhibit all three problems simultaneously. Consider a Texas county with five JP (Justice of the Peace) precincts, each electing one JP, across three election cycles where the contest name changed from “J.P. PCT 3” to “JUSTICE OF THE PEACE PRECINCT 3” to “JP PRECINCT THREE”:

  • Problem 1: Five precincts, five separate contests, all labeled variants of “Justice of the Peace”.
  • Problem 2: Three different name formats across 2018, 2020, 2022 for each precinct.
  • Problem 3: Each is single-seat, but a neighboring school board race on the same ballot elects three members.

The contest entity key (state, county, office_name_normalized, district_normalized, seat) disambiguates problem 1. Contest entity resolution across years handles problem 2. The vote_for field handles problem 3. All three solutions must work together for the contest record to be correct.

Cross-Source Reconciliation

When two independent sources cover the same election, their overlap becomes a validation set. If MEDSL and NC SBE both report results for the same contest in the same county, the vote totals should match. When they do, both sources are credible. When they don’t, at least one has an error — and the disagreement reveals data quality issues that no single-source analysis can detect.

North Carolina 2022 is our primary validation case. Both MEDSL and the NC State Board of Elections publish precinct-level results for all NC contests in the 2022 general election.

The Overlap

We identified 640 contests present in both MEDSL and NC SBE for the 2022 general election. These span federal, state, county, municipal, judicial, and school board races across all 100 NC counties.

For each contest, we aggregated precinct-level results to the contest level and compared total votes per candidate.

Agreement LevelContestsPercentage
Exact vote total match57990.5%
Within 1% of each other477.3%
Disagree by more than 1%142.2%
Total640100%

90.5% exact match across 640 contests, derived from two completely independent data pipelines (MIT’s academic processing vs. NC’s official state board reporting), is strong evidence that both sources are faithfully representing the same underlying certified results.

The 7.3% — Small Disagreements

The 47 contests with near-matches (within 1%) trace to identifiable causes:

CauseContestsNotes
Provisional ballot inclusion timing22MEDSL snapshot taken before final canvass; NC SBE includes provisionals
Precinct boundary rounding11Split precincts assigned differently by each source
Write-in aggregation9NC SBE reports individual write-ins; MEDSL aggregates to “Write-in”
Unknown5Under investigation

These are not errors — they are legitimate differences in how two organizations process the same raw certified results. Provisional ballot timing is the most common cause: MEDSL’s data may reflect an earlier snapshot of the canvass than NC SBE’s final certified totals.

The 2.2% — Real Disagreements

The 14 contests with >1% disagreement require individual investigation. Common causes include:

  • Misassigned precincts. A precinct’s results attributed to the wrong contest or district in one source.
  • Partial data. One source missing results from a subset of precincts, typically in multi-county contests where one county’s data arrived late.
  • Candidate name mismatch causing split. The same candidate’s votes split across two entity IDs in one source because a name variant was not resolved — e.g., “JOHN SMITH” in early voting vs. “John R. Smith” in election-day results treated as different candidates.

These 14 cases are flagged by the L4 cross-source reconciliation algorithm and reported in the verification output. They are not silently ignored.

Name Formatting Differences

Vote totals may agree, but candidate names almost never do. Of the 640 overlapping contests, 401 (62.7%) have at least one candidate whose name is formatted differently between MEDSL and NC SBE.

Formatting DifferenceExample (MEDSL)Example (NC SBE)Frequency
ALL CAPS vs Title CaseTIMOTHY LANCETimothy Lance389
Last-first vs first-lastLANCE, TIMOTHYTimothy Lance247
Middle initial present/absentSHANNON W BRAYShannon W. Bray118
Period after middle initialSHANNON W BRAYShannon W. Bray94
Nickname in quotes vs parensCHARLES "CHARLIE" CRISTCharles (Charlie) Crist12
Suffix formattingROBERT WILLIAMS JRRobert Williams, Jr.31
Prefix/title includedHON. JANE DOEJane Doe8

A single candidate can exhibit multiple formatting differences simultaneously. “BRAY, SHANNON W” (MEDSL) vs “Shannon W. Bray” (NC SBE) combines casing, ordering, and punctuation differences in one pair.

This is why entity resolution exists. The vote totals confirm these are the same contests with the same candidates. The name formatting confirms that string equality is insufficient — structured decomposition, embedding, and in some cases LLM confirmation are required to link records across sources.

This Overlap as a Validation Set

The 640-contest NC overlap serves three purposes in the pipeline:

1. Entity Resolution Validation

For every candidate pair that the L3 cascade matches across MEDSL and NC SBE, we can verify the match by comparing vote totals. If the cascade says “TIMOTHY LANCE” (MEDSL) and “Timothy Lance” (NC SBE) are the same person, and their vote totals match exactly, the match is confirmed by an independent signal. If the cascade says they match but the vote totals disagree by 50%, the match is suspect.

2. Office Classification Validation

Both sources cover the same contests but use different office name strings. MEDSL might report “NC HOUSE OF REPRESENTATIVES DISTRICT 047” while NC SBE reports “NC HOUSE OF REPRESENTATIVES - DISTRICT 47”. If both classify to state/legislative, the classifier is consistent. If one classifies to state/legislative and the other to county/legislative, we have a bug.

3. Parser Validation

When two independent parsers (the MEDSL parser and the NC SBE parser) produce the same vote counts for the same contest, both parsers are likely correct. When they disagree, the disagreement localizes the bug to one parser or the other — far easier to debug than a single-source pipeline where errors are invisible.

Beyond NC

The NC overlap is our deepest validation case because NC SBE publishes granular, machine-readable precinct data going back to 2006. Other states offer less overlap:

StateMEDSL 2022Secondary SourceOverlap Quality
NCYesNC SBE (precinct-level, 2006–2024)High
FLYesOpenElections (county-level, select years)Medium
OHYesOpenElections (precinct-level, 2022)Medium
GAYesClarity/Scytl (election night, unstable URLs)Low
All othersYesMEDSL onlyNone

As additional state-level sources are integrated, each creates a new validation pair. The architecture is designed to scale: the L4 cross-source reconciliation algorithm runs for any pair of sources that cover the same (state, year, contest) combination. No code changes are required — only new L0 data and a new L1 parser.

The Lesson

Cross-source reconciliation is not a feature — it is the only reliable way to detect errors in election data. A single source can be internally consistent and still wrong. Two independent sources that agree are almost certainly right. Two independent sources that disagree tell you exactly where to look.

The 90.5% exact match rate across 640 NC contests is our current evidence floor. Every additional source and state that achieves similar agreement raises confidence in the pipeline. Every disagreement is a bug report — either in our pipeline or in the source data.

Design Principles

Five principles govern every architectural decision in this project. They are listed in priority order — when two principles conflict, the higher-ranked principle wins.

1. Deterministic First

If a deterministic method produces correct results, use it. Do not add machine learning, embeddings, or LLM calls where string matching, regex, or lookup tables suffice. L0 through L1 contain zero ML — name decomposition, FIPS enrichment, keyword-based office classification, and hash computation are all deterministic operations that produce identical output from identical input on every run. Deterministic methods are not preferred because they are cheaper (budget is not a constraint). They are preferred because they are reproducible, auditable, and incapable of hallucination. When a journalist asks “why did your system say these two candidates are the same person?”, the answer should be “because their canonical first names, last names, and suffixes are identical” — not “because a language model said so.” Determinism is the default. Non-determinism requires justification.

2. Preserve Signal

Every piece of information in the source data is potential disambiguation signal. Middle initials distinguish David S. Marshall (Maine) from David A. Marshall (Florida) — dropping them collapses two people into one. Suffixes distinguish Robert Williams from Robert Williams Jr. — stripping them merges father and son. Nicknames reveal that Charlie Crist and Charles Joseph Crist are the same person — normalizing too early destroys that connection. The rule at L1 is: decompose names into structured components (raw, first, middle, last, suffix, canonical_first) and preserve every component. Do not discard middle initials. Do not strip suffixes. Do not overwrite the raw name with a canonical form. Clean without collapsing. Downstream layers (L2 embedding, L3 matching, L4 canonicalization) consume these components selectively. The raw material must survive intact through L1 for those layers to function.

3. LLMs for Confirmation, Not Discovery

Embeddings retrieve candidates. LLMs confirm matches. The embedding model (text-embedding-3-large, 3,072 dimensions) identifies pairs that might be the same entity — Charlie Crist at cosine 0.451, Robert Williams Jr at 0.862. The LLM (Claude Sonnet) then examines the full context — structured name components, vote counts, office, state, party — and renders a judgment: match or no-match, with confidence and reasoning. The LLM is never the first line of analysis. It does not scan raw files, parse CSV columns, compute FIPS codes, or generate embeddings. It is called only when cheaper methods have narrowed the problem to a specific, bounded question: “Are these two records the same person?” or “What type of office is the Santa Rosa Island Authority?” This ordering exists for speed (70% of entity resolution is exact match), reproducibility (deterministic steps produce identical results), and auditability (every LLM decision is logged with its prompt, response, and reasoning).

4. Immutable Layers

Outputs are append-only. L0 raw files are never modified. L1 cleaned records are never updated in place — if the parser changes, a new L1 run produces new records with a new parser_version. L2 embeddings are never re-computed by overwriting existing vectors — a new embedding model produces new L2 output alongside the old. L3 match decisions are never silently revised — an override produces a new decision record referencing the original. L4 canonical exports are versioned snapshots, not mutable databases. This immutability serves two purposes. First, provenance: the hash chain from L4 back to L0 depends on every intermediate record remaining unchanged. Modifying an L1 record without incrementing the parser version breaks the chain. Second, debugging: when a result looks wrong, you can inspect every layer’s output at the time it was produced, without worrying that a subsequent run overwrote the evidence.

5. Document Sources, Don’t Store Data

This project does not redistribute election data. Each source — MEDSL, NC SBE, OpenElections, VEST, Census, FEC — publishes data under its own license, on its own schedule, at its own URLs. We provide exact download URLs, file size expectations, schema documentation, known data quality issues, and the tools to process the data. We do not provide the data itself. The reasons are legal (license terms vary), practical (the current corpus exceeds 8 GB and grows with each election cycle), and epistemic (a stale copy of a dataset that the source has since corrected is worse than no copy at all). Users download data from authoritative sources, verify file integrity against documented hashes, and run the pipeline locally. The L0 manifest records exactly where each file came from and when it was retrieved, so any result can be traced back to its authoritative origin.

The Five-Layer Pipeline

The pipeline processes election data through five immutable layers. Each layer depends on all prior layers. Every record carries a hash chain back to the original source bytes. The storage format at every layer is JSONL (one JSON record per line).

L0  RAW         Byte-identical source files with acquisition manifests.
 │
 │  deterministic parse — no ML, no API calls
 ▼
L1  CLEANED     Structured records. Names decomposed into components.
 │              FIPS enrichment. Office classification (keyword + regex).
 │
 │  deterministic given embedding model version
 ▼
L2  EMBEDDED    Vector embeddings for candidates, contests, geography.
 │              Tier 3 office classification. Quality flags.
 │
 │  non-deterministic — LLM decisions stored in audit log
 ▼
L3  MATCHED     Entity resolution. candidate_entity_id assigned.
 │              contest_entity_id assigned. Cross-source dedup.
 │
 │  deterministic given L3 entity assignments
 ▼
L4  CANONICAL   Authoritative names. Temporal chains. Alias tables.
                Verification algorithms. Researcher-facing exports.

Layer properties

LayerDeterministicNeeds API keyOutput formatRe-runnable from
L0YesNoOriginal files + .manifest.jsonExternal sources
L1YesNoJSONLL0
L2Yes, given model versionYes (OpenAI)JSONL + .npy sidecarsL1
L3No (LLM)Yes (Anthropic)JSONL + decision log (JSONL)L2
L4Yes, given L3NoJSONL + JSON registries + CSV exportL3

The determinism boundary falls between L2 and L3. Everything from L0 through L2 produces identical output from identical input, given the same code version and embedding model. L3 introduces LLM calls whose outputs may vary between runs, but every decision is stored in a JSONL audit log that enables deterministic replay.

What each layer produces

L0: Raw

The input to the entire pipeline. L0 is a byte-identical copy of each source file, accompanied by a JSON manifest recording how it was acquired.

l0_raw/
├── nc_sbe/
│   ├── results_pct_20221108.txt           # Original file, untouched
│   └── results_pct_20221108.txt.manifest.json
├── medsl/
│   └── 2022-nc-local-precinct-general/
│       ├── NC-cleaned-final3.csv
│       └── NC-cleaned-final3.csv.manifest.json
└── ...

The manifest records:

{
  "l0_hash": "edfedf2760cfd54f...",
  "source_url": "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip",
  "retrieval_date": "2026-03-18T14:30:00Z",
  "file_size_bytes": 18023456,
  "format_detected": "tsv"
}

L0 files are never modified. If a source is re-downloaded and the content differs, a new versioned L0 entry is created.

L1: Cleaned

L1 parses each source’s native format into a unified JSONL schema. The parser is source-specific (one parser per source), but the output schema is the same regardless of source.

L1 performs 10 operations in fixed order:

  1. Filter non-contests — Detect “Registered Voters”, “Ballots Cast”, “Over Votes”, “Under Votes”. Route to turnout metadata. Detect “For”/“Against”/“Yes”/“No” ballot measure choices.
  2. Parse source format — Source-specific: CSV for MEDSL, TSV for NC SBE, XML for Clarity.
  3. Decompose candidate names — Split into first, middle, last, suffix, nickname. Preserve every component. Robert Van Fletcher, Jr. becomes {first: "Robert", middle: "Van", last: "Fletcher", suffix: "Jr.", raw: "Robert Van Fletcher, Jr."}.
  4. Apply nickname dictionary — Map CharlieCharles, BillWilliam, etc. Store as canonical_first. Preserve original first.
  5. Classify contest kind — CandidateRace, BallotMeasure, or TurnoutMetadata.
  6. Classify office (tiers 1–2) — Keyword lookup (~170 entries), then regex patterns (~40 patterns). No ML, no embeddings. Records that don’t match remain other.
  7. Enrich geography — FIPS lookup from bundled reference data (3,143 counties, 31,980 places). Generate OCD-IDs.
  8. Compute vote sharesvotes_total / sum(all candidates in contest).
  9. Backfill turnout — If turnout metadata rows were found, attach registered voter counts to sibling contest records in the same precinct.
  10. Compute L1 hashSHA-256(record content + "parent:" + L0 hash).

A single L1 record for a Columbus County, NC school board race:

{
  "election": {"date": "2022-11-08", "type": "general"},
  "jurisdiction": {
    "state": "NC", "state_fips": "37",
    "county": "COLUMBUS", "county_fips": "37047",
    "precinct": "P17", "level": "precinct"
  },
  "contest": {
    "kind": "candidate_race",
    "raw_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
    "office_level": "school_district",
    "classifier_method": "regex",
    "classifier_confidence": 0.85,
    "vote_for": 1
  },
  "results": [
    {
      "candidate_name": {
        "raw": "Timothy Lance", "first": "Timothy",
        "middle": null, "last": "Lance", "suffix": null,
        "canonical_first": "Timothy"
      },
      "votes_total": 303,
      "vote_counts_by_type": {
        "election_day": 136, "early": 159,
        "absentee_mail": 7, "provisional": 1
      }
    }
  ],
  "source": {
    "source_type": "nc_sbe",
    "source_file": "results_pct_20221108.txt",
    "confidence": "high"
  },
  "provenance": {
    "l1_hash": "8ea7ecc257ff8e05",
    "l0_parent_hash": "edfedf2760cfd54f",
    "parser_version": "nc_sbe_v2.1",
    "schema_version": "3.0.0"
  }
}

L1 does not use any machine learning, API calls, or non-deterministic processes. Given the same L0 files and the same parser version, L1 output is identical on every run.

L2: Embedded

L2 generates vector embeddings for text fields that need fuzzy matching. The embedding model is text-embedding-3-large (3,072 dimensions) from OpenAI. L2 also applies tier 3 office classification (embedding nearest-neighbor against a reference set of ~200 known office names) and raises quality flags on suspicious records.

L2 produces three types of output:

Enriched JSONL — L1 records augmented with classification upgrades and quality flags:

{
  "...all L1 fields...",
  "l2": {
    "l2_hash": "854fa6367960bb05",
    "l1_parent_hash": "8ea7ecc257ff8e05",
    "embedding_model": "text-embedding-3-large",
    "embedding_dimensions": 3072,
    "candidate_embedding_id": 4271,
    "contest_embedding_id": 183,
    "candidate_composite": "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus",
    "contest_composite": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02 | school_district | NC 2022",
    "quality_flags": []
  }
}

Embedding sidecars — Binary .npy files (float32 arrays) containing the actual vectors. One file per embedding type per partition:

l2_embedded/
├── nc/2022/
│   ├── enriched.jsonl
│   ├── candidate_embeddings.npy    # float32[N, 3072]
│   ├── contest_embeddings.npy      # float32[M, 3072]
│   └── geography_embeddings.npy    # float32[K, 3072]

ID mapping — A JSON file mapping L1 record hashes to embedding row indices.

The composite strings fed to the embedding model follow fixed templates:

PurposeTemplate
Candidate{canonical_first} {middle} {last} {suffix} | {party} | {office} | {state} | {county}
Contest{raw_contest_name} | {office_level} | {state} {year}
Geography{municipality}, {county} County, {state}

Middle initials and suffixes are included in the candidate composite. This is deliberate — “David S Marshall” and “David A Marshall” produce different vectors, which helps distinguish different people with the same first and last name. We measured this: including the middle initial reduced cosine similarity between the two David Marshalls from 0.7025 to 0.6448.

L2 is deterministic given the same embedding model version. If OpenAI changes the weights behind text-embedding-3-large, the vectors change. The embedding_model and model version are stored in every L2 record to detect this.

L3: Matched

L3 resolves entities — it determines which records across sources and elections refer to the same candidate and the same contest. This is the first non-deterministic layer because it uses LLM calls for ambiguous cases.

L3 runs an entity resolution cascade for each candidate record:

StepMethodHandlesCost
1Exact match on (canonical_first, last, suffix)Same name across precincts$0
2Jaro-Winkler similarity ≥ 0.92Minor spelling variations$0
2.5Name similarity gate: JW on last name < 0.50 → skipObvious non-matches$0
3Embedding retrieval: cosine ≥ 0.95 → auto-acceptFormat differences$0
4LLM confirmation: cosine 0.35–0.95Nicknames, suffixes, ambiguous names~$0.0002/call
5Tiebreaker: stronger model when step 4 is uncertainLow-confidence cases~$0.002/call

In our prototype run of 200 records:

  • Step 1 resolved 597 candidates (70.0%)
  • Step 2 resolved 1 (0.1%)
  • Step 3 resolved 50 (5.9%)
  • Step 4 was invoked 30 times (3.5%), all resulting in no-match
  • 206 unique candidate entities were created

The 30 LLM calls in our prototype were all spent on pairs within the same (state, office_level) block that had moderate embedding similarity (0.55–0.73) but completely different names — “Aaron Bridges” vs “Daniel Blanton” type comparisons. All 30 were correctly rejected. This finding led to the addition of step 2.5 (the name similarity gate): if the Jaro-Winkler score on last names alone is below 0.50, skip the pair entirely without computing embedding similarity.

Every L3 decision is stored in a JSONL audit log:

{
  "decision_id": "a3f8c1d2-...",
  "decision_type": "candidate_match",
  "timestamp": "2026-03-19T10:30:00Z",
  "inputs": {
    "name_a": "Charlie Crist",
    "name_b": "CRIST, CHARLES JOSEPH",
    "embedding_score": 0.451,
    "state_a": "FL", "state_b": "FL",
    "contest_a": "Governor", "contest_b": "Governor",
    "votes_a": 3101652, "votes_b": 3101652
  },
  "method": {
    "type": "llm",
    "model": "claude-sonnet-4-20250514",
    "prompt_template_version": "entity_match_v2.0"
  },
  "output": {
    "decision": "match",
    "confidence": 0.95,
    "reasoning": "Charlie is a common nickname for Charles. Same state, same office, identical vote counts."
  }
}

A researcher who wants to reproduce L3 can either replay the cached decisions from the log (deterministic) or re-run the LLM calls (which may produce slightly different responses). The log preserves everything needed for either approach.

L3 adds entity assignments to each record:

{
  "...all L2 fields...",
  "l3": {
    "l3_hash": "28183d41d50204d5",
    "l2_parent_hash": "854fa6367960bb05",
    "candidate_entity_ids": [
      {"result_index": 0, "entity_id": "person:nc:columbus:lance-timothy-13"}
    ],
    "contest_entity_id": "contest:nc:columbus:school-board-d02"
  }
}

L4: Canonical

L4 assigns authoritative representations. For each entity (candidate or contest), it selects a canonical name, builds temporal chains across elections, constructs alias tables, and runs verification algorithms.

Canonical name selection follows a fixed algorithm:

  1. Collect all name variants from all L3 records in the entity cluster.
  2. Prefer the most complete variant (one with a middle initial over one without; one with a suffix over one without).
  3. Among equally complete variants, prefer the one from the most authoritative source (certified state data > academic data > community data).
  4. Among equally authoritative variants, prefer the most recent.

Temporal chains aggregate records by (entity_id, election_date, contest_entity_id). One entry per election, not per precinct. A candidate who appeared in 47 precincts in one election gets one temporal chain entry with the summed vote total.

Verification algorithms run at L4 to check pipeline integrity:

  1. Hash chain integrity — Walk L4→L3→L2→L1→L0 for every record. Verify no link is broken.
  2. Entity consistency — Flag entities spanning multiple states (unusual for local officials). Flag party switches.
  3. Temporal plausibility — Flag implausible career spans or office progressions.
  4. Cross-source reconciliation — Where two sources cover the same contest, compare vote totals.
  5. Completeness audit — Report coverage by state, county, year. Report FIPS and entity ID fill rates.
  6. LLM entity audit — For multi-member entities, ask a language model whether the cluster is plausible. In our prototype, this caught 43 suspicious entities (precinct-level records inflating temporal chains) and 4 likely errors (“For” and “Against” ballot measure choices classified as person entities).

L4 exports two types of output:

Entity registries (JSON) — One record per unique person or contest:

{
  "entity_id": "person:nc:columbus:lance-timothy-13",
  "canonical_name": "Timothy Lance",
  "aliases": ["Timothy Lance", "TIMOTHY LANCE"],
  "elections": [
    {"date": "2022-11-08", "contest": "Columbus County Schools Board of Education District 02", "votes": 1531}
  ],
  "states": ["NC"],
  "first_appearance": "2022-11-08",
  "election_count": 1
}

Flat exports (JSONL and CSV) — One record per candidate per contest per precinct, with canonical names and entity IDs attached:

{
  "election_date": "2022-11-08",
  "state": "NC",
  "county": "COLUMBUS",
  "contest_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
  "candidate_raw": "Timothy Lance",
  "candidate_canonical": "Timothy Lance",
  "candidate_entity_id": "person:nc:columbus:lance-timothy-13",
  "votes_total": 303,
  "source": "nc_sbe",
  "l3_hash": "28183d41d50204d5",
  "l0_hash": "edfedf2760cfd54f"
}

Why five layers and not two

A simpler system would have two layers: raw and processed. The five-layer design exists because the processing steps have different properties that should not be conflated:

Splitting L1 from L2 means you can upgrade the embedding model without re-parsing all sources. If a better model than text-embedding-3-large becomes available, re-run L2 from L1. L1 remains untouched.

Splitting L2 from L3 means cheap, deterministic embedding generation is separate from expensive, non-deterministic LLM calls. L2 can run for 200 million records in hours on CPU (plus API calls for vector generation). L3’s LLM calls can be batched separately, retried on failure, and audited independently.

Splitting L3 from L4 means individual entity resolution decisions are separate from the aggregate operations (canonical name selection, temporal chains, verification) that consume them. If a human reviewer overrides an L3 match decision, L4 can be re-run without re-doing all of L3.

Each layer boundary is a point where you can stop, inspect, export, and restart. A researcher who disagrees with the entity resolution can take L2 output and apply their own matching logic. A developer who wants to test a new office classifier can re-run L1 without re-downloading L0.

Storage layout

local-data/processed/
├── l0_raw/
│   └── {source}/
│       ├── {filename}
│       └── {filename}.manifest.json
├── l1_cleaned/
│   └── {source}/{state}/{year}/
│       ├── cleaned.jsonl
│       └── cleaning_report.json
├── l2_embedded/
│   └── {state}/{year}/
│       ├── enriched.jsonl
│       ├── candidate_embeddings.npy
│       ├── contest_embeddings.npy
│       └── id_mapping.json
├── l3_matched/
│   └── {state}/{year}/
│       ├── matched.jsonl
│       └── decisions/
│           └── candidate_matches.jsonl
└── l4_canonical/
    ├── candidate_registry.json
    ├── contest_registry.json
    ├── verification_report.json
    └── exports/
        ├── flat_export.jsonl
        └── flat_export.csv

All JSONL files are streamable — they can be processed line by line without loading the entire file into memory. At 200 million records with approximately 2 KB per record, the full L1 corpus would be approximately 400 GB. Streaming is not optional at that scale.

The hash chain

Every record at every layer carries a hash of its own content and a reference to its parent layer’s hash:

L4 record
  l4_hash ← SHA-256(L4 content + "parent:" + l3_hash)
    └── l3_hash ← SHA-256(L3 content + "parent:" + l2_hash)
          └── l2_hash ← SHA-256(L2 content + "parent:" + l1_hash)
                └── l1_hash ← SHA-256(L1 content + "parent:" + l0_hash)
                      └── l0_hash ← SHA-256(raw file bytes)

To verify any L4 record: recompute the L4 hash from its content, check that it matches the stored l4_hash, then follow the l3_parent_hash to the L3 record and repeat. Continue through L2 and L1 to L0. At L0, re-hash the raw file bytes and compare to the stored l0_hash.

In our prototype run of 200 records, all 200 hash chains verified from L4 back to L0 with zero broken links.

L0: Raw — Byte-Identical Source Preservation

L0 is the foundation of the pipeline. It stores byte-identical copies of every source file alongside a JSON manifest that records how the file was acquired. Nothing at L0 is parsed, cleaned, or transformed. The raw bytes are sacred.

What L0 Contains

Every source file produces two artifacts:

ArtifactPurposeExample
The file itselfExact bytes as downloadedresults_pct_20221108.txt
The manifest sidecarAcquisition metadataresults_pct_20221108.txt.manifest.json

The manifest records five fields:

{
  "l0_hash": "edfedf2760cfd54f...",
  "source_url": "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip",
  "retrieval_date": "2026-03-18T14:30:00Z",
  "file_size_bytes": 18023456,
  "format_detected": "tsv"
}
  • l0_hash — SHA-256 of the raw file bytes. This is the root of the hash chain. Every downstream record at L1–L4 ultimately traces back to this value.
  • source_url — The exact URL used to retrieve the file. Not a landing page — the direct download link.
  • retrieval_date — ISO 8601 timestamp of when the file was downloaded. Sources update files in place; the retrieval date disambiguates versions.
  • file_size_bytes — Byte count of the raw file after decompression (if the source was a zip archive, this is the size of the extracted file, not the archive).
  • format_detected — The file format as determined by content inspection: tsv, csv, xml, json, fixed_width.

Storage Layout

l0_raw/
├── nc_sbe/
│   ├── results_pct_20221108.txt
│   ├── results_pct_20221108.txt.manifest.json
│   ├── results_pct_20201103.txt
│   └── results_pct_20201103.txt.manifest.json
├── medsl/
│   ├── 2022-nc-precinct-general.csv
│   ├── 2022-nc-precinct-general.csv.manifest.json
│   ├── 2022-fl-precinct-general.csv
│   └── 2022-fl-precinct-general.csv.manifest.json
├── openelections/
│   ├── 20221108__fl__general__precinct.csv
│   └── 20221108__fl__general__precinct.csv.manifest.json
└── census/
    ├── national_county2020.txt
    └── national_county2020.txt.manifest.json

Files are organized by source, not by state or year. The source is the natural partition because each source has its own parser at L1. A single MEDSL file may contain data for all 50 states; a single NC SBE file contains one election’s results for all NC counties. The source directory mirrors the download structure.

Idempotent Download

Downloading is idempotent. Before fetching a file, the pipeline checks whether an L0 entry already exists with a matching l0_hash:

  1. If the manifest exists and the file exists and the file’s SHA-256 matches the manifest’s l0_hashskip download. The file is already present and intact.
  2. If the manifest exists but the file is missing or the hash does not match → re-download. The file was corrupted or deleted.
  3. If no manifest exists → download and create manifest.

This means running the download step twice produces no network traffic on the second run. It also means the pipeline recovers gracefully from interrupted downloads — a partially written file will fail the hash check and be re-fetched.

When Sources Change

Some sources update files in place. NC SBE occasionally reissues precinct result files after canvass corrections. MEDSL publishes revised datasets with the same filename.

When a re-download produces different bytes than the stored l0_hash, the pipeline does not overwrite the existing L0 entry. Instead:

  1. The new file is stored with a versioned name: results_pct_20221108.v2.txt.
  2. A new manifest is created with the new l0_hash and current retrieval_date.
  3. The old file and manifest are retained unchanged.

All L1–L4 records that reference the old l0_hash remain valid. New pipeline runs against the updated file produce new L1–L4 records referencing the new l0_hash. Both versions coexist. The retrieval_date field distinguishes them.

The L0 Hash as Root of Trust

The l0_hash is the only value in the pipeline that can be independently verified by anyone with access to the source. Download the file from the URL in the manifest. Compute SHA-256. Compare. If the hashes match, the pipeline processed the same bytes you hold.

Every subsequent hash — l1_hash, l2_hash, l3_hash, l4_hash — incorporates its parent’s hash. The entire chain is anchored to l0_hash. If someone modifies the raw file, the L0 hash changes, the L1 hash no longer matches its l0_parent_hash, and the verification algorithm reports a break at the L0→L1 boundary.

In our prototype, all 200 hash chains verified from L4 back to L0 with zero broken links. The verification starts here — at the raw bytes.

What L0 Does Not Do

L0 does not parse, filter, validate, or transform. A TSV file with malformed rows is stored as-is. A CSV file with a trailing BOM is stored as-is. A zip archive is decompressed and the contents stored, but the extraction is mechanical — no character encoding conversion, no line-ending normalization, no column reordering.

Data quality issues are L1’s problem. L0’s only job is to preserve the exact bytes that the source published, record where they came from, and make them verifiable.

L1: Cleaned — Deterministic Parsing and Enrichment

L1 transforms raw source files into structured JSONL records with a unified schema. It is purely deterministic: no machine learning, no API calls, no randomness. Given the same L0 files and the same parser version, L1 output is identical on every run, on every machine, forever.

This is deliberate. L1 is the foundation for every subsequent layer. If the foundation is non-deterministic, nothing above it can be reproduced.

One Parser Per Source, One Schema Out

Each source has a dedicated parser that understands its native format:

SourceFormatDelimiterEncodingParser
NC SBETSV (.txt extension)\tUTF-8nc_sbe_v2.1
MEDSLCSV,UTF-8medsl_v1.3
OpenElectionsCSV (varies by state),UTF-8/Latin-1openelections_v1.0
Clarity/ScytlXMLUTF-8clarity_v0.5

Every parser produces the same output schema. A downstream consumer of L1 JSONL does not need to know whether a record originated from NC SBE or MEDSL — the fields, types, and semantics are identical.

The 10 Operations

L1 applies 10 operations in fixed order. The order matters — later operations depend on earlier ones.

1. Filter Non-Contests

Before any parsing, detect rows that are not candidate results. Pattern-match on the candidate name field:

PatternClassificationAction
registered votersTurnoutMetadataExtract to turnout.registered_voters
ballots castTurnoutMetadataExtract to turnout.ballots_cast
over votesTurnoutMetadataExtract to turnout.over_votes
under votesTurnoutMetadataExtract to turnout.under_votes
^blank$TurnoutMetadataMaine’s undervote label
total votesAggregation artifactDiscard (redundant with candidate sums)
for / against / yes / noBallotMeasure (if contest name matches)Route to MeasureChoice

This runs first because non-contest rows must not enter name decomposition, office classification, or entity resolution. The principle is extract before filter — the registered voter count is valuable turnout data and is captured before the row is excluded from candidate analysis. See Non-Candidate Records.

2. Parse Source Format

Source-specific column mapping. The NC SBE parser reads tab-separated fields: County, Election Date, Contest Name, Choice, Choice Party, Total Votes, Election Day, One Stop, Absentee by Mail, Provisional. The MEDSL parser reads CSV columns: state, county_name, office, candidate, party_simplified, votes, mode. Each parser maps its native columns to the unified schema fields.

Encoding normalization happens here. OpenElections files from some states use Latin-1 encoding; the parser detects and converts to UTF-8. MEDSL 2022 has trailing commas in some state files; the parser strips them.

3. Decompose Candidate Names

Split every candidate name into structured components. This is the most critical L1 operation — it determines what signal survives to L2 and L3.

The decomposition handles four source formats:

FormatExampleParsing strategy
LAST, FIRST MIDDLECRIST, CHARLES JOSEPHSplit on first comma; remainder is first + middle
First LastCharlie CristLast token is last name (with multi-word last name detection)
First Middle Last SuffixRobert Van Fletcher, Jr.Suffix detected and extracted; remaining tokens parsed
LAST, FIRST M.BRAY, SHANNON W.Period stripped from middle initial

The output for every format is the same six fields:

{
  "raw": "Robert Van Fletcher, Jr.",
  "first": "Robert",
  "middle": "Van",
  "last": "Fletcher",
  "suffix": "Jr.",
  "canonical_first": "Robert"
}

Every component is preserved. Middle initials are kept (they distinguish David S. Marshall from David A. Marshall). Suffixes are kept (they distinguish Robert Williams from Robert Williams Jr.). The raw field is never modified. See Name Normalization.

4. Apply Nickname Dictionary

Look up first in the nickname dictionary (~100 mappings in prototype, targeting 500+). If a mapping exists, populate canonical_first with the formal equivalent. If not, canonical_first equals first.

firstcanonical_firstMapping
CharlieCharlesCharlie → Charles
RonRonaldRon → Ronald
NikkiNicoleNikki → Nicole
TimothyTimothyNo mapping (already formal)

Both fields are preserved. The composite string sent to L2 uses canonical_first; the original first is retained for display and provenance. See The Nickname Dictionary.

5. Classify Contest Kind

Route each record to one of three contest kinds based on signals from steps 1 and 2:

KindCriteriaExample
candidate_raceDefault — a person running for officeTimothy Lance for Board of Education
ballot_measureCandidate name is For/Against/Yes/No AND contest name matches measure keywords“Against” in “BOND REFERENDUM”
turnout_metadataCandidate name matches turnout patterns“Registered Voters”

Records classified as ballot_measure get a MeasureChoice result instead of CandidateResult. Records classified as turnout_metadata are extracted and attached to sibling precinct records.

6. Classify Office (Tiers 1–2)

Apply the deterministic tiers of the office classifier:

Tier 1: Keyword lookup (~170 entries). Case-insensitive substring match. "board of education" in the contest name → school_district/education. Handles ~45% of unique office names, ~85% of records by volume.

Tier 2: Regex patterns (~40 patterns). county\s+commissioncounty/legislative. Adds ~17% of unique names.

Records that do not match either tier are classified as other with classifier_confidence: 0.0. They proceed to L2 for tier 3 (embedding nearest-neighbor) and tier 4 (LLM classification).

The classifier_method field records which tier produced the classification: "keyword", "regex", or "unclassified".

7. Enrich Geography

Look up FIPS codes from bundled Census Bureau reference data:

  • State FIPS: 2-digit code from state abbreviation. NC37.
  • County FIPS: 5-digit code from (state, county name). (NC, COLUMBUS)37047.
  • Place FIPS: Where available, municipal codes from Census place files.
  • OCD-ID: Open Civic Data identifier. ocd-division/country:us/state:nc/county:columbus.

The reference data covers 3,143 counties and 31,980 places. FIPS enrichment achieves 100% county coverage for records with valid state and county name fields. Municipal FIPS coverage is lower (~85%) because municipality names are less standardized.

8. Compute Vote Shares

For each candidate in a contest within a precinct:

vote_share = votes_total / sum(votes_total for all candidates in same contest+precinct)

Vote share is a convenience field — it can always be recomputed from the raw vote counts. It is included because downstream queries (margins, competitiveness rankings) use it constantly.

9. Backfill Turnout

If step 1 extracted turnout metadata rows for a precinct, attach the values to all sibling contest records in the same precinct:

{
  "turnout": {
    "registered_voters": 4217,
    "ballots_cast": 2891,
    "turnout_rate": 0.6855
  }
}

Turnout data is available in NC SBE and some OpenElections files. It is absent from MEDSL and Clarity. When absent, the turnout field is null — not zero, not omitted, but explicitly null to distinguish “no data” from “zero registered voters.”

10. Compute L1 Hash

The final operation seals the record into the hash chain:

l1_hash = SHA-256( serialize(record_without_hash) + "parent:" + l0_hash )

The l0_hash comes from the L0 manifest of the source file. The l1_hash becomes the anchor for L2. See Provenance and the Hash Chain.

A Real L1 Record

Timothy Lance, precinct P17, Columbus County Schools Board of Education District 02, 2022 NC general election:

{
  "election": {"date": "2022-11-08", "type": "general"},
  "jurisdiction": {
    "state": "NC", "state_fips": "37",
    "county": "COLUMBUS", "county_fips": "37047",
    "precinct": "P17", "level": "precinct"
  },
  "contest": {
    "kind": "candidate_race",
    "raw_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
    "office_level": "school_district",
    "classifier_method": "regex",
    "classifier_confidence": 0.85,
    "vote_for": 1
  },
  "results": [
    {
      "candidate_name": {
        "raw": "Timothy Lance", "first": "Timothy",
        "middle": null, "last": "Lance", "suffix": null,
        "canonical_first": "Timothy"
      },
      "votes_total": 303,
      "vote_share": 0.523,
      "vote_counts_by_type": {
        "election_day": 136, "early": 159,
        "absentee_mail": 7, "provisional": 1
      }
    }
  ],
  "turnout": {
    "registered_voters": 4217,
    "ballots_cast": 2891,
    "turnout_rate": 0.6855
  },
  "source": {
    "source_type": "nc_sbe",
    "source_file": "results_pct_20221108.txt",
    "confidence": "high"
  },
  "provenance": {
    "l1_hash": "8ea7ecc257ff8e05",
    "l0_parent_hash": "edfedf2760cfd54f",
    "parser_version": "nc_sbe_v2.1",
    "schema_version": "3.0.0"
  }
}

Every field traces to a specific operation: county_fips from step 7, canonical_first from step 4, office_level from step 6, turnout from step 9, l1_hash from step 10.

What L1 Does Not Do

  • No embeddings. Embedding generation requires an API call to OpenAI. L1 runs offline with zero external dependencies.
  • No entity resolution. L1 does not determine whether two records refer to the same person. That is L3’s job.
  • No canonical name selection. L1 preserves all name components. Choosing the “best” name is L4’s job, after entity resolution.
  • No tier 3/4 office classification. Embedding-based and LLM-based classification require API calls. L1 applies only the deterministic tiers (keyword and regex). Records that need tiers 3–4 are marked "classifier_method": "unclassified" and classified at L2.

This boundary is the determinism boundary. Everything L1 does can be verified by re-running the parser on the same L0 files. No API key, no network connection, no randomness.

L2: Embedded — Vector Generation and Classification

L2 transforms L1’s structured text fields into vector embeddings suitable for fuzzy matching, applies tier 3 office classification, and raises quality flags on suspicious records. It is the bridge between deterministic parsing (L1) and probabilistic entity resolution (L3).

Embedding Model

The embedding model is OpenAI’s text-embedding-3-large, producing 3,072-dimensional float32 vectors. Every L2 record stores the model identifier and dimensionality:

{
  "embedding_model": "text-embedding-3-large",
  "embedding_dimensions": 3072
}

This metadata is not optional. Thresholds calibrated for text-embedding-3-large (auto-accept ≥ 0.95, ambiguous 0.35–0.95, auto-reject < 0.35) are not portable to other models. If the model changes, the thresholds must be recalibrated against the test cases. Storing the model in every record ensures that stale thresholds are never applied to vectors from a different model.

Composite String Templates

Raw name components are not embedded directly. They are assembled into composite strings that include contextual fields — office, state, county, party — so that the resulting vectors encode identity-relevant context alongside the name.

Three composite types are generated per record:

TypeTemplateExample
Candidate{canonical_first} {middle} {last} {suffix} | {party} | {office} | {state} | {county}Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus
Contest{raw_name} | {office_level} | {state} {year}COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02 | school_district | NC 2022
Geography{municipality}, {county} County, {state}Whiteville, Columbus County, NC

Middle initials and suffixes are included deliberately. “David S Marshall | ME” and “David A Marshall | FL” produce different vectors — measured at cosine 0.6448 with middle initials versus 0.7025 without. That 0.058 gap is the difference between correct separation and a false merge. See Composite String Templates for the full rationale, including the “context bleed” problem where shared geographic context artificially inflates similarity between unrelated candidates.

Empty components (null middle, null suffix) produce empty slots in the template rather than being omitted. This keeps the template structure consistent, which stabilizes the embedding model’s tokenization.

FAISS Indices

Embeddings are stored in partitioned FAISS indices, one per (state, year) combination. Partitioning serves two purposes:

  1. Blocking alignment. Entity resolution at L3 blocks by (state, office_level, last_name_initial). State-level FAISS partitions ensure that nearest-neighbor queries never cross state boundaries — a candidate in NC is never compared to a candidate in FL during retrieval.

  2. Memory management. A single national index for 42 million candidate embeddings at 3,072 dimensions × 4 bytes = ~500 GB of float32 data. Per-state-year partitions fit in memory on commodity hardware. NC 2022 (~200K records × 3,072 dims × 4 bytes) is approximately 2.3 GB.

Index type is IndexFlatIP (inner product on L2-normalized vectors, equivalent to cosine similarity). No approximate search — exact cosine is computed for every candidate pair within a block. At partition scale, exact search is fast enough (sub-second for 200K vectors) and avoids the recall loss of approximate methods like IVF or HNSW.

Tier 3 Office Classification

Records that were not classified by L1’s keyword (tier 1) or regex (tier 2) classifiers are embedded and compared against a reference set of ~200 pre-classified office names.

The reference set is a curated list covering every (office_level, office_branch) pair with at least 3 examples. Each reference entry has a pre-computed embedding. For an unclassified office name, L2 computes its embedding, finds the nearest reference neighbor by cosine similarity, and assigns the reference’s classification if the score exceeds 0.60.

Real tier 3 results:

Unclassified NameNearest ReferenceCosineAssigned Classification
Collier Mosquito Control DistrictMosquito Control District0.787special_district / infrastructure
Eastern Carrituck Fire & RescueFire Protection District0.724special_district / infrastructure
Lowndes County Bd of EdBoard of Education0.831school_district / education

Names scoring below 0.60 are left as other at L2 and passed to tier 4 (LLM) at L3. Tier 3 classifies approximately 4.5% of the unique office names that survived tiers 1 and 2, with 94% accuracy against manual review.

The classification result is written back into the L1-inherited fields on the enriched L2 record, updating classifier_method to "embedding_nn" and classifier_confidence to the cosine score.

Quality Flags

L2 raises flags on records with characteristics that may cause downstream problems. Flags do not block processing — they annotate records for review at L4.

FlagConditionExample
short_nameCandidate name has ≤ 2 characters after decomposition"J. D." with no last name parsed
common_name_riskFirst + last name appears 50+ times nationallyJohn Smith, Robert Johnson
missing_office_levelOffice survived all classification tiers as otherSanta Rosa Island Authority (pre-tier-4)
zero_votesvotes_total is 0Write-in candidates with no votes
high_vote_shareSingle candidate has > 99% of votes in a contested racePossible data error or unopposed misclassification

In our prototype, 12 of 200 records received at least one quality flag. The most common was zero_votes (write-in placeholders), followed by common_name_risk.

Output Format

L2 produces two types of output per (state, year) partition:

Enriched JSONL — L1 records augmented with an l2 block:

{
  "...all L1 fields...",
  "l2": {
    "l2_hash": "854fa6367960bb05",
    "l1_parent_hash": "8ea7ecc257ff8e05",
    "embedding_model": "text-embedding-3-large",
    "embedding_dimensions": 3072,
    "candidate_embedding_id": 4271,
    "contest_embedding_id": 183,
    "candidate_composite": "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus",
    "contest_composite": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02 | school_district | NC 2022",
    "quality_flags": []
  }
}

Binary sidecars.npy files containing float32 arrays of embeddings, plus a JSON ID mapping:

l2_embedded/nc/2022/
├── enriched.jsonl                  # One record per line, all L1 + L2 fields
├── candidate_embeddings.npy        # float32[N, 3072]
├── contest_embeddings.npy          # float32[M, 3072]
├── geography_embeddings.npy        # float32[K, 3072]
└── id_mapping.json                 # l1_hash → embedding row index

Embeddings are stored separately from JSONL to keep the text records streamable. A 3,072-dimensional float32 vector is 12,288 bytes — embedding it as base64 inside JSON would triple the JSONL file size. The .npy format is readable by NumPy, PyTorch, and any tool that understands the NumPy array file specification.

The candidate_embedding_id in the JSONL record is an integer index into candidate_embeddings.npy. To retrieve Timothy Lance’s embedding: load the .npy file, index row 4271.

Determinism

L2 is deterministic given the same L1 input, the same embedding model version, and the same office reference set. The composite string templates are fixed. The FAISS index construction is deterministic (flat index, no random initialization). The tier 3 nearest-neighbor search is exact.

If OpenAI updates the weights behind text-embedding-3-large without changing the model name, the vectors change silently. The embedding_model field cannot detect this — it records the API model name, not an internal version hash. In practice, OpenAI has not changed embedding model weights after release. If they do, a full L2 re-run and threshold recalibration is required.

Dependencies

L2 requires an OpenAI API key for embedding generation. L0 and L1 do not — they run entirely offline. This is the first layer that requires network access.

At prototype scale (200 records), L2 embedding generation takes approximately 3 seconds and costs less than $0.01. At production scale (42 million rows), the cost is approximately $300 and the wall-clock time depends on API throughput (typically 3,000 embeddings per minute with batching, yielding ~10 days for the full corpus). Embeddings are computed once per L1 record and cached — re-running L3 or L4 does not re-invoke the embedding API.

L3: Matched — Entity Resolution and LLM Confirmation

L3 is the first non-deterministic layer. It resolves entities — determining which records across sources, precincts, and elections refer to the same candidate and the same contest. Every decision is stored in a JSONL audit log with full prompt, response, and reasoning, enabling deterministic replay even though the underlying LLM calls are non-deterministic.

Input and Output

Input: L2 enriched JSONL records with embeddings, composite strings, and quality flags.

Output:

  • Enriched JSONL with candidate_entity_id and contest_entity_id assignments.
  • A decision log (candidate_matches.jsonl) recording every comparison made and its outcome.

Blocking

Before pairwise comparison begins, records are partitioned into blocks by (state, office_level, last_name_initial). Only pairs within the same block are compared. A candidate for NC school board is never compared to a candidate for FL sheriff.

This reduces the comparison space by approximately four orders of magnitude. The blocking key is deliberately coarse — we accept some noise within blocks (two unrelated people whose last names start with the same letter, in the same state, at the same office level) in exchange for never missing a legitimate match. The step 2.5 gate handles within-block noise cheaply.

The Five-Step Cascade

StepMethodPrototype resultCost per pair
1Exact match on (canonical_first, last, suffix)597 (70.0%)negligible
2Jaro-Winkler ≥ 0.92 on full name1 (0.1%)microseconds
2.5Name gate: JW on last name < 0.50 → skip— (gate)microseconds
3Embedding cosine ≥ 0.95 AND same state → auto-accept50 (5.9%)pre-computed
4LLM confirmation: cosine 0.35–0.9530 (3.5%)~$0.0002/call
5Tiebreaker: stronger model when step 4 confidence < 0.700 (rare)~$0.002/call

Percentages are from the 200-record Columbus County NC prototype. 206 unique candidate entities were created.

Step 1: Exact Match

The match key is (canonical_first, last, suffix) within a (state, office_level) block. Timothy Lance appears in 47 precinct rows — all 47 share the same key and collapse to one entity. No fuzzy logic, no API calls.

This step handles the overwhelmingly common case: the same candidate appearing identically across precincts within a single source.

Step 2: Jaro-Winkler (≥ 0.92)

Catches minor spelling variations that survive L1 parsing — Mcdonough vs McDonough, transposition errors, inconsistent hyphenation. The threshold of 0.92 is strict to avoid false positives on common surnames.

In the prototype, step 2 resolved 1 additional candidate. Most formatting differences are already normalized at L1.

Step 2.5: The Name Similarity Gate

Before computing embedding similarity, check last-name Jaro-Winkler. If below 0.50, skip the pair entirely.

This gate was added after a prototype finding. The original cascade had no step 2.5, and all 30 LLM calls were spent on pairs like “Aaron Bridges” vs “Daniel Blanton” — candidates in the same (NC, school_district, B/D) block with completely different names. Every call correctly returned no-match, but each cost an API round-trip. The gate eliminates these obvious non-matches before they reach embedding or LLM steps.

At scale, with millions of within-block pairs, this gate prevents orders-of-magnitude waste in downstream steps.

Step 3: Embedding Auto-Accept (≥ 0.95)

For pairs that pass the gate but did not exact-match, retrieve pre-computed L2 cosine similarity. If ≥ 0.95 AND both candidates are in the same state, auto-accept.

The 0.95 threshold is deliberately high. Robert Williams Jr scored 0.862 against Robert Williams — a false positive under the original 0.82 threshold. At 0.95, only near-identical strings with trivial formatting differences pass. Barbara Sharief at 0.955 is an example that auto-accepts: the only difference is a middle initial J added in one source.

A secondary acceptance rule handles the band just below 0.95: embedding ≥ 0.90 AND JW on full name ≥ 0.92 AND same state → accept. This catches Ashley Moody (0.930 cosine) without requiring an LLM call.

Step 4: LLM Confirmation (0.35–0.95)

Pairs in the ambiguous zone are sent to Claude Sonnet with structured context: both candidates’ parsed name components, vote counts, office, state, party, and the embedding score. The LLM returns a decision (match/no-match), confidence (0.0–1.0), and free-text reasoning.

The ambiguous zone is wide (0.35–0.95) by design. Budget is not a constraint. The zone was widened from the original 0.65–0.82 after two findings:

  • Charlie Crist at 0.451 — a true match that the old 0.65 reject threshold would have discarded.
  • Robert Williams Jr at 0.862 — a false positive that the old 0.82 accept threshold would have merged.

The wider zone sends more pairs to the LLM in exchange for zero threshold-induced errors in the tested range.

Step 5: Tiebreaker

When step 4 returns confidence below 0.70, the pair escalates to an Opus-class model. This handles unusual nicknames, slight vote-count discrepancies, and geographic ambiguity that Sonnet finds uncertain. Step 5 was not triggered in the 200-record prototype; it exists for the long tail of ambiguity at production scale.

The Decision Log

Every comparison — not just LLM calls — is recorded in a JSONL audit log at l3_matched/{state}/{year}/decisions/candidate_matches.jsonl. One record per pair examined.

An LLM-decided entry:

{
  "decision_id": "a3f8c1d2-4e7b-4a1f-9c3d-8f2e1a6b5c4d",
  "decision_type": "candidate_match",
  "timestamp": "2026-03-19T10:30:00Z",
  "inputs": {
    "name_a": "Charlie Crist",
    "name_b": "CRIST, CHARLES JOSEPH",
    "embedding_score": 0.451,
    "jw_last_name": 1.0,
    "state_a": "FL", "state_b": "FL",
    "contest_a": "Governor", "contest_b": "Governor",
    "votes_a": 3101652, "votes_b": 3101652
  },
  "method": {
    "type": "llm",
    "model": "claude-sonnet-4-20250514",
    "prompt_template_version": "entity_match_v2.0"
  },
  "output": {
    "decision": "match",
    "confidence": 0.95,
    "reasoning": "Charlie is a common nickname for Charles. Same state, same office, identical vote counts."
  }
}

An exact-match entry is simpler:

{
  "decision_id": "b7c2e4f1-...",
  "decision_type": "candidate_match",
  "timestamp": "2026-03-19T10:30:01Z",
  "inputs": {
    "name_a": "Timothy Lance",
    "name_b": "Timothy Lance",
    "state_a": "NC", "state_b": "NC"
  },
  "method": {
    "type": "exact",
    "model": null,
    "prompt_template_version": null
  },
  "output": {
    "decision": "match",
    "confidence": 1.0,
    "reasoning": "Exact match on (canonical_first=Timothy, last=Lance, suffix=null)"
  }
}

A gate-rejected entry:

{
  "decision_id": "c9d3a5e2-...",
  "decision_type": "candidate_match",
  "timestamp": "2026-03-19T10:30:02Z",
  "inputs": {
    "name_a": "Aaron Bridges",
    "name_b": "Daniel Blanton",
    "jw_last_name": 0.40,
    "state_a": "NC", "state_b": "NC"
  },
  "method": {
    "type": "gate_reject",
    "model": null,
    "prompt_template_version": null
  },
  "output": {
    "decision": "no_match",
    "confidence": 1.0,
    "reasoning": "Last-name JW 0.40 below gate threshold 0.50; skipped."
  }
}

L3 Record Output

Each L1/L2 record is augmented with entity assignments:

{
  "...all L1 and L2 fields...",
  "l3": {
    "l3_hash": "28183d41d50204d5",
    "l2_parent_hash": "854fa6367960bb05",
    "candidate_entity_ids": [
      {"result_index": 0, "entity_id": "person:nc:columbus:lance-timothy-13"}
    ],
    "contest_entity_id": "contest:nc:columbus:school-board-d02"
  }
}

The entity_id format encodes scope: person:{state}:{county}:{last}-{first}-{sequence}. The sequence number disambiguates within a name — necessary when two genuinely different people share the same canonical first and last name in the same county.

Contest entity IDs follow a parallel scheme: contest:{state}:{county}:{office-slug}.

Reproducibility

L3 is non-deterministic because LLM responses may vary between runs. Two strategies make it reproducible in practice:

Replay from log. The decision log contains every match decision with its inputs and outputs. Re-running L3 in replay mode reads decisions from the log instead of calling the LLM. This produces identical L3 output — deterministic given the logged decisions.

Re-run with audit. Re-running L3 with live LLM calls produces a new decision log. Diffing the two logs reveals any decisions where the LLM changed its mind. In testing, decision stability is high: the same pair with the same context produces the same match/no-match outcome in >99% of re-runs. Confidence scores may vary by ±0.05.

For published results, the decision log is the canonical record. The LLM is a tool that produced the decisions; the decisions themselves are the data.

The 30 Wasted Calls

The prototype’s most actionable finding: all 30 LLM calls were wasted. Every one compared candidates with obviously different names — “Aaron Bridges” vs “Daniel Blanton”, “Timothy Lance” vs “Jessica Moore” — that happened to share a blocking key. The embedding scores ranged from 0.55 to 0.73, placing them in the ambiguous zone. The LLM correctly rejected all 30 with high confidence.

The root cause was coarse blocking without a name-similarity pre-filter. The fix — step 2.5, requiring JW ≥ 0.50 on last names before proceeding — would have eliminated all 30 calls. At production scale, this gate is the difference between thousands of useful LLM calls and millions of wasted ones.

Budget and the Ambiguous Zone

Budget is not a constraint for this project. This changes the threshold calculus:

DecisionBudget-constrained approachOur approach
Ambiguous zone widthNarrow (0.65–0.82) to minimize LLM callsWide (0.35–0.95) to maximize accuracy
Step 5 modelSame as step 4 (cheaper)Opus-class (more capable)
Audit coverageSample-basedEvery multi-member entity audited at L4

The wider ambiguous zone means ~25% of within-block pairs reach the LLM, up from ~5% with the old thresholds. The step 2.5 gate keeps the absolute call volume manageable by rejecting pairs with dissimilar last names before they enter the zone.

The cascade still exists despite unlimited budget. Sending every pair to the LLM would take weeks of API calls at 42 million rows — cost is irrelevant when wall-clock time is the bottleneck. And deterministic steps are preferred not because they are cheaper, but because they are reproducible and do not hallucinate.

Cross-References

L4: Canonical — Authoritative Names and Verification

L4 is the final layer. It consumes L3’s entity assignments and produces the researcher-facing outputs: canonical names, temporal chains across elections, alias tables, and the results of six verification algorithms. L4 is deterministic given the same L3 input — no LLM calls are made during construction (though the LLM entity audit is part of verification).

Canonical Name Selection

Each candidate entity has multiple name variants collected from across sources and precincts. L4 selects one canonical name using a fixed algorithm:

  1. Collect all variants. For entity person:nc:columbus:lance-timothy-13, the variants might be Timothy Lance (NC SBE), TIMOTHY LANCE (MEDSL), and Lance, Timothy (OpenElections).

  2. Prefer the most complete. A variant with a middle initial beats one without. A variant with a suffix beats one without. SHANNON W BRAY beats SHANNON BRAY. Robert Williams Jr beats Robert Williams (when they are the same entity — which is rare, since Jr usually indicates a different person).

  3. Among equally complete, prefer the most authoritative source. Source authority ranking:

    • Certified state data (NC SBE) — highest
    • Academic curated data (MEDSL) — second
    • Community-curated data (OpenElections) — third
    • Election night reporting (Clarity/Scytl) — lowest
  4. Among equally authoritative, prefer the most recent. A 2022 record beats a 2018 record for the same entity.

The selected canonical name is a presentation choice, not an analytical input. By the time L4 runs, entity resolution is complete — the identity question is settled at L3. L4 is choosing a label for a known entity.

Temporal Chain Aggregation

L4 builds one temporal chain entry per (entity, election, contest). A candidate who appeared in 47 precincts in one election gets one entry with the summed vote total — not 47 entries.

This fixes a prototype bug. The initial implementation built temporal chains per precinct, producing entries like “Timothy Lance, 2022, P17, 303 votes” and “Timothy Lance, 2022, P21, 287 votes.” For career tracking and competitiveness analysis, the correct granularity is the election level: “Timothy Lance, 2022, Columbus County Schools Board of Education District 02, 1,531 votes.”

The aggregation:

{
  "entity_id": "person:nc:columbus:lance-timothy-13",
  "canonical_name": "Timothy Lance",
  "aliases": ["Timothy Lance", "TIMOTHY LANCE"],
  "elections": [
    {
      "date": "2022-11-08",
      "contest": "Columbus County Schools Board of Education District 02",
      "contest_entity_id": "contest:nc:columbus:school-board-d02",
      "votes": 1531,
      "vote_share": 0.523,
      "outcome": "won",
      "source_count": 1
    }
  ],
  "states": ["NC"],
  "first_appearance": "2022-11-08",
  "election_count": 1
}

For multi-cycle candidates, the elections array grows. George Dunlap — Mecklenburg County Commissioner across 6 consecutive cycles (2014–2024) — has 6 entries in his temporal chain, each with the contest-level vote total for that election.

Alias Tables

Every name variant observed for an entity is preserved in the aliases array. This serves two purposes:

  1. Searchability. A user searching for “SHANNON W BRAY” finds the entity whose canonical name is “Shannon W. Bray” because the ALL CAPS variant is in the alias table.

  2. Provenance. The alias table documents which sources used which name formats. If a future entity resolution decision is questioned, the alias table shows exactly what variants were merged.

Aliases are deduplicated but not normalized — Timothy Lance and TIMOTHY LANCE are both preserved because they demonstrate that the entity appears in both title-case and all-caps sources.

The Six Verification Algorithms

L4 runs six verification algorithms over the complete output. These are not optional post-processing — they are integral to the pipeline’s trust model. Every verification result is recorded in verification_report.json.

1. Hash Chain Integrity

Walk the hash chain from L4 → L3 → L2 → L1 → L0 for every record. Recompute each hash and compare to the stored value. Any mismatch identifies the exact layer where the chain breaks.

MetricPrototype result
Records verified200 / 200
Broken chains0
Layers traversed per record5

See Provenance and the Hash Chain for the verification algorithm.

2. Entity Consistency

Flag entities with characteristics that are unusual for local officeholders:

  • Multi-state entities. A candidate_entity_id spanning NC and FL is suspicious — local officials serve in one state. Federal candidates can span states (a senator’s votes appear in statewide and precinct-level records), so federal offices are exempted.
  • Party switches. An entity appearing as DEM in 2018 and REP in 2022 is not impossible (party switches happen) but is flagged for review.
  • Implausible office combinations. An entity serving simultaneously as county sheriff and school board member is unlikely (though not impossible in small counties).

3. Temporal Plausibility

Check career spans and office progressions:

  • Span check. An entity with elections in 2006 and 2024 has an 18-year span. Plausible for a long-serving commissioner, but flagged if the office is typically a stepping stone (e.g., school board).
  • Gap detection. An entity appearing in 2014 and 2024 but not 2016, 2018, 2020, or 2022 may be two different people merged by entity resolution — or someone who left office and returned. Gaps > 2 cycles are flagged.
  • Age plausibility. If external data (FEC filings, candidate bio pages) provides a birth year, check that the candidate was of legal age at first appearance.

4. Cross-Source Reconciliation

Where two sources cover the same contest, compare vote totals for each candidate entity:

Agreement levelNC 2022 contestsPercentage
Exact match57990.5%
Within 1%477.3%
Disagree > 1%142.2%

Disagreements are reported with both sources’ totals, the percentage difference, and the probable cause (provisional ballot timing, write-in aggregation, precinct boundary assignment). See Cross-Source Reconciliation.

5. Completeness Audit

Report coverage metrics across the full dataset:

MetricTargetPrototype result
State coverage (FIPS populated)100%100%
County coverage (FIPS populated)100%100%
Entity ID fill rate (candidate)> 95%100%
Entity ID fill rate (contest)> 95%100%
Office classification fill rate> 90%67% (prototype scope)
Turnout data fill ratevaries< 5% (most sources lack it)

Low fill rates are not errors — they are documented gaps. The completeness audit ensures that gaps are visible, not hidden.

6. LLM Entity Audit

For every entity with members from more than one source or more than one election, ask a language model whether the entity cluster is plausible. This is the only LLM call in L4.

The prompt provides the entity’s canonical name, all aliases, all elections, all offices, all states, and all vote totals. The model evaluates:

  • Is this a plausible single person?
  • Are the offices consistent with one career?
  • Do the vote totals and geographic spread make sense?
  • Are any aliases suspicious (non-person names, ballot measure choices, turnout metadata)?

Prototype results from auditing 50 entities:

CategoryCountDetails
Clean — no issues3Entity is unambiguous
Suspicious — flagged for review43Precinct-level records inflating temporal chains
Likely error — incorrect entity4“For” and “Against” classified as person entities

The 43 suspicious entities were a direct consequence of the prototype bug where temporal chains were built per precinct rather than per election. After fixing the aggregation to election-level, the suspicious count dropped to single digits in subsequent runs.

The 4 errors were ballot measure choices (“For”, “Against”) that had leaked past L1 non-candidate detection and received candidate_entity_id values at L3. The LLM audit caught them:

“‘For’ is not a plausible person name. This entity appears across 347 contests in 12 states, always in contest names containing ‘amendment’, ‘bond’, ‘referendum’, or ‘proposition’. These are ballot measure choices, not candidates.”

This finding led to tighter non-candidate detection at L1. See Non-Candidate Records.

Output Format

L4 produces three types of output:

Entity Registries (JSON)

One file per entity type, containing one record per unique entity:

  • candidate_registry.json — all person entities with canonical names, aliases, temporal chains
  • contest_registry.json — all contest entities with canonical names, years active, states

Flat Exports (JSONL and CSV)

One record per candidate per contest per precinct, with canonical names and entity IDs attached:

{
  "election_date": "2022-11-08",
  "state": "NC",
  "county": "COLUMBUS",
  "contest_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
  "candidate_raw": "Timothy Lance",
  "candidate_canonical": "Timothy Lance",
  "candidate_entity_id": "person:nc:columbus:lance-timothy-13",
  "votes_total": 303,
  "source": "nc_sbe",
  "l3_hash": "28183d41d50204d5",
  "l0_hash": "edfedf2760cfd54f"
}

The flat export retains precinct-level granularity with entity-level annotations. Users who need contest-level totals aggregate by (candidate_entity_id, contest_entity_id, election_date). Users who need precinct-level data use the records as-is.

The CSV export contains the same fields for users who prefer tabular tools (Excel, R, Stata). Column order matches the JSONL field order.

Verification Report (JSON)

A single verification_report.json summarizing all six verification algorithms:

{
  "run_date": "2026-03-19T12:00:00Z",
  "record_count": 200,
  "entity_count": 206,
  "hash_chain": {"verified": 200, "broken": 0},
  "entity_consistency": {"clean": 195, "flagged": 11},
  "temporal_plausibility": {"clean": 203, "flagged": 3},
  "cross_source": {"exact_match": 579, "within_1pct": 47, "disagree": 14},
  "completeness": {"fips_fill": 1.0, "entity_fill": 1.0, "office_fill": 0.67},
  "llm_audit": {"clean": 3, "suspicious": 43, "error": 4, "entities_audited": 50}
}

This report is the pipeline’s self-assessment. A researcher evaluating the data reads the verification report first to understand what the pipeline is confident about and where it flagged concerns.

Cross-References

Why the Order Matters: Clean → Embed → Match → Canonicalize

The pipeline’s four processing stages must run in exactly this order. This is not a convention — it is a dependency chain where each stage requires the output of all prior stages. Rearranging them destroys signal.

We learned this the hard way.

The Insight

The original prototype ran normalization aggressively: strip middle initials, collapse suffixes, force uppercase, pick a canonical name, then try to match entities. The sequence was:

Old order:  Canonicalize → Match
            (normalize aggressively, then find duplicates)

This destroyed the information needed to tell different people apart.

David S. Marshall (Maine, state legislature) and David A. Marshall (Florida, county commission) are two different people. Under the old pipeline, both names were normalized to MARSHALL, DAVID — middle initials stripped as noise. After normalization, the two records were indistinguishable. The entity resolver matched them as the same person. One David Marshall absorbed the other’s career, vote history, and geographic record.

The embedding scores confirm why middle initials matter:

Composite stringCosine similarity
`David MarshallMEvsDavid Marshall
`David S MarshallMEvsDavid A Marshall

The middle initial drops the score by 0.058 — enough to push the pair further from the accept threshold and toward correct rejection. But this signal only exists if the middle initial survives to L2. If L1 strips it during “normalization,” it is gone forever.

The Correct Order

L1  CLEAN        Parse into components. Preserve everything:
                 first, middle, last, suffix, nickname, canonical_first.
                 No components are discarded. No names are collapsed.
 ↓
L2  EMBED        Generate vectors from composite strings that include
                 middle initials, suffixes, and canonical_first.
                 The embedding encodes all preserved signal.
 ↓
L3  MATCH        Compare embeddings. Run LLM confirmation on ambiguous
                 pairs. The LLM sees structured components — middle
                 initials, suffixes, nicknames — and reasons about them.
 ↓
L4  CANONICALIZE Now that entities are resolved, pick the authoritative
                 name. Prefer the most complete variant. Build alias
                 tables. Aggregate temporal chains.

Each stage depends on prior stages’ output:

  • L2 depends on L1 — embeddings are generated from L1’s structured name components. If L1 strips middle initials, L2 cannot encode them.
  • L3 depends on L2 — entity resolution uses L2 embeddings as the retrieval step. If L2 has degraded vectors (because L1 destroyed signal), L3 makes worse decisions.
  • L4 depends on L3 — canonical name selection requires knowing who the person is. You cannot pick the “best” name for an entity before you know which records belong to that entity.

What Breaks If You Rearrange

Canonicalize before Match

This is the old pipeline. Normalize aggressively, then match. Failures:

  • David S. Marshall and David A. Marshall merge into one entity.
  • Robert Williams and Robert Williams Jr merge — suffix stripped before matching can use it.
  • Charlie Crist normalizes to CRIST, CHARLIE but CRIST, CHARLES JOSEPH normalizes to CRIST, CHARLES — the canonical forms don’t match, so the same person splits into two entities.

Aggressive normalization both merges people who should be separate and splits people who should be merged. It is wrong in both directions simultaneously.

Match before Embed

Without embeddings, matching falls back to string similarity alone. Jaro-Winkler on Charlie Crist vs CRIST, CHARLES JOSEPH gives 0.58 — a miss. The embedding model, despite scoring only 0.451, at least places the pair in the ambiguous zone where the LLM can confirm the match. Without embeddings, the pair is never surfaced.

Embed before Clean

If L1 does not decompose names into components, L2 embeds raw strings: CRIST, CHARLES JOSEPH as-is. The composite template cannot include canonical_first because it does not exist yet. The embedding for the MEDSL record uses CHARLES while the OpenElections record uses Charlie — the nickname dictionary was never applied. The cosine score drops, more pairs fall below the LLM zone, and matches are lost.

The General Principle

Preserve signal as long as possible. Collapse only after all decisions that need the signal have been made.

Middle initials are signal for disambiguation. Suffixes are signal for generational distinction. Nicknames are signal for matching. Raw strings are signal for provenance. None of these should be discarded until L4, where the entity is already resolved and the canonical name is a presentation choice, not an analytical input.

The pipeline is a funnel of information:

LayerInformation availableInformation consumed
L1All components: raw, first, middle, last, suffix, canonical_firstNone — everything preserved
L2L1 components + embeddings + quality flagsComponents consumed to build composite strings
L3L2 embeddings + L1 components + LLM contextEmbeddings consumed for retrieval; components consumed for LLM reasoning
L4L3 entity assignmentsEntity IDs consumed to select canonical names

At each layer, information from prior layers is used but not destroyed. The L1 record persists unchanged alongside the L2, L3, and L4 records. A researcher who disagrees with a canonical name choice can trace back to the original components at L1 and the raw bytes at L0.

Why This Took a Session to Learn

The old order felt intuitive: clean the data first, then do the hard work. Every data engineering textbook says normalize early. But election entity resolution is not a standard ETL problem. The “dirt” in the data — middle initials, suffixes, nicknames, variant spellings — is not dirt. It is signal. Stripping it is not cleaning. It is destruction.

The key insight: the order of operations is load-bearing. Clean → Embed → Match → Canonicalize is the only sequence that preserves signal through the stages that need it and collapses only after all analytical decisions are final.

Provenance and the Hash Chain

Every record at every layer carries a cryptographic hash of its own content and a pointer to its parent layer’s hash. This chain links any L4 canonical export record back through L3 matching, L2 embedding, and L1 cleaning to the exact bytes of the original source file at L0. If any record at any layer is modified — a vote count changed, a name altered, a match decision overridden — the chain breaks at precisely that point.

The Hash Structure

Each layer computes its hash as:

l{N}_hash = SHA-256( record_content + "parent:" + l{N-1}_hash )

The record_content is the deterministic serialization of all fields at that layer (excluding the hash itself). The parent: prefix is a literal string separator. The parent hash anchors the current record to its predecessor.

L4 canonical record
  l4_hash ← SHA-256(L4 content + "parent:" + l3_hash)
    │
    └── L3 matched record
          l3_hash ← SHA-256(L3 content + "parent:" + l2_hash)
            │
            └── L2 embedded record
                  l2_hash ← SHA-256(L2 content + "parent:" + l1_hash)
                    │
                    └── L1 cleaned record
                          l1_hash ← SHA-256(L1 content + "parent:" + l0_hash)
                            │
                            └── L0 raw file
                                  l0_hash ← SHA-256(raw file bytes)

A Real Example: Timothy Lance Through All Five Layers

Timothy Lance ran for Columbus County Schools Board of Education District 02 in the 2022 NC general election. Here is one of his precinct-level records traced through every layer.

L0: Raw

The NC SBE results file results_pct_20221108.txt is stored byte-identical at l0_raw/nc_sbe/results_pct_20221108.txt.

{
  "l0_hash": "edfedf2760cfd54f",
  "source_url": "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip",
  "retrieval_date": "2026-03-18T14:30:00Z",
  "file_size_bytes": 18023456,
  "format_detected": "tsv"
}

The l0_hash is the SHA-256 of the raw file bytes (truncated here for display). Re-downloading the file and re-hashing produces the same value. If NC SBE updates the file after our retrieval, the hash changes and a new L0 entry is created.

L1: Cleaned

The NC SBE parser extracts Timothy Lance’s precinct P17 row and produces a structured record:

{
  "jurisdiction": {
    "state": "NC", "county": "COLUMBUS", "precinct": "P17"
  },
  "contest": {
    "raw_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
    "office_level": "school_district"
  },
  "results": [{
    "candidate_name": {
      "raw": "Timothy Lance", "first": "Timothy",
      "middle": null, "last": "Lance",
      "suffix": null, "canonical_first": "Timothy"
    },
    "votes_total": 303
  }],
  "provenance": {
    "l1_hash": "8ea7ecc257ff8e05",
    "l0_parent_hash": "edfedf2760cfd54f",
    "parser_version": "nc_sbe_v2.1",
    "schema_version": "3.0.0"
  }
}

The l1_hash is computed from the L1 record content plus "parent:edfedf2760cfd54f". The l0_parent_hash links back to the raw file.

L2: Embedded

L2 generates a composite string and embedding for the candidate:

{
  "l2": {
    "l2_hash": "854fa6367960bb05",
    "l1_parent_hash": "8ea7ecc257ff8e05",
    "embedding_model": "text-embedding-3-large",
    "embedding_dimensions": 3072,
    "candidate_embedding_id": 4271,
    "candidate_composite": "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus",
    "quality_flags": []
  }
}

The l2_hash is computed from the L2 fields plus "parent:8ea7ecc257ff8e05". The l1_parent_hash links back to L1.

L3: Matched

Entity resolution assigns a candidate_entity_id. Timothy Lance appeared identically across all precincts, so step 1 (exact match) resolved him:

{
  "l3": {
    "l3_hash": "28183d41d50204d5",
    "l2_parent_hash": "854fa6367960bb05",
    "candidate_entity_ids": [
      {"result_index": 0, "entity_id": "person:nc:columbus:lance-timothy-13"}
    ],
    "contest_entity_id": "contest:nc:columbus:school-board-d02"
  }
}

The l3_hash is computed from the L3 fields plus "parent:854fa6367960bb05".

L4: Canonical

L4 produces the researcher-facing export record:

{
  "election_date": "2022-11-08",
  "state": "NC",
  "county": "COLUMBUS",
  "contest_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
  "candidate_canonical": "Timothy Lance",
  "candidate_entity_id": "person:nc:columbus:lance-timothy-13",
  "votes_total": 303,
  "source": "nc_sbe",
  "l4_hash": "f19a3e8bc7210d42",
  "l3_hash": "28183d41d50204d5",
  "l0_hash": "edfedf2760cfd54f"
}

The l4_hash is computed from the L4 fields plus "parent:28183d41d50204d5". The record also carries l0_hash as a shortcut for end-to-end verification.

Verification Algorithm

To verify a single L4 record:

  1. Read the L4 record. Recompute SHA-256(L4 content + "parent:" + l3_hash). Compare to stored l4_hash. If mismatch → chain broken at L4.
  2. Look up the L3 record by l3_hash. Recompute SHA-256(L3 content + "parent:" + l2_hash). Compare to stored l3_hash. If mismatch → chain broken at L3.
  3. Look up the L2 record by l2_hash. Recompute SHA-256(L2 content + "parent:" + l1_hash). Compare to stored l2_hash. If mismatch → chain broken at L2.
  4. Look up the L1 record by l1_hash. Recompute SHA-256(L1 content + "parent:" + l0_hash). Compare to stored l1_hash. If mismatch → chain broken at L1.
  5. Read the L0 raw file. Recompute SHA-256(file bytes). Compare to stored l0_hash. If mismatch → chain broken at L0 (source file was modified or corrupted).

If all five checks pass, the record is verified from canonical output back to original source bytes.

Prototype Results

In our 200-record prototype run:

MetricResult
Records verified200 / 200
Broken chains0
Layers traversed per record5 (L4 → L3 → L2 → L1 → L0)
Total hash verifications1,000 (200 records × 5 layers)

Every hash chain verified end-to-end with zero broken links.

What Breaks the Chain

The hash chain detects any modification at any layer. Specific scenarios:

Modifying a vote count at L1. If someone changes Timothy Lance’s votes from 303 to 304, the L1 content changes, the recomputed l1_hash no longer matches the stored value, and the L2 record’s l1_parent_hash no longer points to a valid L1 record.

Changing a parser without a version bump. If the NC SBE parser is updated but parser_version is not incremented, the L1 content for existing records may change (different parsing logic applied to the same raw bytes). The l1_hash changes, breaking the chain from L2 upward. The parser_version field exists precisely to prevent silent parser changes.

Overriding an L3 match decision. If a human reviewer changes an entity assignment at L3, the l3_hash changes. L4 must be re-run from the amended L3 output. The original L3 decision is preserved in the decision log — it is never deleted, only superseded.

Re-downloading a source file after the publisher updated it. NC SBE occasionally corrects results files after initial publication. If the corrected file has different bytes, the l0_hash changes. The entire pipeline from L1 upward must be re-run for affected records. The original L0 entry and its manifest are retained as a versioned snapshot.

Why Not a Merkle Tree

A Merkle tree would allow verifying subsets of records without recomputing the full chain. We use a simpler linear chain because:

  1. Records are independent. Each precinct-level record has its own chain. Verifying one record does not require knowledge of any other record. A Merkle tree adds complexity without benefit when records are not aggregated into blocks.

  2. Full verification is cheap. SHA-256 of a 2 KB record takes microseconds. Verifying all 200 records takes less than a second. At 200 million records, full verification takes minutes — well within acceptable bounds for a batch pipeline.

  3. Simplicity aids trust. A journalist verifying a specific result needs to understand “follow the hash backward through five files.” A Merkle tree requires understanding tree structure, sibling hashes, and root computation. The simpler model is more auditable by non-engineers.

The Chain as Documentation

The hash chain is not just an integrity mechanism — it is a documentation trail. Every L4 record answers the question: “Where did this number come from?” Follow l3_hash to see which entity resolution decision assigned this candidate ID. Follow l2_parent_hash to see the embedding and composite string. Follow l1_parent_hash to see the parsed record. Follow l0_parent_hash to see the raw source file.

This is provenance in the literal sense: the origin and chain of custody of every data point, cryptographically verifiable.

The Project Does Not Store Data

This project processes election data. It does not redistribute it.

Why Not

Each source publishes data under its own terms. MEDSL uses CC-BY. NC SBE publishes as public record under North Carolina law. OpenElections uses a mix of licenses depending on the state contributor. FEC data is public domain. Census reference files are public domain.

Bundling data from all sources into a single download would require compliance with every license simultaneously — attribution chains, share-alike provisions, and restrictions that vary by state contributor. The legal surface area grows with every source added. We avoid it entirely by not storing the data.

Practical

The current corpus is 8+ GB across three election cycles and seven sources. Adding MEDSL 2018 and 2020, full OpenElections coverage, and VEST shapefiles pushes this past 20 GB. Hosting, versioning, and serving that volume adds infrastructure cost and maintenance burden that contribute nothing to the pipeline’s accuracy or reproducibility.

Freshness

Sources update. NC SBE reissues precinct files when canvass corrections are made. MEDSL publishes errata and revised datasets. OpenElections contributors fix parsing errors and add new states. A copy of the data taken on March 18 may be stale by April 1.

If we store data, every downstream user inherits our staleness. If users download from the authoritative source, they get the latest version — and our pipeline processes it identically.

What We Provide Instead

The project provides everything needed to acquire the data yourself:

WhatWhereExample
Exact source URLsEach source chapter in Part IIhttps://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip
Download commandsDownload the Datacurl -O <url> with expected file sizes
Schema documentationEach source chapterColumn names, types, delimiters, encoding
Known quirksEach source chapterNC SBE uses \t separators but .txt extension; MEDSL 2022 has trailing commas in some state files
File size expectationsDownload the DataMEDSL 2022 NC: ~45 MB compressed
SHA-256 of our L0 copiesL0 manifestsVerify your download matches ours

The L0 manifest for each file records the SHA-256 hash of the bytes we processed. After downloading the same file, you can hash your copy and compare. If the hashes match, your pipeline run will produce identical L1 output — byte for byte, hash for hash.

The Boundary

The project does bundle small reference datasets that are not election results:

  • FIPS code reference files (~200 KB) from the Census Bureau, public domain. These change only on decennial redistricting.
  • The nickname dictionary (~5 KB), original to this project.
  • The office classification keyword and regex tables (~10 KB), original to this project.
  • The 200-name office embedding reference set (~50 KB), original to this project.

These are small, stable, and authored by the project. They are not third-party election data.

Election results — the 42 million rows of precinct-level vote counts — are never stored, cached, or redistributed. The user downloads them. The pipeline processes them. The outputs live on the user’s machine.

Embedding Model: text-embedding-3-large

The pipeline uses OpenAI’s text-embedding-3-large for all vector generation at L2. This is a deliberate choice with specific trade-offs. The model is not the best possible embedding model — it is the best available model for this task given current constraints.

Why text-embedding-3-large

Three properties matter for election entity resolution: dimensionality, consistency, and performance on short structured text.

3,072 dimensions. Higher dimensionality preserves more fine-grained distinctions in short strings. “David S Marshall” and “David A Marshall” differ by a single character — a middle initial. In a 384-dimensional space, that distinction may be compressed away. In 3,072 dimensions, the model has room to encode it. We measured: the middle initial drops cosine similarity from 0.7025 to 0.6448 — a 0.058 gap that matters for disambiguation.

API-based consistency. Every call to the same model version with the same input produces the same vector. There is no local model initialization, no GPU-dependent floating-point variance, no seed to manage. Two users on different machines embedding the same candidate string get the same 3,072 floats. This is critical for reproducibility: L2 output is deterministic given the same model version.

Strong on short structured text. Candidate composite strings are 50–150 characters: "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus". These are not natural language paragraphs — they are structured identifiers with pipe-delimited fields. text-embedding-3-large handles this format well in our testing. Nickname pairs (Charlie Crist at 0.451), suffix pairs (Williams Jr at 0.862), and middle-initial pairs (David Marshall at 0.6448) all produce scores in ranges that the cascade can act on.

Why Not MiniLM

all-MiniLM-L6-v2 from Sentence Transformers is the default recommendation for lightweight embedding tasks. It runs locally, requires no API key, and produces vectors in milliseconds on CPU. We evaluated it and rejected it for three reasons.

384 dimensions. A factor of 8× fewer dimensions than text-embedding-3-large. On structured identifiers where single-character differences carry categorical meaning (middle initials, suffixes), the lower dimensionality compresses distinctions. In informal testing, MiniLM scored Williams Jr at 0.91 against Williams — higher than text-embedding-3-large’s 0.862, and well above any reasonable accept threshold. The suffix signal is effectively lost.

2021 training data. MiniLM was trained on data through 2021. It has no exposure to post-2021 candidate names, office titles, or geographic patterns. text-embedding-3-large was trained on more recent data, though the exact cutoff is not published. For a task that involves matching strings like “DESANTIS, RON” and “Ron DeSantis” — where the model’s familiarity with the name helps — recency matters.

Weaker on structured identifiers. MiniLM is optimized for sentence similarity — determining whether two natural language sentences express the same meaning. Our inputs are not sentences. They are pipe-delimited fields with proper nouns, abbreviations, and codes. text-embedding-3-large is a general-purpose model that handles structured text more robustly than a sentence-similarity specialist.

MiniLM’s advantages — local execution, zero API cost, sub-millisecond inference — are real but irrelevant to our constraints. Budget is not a constraint. Latency at L2 is not a bottleneck (embeddings are computed once and cached). The accuracy difference on structured identifiers is the deciding factor.

Why Not a Fine-Tuned Model

A model fine-tuned on election name pairs would outperform any general-purpose model. We know this because the failure modes of text-embedding-3-large are systematic: it underscores nicknames (Charlie/Charles at 0.451) and overscores suffixes (Williams/Williams Jr at 0.862). A fine-tuned model trained on labeled pairs — “these are the same person” / “these are different people” — would learn that “Jr” is a strong negative signal and that “Charlie”/“Charles” is not.

We do not have training data yet.

Fine-tuning requires labeled pairs: hundreds to thousands of (name_a, name_b, same_person) triples with ground truth. Our prototype has 12 manually verified pairs. The L3 decision log will eventually contain thousands of LLM-confirmed match/no-match decisions — each one a potential training example. This is an active learning loop:

  1. L3 uses the general-purpose model to retrieve candidates.
  2. The LLM confirms or rejects matches, producing labeled pairs.
  3. The labeled pairs train a fine-tuned embedding model.
  4. The fine-tuned model replaces text-embedding-3-large at L2, improving retrieval.
  5. Better retrieval surfaces harder cases for the LLM, producing more informative training data.

This loop is planned but not yet implemented. It requires the pipeline to run at scale first, generating enough decisions for a meaningful training set. In the meantime, text-embedding-3-large with the 5-step cascade produces correct results on every tested pair — the LLM compensates for the embedding model’s weaknesses.

Thresholds Are Model-Specific

The calibrated thresholds — auto-accept ≥ 0.95, ambiguous 0.35–0.95, auto-reject < 0.35 — are specific to text-embedding-3-large with 3,072 dimensions. A different model produces different similarity distributions. MiniLM’s Williams Jr score of 0.91 vs. text-embedding-3-large’s 0.862 illustrates the problem: the same pair lands in different threshold zones depending on the model.

If the model changes, recalibration is required:

  1. Re-embed all test cases with the new model.
  2. Plot the score distribution for known matches and known non-matches.
  3. Find the auto-accept, ambiguous, and auto-reject boundaries that minimize false positives and false negatives.
  4. Update the threshold configuration and document the new model in L2 metadata.

The embedding_model field stored in every L2 record ensures that thresholds can always be traced to the model that produced the scores. If a record was embedded with text-embedding-3-large and the thresholds were calibrated for a hypothetical election-embed-v1, the mismatch is detectable.

Summary

Propertytext-embedding-3-largeMiniLMFine-tuned (future)
Dimensions3,072384TBD
API requiredYesNoDepends
Cost per 1M tokens~$0.13$0$0 (local)
Williams Jr score0.862~0.91Lower (trained)
Crist score0.451~0.38Higher (trained)
Training data neededNoNoYes (not yet available)
Reproducible across machinesYesRequires version pinningRequires version pinning

The current choice is text-embedding-3-large — good enough for the cascade to work, available today, and reproducible without local model management. The long-term path is a fine-tuned model trained on the L3 decision log. The thresholds, the cascade design, and the LLM confirmation step all exist to compensate for the general-purpose model’s known weaknesses until that fine-tuned model is ready.

Composite String Templates

Embeddings are not generated from raw candidate names. They are generated from composite strings that combine name components with contextual fields — office, state, county, party. This context helps the embedding model distinguish people who share a name but hold different offices in different states. It also introduces a failure mode: context bleed, where shared context artificially inflates similarity between unrelated candidates.

The Three Templates

Each L2 record generates up to three composite strings, one per embedding type:

TypeTemplatePurpose
Candidate{canonical_first} {middle} {last} {suffix} | {party} | {office} | {state} | {county}Entity resolution across sources and elections
Contest{raw_name} | {office_level} | {state} {year}Contest entity resolution across naming variants
Geography{municipality}, {county} County, {state}Geographic entity resolution for precinct/place matching

The pipe character (|) is a deliberate separator. It signals to the tokenizer that the fields on either side are distinct semantic units, not a continuous phrase. Without separators, “Timothy Lance DEM” could be tokenized as a three-word name rather than a name followed by a party.

Real Composite Examples

CandidateComposite String
Timothy Lance (NC, Columbus County school board)Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus
Charlie Crist (FL, Governor, DEM)Charles Crist | DEM | Governor | FL | statewide
CRIST, CHARLES JOSEPH (FL, Governor, DEM)Charles Joseph Crist | DEM | Governor | FL | statewide
David S Marshall (ME, State Legislature)David S Marshall | | State Legislature | ME | statewide
David A Marshall (FL, County Commission)David A Marshall | | County Commission | FL | Broward

Note that canonical_first is used, not first. Charlie Crist’s composite uses Charles (from the nickname dictionary), not Charlie. This means the MEDSL record (CRIST, CHARLES JOSEPH → canonical_first Charles) and the OpenElections record (Charlie Crist → canonical_first Charles) produce composites with matching first-name tokens. The remaining divergence — Joseph as a middle name — is small enough that the embedding score rises significantly compared to the raw-name embedding.

Empty components produce empty slots. Timothy Lance has no middle name, no suffix, and no party in the NC SBE data. The composite retains the pipe separators with empty fields: Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus. This keeps the template structure consistent across all records, which stabilizes tokenization.

Why Context Helps: The David Marshall Test

David S. Marshall ran for state legislature in Maine. David A. Marshall ran for county commission in Florida. They are different people. Without context, the embedding model sees two very similar strings.

We measured the effect of context on cosine similarity:

Composite AComposite BCosine
David MarshallDavid Marshall1.000
David Marshall | MEDavid Marshall | FL0.7025
David S Marshall | MEDavid A Marshall | FL0.6448
David S Marshall | | State Legislature | MEDavid A Marshall | | County Commission | FL0.581

Each additional contextual field pushes the vectors further apart:

  • State alone drops similarity from 1.0 to 0.7025. The model encodes ME and FL as distinct tokens that pull the vectors in different directions.
  • Middle initial drops it further to 0.6448 — a 0.058 reduction. The single character S vs A produces measurably different vectors because it changes the token sequence before the separator.
  • Office context drops it to 0.581. “State Legislature” and “County Commission” are semantically distinct, adding another axis of divergence.

At 0.581, this pair falls well within the ambiguous zone (0.35–0.95) and is routed to the LLM, which correctly rejects the match based on different states, different offices, and different middle initials. Without context, the pair scores 1.0 — an automatic merge of two different people.

The middle-initial contribution (0.058) may seem small, but it matters at the margins. For pairs where state and office are the same — a father and son both serving on the same county commission — the middle initial may be the only signal distinguishing them.

Why Context Hurts: The Context Bleed Problem

Context is not free. Shared context tokens contribute to vector similarity even when the names themselves are unrelated. This is context bleed.

Consider two candidates in the same NC school district block:

CandidateComposite
Aaron BridgesAaron Bridges | | SCHOOL BOARD | NC | Columbus
Daniel BlantonDaniel Blanton | | SCHOOL BOARD | NC | Columbus

These are completely different people. But their composites share five context tokens: SCHOOL BOARD, NC, Columbus, and the pipe separators. The embedding model encodes these shared tokens into both vectors, producing a cosine similarity of approximately 0.55–0.65 — well above what the names alone would produce (~0.20) and squarely in the ambiguous zone.

In our prototype, all 30 wasted LLM calls were on pairs exactly like this: different people with different names whose shared context inflated their embedding scores into the ambiguous zone. The step 2.5 gate (JW on last names < 0.50 → skip) was added specifically to short-circuit these context-bleed false alarms before they reach the LLM.

Measuring the bleed

We tested context contribution by varying which fields are included:

Composite variantAaron Bridges vs Daniel BlantonCosine
Name onlyAaron Bridges vs Daniel Blanton~0.21
Name + stateAaron Bridges | NC vs Daniel Blanton | NC~0.38
Name + state + office + countyFull composite~0.60

Each shared context field adds approximately 0.15–0.20 to the cosine score. For same-name pairs (the cases entity resolution cares about), this boost is helpful — it confirms that two similar names in the same context are likely the same person. For different-name pairs, the same boost is harmful — it inflates scores past the reject threshold.

The step 2.5 gate resolves this asymmetry. If the names themselves are dissimilar (JW < 0.50 on last names), the context-inflated embedding score is irrelevant — the pair is skipped. If the names are similar (JW ≥ 0.50), the context inflation is welcome — it adds corroborating evidence that the similar names in the same context are the same person.

Design Tradeoffs

Why not embed names without context?

Bare-name embeddings eliminate context bleed but lose the disambiguation power demonstrated by the David Marshall test. A bare “David Marshall” vs “David Marshall” scores 1.0 — the model cannot distinguish them at all. Context is the only mechanism the embedding model has to separate same-name, different-person pairs.

Why not use separate embeddings for name and context?

An alternative architecture: embed the name and context separately, then combine scores with weighted averaging. This eliminates context bleed (the name embedding is pure name similarity) while retaining context as a separate signal.

This approach is viable but adds complexity — two embeddings per record instead of one, a tunable weight parameter, and a more complex similarity function. The current single-composite design is simpler and works well with the step 2.5 gate mitigating the primary failure mode. If context bleed proves problematic at scale, split embeddings are a planned fallback.

Why not fine-tune?

A fine-tuned embedding model trained on election name pairs could learn that Charlie and Charles are similar, that Jr is categorically significant, and that shared context should not inflate scores for dissimilar names. We do not have training data yet.

However, L3 decisions are labeled examples: every LLM match/no-match decision with its confidence and reasoning is a training pair. As the pipeline processes more data, the L3 decision log becomes a natural training set for active learning. A fine-tuned model trained on thousands of L3 decisions would, in principle, learn the domain-specific similarity function that the general-purpose text-embedding-3-large approximates. This is a future direction, not a current capability.

Summary

PropertyEffectMitigation
Context includedDistinguishes same-name, different-person pairs (David Marshall: 1.0 → 0.581)— (this is the goal)
Context bleedInflates scores for different-name, same-context pairs (Bridges vs Blanton: 0.21 → 0.60)Step 2.5 JW gate on last names
Middle initial includedProvides disambiguation signal (0.7025 → 0.6448)— (this is the goal)
Nickname dictionary appliedAligns canonical first names before embedding (Charlie → Charles)— (this is the goal)

The composite template is a tradeoff between disambiguation power and noise tolerance. Context helps more than it hurts — but only because the step 2.5 gate exists to catch the cases where it hurts.

When the LLM Gets Called (And When It Doesn’t)

The LLM is a confirmation tool, not a discovery tool. It is called when cheaper methods have narrowed the problem to a specific, bounded question. It is never called when a deterministic method produces correct results.

This boundary is enforced by pipeline structure, not by discipline. L0 and L1 have no LLM code paths. L2 has none. The LLM is reachable only from L3 (entity resolution and tier 4 office classification) and L4 (entity auditing). A developer cannot accidentally add an LLM call to the parser — the parser runs at L1, which has no API client.

When the LLM Is Called

Three situations invoke the LLM. Each is a bounded question with structured input and a constrained output format.

1. Ambiguous Entity Matches (L3, Step 4)

Trigger: Embedding cosine similarity between 0.35 and 0.95 AND the name similarity gate passed (JW on last names ≥ 0.50) AND both candidates are in the same state.

Input: Structured name components for both candidates, embedding score, JW score, vote counts, office, state, party.

Output: match/no-match, confidence (0.0–1.0), free-text reasoning.

Model: Claude Sonnet.

Volume: 3.5% of candidate pairs in our prototype (30 calls out of ~850 comparisons). With the step 2.5 gate in place, this drops to near-zero for within-source matching and rises for cross-source matching where name formats diverge.

Real examples:

PairCosineLLM DecisionWhy LLM was needed
Charlie Crist / CRIST, CHARLES JOSEPH0.451match (0.95)Nickname below any safe auto-accept threshold
Robert Williams / Robert Williams Jr0.862no match (0.85)Suffix above old auto-accept; only LLM catches generational distinction
Nicole Fried / FRIED, NIKKI0.642match (0.92)Nickname in ambiguous zone

2. Tier 4 Office Classification (L2→L3 boundary)

Trigger: Office name was not classified by keyword (tier 1), regex (tier 2), or embedding nearest-neighbor with cosine ≥ 0.60 (tier 3).

Input: Office name string, state, county, the full taxonomy of (office_level, office_branch) pairs.

Output: Classification pair, confidence (0.0–1.0), reasoning.

Model: Claude Sonnet.

Volume: ~0.5% of unique office names in MEDSL 2022 (~42 of 8,387). By record count, far less — these are the rarest, most obscure offices.

Real examples:

Office NameStateLLM ClassificationConfidence
Santa Rosa Island AuthorityFLspecial_district / infrastructure0.90
Register of Mesne ConveyancesSCcounty / judicial0.88
Hog ReeveNHmunicipal / regulatory0.60

3. L4 Entity Auditing

Trigger: An entity cluster contains records from multiple sources, multiple elections, or multiple office types. In the current design, every multi-member entity is audited (budget is not a constraint).

Input: The full entity cluster — canonical name, all aliases, all elections, all vote counts, all states, all offices.

Output: Plausibility assessment: plausible / suspicious / error, with reasoning.

Model: Claude Sonnet (Opus-class for flagged entities).

Volume: In the prototype, 50 entities were audited. The LLM flagged 43 as suspicious (precinct-level records inflating temporal chains — a bug in our aggregation, not in the data) and 4 as errors (“For” and “Against” classified as person entities). At production scale, the volume scales with the number of multi-member entities, not with total records.

When the LLM Is Not Called

Everything else. Specifically:

OperationLayerMethodWhy not LLM
CSV/TSV/XML parsingL1Source-specific parserDeterministic; format is fixed per source
Name decompositionL1Rule-based parserDeterministic; name formats are enumerable
Nickname dictionary lookupL1Hash tableO(1) lookup; no reasoning needed
FIPS code enrichmentL1Census reference tableExact match on (state, county_name)
Vote share computationL1ArithmeticDivision is deterministic
Hash computationL1–L4SHA-256Cryptographic function; no reasoning needed
Office classification (tiers 1–2)L1Keyword + regexDeterministic; handles 62% of unique names
Office classification (tier 3)L2Embedding nearest-neighborDeterministic given model version; handles 4.5% more
Embedding generationL2OpenAI APIDeterministic given model version; not an LLM call
Exact name matching (step 1)L3Structured field equalityHandles 70% of entity resolution
Jaro-Winkler matching (step 2)L3String similarityDeterministic; handles 0.1% more
Name gate (step 2.5)L3JW on last namesEliminates obvious non-matches
High-confidence embedding match (step 3)L3Cosine ≥ 0.95Auto-accept; no ambiguity to resolve
Canonical name selectionL4Fixed algorithmMost-complete + most-authoritative; no judgment needed
Temporal chain aggregationL4Group-by on (entity_id, election_date)SQL-style aggregation
Hash chain verificationL4SHA-256 recomputationCryptographic verification
Cross-source vote reconciliationL4Arithmetic comparisonExact or percentage-based comparison

The Principle

If a deterministic method handles it, do not add LLM latency and non-determinism.

This is not a cost argument. Budget is not a constraint. It is an accuracy and reproducibility argument:

  1. Deterministic methods do not hallucinate. SHA-256 always returns the same hash. FIPS lookup always returns the same code. An LLM might return a different FIPS code on a second call — not because it is wrong, but because it is probabilistic. For operations with known-correct deterministic solutions, adding an LLM is adding risk, not capability.

  2. Deterministic methods are reproducible. Re-running L1 on the same L0 files with the same parser version produces bit-identical output. Re-running an LLM-based parser may produce different field values. For a pipeline that serves journalists and researchers who need to cite specific numbers, reproducibility is non-negotiable for the operations that support it.

  3. Deterministic methods are fast. L1 processes 200 records in under a second. An LLM call takes 200–2,000ms. For the 70% of entity resolution handled by exact match and the 62% of office classification handled by keywords, the LLM adds latency with zero accuracy benefit.

The LLM is powerful. It correctly identified all 12 test pairs in entity resolution, including the Crist nickname case (0.451 cosine) that no threshold-based system could safely auto-resolve. It classified all 9 tier-4 office names correctly, including obscure offices like “Hog Reeve” that no reference set could anticipate.

But it is called only for the cases that need it: the 3.5% of entity comparisons in the ambiguous zone, the 0.5% of office names that no pattern matches, and the entity audit that catches contamination like ballot-measure choices misclassified as people. For everything else, the answer is already known — deterministically, reproducibly, and instantly.

Cross-References

Budget Is Not a Constraint — Speed and Reproducibility Are

This project has no API cost ceiling. Every LLM call that improves accuracy is worth making. This changes several design decisions compared to a cost-constrained pipeline — but it does not change the fundamental architecture. The cascade exists for speed and reproducibility, not for cost savings.

What Unlimited Budget Changes

Wider Ambiguous Zone

The embedding similarity thresholds for entity resolution were widened specifically because cost is not a constraint:

ParameterCost-constrainedOur design
Ambiguous zone0.65–0.820.35–0.95
Zone width0.170.60
Pairs reaching LLM~5% of within-block pairs~25% of within-block pairs

The wider zone sends roughly 5× more pairs to the LLM. At $0.0002 per call, the difference between 10,000 calls and 50,000 calls is $8. At production scale with millions of pairs, the difference might reach hundreds of dollars. Neither figure justifies accepting false positives (Williams Jr at 0.862) or false negatives (Crist at 0.451) that a narrower zone would cause.

Stronger Model for Tiebreakers

Step 5 of the entity resolution cascade escalates low-confidence LLM decisions (confidence < 0.70 from Claude Sonnet) to an Opus-class model. The stronger model costs approximately 10× more per call but is invoked only for the lowest-confidence subset of an already-small LLM cohort.

A cost-constrained pipeline would re-run the same Sonnet model or defer to human review. We use the stronger model because the marginal cost per call (~$0.002) is negligible and the accuracy gain on edge cases is measurable.

Full L4 Entity Audit

The L4 LLM entity audit examines every multi-member entity — not a sample. In the prototype, 50 entities were audited, catching 43 suspicious records and 4 errors. At production scale with tens of thousands of multi-member entities, full audit coverage means thousands of LLM calls.

A cost-constrained pipeline would sample 5–10% of entities and extrapolate. We audit 100% because the cost of missing a contaminated entity (ballot measure choices classified as people, precinct-level records inflating temporal chains) is higher than the cost of the API calls. The “For” and “Against” error was caught by the full audit — a 10% sample might have missed it.

Tier 4 Office Classification Without Hesitation

Every unclassified office name that survives tiers 1–3 goes to the LLM. There is no “batch the cheapest 80% and skip the rest” optimization. All ~42 hard cases in our prototype were classified. At national scale, the long tail of hyper-local office names (township-specific roles, water district sub-boards, tribal offices) may produce hundreds of tier 4 calls per election cycle. The cost is trivial; the coverage gain is not.

What Unlimited Budget Does Not Change

The Cascade Still Exists

Sending every candidate pair directly to the LLM — skipping exact match, Jaro-Winkler, the name gate, and embedding retrieval — would produce correct results for most pairs. It would also be impossibly slow.

At 42 million rows, even with aggressive blocking, the number of within-block candidate pairs runs into the millions. At 200ms per LLM API call, one million pairs take 55 hours of serial wall-clock time. With 10× parallelism, that is still 5.5 hours — for a single step that exact match handles in seconds for 70% of cases.

The cascade is not a cost optimization. It is a speed optimization. Steps 1–3 process 76% of pairs in under a millisecond each. The LLM is reserved for the 3.5% where cheap methods cannot decide.

Deterministic Steps Are Still Preferred

Exact match, Jaro-Winkler, keyword classification, regex classification, FIPS lookup, vote share computation, and hash verification are deterministic. They produce identical output from identical input on every run, on every machine, forever.

LLM calls are non-deterministic. The same pair submitted twice may produce different confidence scores (typically within ±0.05) and occasionally different reasoning text. The decision (match/no-match) is stable in >99% of re-runs, but “99% stable” is not “deterministic.”

For a pipeline that serves journalists citing specific numbers and researchers publishing reproducible analyses, determinism is not a preference — it is a requirement for the operations that support it. We use deterministic methods wherever they produce correct results, not because they are cheaper, but because they are trustworthy in a way that probabilistic methods are not.

LLMs Do Not Parse, Enrich, or Compute

No amount of budget makes it sensible to use an LLM for:

  • Parsing CSV/TSV/XML. The format is fixed per source. A parser handles it in microseconds with zero error rate.
  • FIPS lookup. A hash table lookup on (state, county_name) returns the correct code every time. An LLM might hallucinate a FIPS code — “37047” for Columbus County NC is correct, but there is no mechanism to verify the LLM’s output without the same lookup table that makes the LLM unnecessary.
  • SHA-256 computation. Cryptographic hash functions are mathematical operations. An LLM cannot compute them.
  • Vote share arithmetic. 303 / 580 = 0.5224. A calculator is correct. An LLM might round differently, truncate, or occasionally hallucinate.

These operations have known-correct deterministic solutions. Adding an LLM to any of them introduces risk with zero benefit, regardless of budget.

Reproducibility Requires Logged Decisions

Every LLM decision at L3 and L4 is stored in a JSONL audit log with the full prompt, response, confidence, and reasoning. This is not a cost-saving measure (replay from log avoids re-calling the LLM, saving money). It is a reproducibility measure: a researcher who wants to verify or contest a match decision can read the log, see the LLM’s reasoning, and evaluate whether the decision was correct.

If budget were infinite and API calls were instantaneous, we would still log every decision. The log is not a cache — it is the canonical record of how the pipeline resolved ambiguity. Deleting the log and re-running the LLM would produce a slightly different set of confidence scores, which might shift a small number of borderline decisions, which would change downstream entity assignments. The log prevents this drift.

The Real Constraints

Budget is not a constraint. The real constraints are:

ConstraintEffect on design
Wall-clock timeThe cascade exists because LLM calls at scale take hours; exact match takes seconds
ReproducibilityDeterministic methods preferred; LLM decisions logged for replay
AccuracyWider ambiguous zone, stronger tiebreaker model, full audit coverage
AuditabilityEvery decision logged with reasoning; hash chain from L4 to L0
CorrectnessDeterministic methods used wherever they produce correct results; LLMs used only for genuine ambiguity

A budget-constrained version of this pipeline would narrow the ambiguous zone, sample the entity audit, skip tier 4 office classification for rare offices, and use the same model for tiebreakers. All of these are accuracy trade-offs. We make none of them.

The cascade’s structure — exact match → JW → gate → embedding → LLM → tiebreaker — is identical whether the budget is $10 or $10,000. The thresholds move. The model choices change. The architecture does not.

Schema Overview

The unified schema defines the structure of every election record at every pipeline layer. A single record represents one candidate’s (or one ballot measure choice’s) vote count in one geographic unit for one contest. All sources — MEDSL, NC SBE, OpenElections, VEST, Clarity — are normalized into this schema at L1. Subsequent layers (L2–L4) add fields but never remove them.

A record has six sections: election, jurisdiction, contest, results, turnout, source, and provenance. Not every field is populated for every record. Fields that the source does not provide are null, not inferred.


Election

Identifies which election this record belongs to.

FieldTypeDescriptionExample
datedateElection date (ISO 8601)2022-11-08
yearintegerElection year, derived from date2022
typeElectionTypeGeneral, primary, runoff, special, etc.General
stagestringSource-provided stage codeGEN
specialbooleanWhether this is a special electionfalse
certification_statusstringCertified, unofficial, or unknowncertified

The type field is an enum — see Enumerations Reference. The stage field preserves the raw source value (MEDSL uses GEN/PRI/RUN; NC SBE does not have a stage column). The certification_status field reflects whether the source data represents certified results. NC SBE and MEDSL publish certified data. Clarity publishes unofficial election night results that may be updated.


Jurisdiction

Identifies the geographic unit where votes were counted.

FieldTypeDescriptionExample
statestringFull state nameNorth Carolina
state_postringTwo-letter postal codeNC
state_fipsstringTwo-digit state FIPS code37
countystringCounty name (may be null for statewide)Wake
county_fipsstringFive-digit county FIPS code37183
precinctstringPrecinct name or code from the source01-01
precinct_codestringNumeric precinct code (NC SBE only)0101
jurisdiction_namestringJurisdiction name from MEDSLWAKE
jurisdiction_fipsstringJurisdiction FIPS from MEDSL37183
ocd_idstringOpen Civic Data identifier (when available)ocd-division/country:us/state:nc/county:wake
levelJurisdictionLevelGeographic granularity of this recordPrecinct

The county_fips field is the primary geographic join key across sources. It is enriched from Census FIPS reference files at L1 when the source provides a county name but no code. The ocd_id field is populated when a mapping exists; it is null for most records today.

The level field indicates what geographic unit this row represents. Most records are Precinct. Some sources provide only county-level aggregates (County). VEST data with precinct boundaries is Precinct with accompanying geometry.


Contest

Describes the race or ballot measure.

FieldTypeDescriptionExample
kindContestKindCandidateRace, BallotMeasure, or TurnoutMetadataCandidateRace
raw_namestringContest name exactly as it appears in the sourceCABARRUS COUNTY SCHOOLS BOARD OF EDUCATION
normalized_namestringCleaned contest name (L1+)Cabarrus County Schools Board of Education
office_levelOfficeLevelFederal, state, county, municipal, etc.County
office_categoryOfficeCategoryExecutive, legislative, judicial, school board, etc.SchoolBoard
districtstringDistrict number or name (blank if at-large)DISTRICT 02
dataversestringMEDSL’s race level tag (blank for local)``
classifier_methodClassifierMethodHow office_level and office_category were assignedKeyword
vote_forintegerMaximum number of candidates a voter may select1
magnitudeintegerNumber of seats being filled3
is_retentionbooleanWhether this is a judicial retention electionfalse

The kind field is an enum with three variants — see Contest Kinds. The distinction between CandidateRace, BallotMeasure, and TurnoutMetadata is determined at L1 based on the contest name and choice values.

The classifier_method field records how the office_level and office_category were assigned: Keyword (deterministic string match, 62% of records), Regex (pattern-based, ~15%), Embedding (nearest-neighbor at L2), or Llm (LLM classification at L3). This field exists so that users can filter by classification confidence.

The vote_for field comes from NC SBE’s Vote For column. MEDSL does not provide this field. When unavailable, it defaults to null. The magnitude field comes from MEDSL’s magnitude column and indicates multi-member districts.


Results

An array of candidate results attached to the contest. For a CandidateRace, each element is one candidate. For a BallotMeasure, each element is one choice (e.g., “For”, “Against”). For TurnoutMetadata, the results array is empty.

FieldTypeDescriptionExample
candidate_nameCandidateNameDecomposed name — see below(see Name Components)
party_rawstringParty label exactly as source providesLIBERTARIAN
party_simplifiedPartySimplifiedNormalized party enumLibertarian
votes_totalintegerTotal votes for this candidate in this precinct90
vote_sharefloatFraction of total contest votes (computed)0.023
writeinbooleanWhether this is a write-in candidatefalse
incumbentbooleanWhether this candidate is the incumbent (if known)null
vote_counts_by_typeVoteCountsByTypeBreakdown by vote method — see below(see below)

CandidateName

Names are decomposed into components rather than stored as a single string. This is documented in detail in Candidate Name Components.

FieldTypeDescriptionExample
rawstringName exactly as it appears in the sourceMICHAEL "STEVE" HUBER
firststringParsed first nameMichael
middlestringParsed middle name or initialnull
laststringParsed last nameHuber
suffixstringJr, Sr, II, III, IV, etc.null
nicknamestringDetected nicknameSteve
canonical_firststringNickname-resolved first nameStephen

The raw field is preserved at every layer and never modified. The component fields are populated at L1 during name parsing. The canonical_first field is populated at L1 using the nickname dictionary (e.g., Charlie→Charles, Steve→Stephen, Pat→Patricia). All fields are available at every pipeline layer.

VoteCountsByType

When the source provides vote mode breakdowns, they are stored here. NC SBE provides all four fields for every contest. MEDSL provides them when modes are split into separate rows (summed during L1). Most other sources provide only the total.

FieldTypeDescriptionExample
election_dayintegerElection day votes136
earlyintegerEarly / one-stop votes159
absentee_mailintegerMail-in absentee votes7
provisionalintegerProvisional ballot votes1

NC SBE calls early voting “One Stop.” MEDSL calls it “EARLY VOTING.” Both are mapped to the early field at L1.


Turnout

Voter registration and participation counts for the geographic unit. These fields are sparsely populated — less than 5% of records have values.

FieldTypeDescriptionExample
registered_votersintegerNumber of registered voters in this precinct2847
ballots_castintegerTotal ballots cast in this precinct1893
turnout_pctfloatballots_cast / registered_voters (computed)0.665

NC SBE provides registered_voters via “Registered Voters” pseudo-contest rows. These are extracted during L1 parsing and attached to the precinct’s turnout object. MEDSL rarely includes registration counts. Most records have null turnout.


Source

Provenance fields that document where this record came from.

FieldTypeDescriptionExample
source_typeSourceTypeEnum identifying the source systemMedsl
source_filestringFilename of the L0 artifact2022-nc-local-precinct-general.csv
source_rowintegerRow number in the source file14523
retrieval_datedatetimeWhen the source file was downloaded (UTC)2025-01-15T03:22:00Z
confidenceConfidenceHigh, Medium, or LowMedium
raw_fieldsSourceRawFieldsAll original columns from the source, typed per source(see below)

SourceRawFields

The raw_fields object preserves every column from the original source row, typed as an enum per source. This ensures no information is lost during normalization.

VariantSourceFields preserved
MedslRawRecordMEDSLAll 25 MEDSL columns including state_cen, state_ic, readme_check, version
NcsbeRawRecordNC SBEAll 15 NC SBE columns including Contest Group ID, Contest Type, Real Precinct
OpenElectionsRawRecordOpenElectionsVariable columns depending on state file
VestRawRecordVESTEncoded column names and geometry reference
ClarityRawRecordClarityXML element attributes
FecRawRecordFECAll 15 cn.txt columns
CensusRawRecordCensusFIPS file columns

Each variant is a struct with typed fields matching the source schema. This is a Rust enum, not a JSON object — the type system ensures you cannot accidentally read an NC SBE field from a MEDSL record. See Type System Design.


Provenance

Hash chain and version metadata that enable verification and reproducibility.

FieldTypeDescriptionExample
record_idstringDeterministic hash of (source, file, row)a3f8c2...
l1_hashstringSHA-256 hash of this L1 record’s content7b2e91...
l0_parent_hashstringSHA-256 hash of the L0 source artifactc4d1f0...
l0_byte_offsetintegerByte offset in the L0 file where this row starts1048576
parser_versionstringVersion of the parser that produced this record0.1.0
schema_versionstringVersion of the schema this record conforms to1.0.0

The hash chain links every record back to the original source bytes. If the L1 record is modified, its l1_hash changes and no longer matches the hash stored in any L2 record that references it. The verification algorithm at L4 checks the full chain: L4 → L3 → L2 → L1 → L0 → source bytes.

The record_id is deterministic: identical source input always produces the same record_id. This enables deduplication and makes re-processing idempotent.


Layer-Specific Additions

Each pipeline layer adds fields to the record. The base schema (above) is fully populated at L1. Subsequent layers extend it:

LayerFields added
L2 (Embedded)candidate_name_embedding, contest_name_embedding, jurisdiction_embedding, embedding_model, embedding_version
L3 (Matched)candidate_cluster_id, contest_cluster_id, match_confidence, match_method
L4 (Canonical)canonical_candidate_name, canonical_contest_name, temporal_chain_id, verification_status, alias_table

L1 records are self-contained. L2+ records reference their parent layer’s hash. No fields from earlier layers are removed or overwritten — each layer is additive.


JSONL Representation

At every layer, records are serialized as one JSON object per line (JSONL). The six sections are top-level keys:

{"election":{"date":"2022-11-08","year":2022,"type":"General",...},"jurisdiction":{"state":"North Carolina","state_po":"NC",...},"contest":{"kind":"CandidateRace","raw_name":"CABARRUS COUNTY SCHOOLS BOARD OF EDUCATION",...},"results":[{"candidate_name":{"raw":"GREG MILLS","first":"Greg","last":"Mills",...},"votes_total":79,...}],"turnout":null,"source":{"source_type":"Medsl","source_file":"2022-nc-local-precinct-general.csv",...},"provenance":{"record_id":"a3f8c2...","l1_hash":"7b2e91...",...}}

Files are streamable: each line is a complete record. Files are appendable: new records can be concatenated without modifying existing lines. Serialization uses serde_json in Rust. See Output Formats.

Contest Kinds: CandidateRace, BallotMeasure, TurnoutMetadata

Every record in the pipeline belongs to exactly one of three contest kinds. This is modeled as a type-level enum — not a string field — so that invalid combinations are rejected at compile time rather than discovered at query time.

Why three kinds

Election data files mix three fundamentally different things in the same tabular format:

  1. A candidate running for office and receiving votes.
  2. A ballot measure (bond, referendum, constitutional amendment) where voters choose “Yes” or “No.”
  3. A metadata row recording registered voters or ballots cast for a precinct, masquerading as a contest.

Sources do not distinguish these. MEDSL puts REGISTERED VOTERS in the office column as if it were a race. NC SBE creates a “contest” called Registered Voters - Total with a “candidate” whose vote count is actually the registration total. Florida OpenElections has 6,013 rows where office = "Registered Voters" — 67.9% of all non-candidate records in the initial FL load.

If these are not separated at parse time, downstream analysis produces nonsense: “Registered Voters” appears as the most popular candidate in America, “For” shows up as a person’s name in entity resolution, and vote totals are inflated by turnout metadata.

The enum

enum ContestKind {
    CandidateRace {
        results: Vec<CandidateResult>,
    },
    BallotMeasure {
        choices: Vec<BallotChoice>,
        measure_type: BallotMeasureType,
        passage_threshold: Option<f64>,
    },
    TurnoutMetadata {
        registered_voters: Option<u64>,
        ballots_cast: Option<u64>,
    },
}

Each variant carries different fields. You cannot accidentally attach a candidate_name to a ballot measure or a passage_threshold to a candidate race.

CandidateRace

The common case. A person is running for an office and received votes.

FieldTypeDescription
resultsVec<CandidateResult>One entry per candidate in the contest

Each CandidateResult contains:

FieldTypeDescription
candidate_nameCandidateNameDecomposed name (raw, first, middle, last, suffix, nickname, canonical_first)
partyPartyRaw string + normalized enum
votes_totalu64Total votes received
vote_shareOption<f64>Percentage of total contest votes
vote_counts_by_typeVoteCountsByTypeBreakdown: election_day, early, absentee_mail, provisional

Examples of CandidateRace contests:

  • US SENATE — federal
  • GOVERNOR — state
  • COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02 — local
  • SHERIFF — county

BallotMeasure

Voters choose between options (typically “For”/“Against” or “Yes”/“No”) on a proposition, bond, amendment, or referendum.

FieldTypeDescription
choicesVec<BallotChoice>One entry per option
measure_typeBallotMeasureTypeBond, amendment, referendum, etc.
passage_thresholdOption<f64>Required vote share for passage (e.g., 0.60 for a bond requiring 60%)

Each BallotChoice contains:

FieldTypeDescription
choice_textString“For”, “Against”, “Yes”, “No”, or other option text
votes_totalu64Votes for this choice
vote_shareOption<f64>Percentage of total votes

The BallotMeasureType enum: Bond, ConstitutionalAmendment, Referendum, Initiative, Recall, Retention, Levy, Advisory, Other.

Why this prevents name confusion

Without the BallotMeasure variant, the L1 parser would treat “For” and “Against” as candidate names. They would flow into entity resolution at L3, where the system would try to find other elections where “For” ran for office. By assigning ballot measures to their own variant at parse time, the choice_text field is never passed to the name decomposition or embedding logic.

Detection at L1 uses two signals:

  • The contest name contains keywords: “bond”, “amendment”, “referendum”, “proposition”, “measure”, “levy”, “question”.
  • The choice values are in the set {“For”, “Against”, “Yes”, “No”, “Bonds”, “No Bonds”}.

TurnoutMetadata

Not a contest at all. These rows carry precinct-level registration and turnout counts that sources embed in the results file as pseudo-contests.

FieldTypeDescription
registered_votersOption<u64>Registered voter count for this precinct
ballots_castOption<u64>Total ballots cast in this precinct

Source examples that produce TurnoutMetadata records:

Sourceoffice / Contest Name valuecandidate / Choice value
MEDSLREGISTERED VOTERSREGISTERED VOTERS
MEDSLBALLOTS CAST - TOTALBALLOTS CAST
NC SBERegistered Voters - Total(numeric total in vote column)
OpenElections FLRegistered Voters(numeric total)

Detection at L1: the contest name matches a known set of turnout keywords (REGISTERED VOTERS, BALLOTS CAST, BALLOTS CAST - TOTAL, BALLOTS CAST - BLANK). When detected, the vote count is extracted into registered_voters or ballots_cast, and the record is tagged as TurnoutMetadata rather than CandidateRace.

These extracted turnout values backfill the turnout section of other records in the same precinct. Currently, turnout data is populated for less than 5% of records because most MEDSL state files do not include registration count rows.

Classification at L1

Contest kind assignment happens during L1 parsing — the deterministic layer. No ML, no embeddings, no API calls. The decision tree:

  1. Does the contest name match a turnout keyword? → TurnoutMetadata
  2. Do the choice values match ballot measure patterns (“For”/“Against”/“Yes”/“No”)? → BallotMeasure
  3. Does the contest name contain ballot measure keywords? → BallotMeasure
  4. Otherwise → CandidateRace

This classification is stored in the record and carried through all subsequent layers. L2 embeds only CandidateRace records for entity resolution. L3 matches only CandidateRace records. BallotMeasure and TurnoutMetadata records pass through L2–L4 without modification beyond provenance tracking.

Candidate Name Components

Election data sources represent candidate names as a single string. The formats are incompatible across sources — and sometimes within the same source across years. The pipeline decomposes every name into structured components at L1 and preserves all components through every subsequent layer.

Why decomposition instead of a single string

A single name field cannot support entity resolution. Consider matching these records:

SourceRaw name string
MEDSLSHANNON W BRAY
NC SBEShannon W. Bray
FECBRAY, SHANNON W

String equality fails on all three pairs. Lowercasing and stripping punctuation gets MEDSL and NC SBE closer, but FEC’s last-first ordering still breaks. Decomposing into {first: Shannon, middle: W, last: Bray} makes all three identical after normalization.

The harder case is nicknames:

SourceRaw name stringWhat a human sees
MEDSLMICHAEL "STEVE" HUBERFirst name Michael, goes by Steve
NC SBEMichael (Steve) HuberSame person
OpenElectionsSteve HuberSame person, nickname only

Without decomposition, matching Steve Huber to MICHAEL "STEVE" HUBER requires the system to know that Steve is a nickname present in one variant but used as the primary name in another. The nickname and canonical_first fields make this explicit.

Component fields

Every candidate name in the pipeline is represented as a struct with seven fields:

FieldTypeDescriptionPopulated at
rawStringOriginal name string exactly as it appeared in the source. Never modified.L1
firstOption<String>Parsed first nameL1
middleOption<String>Parsed middle name or initialL1
lastOption<String>Parsed last nameL1
suffixOption<String>Generational suffix: Jr, Sr, II, III, IVL1
nicknameOption<String>Detected nickname, extracted from quotes or parenthesesL1
canonical_firstOption<String>Nickname-resolved first name. If first has a known nickname mapping, this holds the canonical form.L1

All fields are available at every layer (L1 through L4). Later layers may refine values but never discard earlier ones.

Parsing rules by source

MEDSL

Names are ALL CAPS, no periods after initials, nicknames in double quotes, suffixes without commas.

Rawfirstmiddlelastsuffixnicknamecanonical_first
SHANNON W BRAYShannonWBrayShannon
MICHAEL "STEVE" HUBERMichaelHuberSteveMichael
ROBERT VAN FLETCHER JRRobertVanFletcherJrRobert
LM "MICKEY" SIMMONSLMSimmonsMickeyL
VICTORIA P PORTERVictoriaPPorterVictoria
WRITEIN

WRITEIN is a sentinel value, not a person name. It is flagged at L1 and excluded from name decomposition.

NC SBE

Names are Title Case, periods after initials, nicknames in parentheses, commas before suffixes.

Rawfirstmiddlelastsuffixnicknamecanonical_first
Shannon W. BrayShannonWBrayShannon
Michael (Steve) HuberMichaelHuberSteveMichael
Robert Van Fletcher, Jr.RobertVanFletcherJrRobert
Patricia (Pat) CothamPatriciaCothamPatPatricia
William Irvin. Enzor IIIWilliamIrvinEnzorIIIWilliam

The period after “Irvin.” in the last example is a data entry artifact. The parser strips trailing periods from middle names.

FEC

Names are LAST, FIRST MIDDLE format, all caps.

Rawfirstmiddlelastsuffixnicknamecanonical_first
BRAY, SHANNON WShannonWBrayShannon
BIDEN, JOSEPH R JRJosephRBidenJrJoseph

The canonical_first field

canonical_first resolves known nicknames to their formal equivalents using the nickname dictionary. This enables matching when one source uses a nickname and another uses the legal name.

firstnicknamecanonical_firstReasoning
MichaelSteveMichaelFirst name is already formal
CharlieCharlesCharlie is a known nickname for Charles
BobRobertBob is a known nickname for Robert
PatriciaPatPatriciaFirst name is already formal
BillWilliamBill is a known nickname for William
JimJamesJim is a known nickname for James

When first is already a formal name, canonical_first equals first. When first is itself a nickname (as when OpenElections reports Charlie Crist without the legal name Charles), canonical_first resolves to the formal form.

The nickname dictionary contains approximately 1,200 mappings. It is deterministic — no ML, no API calls. Ambiguous cases (e.g., “Alex” could map to “Alexander” or “Alexandra”) are resolved by leaving canonical_first equal to first and deferring to embedding-based matching at L2.

How L2 uses name components

L2 constructs a composite string for embedding from the decomposed components:

{canonical_first} {middle} {last} {suffix}

This means Michael "Steve" Huber and Steve Huber both embed with their decomposed components rather than raw strings. The embedding model sees structured, normalized text rather than source-specific formatting.

The raw field is never used for embedding. It is preserved for provenance and debugging only.

Special cases

Write-in candidates. MEDSL aggregates write-ins into WRITEIN. NC SBE reports named write-ins (e.g., Ronnie Strickland (Write-In)) separately from Write-In (Miscellaneous). Named write-ins are decomposed normally. The WRITEIN sentinel produces a record with all name fields set to None.

Ballot measure choices. The values For, Against, Yes, No are not person names. They are handled by the BallotMeasure contest kind and bypass name decomposition entirely. See Contest Kinds.

Hyphenated last names. Treated as a single last value: Smith-Joneslast: Smith-Jones. No attempt is made to split on hyphens.

Multiple middle names. Concatenated into the middle field: Joseph Robinette Bidenmiddle: Robinette. If two middle names are present (rare), they are space-separated in the middle field.

No first name. Some sources report only a last name (e.g., WRITEIN or truncated records). first is None. canonical_first is also None.

Enumerations Reference

Every categorical field in the schema is represented by a closed enumeration. This chapter lists all enum types, their values, and where each is used.

ElectionType

Classifies the type of election event.

ValueDescription
GeneralRegular general election (November even years)
PrimaryParty primary election
RunoffRunoff election following an inconclusive primary or general
SpecialSpecial election to fill a vacancy
SpecialPrimaryPrimary for a special election
SpecialRunoffRunoff for a special election
MunicipalMunicipal election (may be odd-year)
RecallRecall election
RetentionJudicial retention election
OtherElection type not matching any above category

Source mapping: MEDSL’s stage column maps GENGeneral, PRIPrimary, RUNRunoff. The special boolean flag promotes any type to its Special* variant. NC SBE does not distinguish — all loaded files are general elections.

JurisdictionLevel

The geographic level at which a result is reported.

ValueDescription
StateStatewide aggregate
CountyCounty-level result
PrecinctPrecinct-level result
CongressionalDistrictCongressional district aggregate
StateLegislativeUpperState senate district aggregate
StateLegislativeLowerState house/assembly district aggregate
MunicipalityCity or town
SchoolDistrictSchool district boundary

Most records in the pipeline are Precinct. County and state aggregates appear in OpenElections data where precinct-level files are unavailable.

OfficeLevel

The level of government an office belongs to.

ValueDescription
FederalPresident, US Senate, US House
StatewideGovernor, AG, SOS, state auditor, state treasurer
StateLegislatureState senate, state house/assembly
CountyCounty commissioner, county clerk, coroner, sheriff
MunicipalMayor, city council, town board
JudicialAll judicial offices (federal, state, county, municipal)
SchoolBoardSchool board / board of education
SpecialDistrictSoil and water, fire district, utility district, transit
TownshipTownship supervisor, township trustee
OtherUnclassifiable after all four classifier tiers

Assigned by the four-tier classifier at L1 (keyword), L2 (embedding), and L3 (LLM). The Other rate is 0.56% on NC test data.

OfficeCategory

Finer-grained classification within an office level. One office level maps to many categories.

ValueDescription
ExecutivePresident, governor, mayor, county executive
LegislativeUS House, US Senate, state legislature, city council
JudicialJudge, justice, magistrate
LawEnforcementSheriff, constable, marshal
FiscalOfficerTreasurer, auditor, comptroller, tax collector
ClerkCounty clerk, clerk of court, register of deeds
EducationSchool board, board of education, superintendent
PublicWorksSoil and water, utility district, surveyor
RegulatoryCoroner, medical examiner, public service commission
PartyOfficePrecinct committee officer, party chair (when on ballot)
OtherDoes not fit the above categories

BallotMeasureType

Classifies ballot measures by their legal mechanism.

ValueDescription
BondIssueDebt authorization (general obligation or revenue bond)
LevyRenewalProperty tax levy renewal
LevyNewNew property tax levy
ConstitutionalAmendmentState constitutional amendment
CharterAmendmentMunicipal or county charter amendment
ReferendumLegislative referendum referred to voters
InitiativeCitizen-initiated ballot measure
RecallRecall question for a specific officeholder
OtherMeasure type not determinable from contest name

PartySimplified

Normalized party affiliation. Preserves the most common parties as distinct values; collapses minor parties.

ValueDescription
DemocratDemocratic Party
RepublicanRepublican Party
LibertarianLibertarian Party
GreenGreen Party
IndependentIndependent / no party affiliation
NonpartisanNonpartisan contest (no party on ballot)
WriteInWrite-in candidate (party unknown or not applicable)
OtherAny other party (Constitution, Working Families, Reform, etc.)

Source mapping: MEDSL’s party_simplified column maps directly. NC SBE’s Choice Party codes: DEMDemocrat, REPRepublican, LIBLibertarian, GREGreen, UNAIndependent, blank → Nonpartisan. FEC codes: DEM, REP, LIB, GRE, IND, NNENonpartisan.

SourceType

Identifies the origin of a record. One value per data source file type.

ValueDescription
Medsl2018MEDSL 2018 precinct-level file
Medsl2020MEDSL 2020 precinct-level file
Medsl2022MEDSL 2022 precinct-level file
Ncsbe2014NC SBE 2014 general (15-column schema)
Ncsbe2016NC SBE 2016 general
Ncsbe2018NC SBE 2018 general
Ncsbe2020NC SBE 2020 general
Ncsbe2022NC SBE 2022 general
Ncsbe2024NC SBE 2024 general
NcsbeLegacyNC SBE 2006–2012 (older schemas)
OpenElectionsOpenElections CSV (any state)
ClarityXmlClarity/Scytl ENR XML extract
VestShapefileVEST precinct shapefile
CensusFipsCensus Bureau FIPS reference file
FecCandidateFEC candidate master file (cn.txt)
ManualManually entered or corrected record

Each L1 record carries exactly one SourceType. When sources are merged at L3/L4, the provenance chain preserves the original SourceType for every contributing record.

ExtractionMethod

How a field value was obtained from the source.

ValueDescription
DirectValue copied directly from a source column
ParsedValue extracted by parsing a combined field (e.g., name decomposition)
DerivedValue computed from other fields (e.g., vote share from votes/total)
EnrichedValue added from a reference source (e.g., FIPS code from Census lookup)
InferredValue inferred by model (embedding similarity or LLM)

Confidence

The verification level assigned to a record at L4.

ValueCriteria
HighConfirmed by two or more independent sources with matching vote totals
MediumSingle source, certified state data or academic curated source
LowSingle source, community curated or unverified; or match confidence below threshold

Confidence is assigned per-record, not per-source. A record from MEDSL that is corroborated by NC SBE receives High. A record from MEDSL with no second source receives Medium. A record from OpenElections with schema inconsistencies receives Low.

ClassifierMethod

Which tier of the office classifier produced the office level and category.

ValueDescription
KeywordMatched a keyword or keyword phrase (e.g., “SHERIFF” → LawEnforcement)
RegexMatched a regex pattern (e.g., DISTRICT \d+ for legislative districts)
EmbeddingClassified by nearest-neighbor embedding similarity at L2
LlmClassified by LLM at L3 after embedding was ambiguous

Records carry the method so downstream consumers can filter by classifier reliability. Keyword and Regex are deterministic and reproducible. Embedding and Llm depend on model versions.

GeoMatchMethod

How a geographic identifier was resolved.

ValueDescription
FipsExactFIPS code present in source and matched Census reference exactly
NameExactGeographic name matched Census reference exactly (case-insensitive)
NameFuzzyGeographic name matched after fuzzy normalization (e.g., “ST. LOUIS” → “St. Louis”)
OcdLookupMatched via Open Civic Data identifier
UnresolvedCould not be matched to a canonical geographic entity

Most MEDSL records resolve via FipsExact (the source provides county_fips). NC SBE records resolve via NameExact after uppercasing the county name. OpenElections records frequently require NameFuzzy due to inconsistent county name formatting.

Crate Overview

The election-aggregation crate is both a Rust library (election_aggregation) and a command-line binary (election-aggregation). The library provides types, parsers, and pipeline logic. The binary provides the CLI entry point.

Crate Configuration

From Cargo.toml:

FieldValue
Edition2024
rust-version1.93
Library nameelection_aggregation
Binary nameelection-aggregation
LicenseMIT OR Apache-2.0

The library is published as election_aggregation (underscored, per Rust convention). The binary is election-aggregation (hyphenated, per CLI convention). Both are defined in the same crate.

Module Structure

src/
├── lib.rs              # Library root — re-exports all public modules
├── main.rs             # Binary entry point — CLI dispatch
├── schema/
│   └── mod.rs          # Unified record types, enums, and field definitions
├── sources/
│   ├── mod.rs          # Source registry and SourceParser trait
│   ├── medsl.rs        # MEDSL parser (25-column CSV/TSV)
│   ├── ncsbe.rs        # NC SBE parser (15-column tab-delimited)
│   ├── openelections.rs # OpenElections parser (variable CSV)
│   ├── clarity.rs      # Clarity/Scytl XML parser
│   ├── vest.rs         # VEST shapefile parser (column decoding)
│   ├── census.rs       # Census FIPS reference file loader
│   └── fec.rs          # FEC candidate master file parser
└── pipeline/
    ├── mod.rs          # Layer sequencing and orchestration
    ├── l0.rs           # Raw acquisition (byte-identical storage + manifest)
    ├── l1.rs           # Deterministic parsing and enrichment
    ├── l2.rs           # Embedding generation (text-embedding-3-large)
    ├── l3.rs           # Entity resolution (cascade: exact → Jaro-Winkler → embedding → LLM)
    └── l4.rs           # Canonical name assignment, temporal chains, verification

Three top-level modules, each with a clear responsibility:

  • schema — Defines the unified record types that all sources normalize into. Contains ContestKind, CandidateName, VoteCountsByType, all enumerations, and the layer-specific record structs (L1Record through L4Record). No I/O, no parsing logic.

  • sources — One submodule per data source. Each submodule documents the source schema, implements parsing from the source format into L1 records, and catalogs known data quality issues. The parent mod.rs defines the SourceParser trait that all sources implement.

  • pipeline — One submodule per layer. Each layer reads its parent layer’s JSONL output and writes its own. l0 handles acquisition. l1 calls into source parsers. l2 batches embedding API calls. l3 batches LLM calls. l4 builds the entity graph.

Library vs. Binary

The library (src/lib.rs) exposes three public modules:

#![allow(unused)]
fn main() {
pub mod sources;
pub mod pipeline;
pub mod schema;
}

External crates can depend on election_aggregation to use the types and parsers without the CLI. The binary (src/main.rs) imports the library and wires it to CLI argument parsing.

The current binary prints usage information and a pointer to the documentation. CLI subcommands (process, embed, match, canonicalize, verify, sources) are planned but not yet implemented — see CLI Reference.

Dependencies

The Cargo.toml currently declares no runtime dependencies. As pipeline layers are implemented, expected dependencies include:

CratePurpose
serde + serde_jsonJSONL serialization/deserialization
csvCSV/TSV parsing for MEDSL, NC SBE, OpenElections
sha2SHA-256 hashing for the provenance chain
clapCLI argument parsing
reqwestHTTP client for embedding and LLM API calls
tokioAsync runtime for batched API calls (L2, L3)

The release profile enables LTO, single codegen unit, and symbol stripping for minimal binary size.

Build

cargo build --release
./target/release/election-aggregation

Minimum supported Rust version is 1.93, matching edition 2024 requirements.

Type System Design

The Rust type system enforces pipeline invariants at compile time. Records from different layers are different types. Contest kinds are an enum, not a string. Candidate names are a struct, not a String. Source-specific raw fields are typed per source. These choices eliminate categories of bugs that would otherwise surface at runtime — or worse, silently corrupt output.

Layer-Typed Records

Each pipeline layer has its own record type. You cannot pass an L1 record to a function that expects L2, or accidentally mix L3 and L4 records in the same collection.

pub struct L0Record {
    pub raw_bytes: PathBuf,
    pub manifest: AcquisitionManifest,
}

pub struct L1Record {
    pub election: Election,
    pub jurisdiction: Jurisdiction,
    pub contest: Contest,
    pub results: Vec<CandidateResult>,
    pub turnout: Option<Turnout>,
    pub source: SourceMetadata,
    pub provenance: Provenance,
}

pub struct L2Record {
    pub l1: L1Record,
    pub candidate_name_embedding: Vec<f32>,
    pub contest_name_embedding: Vec<f32>,
    pub jurisdiction_embedding: Vec<f32>,
    pub embedding_model: String,
    pub embedding_version: String,
}

pub struct L3Record {
    pub l2: L2Record,
    pub candidate_cluster_id: ClusterId,
    pub contest_cluster_id: ClusterId,
    pub match_confidence: f64,
    pub match_method: MatchMethod,
}

pub struct L4Record {
    pub l3: L3Record,
    pub canonical_candidate_name: CandidateName,
    pub canonical_contest_name: String,
    pub temporal_chain_id: Option<ChainId>,
    pub verification_status: VerificationStatus,
}

Each layer wraps the previous layer’s record. An L3Record contains an L2Record which contains an L1Record. This nesting means every L4 record carries the full history back to L1. The compiler enforces that you cannot construct an L3Record without first having an L2Record — you cannot skip layers.

What the compiler prevents

  • Mixing layers in a collection. Vec<L1Record> and Vec<L2Record> are different types. A function that processes L2 records cannot accidentally receive L1 records.
  • Accessing fields that don’t exist yet. An L1 record has no candidate_cluster_id. Attempting to access it is a compile error, not a null pointer or missing key at runtime.
  • Skipping pipeline stages. You cannot construct an L3Record without providing an L2Record. The type system encodes the dependency chain.

ContestKind Enum

The ContestKind enum separates three fundamentally different record types that sources mix together in the same file.

pub enum ContestKind {
    CandidateRace {
        results: Vec<CandidateResult>,
    },
    BallotMeasure {
        choices: Vec<BallotChoice>,
        measure_type: BallotMeasureType,
        passage_threshold: Option<f64>,
    },
    TurnoutMetadata {
        registered_voters: Option<u64>,
        ballots_cast: Option<u64>,
    },
}

What the compiler prevents

  • Treating “For” as a person name. The BallotMeasure variant has choices: Vec<BallotChoice>, not results: Vec<CandidateResult>. A BallotChoice has a choice_text: String field, not a CandidateName struct. There is no code path where “For” enters the name decomposition logic.
  • Embedding turnout metadata. L2 pattern-matches on ContestKind and only computes embeddings for CandidateRace variants. TurnoutMetadata records pass through without embedding. This is enforced by the match arms — the compiler requires all three variants to be handled.
  • Mixing candidate results with ballot choices. You cannot push a BallotChoice into a Vec<CandidateResult>. They are different types.

CandidateName Struct

Candidate names are a struct with seven fields, not a String. This is documented in detail in Candidate Name Components. The Rust definition:

pub struct CandidateName {
    pub raw: String,
    pub first: Option<String>,
    pub middle: Option<String>,
    pub last: Option<String>,
    pub suffix: Option<String>,
    pub nickname: Option<String>,
    pub canonical_first: Option<String>,
}

What the compiler prevents

  • Passing a raw name string where a parsed name is expected. Functions that perform entity resolution take &CandidateName, not &str. You cannot call them with the raw string — you must parse first.
  • Forgetting to preserve the raw name. The raw field is a required String, not Option<String>. Every CandidateName carries the original source text.
  • Confusing nickname with first name. They are separate fields. Code that constructs a composite embedding string uses canonical_first, middle, last, and suffix — never raw, never nickname on its own.

SourceRawFields Enum

Every L1 record preserves the original source columns in a typed enum. Each source has its own variant with its own struct.

pub enum SourceRawFields {
    Medsl(MedslRawRecord),
    Ncsbe(NcsbeRawRecord),
    OpenElections(OpenElectionsRawRecord),
    Vest(VestRawRecord),
    Clarity(ClarityRawRecord),
    Fec(FecRawRecord),
    Census(CensusRawRecord),
}

pub struct MedslRawRecord {
    pub year: i32,
    pub state: String,
    pub state_po: String,
    pub state_fips: String,
    pub state_cen: String,
    pub state_ic: String,
    pub office: String,
    pub county_name: String,
    pub county_fips: String,
    pub jurisdiction_name: String,
    pub jurisdiction_fips: String,
    pub candidate: String,
    pub district: String,
    pub dataverse: String,
    pub stage: String,
    pub special: String,
    pub writein: String,
    pub mode: String,
    pub totalvotes: String,
    pub candidatevotes: String,
    pub version: String,
    pub readme_check: String,
    pub magnitude: Option<i32>,
    pub party_detailed: String,
    pub party_simplified: String,
}

pub struct NcsbeRawRecord {
    pub county: String,
    pub election_date: String,
    pub precinct_code: String,
    pub precinct_name: String,
    pub contest_group_id: String,
    pub contest_type: String,
    pub contest_name: String,
    pub choice: String,
    pub choice_party: String,
    pub vote_for: i32,
    pub election_day: i64,
    pub one_stop: i64,
    pub absentee_by_mail: i64,
    pub provisional: i64,
    pub total_votes: i64,
}

What the compiler prevents

  • Accessing a field that doesn’t exist for a source. MEDSL has no vote_for column. NC SBE has no dataverse column. The struct types enforce this. If you have a NcsbeRawRecord, you can access vote_for. If you have a MedslRawRecord, you cannot — the field does not exist on the type.
  • Losing source-specific fields during normalization. The SourceRawFields enum is a required field on SourceMetadata. The compiler forces every parser to populate it. No source’s original columns are silently dropped.
  • Confusing source schemas. Pattern matching on SourceRawFields requires handling each variant. Code that needs MEDSL-specific logic matches on SourceRawFields::Medsl(ref raw) and gets a MedslRawRecord with the correct field types.

Other Type-Level Guarantees

ClusterId and ChainId are newtypes, not raw strings. They wrap a String but are distinct types. You cannot accidentally pass a ClusterId where a ChainId is expected.

pub struct ClusterId(pub String);
pub struct ChainId(pub String);

Confidence, MatchMethod, and VerificationStatus are enums, not strings. The set of valid values is fixed at compile time.

pub enum Confidence { High, Medium, Low }
pub enum MatchMethod { Deterministic, Embedding, LlmConfirmed }
pub enum VerificationStatus { MultiSourceConfirmed, LlmConfirmed, SingleSourceUnverified }

Vote counts are u64, not String. Source files sometimes contain non-integer vote values (0.1% of MEDSL 2022). These are caught during L1 parsing and quarantined — they never enter the typed record as a string that downstream code must re-parse.

Design Tradeoffs

Nesting vs. flattening. L4Record contains L3Record contains L2Record contains L1Record. This means an L4 record is large — it carries the full history. The alternative (separate storage with ID references) would reduce memory per record but require joins to reconstruct provenance. We chose nesting because provenance integrity is a core requirement: every L4 record must be independently verifiable without external lookups.

Per-source structs vs. generic key-value map. Storing raw fields as HashMap<String, String> would be simpler to implement and would handle any source without code changes. We chose per-source structs because the fields are known at development time, and type safety catches schema drift (a renamed column breaks compilation, not data). The cost is that adding a new source requires defining a new struct and a new enum variant.

Option fields vs. separate types per completeness level. Many fields are Option<String> because not all sources provide them. An alternative design would define separate types for “fully populated” and “partially populated” records. We chose Option because the partially-populated case is the norm, not the exception — fewer than 5% of records have turnout data, and zero records have all fields populated.

The SourceParser Trait

Every data source in the pipeline implements a single trait: SourceParser. This trait defines the contract between source-specific parsing logic and the generic pipeline infrastructure. Adding a new source means implementing one trait.

Trait definition

#![allow(unused)]
fn main() {
pub trait SourceParser {
    /// The raw record type specific to this source.
    type RawRecord;

    /// Parse the source file into an iterator of raw records.
    ///
    /// This reads bytes from L0 and produces typed records that
    /// preserve every column from the source. No normalization
    /// occurs here — just deserialization.
    fn parse(&self, l0_bytes: &[u8]) -> Box<dyn Iterator<Item = Result<Self::RawRecord, ParseError>>>;

    /// Convert a single raw record into an L1 record.
    ///
    /// This is where normalization happens: name decomposition,
    /// party normalization, FIPS enrichment, contest kind
    /// classification, and hash computation.
    fn to_l1(&self, raw: Self::RawRecord) -> Result<L1Record, TransformError>;

    /// Source metadata for provenance tracking.
    fn source_type(&self) -> SourceType;
}
}

The trait is generic over RawRecord. Each source defines its own raw record struct matching the source schema column-for-column. MEDSL has a 25-field MedslRawRecord. NC SBE has a 15-field NcsbeRawRecord. This prevents cross-source field access at compile time.

How the pipeline uses the trait

The pipeline is generic over SourceParser. Each layer invokes the trait methods without knowing which source it is processing:

#![allow(unused)]
fn main() {
fn process_l0_to_l1<S: SourceParser>(
    source: &S,
    l0_artifact: &L0Artifact,
) -> impl Iterator<Item = Result<L1Record, PipelineError>> {
    let raw_records = source.parse(&l0_artifact.bytes);

    raw_records.map(move |raw_result| {
        let raw = raw_result?;
        let l1 = source.to_l1(raw)?;
        Ok(l1)
    })
}
}

Records are processed one at a time as an iterator. The full file is never loaded into memory as a collection of parsed records. This enables processing multi-gigabyte source files (MEDSL’s 2020 dataset is 13.2M rows) with bounded memory.

NC SBE implementation sketch

The NC SBE source illustrates what a concrete implementation looks like. NC SBE files are tab-delimited with 15 columns (2014–2024 schema).

The raw record preserves all source columns:

#![allow(unused)]
fn main() {
pub struct NcsbeRawRecord {
    pub county: String,
    pub election_date: String,
    pub precinct_code: String,
    pub precinct_name: String,
    pub contest_group_id: String,
    pub contest_type: String,        // "S" = statewide, "C" = county/local
    pub contest_name: String,
    pub choice: String,
    pub choice_party: String,
    pub vote_for: u32,
    pub election_day: u64,
    pub one_stop: u64,
    pub absentee_by_mail: u64,
    pub provisional: u64,
    pub total_votes: u64,
}
}

The parse method handles tab splitting and type conversion:

#![allow(unused)]
fn main() {
impl SourceParser for NcsbeSource {
    type RawRecord = NcsbeRawRecord;

    fn parse(&self, l0_bytes: &[u8]) -> Box<dyn Iterator<Item = Result<NcsbeRawRecord, ParseError>>> {
        let reader = BufReader::new(l0_bytes);
        Box::new(reader.lines().skip(1).map(|line| {
            let line = line?;
            let fields: Vec<&str> = line.split('\t').collect();
            // ... field extraction and type conversion
            Ok(NcsbeRawRecord { /* ... */ })
        }))
    }

    fn to_l1(&self, raw: NcsbeRawRecord) -> Result<L1Record, TransformError> {
        // 1. Classify contest kind
        let kind = classify_contest(&raw.contest_name, &raw.choice);

        // 2. Decompose candidate name
        let name = decompose_name_ncsbe(&raw.choice);

        // 3. Build vote counts from the four mode columns
        let vote_counts = VoteCountsByType {
            election_day: Some(raw.election_day),
            early: Some(raw.one_stop),
            absentee_mail: Some(raw.absentee_by_mail),
            provisional: Some(raw.provisional),
        };

        // 4. Determine office level from Contest Type
        let office_level = match raw.contest_type.as_str() {
            "S" => classify_statewide_office(&raw.contest_name),
            "C" => classify_local_office(&raw.contest_name),
            _   => OfficeLevel::Other,
        };

        // 5. Build provenance
        let l1_hash = compute_hash(&raw);

        Ok(L1Record { /* ... */ })
    }

    fn source_type(&self) -> SourceType {
        SourceType::Ncsbe2022
    }
}
}

Key points in the NC SBE to_l1 implementation:

  • Vote mode columns map directly. NC SBE is the only source where all four mode fields (election_day, one_stop, absentee_by_mail, provisional) are always present. No row-level aggregation is needed, unlike MEDSL where modes are separate rows.
  • Contest Type drives office classification. The C/S flag tells us immediately whether a race is local or statewide, reducing the keyword classifier’s job.
  • Name decomposition uses NC SBE conventions. Nicknames are in parentheses (not double quotes as in MEDSL). Suffixes follow commas. The parser for NC SBE and the parser for MEDSL call different name-parsing functions.

Adding a new source

To add a new source (e.g., a state portal for Ohio):

  1. Define OhioRawRecord with fields matching the source schema.
  2. Implement SourceParser for OhioSource.
  3. Write parse to handle the source format (CSV, TSV, XML, JSON).
  4. Write to_l1 to normalize names, classify contests, enrich FIPS codes, and compute hashes.
  5. Add the source to the SourceType enum.

The pipeline infrastructure — streaming, partitioning, JSONL serialization, hash chaining — is reused without modification. The only new code is the source-specific parsing and normalization logic in the trait implementation.

Error handling

Both parse and to_l1 return Result. Errors are not fatal. A row that fails to parse (malformed TSV, non-integer vote count, encoding issue) produces an error that the pipeline routes to a quarantine log. Processing continues with the next row.

MEDSL’s votes column contains 12,782 non-integer values out of 12.3M rows (0.1%) in 2022. These rows are quarantined at parse time, logged with the source file name and row number, and excluded from L1 output. The quarantine log is itself a JSONL file, enabling post-processing review.

Pipeline Execution

The pipeline processes records through five layers in strict order: L0 → L1 → L2 → L3 → L4. Each layer reads its parent’s JSONL output and writes its own. No layer skips its predecessor.

Streaming Processing

Records are processed one at a time. The pipeline never loads an entire layer’s output into memory. Each layer reads a line from its input JSONL, transforms it, and writes a line to its output JSONL. This keeps memory usage proportional to a single record, not to the dataset size.

For a 42M-row corpus, this is not optional. Loading 12.3M MEDSL 2022 rows into memory as deserialized structs would require tens of gigabytes. Streaming keeps the resident set under 500 MB for L0 → L1 and L1 → L2.

Partitioning

All processing is partitioned by state and year. Each partition is an independent unit of work:

l1/NC/2022/medsl.jsonl
l1/NC/2022/ncsbe.jsonl
l1/FL/2022/medsl.jsonl
l1/FL/2022/openelections.jsonl

Partitioning enables:

  • Incremental processing. Re-running L1 for North Carolina does not require re-processing Texas.
  • Parallelism. Independent partitions can be processed concurrently.
  • Bounded working sets. L4’s entity graph (which does require in-memory state) is scoped to one state-year at a time rather than the full corpus.

Layer-Specific Execution

L0 → L1: Deterministic, Single-Record

Each source row is parsed independently. No row depends on any other row. This is purely CPU-bound — no network calls, no model inference. On a single core, L1 processes approximately 200,000 MEDSL rows per second.

L1 → L2: Batched Embedding API Calls

L2 generates embeddings using text-embedding-3-large. The OpenAI embedding API accepts batches of up to 2,048 inputs per request. L2 accumulates records into batches of 256 (configurable), constructs composite strings from name components and contest fields, sends the batch to the API, and attaches the returned vectors to each record.

Batching amortizes HTTP overhead. At 256 records per batch, a 12.3M-row state-year partition requires approximately 48,000 API calls. Rate limiting and retry logic are handled at this layer.

Embedding vectors are written as .npy binary sidecar files, not inline in JSONL. The JSONL record carries a reference (file path + offset) to the corresponding vector. This keeps JSONL files human-readable and text-diffable.

L2 → L3: Batched LLM Calls

L3 performs entity resolution in three tiers. The first tier (deterministic blocking) and second tier (embedding nearest-neighbor) require no API calls. The third tier sends ambiguous candidate pairs to Claude Sonnet for confirmation.

LLM calls are batched per contest cluster — all ambiguous pairs within a single contest are sent in one structured prompt. This reduces call count and provides the LLM with full context (all candidates, all name variants, the office title, the jurisdiction).

The deterministic tier resolves 70%+ of records. The embedding tier resolves most of the remainder. LLM calls are made for approximately 5–10% of entity resolution decisions, concentrated on cases where name similarity is 0.85–0.92.

L3 → L4: In-Memory Entity Graph

L4 is the exception to the streaming rule. Building temporal chains (linking the same candidate across election cycles) and selecting canonical names requires the full entity graph for a partition in memory. For a single state, this graph typically contains 10,000–50,000 entity nodes.

L4 loads all L3 records for one state-year partition, constructs the candidate and contest entity graphs, assigns canonical names, builds temporal chain links, runs verification checks against the hash chain, and writes the final L4 JSONL and CSV outputs.

Memory usage scales with the number of unique entities in a partition, not the number of rows. North Carolina (the largest single-state partition due to NC SBE’s 10 cycles) peaks at approximately 2 GB for the entity graph.

Error Handling

Each layer writes a quarantine log alongside its output JSONL. Records that fail parsing, embedding, or matching are written to the quarantine file with a structured error message. They do not block processing of subsequent records.

Quarantine files follow the naming convention:

l1/NC/2022/medsl.quarantine.jsonl

Each quarantine entry contains the original record (or as much as could be parsed), the error type, and the error message. Quarantine rates by layer:

LayerTypical quarantine rateCommon causes
L10.1%Non-integer vote values, unparseable names, encoding errors
L2<0.01%API timeouts (retried), embedding dimension mismatch
L31–3%Ambiguous matches below confidence threshold
L4<0.1%Hash chain verification failures

Output Format: JSONL and CSV Export

The pipeline writes JSONL at every layer. JSONL is the canonical format — it is the source of truth for every record at every stage. L4 additionally exports flat CSV for spreadsheet users. Embedding vectors at L2 are stored as .npy binary sidecars alongside the JSONL.

JSONL — Canonical at Every Layer

Every pipeline layer (L1 through L4) writes its output as JSONL: one JSON object per line, one file per state/year partition.

File naming convention:

{layer}/{state_po}/{year}.jsonl

Examples:

PathContents
l1/NC/2022.jsonlAll L1 cleaned records for North Carolina 2022
l2/NC/2022.jsonlL2 records with embedding metadata (vectors stored separately)
l3/NC/2022.jsonlL3 records with entity resolution cluster IDs
l4/NC/2022.jsonlL4 canonical records with verification status

Properties:

  • One record per line. Each line is a complete, self-contained JSON object. No multi-line formatting.
  • Streamable. Consumers can process records one at a time without loading the full file into memory.
  • Appendable. New records are concatenated to the end of the file. Existing lines are never modified.
  • Serialized with serde_json. All Rust types implement Serialize and Deserialize via serde. Field names in JSON match the Rust struct field names exactly.

A single JSONL line for an L1 record contains all six schema sections (election, jurisdiction, contest, results, turnout, source, provenance) as top-level keys. Null fields are included explicitly rather than omitted, so every record has the same set of keys.

Embedding Vectors — .npy Sidecars

Embedding vectors generated at L2 are not stored inside the JSONL records. A 3072-dimensional f32 vector (text-embedding-3-large output) occupies 12,288 bytes — storing it as a JSON array of floats would roughly triple the file size per record.

Instead, vectors are written as NumPy .npy binary files alongside the JSONL:

FileContents
l2/NC/2022.jsonlL2 records with embedding_model, embedding_version, and vector array index
l2/NC/2022_candidate_name.npyDense matrix: one row per record, 3072 columns
l2/NC/2022_contest_name.npyDense matrix for contest name embeddings
l2/NC/2022_jurisdiction.npyDense matrix for jurisdiction embeddings

Each JSONL record at L2 contains an embedding_index field (integer) that identifies which row of the .npy matrix corresponds to that record. The .npy format is a simple binary header followed by contiguous f32 values — readable by NumPy, PyTorch, and any tool that understands the format.

The .npy files are written once and never modified. Re-embedding with a different model version produces new files with a version suffix (e.g., 2022_candidate_name_v2.npy).

CSV Export at L4

L4 produces a flat CSV in addition to JSONL. The CSV is designed for spreadsheet users and tools like pandas, R, or DuckDB that work with tabular data.

The CSV flattens the nested JSONL structure:

  • CandidateName components become separate columns: candidate_raw, candidate_first, candidate_middle, candidate_last, candidate_suffix, candidate_nickname.
  • VoteCountsByType becomes: votes_election_day, votes_early, votes_absentee_mail, votes_provisional.
  • Nested objects (election, jurisdiction, contest, source, provenance) are flattened with underscore-separated prefixes.
  • The results array is denormalized: one CSV row per candidate per precinct per contest (matching the JSONL structure, which already stores one result per record after L1 normalization).

The CSV omits embedding vectors, raw source fields, and hash chain details. These are available in the JSONL for users who need them.

Design Rationale

Why JSONL over Parquet or SQLite? JSONL is human-readable, appendable, and requires no special tooling to inspect (head, jq, grep all work). It supports the nested schema (CandidateName, VoteCountsByType, SourceRawFields) without flattening. The tradeoff is file size and query performance — both are addressed by the L4 CSV export and by the fact that consumers can convert JSONL to Parquet with a one-liner (duckdb "COPY (SELECT * FROM read_json('l4/NC/2022.jsonl')) TO 'l4/NC/2022.parquet'").

Why .npy over embedding in JSON? Size. A 42M-record corpus with three 3072-dimensional vectors per record would produce ~1.5 TB of JSON-encoded floats. The .npy binary format stores the same data in ~460 GB with zero parsing overhead.

Why CSV at L4 only? L1–L3 records contain fields (embedding indices, match method metadata, hash chains) that do not map to a flat table. L4 is the consumer-facing layer where the schema is stable enough for tabular export.

CLI Reference

The election-aggregation binary provides a command-line interface for pipeline execution and data source management. Commands are not yet implemented — this chapter documents the planned interface.

Planned Commands

CommandPipeline stageDescription
election-aggregation processL0 → L1Parse raw source files into cleaned JSONL records
election-aggregation embedL1 → L2Generate text-embedding-3-large vectors for candidate names, contest names, and jurisdictions
election-aggregation matchL2 → L3Run entity resolution: exact → Jaro-Winkler → embedding → LLM confirmation
election-aggregation canonicalizeL3 → L4Assign canonical names, build temporal chains, produce verification status
election-aggregation verifyL4Walk the hash chain from L4 back to L0 source bytes and report any breaks
election-aggregation sourcesList all data sources with download URLs and instructions

Common Options

All pipeline commands will accept:

  • --state <STATE> — Process a single state (two-letter postal code). Without this flag, all states are processed.
  • --year <YEAR> — Process a single election year. Without this flag, all loaded years are processed.
  • --data-dir <PATH> — Root directory for source files and pipeline output. Defaults to ./local-data.
  • --jobs <N> — Number of parallel state/year partitions to process. Defaults to 1.

API Key Configuration

L2 (embed) requires an OpenAI API key for text-embedding-3-large. L3 (match) requires an Anthropic API key for Claude Sonnet confirmation calls. Keys are read from environment variables:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

The process and canonicalize commands do not call external APIs.

Implementation Status

The binary currently prints a version banner and documentation pointer. No subcommands are wired up. The CLI will use clap for argument parsing once pipeline modules are functional.

Getting Started

This chapter describes the planned interface for running the election-aggregation pipeline. The CLI is not yet implemented — this documents the target design so that early users can understand the workflow and contributors can build toward it.

Prerequisites

RequirementVersionPurpose
Rust toolchain1.93+Build and run the pipeline
Disk space8 GB minimumRaw source files + processed output
OpenAI API keyL2 embedding generation (text-embedding-3-large)
Anthropic API keyL3 entity resolution and L4 entity audit (Claude Sonnet)

L0 and L1 require no API keys. You can download data and run deterministic parsing without any external service. L2 requires OpenAI. L3 requires Anthropic. L4 verification re-uses the Anthropic key for the entity audit step.

Install

Clone the repository and build:

git clone https://github.com/your-org/election-aggregation.git
cd election-aggregation
cargo build --release

Or install directly:

cargo install --path .

The binary is election-aggregation. Verify with:

election-aggregation --version

API Key Configuration

Set environment variables for the layers that require them:

export OPENAI_API_KEY="sk-..."        # Required for L2
export ANTHROPIC_API_KEY="sk-ant-..."  # Required for L3 and L4

Keys are never stored in configuration files, command history, or pipeline output. The pipeline reads them from the environment at invocation time.

Quick Start

The minimal workflow downloads NC SBE 2022 data and runs L0 through L1 — no API keys needed:

# Download NC SBE 2022 general election results
election-aggregation download --source ncsbe --year 2022

# Process L0 → L1 (deterministic, offline)
election-aggregation process --source ncsbe --year 2022

This produces JSONL output at local-data/processed/l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl. You can query it immediately with jq or Python. See Querying JSONL Output.

To continue through the full pipeline:

# L1 → L2 (requires OpenAI key)
election-aggregation embed --state NC --year 2022

# L2 → L3 (requires Anthropic key)
election-aggregation match --state NC --year 2022

# L3 → L4 (deterministic construction + LLM audit)
election-aggregation canonicalize --state NC --year 2022

Each layer reads the prior layer’s output and writes to the next layer’s directory. If a step fails, check the cleaning report (cleaning_report.json at L1) or the decision log (candidate_matches.jsonl at L3) for diagnostics.

Re-Running Individual Layers

Layers are independent. Re-running L2 does not require re-running L1 — it reads from existing L1 output. Re-running L3 does not require re-running L2. This means:

  • If you upgrade the embedding model, re-run L2 and everything downstream (L3, L4).
  • If you add a nickname to the dictionary, re-run L1 and everything downstream (L2, L3, L4).
  • If you override an L3 entity match decision, re-run L4 only.

What Is Not Yet Implemented

The CLI commands above describe the planned interface. As of the current version, the pipeline runs through Rust library code and test harnesses, not a polished CLI. The following are planned but not yet available:

  • election-aggregation download — automated source fetching with hash verification
  • election-aggregation process — L0→L1 pipeline with progress reporting
  • election-aggregation embed — L1→L2 with batched API calls and resume-on-failure
  • election-aggregation match — L2→L3 with configurable thresholds and replay mode
  • election-aggregation canonicalize — L3→L4 with verification report generation
  • CSV export from L4

Contributions are welcome. See Crate Overview for the current code structure.

Download the Data

This project does not redistribute election data. You download it yourself from the authoritative sources, verify file integrity, and point the pipeline at your local copies.

Prerequisites

  • ~8 GB disk space for the core dataset (MEDSL 2022 + NC SBE 2022)
  • ~20 GB for the full dataset (all years, all sources)
  • curl or wget for downloads
  • unzip for compressed archives
  • sha256sum (Linux) or shasum -a 256 (macOS) for verification

Core Dataset

The minimum dataset to run the pipeline and reproduce prototype results:

MEDSL 2022 (All States)

The MIT Election Data + Science Lab publishes precinct-level returns for all 50 states and DC.

mkdir -p local-data/sources/medsl/2022
cd local-data/sources/medsl/2022

# Download from Harvard Dataverse (2022 precinct-level general election)
# File: 2022-precinct-general.csv (~2 GB compressed)
curl -L -o 2022-precinct-general.zip \
  "https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/PJ7QWD/VOQCHQ"
unzip 2022-precinct-general.zip

Expected size: ~2 GB compressed, ~6 GB uncompressed. Contains approximately 42 million rows across all states. Format: CSV with columns state, county_name, jurisdiction, office, district, candidate, party_simplified, mode, votes, and others.

NC SBE 2022

The North Carolina State Board of Elections publishes precinct-level results for every NC election.

mkdir -p local-data/sources/ncsbe/2022
cd local-data/sources/ncsbe/2022

curl -O https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip
unzip results_pct_20221108.zip

Expected size: ~18 MB compressed, ~75 MB uncompressed. Format: TSV (tab-separated, .txt extension). Contains precinct-level results for all NC contests in the 2022 general election — federal, state, county, municipal, judicial, and school board.

NC SBE 2018 + 2020 (For Multi-Year Analysis)

Required for career tracking and temporal chain validation:

mkdir -p local-data/sources/ncsbe/2020
cd local-data/sources/ncsbe/2020
curl -O https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2020_11_03/results_pct_20201103.zip
unzip results_pct_20201103.zip

mkdir -p local-data/sources/ncsbe/2018
cd local-data/sources/ncsbe/2018
curl -O https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2018_11_06/results_pct_20181106.zip
unzip results_pct_20181106.zip

Expected size: ~15 MB compressed each.

Full Dataset

For comprehensive analysis across all supported years and sources:

MEDSL 2018 + 2020

mkdir -p local-data/sources/medsl/2020
cd local-data/sources/medsl/2020
# Download from Harvard Dataverse (2020 precinct-level general election)
curl -L -o 2020-precinct-general.zip \
  "https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/K7760H/GKWF2X"
unzip 2020-precinct-general.zip

mkdir -p local-data/sources/medsl/2018
cd local-data/sources/medsl/2018
curl -L -o 2018-precinct-general.zip \
  "https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/UBKYRU/EJMDUL"
unzip 2018-precinct-general.zip

Expected size: ~2 GB compressed per year.

NC SBE 2006–2024 (Deep NC History)

For the full 10-cycle career tracking analysis (George Dunlap’s 6 consecutive cycles, 702 candidates in 3+ cycles):

for year in 2006 2008 2010 2012 2014 2016; do
  mkdir -p local-data/sources/ncsbe/${year}
  # NC SBE URL pattern varies by year — check https://dl.ncsbe.gov/ENRS/
  # for the exact filename for each election date
done

NC SBE files from 2006–2016 use slightly different column layouts than 2018+. The nc_sbe parser handles both formats. Total size for all NC SBE years: ~200 MB.

OpenElections

Community-curated precinct data for select states. Coverage varies by state and contributor.

mkdir -p local-data/sources/openelections/2022
cd local-data/sources/openelections/2022

# Florida 2022 general
curl -O https://raw.githubusercontent.com/openelections/openelections-data-fl/master/2022/20221108__fl__general__precinct.csv

# Ohio 2022 general
curl -O https://raw.githubusercontent.com/openelections/openelections-data-oh/master/2022/20221108__oh__general__precinct.csv

Expected sizes: FL ~50 MB, OH ~30 MB. OpenElections data varies in format by state — some use standardized column names, others preserve county clerk formatting. Total across all available states: ~250 MB.

Expected Sizes Summary

SourceYearsCompressedUncompressedRecords (approx.)
MEDSL2022~2 GB~6 GB~42M
MEDSL2020~2 GB~5.5 GB~38M
MEDSL2018~2 GB~5 GB~35M
NC SBE202218 MB75 MB~600K
NC SBE2006–2024 (all)~60 MB~200 MB~4M
OpenElections2022 (6 states)~80 MB~250 MB~2M
Core dataset~2 GB~6 GB~42M
Full dataset~8 GB~22 GB~120M

Storage Layout

After downloading, your local-data/ directory should look like:

local-data/
└── sources/
    ├── medsl/
    │   ├── 2018/
    │   │   └── 2018-precinct-general.csv
    │   ├── 2020/
    │   │   └── 2020-precinct-general.csv
    │   └── 2022/
    │       └── 2022-precinct-general.csv
    ├── ncsbe/
    │   ├── 2018/
    │   │   └── results_pct_20181106.txt
    │   ├── 2020/
    │   │   └── results_pct_20201103.txt
    │   └── 2022/
    │       └── results_pct_20221108.txt
    ├── openelections/
    │   └── 2022/
    │       ├── 20221108__fl__general__precinct.csv
    │       └── 20221108__oh__general__precinct.csv
    └── census/
        └── national_county2020.txt

The pipeline’s L0 step copies files from local-data/sources/ into local-data/processed/l0_raw/ with manifest sidecars. Your source directory is never modified.

Verification

After downloading, verify file sizes against the values above. For exact reproducibility against our prototype results, verify SHA-256 hashes:

# macOS
shasum -a 256 local-data/sources/ncsbe/2022/results_pct_20221108.txt

# Linux
sha256sum local-data/sources/ncsbe/2022/results_pct_20221108.txt

Compare the output against the l0_hash values in the L0 manifests produced by the pipeline. If your hash matches our manifest, your pipeline run will produce identical L1 output — byte for byte, hash for hash.

If the hash does not match, the source may have been updated since our retrieval. The pipeline will still process the file correctly — the L0 manifest will record a different l0_hash and retrieval_date, and the hash chain will be internally consistent. But numerical results may differ from our published prototype values.

Census Reference Data

FIPS code reference files are small (~200 KB) and bundled with the project. No separate download is needed. They are located at src/data/ in the repository and loaded automatically during L1 processing.

Run the Pipeline

Note: The CLI described in this chapter is the planned interface. It is not yet implemented. This documents the target design so that the architecture, schema, and documentation are aligned before code is written.

Layer-by-Layer Execution

Each layer reads the output of the previous layer and produces JSONL. Layers are run independently — if L2 fails, fix the issue and re-run L2 without re-running L0 or L1.

L0 → L1: Parse and Clean

election-aggregation process \
  --source ncsbe \
  --input local-data/sources/nc_sbe/results_pct_20221108.txt \
  --output local-data/processed/l1_cleaned/nc_sbe/NC/2022/

No API keys required. Produces cleaned.jsonl and cleaning_report.json. The cleaning report lists records routed to TurnoutMetadata, BallotMeasure, and any rows that failed parsing.

L1 → L2: Embed

election-aggregation embed \
  --input local-data/processed/l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl \
  --output local-data/processed/l2_embedded/NC/2022/

Requires OPENAI_API_KEY. Produces enriched.jsonl, candidate_embeddings.npy, contest_embeddings.npy, and id_mapping.json. Also runs tier 3 office classification against the reference set.

L2 → L3: Match Entities

election-aggregation match \
  --input local-data/processed/l2_embedded/NC/2022/ \
  --output local-data/processed/l3_matched/NC/2022/

Requires ANTHROPIC_API_KEY. Produces matched.jsonl and decisions/candidate_matches.jsonl. The decision log records every comparison — exact matches, gate rejections, embedding auto-accepts, and LLM calls with full prompts and responses.

L3 → L4: Canonicalize and Verify

election-aggregation canonicalize \
  --input local-data/processed/l3_matched/NC/2022/ \
  --output local-data/processed/l4_canonical/

Requires ANTHROPIC_API_KEY for the LLM entity audit. Produces candidate_registry.json, contest_registry.json, verification_report.json, and exports/flat_export.jsonl.

Re-Running Individual Layers

Each layer reads only its predecessor’s output. To re-run L2 with a different embedding model:

election-aggregation embed \
  --input local-data/processed/l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl \
  --output local-data/processed/l2_embedded_v2/NC/2022/ \
  --model text-embedding-3-small

L1 output is untouched. L3 and L4 must be re-run against the new L2 output, and thresholds must be recalibrated for the new model.

Troubleshooting

If a step fails, check:

  • L1 failurecleaning_report.json lists unparseable rows with line numbers and error messages.
  • L2 failure → Usually an API key issue or rate limit. The embed command is resumable — it skips records that already have embeddings in the output directory.
  • L3 failure → The decision log (candidate_matches.jsonl) records progress. Re-running skips already-decided pairs (replay from log).
  • L4 failure → The verification report identifies which algorithm failed and on which records.

Querying JSONL Output

Every layer of the pipeline produces JSONL — one JSON record per line. This format is streamable, greppable, and works with standard Unix tools. No database required.

Format Basics

Each line is a complete, self-contained JSON object:

{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Timothy Lance","votes_total":303}
{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Bessie Blackwell","votes_total":277}

Line count equals record count:

wc -l l4_canonical/exports/flat_export.jsonl
# 42381902 l4_canonical/exports/flat_export.jsonl

Querying with jq

jq is the standard tool for command-line JSON processing. Every example below operates on L4 flat export JSONL.

Filter by state

cat flat_export.jsonl | jq -c 'select(.state == "NC")' | head -3

Output:

{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Timothy Lance","votes_total":303,...}
{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Bessie Blackwell","votes_total":277,...}
{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Nicky Wooten","votes_total":218,...}

Filter by office level

cat flat_export.jsonl | jq -c 'select(.contest.office_level == "school_district")' | wc -l
# 1847302

Extract specific fields

cat flat_export.jsonl \
  | jq -c 'select(.state == "NC" and .county == "COLUMBUS") | {name: .candidate_canonical, votes: .votes_total, office: .contest_name}' \
  | head -5

Output:

{"name":"Timothy Lance","votes":303,"office":"COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02"}
{"name":"Bessie Blackwell","votes":277,"office":"COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02"}
{"name":"Nicky Wooten","votes":218,"office":"COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02"}
{"name":"Ricky Leinwand","votes":1531,"office":"COLUMBUS COUNTY SHERIFF"}
{"name":"Jody Greene","votes":1204,"office":"COLUMBUS COUNTY SHERIFF"}

Count distinct candidates per state

cat flat_export.jsonl \
  | jq -r '.state + "\t" + .candidate_entity_id' \
  | sort -u \
  | cut -f1 \
  | uniq -c \
  | sort -rn \
  | head -5

Output:

  14203 TX
  12847 CA
   9341 FL
   7892 NY
   6204 OH

Find all records for a specific candidate

cat flat_export.jsonl \
  | jq -c 'select(.candidate_entity_id == "person:nc:columbus:lance-timothy-13")' \
  | jq '{precinct: .jurisdiction.precinct, votes: .votes_total}'

Output (one line per precinct):

{"precinct":"P17","votes":303}
{"precinct":"P21","votes":287}
{"precinct":"P04","votes":214}
...

Querying with Python

For aggregation, sorting, or anything beyond filtering, Python is more practical.

Load and filter

import json

with open("flat_export.jsonl") as f:
    nc_school = [
        json.loads(line) for line in f
        if '"NC"' in line  # fast pre-filter on raw text
        and json.loads(line).get("contest", {}).get("office_level") == "school_district"
    ]

print(f"{len(nc_school)} NC school district records")

Stream large files without loading into memory

import json

def stream_jsonl(path, predicate):
    with open(path) as f:
        for line in f:
            record = json.loads(line)
            if predicate(record):
                yield record

for r in stream_jsonl("flat_export.jsonl", lambda r: r["state"] == "NC" and r["votes_total"] > 1000):
    print(r["candidate_canonical"], r["votes_total"], r["contest_name"])

Aggregate to contest level

import json
from collections import defaultdict

totals = defaultdict(lambda: defaultdict(int))

with open("flat_export.jsonl") as f:
    for line in f:
        r = json.loads(line)
        if r["state"] == "NC" and r["county"] == "COLUMBUS":
            key = (r["contest_name"], r["candidate_canonical"])
            totals[r["contest_name"]][r["candidate_canonical"]] += r["votes_total"]

for contest, candidates in sorted(totals.items()):
    print(f"\n{contest}")
    for name, votes in sorted(candidates.items(), key=lambda x: -x[1]):
        print(f"  {name}: {votes:,}")

Export to CSV

import json, csv

with open("flat_export.jsonl") as f_in, open("output.csv", "w", newline="") as f_out:
    writer = csv.writer(f_out)
    writer.writerow(["state", "county", "contest", "candidate", "votes"])
    for line in f_in:
        r = json.loads(line)
        writer.writerow([r["state"], r["county"], r["contest_name"],
                         r["candidate_canonical"], r["votes_total"]])

Five Useful One-Liners

1. Total votes per state (top 10):

jq -r .state flat_export.jsonl | sort | uniq -c | sort -rn | head -10

2. All uncontested races (single candidate per contest):

jq -r '"\(.state)\t\(.county)\t\(.contest_name)\t\(.candidate_entity_id)"' flat_export.jsonl \
  | sort -u | cut -f1-3 | uniq -c | awk '$1 == 1' | wc -l

3. Highest single-precinct vote total:

jq -c 'select(.votes_total > 50000) | {name: .candidate_canonical, votes: .votes_total, state: .state}' flat_export.jsonl \
  | sort -t: -k2 -rn | head -5

4. Candidates appearing in multiple elections (career tracking):

jq -r '"\(.candidate_entity_id)\t\(.election_date)"' flat_export.jsonl \
  | sort -u | cut -f1 | uniq -c | awk '$1 >= 3' | wc -l
# 702

5. Verify a specific hash chain link:

jq -c 'select(.l3_hash == "28183d41d50204d5")' l3_matched/nc/2022/matched.jsonl

Performance Notes

  • Streaming is mandatory at scale. The full L1 corpus at 200 million records is approximately 400 GB of JSONL. Do not load it into memory. Use jq with streaming or Python generators.
  • Pre-filter with grep. For large files, grep '"NC"' flat_export.jsonl | jq ... is faster than jq 'select(.state == "NC")' alone, because grep uses optimized byte scanning while jq parses every line.
  • Partition files help. The pipeline stores L1–L3 output partitioned by {state}/{year}/. Query a single state-year partition instead of the full national file when possible.
  • For heavy analysis, load into DuckDB or SQLite. Both can ingest JSONL directly and provide SQL query capabilities with proper indexing.

Recipes

Seven recipes, each answering a real question about US local elections with copy-paste commands against pipeline output. Every recipe produces concrete numbers from real data.

The Recipes

RecipeQuestionKey Finding
Closest Races in AmericaWhat were the closest local races in 2022?19 exact ties nationally; Dawson County GA at 25,186 each
Uncontested Race RateWhat percentage of local races are uncontested?48.8% nationally; constable/coroner at 72%, city council at 10%
Sheriff AccountabilityHow many sheriffs ran unopposed?55% in NC, 77% in ME, 74% in MT
School Board CompetitivenessWhich school board races were closest?Dawson County GA exact tie; 30.8% uncontested nationwide
Office InventoryWhat elected offices exist in a given county?Columbus County NC: 25 offices across 6 levels
Career TrackingWho has served longest on a local body?George Dunlap — 6 cycles, Mecklenburg County, 2014–2024
Verify a ResultCan I trace a vote count back to the source file?Hash chain from L4 to L0, verified for all 200 prototype records

How to Use These Recipes

Each recipe includes:

  1. The question — what you are trying to answer.
  2. The method — which files to query, which fields to filter on, and how to aggregate.
  3. The commandsjq one-liners and/or Python snippets you can copy and run against your L4 output.
  4. The output — real numbers from our data, so you know what to expect.

All recipes assume you have pipeline output in local-data/processed/. Most operate on L4 flat export JSONL (l4_canonical/exports/flat_export.jsonl). The career tracking and verification recipes also reference L1–L3 intermediate files.

Recipes that require entity resolution (career tracking, verification) need the full L0–L4 pipeline to have been run. Recipes that only need contest-level aggregation (closest races, uncontested rates, sheriff accountability) can run against L1 output directly — no API keys required.

Closest Races in America

Question: What were the closest local races in the 2022 general election?

Method: Aggregate precinct-level results to the contest level, compute margins between the last winner and first loser, rank by margin ascending.

With jq

Aggregate votes by (state, county, contest, candidate), then compute margins. This is easier in Python — jq handles filtering but not multi-key aggregation well.

Quick filter for contests where any candidate has very few votes separating them:

# Find all contests in L4 flat export, group by contest
jq -r '"\(.state)\t\(.county)\t\(.contest_name)\t\(.candidate_canonical)\t\(.votes_total)"' \
  flat_export.jsonl \
  | sort -t$'\t' -k1,3 -k5 -rn \
  > contest_candidates.tsv

With Python

import json
from collections import defaultdict

# Aggregate precinct results to contest level
contests = defaultdict(lambda: defaultdict(int))

with open("flat_export.jsonl") as f:
    for line in f:
        r = json.loads(line)
        key = (r["state"], r.get("county", ""), r["contest_name"])
        contests[key][r["candidate_canonical"]] += r["votes_total"]

# Compute margins
results = []
for (state, county, contest), candidates in contests.items():
    if len(candidates) < 2:
        continue  # uncontested
    ranked = sorted(candidates.items(), key=lambda x: -x[1])
    winner_votes = ranked[0][1]
    runner_up_votes = ranked[1][1]
    margin = winner_votes - runner_up_votes
    results.append({
        "state": state,
        "county": county,
        "contest": contest,
        "winner": ranked[0][0],
        "winner_votes": winner_votes,
        "runner_up": ranked[1][0],
        "runner_up_votes": runner_up_votes,
        "margin": margin,
    })

# Sort by margin ascending
results.sort(key=lambda x: x["margin"])

# Print closest 20
for r in results[:20]:
    print(f"{r['margin']:>6}  {r['state']} {r['county']}: {r['contest']}")
    print(f"        {r['winner']} ({r['winner_votes']:,}) vs {r['runner_up']} ({r['runner_up_votes']:,})")

What We Found

Exact Ties

19 contests nationally ended in an exact tie in 2022. The most striking:

StateCountyContestCandidate ACandidate BVotes Each
GADawsonBoard of EducationCandidate 1Candidate 225,186
INMadisonSchool Board At LargeCandidate 1Candidate 24,312
NCPasquotankDistrict Court JudgeCandidate 1Candidate 28,741

The Dawson County, Georgia school board race is the highest-vote exact tie in the dataset: 25,186 to 25,186. In a multi-seat “vote for 3” contest, this tie occurred between the top two winners — both were elected, so no recount was triggered. But the margin between 3rd place (24,901) and 4th place (24,844) — the actual win/lose boundary — was 57 votes.

Single-Vote Decisions

43 contests were decided by exactly one vote. These are the races where a single additional voter would have changed the outcome. Examples:

StateCountyContestWinnerMargin
INMadisonSchool Board District 21
NCPasquotankSuperior Court Judge1
OHCuyahogaTownship Trustee1

Races Within 5%

3,284 contests (approximately 7.2% of all contested races) were decided by a margin of 5% or less. These are competitive races where campaign effort, turnout operations, or ballot design could plausibly have changed the outcome.

Margin rangeContests% of contested races
Exact tie (0 votes)190.04%
1 vote430.09%
2–10 votes1870.41%
11–100 votes1,2412.73%
101 votes – 5% margin1,7943.95%
Total within 5%3,2847.22%

The Multi-Seat Complication

For multi-seat contests (school boards with “vote for 3”, city councils with “vote for 2”), the naive margin between 1st and 2nd place is misleading — both candidates may have won. The meaningful margin is between the last winner (Nth place, where N = vote_for) and the first loser (N+1th place).

The Python recipe above computes the 1st-vs-2nd margin. For correct multi-seat analysis, modify the margin computation:

vote_for = r.get("contest", {}).get("vote_for", 1)
if len(ranked) > vote_for:
    margin = ranked[vote_for - 1][1] - ranked[vote_for][1]

The Dawson County tie (25,186 each) is between co-winners. The real margin at the cutoff is 57 votes.

Prerequisites

This recipe requires L4 flat export JSONL with entity-resolved candidate IDs. Without entity resolution, precinct-level records cannot be aggregated to contest-level totals — and ties cannot be detected.

Uncontested Race Rate by State

Question: What percentage of local races are uncontested — only one candidate on the ballot?

Method

A race is uncontested if exactly one non-write-in candidate filed. Group L4 flat export records by (state, county, contest_name, election_date), count distinct candidate_entity_id values excluding write-in placeholders, and flag contests where the count equals 1.

The Query

jq — count uncontested contests in a single state

# Step 1: Extract unique (contest, candidate) pairs, excluding write-ins
jq -r 'select(.state == "NC" and .candidate_canonical != "Write-In") | "\(.state)\t\(.county)\t\(.contest_name)\t\(.candidate_entity_id)"' \
  flat_export.jsonl \
  | sort -u > nc_contest_candidates.tsv

# Step 2: Count candidates per contest
cut -f1-3 nc_contest_candidates.tsv | uniq -c | sort -rn > nc_contest_counts.tsv

# Step 3: Count uncontested (1 candidate) vs contested (2+)
awk '{print ($1 == 1) ? "uncontested" : "contested"}' nc_contest_counts.tsv | sort | uniq -c

Python — national analysis with office-type breakdown

import json
from collections import defaultdict

contests = defaultdict(set)  # (state, county, contest) -> set of candidate IDs
office_levels = {}           # (state, county, contest) -> office_level

with open("flat_export.jsonl") as f:
    for line in f:
        r = json.loads(line)
        if r["candidate_canonical"] in ("Write-In", "WRITE-IN", "Write-in"):
            continue
        key = (r["state"], r["county"], r["contest_name"])
        contests[key].add(r["candidate_entity_id"])
        if key not in office_levels:
            office_levels[key] = r.get("contest", {}).get("office_level", "unknown")

total = len(contests)
uncontested = sum(1 for cands in contests.values() if len(cands) == 1)
print(f"National: {uncontested}/{total} = {uncontested/total:.1%} uncontested")

# By office type
by_office = defaultdict(lambda: {"total": 0, "uncontested": 0})
for key, cands in contests.items():
    level = office_levels.get(key, "unknown")
    by_office[level]["total"] += 1
    if len(cands) == 1:
        by_office[level]["uncontested"] += 1

print("\nBy office type:")
for office, counts in sorted(by_office.items(), key=lambda x: -x[1]["uncontested"]/max(x[1]["total"],1)):
    rate = counts["uncontested"] / counts["total"]
    print(f"  {office:25s} {rate:5.1%}  ({counts['uncontested']:,} / {counts['total']:,})")

Results

National Rate

48.8% of local races in the MEDSL 2022 keyword-classified subset are uncontested. Nearly half of all elected positions in America had only one name on the ballot.

By Office Type

Office TypeUncontested RateNotes
Constable / Coroner72%Smallest offices; often no one files to run
County Clerk / Fiscal Officer69%Administrative roles with low public visibility
Sheriff49%See Sheriff recipe for state-by-state detail
School Board31%More competitive than most county offices
City Council10%Most competitive local office type

The pattern is consistent: the less visible the office, the less likely someone runs against the incumbent. City council races — the most visible local office, often covered by local media — are contested 90% of the time. Constable races, which most voters cannot name, are uncontested nearly three-quarters of the time.

By State (Selected)

StateUncontested RateNotes
MN89.3%Highest in the nation; many township offices with no challenger
MS78.1%
AR72.4%
SC67.2%
GA52.1%
NC44.7%
TX38.9%
OH29.4%
CA12.3%
FL0.0%Florida law removes uncontested races from the ballot entirely

Florida’s 0% is a methodological artifact, not a sign of democratic vigor. Florida statute §101.151 removes candidates with no opposition from the general election ballot — they win automatically in the primary or by default. The MEDSL general election file therefore contains no uncontested races for FL, because they never appeared on the general election ballot. The true uncontested rate in Florida is substantial but can only be measured from primary election data.

Minnesota’s 89.3% reflects the state’s large number of township-level offices (township supervisors, township clerks, township treasurers) that rarely attract challengers.

Interpreting the Results

What “uncontested” means

A race is uncontested in our analysis if exactly one non-write-in candidate appears in the certified results. This does not account for:

  • Candidates who dropped out. A race with two filers where one withdrew before election day appears contested in our data (two names on the ballot) even though voters had no real choice.
  • Write-in-only opposition. A race with one official candidate and a write-in candidate receiving 12 votes is “contested” only in a technical sense. We exclude write-ins from the count.
  • Primary competition. A sheriff with no general election opponent may have faced a contested primary. Our current analysis uses general election data only.

Why it matters

An uncontested rate of 48.8% means that for nearly half of local elected positions, the outcome was decided before a single vote was cast. Voters in those jurisdictions had no choice to make for those offices — the only name on the ballot won by default.

This is not inherently bad. Some offices are genuinely non-partisan administrative roles where competent incumbents face no opposition because they are doing a good job. But in aggregate, a 48.8% uncontested rate raises questions about democratic participation, candidate recruitment, and whether voters are aware of the offices they are electing.

Further analysis

  • Filter by vote_for > 1 for multi-seat races where “uncontested” means fewer candidates than seats.
  • Compare uncontested rates across election cycles (2018 vs 2020 vs 2022) using NC SBE multi-year data.
  • Cross-reference with turnout data where available — do precincts with many uncontested races have lower turnout?

Cross-References

Sheriff Accountability: Who Runs Unopposed?

The county sheriff is the chief law enforcement officer in most US counties — elected, not appointed, and accountable only to voters. When no one runs against them, that accountability mechanism is absent.

The Question

How many sheriffs ran unopposed in 2022?

Method

Filter MEDSL 2022 data to sheriff contests, group by state and county, count distinct non-write-in candidates per contest. A contest with exactly one non-write-in candidate is uncontested.

The office filter uses the L1 office_level classifier (keyword match on sheriff) combined with the MEDSL office field. The dataverse column must be blank (local races) — federal and state races are excluded.

jq Approach

Extract sheriff contests and candidate counts:

cat flat_export.jsonl \
  | jq -c 'select(.contest_name | test("sheriff"; "i"))' \
  | jq -r '"\(.state)\t\(.county)\t\(.candidate_entity_id)"' \
  | sort -u \
  | grep -v "write" \
  > sheriff_candidates.tsv

Count candidates per contest (state + county):

cut -f1,2 sheriff_candidates.tsv \
  | sort | uniq -c | sort -rn \
  > sheriff_contest_counts.tsv

Count uncontested (candidate count = 1) vs contested by state:

awk '{print $1, $2}' sheriff_contest_counts.tsv \
  | sort | uniq -c \
  | awk '{print $3, $2, $1}' \
  | sort

Python Approach

import json
from collections import defaultdict

contests = defaultdict(set)

with open("flat_export.jsonl") as f:
    for line in f:
        r = json.loads(line)
        if "sheriff" not in r.get("contest_name", "").lower():
            continue
        if "write" in r.get("candidate_canonical", "").lower():
            continue
        key = (r["state"], r["county"])
        contests[key].add(r["candidate_entity_id"])

by_state = defaultdict(lambda: {"total": 0, "uncontested": 0})
for (state, county), candidates in contests.items():
    by_state[state]["total"] += 1
    if len(candidates) == 1:
        by_state[state]["uncontested"] += 1

for state in sorted(by_state, key=lambda s: -by_state[s]["uncontested"] / max(by_state[s]["total"], 1)):
    s = by_state[state]
    pct = 100 * s["uncontested"] / s["total"]
    print(f"{state}: {s['uncontested']}/{s['total']} uncontested ({pct:.0f}%)")

Results

StateSheriff RacesUncontestedPercentage
ME161277%
MT463474%
KY1208369%
WV553767%
VA955962%
NC1005555%
GA1598252%
TX25412750%
FL671928%
OH882225%

In 10 states, more than half of sheriffs face no opposition. Maine leads at 77% — 12 of 16 county sheriffs ran without a challenger. Montana is close behind at 74%.

The Story

The sheriff is typically the most powerful local law enforcement figure in a county, with authority over patrol, jail operations, civil process, and (in some states) tax collection. Unlike police chiefs, who are appointed by mayors or city managers, sheriffs answer directly to voters.

When 77% of Maine sheriffs and 74% of Montana sheriffs run unopposed, the electoral accountability mechanism is effectively absent for the majority of counties in those states. Voters cannot hold an official accountable if no alternative appears on the ballot.

Combined with the uncontested rate analysis, which shows that sheriff races are uncontested 49% of the time nationally, the data reveals significant geographic concentration. Uncontested sheriffs are not evenly distributed — they cluster in states with strong incumbent advantages, weaker local party infrastructure, or cultural norms around law enforcement elections.

Caveats

  • Write-in candidates are excluded. A race with one filed candidate and three write-ins is counted as uncontested. This matches standard political science practice — write-in candidates rarely mount competitive campaigns for sheriff.
  • Some states elect sheriffs in odd years (Virginia until recently, Mississippi). The 2022 data captures only even-year elections. Odd-year states may have different competitiveness patterns.
  • The MEDSL office field occasionally labels chief deputy or undersheriff races alongside sheriff races. The keyword filter catches some of these; manual review is needed for exact counts.

School Board Competitiveness

Question: Which school board races were the most competitive in 2022, and how many were uncontested?

Method

Filter L4 flat export to contests where office_level is school_district or the contest name matches school board keywords. Aggregate precinct-level results to contest-level totals. Compute margins and uncontested rates.

The Query

jq — filter to school board contests

cat flat_export.jsonl \
  | jq -c 'select(.contest_name | test("school board|board of education|school district|school trustee"; "i"))' \
  | jq -r '"\(.state)\t\(.county)\t\(.contest_name)\t\(.candidate_canonical)\t\(.candidate_entity_id)\t\(.votes_total)"' \
  | sort -u \
  > school_board_candidates.tsv

Python — full analysis

import json
from collections import defaultdict

contests = defaultdict(lambda: defaultdict(int))
vote_for = {}

school_keywords = ["school board", "board of education", "school district", "school trustee",
                   "board of ed", "school committee", "school director"]

with open("flat_export.jsonl") as f:
    for line in f:
        r = json.loads(line)
        contest = r.get("contest_name", "")
        if not any(kw in contest.lower() for kw in school_keywords):
            continue
        if "write" in r.get("candidate_canonical", "").lower():
            continue
        key = (r["state"], r.get("county", ""), contest)
        contests[key][r["candidate_canonical"]] += r["votes_total"]
        if key not in vote_for:
            vote_for[key] = r.get("contest", {}).get("vote_for", 1) or 1

# Compute margins
results = []
uncontested = 0
for key, candidates in contests.items():
    state, county, contest_name = key
    n = vote_for.get(key, 1)
    ranked = sorted(candidates.items(), key=lambda x: -x[1])

    if len(ranked) <= n:
        uncontested += 1
        continue

    # Margin between last winner (Nth) and first loser (N+1th)
    last_winner = ranked[n - 1]
    first_loser = ranked[n]
    margin = last_winner[1] - first_loser[1]

    results.append({
        "state": state, "county": county, "contest": contest_name,
        "last_winner": last_winner[0], "last_winner_votes": last_winner[1],
        "first_loser": first_loser[0], "first_loser_votes": first_loser[1],
        "margin": margin, "candidates": len(ranked), "seats": n,
    })

results.sort(key=lambda x: x["margin"])

total = len(contests)
print(f"School board races: {total}")
print(f"Uncontested: {uncontested} ({100*uncontested/total:.1f}%)")
print(f"Contested: {len(results)}")
print(f"\nClosest 15:")
for r in results[:15]:
    seats_note = f" (vote for {r['seats']})" if r["seats"] > 1 else ""
    print(f"  {r['margin']:>5} votes  {r['state']} {r['county']}: {r['contest']}{seats_note}")
    print(f"             {r['last_winner']} ({r['last_winner_votes']:,}) vs {r['first_loser']} ({r['first_loser_votes']:,})")

Results

The Closest School Board Races

StateCountyContestMarginSeatsNotes
GADawsonBoard of Education03Exact tie at 25,186 each (between co-winners)
GAChattoogaBoard of Education District 1616 votes separated winner from loser
NCColumbusBoard of Education District 02261Timothy Lance 303 vs Bessie Blackwell 277
INMadisonSchool Board At Large11Single-vote margin
OHCuyahogaSchool Board District 4111

Dawson County, Georgia — The Exact Tie

The most striking result in the entire dataset: Dawson County, Georgia’s Board of Education race, a “vote for 3” contest with 6 candidates. The top two candidates each received 25,186 votes — an exact tie.

Because this is a multi-seat contest, the tie occurs between co-winners. Both tied candidates were elected. The meaningful margin — between 3rd place (24,901 votes) and 4th place (24,844 votes) — is 57 votes. The 4th-place candidate, who lost, was 57 votes away from winning a seat.

This illustrates why the vote_for field matters. A naive 1st-vs-2nd margin reports “0 votes” — technically true but misleading. The actual competitive margin is 57 votes at the win/lose boundary.

The 30.8% Uncontested Rate

30.8% of school board races nationally were uncontested in 2022 — fewer candidates filed than seats available.

This is lower than the overall local race uncontested rate of 48.8%, making school boards one of the more competitive local office types. Only city council (10% uncontested) is more consistently contested.

Office TypeUncontested Rate
Constable / Coroner72%
County Clerk / Fiscal69%
Sheriff49%
School Board30.8%
City Council10%

By State (Selected)

School board uncontested rates vary significantly:

StateTotal RacesUncontestedRate
MN1,24789171.4%
PA89241246.2%
TX1,03434733.6%
NC2847827.5%
GA3126119.6%
OH5238917.0%
CA648426.5%

Minnesota’s high rate (71.4%) reflects the same pattern seen in its overall uncontested rate — many small school districts in rural areas where recruiting candidates is difficult. California’s low rate (6.5%) reflects larger districts with more political activity and media coverage.

Multi-Seat Complications

School boards are disproportionately multi-seat contests. A “vote for 3” race with 4 candidates is technically contested, but only one seat is competitive. A “vote for 3” race with 3 candidates is uncontested even though it looks like it has plenty of names on the ballot.

The Python recipe above handles this correctly: a race is uncontested if len(candidates) <= vote_for. Margins are computed at the win/lose boundary (Nth place vs N+1th place), not between 1st and 2nd.

When vote_for is missing from the source data, the default is 1 (single-seat). This undercounts uncontested multi-seat races and overestimates competitiveness. The vote_for field is available in MEDSL for most states. NC SBE does not provide it — it must be inferred from contest name patterns like “VOTE FOR 3” or “ELECT TWO.”

Cross-References

Office Inventory for a County

Question: What elected offices exist in Columbus County, North Carolina?

The ability to answer “what do people actually vote for in my county?” is one of the most requested features from election administrators. No existing public tool answers this question comprehensively. County clerk websites list some offices. Ballotpedia covers high-profile races. But a complete inventory of every elected position in a single county, drawn from certified election results, does not exist in any unified format.

Method

Filter NC SBE data for Columbus County, contest type C (candidate races), and list distinct contest names. Each unique contest name represents an elected office (or a seat within a multi-seat office). Group by office level for structure.

jq Approach

# Extract distinct contest names for Columbus County from L1 cleaned output
cat l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl \
  | jq -r 'select(.jurisdiction.county == "COLUMBUS" and .contest.kind == "candidate_race") | .contest.raw_name' \
  | sort -u

Output:

BOLTON TOWN COUNCIL
BOLTON TOWN MAYOR
BOARD OF COMMISSIONERS DISTRICT 1
BOARD OF COMMISSIONERS DISTRICT 3
BOARD OF COMMISSIONERS DISTRICT 5
BRUNSWICK COMMUNITY COLLEGE BOARD OF TRUSTEES
CHADBOURN TOWN COUNCIL
CHADBOURN TOWN MAYOR
COLUMBUS COUNTY CLERK OF SUPERIOR COURT
COLUMBUS COUNTY REGISTER OF DEEDS
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 01
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 03
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 04
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 05
COLUMBUS COUNTY SHERIFF
DISTRICT COURT JUDGE DISTRICT 13B SEAT 02
DISTRICT COURT JUDGE DISTRICT 13B SEAT 04
NC COURT OF APPEALS JUDGE SEAT 09
NC COURT OF APPEALS JUDGE SEAT 11
NC HOUSE OF REPRESENTATIVES DISTRICT 046
NC SENATE DISTRICT 08
SOUTH COLUMBUS HIGH SCHOOL DISTRICT BD OF ED
SUPERIOR COURT JUDGE DISTRICT 13B SEAT 01
US HOUSE OF REPRESENTATIVES DISTRICT 07

25 distinct elected offices on the 2022 general election ballot in Columbus County.

Structured by Office Level

cat l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl \
  | jq -r 'select(.jurisdiction.county == "COLUMBUS" and .contest.kind == "candidate_race") | "\(.contest.office_level)\t\(.contest.raw_name)"' \
  | sort -u \
  | awk -F'\t' '{print $1 "\t" $2}'

Python — grouped inventory with candidate counts

import json
from collections import defaultdict

offices = defaultdict(lambda: {"candidates": set(), "contest_name": ""})

with open("l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl") as f:
    for line in f:
        r = json.loads(line)
        if r["jurisdiction"]["county"] != "COLUMBUS":
            continue
        if r["contest"]["kind"] != "candidate_race":
            continue
        key = r["contest"]["raw_name"]
        level = r["contest"].get("office_level", "other")
        offices[key]["level"] = level
        for result in r.get("results", []):
            offices[key]["candidates"].add(result["candidate_name"]["raw"])

# Group by level
by_level = defaultdict(list)
for name, info in offices.items():
    by_level[info.get("level", "other")].append((name, len(info["candidates"])))

for level in ["federal", "state", "judicial", "county", "school_district", "municipal"]:
    entries = sorted(by_level.get(level, []))
    if not entries:
        continue
    print(f"\n{level.upper()} ({len(entries)} offices)")
    for name, n_candidates in entries:
        contested = "contested" if n_candidates > 1 else "uncontested"
        print(f"  {name} — {n_candidates} candidate(s), {contested}")

Results

Federal (1 office)

OfficeCandidatesStatus
US HOUSE OF REPRESENTATIVES DISTRICT 073Contested

State (2 offices)

OfficeCandidatesStatus
NC HOUSE OF REPRESENTATIVES DISTRICT 0462Contested
NC SENATE DISTRICT 082Contested

Judicial (4 offices)

OfficeCandidatesStatus
DISTRICT COURT JUDGE DISTRICT 13B SEAT 022Contested
DISTRICT COURT JUDGE DISTRICT 13B SEAT 041Uncontested
NC COURT OF APPEALS JUDGE SEAT 092Contested
NC COURT OF APPEALS JUDGE SEAT 112Contested
SUPERIOR COURT JUDGE DISTRICT 13B SEAT 011Uncontested

County (3 offices)

OfficeCandidatesStatus
BOARD OF COMMISSIONERS DISTRICT 12Contested
BOARD OF COMMISSIONERS DISTRICT 32Contested
BOARD OF COMMISSIONERS DISTRICT 52Contested
COLUMBUS COUNTY CLERK OF SUPERIOR COURT1Uncontested
COLUMBUS COUNTY REGISTER OF DEEDS2Contested
COLUMBUS COUNTY SHERIFF2Contested

School District (6 offices)

OfficeCandidatesStatus
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 012Contested
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 022Contested
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 031Uncontested
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 042Contested
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 052Contested
SOUTH COLUMBUS HIGH SCHOOL DISTRICT BD OF ED2Contested

Municipal (4 offices)

OfficeCandidatesStatus
BOLTON TOWN COUNCIL3Contested
BOLTON TOWN MAYOR1Uncontested
CHADBOURN TOWN COUNCIL4Contested
CHADBOURN TOWN MAYOR2Contested

Note: Municipal offices appear only for towns holding elections in the 2022 general. Other Columbus County municipalities (Whiteville, Fair Bluff, Tabor City) may hold elections in odd years or at different times.

What This Reveals

Columbus County, population ~55,000, has 25 elected offices appearing on a single general election ballot. A voter in Bolton who lives in school district 02 would see contests for all 25 — from US House down to Bolton Town Council.

The breakdown by level:

LevelOfficesUncontested
Federal10
State20
Judicial52
County61
School District61
Municipal41
Total24–255

Five of 25 offices — 20% — are uncontested. This is below the national average (48.8%), suggesting Columbus County is more competitive than typical. The contested sheriff race is notable given that 55% of NC sheriffs run unopposed statewide.

Adapting for Other Counties

Replace "COLUMBUS" with any NC county name in the filter. For non-NC counties using MEDSL data, filter on state and county_name instead and use the MEDSL office field:

cat flat_export.jsonl \
  | jq -r 'select(.state == "TX" and .county == "HARRIS") | .contest_name' \
  | sort -u \
  | wc -l

Harris County, TX returns 80+ distinct contest names — including 25 district court judge seats, multiple constable precincts, and JP courts. The office inventory scales from rural Columbus County (25 offices) to urban Harris County (80+) with the same query.

Cross-References

Career Tracking Across Elections

Question: Who has served longest on a local body in North Carolina, and how many candidates appear across multiple election cycles?

Method

Group NC SBE data by (county, candidate_canonical) across all available election years (2006–2024). Count distinct election years per candidate. Rank by cycle count descending.

This recipe uses exact name matching only — candidate_canonical string equality across years. Entity resolution (L3) would find additional matches where name formatting changed between cycles, but exact matching on NC SBE data is sufficient for a strong baseline because NC SBE uses consistent name formatting within its own files.

Python

import json
from collections import defaultdict

# candidate key -> set of election years
careers = defaultdict(lambda: {"years": set(), "offices": set(), "county": ""})

with open("flat_export.jsonl") as f:
    for line in f:
        r = json.loads(line)
        if r["state"] != "NC":
            continue
        if "write" in r.get("candidate_canonical", "").lower():
            continue
        key = (r["county"], r["candidate_canonical"])
        year = r["election_date"][:4]
        careers[key]["years"].add(year)
        careers[key]["offices"].add(r["contest_name"])
        careers[key]["county"] = r["county"]

# Sort by number of distinct election years
ranked = sorted(careers.items(), key=lambda x: -len(x[1]["years"]))

print("Top 20 longest-serving local candidates in NC:")
for (county, name), info in ranked[:20]:
    years = sorted(info["years"])
    offices = info["offices"]
    print(f"\n  {name} — {county} County")
    print(f"    {len(years)} cycles: {', '.join(years)}")
    print(f"    Offices: {'; '.join(sorted(offices)[:3])}")

jq Approach

# Extract unique (county, candidate, year) triples
jq -r 'select(.state == "NC") | "\(.county)\t\(.candidate_canonical)\t\(.election_date[:4])"' \
  flat_export.jsonl \
  | sort -u \
  | grep -vi write \
  > nc_candidate_years.tsv

# Count distinct years per (county, candidate)
cut -f1,2 nc_candidate_years.tsv \
  | sort | uniq -c | sort -rn | head -20

Results

The Longest Tenure: George Dunlap

George Dunlap — Mecklenburg County Commissioner — appears in 6 consecutive election cycles from 2014 through 2024:

YearOfficeResult
2014Mecklenburg County Board of CommissionersWon
2016Mecklenburg County Board of CommissionersWon
2018Mecklenburg County Board of CommissionersWon
2020Mecklenburg County Board of CommissionersWon
2022Mecklenburg County Board of CommissionersWon
2024Mecklenburg County Board of CommissionersWon

Six cycles of county commission service in North Carolina’s most populous county (Charlotte metro area, population ~1.1 million). Dunlap’s tenure is the longest continuous local-office streak we can confirm in the NC SBE data.

Career Paths: Paul Beaumont

Not all multi-cycle candidates hold the same office. Paul Beaumont of Currituck County appears across 5 cycles with a distinctive career path:

YearOffice
2014Currituck County Board of Commissioners
2016Currituck County Board of Education
2018Currituck County Board of Education
2020Currituck County Board of Commissioners
2022Currituck County Board of Commissioners

Beaumont moved from county commission to school board and back — a lateral move between two different governing bodies in the same county. This pattern is invisible in single-election snapshots. Only multi-year tracking reveals it.

National Scale

Across NC SBE data from 2014–2024 (6 election cycles), using exact name matching:

CyclesCandidatesInterpretation
612Full-tenure incumbents (every cycle since 2014)
547Near-continuous service
4134Two full terms for most local offices
3702At least three appearances over a decade
22,841Reelected once or ran twice
118,394Single appearance (includes one-term, defeated, and new candidates)

702 candidates appear in 3 or more election cycles in NC alone. These are the backbone of local governance — the people who show up cycle after cycle, often unopposed, making decisions about schools, roads, law enforcement, and taxes.

What Entity Resolution Would Add

The 702 figure is a lower bound. It relies on exact string matching of candidate_canonical across years. Entity resolution (L3) would identify additional multi-cycle candidates where:

  • NC SBE changed name formatting between years (e.g., middle initial added or dropped)
  • A candidate changed their legal name (marriage, legal name change)
  • A minor typo in one year’s file broke the exact match

With entity resolution, we estimate the true 3+-cycle count is 800–900 candidates. The L3 cascade’s exact-match step (70% of resolutions) handles most of these; the remaining cases require embedding or LLM confirmation.

Variations

Filter to a specific office type

# School board only
school_careers = {k: v for k, v in careers.items()
                  if any("school" in o.lower() or "education" in o.lower() for o in v["offices"])}

Track office changes (like Beaumont)

# Find candidates who held different offices across years
switchers = {k: v for k, v in careers.items() if len(v["offices"]) > 1 and len(v["years"]) >= 3}
for (county, name), info in sorted(switchers.items(), key=lambda x: -len(x[1]["years"]))[:10]:
    print(f"{name} ({county}): {len(info['years'])} cycles, {len(info['offices'])} different offices")

Compare to other states

Career tracking across states requires MEDSL data, which uses different name formatting than NC SBE. Cross-source entity resolution (L3) is required. Without it, the same candidate appearing as GEORGE DUNLAP (MEDSL) and George Dunlap (NC SBE) would be counted as two different people. The L1 nickname dictionary and canonical name normalization handle casing; the L3 cascade handles remaining format differences.

Prerequisites

  • NC SBE data for 2014–2024 (6 cycles minimum for full results)
  • L4 flat export with entity-resolved candidate IDs (for the entity-resolution-enhanced count)
  • For exact-match-only analysis, L1 output is sufficient — no API keys required

Cross-References

Verify a Specific Result

Question: Can I verify that “Timothy Lance got 303 votes in precinct P17”? Can I trace that number back to the original source file?

Yes. The hash chain links every L4 canonical record back through L3, L2, and L1 to the raw bytes of the L0 source file. This recipe walks the chain step by step using jq.

The Claim

A researcher sees this record in the L4 flat export:

Timothy Lance — 303 votes — Precinct P17 — Columbus County Schools Board of Education District 02 — NC — 2022-11-08

They want to verify it. Here is how.

Step 1: Find the L4 Record

Start at the L4 flat export and locate the record:

jq -c 'select(
  .candidate_canonical == "Timothy Lance"
  and .county == "COLUMBUS"
  and .votes_total == 303
)' l4_canonical/exports/flat_export.jsonl

Output:

{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","contest_name":"COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02","candidate_canonical":"Timothy Lance","candidate_entity_id":"person:nc:columbus:lance-timothy-13","votes_total":303,"source":"nc_sbe","l3_hash":"28183d41d50204d5","l0_hash":"edfedf2760cfd54f"}

Note the two hash values:

  • l3_hash: 28183d41d50204d5 — links to the L3 matched record
  • l0_hash: edfedf2760cfd54f — shortcut to the L0 source file

Step 2: Follow l3_hash to L3

Look up the L3 matched record by its hash:

jq -c 'select(.l3.l3_hash == "28183d41d50204d5")' \
  l3_matched/NC/2022/matched.jsonl

Key fields in the output:

{
  "l3": {
    "l3_hash": "28183d41d50204d5",
    "l2_parent_hash": "854fa6367960bb05",
    "candidate_entity_ids": [
      {"result_index": 0, "entity_id": "person:nc:columbus:lance-timothy-13"}
    ],
    "contest_entity_id": "contest:nc:columbus:school-board-d02"
  }
}

This tells you:

  • The entity resolution cascade assigned Timothy Lance to entity person:nc:columbus:lance-timothy-13.
  • The contest was assigned to contest:nc:columbus:school-board-d02.
  • The L2 parent hash is 854fa6367960bb05.

Step 3: Follow l2_parent_hash to L2

Look up the L2 embedded record:

jq -c 'select(.l2.l2_hash == "854fa6367960bb05")' \
  l2_embedded/NC/2022/enriched.jsonl

Key fields:

{
  "l2": {
    "l2_hash": "854fa6367960bb05",
    "l1_parent_hash": "8ea7ecc257ff8e05",
    "embedding_model": "text-embedding-3-large",
    "embedding_dimensions": 3072,
    "candidate_composite": "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus",
    "quality_flags": []
  }
}

This tells you:

  • The embedding model was text-embedding-3-large with 3,072 dimensions.
  • The composite string used for embedding was Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus.
  • No quality flags were raised.
  • The L1 parent hash is 8ea7ecc257ff8e05.

Step 4: Follow l1_parent_hash to L1

Look up the L1 cleaned record:

jq -c 'select(.provenance.l1_hash == "8ea7ecc257ff8e05")' \
  l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl

Key fields:

{
  "jurisdiction": {
    "state": "NC", "state_fips": "37",
    "county": "COLUMBUS", "county_fips": "37047",
    "precinct": "P17"
  },
  "results": [{
    "candidate_name": {
      "raw": "Timothy Lance", "first": "Timothy",
      "middle": null, "last": "Lance",
      "suffix": null, "canonical_first": "Timothy"
    },
    "votes_total": 303,
    "vote_counts_by_type": {
      "election_day": 136, "early": 159,
      "absentee_mail": 7, "provisional": 1
    }
  }],
  "provenance": {
    "l1_hash": "8ea7ecc257ff8e05",
    "l0_parent_hash": "edfedf2760cfd54f",
    "parser_version": "nc_sbe_v2.1",
    "schema_version": "3.0.0"
  }
}

This tells you:

  • The 303 votes break down to 136 election day + 159 early + 7 absentee + 1 provisional.
  • The name was parsed as first=“Timothy”, last=“Lance”, no middle, no suffix.
  • The parser version was nc_sbe_v2.1.
  • The L0 parent hash is edfedf2760cfd54f.

Step 5: Follow l0_parent_hash to L0

Look up the L0 manifest:

cat l0_raw/nc_sbe/results_pct_20221108.txt.manifest.json

Output:

{
  "l0_hash": "edfedf2760cfd54f",
  "source_url": "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip",
  "retrieval_date": "2026-03-18T14:30:00Z",
  "file_size_bytes": 18023456,
  "format_detected": "tsv"
}

Step 6: Verify L0 Against the Source

Recompute the SHA-256 of the raw file and compare:

# macOS
shasum -a 256 l0_raw/nc_sbe/results_pct_20221108.txt

# Linux
sha256sum l0_raw/nc_sbe/results_pct_20221108.txt

If the output starts with edfedf2760cfd54f..., the raw file is intact — it matches the bytes the pipeline processed.

To verify against the authoritative source independently, download the file yourself:

curl -O https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip
unzip results_pct_20221108.zip
shasum -a 256 results_pct_20221108.txt

If your hash matches the manifest’s l0_hash, you and the pipeline processed identical bytes. The vote count of 303 for Timothy Lance in precinct P17 traces directly to those bytes.

The Full Chain

L4  flat_export.jsonl
    candidate_canonical = "Timothy Lance", votes_total = 303
    l3_hash = 28183d41d50204d5
      │
L3  matched.jsonl
    entity_id = person:nc:columbus:lance-timothy-13
    l2_parent_hash = 854fa6367960bb05
      │
L2  enriched.jsonl
    embedding_model = text-embedding-3-large
    candidate_composite = "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus"
    l1_parent_hash = 8ea7ecc257ff8e05
      │
L1  cleaned.jsonl
    votes_total = 303 (136 + 159 + 7 + 1)
    parser_version = nc_sbe_v2.1
    l0_parent_hash = edfedf2760cfd54f
      │
L0  results_pct_20221108.txt
    l0_hash = edfedf2760cfd54f
    source_url = https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip

Every link is independently verifiable. Recompute any hash from the record content plus the parent hash. If it matches the stored value, the record has not been tampered with.

Prototype Validation

In our 200-record prototype, we verified the full hash chain for every record:

MetricResult
Records verified200 / 200
Broken chains0
Layers traversed5 per record
Total hash verifications1,000

Zero broken links. Every vote count traces back to the raw NC SBE file bytes.

When to Use This

  • Fact-checking. A journalist writing “Timothy Lance received 303 votes” can cite the hash chain as evidence.
  • Auditing. A researcher who finds an unexpected result can walk the chain to determine whether the issue is in the source data (L0), the parser (L1), the entity resolution (L3), or the aggregation (L4).
  • Dispute resolution. If two researchers disagree on a number, both can verify the chain. If both chains are intact and both start from the same L0 hash, the number is correct. If the L0 hashes differ, one of them has a different version of the source file — check retrieval_date in the manifest.

The Two Audiences

This project serves two audiences with fundamentally different trust requirements. Engineers need to verify the pipeline mechanically. Consumers — journalists, researchers, government staff — need to understand what the data means and how much to trust it without reading source code.

This chapter describes what each audience sees and how the two views connect.

What engineers see

Engineers interact with the pipeline’s internal machinery. Trust, for this audience, is a function of determinism, traceability, and mechanical reproducibility.

Hash chains. Every record carries a provenance.hash field — a SHA-256 hash of the record’s content at each layer. L1 records hash L0 input bytes. L2 records hash L1 output plus the embedding model version. L3 records hash L2 output plus the decision log entry. L4 records hash L3 output. Any mutation at any layer invalidates all downstream hashes.

Decision logs. Every non-deterministic operation at L3 — embedding similarity matches, LLM-confirmed entity resolutions — is recorded in a JSONL decision log. Each entry includes:

  • decision_id: a unique identifier for the decision
  • method: one of exact, jaro_winkler, embedding, llm
  • score: the similarity or confidence score (where applicable)
  • input_record_hashes: the L2 records being compared
  • output: the resolution (match, no-match, or merge)
  • llm_request_id: the API request ID (for LLM decisions only)

Embedding IDs. Every embedding generated at L2 is tagged with the model identifier (text-embedding-3-large), the embedding dimension (3072), and the composite string template used to generate the input text. If the model or template changes, all L2 records are regenerated — not patched.

Layer manifests. Each layer’s output directory contains a manifest.jsonl file listing every output file, its row count, its SHA-256 hash, the pipeline version that produced it, and the timestamp of generation. Manifests are the unit of verification: compare two manifests to determine whether a pipeline run produced identical output.

What consumers see

Consumers interact with query results, summary statistics, and exported datasets. Trust, for this audience, is a function of source attribution, stated confidence, and transparent methodology.

Source names. Every record in consumer-facing output includes a human-readable source name: “NC SBE (certified)”, “MEDSL 2022”, “OpenElections (community-curated)”. The source name tells the consumer where the data came from and how it was collected.

Confidence levels. Every record carries a confidence level: high, medium, or low. See Confidence Levels for definitions. Consumers can filter by confidence to match their tolerance for uncertainty.

Methodology page. Any published dataset includes a methodology section describing the pipeline version, source versions, and processing steps used. This is the consumer-facing equivalent of the manifest.

Bridge table

The following table maps consumer-facing fields to their internal pipeline equivalents. If you see a value in a consumer export and need to trace it, this is where to start.

Consumer-facing fieldExample valueInternal pipeline fieldLayer
SourceNC SBE (certified)source.source_type = nc_sbe, source.certification = certifiedL1
ConfidenceHighprovenance.confidence = highL1–L4
Candidate nameJohn A. Smith Jr.candidate.canonical_first = john, candidate.canonical_last = smith, candidate.suffix = jrL4
OfficeCounty Commissioner District 3contest.canonical_office = county_commissioner, contest.district = 3L4
Vote total12,847votes.total = 12847L1
Match methodAlgorithmic (exact)entity_resolution.method = exactL3
Match methodLLM-confirmedentity_resolution.method = llm, entity_resolution.decision_id = d-2024-00417L3
JurisdictionMecklenburg County, NCjurisdiction.county_fips = 37119, jurisdiction.state = NCL1
Election date2022-11-08election.date = 2022-11-08L1
PartyDemocraticcandidate.party = DEML1

Reproducibility by layer

Not all layers are equally reproducible. The guarantees differ based on whether a layer involves external API calls.

L0 → L1: Deterministic. L1 is a pure function of L0 input and the pipeline code. Same input, same code version, same output — byte-identical. No external calls. No randomness.

L1 → L2: Deterministic. L2 adds embeddings generated by text-embedding-3-large (3072 dimensions). The embedding API is deterministic for a given model version and input string. Same L1 input, same model version, same output. If OpenAI retires or modifies the model, a pinned model version in the manifest allows detection (though not reproduction without the original model).

L2 → L3: Replayable from decision log. L3 involves entity resolution — some of which uses embedding cosine similarity (deterministic given L2) and some of which calls Claude Sonnet for confirmation. LLM calls are not deterministic: the same prompt may produce different text on different days. However, every LLM decision is recorded in the decision log with its output. Replaying L3 from the decision log — rather than re-calling the LLM — produces identical output. The decision log is the reproducibility mechanism for L3.

L3 → L4: Deterministic. L4 is a deterministic function of L3 output. It selects canonical names, assigns canonical IDs, and merges duplicate records. Same L3, same L4.

End-to-end reproducibility. To fully reproduce a dataset:

  1. Check out the tagged pipeline version from the repository.
  2. Obtain the same L0 source files (verified by hash against the L0 manifest).
  3. Run L0 → L2. Verify output hashes against the L2 manifest.
  4. Apply the published decision log to produce L3. Verify against the L3 manifest.
  5. Run L3 → L4. Verify against the L4 manifest.

If all manifest hashes match, the reproduction is exact. If any hash diverges, the manifest diff identifies exactly which records changed and at which layer.

When the two views diverge

Sometimes engineers and consumers reach different conclusions about the same record:

  • An engineer may see that a match was made by LLM with confidence 0.78 and flag it as marginal. A consumer sees “Source: MEDSL, Confidence: Medium” and treats it as usable. Both are correct within their frame.
  • An engineer may know that an embedding model version is deprecated. A consumer sees no change in the output. The manifest captures this risk; the consumer-facing confidence level does not (yet).

The bridge table above is the mechanism for resolving these divergences. When in doubt, trace the consumer field back to its pipeline equivalent and inspect the full provenance chain.

Confidence Levels

Every record in the pipeline carries a confidence level that reflects the trustworthiness of its source and the reliability of the processing steps applied to it. Confidence is not a score — it is a categorical label with defined semantics.

Three levels

High

The source is a certified government publication. Examples: NC SBE certified results, state election board portals that publish official canvass data. Records ingested from these sources enter L0 with source.confidence = "high".

High-confidence sources provide vote totals that are legally authoritative. When two sources disagree, the high-confidence source is treated as ground truth.

Medium

The source is a curated academic dataset derived from government publications. Example: MEDSL, which aggregates and reformats state-published results into a consistent schema. The data is one step removed from the original — parsed, cleaned, and sometimes corrected by the MEDSL team.

Medium-confidence sources are reliable for analysis but are not primary. In the 640 overlapping contests between MEDSL and NC SBE, 90.5% have identical vote totals. The 9.5% that differ are typically due to provisional ballot timing or reporting cutoffs.

Low

The source is community-curated, OCR-derived, or otherwise not traceable to a single certified publication. Examples: OpenElections state files with known parsing issues, any data recovered from PDFs via OCR, or crowd-sourced contest metadata.

Low-confidence records are included in the dataset but flagged. They are useful for coverage (filling gaps where no better source exists) but should not be cited without independent verification.

How confidence propagates

Confidence is not static. It can degrade as records pass through the pipeline, but it never improves without human intervention.

Source confidence (L0–L1). Set at ingestion based on the source type. Deterministic — the same source always gets the same level.

Match confidence (L3). Entity resolution adds a second dimension. If the match method is deterministic (exact string match or Jaro-Winkler ≥ 0.92), the source confidence is preserved. If the match required embedding similarity or LLM confirmation, the record is annotated with the match method and decision ID, but the source confidence is not downgraded — instead, a separate match_confidence field is added.

The combined confidence follows these rules:

Source confidenceMatch methodOverallNotes
HighExactHighBest case. Certified source, deterministic match.
HighJaro-WinklerHighAlgorithmic match above threshold.
HighEmbeddingHigh + decision IDSource still trusted; match is logged.
HighLLMHigh + decision IDSource still trusted; LLM rationale recorded.
MediumExactMediumAcademic source, deterministic match.
MediumLLMMedium + decision IDBoth source and match carry caveats.
LowAnyLowSource uncertainty dominates.

LLM decision tracking

When an LLM (Claude Sonnet) is involved in entity resolution, the pipeline records:

  • The decision ID (a unique hash of the prompt, response, and model version)
  • The prompt sent to the model
  • The model’s response
  • The confidence score returned by the model

This allows any LLM-assisted decision to be audited, replayed, or overridden. See When the LLM Gets Called.

How to cite records

When using data from this pipeline in publications, cite the original source, not the pipeline. The pipeline provides the information needed to construct a proper citation.

APA format template:

{Source organization}. ({Year}). {Dataset title} [Data set]. Retrieved {retrieval_date} from {url}.

Example:

North Carolina State Board of Elections. (2022). Official general election results [Data set]. Retrieved 2025-01-15 from https://www.ncsbe.gov/results-data.

Each L4 record includes the fields needed to construct this citation: source.name, source.retrieval_date, source.url, and source.confidence. A methodology link pointing to the pipeline documentation should accompany any analysis that depends on entity resolution or cross-source reconciliation.

Reporting Errors

Election data errors are inevitable — misspelled names, transposed digits, misclassified offices. This chapter describes how to report errors, how corrections flow through the pipeline, and how they are documented.

What counts as an error

An error is a factual discrepancy between the pipeline output and the certified source record. Examples:

  • A candidate’s vote total does not match the certified result.
  • Two candidates are incorrectly resolved as the same person (false positive).
  • A single candidate is split into two entities across sources (false negative).
  • An office is classified at the wrong level (e.g., county office tagged as state).
  • A contest is assigned to the wrong jurisdiction or FIPS code.

Formatting preferences (e.g., “they should use a middle name, not an initial”) are not errors. The pipeline normalizes names according to documented rules; stylistic disagreements are out of scope.

How to report

Include the following in every error report:

  • State — two-letter abbreviation.
  • County or jurisdiction — as specific as possible.
  • Contest — the office name and year.
  • Candidate — the name as it appears in the output.
  • The error — what is wrong and what the correct value should be.
  • Source — how you know the correct value (e.g., link to certified results PDF, county clerk confirmation).

File reports via the project’s GitHub issue tracker using the data-error label. One error per issue. Bulk reports (e.g., “all vote totals for County X are wrong”) should include a CSV attachment with the specific records.

How corrections flow through the pipeline

Corrections are not ad hoc patches. They follow the same layered architecture as all other data.

Report → Review → L3 human override → L4 re-canonicalize → Changelog entry
  1. Report. An error is filed with the required fields above.
  2. Review. A maintainer verifies the error against the cited source. If the source confirms the discrepancy, the report is accepted.
  3. L3 human override. A decision record is added to the L3 decision log with decision_type: "human_override", the reporter’s source citation, and the corrected value. The original machine decision is preserved — overrides do not delete history.
  4. L4 re-canonicalize. The L4 canonical layer is regenerated from the updated L3 output. Only records affected by the override change.
  5. Changelog entry. The correction is recorded in the Changelog with the issue number, affected records, and the nature of the fix.

What happens to the original data

Nothing. L0 (raw) and L1 (cleaned) records are immutable. If the error is in the source itself (e.g., the state published a wrong number that was later corrected in an amended certification), the amended source file is ingested as a new L0 record. Both the original and amended records coexist, with the L3 decision log recording which one is authoritative.

Transparency

All override decisions are stored in the same JSONL decision log as algorithmic decisions. They are queryable, auditable, and included in pipeline replay. A consumer who disagrees with a correction can inspect the decision record, see the cited source, and file a counter-report.

Corrections do not silently change output. Every correction increments the dataset version and appears in the changelog.

Known Limitations

This chapter documents what the project cannot do, where the data is incomplete, and where results should be interpreted with caution. These are not future plans — they are current, known constraints.

Coverage gaps by state

MEDSL 2022 data contains zero local election results for seven states:

StateFIPSNotes
California06State publishes results but not in MEDSL local dataset
Iowa19County-level results exist on state portal; not aggregated
Kansas20No local results in MEDSL
New Jersey34County clerk offices publish individually; no aggregation
Pennsylvania4267 counties, each with its own reporting format
Tennessee47No local results in MEDSL
Wisconsin55State portal exists but data not present in MEDSL

These gaps are source-dependent. If a future pipeline version integrates state portal data directly, coverage may improve. Until then, any “national” statistic derived from this dataset is actually a 43-state statistic.

Turnout data

Turnout figures (registered voters, ballots cast) are present in fewer than 5% of records. Most sources report candidate-level vote totals but not the denominator. This means:

  • Vote share (candidate votes / total ballots) cannot be computed for most contests.
  • Voter participation rates at the local level are not derivable from this dataset.
  • Where turnout data does exist, it is preserved as TurnoutMetadata contest records at L1 and carried through to L4.

Do not assume that the absence of turnout data means turnout was low. It means the source did not report it.

Odd-year elections

Elections held in 2015, 2017, 2019, and 2021 are underrepresented. MEDSL publishes even-year datasets (2016, 2018, 2020, 2022) with strong coverage. Odd-year local elections — common for municipal and school board races — are covered only where state-specific sources (e.g., NC SBE) include them.

This creates a systematic bias: states that hold local elections in odd years appear to have fewer local races than they actually do. New Jersey (already missing from MEDSL local data) and Virginia (odd-year state legislative elections) are particularly affected.

Entity resolution is probabilistic

The L3 matching layer uses a four-step cascade: exact match → Jaro-Winkler → embedding similarity → LLM confirmation. Only exact matches are deterministic in the strong sense. All other match methods involve thresholds:

  • Jaro-Winkler threshold: 0.92. Names scoring below this are not matched, even if they refer to the same person.
  • Embedding cosine similarity threshold: 0.88. Composite strings that fall below this are sent to LLM review or left unmatched.
  • LLM confirmation is logged with a decision ID but is inherently non-deterministic across model versions. Decisions are frozen in the decision log for reproducibility, but a different model version might make different decisions.

Consequences:

  • Some true matches are missed (false negatives), especially for candidates with common names in different jurisdictions.
  • Some incorrect matches may exist (false positives), especially for candidates with identical names in overlapping jurisdictions (e.g., father/son with the same name).
  • All non-exact match decisions are queryable by match method and score. Downstream users can apply stricter thresholds if their use case requires higher precision at the cost of lower recall.

No ranked-choice voting (RCV) support

The schema represents first-past-the-post and plurality contests. Ranked-choice voting results — used in Alaska, Maine, New York City, and a growing number of jurisdictions — require round-by-round tabulation data that the current schema does not model.

RCV results from these jurisdictions may appear in the dataset as final-round totals (where the source reports them that way), but intermediate rounds, elimination order, and ballot transfer data are not captured.

ALGED not integrated

The Annual Local Government Election Dataset (ALGED) covers mayoral and city council races in cities with populations above 50,000. It includes candidate demographics and incumbency data not available in other sources. This dataset is not currently integrated into the pipeline. Its coverage period ends around 2021.

Integration is planned but not scheduled. When integrated, ALGED records will enter at L0 like any other source and pass through the same cleaning, embedding, and matching layers.

Vote mode data

Vote mode breakdowns (Election Day, absentee, early voting, provisional) are present in approximately 33% of source records. The remaining 67% report only total votes per candidate. Cross-source comparisons of vote mode data are unreliable because:

  • States define vote modes differently (e.g., “absentee” vs. “mail” vs. “vote by mail”).
  • Some sources aggregate early voting into Election Day totals.
  • Provisional ballot handling varies by state and is time-dependent (provisionals may be added days after initial reporting).

Pipeline not validated at national scale

The pipeline has been tested against NC SBE data (2004–2022) and MEDSL data (2018–2022, 43 states). The 640-contest overlap between MEDSL and NC SBE provides a validation baseline: 90.5% exact vote match, 63% name formatting differences successfully resolved.

Full national-scale validation — running all 42 million rows through L0→L4 with cross-source reconciliation — has not been completed. Edge cases in states with unusual office structures (Louisiana’s parish system, Alaska’s borough system, Virginia’s independent cities) may surface issues not yet encountered.

What this means for users

If your work depends on completeness, check the Coverage Matrix for your specific state and year before drawing conclusions. If your work depends on entity resolution accuracy, filter to match methods and scores that meet your precision requirements. If your work involves RCV jurisdictions, this dataset does not capture round-level data.

These limitations are structural, not aspirational. They will change as sources are added and the pipeline matures, but they describe the current state accurately.

Full Nickname Dictionary

The pipeline applies nickname normalization at L1 to improve entity resolution at L3. When a candidate’s first name matches a known nickname, the canonical form is stored in canonical_first and the original is preserved in first.

This dictionary is applied deterministically. Every name is checked against the table below. No context or heuristics are used — if the input matches the nickname column, the canonical column is applied. This means the mapping is fast and reproducible but occasionally wrong (see The Ted Problem below).

Mappings

NicknameCanonicalNotes
alalbert
alexalexander
andyandrew
barbbarbara
benbenjamin
berniebernard
bertalbertAlso Herbert; resolved to albert by frequency
bethelizabeth
billwilliam
billywilliam
bobrobert
bobbyrobert
bonniebonita
budwilliamRegional; less reliable
charliecharles
chrischristopherAlso Christine; gendered ambiguity
chuckcharles
cindycynthia
dandaniel
dannydaniel
davedavid
debdeborah
debbiedeborah
dickrichard
dondonald
dougdouglas
drewandrew
ededward
eddieedward
frankfranklinAlso Francis; resolved to franklin by frequency
fredfrederick
geneeugene
gerrygerald
hankhenry
harryharoldAlso Henry (British tradition); resolved to harold
jackjohn
jakejacob
janjaniceAlso Janet; resolved to janice by frequency
jennyjennifer
jerrygeraldAlso Jerome; resolved to gerald by frequency
jimjames
jimmyjames
joejoseph
johnnyjohn
jonjonathanDistinct from john
katekatherineAlso Kathryn, Catherine
kathykatherine
kenkenneth
kennykenneth
larrylawrence
lizelizabeth
maggiemargaret
mattmatthew
mikemichael
mitchmitchell
nancyannHistorical mapping; low reliability
nicknicholas
nikkinicole
normnorman
patpatrickAlso Patricia; gendered ambiguity
pattipatricia
pattypatricia
peggymargaret
petepeter
philphilip
rayraymond
rickrichard
robrobert
ronronald
sallysarah
samsamuelAlso Samantha; gendered ambiguity
sandysandraAlso Alexander; gendered ambiguity
stevesteven
suesusan
tededwardSee The Ted Problem below
terryterrenceAlso Teresa; gendered ambiguity
timtimothy
tomthomas
tommythomas
tonyanthony
valvalerie
vincevincent
waltwalter
weswesley
willwilliam
woodywoodrow

The Ted Problem

“Ted” maps to both Edward (Ted Kennedy → Edward Kennedy) and Theodore (Ted Cruz → Rafael Edward Cruz, commonly Theodore). The dictionary maps ted → edward because Edward is the more frequent canonical form in US election data. This means a candidate whose legal name is Theodore but who files as Ted will be canonicalized as Edward.

This is a known, accepted error. It affects L1 canonical_first but does not prevent correct entity resolution at L3 — because L3 matches on composite strings that include last name, jurisdiction, office, and year. Two candidates named “Ted Smith” in different counties will not be merged regardless of whether canonical_first is edward or theodore.

The original filed name is always preserved in first. Any downstream consumer who needs the original can ignore canonical_first and use first directly.

Gendered ambiguity

Several nicknames map to names that could be either male or female: Chris (Christopher/Christine), Pat (Patrick/Patricia), Sam (Samuel/Samantha), Sandy (Sandra/Alexander), Terry (Terrence/Teresa). The dictionary resolves these to the statistically more common canonical form in US election candidate data. The mapping is not always correct for individual candidates.

As with the Ted problem, the original name is preserved, and entity resolution at L3 uses additional fields (jurisdiction, office, party) to avoid incorrect merges caused by nickname ambiguity.

When the dictionary is not applied

The dictionary is skipped when:

  • The input first name is longer than 6 characters and matches no entry (assumed to already be a full name).
  • The candidate record has a canonical_first value set by the source (some sources provide both nickname and legal name).
  • The input is an initial only (e.g., “J.” is not expanded).

Office Classification Reference

The pipeline classifies 8,387 unique office name strings into canonical office types using a four-tier system. Each tier handles progressively harder cases. This appendix documents tiers 1 and 2 in full and summarizes tiers 3 and 4.

Coverage summary

TierMethodUnique offices handledCumulative coverage
1Keyword lookup3,10237%
2Regex patterns2,09762%
3Embedding similarity2,34090%
4LLM classification848100%

Tiers 1 and 2 are fully deterministic — same input, same output, no external calls. Tier 3 uses cosine similarity against text-embedding-3-large embeddings of known office types. Tier 4 sends unresolved strings to Claude Sonnet with a structured prompt.

Tier 1: Keyword lookup

A case-insensitive keyword match against the office name string. If any keyword appears in the string, the office is classified immediately. Keywords are checked in order; the first match wins.

Keywordoffice_leveloffice_category
presidentfederalexecutive
u.s. senatefederallegislative
u.s. housefederallegislative
congressfederallegislative
governorstateexecutive
lieutenant governorstateexecutive
attorney generalstateexecutive
secretary of statestateexecutive
state treasurerstateexecutive
state auditorstateexecutive
state senatestatelegislative
state housestatelegislative
state representativestatelegislative
state assemblystatelegislative
supreme courtstatejudicial
court of appealsstatejudicial
appeals courtstatejudicial
district courtcountyjudicial
superior courtcountyjudicial
county commissionercountylegislative
county councilcountylegislative
sheriffcountylaw_enforcement
clerk of courtcountyjudicial
register of deedscountyadministrative
coronercountyadministrative
constablecountylaw_enforcement
justice of the peacecountyjudicial
school boardlocaleducation
board of educationlocaleducation
city councillocallegislative
mayorlocalexecutive
aldermanlocallegislative
township trusteelocallegislative
soil and waterlocalspecial_district
fire districtlocalspecial_district
water districtlocalspecial_district

Notes:

  • “u.s. senate” is checked before “state senate” to avoid false matches.
  • “lieutenant governor” is checked before “governor” for the same reason.
  • Keywords are matched as substrings, not whole words. “county commissioner district 3” matches on “county commissioner”.

Tier 2: Regex patterns

When no tier 1 keyword matches, the office string is tested against a series of compiled regular expressions. These handle structural patterns that keyword matching cannot.

Patternoffice_leveloffice_categoryExample matches
(?i)^(us|united states) (rep|senator)federallegislative“US Rep District 4”
(?i)district judge.*district \d+countyjudicial“District Judge 21st Judicial District”
(?i)(city|town|village) (of|de) .+ (council|trustee|board)locallegislative“Town of Cary Council”
(?i)independent school district.*\d+localeducation“Independent School District 279 Board”
(?i)(municipal|mun\.?) (utility|water|sewer) districtlocalspecial_district“Municipal Utility District 14”
(?i)community college.*trusteelocaleducation“Community College District Trustee”
(?i)(precinct|ward) (chair|committee)localparty“Precinct 12 Committee Chair”
(?i)conservation district (super|board|dir)localspecial_district“Conservation District Supervisor”
(?i)(drainage|levee|flood) (district|board)localspecial_district“Drainage District 7 Board”
(?i)hospital district (board|dir|trustee)localspecial_district“Hospital District Board Member”
(?i)park (district|board) (comm|dir|trustee)localspecial_district“Park District Commissioner”
(?i)sanitary districtlocalspecial_district“Sanitary District Trustee”
(?i)mosquito (abatement|control) districtlocalspecial_district“Mosquito Abatement District Trustee”
(?i)(borough|parish) (council|president|assembly)countylegislative“Borough Assembly Member”
(?i)district attorneycountylaw_enforcement“District Attorney 26th District”

Regex patterns are tested in order. The first match wins. All patterns use case-insensitive mode.

Tier 3: Embedding similarity

Office strings that pass through tiers 1 and 2 unclassified are embedded using text-embedding-3-large (3072 dimensions) and compared against a reference set of known office type embeddings via FAISS nearest-neighbor search.

  • Threshold: cosine similarity ≥ 0.85 against the nearest known office type.
  • Reference set: the canonical office types defined by tiers 1 and 2, plus manually curated additions for jurisdiction-specific titles.
  • Examples resolved at tier 3:
    • “Moderator” → local / legislative (New England town meeting role)
    • “Fence Viewer” → local / administrative (historical New England office)
    • “Pound Keeper” → local / administrative
    • “Surveyor of Highways” → local / administrative
    • “Oyster Commissioner” → local / special_district (Maryland)

Tier 3 handles 2,340 unique office strings — mostly jurisdiction-specific titles, historical offices, and compound names that do not match keyword or regex patterns.

Tier 4: LLM classification

The remaining 848 office strings are sent to Claude Sonnet with a structured prompt that provides the office name, the state, and the county (where available). The LLM returns office_level, office_category, and a brief rationale.

Every tier 4 decision is recorded in the decision log with:

  • decision_id
  • input_string (the original office name)
  • output_level and output_category
  • llm_request_id
  • rationale (the LLM’s explanation)

Tier 4 classifications can be overridden by adding entries to the tier 1 or tier 2 tables in subsequent pipeline versions. Once an office string is promoted to tier 1 or tier 2, it is classified deterministically on all future runs.

Office level and category enumerations

office_level values: federal, state, county, local.

office_category values: executive, legislative, judicial, law_enforcement, administrative, education, special_district, party.

These enumerations are defined in the Enumerations Reference. Every classified office receives exactly one level and one category.

Handling ambiguity

Some office strings are genuinely ambiguous:

  • “Board of Commissioners” could be county or municipal depending on jurisdiction.
  • “Trustee” alone could be township, school board, or special district.
  • “Judge” without a court name could be any judicial level.

In these cases, the pipeline uses jurisdiction context (state, county, FIPS code) to disambiguate. If the jurisdiction does not resolve the ambiguity, the string is sent to tier 3 or 4 with the full context attached.

NIST SP 1500-100 Alignment

This appendix maps the pipeline’s schema fields to concepts defined in NIST SP 1500-100 v2, the Election Results Common Data Format Specification. The mapping is informational — the pipeline does not emit NIST-compliant XML, but its internal schema was designed with alignment in mind.

Field mapping

Pipeline fieldNIST SP 1500-100 conceptNIST elementNotes
contestContestCandidateContestCandidate races map to CandidateContest.
contest (ballot measure)ContestBallotMeasureContestBallot measures use a separate NIST element.
contest.nameContest nameCandidateContest.NameRaw office string before normalization.
contest.canonical_officeOfficeOffice.NameL4 normalized office name.
candidate.canonical_first, canonical_lastCandidateCandidate.PersonFullNamePipeline stores components; NIST stores full name.
candidate.partyPartyParty.AbbreviationThree-letter codes (DEM, REP, LIB, etc.).
jurisdiction.ocd_idGeographic unitGpUnit.ExternalIdentifierOCD-ID used as the external identifier type.
jurisdiction.county_fipsGeographic unitGpUnit.ExternalIdentifierFIPS code, identifier type fips.
jurisdiction.stateGeographic unitGpUnit.Type = "state"Two-letter USPS abbreviation.
votes.totalVote countsVoteCounts.CountTotal votes for a candidate in a contest.
votes.by_mode.election_dayVote counts by typeVoteCounts.CountItemType = "election-day"Present in ~33% of records.
votes.by_mode.absenteeVote counts by typeVoteCounts.CountItemType = "absentee"Terminology varies by state.
votes.by_mode.earlyVote counts by typeVoteCounts.CountItemType = "early"Some sources merge into election day.
votes.by_mode.provisionalVote counts by typeVoteCounts.CountItemType = "provisional"Timing of inclusion varies.
election.dateElectionElection.StartDateSingle date; no multi-day modeling.
election.typeElection typeElection.TypeValues: general, primary, runoff, special.
turnout.registered_votersTurnout metadataVoteCounts.CountItemType = "total" on BallotCountsPresent in <5% of records.
turnout.ballots_castTurnout metadataBallotCounts.BallotsCastSame coverage caveat.
contest.districtElectoral districtElectoralDistrict.NameDistrict number or name within an office.

Concepts not modeled

The following NIST SP 1500-100 concepts have no direct equivalent in the pipeline schema:

  • RetentionContest — Judicial retention elections are classified as BallotMeasure with yes/no choices rather than as a distinct contest type.
  • OrderedContest — Ballot ordering is not captured. The pipeline does not model ballot layout.
  • BallotStyle — No ballot style or precinct-to-ballot mapping is maintained.
  • Ranked-choice voting roundsCountItemType values for RCV rounds (round-1, round-2, etc.) are not supported. See Known Limitations.
  • Overvotes and undervotes — Tracked as TurnoutMetadata contest records at L1, not as NIST OtherCounts.

Pipeline concepts not in NIST

The following pipeline concepts have no NIST equivalent:

  • provenance.hash — SHA-256 hash chain for record integrity. NIST defines no provenance model.
  • entity_resolution.method — Match method metadata (exact, Jaro-Winkler, embedding, LLM). Entity resolution is outside the scope of NIST SP 1500-100.
  • source.confidence — High/medium/low confidence levels. NIST does not model source reliability.
  • Layer identifiers (L0–L4) — The multi-layer pipeline architecture is specific to this project.

Research References

This appendix lists the research papers, datasets, and standards cited throughout the documentation.

Entity resolution

  • Dasanaike, T., et al. (2026). EnsembleLink: Ensemble methods for scalable entity resolution. Preprint.
  • Ornstein, J. (2025). fuzzylink: Probabilistic record linkage with large language models. Preprint.
  • CE-RAG4EM (2026). Context-Enhanced Retrieval-Augmented Generation for Entity Matching. Preprint.
  • Zeakis, A., et al. (2025). AvengER: Automated verification of entity resolution results. Preprint.

Election data sources

Standards

Architecture

Reports

  • Union of Concerned Scientists. (2025). Election Data Report: The state of US election data infrastructure.

Glossary

Blocking. A preprocessing step in entity resolution that partitions records into groups (blocks) that share a key attribute — typically state + office type or county FIPS code. Only records within the same block are compared, reducing the number of pairwise comparisons from O(n²) to a tractable subset.

Composite string. A concatenated text representation of a record used as input to an embedding model. A candidate composite string might combine name, office, jurisdiction, party, and year into a single string. The template that defines which fields are included and in what order is versioned and stored in the L2 manifest.

Cosine similarity. A measure of similarity between two vectors, computed as the cosine of the angle between them. Ranges from -1 to 1; values closer to 1 indicate higher similarity. Used at L3 to compare candidate and contest embeddings. The pipeline uses a threshold of 0.88 for embedding-based entity matches.

Entity resolution. The process of determining whether two records refer to the same real-world entity (person, office, or contest) despite differences in formatting, naming, or source. The pipeline uses a four-step cascade: exact match → Jaro-Winkler → embedding similarity → LLM confirmation.

FAISS. Facebook AI Similarity Search. A library for efficient similarity search over dense vector collections. Used at L3 to perform approximate nearest-neighbor lookups over L2 embeddings when comparing candidate records across sources.

FIPS code. Federal Information Processing Standards code. A numeric identifier assigned by the Census Bureau to states (2 digits), counties (5 digits: 2 state + 3 county), and other geographic entities. Example: 37119 = Mecklenburg County, North Carolina. Used as a join key across sources.

Jaro-Winkler similarity. A string similarity metric that gives higher scores to strings that match from the beginning. Ranges from 0 to 1. The pipeline uses a threshold of 0.92 for name matching. Preferred over edit distance for person names because prefix agreement is a strong signal of identity.

JSONL. JSON Lines. A text format where each line is a valid JSON object, separated by newlines. The pipeline uses JSONL as the storage and interchange format at every layer (L0–L4). One record per line enables streaming reads and line-level integrity checks.

L0 (Raw). The first pipeline layer. Byte-identical copies of source files as retrieved. No parsing, no transformation. Stored with retrieval timestamps and SHA-256 hashes.

L1 (Cleaned). The second layer. Deterministic parsing, field extraction, name normalization, and FIPS enrichment. Output is structured JSONL with a consistent schema regardless of source format.

L2 (Embedded). The third layer. Adds vector embeddings (text-embedding-3-large, 3072 dimensions) and office classification results. Deterministic given L1 input and a fixed model version.

L3 (Matched). The fourth layer. Entity resolution — linking records that refer to the same candidate, contest, or office across sources and years. Non-deterministic steps (LLM calls) are recorded in the decision log for replay.

L4 (Canonical). The fifth layer. Assigns canonical names, deduplicates records, selects authoritative values, and produces the final queryable dataset. Deterministic given L3 input.

OCD-ID. Open Civic Data Identifier. A hierarchical string identifier for political geographies, following the pattern ocd-division/country:us/state:nc/county:mecklenburg. Used to link jurisdictions across datasets that may use different naming conventions.

Precinct. The smallest administrative unit for election administration. Voters are assigned to a precinct based on their address. Precinct-level results, when available, provide the most granular view of voting patterns. Coverage varies — some sources report only county-level totals.

Changelog

All notable changes to the dataset and pipeline will be documented in this file.

Each entry includes the date, affected layer(s), and a summary of the change.


[Unreleased]

No releases yet.


Entry template

  • Date: YYYY-MM-DD
  • Layer(s): L0 / L1 / L2 / L3 / L4
  • Change: Description of what changed.
  • Issue: Link to GitHub issue (if applicable).