Election Aggregation
A multi-layer pipeline for collecting, normalizing, and unifying US local election results from heterogeneous sources.
The question this project answers
Who ran for school board in your county last year? Who was the sheriff, and did anyone run against them? What was the closest local race in your state? Has your county commissioner been reelected five times unopposed, or do they face real competition?
These should be easy questions. They are not.
There is no national database of US local election results. The data exists — scattered across 50 state election boards, 3,000+ county clerk offices, academic datasets, election night reporting platforms, and community-curated repositories — but it has never been unified into a single, consistent, trustworthy format. Every source uses different schemas, different name formats, different office titles, different geographic identifiers, and different levels of completeness.
This project fixes that.
What we found when we tried
We downloaded 42 million rows of precinct-level election data from MIT’s Election Data Lab (MEDSL), the North Carolina State Board of Elections, OpenElections, VEST, the Census Bureau, and the FEC. We covered all 50 states across three election cycles (2018, 2020, 2022) and ten years of deep North Carolina history (2006–2024).
Then we tried to answer simple questions, and the problems started immediately.
The same candidate appears differently across sources. MEDSL reports SHANNON W BRAY. NC SBE reports Shannon W. Bray. One is all caps with no period after the middle initial. The other is title case with a period. These are the same person — but a computer doesn’t know that without being told.
Nicknames break everything. Charlie Crist in one source is CRIST, CHARLES JOSEPH in another. A human recognizes Charlie as a nickname for Charles. An embedding model scores their similarity at 0.451 — well below any reasonable match threshold. A language model, given the context (same state, same office, same election, same vote count), correctly identifies them as the same person with 0.95 confidence.
The same office title means different things. In Texas, the “County Judge” is the chief executive of the county — equivalent to a county manager. In every other state, a county judge is a judicial officer. If your system classifies “DALLAS COUNTY JUDGE” as judicial, you’re wrong in Texas and right everywhere else. Across all 50 states in 2022, we found 8,387 unique local office names. Our keyword classifier handles 62% of them. The remaining 38% require embedding-based matching and LLM reasoning.
Non-candidate data hides inside candidate data. Florida OpenElections includes 6,013 rows labeled “Registered Voters” — not a contest, but a turnout metadata row that got slurped into the results file as if it were a race. Other sources include “BLANK” (Maine’s name for undervotes), “TOTAL VOTES” (Utah’s aggregation rows), and “OverVotes” / “UnderVotes” masquerading as candidate names. Each source has its own ghosts.
Nobody tracks the same person across elections. Timothy Lance won a Columbus County, NC school board seat in 2022. Did he run before? Did he win? Is the “T. Lance” who ran in 2018 the same person? No existing dataset answers this. Entity resolution — determining that two records refer to the same human being — is the hardest problem in this project, and the one we spend the most effort on.
What this project does
Election Aggregation is a five-layer pipeline that transforms messy, heterogeneous election data into a clean, unified, entity-resolved dataset with full provenance back to the original source files.
L0 RAW Byte-identical source files. Never modified.
↓
L1 CLEANED Parsed, structured records. Names decomposed into
first/middle/last/suffix. FIPS codes enriched.
Office classified by keyword and regex.
Purely deterministic. No ML, no API calls.
↓
L2 EMBEDDED Vector embeddings generated for candidates, contests,
and geographic names. Office classification tier 3
(embedding nearest-neighbor). Quality flags raised.
Deterministic given the same embedding model.
↓
L3 MATCHED Entity resolution. Same-candidate and same-contest
identifiers assigned. Embedding retrieval + LLM
confirmation. Every decision stored with reasoning.
↓
L4 CANONICAL Authoritative names chosen. Temporal chains built.
Alias tables constructed. Verification algorithms run.
Researcher-facing exports produced.
The ordering is strict and deliberate: Clean → Embed → Match → Canonicalize. You cannot assign an authoritative name before you know who the person is. You cannot match entities before you have embeddings. You cannot embed before you have clean, signal-preserving parsed records. And you cannot parse before you have the raw bytes.
Every record at every layer carries a cryptographic hash chain back to the original source file. If someone modifies a vote count, changes a name, or alters a match decision at any layer, the verification algorithm detects exactly where the chain breaks.
What this project does not do
-
It does not store election data. The data files are large (7+ GB for our current corpus) and are published by their respective sources under their own terms. This project tells you where to get the data, documents every source’s schema and quirks, and provides the tools to process it. You download the data yourself.
-
It is not a real-time election night tracker. We ingest official and certified results, not live feeds. The pipeline is designed for post-election analysis, not real-time reporting.
-
It is not a prediction model. We report what happened, not what will happen.
-
It does not claim perfect accuracy. Entity resolution is probabilistic. Office classification has a 0.56% “other” rate. Some records have data quality issues we haven’t caught yet. We document every known limitation, and every match decision is auditable.
What you can answer today
With the data currently available (MEDSL 2018/2020/2022 for all 50 states, NC SBE 2006–2024, OpenElections for 6 states, VEST shapefiles for 4 states), you can answer:
| Question | Answer from our data |
|---|---|
| How many sheriffs ran unopposed in 2022? | 55% in North Carolina, 77% in Maine, varies by state |
| What was the closest school board race in America? | Dawson County, GA — exact tie at 25,186 to 25,186 |
| How many local races were uncontested? | 48.8% nationally (keyword-classified subset) |
| Which office type is least competitive? | Constable/Coroner at 72% uncontested |
| Which is most competitive? | City Council at 10% uncontested |
| Who has served longest on a local body in NC? | George Dunlap — Mecklenburg County Commissioner, 6 consecutive cycles (2014–2024) |
| How many unique elected offices exist in America? | At least 8,387 distinct office names in MEDSL 2022 alone; 4,995 exist in exactly one county |
| Did the same candidate run across multiple elections? | Yes — 702 NC candidates appear in 3+ election cycles (2014–2024) |
And questions you cannot answer yet, honestly:
| Question | Why not |
|---|---|
| What’s the voter turnout for school board races? | Turnout data exists in less than 5% of records |
| Did this candidate switch parties? | Requires entity resolution across elections, which is functional but not yet validated at scale |
| What are the RCV round-by-round results? | Schema doesn’t support ranked-choice voting yet |
| How do local election results correlate with demographics? | Census demographic join is ready (100% FIPS coverage) but not yet implemented |
| What happened in odd-year elections (2015, 2017, 2019)? | MEDSL has odd-year data on Harvard Dataverse; we haven’t loaded it yet |
The data sources
This project processes data from multiple sources. We do not redistribute their data. Here is what each provides and where to get it:
| Source | What it is | Coverage | Where to get it |
|---|---|---|---|
| MEDSL | MIT Election Data + Science Lab precinct returns | All 50 states + DC, 2018/2020/2022 | GitHub, Harvard Dataverse |
| NC SBE | North Carolina State Board of Elections | NC only, 2006–2024 (10 cycles) | NC SBE |
| OpenElections | Community-curated precinct data | ~8 states, varies | GitHub |
| Clarity/Scytl | Election night reporting XML | ~1,000+ jurisdictions | Per-jurisdiction URLs (unstable) |
| VEST | Precinct results + geographic boundaries | All 50 states (shapefiles) | Harvard Dataverse |
| Census | FIPS code reference files | National | Census.gov |
| FEC | Federal candidate master files | National | FEC.gov |
Each source has its own chapter documenting the exact schema, download commands, known data quality issues, and how our pipeline handles its quirks.
Who this book is for
If you’re a journalist and you want to answer “what happened in local elections in my area” — start with Questions for Journalists, then go to Getting Started and Recipes. You don’t need to understand the pipeline architecture. You need the data and the queries.
If you’re a researcher and you want a citable, reproducible, documented dataset for studying local election competitiveness, candidate career paths, or democratic participation — start with Questions for Researchers, then read Reproducibility Guide and How to Cite This Data. The dataset is versioned with DOIs. Every entity resolution decision is logged and auditable.
If you’re a government staffer and you need to know what elected offices exist in your jurisdiction, how your state compares to others, or how to benchmark election administration — start with Questions for Government Staffers and Office Inventory Recipe.
If you’re a developer and you want to contribute to the pipeline, add a new data source, or understand the Rust implementation — start with Design Principles, then read The Five-Layer Pipeline and Type System Design. The mdbook is the spec. The Rust types are the implementation.
If you’re evaluating this architecture for your own data pipeline project — the Architecture section describes a pattern (immutable layers, deterministic-first processing, embeddings for retrieval, LLMs for confirmation) that generalizes beyond election data. The Hard Problems section documents real entity resolution challenges with real data and real solutions.
How to read this book
The book is organized in the order you’d have questions:
- Part I: The Problem — Why local election data is a mess, and what questions we’re trying to answer.
- Part II: Data Sources — Where the data comes from, exactly what’s in it, and how to download it yourself.
- Part III: The Hard Problems — Name normalization, office classification, entity resolution, and cross-source reconciliation. Real examples from real data. This is the heart of the book.
- Part IV: Architecture — The five-layer pipeline, the hash chain, the embedding strategy, the LLM integration. How the system is designed and why.
- Part V: Unified Schema — The exact record format, field by field. What each field means, where it comes from, and which layer populates it.
- Part VI: Rust Implementation — The type system, the traits, the module structure. How the architecture becomes code.
- Part VII: Using the Data — Download instructions, pipeline execution, and ten ready-to-use recipes with copy-paste queries.
- Part VIII: Trust and Reproducibility — How to verify the data, how to cite it, how to report errors, and what the known limitations are.
You don’t have to read it in order. Every chapter is self-contained with cross-references to related sections. But if you read Part I and Part III, you’ll understand why this project exists and what makes it hard. Everything else follows from that.
This project is open source under MIT/Apache-2.0. The data it processes is published by its respective sources under their own licenses (generally CC-BY or public domain). We do not store or redistribute election data.
Why This Is Hard
US local election results are published by approximately 3,143 county-level election offices, 50 state election boards, and an unknown number of municipal clerks. There is no common schema, no shared identifier system, and no central repository. This chapter describes the five structural problems that make unification difficult.
Fragmented administration
US elections are administered at the county level. Each county decides independently how to collect, tabulate, and publish results. Some publish precinct-level CSV files. Others post scanned PDFs. Some use Clarity/Scytl election night reporting platforms that expose structured XML. Others put results on pages that require JavaScript rendering.
There is no federal mandate requiring any particular publication format. The result is 3,143 independent data silos with different schemas, different update schedules, different URL structures, and different retention policies.
When we downloaded precinct-level data for all 50 states from the MIT Election Data + Science Lab (MEDSL), we received 51 separate files containing 12.3 million rows. The files use different column encodings, different candidate name conventions, and different definitions of what constitutes a “local” race. Seven states — California, Iowa, Kansas, New Jersey, Pennsylvania, Tennessee, and Wisconsin — had zero local race records in the MEDSL 2022 dataset.
No standard schema
The same vote record looks different in every source. Here is a single result — Shannon W. Bray in a 2022 North Carolina precinct — as represented by three different sources:
MEDSL (25-column CSV, one row per vote mode):
precinct,office,party_simplified,mode,votes,candidate,...
12-13,US SENATE,LIBERTARIAN,ELECTION DAY,47,SHANNON W BRAY,...
NC SBE (15-column TSV, vote modes as columns):
County Precinct Contest Name Choice Election Day One Stop Absentee by Mail Provisional Total Votes
CABARRUS 12-13 US SENATE Shannon W. Bray 47 38 5 0 90
OpenElections (7-column CSV, totals only):
county,precinct,office,party,candidate,votes
Cabarrus,12-13,U.S. Senate,LIB,Shannon W. Bray,90
Three sources. Three schemas. Three representations of the candidate name (SHANNON W BRAY, Shannon W. Bray, Shannon W. Bray). Three levels of granularity for vote mode data. Three different office name formats (US SENATE, US SENATE, U.S. Senate). This is a federal race where all three sources agree on the totals. For local races, the divergence is worse.
Name formatting differs across every source
We compared MEDSL and NC SBE data for 640 contests in the 2022 North Carolina general election where both sources reported the same vote totals. In 401 of those contests (63%), candidate names are formatted differently between the two sources.
The differences are systematic:
| Pattern | MEDSL | NC SBE |
|---|---|---|
| Case | SHANNON W BRAY | Shannon W. Bray |
| Middle initial punctuation | VICTORIA P PORTER | Victoria P. Porter |
| Nickname quoting | MICHAEL "STEVE" HUBER | Michael (Steve) Huber |
| Suffix formatting | ROBERT VAN FLETCHER JR | Robert Van Fletcher, Jr. |
| Nickname style | LM "MICKEY" SIMMONS | L.M. (Mickey) Simmons |
| Write-in label | WRITEIN | Write-In (Miscellaneous) |
Each source applies a consistent internal convention. MEDSL uses ALL CAPS with no punctuation. NC SBE uses Title Case with periods and commas. Across sources, the conventions diverge.
The formatting problem is solvable with normalization rules. The deeper problem is name identity. We tested real candidate pairs with OpenAI’s text-embedding-3-large model (3,072 dimensions):
| Name A | Name B | Cosine similarity | Same person? |
|---|---|---|---|
Charlie Crist | CRIST, CHARLES JOSEPH | 0.451 | Yes |
Robert Williams | Robert Williams Jr. | 0.862 | No |
Nikki Fried | Nicole Fried | 0.642 | Yes |
Ron DeSantis | DESANTIS, RON | 0.729 | Yes |
“Charlie Crist” and “CRIST, CHARLES JOSEPH” score 0.451 — below any reasonable match threshold — because Charlie and Charles have unrelated vector representations. These are the same person (same state, same office, identical vote count of 3,101,652). Only a language model with knowledge that Charlie is a common nickname for Charles can make the connection.
“Robert Williams” and “Robert Williams Jr.” score 0.862 — above most auto-accept thresholds used in the literature. These are different people. The “Jr.” suffix indicates a generational distinction. A system that auto-accepts at 0.82 would merge a father and son into one entity.
Institutional variation by state
Across all 50 states in the 2022 MEDSL data, we found 8,387 unique local office names. Our keyword classifier handles 62% of them. The remaining 38% includes offices where the same title has different institutional meanings depending on the state.
“County Judge” in Texas is the presiding officer of the Commissioners Court — the chief executive of the county, analogous to a county manager. In every other state, a county judge presides over a courtroom. Texas has 254 counties; each has a County Judge who is an executive, not a judicial officer.
“Sheriff” in Connecticut is a court officer who serves civil process. In the other 49 states, the sheriff runs the county jail and patrols unincorporated areas.
“Board of Education” is an elected body in some states and an appointed body in others. Where it is appointed, it does not appear in election data — its absence from a source does not mean the county lacks a school board.
A static lookup table mapping office names to categories does not work. The classification must account for state-level context, which is why the pipeline uses a four-tier classifier: keyword matching for unambiguous names, regex patterns for structured names, embedding similarity against a reference set, and a language model for genuinely ambiguous cases.
No persistent candidate identifiers
Timothy Lance won a seat on the Columbus County Schools Board of Education in 2022. No existing dataset can answer whether he ran before, whether he won, or whether the “T. Lance” who appeared on a 2018 ballot is the same person.
MEDSL, NC SBE, and OpenElections each treat every election as an independent snapshot. There is no identifier linking Timothy Lance (2022) to Timothy Lance (2020) to Tim Lance (2018). The candidate name can change between elections — a middle initial added, a nickname used, a suffix dropped after a parent’s death. The office can change if the candidate runs for a different seat. The county can change if the candidate relocates.
In 10 years of NC SBE data (2014–2024), we found 702 candidates appearing in three or more election cycles using exact name matching within the same county. George Dunlap appeared on the Mecklenburg County ballot in six consecutive cycles. Paul Beaumont in Currituck County ran for the Board of Commissioners, then the Board of Education, then back to Commissioners.
Connecting these records — determining that entries in different elections, from different sources, with different formatting, refer to the same person — requires preserving every name component through the cleaning pipeline, embedding candidates for vector retrieval, and confirming ambiguous matches with a language model that reasons about context (office, county, party, vote totals). This process is called entity resolution, and it is detailed in its own chapter.
What this adds up to
In 2022, across the MEDSL data for all 50 states, 48.8% of classified local races had only one candidate. In Minnesota, the uncontested rate was 89.3%. Nineteen local races ended in exact ties. Forty-three were decided by a single vote. These are basic facts about American democracy that require combining data from multiple sources, resolving thousands of name variations, classifying thousands of office types, and linking candidates across elections.
That is what this project does. The rest of this book describes how.
What Questions Should Be Answerable?
The purpose of this project is to make US local election data queryable. Across 42 million rows, 50 states, and 8,387 distinct office names, basic questions remain difficult to answer. This chapter frames those questions by audience.
Four audiences, different needs
- Journalists need specific, verifiable facts — closest races, unopposed incumbents, anomalies worth investigating. See For Journalists.
- Researchers need structured, reproducible datasets — uncontested rates by office type, candidate career paths, cross-state comparisons. See For Researchers.
- Government staffers need operational inventories — what offices exist in a jurisdiction, how many races appear on a ballot, how local structures compare to peer counties. See For Government Staffers.
- Civic tech developers need reliable data interchange — OCD-ID mappings, entity-resolved candidate records, JSONL exports for downstream applications. See For Civic Tech Developers.
What the data already tells us
Even partial analysis of available sources reveals findings that are difficult to obtain elsewhere:
- 48.8% of local races in available data are uncontested — one candidate, no opponent.
- 19 exact ties have been identified across the dataset (same vote total, different candidates).
- 8,387 unique office name strings exist before normalization, many referring to the same underlying office.
These numbers are not estimates. They come from deterministic queries against cleaned, source-attributed JSONL records. The methodology for each finding is documented in the recipe chapters.
Why these questions matter
No single existing source answers all of these questions. The existing landscape chapter surveys what is available today and where each source falls short. This project exists to fill the gaps — not by replacing those sources, but by unifying them through a documented, reproducible pipeline.
For Journalists
Local election data is where accountability stories live — and where data is hardest to find. These are the questions journalists ask, with real answers drawn from the dataset.
Closest races
- Who won the closest race in America? Dawson County, GA had a tied contest at 25,186 total votes cast — decided by recount procedures, not by a single voter’s margin.
- How many exact ties exist? 19 exact ties have been identified across available data. Each is flagged with the specific contest, county, and vote totals. See Closest Races in America.
- Which school board races were decided by single digits? Madison County, IN had a school board race decided by 1 vote. These contests are queryable by margin across all office types.
Unopposed races
- How many sheriffs ran unopposed? In North Carolina, 55% of sheriff races were uncontested. In Maine, 77%. National figures depend on source coverage — seven states lack local data entirely.
- What’s the overall uncontested rate? 48.8% of local races in available data have a single candidate. This figure spans all office types and all states with coverage.
- Which offices are most likely to be uncontested? Constable races are uncontested 72% of the time. City council races: 10%. The rate varies by office type and state. See Uncontested Race Rate by State.
Accountability angles
- Who keeps winning without opposition? Candidate entity resolution across election cycles identifies incumbents who have never faced an opponent. See Career Tracking Across Elections.
- Which counties have the most uncontested offices? County-level aggregation is possible wherever FIPS codes are present in the source data. See Sheriff Accountability.
- Are there races where write-in candidates are the only opposition? Write-in totals are preserved where the source reports them. In some jurisdictions, write-in votes account for the only opposition in over a third of contests.
Verification
- Can I verify a specific result? Every record traces back to a named source (e.g., NC SBE certified results, MEDSL). The pipeline preserves source file hashes and original field values. See Verify a Specific Result.
- How do I cite this data? You cite the original source, not this project. The project provides the source name, retrieval date, and confidence level for each record. See Confidence Levels.
What you cannot get here (yet)
- Turnout data is present in fewer than 5% of records.
- Seven states (CA, IA, KS, NJ, PA, TN, WI) have zero local coverage in MEDSL 2022.
- Odd-year elections (2015, 2017, 2019, 2021) are underrepresented.
These gaps are documented in Known Limitations. If you are reporting on a state with limited coverage, check the Coverage Matrix first.
For Researchers
Local election data presents structural challenges for quantitative research: inconsistent office names, no universal candidate identifiers, and source-dependent coverage gaps. These are the questions researchers ask, with real answers and methodology notes.
Competitiveness and contestation
- What’s the uncontested rate by office type? Constable: 72%. Soil and water conservation district: 58%. County commissioner: 34%. City council: 10%. These rates are computed from L4 canonical records where
candidate_count = 1for a given contest. - How does competitiveness vary across states? Minnesota reports 89.3% of local races as contested. Florida reports 0% in available MEDSL local data — a coverage artifact, not a political finding. Interpret cross-state comparisons with the Coverage Matrix.
- What is the national uncontested rate? 48.8% across all available local races. This figure is coverage-weighted: states with more reported contests contribute proportionally more. It is not a population-weighted estimate.
Candidate career tracking
- Can I track candidates across election cycles? Entity resolution at L3 links candidate records across years and sources. George Dunlap (Mecklenburg County, NC) appears in 6 election cycles under consistent entity IDs. See Career Tracking Across Elections.
- What identifier links candidates across sources? The L4
canonical_candidate_idis a deterministic hash of resolved name components, jurisdiction, and office. It is stable across pipeline runs given the same L3 decisions. - How reliable is cross-cycle linking? Exact name matches are deterministic. Fuzzy matches (Jaro-Winkler ≥ 0.92) and embedding matches (cosine ≥ 0.88) are logged with scores. LLM-assisted matches include the decision ID. All match metadata is queryable.
Cross-source validation
- How consistent are sources that cover the same contests? In 640 overlapping contests between MEDSL and NC SBE, 90.5% have identical vote totals. The remaining 9.5% differ by small amounts, typically due to provisional ballot timing or reporting cutoff dates.
- How do candidate names differ across sources? In the same 640 overlapping contests, 63% have name formatting differences (e.g., “SMITH, JOHN A” vs. “John A. Smith”). These are resolved at L1 (parsing) and confirmed at L3 (entity resolution).
Office taxonomy
- How many distinct offices exist? 8,387 unique office name strings before normalization. After L2 classification (keyword, regex, embedding, LLM), these resolve to a smaller set of canonical office types. See Office Classification Reference.
- What office types exist at the sub-county level? Constable, justice of the peace, soil and water conservation district supervisor, school board trustee, municipal utility district director, and hundreds of jurisdiction-specific titles.
Reproducibility
All findings above are reproducible from the pipeline output:
- L0 → L2 layers are fully deterministic. Given the same source files and pipeline version, output is byte-identical.
- L3 decisions are logged in a decision log (JSONL). Replaying the log against the same L2 input reproduces L3 exactly, even when LLM calls were involved.
- L4 is deterministic given L3 output.
- Versioned JSONL files at every layer serve as the unit of reproducibility. Each file includes a manifest with source hashes, pipeline version, and timestamp.
To reproduce a specific finding, check out the tagged pipeline version, supply the same L0 inputs, and run the pipeline. The decision log ensures that even probabilistic steps (embedding similarity, LLM confirmation) produce identical output on replay.
Data format for analysis
Pipeline output is JSONL — one JSON object per line. This is directly loadable into pandas, R (jsonlite), DuckDB, or any tool that reads newline-delimited JSON. No proprietary formats or database dependencies are required.
For Government Staffers
County clerks, election administrators, and local officials need operational data — not research datasets. These are the questions government staff ask, with answers drawn from the dataset.
Office inventories
- What offices exist in my county? Columbus County, NC has 25 distinct elected offices across county government, municipalities, school boards, and special districts. The pipeline produces per-county office inventories from L4 canonical records. See Office Inventory for a County.
- How many races will be on the next ballot? Historical office inventories establish the set of offices that typically appear in a given election cycle. Odd-year vs. even-year patterns, staggered terms, and special elections are identifiable where source data includes election dates and term lengths.
- Which offices are partisan vs. nonpartisan? Party affiliation is recorded where the source provides it. In North Carolina, all county commissioner races are partisan; all school board races are nonpartisan. Coverage varies by state.
Comparisons
- How does our uncontested rate compare to peer counties? County-level uncontested rates are computable for any jurisdiction with coverage. A county clerk can compare their 60% uncontested rate against the state median or against demographically similar counties. See Uncontested Race Rate by State.
- Are other counties consolidating offices we still elect separately? Office inventories across counties within a state reveal structural differences — some counties elect a coroner, others appoint one. The data does not explain why, but it shows where differences exist.
- How many candidates typically file for each office? Candidate counts per contest are derivable from L4 records. A county with historically 1.2 candidates per school board seat has a different recruitment problem than one averaging 3.4.
Administrative planning
- What does our ballot complexity look like over time? The number of contests per jurisdiction per cycle is queryable. Ballot length affects printing costs, voter fatigue research, and polling place logistics.
- Which districts overlap our jurisdiction? Where OCD-IDs are present, hierarchical district relationships can be inferred. A county contains municipalities, school districts, and special districts — the data reflects which contests appear in which jurisdictions.
Data format
All outputs are JSONL with one record per contest-candidate pair. Government staff who need spreadsheets can convert JSONL to CSV with standard tools. See Querying JSONL Output.
Caveats
- Office inventories are only as complete as the source data. If a state does not report local results to MEDSL or another covered source, those offices will not appear.
- The pipeline documents sources and provides tools — it does not store or redistribute official election results. See The Project Does Not Store Data.
- Seven states have zero local coverage in MEDSL 2022. Check the Coverage Matrix before relying on completeness for a specific jurisdiction.
For Civic Tech Developers
Civic technology projects depend on structured, reliable election data. Most fail not because of engineering limitations but because the underlying data is fragmented, inconsistently formatted, and difficult to resolve across sources. These are the questions developers ask when building on local election data.
Ballot lookup tools
- Can I build a “what’s on my ballot” tool? Yes, but it requires mapping voter addresses to jurisdictions (via OCD-IDs or FIPS codes) and then mapping jurisdictions to offices. The dataset contains 8,387 unique office name strings — many of which refer to the same office across sources. The L4 canonical layer resolves these to deduplicated office records with jurisdiction identifiers.
- How do I map an address to its contests? You need an OCD-ID → office mapping. OCD-IDs (Open Civic Data Identifiers) are present where source data includes them or where FIPS codes allow deterministic derivation. Coverage is not universal. See the Schema Overview for the
jurisdiction.ocd_idfield. - What format is the data in? Every pipeline layer outputs JSONL (newline-delimited JSON). One record per line, one file per source-year-state. No database required — parse with
jq, Python, DuckDB, or any JSON-capable tool.
Candidate lookup and entity resolution
- Can I build a candidate lookup API? The L4 layer provides entity-resolved candidate records with canonical names, office history, and source attribution. A candidate who appears as “Bill Smith” in one source and “William R. Smith Jr.” in another is resolved to a single entity with both name variants preserved.
- How reliable is entity resolution? It depends on the match method. Exact matches and high-confidence Jaro-Winkler matches (≥0.92) are deterministic. Embedding-based and LLM-confirmed matches carry a decision ID that traces back to the specific match rationale. See The Cascade.
- Can I track candidates across election cycles? Yes. Entity resolution operates across years. George Dunlap in Mecklenburg County, NC appears across 6 election cycles with consistent entity IDs. See Career Tracking.
Election history and widgets
- Can I build an election history widget for a jurisdiction? The data supports historical queries by jurisdiction, office, and candidate. Time series depend on source coverage — MEDSL covers 2018–2022 for most states; NC SBE covers 2004–present for North Carolina.
- What about ballot measures? Ballot measures are a distinct contest kind (
BallotMeasure) in the schema. Choices are normalized tofor/against/yes/noat L1.
Data interchange
- Why JSONL and not a REST API? JSONL is the data interchange format at every layer. It is self-describing, streamable, and requires no server infrastructure. Downstream applications can ingest it directly or load it into any datastore.
- Can I join this data with other civic datasets? Yes. Records include FIPS codes, OCD-IDs (where available), and state abbreviations. These are standard join keys for Census data, geographic boundaries, and other civic datasets.
- Is the schema stable? The schema is versioned. Each JSONL record includes a
schema_versionfield. Breaking changes increment the major version. See Schema Overview.
What to watch out for
- The project does not host a live API or data download. It documents sources and provides pipeline tools to process them. You run the pipeline yourself.
- Coverage gaps exist. Seven states lack local data in MEDSL 2022. Odd-year elections are underrepresented. Check the Coverage Matrix before building features that assume national coverage.
- Entity resolution is probabilistic for non-exact matches. If your application requires certainty, filter to records with
match_method: "exact"ormatch_method: "jaro_winkler".
What Exists Today and Where It Falls Short
Several organizations publish US election data. Each serves a different purpose, covers a different scope, and has different limitations. This chapter surveys the major sources and identifies the gaps that motivate this project.
MEDSL — MIT Election Data + Science Lab
MEDSL provides the most comprehensive freely available collection of US election returns. Their datasets cover federal, state, and many local races across multiple election cycles. Data is published as flat CSV files with consistent column schemas.
Strengths. Wide state coverage for federal and state races. Consistent schema across years. Academic quality control. Openly licensed. Includes candidate-level vote totals with party affiliation.
Weaknesses. Seven states have zero local election coverage in the 2022 dataset: CA, IA, KS, NJ, PA, TN, and WI. Office name strings are not normalized — the same office appears under different names across states and years. No entity resolution across cycles (the same candidate is a new row each time). Turnout metadata is sparse. Release cadence lags elections by 12–18 months.
ALGED — Annual Local Government Election Dataset
ALGED focuses specifically on local elections in US cities, filling a gap that most other sources ignore. It covers mayoral, city council, and some school board races.
Strengths. Dedicated local focus. Includes candidate demographics and incumbency status where available. Covers elections that no other academic dataset tracks.
Weaknesses. Limited to cities with populations above 50,000. Data collection appears to have stopped around 2021. Does not cover counties, townships, or special districts. Not currently integrated into this pipeline (planned for future work).
OpenElections
OpenElections is a community-curated effort to collect certified election results for all 50 states. Volunteers parse state-level result files into a common CSV format and publish them on GitHub.
Strengths. State-level certified results for many states. Community-driven, so coverage expands over time. Raw source files are preserved alongside parsed output. Free and open.
Weaknesses. Coverage varies dramatically by state — some states have complete precinct-level data back to 2000, others have nothing below the county level. Schema consistency depends on the volunteer. Local races are included when the state publishes them, but there is no systematic local collection effort. Quality varies; some state files have known parsing errors that persist across releases.
Ballotpedia
Ballotpedia maintains a wiki-style encyclopedia of US elections covering federal, state, and many local offices. Their coverage of school boards, judicial elections, and ballot measures is broader than most sources.
Strengths. Broad office-type coverage including judicial, school board, and special district races. Candidate biographical information. Historical coverage for some offices. Structured data behind the wiki pages.
Weaknesses. Bulk data access requires a commercial API license. No freely available flat-file download. Data is editorial (curated by staff, not derived from certified results). Not suitable as a primary source for vote totals, though useful for office inventories and candidate metadata.
Associated Press (AP)
The AP provides real-time and certified election results to media organizations. Their data covers federal, state, and many local races on election night and through the canvassing period.
Strengths. Fast — results are available on election night. Broad geographic coverage. Includes local races in many states. High reliability for the races they cover.
Weaknesses. Expensive commercial license. Not available for academic or civic tech use without a contract. Historical data is not publicly archived. Coverage decisions are editorial — not all local races are included.
Other sources
- State election board websites (e.g., NC SBE) publish certified results, but formats vary by state — PDF, Excel, CSV, HTML, or proprietary portals. No two states use the same schema.
- Clarity/Scytl election night reporting portals are used by many counties. Data is structured but ephemeral — pages are taken down or overwritten after certification.
- VEST (Voting and Election Science Team) provides precinct-level shapefiles matched to election returns, primarily for redistricting research. Coverage is strong for federal races but limited at the local level.
- FEC publishes federal candidate filings and financial data. No state or local coverage.
- Census Bureau provides FIPS codes and geographic hierarchies, which are essential for joining across sources but contain no election results.
Summary
| Source | Local coverage | Schema consistency | Freely available | Current | Entity resolution |
|---|---|---|---|---|---|
| MEDSL | 43 of 50 states (2022) | High | Yes | Yes (with lag) | No |
| ALGED | Cities >50K only | Medium | Yes | No (~2021) | No |
| OpenElections | Varies by state | Low | Yes | Yes | No |
| Ballotpedia | Broad | Medium | API only | Yes | Partial |
| AP | Broad | High | No (commercial) | Yes | No |
| State portals | Varies | None (50 formats) | Usually | Yes | No |
No single source covers all local races, uses a consistent schema, resolves candidates across elections, and is freely available. That gap — between what exists and what the four audiences need — is what this project addresses.
Source Overview
This project ingests election data from seven sources. None are complete on their own. Each fills a different gap — geographic breadth, temporal depth, local race coverage, geographic boundaries, or reference identifiers. The pipeline merges them into a unified schema; this chapter documents what each provides and where they overlap.
Source Summary
| Source | What It Provides | Coverage | Format | Access Method |
|---|---|---|---|---|
| MEDSL | Precinct-level returns for federal, state, and some local races | 50 states + DC; 2018, 2020, 2022 (~36.5M rows) | CSV/TSV, one file per state per cycle | Harvard Dataverse download, GitHub mirror |
| NC SBE | Precinct-level returns for every contest on the ballot, with vote mode breakdowns | NC only; 2006–2024 (10 cycles, ~2M rows) | Tab-delimited TXT in ZIP archives | S3 bucket direct download |
| OpenElections | Community-curated precinct-level CSV files | ~8 states with 2022 data (FL, GA, MI, OH, PA, TX, others); coverage varies | CSV, schema varies by state | Git clone per state repo on GitHub |
| Clarity/Scytl | Election night reporting with precinct-level XML results | ~1,000+ jurisdictions nationwide | Structured XML in ZIP files | Per-jurisdiction URLs (unstable across cycles) |
| VEST | Precinct boundaries (shapefiles) with vote counts as attributes | 50 states; odd-year elections for KY/LA/MS/VA (2015, 2019) | Shapefile (.shp/.dbf/.shx/.prj) | Harvard Dataverse download |
| Census | FIPS reference codes for states (50+DC), counties (3,143), and places (31,980) | National, 2020 vintage | Pipe-delimited text files | census.gov direct download |
| FEC | Federal candidate master records with stable CAND_ID identifiers | All registered federal candidates; 2020 and 2022 loaded | Pipe-delimited TXT (cn.txt) in ZIP | fec.gov bulk download |
What Each Source Contributes to the Pipeline
MEDSL is the backbone. It covers all 50 states at precinct granularity for three recent even-year cycles. Approximately 41.5% of rows in the 2022 dataset have a blank dataverse column, indicating local races. Seven states have zero local race rows — see Coverage Matrix.
NC SBE provides the deepest single-state coverage: every contest on every ballot in every precinct across 10 election cycles. It is the only source that provides vote mode breakdowns (Election Day, early, absentee, provisional) for local races. It serves as the primary validation dataset for cross-source entity resolution.
OpenElections fills state-level gaps where MEDSL coverage is incomplete or where an alternative source view aids cross-validation. Schema varies by state, requiring per-state parser logic.
Clarity has the highest value for hyperlocal races (school board, city council, judicial) because it captures results directly from county ENR systems. Not yet integrated in our pipeline. URL instability is the primary obstacle.
VEST provides the only precinct boundary geometries in the corpus, enabling geographic analysis. It also covers odd-year elections (2015, 2019) for states with off-cycle gubernatorial races — data that MEDSL’s loaded cycles do not include.
Census provides the authoritative FIPS code-to-name mappings used at L1 for geographic enrichment and cross-source geographic joins.
FEC provides stable candidate identifiers (CAND_ID) for federal candidates, used at L3 as reference anchors during entity resolution.
Cross-Source Overlap
Two source pairs have been compared quantitatively.
MEDSL + NC SBE (North Carolina, 2022 General)
Both sources report precinct-level results for the same 640 contests in North Carolina’s 2022 general election. Comparison results:
| Metric | Value |
|---|---|
| Contests with exact vote total match | 579 (90.5%) |
| Contests matching within 1% | 47 (7.3%) |
| Contests disagreeing by >1% | 14 (2.2%) |
| Contests with different candidate name formatting | 401 (63%) |
The 63% name formatting difference rate is the reason entity resolution exists. MEDSL reports SHANNON W BRAY (all caps, no period). NC SBE reports Shannon W. Bray (title case, period after initial). Same person, different string. This overlap is the primary test bed for the matching pipeline — see Cross-Source Reconciliation.
MEDSL + OpenElections (Florida, 2022 General)
Florida OpenElections data contains 6,013 “Registered Voters” rows (67.9% of non-candidate records), which are turnout metadata rows mixed into the results file. This overlap revealed the non-candidate row problem documented in Non-Candidate Records.
Source Priority Ranking
When multiple sources report results for the same contest, the pipeline applies a priority order to select the authoritative record:
| Priority | Source Type | Rationale | Examples |
|---|---|---|---|
| 1 | Certified state data | Published by the official election authority; legally authoritative | NC SBE |
| 2 | Academic curated | Cleaned and standardized by researchers with documented methodology | MEDSL, VEST |
| 3 | Community curated | Volunteer-driven; quality varies by state and contributor | OpenElections |
| 4 | Election night reporting | Often preliminary, not certified; URLs are unstable | Clarity |
| 5 | Reference only | Not election results; used for enrichment and cross-referencing | Census, FEC |
Priority 1 sources are preferred when available. In practice, NC SBE is the only certified state source currently loaded. For the remaining 49 states, MEDSL (priority 2) is the primary source. Lower-priority sources are retained in the record’s provenance for cross-validation, not discarded.
The priority ranking affects two pipeline decisions: which record becomes the canonical version at L4, and which confidence level is assigned. A record confirmed by two independent sources (e.g., MEDSL + NC SBE with matching vote totals) receives High confidence. A record from a single source receives Medium or Low depending on the source tier.
Coverage Matrix
This chapter maps which sources cover which states and years. Use it to determine whether a specific state/year/level combination is available before querying.
MEDSL — 50 States, 3 Cycles
MEDSL provides precinct-level results for all 50 states plus DC across three even-year general election cycles. Each cycle is one CSV per state.
| Cycle | States | Approximate rows | Local race coverage |
|---|---|---|---|
| 2018 | 50 + DC | ~11.0M | Varies by state |
| 2020 | 50 + DC | ~13.2M | Varies by state |
| 2022 | 50 + DC | ~12.3M | 44 of 51 jurisdictions |
Seven states with zero local data in MEDSL 2022. These states have no rows with a blank dataverse column, meaning no local races were captured:
| State | FIPS |
|---|---|
| California | 06 |
| Iowa | 19 |
| Kansas | 20 |
| New Jersey | 34 |
| Pennsylvania | 42 |
| Tennessee | 47 |
| Wisconsin | 55 |
Local elections occur in all seven states. MEDSL’s curation process did not capture them for 2022. Coverage may differ in 2018 and 2020.
Odd-year data on Dataverse but not yet loaded. MEDSL publishes odd-year election data on Harvard Dataverse:
| Cycle | DOI | Status |
|---|---|---|
| 2015 | — | Not loaded |
| 2017 | 10.7910/DVN/VNJAB1 | Not loaded |
| 2019 | 10.7910/DVN/2AJUII | Not loaded |
| 2021 | — | Not loaded |
Odd-year elections cover gubernatorial races in VA, NJ, KY, LA, MS and municipal elections in many states. Loading these would fill a significant gap.
NC SBE — 1 State, 10 Cycles
NC SBE covers North Carolina exclusively, with precinct-level results for every contest on the ballot.
| Year | Election | Rows | Schema |
|---|---|---|---|
| 2024 | General | 233,511 | 15-column |
| 2022 | General | 171,901 | 15-column |
| 2020 | General | 257,722 | 15-column |
| 2018 | General | 183,724 | 15-column |
| 2016 | General | 252,827 | 15-column |
| 2014 | General | 223,977 | 15-column |
| 2012 | General | 208,921 | 14–15 column (different layout) |
| 2010 | General | 188,008 | 14–15 column (different layout) |
| 2008 | General | 233,141 | 14–15 column (different layout) |
| 2006 | General | 69,482 | 9-column (significantly different) |
All 10 cycles are downloaded. The 2014–2024 files share a stable schema and a single parser. The 2008–2012 files require a separate parser. The 2006 file requires a third.
OpenElections — ~8 States, Variable Coverage
OpenElections is community-curated. Coverage depends on volunteer effort per state. The following states have 2022 precinct-level general election data:
| State | 2022 precinct data | Earlier years |
|---|---|---|
| Florida | ✅ | 2000–2020 |
| Georgia | ✅ | 2004–2020 |
| Michigan | ✅ | 2000–2020 |
| Ohio | ✅ | 2000–2020 |
| Pennsylvania | ✅ | 2000–2020 |
| Texas | ✅ | 2000–2020 |
| North Carolina | ✅ | 2008–2020 |
| Arizona | Partial | 2004–2020 |
Coverage for other states exists at county level or for federal races only. Check each state’s GitHub repository (openelections-data-{state}) for current status.
VEST — Shapefiles with Vote Counts
VEST publishes precinct-level shapefiles for all 50 states. We have loaded a subset for odd-year coverage:
| State | Year | Election type | Loaded |
|---|---|---|---|
| Kentucky | 2019 | General (Governor) | ✅ |
| Louisiana | 2019 | General (Governor) | ✅ |
| Mississippi | 2019 | General (Governor) | ✅ |
| Virginia | 2019 | General (state legislature) | ✅ |
| Kentucky | 2015 | General (Governor) | ✅ |
| Louisiana | 2015 | General (Governor) | ✅ |
| Mississippi | 2015 | General (Governor) | ✅ |
| Virginia | 2015 | General (state legislature) | ✅ |
VEST covers state-level races only (president, governor, US Senate, US House, state legislature). No local races.
Census and FEC — Reference Data
These are not election results. They provide reference identifiers used during pipeline enrichment.
| Source | Scope | Years | Records |
|---|---|---|---|
| Census county FIPS | National | 2020 | 3,143 |
| Census place FIPS | National | 2020 | 31,980 |
| Census state FIPS | National | 2020 | 56 |
| FEC candidate master | Federal candidates | 2020 | ~6,800 |
| FEC candidate master | Federal candidates | 2022 | ~6,600 |
Clarity/Scytl — Not Yet Integrated
Clarity ENR sites cover 1,000+ jurisdictions but are not yet in the pipeline. URLs are unstable across election cycles, making systematic acquisition difficult. See Clarity/Scytl ENR.
Combined Coverage Summary
| Dimension | Current status |
|---|---|
| States with any data | 50 + DC |
| Even-year general elections | 2018, 2020, 2022 |
| Odd-year elections | KY/LA/MS/VA 2015, 2019 (VEST only, state-level) |
| Deep single-state coverage | NC, 2006–2024 (10 cycles) |
| Total rows across all sources | ~42M |
| Local race coverage | 44 of 51 jurisdictions (MEDSL 2022) + NC (NC SBE) |
| Vote mode breakdowns | NC SBE (all contests), MEDSL (some states), Clarity (when integrated) |
| Turnout data | <5% of records populated |
Gap Analysis
Temporal gaps. No odd-year municipal election results are loaded. Cities like New York, Los Angeles, Houston, Philadelphia, and San Antonio hold elections in odd years. MEDSL publishes 2017 and 2019 data on Dataverse. Loading these would add coverage for the largest US cities.
State-level local gaps. Seven states have zero local race data in MEDSL 2022. OpenElections partially fills this for Pennsylvania. The remaining six (CA, IA, KS, NJ, TN, WI) require either Clarity integration or direct state portal downloads.
Primary elections. All loaded data is general election only. MEDSL tags primary results with stage = PRI but we have not loaded primary-specific files. NC SBE publishes primary results as separate files.
Runoff elections. Georgia, Louisiana, Texas, and other states hold runoff elections. These are partially captured in MEDSL (stage = RUN) but not systematically loaded.
What We Cover, What We Don’t, and Why
This page is a honest inventory of what the pipeline can and cannot do today. The status indicators mean:
- ✅ — Functional and validated
- ⚠️ — Partially implemented or not validated at scale
- ❌ — Not yet supported
Status Table
| Capability | Status | Notes |
|---|---|---|
| Precinct-level results, all 50 states | ✅ | Via MEDSL 2018/2020/2022. 36.5M rows across three cycles. |
| NC deep temporal coverage | ✅ | NC SBE 2006–2024, 10 election cycles, 2.0M+ rows. Consistent 15-column schema from 2014 onward. |
| Federal race coverage | ✅ | President, US Senate, US House present in MEDSL for all states. FEC candidate master files available for cross-referencing. |
| State-level race coverage | ✅ | Governor, state legislature, AG, SOS present in MEDSL for all states. |
| FIPS geographic enrichment | ✅ | Census reference files loaded: 3,143 counties, 31,980 places, all 50 states + DC. 100% county FIPS match rate on MEDSL data. |
| Vote mode breakdowns | ✅ | NC SBE provides Election Day / One Stop / Absentee / Provisional for every contest. MEDSL provides mode breakdowns for some states (rows split by mode column). |
| Local race coverage | ⚠️ | 44 of 51 MEDSL jurisdictions have local race data (blank dataverse column) in 2022. Seven states — CA, IA, KS, NJ, PA, TN, WI — have zero local rows. |
| Cross-source validation | ⚠️ | Validated for NC only. MEDSL and NC SBE share 640 contests in 2022: 90.5% exact vote match, 7.3% within 1%, 2.2% disagree by >1%. No systematic cross-source validation for other states. |
| Entity resolution | ⚠️ | Four-tier cascade (exact → Jaro-Winkler → embedding → LLM) is designed and prototyped. Not yet validated at scale beyond NC test cases. |
| Office classification | ⚠️ | Four-tier classifier (keyword → regex → embedding → LLM) handles 62% of 8,387 unique office names via keywords. Remaining 38% require embedding or LLM tiers. 0.56% classified as “other” in NC testing. |
| Name decomposition | ⚠️ | Parses first/middle/last/suffix/nickname from MEDSL and NC SBE formats. Handles nicknames in quotes ("Steve") and parentheses ((Steve)). Not tested against all 50 states’ formatting conventions. |
| Turnout data | ❌ | registered_voters and ballots_cast populated for <5% of records. NC SBE has “Registered Voters” pseudo-contest rows. Most MEDSL state files do not include registration counts. |
| Odd-year elections | ❌ | MEDSL publishes 2017 and 2019 on Harvard Dataverse. VEST has KY/LA/MS/VA for 2015 and 2019. None loaded into our pipeline yet. |
| Ranked-choice voting | ❌ | Schema has no fields for RCV rounds. Maine and Alaska use RCV for federal races. NYC and other cities use it for local races. No timeline for support. |
| Demographic correlation | ❌ | Census FIPS join is ready (county-level). Census demographic data (ACS) not yet integrated. The join key exists; the demographic tables do not. |
| Real-time results | ❌ | Pipeline processes certified and official results only. Not designed for election night reporting. Clarity integration (which could provide semi-live data) is not yet implemented. |
| Party switching detection | ❌ | Requires entity resolution across election cycles, which depends on L3/L4 being operational at scale. |
Local Race Coverage Detail
The 44 states with local data in MEDSL 2022 vary in depth. Some states report thousands of local contests; others report only a handful. The seven states with zero local rows are not states without local elections — they are states where MEDSL’s curation did not capture local results for that cycle.
NC SBE fills the gap for North Carolina with complete local coverage: every contest on every ballot in every precinct in all 100 counties. For other states, the gap remains.
OpenElections provides supplemental local data for FL, GA, MI, OH, PA, and TX, but coverage is inconsistent across years and granularity levels.
What “Validated” Means
A capability marked ✅ means:
- The data is loaded and parsed without errors.
- The output has been spot-checked against the source.
- Where cross-source overlap exists, the numbers have been compared.
It does not mean the data is free of errors from the source. MEDSL’s votes column contains 12,782 non-integer values out of 12.3M rows (0.1%) in 2022. NC SBE has occasional data entry artifacts (e.g., a period after a middle name instead of a middle initial). These are source-level issues that the pipeline preserves and flags rather than silently corrects.
What “Not Validated at Scale” Means
Entity resolution and office classification work on NC test data. We have not run them against all 42M rows across all 50 states. The algorithms are designed; the compute has not been spent. When we do run at scale, we expect to discover new edge cases — office titles we haven’t seen, name formats we haven’t parsed, and match ambiguities we haven’t resolved.
This page will be updated as capabilities move from ⚠️ to ✅ or as new limitations are discovered.
MEDSL — MIT Election Data + Science Lab
The MIT Election Data + Science Lab publishes precinct-level election returns for all 50 states and the District of Columbia. The data is hosted on the Harvard Dataverse (electionscience collection) and mirrored on GitHub for recent cycles. It is the most complete single source of US election data available without a paywall or API key.
What MEDSL contains
MEDSL provides one CSV or tab-delimited file per state per election cycle. Each row represents one candidate in one precinct for one vote mode (election day, absentee, early voting, provisional, etc.). To obtain the total votes for a candidate in a precinct, you must sum across all rows for that candidate and precinct.
Available election cycles:
| Cycle | Location | Format | DOI |
|---|---|---|---|
| 2022 | GitHub | CSV, one ZIP per state | — |
| 2020 | Harvard Dataverse | CSV/TAB, one file per state | 10.7910/DVN/NT66Z3 |
| 2018 | Harvard Dataverse | CSV/TAB, one file per state | 10.7910/DVN/NVQYMG |
| 2016 | Harvard Dataverse | CSV/TAB | 10.7910/DVN/NH5S2I |
| 2019 (odd-year) | Harvard Dataverse | CSV/TAB | 10.7910/DVN/2AJUII |
| 2017 (odd-year) | Harvard Dataverse | CSV/TAB | 10.7910/DVN/VNJAB1 |
We have downloaded and loaded 2018, 2020, and 2022. Together they contain approximately 36.5 million rows.
Schema
MEDSL files have 25 columns. The delimiter is comma for most states but tab for some; auto-detection handles this.
| Column | Type | Description | Example |
|---|---|---|---|
precinct | string | Precinct identifier from the source | 12-13 |
office | string | Contest name, ALL CAPS | CABARRUS COUNTY SCHOOLS BOARD OF EDUCATION |
party_detailed | string | Full party name | NONPARTISAN |
party_simplified | string | Normalized party | NONPARTISAN |
mode | string | Vote type for this row | ELECTION DAY |
votes | integer | Vote count for this mode | 79 |
candidate | string | Candidate name, ALL CAPS | GREG MILLS |
district | string | District identifier or blank | STATEWIDE, 003, `` |
dataverse | string | Race level tag — see below | STATE, SENATE, HOUSE, `` |
stage | string | Election stage | GEN |
special | string | Special election flag | FALSE |
writein | string | Write-in flag | FALSE |
date | date | Election date | 2022-11-08 |
year | integer | Election year | 2022 |
county_name | string | County name, ALL CAPS | CABARRUS |
county_fips | string | 5-digit county FIPS | 37025 |
jurisdiction_name | string | Jurisdiction name | CABARRUS |
jurisdiction_fips | string | Jurisdiction FIPS | 37025 |
state | string | Full state name | NORTH CAROLINA |
state_po | string | 2-letter postal code | NC |
state_fips | string | 2-digit state FIPS | 37 |
state_cen | string | Census state code | 56 |
state_ic | string | ICPSR state code | 47 |
readme_check | string | Data quality flag | FALSE |
magnitude | integer | Number of seats in this contest | 3 |
The dataverse column and local races
MEDSL tags each row with a dataverse value indicating which Harvard Dataverse sub-collection the race belongs to:
| Value | Meaning | Example offices |
|---|---|---|
PRESIDENT | Presidential race | President |
SENATE | US Senate | US Senate |
HOUSE | US House | US House District 7 |
STATE | State-level offices | Governor, State Senate, Attorney General |
| (blank) | Everything else — including all local races | County Commissioner, School Board, Sheriff, Soil and Water |
Local races are identified by a blank dataverse column, not by the value LOCAL. This is a frequent source of confusion. In the 2022 North Carolina file, 385,260 of 684,712 rows (56%) have a blank dataverse value. These rows contain school board races, county commissioner races, soil and water conservation districts, district court judges, mayors, city councils, and other local offices.
In the full 2022 national dataset (12.3 million rows), approximately 5.1 million rows (41.5%) have a blank dataverse value.
The mode column and vote totals
Each row in MEDSL represents one candidate’s votes for one vote mode. A single candidate in a single precinct may have multiple rows:
12-13,US SENATE,LIBERTARIAN,LIBERTARIAN,ELECTION DAY,47,SHANNON W BRAY,...
12-13,US SENATE,LIBERTARIAN,LIBERTARIAN,ABSENTEE BY MAIL,5,SHANNON W BRAY,...
12-13,US SENATE,LIBERTARIAN,LIBERTARIAN,EARLY VOTING,38,SHANNON W BRAY,...
12-13,US SENATE,LIBERTARIAN,LIBERTARIAN,PROVISIONAL,0,SHANNON W BRAY,...
To get Shannon W. Bray’s total votes in precinct 12-13, sum the votes column across all modes: 47 + 5 + 38 + 0 = 90.
Some states include a TOTAL mode row that pre-sums the other modes. Some do not. Your aggregation logic must handle both cases. If TOTAL rows are present, either use them directly and skip the individual mode rows, or skip TOTAL and sum the modes yourself. Do not double-count.
Common mode values: ELECTION DAY, ABSENTEE BY MAIL, EARLY VOTING, ONE STOP, PROVISIONAL, TOTAL.
Name formatting
MEDSL candidate names are ALL CAPS with no periods after initials:
| MEDSL | Actual name |
|---|---|
SHANNON W BRAY | Shannon W. Bray |
VICTORIA P PORTER | Victoria P. Porter |
MICHAEL "STEVE" HUBER | Michael “Steve” Huber |
ROBERT VAN FLETCHER JR | Robert Van Fletcher, Jr. |
LM "MICKEY" SIMMONS | L.M. “Mickey” Simmons |
Nicknames appear in double quotes within the name string. Suffixes (JR, SR, III) appear without a preceding comma.
Write-in candidates are aggregated into a single row with candidate = WRITEIN and writein = TRUE.
Non-candidate rows
Some states include metadata rows in the data that are not candidate results:
office value | Meaning | Action |
|---|---|---|
REGISTERED VOTERS | Voter registration count | Extract as turnout metadata, do not treat as a contest |
BALLOTS CAST | Ballots cast count | Extract as turnout metadata |
BALLOTS CAST - TOTAL | Same | Extract |
BALLOTS CAST - BLANK | Blank ballot count | Extract |
STRAIGHT PARTY | Straight-ticket party vote | Typically excluded from contest analysis |
OVER VOTES | Overvote count | Extract as quality metadata |
UNDER VOTES | Undervote count | Extract as quality metadata |
These rows are present in some states and absent in others. Florida OpenElections data contains 6,013 “Registered Voters” rows — 67.9% of all records classified as “other” in initial processing.
Known coverage gaps
MEDSL 2022 contains local race data for 44 of 51 jurisdictions. Seven states have zero rows with a blank dataverse column:
| State | Likely reason |
|---|---|
| California | Local results published separately by each county; not aggregated by MEDSL |
| Iowa | Local results not included in the MEDSL state file |
| Kansas | Same |
| New Jersey | Same |
| Pennsylvania | Same |
| Tennessee | Same |
| Wisconsin | Same |
This does not mean these states lack local elections. It means MEDSL’s curation process did not capture them for 2022. Coverage may differ in other years.
The votes column type
The votes column is predominantly integer, but some state files contain non-integer values. We observed:
- Floating-point values (likely vote shares erroneously placed in the votes column)
- Asterisks (
*) indicating suppressed data - Empty strings
Parse with TRY_CAST or equivalent. In our load of the full 2022 dataset, 12,782 rows had non-integer votes values out of 12.3 million total (0.1%).
Download
# 2022 — All 51 files from GitHub
mkdir -p local-data/sources/medsl/2022
for state in ak al ar az ca co ct dc de fl ga hi ia id il in ks ky la \
ma md me mi mn mo ms mt nc nd ne nh nj nm nv ny oh ok or pa \
ri sc sd tn tx ut va vt wa wi wv wy; do
curl -L -o "local-data/sources/medsl/2022/2022-${state}-local-precinct-general.zip" \
"https://raw.githubusercontent.com/MEDSL/2022-elections-official/main/individual_states/2022-${state}-local-precinct-general.zip"
done
# Unzip
for f in local-data/sources/medsl/2022/*.zip; do
unzip -o "$f" -d "${f%.zip}"
done
# 2020 — NC example from Harvard Dataverse (file ID 6100444)
mkdir -p local-data/sources/medsl/2020
curl -L -o local-data/sources/medsl/2020/2020-nc-precinct-general.csv \
"https://dataverse.harvard.edu/api/access/datafile/6100444"
File IDs for all 51 jurisdictions in 2020 and 2018 are documented in the download instructions.
Cross-source overlap
For the 2022 North Carolina general election, MEDSL and NC SBE share 640 contests where both sources report results:
- 579 (90.5%) have exactly matching vote totals
- 47 (7.3%) match within 1%
- 14 (2.2%) disagree by more than 1%
- 401 (63%) have different candidate name formatting between the two sources
This overlap is the basis for our entity resolution validation. See Cross-Source Reconciliation.
NC SBE — North Carolina State Board of Elections
The North Carolina State Board of Elections publishes precinct-level results for every contest on the ballot — federal, state, and local — with vote mode breakdowns, for every election cycle back to at least 2006. It is the most complete single-state local election dataset we have found.
What NC SBE contains
NC SBE provides one tab-delimited text file per election, delivered as a ZIP archive from an S3 bucket. Each row represents one candidate in one precinct for one contest. Vote mode totals (Election Day, early voting, absentee by mail, provisional) appear as separate columns on each row, not as separate rows. This means a single row gives you the full vote breakdown for one candidate in one precinct — unlike MEDSL, which splits each vote mode into its own row.
Coverage:
| Year | File | Rows | Notes |
|---|---|---|---|
| 2024 | results_pct_20241105.txt | 233,511 | Presidential general |
| 2022 | results_pct_20221108.txt | 171,901 | Midterm general |
| 2020 | results_pct_20201103.txt | 257,722 | Presidential general |
| 2018 | results_pct_20181106.txt | 183,724 | Midterm general |
| 2016 | results_pct_20161108.txt | 252,827 | Presidential general |
| 2014 | results_pct_20141104.txt | 223,977 | Midterm general |
| 2012 | results_pct_20121106.txt | 208,921 | Different schema — see below |
| 2010 | results_pct_20101102.txt | 188,008 | Different schema |
| 2008 | results_pct_20081104.txt | 233,141 | Different schema |
| 2006 | results_pct_20061107.txt | 69,482 | Significantly different schema (9 columns) |
We have downloaded and loaded all 10 cycles. The 2014–2024 files share a stable 15-column format. Earlier files require separate parsers.
Schema (2014–2024)
Files from 2014 onward are tab-delimited with 15 columns. There is no quoting convention; values do not contain tabs.
| Column | Type | Description | Example |
|---|---|---|---|
County | string | County name, ALL CAPS | COLUMBUS |
Election Date | string | Date as MM/DD/YYYY | 11/08/2022 |
Precinct | string | Precinct identifier | P17 |
Contest Group ID | string | Internal contest grouping number | 7 |
Contest Type | string | S = statewide, C = county/local | C |
Contest Name | string | Full contest name, ALL CAPS | COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02 |
Choice | string | Candidate name, Title Case | Timothy Lance |
Choice Party | string | Party abbreviation or blank | REP, DEM, |
Vote For | integer | Maximum selections allowed | 1 |
Election Day | integer | Election day votes | 136 |
One Stop | integer | Early voting (in-person) votes | 159 |
Absentee by Mail | integer | Mail absentee votes | 7 |
Provisional | integer | Provisional ballot votes | 1 |
Total Votes | integer | Sum of all vote modes | 303 |
Real Precinct | string | Y = physical precinct, N = aggregation group | Y |
The Contest Type column
The Contest Type field distinguishes local from statewide races:
C— county/local contests: school board, county commissioner, city council, soil and water, local judicial races, bond referendumsS— statewide contests: US Senate, US House, Governor, state legislature, statewide judicial races
For local election analysis, filter to Contest Type = 'C'. In the 2022 file, this yields 919 distinct contests across 100 counties.
Vote mode columns
NC SBE is the only source in our corpus that provides vote mode breakdowns as columns for every contest, including local races. The four modes are:
| Column | Meaning |
|---|---|
Election Day | Votes cast in person on election day |
One Stop | Early in-person voting (North Carolina’s term for early voting) |
Absentee by Mail | Absentee ballots returned by mail |
Provisional | Provisional ballots accepted during canvass |
Total Votes is the sum of the four mode columns. We have verified this holds across all rows in the 2014–2024 files.
The vote mode data enables analysis that most sources cannot support: comparing early voting patterns to election day patterns at the precinct level for local races. Three of our nine data sources provide any vote mode breakdown at all (NC SBE, Clarity, and MEDSL for some states). NC SBE is the only one that provides it consistently for all contests.
Non-contest rows
NC SBE data includes rows that are not candidate results. These appear as entries in the Choice column within contests that are not real races:
Contest Name pattern | Choice value | What it is |
|---|---|---|
| Contains “Registered Voters” | (varies) | Voter registration count for the precinct |
| Any contest | Write-In (Miscellaneous) | Aggregated write-in votes |
| Any contest | Over Votes | Overvote count |
| Any contest | Under Votes | Undervote count |
The “Registered Voters” rows deserve special attention. They appear as a contest named “Registered Voters” with a single Choice entry where Total Votes contains the number of registered voters in that precinct. This is turnout metadata, not a contest result.
In our prototype pipeline, we extract the registered voter count from these rows into a turnout object, then exclude the row from contest analysis. This is how we backfill the turnout.registered_voters field that is otherwise unpopulated for most sources.
Write-in rows with the suffix (Write-In) in the candidate name (e.g., Ronnie Strickland (Write-In)) are distinct from the aggregated Write-In (Miscellaneous) row. The named write-in rows report votes for a specific write-in candidate. The (Miscellaneous) row reports the total for all unnamed write-ins.
The Real Precinct column
Real Precinct = Y indicates a physical voting precinct with a defined geographic boundary. Real Precinct = N indicates an aggregation group — typically used for absentee-only tallies or provisional ballot pools that cannot be assigned to a specific precinct.
For geographic analysis (mapping, precinct-level comparison), filter to Real Precinct = 'Y'. For total vote counts, include both.
Candidate name formatting
NC SBE candidate names are Title Case with periods after initials and commas before suffixes:
| NC SBE | Components |
|---|---|
Timothy Lance | first=Timothy, last=Lance |
Shannon W. Bray | first=Shannon, middle=W, last=Bray |
Robert Van Fletcher, Jr. | first=Robert, middle=Van, last=Fletcher, suffix=Jr. |
Michael (Steve) Huber | first=Michael, nickname=Steve, last=Huber |
William Irvin. Enzor III | first=William, middle=Irvin, last=Enzor, suffix=III |
Patricia (Pat) Cotham | first=Patricia, nickname=Pat, last=Cotham |
Nicknames appear in parentheses. This differs from MEDSL, which uses double quotes. The period after “Irvin.” in “William Irvin. Enzor III” appears to be a data entry artifact — the period belongs after the middle initial, not after the full middle name. These inconsistencies are present in the source data and must be handled during name decomposition at L1.
Schema changes across years
The 2014–2024 files share the 15-column schema documented above. Earlier files differ:
2008–2012: The schema has 14–15 columns but with different names and ordering. Contest Type is the third column (not the fifth). Fields are comma-delimited with quote wrapping. The district column was added later. Vote mode columns use slightly different names in some years.
2006: Significantly different. Only 9 columns: county, election_dt, precinct_abbrv, precinct, contest_name, name_on_ballot, party_cd, ballot_count, FTP_Date. No vote mode breakdown. No Contest Type field. All column names are lowercase with underscores.
We currently parse 2014–2024 with one parser and treat 2006–2012 as a separate parser target. The 2008–2012 files contain local races (they have Contest Type = C) but require different column mapping. The 2006 file requires more investigation to determine whether it includes local races.
Why NC SBE matters
NC SBE is not the largest dataset in our corpus (MEDSL has far more rows). Its value is in three properties that no other source provides simultaneously:
-
Complete local coverage. Every contest on every ballot in every precinct in every county — school board, soil and water, county commissioner, municipal, judicial, and bond referendums. MEDSL has gaps in local race coverage for some states. NC SBE has none for North Carolina.
-
Vote mode breakdowns for local races. The four-column mode breakdown (Election Day, One Stop, Absentee, Provisional) is present for every contest, including hyperlocal races like “Whiteville City Schools Board of Education District 01.”
-
Ten-year temporal depth. Six clean election cycles (2014–2024) with a consistent schema. This enables career tracking, competitiveness trend analysis, and temporal chain construction across a decade of local elections. Combined with the 2008–2012 files (once parsed), the coverage extends to nearly 20 years.
The combination of these three properties makes NC SBE the primary validation dataset for the pipeline. When we test cross-source entity resolution, we compare MEDSL NC against NC SBE NC — 640 overlapping contests with 90.5% exact vote total agreement and 63% candidate name formatting differences. When we test temporal chains, we track candidates across NC SBE’s six-cycle span.
Download
The URL pattern is:
https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/{YYYY_MM_DD}/results_pct_{YYYYMMDD}.zip
mkdir -p local-data/sources/ncsbe/{2014,2016,2018,2020,2022,2024}
# 2024
curl -L -o local-data/sources/ncsbe/2024/results_pct_20241105.zip \
"https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2024_11_05/results_pct_20241105.zip"
# 2022
curl -L -o local-data/sources/ncsbe/2022/results_pct_20221108.zip \
"https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip"
# 2020
curl -L -o local-data/sources/ncsbe/2020/results_pct_20201103.zip \
"https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2020_11_03/results_pct_20201103.zip"
# 2018
curl -L -o local-data/sources/ncsbe/2018/results_pct_20181106.zip \
"https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2018_11_06/results_pct_20181106.zip"
# 2016
curl -L -o local-data/sources/ncsbe/2016/results_pct_20161108.zip \
"https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2016_11_08/results_pct_20161108.zip"
# 2014
curl -L -o local-data/sources/ncsbe/2014/results_pct_20141104.zip \
"https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2014_11_04/results_pct_20141104.zip"
# Unzip all
for d in local-data/sources/ncsbe/*/; do
cd "$d" && unzip -o *.zip && cd -
done
Older cycles (2008–2012) follow the same URL pattern. The 2006 file uses a different path structure. The full election calendar is at ncsbe.gov/results-data.
OpenElections
The OpenElections project is a volunteer-driven effort to collect, clean, and publish certified US election results as CSV files. Data is organized into per-state GitHub repositories under the openelections organization.
What OpenElections provides
Precinct-level and county-level election results, parsed from official state and county sources into CSV format. Coverage varies by state — some repositories have data back to 2000, others have only one or two recent cycles. Approximately 8 states have precinct-level 2022 general election data suitable for aggregation.
States with usable precinct-level data for recent cycles include FL, GA, MI, OH, PA, and TX. Each state repository is independent, maintained by different volunteers, with different levels of completeness.
Repository structure
Each state has its own repo:
openelections-data-fl— Floridaopenelections-data-ga— Georgiaopenelections-data-pa— Pennsylvania- etc.
Files follow a naming convention that encodes the election date, state, type, and granularity:
{YYYYMMDD}__{state}__{type}__{granularity}.csv
Examples:
| Filename | Meaning |
|---|---|
20221108__fl__general__precinct.csv | 2022 FL general, precinct-level |
20220510__pa__primary__county.csv | 2022 PA primary, county-level |
20201103__ga__general__precinct.csv | 2020 GA general, precinct-level |
Some repos include both raw and cleaned versions. Files with raw in the name are unprocessed source dumps. Prefer files without the raw prefix.
Core schema (7+ columns)
The project does not enforce a single schema. Most files share a 7-column core:
| Column | Type | Description |
|---|---|---|
county | string | County name |
precinct | string | Precinct name or code |
office | string | Office contested |
district | string | District number or name (may be blank) |
party | string | Party abbreviation |
candidate | string | Candidate name |
votes | integer | Vote count |
Additional columns appear in some states:
election_day,absentee,provisional,early_voting— vote mode breakdownswinner— boolean orY/Nflagtotal_votes— aggregate across modes
Column names and ordering differ across states and sometimes across files within the same state repo.
Schema variation by state
| State | Extra columns | Name format | Notes |
|---|---|---|---|
| FL | election_day, absentee, early_voting | Last, First | Includes “Registered Voters” metadata rows (6,013 in 2022) |
| GA | total_votes | First Last | Precinct names vary by county |
| PA | none beyond core | First Last | Some files county-level only |
| OH | early_voting, absentee | Last, First | Inconsistent across counties |
Non-candidate rows
Florida files include metadata rows that are not contest results:
office value | Meaning |
|---|---|
Registered Voters | Voter registration count — 67.9% of “other” rows in initial FL processing |
Ballots Cast | Turnout count |
These must be extracted as turnout metadata during L1 parsing, not treated as contests.
Access method
Data is accessed by cloning the per-state Git repository:
git clone https://github.com/openelections/openelections-data-fl.git
git clone https://github.com/openelections/openelections-data-ga.git
git clone https://github.com/openelections/openelections-data-pa.git
There is no bulk download endpoint. Each state repo must be cloned individually.
Data quality
Quality varies by state and volunteer. Known issues:
- No standard schema. Column names differ across states and files. Parsers must handle each state separately.
- Candidate name format varies. Some states use
Last, First. Others useFirst Last. Suffixes and middle names are inconsistent. - Encoding. Most files are UTF-8. Some older files contain Latin-1 or Windows-1252 characters.
- Duplicates. Some repos contain both raw and cleaned versions of the same election. Ingest only one to avoid double-counting.
- Incomplete coverage. A state repo existing does not mean it has precinct-level data for all cycles.
Cross-source overlap
OpenElections FL overlaps with MEDSL FL for the 2022 general election. This overlap is useful for validation but has not been systematically compared at the same depth as the MEDSL–NC SBE comparison (640 contests, 90.5% vote match). The FL overlap is a planned validation target.
Value in the pipeline
OpenElections fills gaps where MEDSL coverage is thin or where vote mode breakdowns are available. Florida’s vote mode columns (election day, absentee, early voting) provide signal that MEDSL’s Florida file lacks. The community-curated nature means data may appear for states or cycles before MEDSL publishes its cleaned version.
The tradeoff is consistency: every state requires its own parser branch at L1.
Clarity/Scytl ENR
Clarity (now part of Scytl / CivicPlus) powers Election Night Reporting (ENR) websites for over 1,000 US jurisdictions — counties, cities, and some state-level election authorities. Each jurisdiction runs its own Clarity instance, publishing structured results in XML and JSON formats.
What Clarity provides
Clarity sites are the primary source for local race results that no other source captures: school board, city council, municipal judge, fire district commissioner, water board. Many jurisdictions publish precinct-level results with vote mode breakdowns (Election Day, early, absentee, provisional). Results appear on election night and typically remain available for weeks or months before being replaced by the next election cycle.
Data format
Results are distributed as XML inside ZIP archives. The XML follows a hierarchical structure:
| Element | Description |
|---|---|
<ElectionResult> | Root element. Contains election metadata (name, date, jurisdiction). |
<Contest> | One per race. Attributes include contest name, vote-for count, total ballots. |
<Choice> | One per candidate or ballot measure option within a contest. Includes name, party, total votes. |
<VoteType> | Breakdown by vote method within each choice. Election Day, absentee, early, provisional. |
<Precinct> | Precinct-level results when the jurisdiction publishes at that granularity. |
A single detailxml.zip file for a medium-sized county (50 precincts, 30 contests) is typically 200 KB–2 MB uncompressed.
URL structure
Clarity ENR sites follow a predictable URL pattern:
https://results.enr.clarityelections.com/{state}/{jurisdiction}/{electionID}/
The underlying data feeds are at:
| Endpoint | Content |
|---|---|
reports/detailxml.zip | Full precinct-level XML results |
json/en/summary.json | Lightweight JSON summary (no precinct detail) |
Web02/en/summary.html | Human-readable results page |
Example for Wake County, NC:
https://results.enr.clarityelections.com/NC/Wake/115545/reports/detailxml.zip
The {electionID} is a numeric identifier assigned per election. It is not sequential and cannot be predicted.
Coverage
- Jurisdictions: ~1,000+ counties and municipalities across ~30 states
- Election types: general, primary, runoff, special, municipal
- Granularity: precinct-level with vote type breakdowns (most jurisdictions)
- Temporal: current election cycle only; prior results are removed when new elections are configured
Why Clarity matters
Clarity is the highest-value source for local races that do not appear in MEDSL, OpenElections, or state portals. A county’s Clarity site may be the only machine-readable source for races like:
- School board (non-partisan, no state-level reporting)
- City council (municipal elections, often off-cycle)
- District court judge retention
- Bond referendums and local ballot measures
Key problems
URLs are unstable. The {electionID} changes every cycle. Old results are removed without redirect or archive. There is no central index of active Clarity instances. Discovery requires crawling county election office websites for links.
No published XML schema. The XML structure is consistent in practice but not formally specified. Minor variations exist across Clarity software versions. Field names and nesting can differ between jurisdictions.
Candidate names may embed party. Some jurisdictions format candidate names as John Smith (REP) rather than using a separate party field. This requires parsing at L1.
Ephemeral availability. Results may disappear weeks after certification when the jurisdiction configures the site for the next election. L0 acquisition must happen promptly after each election.
Integration status
Clarity is not yet integrated in our pipeline. The source module (src/sources/clarity.rs) defines the XML schema and URL patterns but does not implement parsing or acquisition. Integration is blocked on building a jurisdiction discovery mechanism and a scheduled acquisition process that captures results before URLs expire.
When integrated, Clarity will feed into L0 as ZIP archives with XML contents, parsed at L1 into the unified schema. The hierarchical Contest → Choice → VoteType structure maps cleanly to our ContestKind model.
VEST — Voting and Election Science Team
The Voting and Election Science Team (VEST) publishes precinct-level election shapefiles for all 50 states. Each shapefile pairs precinct geographic boundaries with vote counts encoded as attribute columns. The data is archived on the Harvard Dataverse.
What VEST provides
VEST’s primary value is twofold: geographic precinct boundaries (polygons) and odd-year election coverage. No other source in our corpus provides precinct geometries, and MEDSL’s loaded data currently covers only even years.
We have downloaded VEST shapefiles for KY, LA, MS, and VA covering the 2015 and 2019 odd-year elections. These contain state-level races (governor, attorney general, state legislature) but not local races.
Data format
Each state-year dataset is a ZIP archive containing a standard ESRI shapefile bundle:
| File | Purpose |
|---|---|
.shp | Geometry (precinct boundary polygons) |
.dbf | Attribute table (vote counts, FIPS codes) |
.shx | Spatial index |
.prj | Coordinate reference system definition |
.cpg | Character encoding declaration |
Reading requires a spatial data library. In Python, geopandas.read_file() handles the full bundle. In Rust, the shapefile crate reads .shp/.dbf pairs.
Column encoding convention
VEST encodes election metadata into column names using a compact format:
{stage}{YY}{office}{party}{surname}
| Component | Values | Examples |
|---|---|---|
| Stage | G (general), P (primary), R (runoff) | G |
| Year | Two-digit year | 20, 19, 15 |
| Office | PRE (President), USS (US Senate), USH (US House), GOV (Governor), SOS (Sec. of State), AG (Attorney General), LTG (Lt. Governor) | PRE |
| Party | R (Republican), D (Democrat), L (Libertarian), G (Green), O (Other) | R |
| Surname | Abbreviated (typically 3 chars) | TRU, BID |
Decoded examples
| Column | Stage | Year | Office | Party | Candidate |
|---|---|---|---|---|---|
G20PRERTRU | General | 2020 | President | Republican | Trump |
G20PREDBID | General | 2020 | President | Democrat | Biden |
G19GOVDBED | General | 2019 | Governor | Democrat | Beshear (KY) |
G15GOVDEDW | General | 2015 | Governor | Democrat | Edwards (LA) |
G18GOVDABO | General | 2018 | Governor | Democrat | Abrams (GA) |
Attribute table structure
The .dbf file contains both geographic identifiers and vote count columns:
| Column pattern | Description |
|---|---|
STATEFP20 | 2-digit state FIPS code |
COUNTYFP20 | 3-digit county FIPS code |
VTDST20 | Voting tabulation district (precinct) code |
NAME20 | Human-readable precinct name |
ALAND20 | Land area in square meters |
AWATER20 | Water area in square meters |
G20PRE* | Vote count columns (one per candidate) |
Vote values are raw integer counts. Each row is one precinct.
dBASE column name truncation
The .dbf format (dBASE III) limits column names to 10 characters. This truncation creates ambiguity:
G20USSRPERcould be Perdue or PerryG20USHDWILcould be Williams, Wilson, or Wilkins
VEST documentation files (included in each ZIP) provide a column-to-candidate mapping. These must be consulted to resolve truncated names.
Coverage in our pipeline
| State | Year | Election type | Races |
|---|---|---|---|
| KY | 2019 | Governor, AG, SOS, state legislature | State-level only |
| LA | 2015, 2019 | Governor, state legislature | State-level only |
| MS | 2019 | Governor, AG, state legislature | State-level only |
| VA | 2015, 2019 | Governor, state legislature | State-level only |
These four states hold odd-year elections, which MEDSL has on Dataverse but which we have not yet loaded from that source. VEST fills the gap for state-level races in these cycles.
Limitations
No local races. VEST encodes statewide and federal contests only. County commissioner, school board, sheriff, and other local offices are not present. For local race coverage, use MEDSL or state-specific sources.
Large file sizes. Individual state shapefiles range from 50 MB to 500+ MB. The geometry data dominates file size; vote counts are a small fraction.
Precinct boundary instability. Redistricting changes precinct boundaries between election cycles. A precinct polygon from 2020 may not correspond to the same geographic area in 2022. Cross-year geographic comparisons require spatial intersection, not ID matching.
Requires spatial tooling. Unlike CSV sources that can be read with any text processor, shapefiles require geopandas (Python) or the shapefile crate (Rust). This adds a dependency that other sources do not.
Usage in the pipeline
VEST data enters at L0 as the raw shapefile ZIP. At L1, the column encoding is decoded to extract year, office, party, and candidate surname. Vote counts are pivoted from wide format (one column per candidate) to long format (one row per candidate per precinct) to match the unified schema.
The geographic boundaries are preserved as sidecar geometry files but are not embedded into the JSONL record stream. They are available for spatial joins and map rendering but are not part of the core election result schema.
Download
VEST datasets are available from the Harvard Dataverse. Each state-year combination has its own DOI. Example for Kentucky 2019:
mkdir -p local-data/sources/vest/ky/2019
curl -L -o local-data/sources/vest/ky/2019/ky_2019.zip \
"https://dataverse.harvard.edu/api/access/dataset/:persistentId/?persistentId=doi:10.7910/DVN/XXXXXX"
unzip local-data/sources/vest/ky/2019/ky_2019.zip -d local-data/sources/vest/ky/2019/
Consult the VEST precinct data page for current DOIs. File IDs change when datasets are updated.
Census Bureau FIPS Reference Files
The US Census Bureau publishes authoritative FIPS (Federal Information Processing Standards) code files that provide the canonical mapping from numeric codes to geographic entity names. These files are the ground truth for geographic identifiers across the pipeline.
What it provides
| File | Entity type | Record count | Key columns |
|---|---|---|---|
state.txt | States + DC + territories | 57 | STATE, STATE_NAME |
national_county2020.txt | Counties + equivalents | 3,143 | STATEFP, COUNTYFP, COUNTYNAME |
national_place2020.txt | Incorporated places + CDPs | 31,980 | STATEFP, PLACEFP, PLACENAME |
national_cousub2020.txt | County subdivisions | ~36,000 | STATEFP, COUNTYFP, COUSUBFP, COUSUBNAME |
Format
All files are pipe-delimited (|) plain text with a header row. Encoding is ASCII. Example from the county file:
NC|37|037|1026339|Chatham County|H1|A
NC|37|063|1008557|Durham County|H1|A
NC|37|183|1008586|Wake County|H1|A
Columns in the county file:
| Column | Description |
|---|---|
STATE | Two-letter postal abbreviation |
STATEFP | Two-digit state FIPS code |
COUNTYFP | Three-digit county FIPS code |
COUNTYNS | ANSI feature code |
COUNTYNAME | Full county name including “County” suffix |
CLASSFP | FIPS class code (H1 = active county, H4 = borough, H6 = parish) |
FUNCSTAT | Functional status (A = active) |
The five-digit county FIPS used throughout the pipeline is STATEFP + COUNTYFP (e.g., 37 + 183 = 37183 for Wake County, NC).
Download
https://www2.census.gov/geo/docs/reference/state.txt
https://www2.census.gov/geo/docs/reference/codes2020/national_county2020.txt
https://www2.census.gov/geo/docs/reference/codes2020/national_place2020.txt
https://www2.census.gov/geo/docs/reference/codes2020/national_cousub2020.txt
No API key required. Files are small (under 5 MB total) and rarely change.
Usage in the pipeline
Census FIPS files are consumed at L1 for geographic enrichment. When a source record contains a county name but no FIPS code (common in OpenElections and Clarity data), the pipeline joins against the county file to assign the canonical five-digit FIPS. When a source provides a FIPS code but no name, the lookup runs in reverse.
The place file enables resolution of municipal names to FIPS codes — relevant for city council, mayoral, and municipal utility district contests where the jurisdiction is a place, not a county.
FIPS codes serve as the primary geographic join key across all seven data sources. Without them, matching “Wake County” in MEDSL to “WAKE” in NC SBE to “Wake Co.” in OpenElections would require fuzzy string matching. With them, it is an exact integer join.
FEC — Federal Election Commission Candidate Master Files
The FEC publishes bulk data files for every registered federal candidate: President, US Senate, and US House. These files provide stable candidate identifiers (CAND_ID) that persist across election cycles, making them a reference source for cross-linking federal candidates across MEDSL, NC SBE, and OpenElections data.
What FEC provides
The candidate master file (cn.txt) contains one row per candidate per election cycle. It covers all candidates who have filed with the FEC, including those who lost primaries or never appeared on a general election ballot.
Available cycles: 1980–present. We have downloaded 2020 and 2022.
Download
Bulk data is at fec.gov/data/browse-data.
mkdir -p local-data/sources/fec/{2020,2022}
# 2022
curl -L -o local-data/sources/fec/2022/cn.zip \
"https://www.fec.gov/files/bulk-downloads/2022/cn.zip"
unzip -o local-data/sources/fec/2022/cn.zip -d local-data/sources/fec/2022/
# 2020
curl -L -o local-data/sources/fec/2020/cn.zip \
"https://www.fec.gov/files/bulk-downloads/2020/cn.zip"
unzip -o local-data/sources/fec/2020/cn.zip -d local-data/sources/fec/2020/
Schema
The file cn.txt is pipe-delimited (|) with 15 columns and no header row.
| # | Column | Description | Example |
|---|---|---|---|
| 1 | CAND_ID | Stable candidate identifier | H0NC09072 |
| 2 | CAND_NAME | Name in LAST, FIRST MIDDLE format | BRAY, SHANNON W |
| 3 | CAND_PTY_AFFILIATION | Party code | LIB |
| 4 | CAND_ELECTION_YR | Election year | 2022 |
| 5 | CAND_OFFICE_ST | State (2-letter postal code) | NC |
| 6 | CAND_OFFICE | Office: H / S / P | H |
| 7 | CAND_OFFICE_DISTRICT | Congressional district (00 for Senate/President) | 09 |
| 8 | CAND_ICI | Incumbent/Challenger/Open: I/C/O | C |
| 9 | CAND_STATUS | Status code (C=statutory candidate, F=filed, N=not yet, P=prior cycle) | C |
| 10 | CAND_PCC | Principal campaign committee ID | C00654321 |
| 11 | CAND_ST1 | Mailing address street | |
| 12 | CAND_ST2 | Mailing address street 2 | |
| 13 | CAND_CITY | Mailing address city | |
| 14 | CAND_ST | Mailing address state | |
| 15 | CAND_ZIP | Mailing address ZIP |
Usage in the pipeline
FEC data serves two purposes:
-
Stable identifiers.
CAND_IDpersists across election cycles. A candidate who runs for the same seat in 2020 and 2022 keeps the same ID. This provides a ground-truth link for validating temporal chains built by the L4 layer. -
Name cross-referencing.
CAND_NAMEis parsed at L1 into last, first, middle, and suffix components. These parsed names are compared against MEDSL and state source names during L3 entity resolution. FEC usesLAST, FIRST MIDDLEformat consistently, which makes it one of the more predictable sources for name parsing.
Limitations
- Federal candidates only. No state legislators, no county commissioners, no school board members. FEC has no jurisdiction over non-federal offices.
- Filing ≠ appearing on ballot. Many
CAND_IDentries correspond to candidates who filed paperwork but never appeared on a general election ballot. - Party codes differ from other sources. FEC uses codes like
LIB,GRE,NNE(None) that do not match MEDSL’sLIBERTARIAN,GREEN,NONPARTISANlabels. Normalization is required at L1.
Future Sources
This chapter documents data sources that have been identified as valuable but are not yet integrated into the pipeline. Each is blocked by a specific access, cost, or engineering constraint.
ALGED — Annual Local Government Election Data
The Annual Local Government Election Data project, hosted on the Open Science Framework (OSF), covers municipal elections in 1,747 cities with populations over 25,000. Records include candidate demographics (race, gender), incumbency status, and election outcomes — fields that no other source in our corpus provides.
Coverage: Municipal elections from 2000–2020. Cities only (no counties, no school districts). Focuses on mayoral and city council races.
Format: CSV files organized by city population tier.
Status: Blocked. The OSF repository requires an approved access request. We submitted a request in early 2025 and have not received a response. The underlying data appears to be derived from individual city clerk records, manually curated by the research team.
Value if integrated: ALGED would fill the demographic gap entirely. No other source provides candidate race or gender. It would also provide an independent validation source for municipal races in the 1,747 covered cities.
Ballotpedia
Ballotpedia maintains the most comprehensive database of US local elections, covering school boards, city councils, county commissions, judges, ballot measures, and special districts across all 50 states. Their coverage extends to races that no other source tracks — mosquito abatement districts, water boards, and transit authorities.
Coverage: All 50 states, all levels of government, ongoing since approximately 2007.
Format: Structured database accessible via a paid API. Some data is available on the public website but is not bulk-downloadable.
Status: Blocked by cost. The API requires a commercial license. Pricing is not publicly listed but is reported to be in the five-figure annual range. We have not pursued a license.
Value if integrated: Ballotpedia would be the single largest improvement to local race coverage. It would fill the 7-state local race gap in MEDSL 2022 and provide office-level metadata (term length, salary, appointing authority) that no source currently offers.
AP Elections API
The Associated Press Elections API provides real-time and certified results for federal and state races, with some local coverage in larger jurisdictions. It is the standard data feed used by newsrooms on election night.
Coverage: Federal and statewide races nationwide. County-level results for most races. Precinct-level for some states. Local race coverage varies.
Format: JSON API with WebSocket push for live updates.
Status: Blocked by cost. The AP API is a commercial product priced for newsroom budgets. It is not available for academic or open-source use without a contract. The real-time capability is irrelevant to our pipeline (we process certified results, not live feeds), but the certified result snapshots would be a valuable validation source.
Value if integrated: AP results would serve as a third independent source for federal and statewide races, enabling three-way cross-source validation alongside MEDSL and state portals. AP’s candidate identifiers are stable across cycles, which would simplify temporal chaining for federal candidates.
Additional State Portals
Six states with significant populations publish precinct-level results through their own election portals in structured formats. These would complement MEDSL by providing certified results directly from the state authority.
Florida: The Division of Elections publishes precinct-level results at results.elections.myflorida.com. CSV format. All counties, all contests. Would overlap with both MEDSL and OpenElections FL, enabling three-source validation for one state.
Georgia: The Secretary of State publishes results at results.enr.clarityelections.com/Georgia/ (Clarity-based) and via a separate certified results portal. XML and CSV. Would provide a second source for GA alongside MEDSL.
Texas: The Secretary of State publishes county-level results (not precinct-level) at elections.sos.state.tx.us. Precinct-level results are published by individual counties. A full TX integration would require crawling 254 county websites or using the Clarity instances that many TX counties operate.
Ohio: The Secretary of State publishes precinct-level results at www.ohiosos.gov/elections/election-results-and-data/. CSV format. Covers all contests including local races.
Pennsylvania: The Department of State publishes results at electionreturns.pa.gov. JSON API available. Covers all contests. Would fill one of the 7 states with zero local data in MEDSL 2022.
Michigan: The Secretary of State publishes precinct-level results at miboecfr.nictusa.com/cgi-bin/cfr/. Older web interface with downloadable files. Covers all contests.
Status: Not blocked by access — all six portals are public. Blocked by engineering time. Each state portal has its own format, URL structure, and quirks. We estimate 1–2 weeks of parser development per state. These are the highest-priority engineering tasks after odd-year MEDSL loading.
Name Normalization
Election data arrives with candidate names in dozens of formats. MEDSL uses LAST, FIRST MIDDLE in all caps. NC SBE uses First Last in title case. OpenElections uses whatever the county clerk typed. FEC uses LAST, FIRST MIDDLE SUFFIX. A single candidate can appear as:
CRIST, CHARLES JOSEPH(MEDSL)Charlie Crist(OpenElections)Crist, Charlie(FEC)
These are all the same person. A system that treats them as three different candidates produces garbage output. A system that aggressively normalizes them — stripping middle names, collapsing nicknames, removing suffixes — destroys the signal needed to tell different people apart.
The principle: clean without collapsing.
Name decomposition at L1
Every candidate name is decomposed at L1 into six components:
| Component | Purpose | Example |
|---|---|---|
raw | Original string, unmodified | CRIST, CHARLES JOSEPH |
first | Parsed first name | CHARLES |
middle | Middle name or initial | JOSEPH |
last | Last name | CRIST |
suffix | Generational suffix | null |
canonical_first | Dictionary-normalized first name | CHARLES |
The canonical_first field is populated by the nickname dictionary. If the raw first name is Charlie, canonical_first becomes Charles. If no mapping exists, canonical_first equals first.
Both first and canonical_first are preserved. The raw nickname is useful signal — it tells you what the candidate goes by. The canonical form is what enables matching.
Real decomposition examples
Five candidates from our prototype, showing how MEDSL and NC SBE formats decompose differently for the same people:
| Source | Raw Name | first | middle | last | suffix | canonical_first |
|---|---|---|---|---|---|---|
| MEDSL | DESANTIS, RON | RON | null | DESANTIS | null | RONALD |
| OpenElections | Ron DeSantis | Ron | null | DeSantis | null | Ronald |
| MEDSL | CRIST, CHARLES JOSEPH | CHARLES | JOSEPH | CRIST | null | CHARLES |
| OpenElections | Charlie Crist | Charlie | null | Crist | null | Charles |
| MEDSL | DEMINGS, VAL BUTLER | VAL | BUTLER | DEMINGS | null | VALDEZ |
| NC SBE | Val Demings | Val | null | Demings | null | Valdez |
| MEDSL | WILLIAMS, ROBERT | ROBERT | null | WILLIAMS | null | ROBERT |
| NC SBE | Robert Williams Jr | Robert | null | Williams | Jr | Robert |
| MEDSL | MARSHALL, DAVID S | DAVID | S | MARSHALL | null | DAVID |
| MEDSL | MARSHALL, DAVID A | DAVID | A | MARSHALL | null | DAVID |
Key observations from these examples:
-
Ron DeSantis —
Ronmaps toRonaldvia the nickname dictionary. The embedding score between the two source representations is 0.729 — below any reasonable auto-accept threshold, but the LLM matches them using nickname knowledge. -
Charlie Crist —
Charliemaps toCharles. The embedding score is 0.451. Without the dictionary, the cascade would need the LLM to know that Charlie is a nickname for Charles. With the dictionary, the canonical forms already match. -
Robert Williams vs Robert Williams Jr — The suffix
Jris the only distinguishing feature. These are different people. The embedding scores them at 0.862 — dangerously close to a false positive. See Suffixes. -
David S Marshall vs David A Marshall — Different middle initials. David S. Marshall ran in Maine; David A. Marshall ran in Florida. The middle initial is the only signal distinguishing them at the name level. See Nicknames and Middle Initials.
What decomposition enables
With names decomposed into components, downstream layers can:
- Exact-match on structured fields:
(canonical_first="Timothy", last="Lance", suffix=null)matches across precincts without fuzzy logic. This handles 70% of entity resolution. - Build composite strings for embedding:
"{canonical_first} {middle} {last} {suffix} | {party} | {office} | {state}"includes middle initials and suffixes as disambiguation signal. - Provide structured context to the LLM: Instead of asking “are these the same person?”, the LLM sees parsed components and can reason about specific differences (nickname vs. different name, Jr vs. no suffix).
- Block efficiently: Group by
(state, last_name_initial)for entity resolution without computing all-pairs similarity.
What goes wrong without decomposition
If you treat names as opaque strings:
CRIST, CHARLES JOSEPHandCharlie Cristhave a Jaro-Winkler similarity of 0.58 — a miss.DESANTIS, RONandRon DeSantishave a cosine embedding similarity of 0.729 — in the ambiguous zone.Robert WilliamsandRobert Williams Jrlook nearly identical to every string metric. Only structured suffix detection prevents a false merge.David S MarshallandDavid A Marshalldiffer by one character in a middle initial that opaque matching may ignore entirely.
Decomposition is not optional. It is the foundation that every subsequent layer depends on.
The three sub-problems
Name normalization breaks into three sub-problems, each with its own chapter:
- Nicknames and Middle Initials — How
CharliebecomesCharlesand whyDavid S.must stay distinct fromDavid A. - Suffixes: Jr/Sr Means Different People — Why generational suffixes are disambiguation signals, not noise to be stripped.
- The Nickname Dictionary — The lookup table that powers
canonical_first, its current scope, and its limits.
Nicknames and Middle Initials
Two distinct problems share a root cause: the candidate’s legal name differs from the name on the ballot or in the source file. Nicknames substitute one first name for another. Middle initials appear in some sources and not others. Both must be handled at L1 to preserve signal for L2 and L3.
Nicknames
A nickname replaces the candidate’s legal first name with a familiar variant. The embedding model has no reliable way to recover the connection — it encodes character-level and token-level similarity, not social knowledge about naming conventions.
Real test results from our prototype, using text-embedding-3-large (3,072 dimensions):
| Source A | Source B | Nickname → Legal | Cosine | LLM Decision | LLM Confidence |
|---|---|---|---|---|---|
| Charlie Crist | CRIST, CHARLES JOSEPH | Charlie → Charles | 0.451 | match | 0.95 |
| Nicole Fried | FRIED, NIKKI | Nikki → Nicole | 0.642 | match | 0.92 |
| Ron DeSantis | DESANTIS, RON | Ron → Ronald | 0.729 | match | 0.98 |
The Crist result is the critical case. At 0.451, the embedding score falls below any plausible auto-accept threshold — and below many reject thresholds. Without nickname resolution, this pair would be missed entirely or routed to LLM on every encounter.
The fix operates at L1. The nickname dictionary maps Charlie → Charles, Nikki → Nicole, Ron → Ronald, and ~100 other mappings. When the L1 parser decomposes a name, it checks the first name against the dictionary and populates canonical_first:
{
"raw": "Charlie Crist",
"first": "Charlie",
"middle": null,
"last": "Crist",
"suffix": null,
"canonical_first": "Charles"
}
Both first and canonical_first are preserved. The original is kept for display and provenance. The canonical form is used in the L2 composite string for embedding and in the L3 exact-match step. After dictionary application, the L3 exact matcher sees (canonical_first="Charles", last="Crist", suffix=null) on both sides — an exact match with no embedding or LLM call required.
Why the embedding model fails on nicknames
Charlie and Charles share a prefix, but the embedding model must also reconcile Crist vs CRIST, CHARLES JOSEPH — different casing, different ordering, and a middle name that appears in one source but not the other. The model embeds the full composite string, not individual tokens. The combined divergence pushes the cosine score to 0.451.
Ron and Ronald are closer (0.729) because the surface forms are more similar and both sources use last-name-first ordering. But 0.729 is still in the ambiguous zone — it requires an LLM call to confirm.
The nickname dictionary eliminates these LLM calls for known mappings. At scale, this matters: if 5% of candidates use nicknames and each requires an LLM call, that is tens of thousands of unnecessary API round-trips.
Middle Initials
Middle initials are a different problem. They do not substitute one name for another — they add or remove a disambiguation signal.
The key case: David S. Marshall (Maine) and David A. Marshall (Florida) are different people. Without middle initials, both reduce to David Marshall. With middle initials preserved, L2 generates different embedding vectors.
We measured the effect directly:
| Composite (no middle) | Composite (with middle) | Cosine (no middle) | Cosine (with middle) |
|---|---|---|---|
| David Marshall | ME | David S Marshall | ME | — | — |
| David Marshall | FL | David A Marshall | FL | 0.7025 | 0.6448 |
The middle initial drops the cosine score by 0.058 — enough to shift the pair further from the accept threshold and closer to correct rejection. The principle: middle initials are signal, not noise.
More middle-initial test results from our prototype:
| Source A | Source B | Cosine | LLM Decision | Key Signal |
|---|---|---|---|---|
| Ashley Moody | Ashley B. Moody | 0.930 | match | Same person, middle added |
| Val Demings | VAL DEMINGS | 0.828 | match | Same person, format difference |
| Dale Holness | DALE V.C. HOLNESS | 0.896 | match | Same person, middle initials added |
Ashley Moody at 0.930 is the same person — the B. appears in one source but not the other. The high embedding score plus same-state context is sufficient for auto-accept above the 0.95 threshold (or just below it, in which case JW on the last name at 1.0 pushes it through).
How Both Feed Into L2
The L2 composite string for a candidate includes both canonical_first and middle:
{canonical_first} {middle} {last} {suffix} | {party} | {office} | {state} | {county}
For Charlie Crist, this becomes:
Charles Crist | DEM | Governor | FL | statewide
For CRIST, CHARLES JOSEPH, this becomes:
Charles Joseph Crist | DEM | Governor | FL | statewide
The canonical first names now match. The remaining divergence — Joseph as a middle name in one source — is small enough that the embedding score rises well above the ambiguous zone. The nickname dictionary at L1 did the heavy lifting; L2 and L3 finish the job.
The Combined Rule
- At L1, apply the nickname dictionary to populate
canonical_first. - At L1, preserve
middleexactly as parsed — do not strip it, do not normalize it. - At L2, include both
canonical_firstandmiddlein the composite string. - At L3 exact match, match on
(canonical_first, last, suffix)— middle is not required for exact match but is used for disambiguation when multiple candidates share the same canonical first and last name. - At L3 LLM confirmation, provide both the raw and canonical names so the model can reason about nickname relationships and middle-initial differences.
The principle behind both: clean without collapsing. Normalize what you can (nicknames to canonical forms), preserve what you must (middle initials as disambiguation signal), and let downstream layers use the full context.
Suffixes: Jr/Sr Means Different People
A suffix is not decoration. Robert Williams and Robert Williams Jr. are two different human beings — father and son, or grandfather and grandson. Any system that strips suffixes before matching will merge them into one entity.
This is not a hypothetical risk. It happened in our prototype.
The Williams Problem
In Florida 2022, two candidates share the name Robert Williams:
| Source | Raw Name | Office | Votes |
|---|---|---|---|
| MEDSL | ROBERT WILLIAMS | State House District 41 | 12,304 |
| OpenElections | Robert Williams Jr | State House District 41 | 12,304 |
The embedding similarity between these two strings is 0.862 — well above our original auto-accept threshold of 0.82. Without suffix-aware logic, the cascade would have merged them automatically, no LLM consulted.
The LLM caught it. Given the full context, Claude Sonnet returned:
Decision: no_match (confidence: 0.85)
“The suffix ‘Jr’ indicates a generational distinction. ‘Robert Williams’ and ‘Robert Williams Jr’ are most likely two different individuals — typically father and son. While the office and vote counts match, the suffix is a strong signal that these are distinct people. The MEDSL record may have dropped the suffix during their cleaning process. I recommend treating these as separate entities until a human reviewer can confirm.”
This single case changed our threshold design.
How This Changed the Thresholds
Before Williams Jr., auto-accept was ≥ 0.82. A score of 0.862 would have been accepted without review. After this finding, we raised auto-accept to ≥ 0.95, ensuring that suffix-bearing pairs always enter the LLM zone (0.35–0.95) where the model can reason about generational distinctions.
| Threshold | Before | After | Reason |
|---|---|---|---|
| Auto-accept | ≥ 0.82 | ≥ 0.95 | Williams Jr at 0.862 was a false positive |
| Ambiguous (LLM zone) | 0.65–0.82 | 0.35–0.95 | Wider zone catches more edge cases |
| Auto-reject | < 0.65 | < 0.35 | Crist at 0.451 was a false negative |
The wider ambiguous zone sends more pairs to the LLM. Budget is not a constraint — accuracy is.
Suffix-Aware Logic in the Cascade
Suffixes receive special treatment at multiple stages:
L1 — Decomposition. The name parser extracts Jr, Sr, II, III, IV, V, Esq, and PhD into the suffix field. Both Jr. and Jr (with and without period) normalize to Jr. The suffix is never discarded.
Step 1 — Exact match. The exact match key is (canonical_first, last, suffix). “Timothy Lance” and “Timothy Lance” match. “Robert Williams” and “Robert Williams Jr” do not — the suffix field differs (null vs “Jr”).
Step 3 — Embedding. The suffix is included in the composite string: {canonical_first} {middle} {last} {suffix} | {party} | {office} | {state} | {county}. This means “Robert Williams” and “Robert Williams Jr” produce different vectors, but the difference is small (0.862 cosine) because the model treats “Jr” as a minor token.
Step 4 — LLM confirmation. The prompt explicitly includes both suffix fields and instructs the model: “A suffix like Jr or Sr typically indicates a different person (parent vs child). Do not match across suffixes unless you have strong evidence they refer to the same individual.” The LLM sees the structured fields, not just the raw strings.
The Suffix Inventory
From MEDSL 2022 data across all 50 states:
| Suffix | Occurrences | Notes |
|---|---|---|
| Jr | 1,847 | Most common; often dropped by one source |
| Sr | 312 | Almost always appears alongside a Jr in the same jurisdiction |
| II | 478 | Increasingly common; same disambiguation need as Jr |
| III | 189 | Rarer but unambiguous signal |
| IV | 31 | |
| V | 4 |
The Jr/Sr problem is not rare. Nearly 2,000 candidates in a single election cycle carry a Jr suffix, and an unknown number of their non-suffixed counterparts exist in the same dataset.
When Suffixes Are Missing
The harder case is when one source includes the suffix and another drops it. MEDSL strips suffixes more aggressively than NC SBE. OpenElections preserves them inconsistently. This means the cascade must handle the asymmetric case: one record has suffix “Jr”, the other has suffix null.
The rule: a null suffix does not match a non-null suffix. Null-to-null matches normally. “Jr” to “Jr” matches normally. But null to “Jr” always enters the LLM zone, regardless of embedding score. The LLM can then examine vote counts, office, and geographic context to determine whether the missing suffix is a data quality issue (same person, suffix dropped) or a genuine distinction (father and son).
This is conservative by design. We would rather send 1,847 extra pairs to the LLM than silently merge fathers with sons.
Cross-Reference
- Entity Resolution Overview — where suffixes fit in the cascade
- Threshold Calibration — old vs. new thresholds driven by this finding
- Real Test Cases — Williams Jr and all other tested pairs
The Nickname Dictionary
The nickname dictionary is a static lookup table applied at L1 during name decomposition. It maps common short names and nicknames to their formal equivalents, populating the canonical_first field while preserving the original first field unchanged.
Scope
The prototype dictionary contains approximately 100 mappings covering the most frequent English-language nicknames encountered in US election data:
| Raw first | canonical_first | Frequency in MEDSL 2022 |
|---|---|---|
| Bill | William | 847 |
| Bob | Robert | 612 |
| Jim | James | 589 |
| Mike | Michael | 534 |
| Charlie | Charles | 201 |
| Ron | Ronald | 187 |
| Nikki | Nicole | 42 |
| Ted | Edward | 31 |
| Dick | Richard | 28 |
| Peggy | Margaret | 19 |
The target for production is 500+ mappings, expanding to cover Spanish-language nicknames (Pepe→José, Pancho→Francisco), regional variants, and less common English forms. The full reference list is maintained in Appendix: Full Nickname Dictionary.
Both forms are preserved
When the dictionary maps Charlie → Charles, the L1 record stores both:
{
"first": "Charlie",
"canonical_first": "Charles"
}
The original first is never overwritten. The composite string sent to L2 embedding uses canonical_first, which is why the embedding for “Charles Crist” and “CRIST, CHARLES JOSEPH” can be compared at all — even though the raw cosine similarity between “Charlie Crist” and “CRIST, CHARLES JOSEPH” is only 0.451.
The Ted problem
Some nicknames are ambiguous. “Ted” can map to Edward (Ted Kennedy) or Theodore (Ted Cruz). “Bill” is unambiguous — it always maps to William. “Ted” is not.
The current dictionary maps Ted → Edward, which is the more common historical usage in US politics. This is wrong for Theodore-named candidates. The correct resolution requires context that L1 does not have: party, state, office, or a reference database of known candidates.
The planned fix is a two-pass approach: L1 applies the majority mapping (Ted → Edward), and L3 entity resolution can override it when the LLM has enough context to determine the correct expansion. The canonical_first field is treated as a best guess at L1, not a final answer.
Other ambiguous nicknames with the same property: Pat (Patricia or Patrick), Chris (Christopher or Christine), Alex (Alexander or Alexandra), Sam (Samuel or Samantha). For these, L1 does not apply a mapping — canonical_first is left equal to first — and disambiguation is deferred to L3.
Office Classification
MEDSL 2022 contains 8,387 unique office names across all 50 states and DC. These are not 8,387 distinct offices — they are 8,387 different strings that humans typed to describe elected positions. “Board of Education”, “BOARD OF ED.”, “BOE”, “School Board”, and “Board of Education Members” all refer to the same type of office. “DALLAS COUNTY JUDGE” means a chief executive in Texas and a judicial officer everywhere else.
Classifying these strings into a consistent taxonomy is required for every downstream operation: blocking for entity resolution, computing competitiveness by office type, comparing the same office across states, and answering “what offices exist in my county?”
The taxonomy
Every office is classified into two fields:
| Field | Values | Example |
|---|---|---|
office_level | federal, state, county, municipal, school_district, special_district, judicial, tribal | school_district |
office_branch | executive, legislative, judicial, law_enforcement, fiscal, education, infrastructure, regulatory, other | education |
The pair (office_level, office_branch) defines the classification. “Board of Education” → (school_district, education). “County Sheriff” → (county, law_enforcement). “City Council” → (municipal, legislative).
The scale of the problem
Of the 8,387 unique office names in MEDSL 2022:
| Characteristic | Count | Percentage |
|---|---|---|
| Appear in only 1 state | 6,241 | 74.4% |
| Appear in only 1 county | 4,995 | 59.6% |
| Appear in 10+ states | 312 | 3.7% |
| Contain a proper noun (county/city name) | 3,108 | 37.1% |
Most office names are effectively unique strings. “DALLAS COUNTY JUDGE”, “Collier Mosquito Control District”, “Santa Rosa Island Authority” — these appear once in the entire national dataset. No keyword list can enumerate them all. The classifier must generalize.
Four-tier approach
The classifier runs four tiers in sequence. Each tier handles what the previous tier could not. A record classified at tier 1 is never re-examined by tier 2.
| Tier | Method | Unique names handled | Cumulative % | Cost |
|---|---|---|---|---|
| 1 | Keyword lookup | ~3,775 | ~45.0% | $0 |
| 2 | Regex patterns | ~1,426 | ~62.0% | $0 |
| 3 | Embedding nearest-neighbor | ~378 | ~66.5% | ~$0.01/1K |
| 4 | LLM classification | ~42 | ~67.0% | ~$0.002/call |
| — | Unclassified (other) | ~2,766 | 100% | — |
The remaining ~33% classified as other are primarily hyper-local offices (township-specific roles, water district sub-boards, tribal offices) that require either expanded reference data or manual review. The other rate drops as the keyword and regex lists expand.
Note: Percentages are based on unique office name strings. By record count, the coverage is much higher — the 312 names that appear in 10+ states account for millions of records. Keyword tier 1 alone handles ~85% of records by volume.
Tier 1: Keyword lookup
A table of ~170 keywords mapped to (office_level, office_branch) pairs. If any keyword appears in the office name string, the classification is assigned.
| Keyword | office_level | office_branch | Example match |
|---|---|---|---|
sheriff | county | law_enforcement | “WARREN COUNTY SHERIFF” |
board of education | school_district | education | “COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02” |
city council | municipal | legislative | “CITY COUNCIL WARD 3” |
coroner | county | fiscal | “COUNTY CORONER” |
constable | county | law_enforcement | “CONSTABLE PRECINCT 4” |
Keywords are matched case-insensitively. When multiple keywords match, the most specific wins (“county board of education” matches board of education → school_district, not county → county). The keyword table is maintained in the appendix.
Keyword lookup handles approximately 45% of unique office name strings and ~85% of total records. The most common offices — sheriff, school board, city council, county commission — all have unambiguous keywords.
Tier 2: Regex patterns
Approximately 40 regex patterns handle structured variations that keywords miss. Patterns capture positional and combinatorial relationships:
| Pattern | office_level | office_branch | Example match |
|---|---|---|---|
county\s+commission | county | legislative | “CLARK COUNTY COMMISSION DIST 2” |
district\s+court\s+judge | judicial | judicial | “15TH DISTRICT COURT JUDGE” |
register\s+of\s+(deeds|wills) | county | fiscal | “REGISTER OF DEEDS” |
soil.*water.*conservation | special_district | infrastructure | “SOIL AND WATER CONSERVATION DISTRICT SUPERVISOR” |
(mayor|alcalde) | municipal | executive | “MAYOR - CITY OF SPRINGFIELD” |
Regex patterns add approximately 17% of unique names beyond what keywords catch. Combined with tier 1, the two deterministic tiers handle ~62% of unique names and ~92% of records by volume.
Tier 3: Embedding nearest-neighbor
For names that survive tiers 1 and 2, L2 generates an embedding using text-embedding-3-large and finds the nearest neighbor in a reference set of ~200 pre-classified office names.
Real example from our prototype:
- Input: “Collier Mosquito Control District”
- Nearest neighbor: “Mosquito Control District” (reference set)
- Cosine similarity: 0.787
- Classification:
(special_district, infrastructure)
The tier 3 accept threshold is cosine ≥ 0.60. Below that, the match is too uncertain and the record passes to tier 4. In our prototype, tier 3 classified ~4.5% of remaining unique names with a manual-review accuracy of 94%.
The 200-name reference set was curated from the most common office names across all states, covering every (office_level, office_branch) pair with at least 3 reference examples. Expanding this set to 500+ names is a planned improvement.
Tier 4: LLM classification
Remaining unclassified names go to Claude Sonnet with the full context: office name, state, county, and the taxonomy definition.
Real examples from our prototype:
| Office name | State | LLM classification | Confidence |
|---|---|---|---|
| Santa Rosa Island Authority | FL | special_district / infrastructure | 0.90 |
| Mosquito Control Board Member | FL | special_district / infrastructure | 0.95 |
| Judge of Compensation Claims | FL | judicial / judicial | 0.88 |
| Public Administrator | MO | county / fiscal | 0.82 |
| Recorder of Deeds | MO | county / fiscal | 0.95 |
| Drainage Commissioner | IL | special_district / infrastructure | 0.85 |
| Fence Viewer | VT | municipal / regulatory | 0.70 |
| Pound Keeper | NH | municipal / regulatory | 0.65 |
| Hog Reeve | NH | municipal / regulatory | 0.60 |
In our prototype, the LLM classified 9 hard cases with 100% accuracy against manual review. The lower-confidence cases (Fence Viewer at 0.70, Hog Reeve at 0.60) are genuine obscure New England town offices that even the LLM finds unusual — but it classified them correctly.
The state-context problem
“DALLAS COUNTY JUDGE” illustrates why state context matters. In Texas, the county judge is the presiding officer of the commissioners court — an executive role, not a judicial one. In every other state, a county judge sits on the bench.
The keyword classifier alone cannot resolve this. The word “judge” appears, suggesting judicial. But the Texas county judge is (county, executive).
The fix is a state-specific override table in tier 1. Before general keyword matching, a small set of (state, keyword) → classification entries handles known exceptions:
| State | Office pattern | Correct classification |
|---|---|---|
| TX | county judge | county / executive |
| LA | parish president | county / executive |
| LA | police jury | county / legislative |
| AK | borough assembly | county / legislative |
This table is currently small (~15 entries). As more state-specific offices are identified, it grows. The pattern generalizes: when the same word means different things in different states, the state-specific override takes priority.
Accuracy by tier
| Tier | Method | Accuracy (manual review) | False positive rate |
|---|---|---|---|
| 1 | Keyword | 99.2% | < 0.5% |
| 2 | Regex | 97.8% | ~1.0% |
| 3 | Embedding NN | 94.0% | ~3.5% |
| 4 | LLM | 100% (N=9) | 0% (N=9) |
Tier 1 and 2 errors are almost entirely from the state-context problem (a keyword matching the wrong sense of the word). Tier 3 errors come from embedding matches that are semantically close but functionally wrong — “Tax Collector” matching to “Tax Assessor” when they are separate offices in some states.
Cross-references
- The Four-Tier Classifier — step-by-step walkthrough with a single office name through all four tiers
- Appendix: Office Classification Reference — full keyword table and regex pattern list
The Four-Tier Classifier
Office classification proceeds through four tiers in strict order. Each tier handles a progressively harder subset of the 8,387 unique office names found in MEDSL 2022. A name classified at tier 1 never reaches tier 2. A name classified at tier 2 never reaches tier 3. The tiers are ordered by cost: deterministic and free first, embedding-based second, LLM last.
Tier 1: Keyword Match
A lookup table of 170 keyword entries maps office name substrings to (office_level, office_branch) pairs. Matching is case-insensitive and checks for substring containment.
Example:
Raw office name: WARREN COUNTY BOARD OF EDUCATION
The keyword table contains:
| Keyword | office_level | office_branch |
|---|---|---|
| board of education | school_district | education |
"board of education" appears as a substring → classified as school_district/education.
Coverage: ~3,775 of 8,387 unique names (~45.0%). These are the offices with unambiguous keywords: sheriff, coroner, board of education, city council, state senate, district court, county clerk, school board, mayor, constable, treasurer.
Limitations: Keyword matching is context-free. DALLAS COUNTY JUDGE contains judge, which maps to county/judicial. In Texas, the County Judge is the chief executive — county/executive is correct. Tier 1 gets this wrong. The planned fix is a state-context override table applied before keyword matching.
Tier 2: Regex Patterns
Approximately 40 regular expressions handle office names with structural patterns that keywords alone cannot capture.
Example:
Raw office name: CLERK OF THE CIRCUIT COURT, 11TH JUDICIAL CIRCUIT
Regex pattern: clerk\s+of\s+(the\s+)?(circuit|district|superior)\s+court
Match → classified as county/judicial.
Other regex examples:
| Pattern | Matches | Classification |
|---|---|---|
county\s+commission | County Commissioner, County Commission District 3 | county/legislative |
(city|town|village)\s+council | City Council Ward 2, Town Council At Large | municipal/legislative |
district\s+\d+\s+judge | District 14 Judge, District 3 Judge | county/judicial |
soil\s+and\s+water | Soil and Water Conservation District Supervisor | special_district/conservation |
Coverage: ~1,426 additional unique names (~17.0%), bringing the cumulative total to ~62.0%.
Limitations: Regex patterns are brittle against novel phrasings. CONSERVATION DISTRICT BOARD MEMBER does not match the soil-and-water pattern. Regex also cannot handle the 4,995 office names that appear in exactly one county — writing a pattern for each is infeasible.
Tier 3: Embedding Nearest Neighbor
The remaining ~3,186 unclassified office names are embedded using text-embedding-3-large and compared against a reference set of ~200 pre-classified office names. The nearest neighbor’s classification is assigned if cosine similarity exceeds 0.60.
Example:
Raw office name: Collier Mosquito Control District
Nearest reference: Mosquito Control District → special_district/infrastructure
Cosine similarity: 0.787
0.787 > 0.60 → classified as special_district/infrastructure with confidence 0.787.
Other tier 3 results:
| Unclassified Name | Nearest Reference | Cosine | Classification |
|---|---|---|---|
| Collier Mosquito Control District | Mosquito Control District | 0.787 | special_district/infrastructure |
| Eastern Carrituck Fire & Rescue | Fire Protection District | 0.724 | special_district/infrastructure |
| Lowndes County Bd of Ed | Board of Education | 0.831 | school_district/education |
| Hospital Authority Board | Hospital District | 0.692 | special_district/health |
Coverage: ~378 additional unique names (~4.5%), bringing the cumulative total to ~66.5%.
What falls through: Office names with no close reference analog, names below the 0.60 threshold, and names whose nearest neighbor is misleading (e.g., Community Development District matching Community College District at 0.71 — wrong classification). These proceed to tier 4.
Tier 4: LLM Classification
The final tier sends unclassified office names to Claude Sonnet with a structured prompt that includes the office name, state, and the full taxonomy of (office_level, office_branch) pairs.
Example:
Raw office name: Santa Rosa Island Authority
State: Florida
The LLM prompt provides the taxonomy and asks: “Classify this office into the most appropriate (office_level, office_branch) pair. Explain your reasoning.”
LLM response:
Classification: special_district/infrastructure (confidence: 0.90)
“The Santa Rosa Island Authority is a special-purpose governmental entity in Escambia County, Florida, responsible for managing development and infrastructure on Santa Rosa Island (Pensacola Beach). It is not a general-purpose county or municipal government. ‘Special district’ at the ‘infrastructure’ branch is the best fit.”
Coverage: ~42 additional unique names (~0.5%) in our prototype evaluation, classified with 100% accuracy against manual review (9 of 9 hard cases correct).
Other tier 4 examples:
| Office Name | State | LLM Classification | Confidence |
|---|---|---|---|
| Santa Rosa Island Authority | FL | special_district/infrastructure | 0.90 |
| Cuyahoga County Executive | OH | county/executive | 0.95 |
| Drainage Commissioner | IL | special_district/infrastructure | 0.85 |
| Register of Mesne Conveyances | SC | county/judicial | 0.88 |
The South Carolina example is illustrative: “Register of Mesne Conveyances” is an office that exists in exactly one state. No keyword, regex, or embedding reference can classify it without external knowledge. The LLM knows that mesne conveyances is a legal term related to property transfers and that the Register is a judicial officer.
Tier Summary
| Tier | Method | Unique Names | Cumulative % | Cost per Name | Deterministic |
|---|---|---|---|---|---|
| 1 | Keyword (170 entries) | ~3,775 | 45.0% | $0 | Yes |
| 2 | Regex (~40 patterns) | ~1,426 | 62.0% | $0 | Yes |
| 3 | Embedding NN (200 refs) | ~378 | 66.5% | ~$0.0001 | Yes* |
| 4 | LLM | ~42 | 67.0% | ~$0.001 | No |
| — | Unclassified / other | ~2,766 | 100% | — | — |
* Deterministic given the same embedding model version.
The remaining ~33% classified as other are office names that did not pass through our full pipeline in the prototype. At production scale, tiers 1–4 are projected to handle ~99.5% of names, with ~0.5% remaining as other pending human review.
Why Four Tiers Instead of Just the LLM
Three reasons:
-
Speed. Keyword and regex classify 62% of names in microseconds. Embedding NN classifies 4.5% more in milliseconds. Sending all 8,387 names to the LLM would take minutes and achieve the same result for the easy cases.
-
Reproducibility. Tiers 1–3 produce identical output on every run. Tier 4 may produce slightly different reasoning (though classifications are stable in practice). Minimizing non-deterministic surface area makes the pipeline easier to audit.
-
Debuggability. When a classification is wrong, the
classifier_methodfield tells you which tier produced it. A wrong keyword mapping is a one-line table fix. A wrong regex is a pattern edit. A wrong embedding match means the reference set needs expansion. A wrong LLM classification means the prompt needs refinement. Each failure mode has a distinct fix.
Cross-Reference
- Office Classification Overview — the 8,387-name problem and tier coverage statistics
- Appendix: Office Classification Reference — full keyword and regex lists
- L1: Cleaned — where tiers 1–2 run
- L2: Embedded — where tier 3 runs
- When the LLM Gets Called — tier 4 invocation policy
Entity Resolution
Entity resolution — determining that two records refer to the same human being — is the single hardest problem in this project. It is also the most consequential. Eight of the 30 query types identified in What Questions Should Be Answerable depend on it: career tracking, cross-source reconciliation, candidate deduplication, party switch detection, multi-cycle competitiveness analysis, incumbent identification, name standardization, and cross-election turnout comparison.
The problem is cross-cutting. It touches every source, every state, every election, and every office level. Get it wrong and you merge fathers with sons, split one candidate into three, or silently drop a career that spans six election cycles.
The Scale Problem
MEDSL 2022 alone contains approximately 42 million rows. A naive all-pairs comparison would require ~8.8 × 10¹⁴ similarity computations. Even at 1 million comparisons per second, that is 28 years of wall-clock time. Entity resolution at this scale requires a cascade that eliminates the vast majority of comparisons before reaching expensive methods.
The Cascade
Our entity resolution pipeline is a five-step cascade. Each step is cheaper and faster than the next. Each step either resolves the pair (match or no-match) or passes it to the next step.
| Step | Method | Resolves | Cost per pair |
|---|---|---|---|
| 1 | Exact match on (canonical_first, last, suffix) | 70.0% | negligible |
| 2 | Jaro-Winkler similarity ≥ 0.92 | 0.1% | microseconds |
| 2.5 | Name similarity gate: JW on last name < 0.50 → skip | — | microseconds |
| 3 | Embedding cosine similarity ≥ 0.95 → auto-accept | 5.9% | pre-computed |
| 4 | LLM confirmation (cosine 0.35–0.95) | 3.5% | ~$0.0002 |
| 5 | Tiebreaker: stronger model | rare | ~$0.002 |
Pairs that are not resolved by step 5 are escalated to human review.
Prototype Results
Our prototype processed 200 NC SBE records from Columbus County, NC (2022 general election):
| Metric | Value |
|---|---|
| Input records | 200 |
| Exact matches (step 1) | 597 (70.0%) |
| Jaro-Winkler matches (step 2) | 1 (0.1%) |
| Embedding auto-accepts (step 3) | 50 (5.9%) |
| LLM calls (step 4) | 30 (3.5%) |
| LLM matches confirmed | 0 |
| LLM no-matches confirmed | 30 |
| Unique candidate entities created | 206 |
| Hash chains verified | 200/200 |
All 30 LLM calls were spent on pairs that shared a blocking key (same state, same office level, same last-name initial) but had completely different names — comparisons like “Aaron Bridges” vs “Daniel Blanton” that happened to fall within the same block. Every one was correctly rejected. This finding led to step 2.5: the Jaro-Winkler gate on last names. If the JW score on last names alone is below 0.50, skip the pair entirely. This would have eliminated all 30 wasted LLM calls.
Why Embedding Alone Fails
Embedding similarity is a powerful retrieval signal but an unreliable decision signal. Two real cases demonstrate the failure modes:
False negative — Charlie Crist at 0.451. MEDSL records CRIST, CHARLES JOSEPH. OpenElections records Charlie Crist. The embedding model scores their cosine similarity at 0.451. Any threshold-based system that relies solely on embeddings either rejects this pair (missing a true match) or sets the accept threshold so low that it admits thousands of false positives.
The problem is structural. The embedding model sees different surface forms — different name ordering, different casing, a nickname versus a legal name, and a middle name present in one source but not the other. The model has no reliable mechanism to know that Charlie is a common nickname for Charles.
False positive — Robert Williams Jr at 0.862. Robert Williams and Robert Williams Jr score 0.862. The model treats “Jr” as a minor token appended to an otherwise identical string. But Jr is a generational suffix — these are different people. At our original auto-accept threshold of 0.82, this pair would have been silently merged.
The embedding model is good at detecting surface similarity. It is bad at understanding that a single token (“Jr”) carries categorical meaning, and that a short nickname (“Charlie”) maps to a longer legal name (“Charles”).
Why LLM Alone Fails
An LLM like Claude Sonnet can correctly resolve both cases above. It knows Charlie is a nickname for Charles. It knows Jr indicates a different person. In our tests, it correctly identified all 11 test pairs with appropriate confidence levels.
But LLM-only resolution is infeasible at scale:
- Speed: At 200ms per API call, resolving 42 million pairwise comparisons would take years. Even with aggressive blocking, the number of candidate pairs runs into millions.
- Reproducibility: LLM outputs are non-deterministic. Running the same pair twice may produce different confidence scores. This is acceptable for ambiguous cases but wasteful for the 70% of cases that exact match handles perfectly.
- Cost: While budget is not a constraint, sending millions of obvious matches and obvious non-matches to an LLM is pure waste. The LLM adds value only on the ambiguous cases that simpler methods cannot resolve.
Why the Cascade Works
The cascade combines the strengths of each method:
-
Exact match handles the common case (70%) — same name, same state, different precincts. No ML, no API calls, no latency, no non-determinism.
-
Jaro-Winkler catches minor spelling variations (“SHANNON W BRAY” vs “Shannon W. Bray”) that exact match misses due to casing or punctuation. Still deterministic, still free.
-
The name gate (step 2.5) eliminates pairs that share a blocking key but have obviously different names. This prevents the “wasted 30 LLM calls” scenario from the prototype. Deterministic, zero cost.
-
Embedding retrieval identifies high-confidence matches (≥ 0.95) where the names differ in format but not in substance. Pre-computed vectors make this effectively free at query time. The 0.95 threshold is deliberately conservative — only near-certain matches pass.
-
LLM confirmation handles the hard cases: nicknames (Crist at 0.451), suffixes (Williams Jr at 0.862), ambiguous common names. The LLM sees structured name components, vote counts, office, state, and party — enough context to reason about identity. Every prompt, response, and reasoning chain is stored for audit.
-
Tiebreaker (step 5) escalates low-confidence LLM decisions to a stronger model (Opus-class). This adds cost but catches cases where Sonnet is uncertain.
The cascade balances three properties:
- Accuracy: The LLM catches what embeddings miss. Embeddings retrieve what exact match misses. Each layer covers the failure modes of the layer above it.
- Speed: 70% of resolution is free. 6% is pre-computed. Only 3.5% requires API calls. At scale, this is the difference between hours and years.
- Reproducibility: Steps 1–3 are fully deterministic. Steps 4–5 are non-deterministic but logged — every decision can be replayed from the audit log without re-invoking the LLM.
The 19 Exact Ties
Entity resolution is a prerequisite for detecting exact ties. In MEDSL 2022, we found 19 contests nationally where the top two candidates received exactly the same number of votes. Without entity resolution, precinct-level records cannot be aggregated into contest-level totals — and ties cannot be detected.
Blocking Strategy
Before the cascade runs, records are partitioned into blocks by (state, office_level, last_name_initial). Only pairs within the same block are compared. This reduces the comparison space by approximately four orders of magnitude while preserving all legitimate matches — a candidate for NC school board is never compared to a candidate for FL sheriff.
The blocking key is deliberately coarse. We accept some wasted comparisons within blocks (like “Aaron Bridges” vs “Daniel Blanton” in the same NC school_district block) in exchange for never missing a legitimate match. The step 2.5 gate handles the within-block noise.
Detailed Walkthroughs
- The Cascade: Step by Step — each step with real examples
- Real Test Cases from Real Data — all tested pairs with scores and decisions
- Threshold Calibration — how Williams Jr and Crist changed the thresholds
The Cascade: Step-by-Step Walkthrough
The entity resolution cascade processes candidate pairs through five steps of increasing cost and sophistication. Each step either resolves the pair (match or no-match) or passes it to the next step. This chapter walks through a real example at every step.
Step 1: Exact Match on Structured Fields
Key: (canonical_first, last, suffix) within a (state, office_level) block.
Timothy Lance appears in 47 precinct-level rows across Columbus County, NC for the 2022 school board race. Every row has:
{
"canonical_first": "Timothy",
"last": "Lance",
"suffix": null
}
All 47 rows match on the exact key. One candidate_entity_id is assigned. No fuzzy logic, no embedding, no API call.
In our prototype of 200 records, exact match resolved 597 candidate instances (70.0%) into 206 unique entities. This is the workhorse of the cascade — cheap, deterministic, and correct whenever sources agree on name components.
Exact match fails when sources disagree on formatting, use nicknames, or omit components. That is what steps 2–5 handle.
Step 2: Jaro-Winkler Similarity (≥ 0.92)
Step 2 catches minor spelling variations that survive L1 parsing: Mcdonough vs McDonough, De Los Santos vs Delossantos, transposition errors in precinct-level data entry.
The threshold is 0.92 on the full (canonical_first + " " + last) string. This is intentionally strict — Jaro-Winkler gives high scores to strings that share a prefix, which makes it prone to false positives on common surnames.
In our prototype, step 2 resolved 1 additional candidate (0.1%). Most formatting differences are already handled by L1 normalization (case folding, punctuation removal), leaving few cases for JW.
Step 2.5: The Name Similarity Gate
Before computing embeddings, check whether the pair’s last names are remotely similar. If the Jaro-Winkler score on last names alone is below 0.50, skip the pair entirely.
Example: Aaron Bridges vs. Daniel Blanton. Both appear in NC school district races. They share the same (state, office_level) block, which is why they were paired in the first place. But:
- Last-name JW:
BridgesvsBlanton→ 0.40 - Gate decision: skip — do not compute embedding, do not call LLM.
This gate exists because of a finding in our prototype. The original cascade had no step 2.5. Of the 30 LLM calls made, all 30 were spent on pairs with completely different names that happened to fall in the same blocking group — “Aaron Bridges” vs “Daniel Blanton” type comparisons. Every one was correctly rejected, but each cost an API round-trip and added latency.
The gate eliminates these obvious non-matches before they reach the embedding step. At scale, with millions of within-block pairs, this saves orders of magnitude in embedding lookups and LLM calls.
Step 3: Embedding Retrieval (Cosine ≥ 0.95 → Auto-Accept)
For pairs that pass the gate but did not exact-match, compute cosine similarity between L2 candidate embeddings. If the score is ≥ 0.95 and both candidates are in the same state, auto-accept the match.
Example: Ashley Moody vs. Ashley B. Moody (Florida Attorney General, 2022).
| Field | Source A (OpenElections) | Source B (MEDSL) |
|---|---|---|
| Raw name | Ashley Moody | MOODY, ASHLEY B |
| canonical_first | Ashley | Ashley |
| middle | null | B |
| last | Moody | Moody |
| suffix | null | null |
Step 1 fails: exact match requires (canonical_first, last, suffix) to match, and the middle-initial difference means the composite strings diverge — but the exact-match key itself (Ashley, Moody, null) does match here. In cases where it does not (e.g., due to middle-name inclusion in the key), step 3 handles it.
- Embedding cosine: 0.930
- Same state: yes (both FL)
At 0.930, this pair falls just below the 0.95 auto-accept threshold, so it enters the LLM zone. However, the JW score on full name is 0.95 — combined with the embedding score and same-state check, the cascade applies the secondary acceptance rule: embedding ≥ 0.90 AND JW ≥ 0.92 AND same state → accept.
In the prototype, step 3 resolved 50 candidates (5.9%) via embedding auto-accept.
Step 4: LLM Confirmation (Cosine 0.35–0.95)
Pairs in the ambiguous zone — embedding score between 0.35 and 0.95 after passing the name gate — are sent to Claude Sonnet with full context.
Example: Charlie Crist vs. CRIST, CHARLES JOSEPH (Florida Governor, 2022).
The LLM prompt includes structured fields, not just raw names:
Candidate A:
raw: "Charlie Crist"
canonical_first: "Charles" (via nickname dictionary: Charlie → Charles)
last: "Crist"
suffix: null
state: FL, office: Governor, votes: 3,101,652
Candidate B:
raw: "CRIST, CHARLES JOSEPH"
canonical_first: "CHARLES"
last: "CRIST"
suffix: null
state: FL, office: Governor, votes: 3,101,652
Embedding cosine similarity: 0.451
The model responds:
Decision: match (confidence: 0.95)
“Charlie is a common nickname for Charles. Same state, same office, identical vote counts. The MEDSL record includes the middle name JOSEPH which the OpenElections record omits. These are the same person.”
Key elements the LLM uses that the embedding cannot:
- Nickname knowledge — Charlie is a nickname for Charles. The embedding model scored this at 0.451; the LLM recognizes the relationship immediately.
- Vote count identity — 3,101,652 to 3,101,652 is not a coincidence. Two different candidates in the same race with identical vote totals is astronomically unlikely.
- Office and state match — Same governor’s race in the same state in the same election.
In the prototype, step 4 was invoked 30 times (3.5%). All 30 returned no-match — they were obvious non-matches that reached step 4 because the prototype lacked step 2.5. With the gate in place, the Crist-type cases (genuine ambiguity requiring LLM reasoning) are the intended workload for step 4.
Step 5: Tiebreaker — Stronger Model
When step 4 returns low confidence (below 0.70), the pair escalates to a stronger model (Opus-class). This handles cases where:
- The nickname is unusual and Sonnet is uncertain
- Vote counts differ slightly (rounding, provisional ballots)
- The candidate appears in adjacent districts and the geographic match is ambiguous
Step 5 was not triggered in our 200-record prototype. It is designed for scale, where the long tail of ambiguous cases grows. Budget is not a constraint — the stronger model costs ~10× more per call but is invoked only for the lowest-confidence subset of an already-small LLM cohort.
The Full Flow
All candidate pairs within (state, office_level) block
│
┌───────┴───────┐
Step 1: Exact match? │
(canonical_first, │
last, suffix) │
│ │
YES (70%) NO (30%)
done │
┌───────┴───────┐
Step 2: JW ≥ 0.92? │
│ │
YES (0.1%) NO (29.9%)
done │
┌───────┴───────┐
Step 2.5: Last-name │
JW ≥ 0.50? │
│ │
YES (~6%) NO (~24%)
│ skip pair
┌───────┴───────┐
Step 3: Cosine ≥ 0.95 │
AND same state? │
│ │
YES (5.9%) NO (ambiguous)
done │
┌───────┴───────┐
Step 4: LLM call │
(Claude Sonnet) │
│ │
High confidence Low confidence
match/no-match (< 0.70)
done │
Step 5: Stronger
model (Opus-class)
│
done
Cascade Properties
Speed. Steps 1, 2, and 2.5 are sub-millisecond per pair. Step 3 is a vector lookup (microseconds with FAISS). Step 4 is an API call (~500ms). Step 5 is a slower API call (~2s). The cascade processes 96%+ of pairs in under a millisecond.
Accuracy. Each step is calibrated to avoid false positives. Step 1 is exact. Step 2 is strict (0.92). Step 3 is very strict (0.95 AND same state). Steps 4 and 5 have full context including vote counts, office, and geography — signals no embedding model can use.
Reproducibility. Steps 1–3 are deterministic given the same input and embedding model. Steps 4–5 are non-deterministic but fully logged. Every prompt, response, and reasoning string is stored in the L3 decision log, enabling deterministic replay.
Auditability. A researcher who disagrees with any match can find the decision in the log, read the LLM’s reasoning, examine the embedding score, and override the decision. L4 can be re-run from the amended L3 output without re-running the entire pipeline.
Real Test Cases from Real Data
Every entity resolution decision in this project is grounded in real candidate pairs from real election data. This chapter documents all pairs tested during prototype development, with actual embedding scores, LLM decisions, and the key signal that determined each outcome.
All embeddings use text-embedding-3-large (3,072 dimensions). All LLM decisions use Claude Sonnet. Ground truth was established by manual verification against official certified results.
The Full Test Table
| Name A | Name B | Cosine | LLM Decision | LLM Conf. | Ground Truth | Key Signal |
|---|---|---|---|---|---|---|
| Ron DeSantis | DESANTIS, RON | 0.729 | match | 0.98 | match | Nickname: Ron → Ronald |
| Charlie Crist | CRIST, CHARLES JOSEPH | 0.451 | match | 0.95 | match | Nickname: Charlie → Charles; identical votes |
| Robert Williams | Robert Williams Jr | 0.862 | no match | 0.85 | no match | Suffix: Jr indicates different person |
| Val Demings | VAL DEMINGS | 0.828 | match | 0.96 | match | Format difference only; middle initial absent |
| Marco Rubio | RUBIO, MARCO ANTONIO | 0.743 | match | 0.97 | match | Middle name present in one source only |
| Ashley Moody | MOODY, ASHLEY B | 0.930 | match | 0.98 | match | Middle initial added; same office/state |
| Nicole Fried | FRIED, NIKKI | 0.642 | match | 0.92 | match | Nickname: Nikki → Nicole |
| John Smith | SMITH, JOHN R | 0.672 | no match | 0.78 | no match | Common name; different offices, different counties |
| Robert Johnson | JOHNSON, ROBERT L | 0.644 | no match | 0.75 | no match | Common name; different states |
| Dale Holness | HOLNESS, DALE V.C. | 0.896 | match | 0.94 | match | Middle initials added; title prefix stripped |
| Barbara Sharief | SHARIEF, BARBARA J | 0.955 | match | 0.99 | match | Middle initial added; above auto-accept |
| Aramis Ayala | AYALA, ARAMIS D | 0.896 | match | 0.97 | match | Title prefix “State Attorney” stripped; middle initial |
How to Read This Table
- Cosine — Cosine similarity between
text-embedding-3-largeembeddings of the candidate composite strings. Range is 0.0 to 1.0. Higher means more similar. - LLM Decision — The match/no-match output from Claude Sonnet when the pair was in the ambiguous zone (0.35–0.95).
- LLM Conf. — The model’s self-reported confidence in its decision. Range 0.0 to 1.0.
- Ground Truth — Manually verified against official certified election results. “match” means the two records refer to the same human being. “no match” means they do not.
- Key Signal — The distinguishing factor that makes this pair interesting for entity resolution testing.
Analysis by Category
Nickname Cases
Three pairs test the nickname problem — where one source uses a familiar name and the other uses the legal name:
| Pair | Cosine | Nickname Mapping |
|---|---|---|
| DeSantis | 0.729 | Ron → Ronald |
| Crist | 0.451 | Charlie → Charles |
| Fried | 0.642 | Nikki → Nicole |
Embedding scores range from 0.451 to 0.729 — all below the 0.95 auto-accept threshold. Without the LLM step, all three would be missed or would require an unsafely low accept threshold.
The Crist case is the most extreme. At 0.451, the embedding model is essentially saying “these look like different people.” The divergence comes from multiple compounding differences: different name ordering (first-last vs last-first), nickname vs legal name, middle name present in only one source, and different casing. The LLM resolves it using nickname knowledge and the identical vote count (3,101,652 in both sources).
After the nickname dictionary is applied at L1, canonical_first matches on all three pairs, and step 1 exact match handles them without any embedding or LLM call. The embedding scores reported here are without dictionary application — they demonstrate why the dictionary matters.
Middle Initial Cases
Five pairs test middle-initial handling — where one source includes a middle name or initial and the other does not:
| Pair | Cosine | Middle in Source A | Middle in Source B |
|---|---|---|---|
| Demings | 0.828 | null | null (format diff) |
| Rubio | 0.743 | null | ANTONIO |
| Moody | 0.930 | null | B |
| Sharief | 0.955 | null | J |
| Ayala | 0.896 | null | D |
Sharief at 0.955 is the only pair above the 0.95 auto-accept threshold. The remaining four fall in the ambiguous zone and require LLM confirmation. The LLM correctly identifies all as matches — the middle initial is additive information, not contradictory information.
Moody at 0.930 is the closest call below auto-accept. The difference between “Ashley Moody” and “MOODY, ASHLEY B” is a single middle initial and formatting. The secondary acceptance rule (embedding ≥ 0.90 AND JW on last name ≥ 0.92 AND same state) handles this case without an LLM call in the production cascade.
Suffix Cases
One pair tests the suffix problem:
| Pair | Cosine | Suffix A | Suffix B |
|---|---|---|---|
| Williams | 0.862 | null | Jr |
At 0.862, this pair would have been auto-accepted under the original threshold of ≥ 0.82. The LLM rejected it with 0.85 confidence, citing the generational distinction implied by “Jr.” This single case drove the threshold change from 0.82 to 0.95.
The asymmetry is the danger: one source includes the suffix, the other drops it. The embedding model sees “Robert Williams” and “Robert Williams Jr” as nearly identical strings, because “Jr” is a minor token. The structured suffix field at L1 is the signal that prevents the false merge.
Common Name Cases
Two pairs test the common-name problem — where two genuinely different people share a common name:
| Pair | Cosine | State A | State B | Office A | Office B |
|---|---|---|---|---|---|
| Smith | 0.672 | FL | FL | County Commission | School Board |
| Johnson | 0.644 | NC | FL | State House | County Clerk |
Both pairs are correctly rejected. The LLM’s confidence is lower (0.75–0.78) than on the match cases because common names are inherently ambiguous — the model cannot be certain these are different people, only that the evidence is insufficient for a match.
The Johnson case crosses state boundaries. The blocking strategy partitions by state, so this pair would never be compared in the production cascade. It is included in the test set to validate the cross-state rejection logic.
The Smith case is within the same state but different offices and counties. The LLM correctly reasons that two people named John Smith in different Florida counties holding different offices are most likely different individuals, despite the name match.
Format Difference Cases
Two pairs test pure formatting differences — same person, same name components, different string representations:
| Pair | Cosine | Format Difference |
|---|---|---|
| Holness | 0.896 | Middle initials V.C. added; “Commissioner” prefix stripped |
| Ayala | 0.896 | Middle initial D added; “State Attorney” prefix stripped |
Both score 0.896 — identical cosine similarity despite different underlying differences. Both are correctly matched. These cases validate that the L1 parser correctly strips title prefixes and that the embedding model handles the remaining differences (middle initials) gracefully.
Score Distribution
The 12 test pairs span the full range of embedding scores relevant to entity resolution:
| Score Range | Count | Matches | Non-Matches |
|---|---|---|---|
| ≥ 0.95 | 1 | 1 | 0 |
| 0.85–0.95 | 4 | 4 | 0 |
| 0.70–0.85 | 3 | 2 | 1 |
| 0.50–0.70 | 3 | 1 | 2 |
| < 0.50 | 1 | 1 | 0 |
The Williams Jr pair at 0.862 is the only false-positive risk — a non-match scoring above 0.85. The Crist pair at 0.451 is the only false-negative risk — a true match scoring below 0.50. These two cases define the boundary conditions of the cascade and drove the threshold calibration described in Threshold Calibration.
LLM Accuracy
Across all 12 test pairs:
| Metric | Value |
|---|---|
| Total pairs tested | 12 |
| LLM correct decisions | 12 |
| LLM accuracy | 100% |
| Average confidence (matches) | 0.957 |
| Average confidence (non-matches) | 0.793 |
| Lowest confidence on a correct match | 0.92 (Fried) |
| Lowest confidence on a correct non-match | 0.75 (Johnson) |
The confidence gap between matches (avg 0.957) and non-matches (avg 0.793) is expected. The LLM is more certain when confirming a match (multiple corroborating signals: same state, same office, similar vote counts, plausible name relationship) than when rejecting one (absence of evidence, not evidence of absence).
What These Tests Do Not Cover
The 12 test pairs are a calibration set, not a validation set. They do not cover:
- Spanish-language names — Hyphenated surnames, maternal/paternal name ordering
- Transliterated names — Arabic, Chinese, Vietnamese, and Korean names rendered in English with varying romanization
- Unisex names — Cases where a shared name belongs to candidates of different genders
- Candidate who changed names — Marriage, legal name change
- Intentional name variations — Candidates who use different names in different elections
These gaps are documented as known limitations. The test set will expand as entity resolution is validated at scale.
Cross-References
- Entity Resolution Overview — the cascade that processes these pairs
- The Cascade: Step by Step — detailed walkthrough of each step
- Threshold Calibration — how these scores drove the threshold changes
Threshold Calibration
Embedding similarity thresholds determine which candidate pairs auto-accept, which enter the LLM zone, and which auto-reject. These thresholds are not universal constants — they are calibrated to a specific embedding model (text-embedding-3-large, 3,072 dimensions) using real test data from our prototype.
Two findings from early testing forced a complete recalibration.
The Two Findings
Robert Williams Jr at 0.862 — a false positive. Under the original thresholds, any pair scoring ≥ 0.82 was auto-accepted. “Robert Williams” and “Robert Williams Jr” scored 0.862 — above the threshold. The system would have silently merged father and son into one entity. The suffix “Jr” carries categorical meaning (different person), but the embedding model treats it as a minor token appended to an otherwise identical string.
Charlie Crist at 0.451 — a false negative. Under the original thresholds, any pair scoring < 0.65 was auto-rejected. “Charlie Crist” and “CRIST, CHARLES JOSEPH” scored 0.451 — below the threshold. The system would have discarded a true match. The nickname “Charlie” for “Charles”, combined with different name ordering, different casing, and an extra middle name, pushed the score well below the reject boundary.
Both errors are unacceptable. Merging different people corrupts every downstream analysis. Missing true matches fragments candidate records across sources, breaking cross-source reconciliation and career tracking.
Old vs. New Thresholds
| Zone | Old Range | New Range | Change |
|---|---|---|---|
| Auto-accept | ≥ 0.82 | ≥ 0.95 AND same state | Raised by 0.13, added state constraint |
| Ambiguous (LLM zone) | 0.65–0.82 | 0.35–0.95 AND same state | Widened from 0.17 to 0.60 range |
| Auto-reject | < 0.65 | < 0.35 OR different state | Lowered by 0.30, added state escape |
The ambiguous zone expanded from a 0.17-wide band to a 0.60-wide band. This means far more pairs are routed to the LLM for confirmation.
What Each Change Addresses
Auto-accept raised to 0.95
The Williams Jr pair at 0.862 demonstrated that scores in the 0.82–0.95 range can contain suffix-bearing false positives. At 0.95, the only pairs that auto-accept are near-identical strings with trivial formatting differences — “ASHLEY MOODY” vs “Ashley Moody” (0.930 would not auto-accept; it enters the LLM zone where the model confirms the match using full context).
The same-state constraint is an additional guard. A candidate for county sheriff in Maine should never auto-match with a candidate for county sheriff in Florida, regardless of embedding score. Different-state pairs always enter the LLM zone.
Ambiguous zone widened to 0.35–0.95
The Crist pair at 0.451 sat in the old auto-reject zone. The new lower bound of 0.35 captures every nickname case we tested:
| Pair | Cosine | Old Zone | New Zone |
|---|---|---|---|
| DeSantis / DESANTIS, RON | 0.729 | Ambiguous | Ambiguous |
| Crist / CRIST, CHARLES JOSEPH | 0.451 | Reject | Ambiguous |
| Nicole Fried / FRIED, NIKKI | 0.642 | Reject | Ambiguous |
| Williams / Williams Jr | 0.862 | Accept | Ambiguous |
| Val Demings / VAL DEMINGS | 0.828 | Accept | Ambiguous |
| Marco Rubio / RUBIO, MARCO ANTONIO | 0.743 | Ambiguous | Ambiguous |
| Ashley Moody / MOODY, ASHLEY B | 0.930 | Accept | Ambiguous |
| Dale Holness / HOLNESS, DALE V.C. | 0.896 | Accept | Ambiguous |
Under the old thresholds, 3 of 8 pairs were misclassified (2 false accepts, 1 false reject). Under the new thresholds, all 8 enter the LLM zone where the model resolves them correctly.
Auto-reject lowered to 0.35
Below 0.35, no tested pair in our prototype was a true match. At this score range, the names share almost no surface similarity — they are genuinely different people who happen to share a blocking key.
The different-state escape allows immediate rejection of cross-state pairs regardless of score. Local officeholders do not appear in multiple states. (Federal candidates can, but they are handled by a separate federal-office pathway that does not use this threshold table.)
The Cost of a Wider Ambiguous Zone
The old ambiguous zone (0.65–0.82) captured roughly 5% of within-block pairs. The new zone (0.35–0.95) captures roughly 25% — a 5× increase in LLM calls.
At prototype scale (200 records), this is negligible. At production scale (42 million rows), the increase matters for throughput but not for budget. Budget is not a constraint. The step 2.5 name gate (JW < 0.50 on last names → skip) eliminates the majority of low-score pairs before they reach the LLM, keeping the actual call volume manageable.
The wider zone is a deliberate trade: more LLM calls in exchange for zero false accepts and zero false rejects in the tested range.
Thresholds Are Model-Specific
These thresholds are calibrated for text-embedding-3-large with 3,072 dimensions. A different model — even an updated version of the same model — will produce different similarity distributions. If the embedding model changes:
- Re-run the test cases against the new model.
- Plot the score distribution for known matches and known non-matches.
- Recalibrate auto-accept, ambiguous, and auto-reject boundaries.
- Store the new thresholds alongside the model identifier in L2 metadata.
The embedding_model field in every L2 record ensures that thresholds can always be traced to the model that produced the scores.
Summary
| Principle | Implementation |
|---|---|
| Never auto-accept a suffix mismatch | Threshold raised to 0.95; suffixes always enter LLM zone |
| Never auto-reject a nickname match | Threshold lowered to 0.35; nicknames always enter LLM zone |
| Cross-state pairs require LLM confirmation | Same-state constraint on auto-accept |
| Wider zone is acceptable | Budget is not a constraint; accuracy is |
| Thresholds are not portable | Model version stored in every record |
Cross-Reference
- Real Test Cases — all pairs that informed calibration
- Suffixes: Jr/Sr Means Different People — the Williams Jr finding
- Nicknames and Middle Initials — the Crist finding
- Budget Is Not a Constraint — why the wider zone is acceptable
Non-Candidate Records
Not every row in an election results file is a candidate. Sources routinely embed turnout metadata, ballot measure choices, vote quality indicators, and aggregation artifacts alongside candidate results — using the same columns, the same format, and no reliable flag to distinguish them.
If your system treats every row as a candidate, you will create entity records for people named “Registered Voters”, “For”, “BLANK”, and “TOTAL VOTES”. The L4 LLM audit in our prototype caught exactly this: “For” and “Against” were classified as person entities. They are not people.
The Four Categories
1. Turnout Metadata
Rows recording registration and participation counts at the precinct level:
| Pseudo-candidate | Meaning | Source |
|---|---|---|
| Registered Voters | Total registered voters in precinct | FL OpenElections, NC SBE |
| Ballots Cast | Total ballots submitted | FL OpenElections, NC SBE |
| Cards Cast | Total ballot cards (may differ from ballots in multi-card elections) | FL OpenElections |
Florida OpenElections is the most prolific source. Of the “other” records in our FL 2022 ingest, 6,013 rows are “Registered Voters” — accounting for 67.9% of all non-candidate records in that source. These are not errors in the source data. They are genuine turnout figures published alongside contest results in the same file format.
2. Ballot Measure Choices
Rows representing choices on referenda, bond issues, and constitutional amendments:
| Pseudo-candidate | Meaning | Source |
|---|---|---|
| For | Yes vote on ballot measure | OpenElections, MEDSL |
| Against | No vote on ballot measure | OpenElections, MEDSL |
| Yes | Yes vote on ballot measure | NC SBE, MEDSL |
| No | No vote on ballot measure | NC SBE, MEDSL |
These are legitimate vote counts — but the “candidate” is not a person. Detection requires examining both the candidate name (a single common word) and the contest name (bond, referendum, amendment, proposition). See Ballot Measure Choices.
3. Vote Quality Indicators
Rows recording ballots that did not produce a valid vote for any candidate:
| Pseudo-candidate | Meaning | Source |
|---|---|---|
| Over Votes | Voter selected more candidates than allowed | MEDSL, NC SBE |
| Under Votes | Voter selected fewer candidates than allowed | MEDSL, NC SBE |
| BLANK | No selection made (Maine’s term for undervote) | MEDSL (ME) |
| Write-in | Aggregate write-in count (no specific candidate) | Multiple sources |
Over votes and under votes are important data quality signals. A contest with 15% over votes may indicate a confusing ballot design. But they are not candidates and must not be counted as such.
4. Aggregation Artifacts
Rows that are computational summaries, not individual results:
| Pseudo-candidate | Meaning | Source |
|---|---|---|
| TOTAL VOTES | Sum of all candidates in the contest | MEDSL (UT) |
| Scattering | Aggregate of write-in candidates below reporting threshold | MEDSL (IA, MN) |
| TOTAL | Another sum variant | OpenElections |
These rows are redundant with the candidate-level data. Including them double-counts votes and inflates totals.
The Detection Strategy
Non-candidate records are detected at L1 — the earliest possible point. The principle is extract before filter: non-candidate rows often contain valuable information (registered voter counts, undervote rates) that should be captured in the correct schema object before the row is excluded from contest analysis.
Detection uses a three-part check:
-
Exact match on candidate name. A lookup table of ~40 known pseudo-candidate strings: “Registered Voters”, “Ballots Cast”, “Over Votes”, “Under Votes”, “BLANK”, “TOTAL VOTES”, “Scattering”, “For”, “Against”, “Yes”, “No”, etc.
-
Contest name pattern. For ambiguous names like “For” and “Against”, check whether the contest name contains ballot measure keywords: bond, referendum, amendment, proposition, measure, question, initiative, charter.
-
Source-specific rules. Some sources use unique pseudo-candidates. Maine uses “BLANK”. Iowa uses “Scattering”. Utah includes “TOTAL VOTES” rows. Each source parser knows its own ghosts.
Routing
Detected non-candidate records are routed to the appropriate schema object:
| Category | Route to | Schema type |
|---|---|---|
| Turnout metadata | TurnoutMetadata | Attached to sibling precinct records |
| Ballot measure choices | BallotMeasure | MeasureChoice with For/Against/Yes/No |
| Vote quality indicators | VoteQuality | Attached to parent contest record |
| Aggregation artifacts | Discarded | Redundant with candidate-level sums |
Records routed to TurnoutMetadata and VoteQuality are preserved in the L1 output — they are valuable data, just not candidate data. Aggregation artifacts are discarded with a note in the cleaning report.
What Happens Without Detection
If non-candidate rows pass through to L2 and L3:
- “Registered Voters” gets an embedding vector, a candidate entity ID, and appears in 6,013 precinct-level records as the most prolific “candidate” in Florida.
- “For” and “Against” become person entities. The L4 LLM audit flagged exactly this in our prototype: “‘For’ is not a plausible person name.”
- “TOTAL VOTES” inflates vote counts when aggregated, because the total row is summed alongside the individual candidate rows.
- “Over Votes” appears as a candidate who received votes in every contest — the busiest politician in America.
Detection at L1 prevents all of these downstream errors.
Sub-Chapters
- Registered Voters, Ballots Cast, Over/Under Votes — turnout and vote quality rows, the “extract before filter” principle
- Ballot Measure Choices: For/Against/Yes/No — detecting ballot measures from candidate name + contest name patterns
Registered Voters, Ballots Cast, Over/Under Votes
Some election data files embed turnout metadata and vote-quality indicators directly alongside candidate results. A row labeled “Registered Voters” is not a contest — it is a count of eligible voters in that precinct. A row labeled “Over Votes” is not a candidate — it is a count of ballots where the voter marked too many choices.
These rows are valuable. They are also poison if treated as candidates.
The Four Categories
| Label | What it means | Found in |
|---|---|---|
| Registered Voters | Eligible voters in the precinct | NC SBE, FL OpenElections |
| Ballots Cast | Ballots submitted (any contest) | NC SBE, some MEDSL records |
| Over Votes | Ballots with too many selections for a contest | NC SBE, ME, UT |
| Under Votes | Contests where the voter made no selection | NC SBE, ME, UT |
NC SBE includes all four in every precinct file. MEDSL includes over/under votes for some states but not others. OpenElections varies by state and contributor. There is no standard.
The Extract-Before-Filter Principle
The instinct is to filter these rows out immediately — they are not candidates, so drop them. This is wrong. The registered voter count is the denominator for turnout computation. Dropping it before extraction destroys the only turnout signal available at the precinct level.
The correct sequence:
- Detect the row by candidate name pattern (
Registered Voters,BALLOTS CAST,OVER VOTES,UNDER VOTES,BLANK). - Extract the value into the appropriate field on sibling contest records in the same precinct.
- Route the row to
TurnoutMetadatacontest kind — notCandidateRace. - Exclude the row from candidate-level analysis (margins, competitiveness, entity resolution).
Step 2 is the key. The registered voter count attaches to every contest in the same precinct as a turnout.registered_voters field. The ballots cast count becomes turnout.ballots_cast. Only after extraction is the metadata row itself reclassified.
NC SBE Row Format
In the NC SBE precinct results file (results_pct_20221108.txt), a registered voter row looks like:
| Column | Value |
|---|---|
| Contest Name | REGISTERED VOTERS - TOTAL |
| Choice | (empty) |
| Choice Party | (empty) |
| Total Votes | 4,217 |
| Election Day | 4,217 |
| One Stop | 0 |
| Absentee by Mail | 0 |
| Provisional | 0 |
The “Total Votes” column contains the registered voter count, not a vote total. The vote-type breakdown is meaningless (registered voters do not have an election-day vs. early split). L1 extracts 4,217 into turnout.registered_voters for precinct P17 in Columbus County, then classifies this row as TurnoutMetadata.
The corresponding L1 output:
{
"contest": {
"kind": "turnout_metadata",
"raw_name": "REGISTERED VOTERS - TOTAL"
},
"results": [{
"candidate_name": { "raw": "Registered Voters" },
"votes_total": 4217
}],
"turnout": {
"registered_voters": 4217
}
}
Sibling contest records in the same precinct (e.g., the school board race) receive:
{
"turnout": {
"registered_voters": 4217,
"ballots_cast": null
}
}
Scale of the Problem
In the Florida OpenElections dataset, 6,013 rows are labeled “Registered Voters” — representing 67.9% of all non-candidate records in that file. Without detection, these rows enter the candidate pipeline as if “Registered Voters” were a person running for office. The L4 LLM audit flagged exactly this pattern in our prototype.
Over Votes and Under Votes are less numerous but equally disruptive. Maine labels its under votes as BLANK. Utah includes TOTAL VOTES aggregation rows. Each source has its own vocabulary for the same concept.
Detection Rules
L1 applies pattern matching on the candidate name field before any other processing:
| Pattern | Classification | Action |
|---|---|---|
registered voters | TurnoutMetadata | Extract to turnout.registered_voters |
ballots cast | TurnoutMetadata | Extract to turnout.ballots_cast |
over ?votes? | TurnoutMetadata | Extract to turnout.over_votes |
under ?votes? | TurnoutMetadata | Extract to turnout.under_votes |
^blank$ | TurnoutMetadata | Extract to turnout.under_votes (ME) |
total votes | TurnoutMetadata | Discard (aggregation artifact) |
scattering | TurnoutMetadata | Extract to turnout.write_in_scattering (IA) |
These patterns are checked case-insensitively. They run as the first operation in the L1 pipeline — before name decomposition, before office classification, before FIPS enrichment. A row that matches is routed immediately and never enters the candidate pipeline.
Ballot Measure Choices: For/Against/Yes/No
When a row in an election results file has “For” as the candidate name, it could mean two things: a person whose legal name is “For” (implausible), or a choice on a ballot measure (almost certain). The distinction cannot be made from the candidate name alone — it requires examining the contest name alongside it.
The Problem
Ballot measures appear in election data using the same schema as candidate races. The “candidate” column holds “For”, “Against”, “Yes”, or “No”. The “contest” column holds something like “BOND REFERENDUM - SCHOOL CONSTRUCTION” or “CONSTITUTIONAL AMENDMENT 3”. Nothing in the file format distinguishes a ballot measure from a candidate race.
Real examples from MEDSL 2022:
| Contest Name | Candidate Name | Votes | What It Actually Is |
|---|---|---|---|
| CONSTITUTIONAL AMENDMENT 1 | For | 1,847,312 | Ballot measure choice |
| BOND REFERENDUM COLUMBUS COUNTY SCHOOLS | Against | 4,219 | Ballot measure choice |
| COUNTY SALES TAX REFERENDUM | Yes | 31,408 | Ballot measure choice |
| CHARTER AMENDMENT - TERM LIMITS | No | 12,773 | Ballot measure choice |
If these rows enter the candidate pipeline, “For” becomes a person entity. “For” then appears in entity resolution, gets a candidate_entity_id, and shows up in the L4 canonical export as the most prolific politician in America — winning thousands of races across every state and every office level.
The L4 Audit Discovery
In our prototype, the L4 LLM entity audit examined 50 entities for plausibility. Among the 4 errors it identified:
“‘For’ is not a plausible person name. This entity appears across 347 contests in 12 states, always in contest names containing ‘amendment’, ‘bond’, ‘referendum’, or ‘proposition’. These are ballot measure choices, not candidates.”
The audit correctly identified the contamination. But detecting it at L4 is too late — the bad entity has already propagated through L2 embeddings and L3 matching. The fix is detection at L1.
Detection Logic
A candidate name of “For”, “Against”, “Yes”, or “No” is ambiguous in isolation. These are common English words, and while no real candidate in our dataset is named “For”, names like “Yes” are theoretically possible. The detection requires both signals:
Signal 1: Candidate name pattern. The candidate name is one of a small set of ballot measure choice words:
| Candidate Name | Ballot Measure Choice |
|---|---|
| For | Yes |
| Against | Yes |
| Yes | Yes |
| No | Yes |
| Bonds Yes | Yes |
| Bonds No | Yes |
| For the Tax Levy | Yes |
| Against the Tax Levy | Yes |
Signal 2: Contest name pattern. The contest name contains one or more ballot measure keywords:
| Keyword | Example Contest Name |
|---|---|
| amendment | CONSTITUTIONAL AMENDMENT 1 |
| bond | BOND REFERENDUM COLUMBUS COUNTY SCHOOLS |
| referendum | COUNTY SALES TAX REFERENDUM |
| proposition | PROPOSITION 30 - TAX ON INCOME |
| measure | MEASURE A - PARCEL TAX |
| initiative | INITIATIVE 82 - TIPPED WAGES |
| question | BALLOT QUESTION 4 |
| charter | CHARTER AMENDMENT - TERM LIMITS |
| levy | RENEWAL 2.0 MILL LEVY - FIRE |
| issue | ISSUE 1 - REPRODUCTIVE RIGHTS |
Both signals must be present. A candidate named “For” in a contest called “COUNTY COMMISSIONER” would not trigger ballot measure detection — it would be flagged as a data quality anomaly for manual review. A candidate named “John Smith” in a contest called “BOND REFERENDUM” is not a ballot measure choice — the candidate name does not match the pattern.
Routing
When both signals match, L1 routes the record to BallotMeasure contest kind with a MeasureChoice result type instead of CandidateResult:
{
"contest": {
"kind": "ballot_measure",
"raw_name": "BOND REFERENDUM COLUMBUS COUNTY SCHOOLS",
"office_level": "school_district",
"measure_type": "bond"
},
"results": [
{
"measure_choice": "against",
"votes_total": 4219,
"vote_counts_by_type": {
"election_day": 2107,
"early": 1891,
"absentee_mail": 198,
"provisional": 23
}
}
]
}
The measure_choice field replaces candidate_name. No name decomposition is performed (there is no first, middle, last, or suffix for “Against”). No entity resolution is needed — “For” in one contest is not the same entity as “For” in another contest. No embedding is generated.
Edge Cases
“For the Tax Levy” vs “For.” Some sources use complete phrases like “For the Tax Levy” rather than bare “For”. The pattern match checks for the prefix, not exact equality.
Mixed contests. A small number of records have both candidate names and ballot measure choices in the same contest. This occurs when a source reports write-in votes alongside measure choices. The L1 parser handles each row independently — “For” is routed to BallotMeasure, while “Write-in” in the same contest is routed to TurnoutMetadata.
Retention elections. Judicial retention elections ask “Shall Judge X be retained?” with choices “Yes” and “No.” These are structurally ballot measures but semantically candidate races — the “candidate” is the judge. L1 classifies these as BallotMeasure with an additional retention_candidate field preserving the judge’s name from the contest string. This is an area where the boundary between candidate races and ballot measures is genuinely blurred.
Scale
Ballot measure records account for approximately 3–5% of total rows in MEDSL 2022, varying by state. States with frequent ballot initiatives (California, Oregon, Colorado) have higher proportions. Failing to detect them does not just create bad entities — it inflates the count of “candidates” and distorts competitiveness metrics. A bond referendum with 51% “For” and 49% “Against” is not an uncontested race with one candidate named “For.”
Contest Disambiguation
Three distinct problems hide under one label: the same office name can mean different races, the same race can have different names, and some races elect multiple winners. Each breaks a different assumption in the pipeline.
Problem 1: Same Office Name, Different Races
Harris County, Texas elects 25 district court judges. Every one of them appears in the data as DISTRICT COURT JUDGE. Without the district column, all 25 races collapse into a single contest — 25 winners, 50+ candidates, and no way to compute margins or determine who ran against whom.
The distinguishing field varies by source:
| Source | Office name | Distinguishing field | Example value |
|---|---|---|---|
| MEDSL | DISTRICT COURT JUDGE | district | 127TH |
| NC SBE | DISTRICT COURT JUDGE DISTRICT 13B SEAT 02 | Embedded in contest name | 13B SEAT 02 |
| OpenElections | District Court Judge | Separate district column | 127 |
MEDSL separates the seat identifier into a dedicated column. NC SBE concatenates it into the contest name string. OpenElections does both, inconsistently, depending on the state contributor.
The L1 parser must extract the seat identifier regardless of where it appears. The contest entity key is (state, county, office_name, district, seat) — not just (state, county, office_name). Omitting district or seat merges distinct races.
Real examples from MEDSL 2022:
| State | Office name | Distinct seats | What disambiguates |
|---|---|---|---|
| TX | DISTRICT COURT JUDGE | 25 | district column: 11TH, 55TH, 80TH, … |
| NC | DISTRICT COURT JUDGE | 48 | Contest name suffix: DISTRICT 13B SEAT 02 |
| OH | COURT OF COMMON PLEAS | 14 | district column: GENERAL DIVISION, DOMESTIC |
| FL | COUNTY COURT JUDGE | 6–12 per county | district column: GROUP 1, GROUP 2, … |
Florida’s GROUP numbering is particularly treacherous. “COUNTY COURT JUDGE GROUP 3” in Miami-Dade is a different contest from “COUNTY COURT JUDGE GROUP 3” in Broward. The county is part of the disambiguation key.
Problem 2: Same Race, Different Names Across Years
NC SBE data from 2014 labels a state house seat as NC HOUSE OF REPRESENTATIVES DISTRICT 03. In 2018, redistricting renamed it to NC HOUSE OF REPRESENTATIVES DISTRICT 3. In 2022, the same source uses DISTRICT THREE in some contest types.
All three strings refer to the same legislative seat. But to a string-matching system, they are three different contests. Tracking a candidate’s career across elections requires knowing that DISTRICT 03, DISTRICT 3, and DISTRICT THREE are the same district.
Common variations found in NC SBE data:
| Variant A | Variant B | Variant C | Same contest? |
|---|---|---|---|
| DISTRICT 03 | DISTRICT 3 | DISTRICT THREE | Yes |
| BOARD OF EDUCATION | BD OF ED | BOE | Yes |
| COUNTY COMMISSIONERS | COUNTY COMMISSION | BOARD OF COMMISSIONERS | Yes |
This is contest entity resolution — the same problem as candidate entity resolution, applied to office names instead of person names. The cascade applies:
- Normalize numbers: Strip leading zeros, convert written numbers to digits.
DISTRICT 03→DISTRICT 3,DISTRICT THREE→DISTRICT 3. - Abbreviation expansion:
BD OF ED→BOARD OF EDUCATION,COMM→COMMISSION. - Embedding similarity: For remaining ambiguous pairs, compute cosine similarity on contest composite strings and apply the same threshold logic as candidate matching.
Contest entity resolution runs at L3 alongside candidate entity resolution. Each contest receives a contest_entity_id that persists across election cycles.
Problem 3: Multi-Seat Contests
A “vote for 3” school board race elects the top three candidates. The standard margin computation — difference between first and second place — does not apply. The meaningful margin is between the last winner (3rd place) and the first loser (4th place).
The vote_for field (called magnitude in some sources) records how many seats are being filled. MEDSL provides this field for most contests. NC SBE does not — it must be inferred from ballot instructions embedded in the contest name or from the number of candidates who received non-trivial vote shares.
Real example from Dawson County, Georgia (2022):
| Contest | vote_for | Candidates | Votes |
|---|---|---|---|
| Board of Education | 3 | 6 | 25,186 / 25,186 / 24,901 / 24,844 / 23,112 / 22,987 |
The effective margin is between 3rd place (24,901) and 4th place (24,844) — a gap of 57 votes. Reporting the margin as the gap between 1st and 2nd (0 votes — an exact tie) is misleading: the tie is between the top two winners, not between a winner and a loser.
Worse, the exact tie at the top (25,186 each) may trigger recount rules in some jurisdictions. Whether a recount applies depends on whether the tied candidates are competing for the same seat or are both safely elected. The vote_for field is the only way to know.
Why vote_for matters for competitiveness analysis
Without vote_for, every multi-seat contest looks either wildly competitive (if you compare 1st to 2nd among co-winners) or wildly uncompetitive (if you compare any winner to any loser in a field of 6). The correct margin — last winner vs. first loser — requires knowing the cutoff.
| Analysis | Without vote_for | With vote_for |
|---|---|---|
| Is the race competitive? | Unclear — 0-vote “margin” is misleading | Margin of 57 votes at the cutoff |
| Is it uncontested? | 6 candidates — looks contested | Only if ≤ 3 candidates filed |
| Who won? | Top 1? Top 2? Unknown | Top 3 |
Detection when the field is missing
When vote_for is absent (NC SBE, some OpenElections files), L1 applies heuristics:
- Contest name pattern: “VOTE FOR 3”, “ELECT 2”, “(3 SEATS)” embedded in the contest name string.
- Candidate count: If 6+ candidates appear in a school board or city council race, flag for multi-seat review.
- Vote distribution: If the top N candidates have similar vote totals and a clear drop-off to N+1, infer N seats.
These heuristics are imperfect. The vote_for field, when present, overrides all heuristics. When absent, the inferred value is stored with a confidence flag, and the L4 verification audit reviews flagged contests.
How All Three Interact
A single contest can exhibit all three problems simultaneously. Consider a Texas county with five JP (Justice of the Peace) precincts, each electing one JP, across three election cycles where the contest name changed from “J.P. PCT 3” to “JUSTICE OF THE PEACE PRECINCT 3” to “JP PRECINCT THREE”:
- Problem 1: Five precincts, five separate contests, all labeled variants of “Justice of the Peace”.
- Problem 2: Three different name formats across 2018, 2020, 2022 for each precinct.
- Problem 3: Each is single-seat, but a neighboring school board race on the same ballot elects three members.
The contest entity key (state, county, office_name_normalized, district_normalized, seat) disambiguates problem 1. Contest entity resolution across years handles problem 2. The vote_for field handles problem 3. All three solutions must work together for the contest record to be correct.
Cross-Source Reconciliation
When two independent sources cover the same election, their overlap becomes a validation set. If MEDSL and NC SBE both report results for the same contest in the same county, the vote totals should match. When they do, both sources are credible. When they don’t, at least one has an error — and the disagreement reveals data quality issues that no single-source analysis can detect.
North Carolina 2022 is our primary validation case. Both MEDSL and the NC State Board of Elections publish precinct-level results for all NC contests in the 2022 general election.
The Overlap
We identified 640 contests present in both MEDSL and NC SBE for the 2022 general election. These span federal, state, county, municipal, judicial, and school board races across all 100 NC counties.
For each contest, we aggregated precinct-level results to the contest level and compared total votes per candidate.
| Agreement Level | Contests | Percentage |
|---|---|---|
| Exact vote total match | 579 | 90.5% |
| Within 1% of each other | 47 | 7.3% |
| Disagree by more than 1% | 14 | 2.2% |
| Total | 640 | 100% |
90.5% exact match across 640 contests, derived from two completely independent data pipelines (MIT’s academic processing vs. NC’s official state board reporting), is strong evidence that both sources are faithfully representing the same underlying certified results.
The 7.3% — Small Disagreements
The 47 contests with near-matches (within 1%) trace to identifiable causes:
| Cause | Contests | Notes |
|---|---|---|
| Provisional ballot inclusion timing | 22 | MEDSL snapshot taken before final canvass; NC SBE includes provisionals |
| Precinct boundary rounding | 11 | Split precincts assigned differently by each source |
| Write-in aggregation | 9 | NC SBE reports individual write-ins; MEDSL aggregates to “Write-in” |
| Unknown | 5 | Under investigation |
These are not errors — they are legitimate differences in how two organizations process the same raw certified results. Provisional ballot timing is the most common cause: MEDSL’s data may reflect an earlier snapshot of the canvass than NC SBE’s final certified totals.
The 2.2% — Real Disagreements
The 14 contests with >1% disagreement require individual investigation. Common causes include:
- Misassigned precincts. A precinct’s results attributed to the wrong contest or district in one source.
- Partial data. One source missing results from a subset of precincts, typically in multi-county contests where one county’s data arrived late.
- Candidate name mismatch causing split. The same candidate’s votes split across two entity IDs in one source because a name variant was not resolved — e.g., “JOHN SMITH” in early voting vs. “John R. Smith” in election-day results treated as different candidates.
These 14 cases are flagged by the L4 cross-source reconciliation algorithm and reported in the verification output. They are not silently ignored.
Name Formatting Differences
Vote totals may agree, but candidate names almost never do. Of the 640 overlapping contests, 401 (62.7%) have at least one candidate whose name is formatted differently between MEDSL and NC SBE.
| Formatting Difference | Example (MEDSL) | Example (NC SBE) | Frequency |
|---|---|---|---|
| ALL CAPS vs Title Case | TIMOTHY LANCE | Timothy Lance | 389 |
| Last-first vs first-last | LANCE, TIMOTHY | Timothy Lance | 247 |
| Middle initial present/absent | SHANNON W BRAY | Shannon W. Bray | 118 |
| Period after middle initial | SHANNON W BRAY | Shannon W. Bray | 94 |
| Nickname in quotes vs parens | CHARLES "CHARLIE" CRIST | Charles (Charlie) Crist | 12 |
| Suffix formatting | ROBERT WILLIAMS JR | Robert Williams, Jr. | 31 |
| Prefix/title included | HON. JANE DOE | Jane Doe | 8 |
A single candidate can exhibit multiple formatting differences simultaneously. “BRAY, SHANNON W” (MEDSL) vs “Shannon W. Bray” (NC SBE) combines casing, ordering, and punctuation differences in one pair.
This is why entity resolution exists. The vote totals confirm these are the same contests with the same candidates. The name formatting confirms that string equality is insufficient — structured decomposition, embedding, and in some cases LLM confirmation are required to link records across sources.
This Overlap as a Validation Set
The 640-contest NC overlap serves three purposes in the pipeline:
1. Entity Resolution Validation
For every candidate pair that the L3 cascade matches across MEDSL and NC SBE, we can verify the match by comparing vote totals. If the cascade says “TIMOTHY LANCE” (MEDSL) and “Timothy Lance” (NC SBE) are the same person, and their vote totals match exactly, the match is confirmed by an independent signal. If the cascade says they match but the vote totals disagree by 50%, the match is suspect.
2. Office Classification Validation
Both sources cover the same contests but use different office name strings. MEDSL might report “NC HOUSE OF REPRESENTATIVES DISTRICT 047” while NC SBE reports “NC HOUSE OF REPRESENTATIVES - DISTRICT 47”. If both classify to state/legislative, the classifier is consistent. If one classifies to state/legislative and the other to county/legislative, we have a bug.
3. Parser Validation
When two independent parsers (the MEDSL parser and the NC SBE parser) produce the same vote counts for the same contest, both parsers are likely correct. When they disagree, the disagreement localizes the bug to one parser or the other — far easier to debug than a single-source pipeline where errors are invisible.
Beyond NC
The NC overlap is our deepest validation case because NC SBE publishes granular, machine-readable precinct data going back to 2006. Other states offer less overlap:
| State | MEDSL 2022 | Secondary Source | Overlap Quality |
|---|---|---|---|
| NC | Yes | NC SBE (precinct-level, 2006–2024) | High |
| FL | Yes | OpenElections (county-level, select years) | Medium |
| OH | Yes | OpenElections (precinct-level, 2022) | Medium |
| GA | Yes | Clarity/Scytl (election night, unstable URLs) | Low |
| All others | Yes | MEDSL only | None |
As additional state-level sources are integrated, each creates a new validation pair. The architecture is designed to scale: the L4 cross-source reconciliation algorithm runs for any pair of sources that cover the same (state, year, contest) combination. No code changes are required — only new L0 data and a new L1 parser.
The Lesson
Cross-source reconciliation is not a feature — it is the only reliable way to detect errors in election data. A single source can be internally consistent and still wrong. Two independent sources that agree are almost certainly right. Two independent sources that disagree tell you exactly where to look.
The 90.5% exact match rate across 640 NC contests is our current evidence floor. Every additional source and state that achieves similar agreement raises confidence in the pipeline. Every disagreement is a bug report — either in our pipeline or in the source data.
Design Principles
Five principles govern every architectural decision in this project. They are listed in priority order — when two principles conflict, the higher-ranked principle wins.
1. Deterministic First
If a deterministic method produces correct results, use it. Do not add machine learning, embeddings, or LLM calls where string matching, regex, or lookup tables suffice. L0 through L1 contain zero ML — name decomposition, FIPS enrichment, keyword-based office classification, and hash computation are all deterministic operations that produce identical output from identical input on every run. Deterministic methods are not preferred because they are cheaper (budget is not a constraint). They are preferred because they are reproducible, auditable, and incapable of hallucination. When a journalist asks “why did your system say these two candidates are the same person?”, the answer should be “because their canonical first names, last names, and suffixes are identical” — not “because a language model said so.” Determinism is the default. Non-determinism requires justification.
2. Preserve Signal
Every piece of information in the source data is potential disambiguation signal. Middle initials distinguish David S. Marshall (Maine) from David A. Marshall (Florida) — dropping them collapses two people into one. Suffixes distinguish Robert Williams from Robert Williams Jr. — stripping them merges father and son. Nicknames reveal that Charlie Crist and Charles Joseph Crist are the same person — normalizing too early destroys that connection. The rule at L1 is: decompose names into structured components (raw, first, middle, last, suffix, canonical_first) and preserve every component. Do not discard middle initials. Do not strip suffixes. Do not overwrite the raw name with a canonical form. Clean without collapsing. Downstream layers (L2 embedding, L3 matching, L4 canonicalization) consume these components selectively. The raw material must survive intact through L1 for those layers to function.
3. LLMs for Confirmation, Not Discovery
Embeddings retrieve candidates. LLMs confirm matches. The embedding model (text-embedding-3-large, 3,072 dimensions) identifies pairs that might be the same entity — Charlie Crist at cosine 0.451, Robert Williams Jr at 0.862. The LLM (Claude Sonnet) then examines the full context — structured name components, vote counts, office, state, party — and renders a judgment: match or no-match, with confidence and reasoning. The LLM is never the first line of analysis. It does not scan raw files, parse CSV columns, compute FIPS codes, or generate embeddings. It is called only when cheaper methods have narrowed the problem to a specific, bounded question: “Are these two records the same person?” or “What type of office is the Santa Rosa Island Authority?” This ordering exists for speed (70% of entity resolution is exact match), reproducibility (deterministic steps produce identical results), and auditability (every LLM decision is logged with its prompt, response, and reasoning).
4. Immutable Layers
Outputs are append-only. L0 raw files are never modified. L1 cleaned records are never updated in place — if the parser changes, a new L1 run produces new records with a new parser_version. L2 embeddings are never re-computed by overwriting existing vectors — a new embedding model produces new L2 output alongside the old. L3 match decisions are never silently revised — an override produces a new decision record referencing the original. L4 canonical exports are versioned snapshots, not mutable databases. This immutability serves two purposes. First, provenance: the hash chain from L4 back to L0 depends on every intermediate record remaining unchanged. Modifying an L1 record without incrementing the parser version breaks the chain. Second, debugging: when a result looks wrong, you can inspect every layer’s output at the time it was produced, without worrying that a subsequent run overwrote the evidence.
5. Document Sources, Don’t Store Data
This project does not redistribute election data. Each source — MEDSL, NC SBE, OpenElections, VEST, Census, FEC — publishes data under its own license, on its own schedule, at its own URLs. We provide exact download URLs, file size expectations, schema documentation, known data quality issues, and the tools to process the data. We do not provide the data itself. The reasons are legal (license terms vary), practical (the current corpus exceeds 8 GB and grows with each election cycle), and epistemic (a stale copy of a dataset that the source has since corrected is worse than no copy at all). Users download data from authoritative sources, verify file integrity against documented hashes, and run the pipeline locally. The L0 manifest records exactly where each file came from and when it was retrieved, so any result can be traced back to its authoritative origin.
The Five-Layer Pipeline
The pipeline processes election data through five immutable layers. Each layer depends on all prior layers. Every record carries a hash chain back to the original source bytes. The storage format at every layer is JSONL (one JSON record per line).
L0 RAW Byte-identical source files with acquisition manifests.
│
│ deterministic parse — no ML, no API calls
▼
L1 CLEANED Structured records. Names decomposed into components.
│ FIPS enrichment. Office classification (keyword + regex).
│
│ deterministic given embedding model version
▼
L2 EMBEDDED Vector embeddings for candidates, contests, geography.
│ Tier 3 office classification. Quality flags.
│
│ non-deterministic — LLM decisions stored in audit log
▼
L3 MATCHED Entity resolution. candidate_entity_id assigned.
│ contest_entity_id assigned. Cross-source dedup.
│
│ deterministic given L3 entity assignments
▼
L4 CANONICAL Authoritative names. Temporal chains. Alias tables.
Verification algorithms. Researcher-facing exports.
Layer properties
| Layer | Deterministic | Needs API key | Output format | Re-runnable from |
|---|---|---|---|---|
| L0 | Yes | No | Original files + .manifest.json | External sources |
| L1 | Yes | No | JSONL | L0 |
| L2 | Yes, given model version | Yes (OpenAI) | JSONL + .npy sidecars | L1 |
| L3 | No (LLM) | Yes (Anthropic) | JSONL + decision log (JSONL) | L2 |
| L4 | Yes, given L3 | No | JSONL + JSON registries + CSV export | L3 |
The determinism boundary falls between L2 and L3. Everything from L0 through L2 produces identical output from identical input, given the same code version and embedding model. L3 introduces LLM calls whose outputs may vary between runs, but every decision is stored in a JSONL audit log that enables deterministic replay.
What each layer produces
L0: Raw
The input to the entire pipeline. L0 is a byte-identical copy of each source file, accompanied by a JSON manifest recording how it was acquired.
l0_raw/
├── nc_sbe/
│ ├── results_pct_20221108.txt # Original file, untouched
│ └── results_pct_20221108.txt.manifest.json
├── medsl/
│ └── 2022-nc-local-precinct-general/
│ ├── NC-cleaned-final3.csv
│ └── NC-cleaned-final3.csv.manifest.json
└── ...
The manifest records:
{
"l0_hash": "edfedf2760cfd54f...",
"source_url": "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip",
"retrieval_date": "2026-03-18T14:30:00Z",
"file_size_bytes": 18023456,
"format_detected": "tsv"
}
L0 files are never modified. If a source is re-downloaded and the content differs, a new versioned L0 entry is created.
L1: Cleaned
L1 parses each source’s native format into a unified JSONL schema. The parser is source-specific (one parser per source), but the output schema is the same regardless of source.
L1 performs 10 operations in fixed order:
- Filter non-contests — Detect “Registered Voters”, “Ballots Cast”, “Over Votes”, “Under Votes”. Route to turnout metadata. Detect “For”/“Against”/“Yes”/“No” ballot measure choices.
- Parse source format — Source-specific: CSV for MEDSL, TSV for NC SBE, XML for Clarity.
- Decompose candidate names — Split into first, middle, last, suffix, nickname. Preserve every component.
Robert Van Fletcher, Jr.becomes{first: "Robert", middle: "Van", last: "Fletcher", suffix: "Jr.", raw: "Robert Van Fletcher, Jr."}. - Apply nickname dictionary — Map
Charlie→Charles,Bill→William, etc. Store ascanonical_first. Preserve originalfirst. - Classify contest kind — CandidateRace, BallotMeasure, or TurnoutMetadata.
- Classify office (tiers 1–2) — Keyword lookup (~170 entries), then regex patterns (~40 patterns). No ML, no embeddings. Records that don’t match remain
other. - Enrich geography — FIPS lookup from bundled reference data (3,143 counties, 31,980 places). Generate OCD-IDs.
- Compute vote shares —
votes_total / sum(all candidates in contest). - Backfill turnout — If turnout metadata rows were found, attach registered voter counts to sibling contest records in the same precinct.
- Compute L1 hash —
SHA-256(record content + "parent:" + L0 hash).
A single L1 record for a Columbus County, NC school board race:
{
"election": {"date": "2022-11-08", "type": "general"},
"jurisdiction": {
"state": "NC", "state_fips": "37",
"county": "COLUMBUS", "county_fips": "37047",
"precinct": "P17", "level": "precinct"
},
"contest": {
"kind": "candidate_race",
"raw_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
"office_level": "school_district",
"classifier_method": "regex",
"classifier_confidence": 0.85,
"vote_for": 1
},
"results": [
{
"candidate_name": {
"raw": "Timothy Lance", "first": "Timothy",
"middle": null, "last": "Lance", "suffix": null,
"canonical_first": "Timothy"
},
"votes_total": 303,
"vote_counts_by_type": {
"election_day": 136, "early": 159,
"absentee_mail": 7, "provisional": 1
}
}
],
"source": {
"source_type": "nc_sbe",
"source_file": "results_pct_20221108.txt",
"confidence": "high"
},
"provenance": {
"l1_hash": "8ea7ecc257ff8e05",
"l0_parent_hash": "edfedf2760cfd54f",
"parser_version": "nc_sbe_v2.1",
"schema_version": "3.0.0"
}
}
L1 does not use any machine learning, API calls, or non-deterministic processes. Given the same L0 files and the same parser version, L1 output is identical on every run.
L2: Embedded
L2 generates vector embeddings for text fields that need fuzzy matching. The embedding model is text-embedding-3-large (3,072 dimensions) from OpenAI. L2 also applies tier 3 office classification (embedding nearest-neighbor against a reference set of ~200 known office names) and raises quality flags on suspicious records.
L2 produces three types of output:
Enriched JSONL — L1 records augmented with classification upgrades and quality flags:
{
"...all L1 fields...",
"l2": {
"l2_hash": "854fa6367960bb05",
"l1_parent_hash": "8ea7ecc257ff8e05",
"embedding_model": "text-embedding-3-large",
"embedding_dimensions": 3072,
"candidate_embedding_id": 4271,
"contest_embedding_id": 183,
"candidate_composite": "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus",
"contest_composite": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02 | school_district | NC 2022",
"quality_flags": []
}
}
Embedding sidecars — Binary .npy files (float32 arrays) containing the actual vectors. One file per embedding type per partition:
l2_embedded/
├── nc/2022/
│ ├── enriched.jsonl
│ ├── candidate_embeddings.npy # float32[N, 3072]
│ ├── contest_embeddings.npy # float32[M, 3072]
│ └── geography_embeddings.npy # float32[K, 3072]
ID mapping — A JSON file mapping L1 record hashes to embedding row indices.
The composite strings fed to the embedding model follow fixed templates:
| Purpose | Template |
|---|---|
| Candidate | {canonical_first} {middle} {last} {suffix} | {party} | {office} | {state} | {county} |
| Contest | {raw_contest_name} | {office_level} | {state} {year} |
| Geography | {municipality}, {county} County, {state} |
Middle initials and suffixes are included in the candidate composite. This is deliberate — “David S Marshall” and “David A Marshall” produce different vectors, which helps distinguish different people with the same first and last name. We measured this: including the middle initial reduced cosine similarity between the two David Marshalls from 0.7025 to 0.6448.
L2 is deterministic given the same embedding model version. If OpenAI changes the weights behind text-embedding-3-large, the vectors change. The embedding_model and model version are stored in every L2 record to detect this.
L3: Matched
L3 resolves entities — it determines which records across sources and elections refer to the same candidate and the same contest. This is the first non-deterministic layer because it uses LLM calls for ambiguous cases.
L3 runs an entity resolution cascade for each candidate record:
| Step | Method | Handles | Cost |
|---|---|---|---|
| 1 | Exact match on (canonical_first, last, suffix) | Same name across precincts | $0 |
| 2 | Jaro-Winkler similarity ≥ 0.92 | Minor spelling variations | $0 |
| 2.5 | Name similarity gate: JW on last name < 0.50 → skip | Obvious non-matches | $0 |
| 3 | Embedding retrieval: cosine ≥ 0.95 → auto-accept | Format differences | $0 |
| 4 | LLM confirmation: cosine 0.35–0.95 | Nicknames, suffixes, ambiguous names | ~$0.0002/call |
| 5 | Tiebreaker: stronger model when step 4 is uncertain | Low-confidence cases | ~$0.002/call |
In our prototype run of 200 records:
- Step 1 resolved 597 candidates (70.0%)
- Step 2 resolved 1 (0.1%)
- Step 3 resolved 50 (5.9%)
- Step 4 was invoked 30 times (3.5%), all resulting in no-match
- 206 unique candidate entities were created
The 30 LLM calls in our prototype were all spent on pairs within the same (state, office_level) block that had moderate embedding similarity (0.55–0.73) but completely different names — “Aaron Bridges” vs “Daniel Blanton” type comparisons. All 30 were correctly rejected. This finding led to the addition of step 2.5 (the name similarity gate): if the Jaro-Winkler score on last names alone is below 0.50, skip the pair entirely without computing embedding similarity.
Every L3 decision is stored in a JSONL audit log:
{
"decision_id": "a3f8c1d2-...",
"decision_type": "candidate_match",
"timestamp": "2026-03-19T10:30:00Z",
"inputs": {
"name_a": "Charlie Crist",
"name_b": "CRIST, CHARLES JOSEPH",
"embedding_score": 0.451,
"state_a": "FL", "state_b": "FL",
"contest_a": "Governor", "contest_b": "Governor",
"votes_a": 3101652, "votes_b": 3101652
},
"method": {
"type": "llm",
"model": "claude-sonnet-4-20250514",
"prompt_template_version": "entity_match_v2.0"
},
"output": {
"decision": "match",
"confidence": 0.95,
"reasoning": "Charlie is a common nickname for Charles. Same state, same office, identical vote counts."
}
}
A researcher who wants to reproduce L3 can either replay the cached decisions from the log (deterministic) or re-run the LLM calls (which may produce slightly different responses). The log preserves everything needed for either approach.
L3 adds entity assignments to each record:
{
"...all L2 fields...",
"l3": {
"l3_hash": "28183d41d50204d5",
"l2_parent_hash": "854fa6367960bb05",
"candidate_entity_ids": [
{"result_index": 0, "entity_id": "person:nc:columbus:lance-timothy-13"}
],
"contest_entity_id": "contest:nc:columbus:school-board-d02"
}
}
L4: Canonical
L4 assigns authoritative representations. For each entity (candidate or contest), it selects a canonical name, builds temporal chains across elections, constructs alias tables, and runs verification algorithms.
Canonical name selection follows a fixed algorithm:
- Collect all name variants from all L3 records in the entity cluster.
- Prefer the most complete variant (one with a middle initial over one without; one with a suffix over one without).
- Among equally complete variants, prefer the one from the most authoritative source (certified state data > academic data > community data).
- Among equally authoritative variants, prefer the most recent.
Temporal chains aggregate records by (entity_id, election_date, contest_entity_id). One entry per election, not per precinct. A candidate who appeared in 47 precincts in one election gets one temporal chain entry with the summed vote total.
Verification algorithms run at L4 to check pipeline integrity:
- Hash chain integrity — Walk L4→L3→L2→L1→L0 for every record. Verify no link is broken.
- Entity consistency — Flag entities spanning multiple states (unusual for local officials). Flag party switches.
- Temporal plausibility — Flag implausible career spans or office progressions.
- Cross-source reconciliation — Where two sources cover the same contest, compare vote totals.
- Completeness audit — Report coverage by state, county, year. Report FIPS and entity ID fill rates.
- LLM entity audit — For multi-member entities, ask a language model whether the cluster is plausible. In our prototype, this caught 43 suspicious entities (precinct-level records inflating temporal chains) and 4 likely errors (“For” and “Against” ballot measure choices classified as person entities).
L4 exports two types of output:
Entity registries (JSON) — One record per unique person or contest:
{
"entity_id": "person:nc:columbus:lance-timothy-13",
"canonical_name": "Timothy Lance",
"aliases": ["Timothy Lance", "TIMOTHY LANCE"],
"elections": [
{"date": "2022-11-08", "contest": "Columbus County Schools Board of Education District 02", "votes": 1531}
],
"states": ["NC"],
"first_appearance": "2022-11-08",
"election_count": 1
}
Flat exports (JSONL and CSV) — One record per candidate per contest per precinct, with canonical names and entity IDs attached:
{
"election_date": "2022-11-08",
"state": "NC",
"county": "COLUMBUS",
"contest_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
"candidate_raw": "Timothy Lance",
"candidate_canonical": "Timothy Lance",
"candidate_entity_id": "person:nc:columbus:lance-timothy-13",
"votes_total": 303,
"source": "nc_sbe",
"l3_hash": "28183d41d50204d5",
"l0_hash": "edfedf2760cfd54f"
}
Why five layers and not two
A simpler system would have two layers: raw and processed. The five-layer design exists because the processing steps have different properties that should not be conflated:
Splitting L1 from L2 means you can upgrade the embedding model without re-parsing all sources. If a better model than text-embedding-3-large becomes available, re-run L2 from L1. L1 remains untouched.
Splitting L2 from L3 means cheap, deterministic embedding generation is separate from expensive, non-deterministic LLM calls. L2 can run for 200 million records in hours on CPU (plus API calls for vector generation). L3’s LLM calls can be batched separately, retried on failure, and audited independently.
Splitting L3 from L4 means individual entity resolution decisions are separate from the aggregate operations (canonical name selection, temporal chains, verification) that consume them. If a human reviewer overrides an L3 match decision, L4 can be re-run without re-doing all of L3.
Each layer boundary is a point where you can stop, inspect, export, and restart. A researcher who disagrees with the entity resolution can take L2 output and apply their own matching logic. A developer who wants to test a new office classifier can re-run L1 without re-downloading L0.
Storage layout
local-data/processed/
├── l0_raw/
│ └── {source}/
│ ├── {filename}
│ └── {filename}.manifest.json
├── l1_cleaned/
│ └── {source}/{state}/{year}/
│ ├── cleaned.jsonl
│ └── cleaning_report.json
├── l2_embedded/
│ └── {state}/{year}/
│ ├── enriched.jsonl
│ ├── candidate_embeddings.npy
│ ├── contest_embeddings.npy
│ └── id_mapping.json
├── l3_matched/
│ └── {state}/{year}/
│ ├── matched.jsonl
│ └── decisions/
│ └── candidate_matches.jsonl
└── l4_canonical/
├── candidate_registry.json
├── contest_registry.json
├── verification_report.json
└── exports/
├── flat_export.jsonl
└── flat_export.csv
All JSONL files are streamable — they can be processed line by line without loading the entire file into memory. At 200 million records with approximately 2 KB per record, the full L1 corpus would be approximately 400 GB. Streaming is not optional at that scale.
The hash chain
Every record at every layer carries a hash of its own content and a reference to its parent layer’s hash:
L4 record
l4_hash ← SHA-256(L4 content + "parent:" + l3_hash)
└── l3_hash ← SHA-256(L3 content + "parent:" + l2_hash)
└── l2_hash ← SHA-256(L2 content + "parent:" + l1_hash)
└── l1_hash ← SHA-256(L1 content + "parent:" + l0_hash)
└── l0_hash ← SHA-256(raw file bytes)
To verify any L4 record: recompute the L4 hash from its content, check that it matches the stored l4_hash, then follow the l3_parent_hash to the L3 record and repeat. Continue through L2 and L1 to L0. At L0, re-hash the raw file bytes and compare to the stored l0_hash.
In our prototype run of 200 records, all 200 hash chains verified from L4 back to L0 with zero broken links.
L0: Raw — Byte-Identical Source Preservation
L0 is the foundation of the pipeline. It stores byte-identical copies of every source file alongside a JSON manifest that records how the file was acquired. Nothing at L0 is parsed, cleaned, or transformed. The raw bytes are sacred.
What L0 Contains
Every source file produces two artifacts:
| Artifact | Purpose | Example |
|---|---|---|
| The file itself | Exact bytes as downloaded | results_pct_20221108.txt |
| The manifest sidecar | Acquisition metadata | results_pct_20221108.txt.manifest.json |
The manifest records five fields:
{
"l0_hash": "edfedf2760cfd54f...",
"source_url": "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip",
"retrieval_date": "2026-03-18T14:30:00Z",
"file_size_bytes": 18023456,
"format_detected": "tsv"
}
l0_hash— SHA-256 of the raw file bytes. This is the root of the hash chain. Every downstream record at L1–L4 ultimately traces back to this value.source_url— The exact URL used to retrieve the file. Not a landing page — the direct download link.retrieval_date— ISO 8601 timestamp of when the file was downloaded. Sources update files in place; the retrieval date disambiguates versions.file_size_bytes— Byte count of the raw file after decompression (if the source was a zip archive, this is the size of the extracted file, not the archive).format_detected— The file format as determined by content inspection:tsv,csv,xml,json,fixed_width.
Storage Layout
l0_raw/
├── nc_sbe/
│ ├── results_pct_20221108.txt
│ ├── results_pct_20221108.txt.manifest.json
│ ├── results_pct_20201103.txt
│ └── results_pct_20201103.txt.manifest.json
├── medsl/
│ ├── 2022-nc-precinct-general.csv
│ ├── 2022-nc-precinct-general.csv.manifest.json
│ ├── 2022-fl-precinct-general.csv
│ └── 2022-fl-precinct-general.csv.manifest.json
├── openelections/
│ ├── 20221108__fl__general__precinct.csv
│ └── 20221108__fl__general__precinct.csv.manifest.json
└── census/
├── national_county2020.txt
└── national_county2020.txt.manifest.json
Files are organized by source, not by state or year. The source is the natural partition because each source has its own parser at L1. A single MEDSL file may contain data for all 50 states; a single NC SBE file contains one election’s results for all NC counties. The source directory mirrors the download structure.
Idempotent Download
Downloading is idempotent. Before fetching a file, the pipeline checks whether an L0 entry already exists with a matching l0_hash:
- If the manifest exists and the file exists and the file’s SHA-256 matches the manifest’s
l0_hash→ skip download. The file is already present and intact. - If the manifest exists but the file is missing or the hash does not match → re-download. The file was corrupted or deleted.
- If no manifest exists → download and create manifest.
This means running the download step twice produces no network traffic on the second run. It also means the pipeline recovers gracefully from interrupted downloads — a partially written file will fail the hash check and be re-fetched.
When Sources Change
Some sources update files in place. NC SBE occasionally reissues precinct result files after canvass corrections. MEDSL publishes revised datasets with the same filename.
When a re-download produces different bytes than the stored l0_hash, the pipeline does not overwrite the existing L0 entry. Instead:
- The new file is stored with a versioned name:
results_pct_20221108.v2.txt. - A new manifest is created with the new
l0_hashand currentretrieval_date. - The old file and manifest are retained unchanged.
All L1–L4 records that reference the old l0_hash remain valid. New pipeline runs against the updated file produce new L1–L4 records referencing the new l0_hash. Both versions coexist. The retrieval_date field distinguishes them.
The L0 Hash as Root of Trust
The l0_hash is the only value in the pipeline that can be independently verified by anyone with access to the source. Download the file from the URL in the manifest. Compute SHA-256. Compare. If the hashes match, the pipeline processed the same bytes you hold.
Every subsequent hash — l1_hash, l2_hash, l3_hash, l4_hash — incorporates its parent’s hash. The entire chain is anchored to l0_hash. If someone modifies the raw file, the L0 hash changes, the L1 hash no longer matches its l0_parent_hash, and the verification algorithm reports a break at the L0→L1 boundary.
In our prototype, all 200 hash chains verified from L4 back to L0 with zero broken links. The verification starts here — at the raw bytes.
What L0 Does Not Do
L0 does not parse, filter, validate, or transform. A TSV file with malformed rows is stored as-is. A CSV file with a trailing BOM is stored as-is. A zip archive is decompressed and the contents stored, but the extraction is mechanical — no character encoding conversion, no line-ending normalization, no column reordering.
Data quality issues are L1’s problem. L0’s only job is to preserve the exact bytes that the source published, record where they came from, and make them verifiable.
L1: Cleaned — Deterministic Parsing and Enrichment
L1 transforms raw source files into structured JSONL records with a unified schema. It is purely deterministic: no machine learning, no API calls, no randomness. Given the same L0 files and the same parser version, L1 output is identical on every run, on every machine, forever.
This is deliberate. L1 is the foundation for every subsequent layer. If the foundation is non-deterministic, nothing above it can be reproduced.
One Parser Per Source, One Schema Out
Each source has a dedicated parser that understands its native format:
| Source | Format | Delimiter | Encoding | Parser |
|---|---|---|---|---|
| NC SBE | TSV (.txt extension) | \t | UTF-8 | nc_sbe_v2.1 |
| MEDSL | CSV | , | UTF-8 | medsl_v1.3 |
| OpenElections | CSV (varies by state) | , | UTF-8/Latin-1 | openelections_v1.0 |
| Clarity/Scytl | XML | — | UTF-8 | clarity_v0.5 |
Every parser produces the same output schema. A downstream consumer of L1 JSONL does not need to know whether a record originated from NC SBE or MEDSL — the fields, types, and semantics are identical.
The 10 Operations
L1 applies 10 operations in fixed order. The order matters — later operations depend on earlier ones.
1. Filter Non-Contests
Before any parsing, detect rows that are not candidate results. Pattern-match on the candidate name field:
| Pattern | Classification | Action |
|---|---|---|
registered voters | TurnoutMetadata | Extract to turnout.registered_voters |
ballots cast | TurnoutMetadata | Extract to turnout.ballots_cast |
over votes | TurnoutMetadata | Extract to turnout.over_votes |
under votes | TurnoutMetadata | Extract to turnout.under_votes |
^blank$ | TurnoutMetadata | Maine’s undervote label |
total votes | Aggregation artifact | Discard (redundant with candidate sums) |
for / against / yes / no | BallotMeasure (if contest name matches) | Route to MeasureChoice |
This runs first because non-contest rows must not enter name decomposition, office classification, or entity resolution. The principle is extract before filter — the registered voter count is valuable turnout data and is captured before the row is excluded from candidate analysis. See Non-Candidate Records.
2. Parse Source Format
Source-specific column mapping. The NC SBE parser reads tab-separated fields: County, Election Date, Contest Name, Choice, Choice Party, Total Votes, Election Day, One Stop, Absentee by Mail, Provisional. The MEDSL parser reads CSV columns: state, county_name, office, candidate, party_simplified, votes, mode. Each parser maps its native columns to the unified schema fields.
Encoding normalization happens here. OpenElections files from some states use Latin-1 encoding; the parser detects and converts to UTF-8. MEDSL 2022 has trailing commas in some state files; the parser strips them.
3. Decompose Candidate Names
Split every candidate name into structured components. This is the most critical L1 operation — it determines what signal survives to L2 and L3.
The decomposition handles four source formats:
| Format | Example | Parsing strategy |
|---|---|---|
LAST, FIRST MIDDLE | CRIST, CHARLES JOSEPH | Split on first comma; remainder is first + middle |
First Last | Charlie Crist | Last token is last name (with multi-word last name detection) |
First Middle Last Suffix | Robert Van Fletcher, Jr. | Suffix detected and extracted; remaining tokens parsed |
LAST, FIRST M. | BRAY, SHANNON W. | Period stripped from middle initial |
The output for every format is the same six fields:
{
"raw": "Robert Van Fletcher, Jr.",
"first": "Robert",
"middle": "Van",
"last": "Fletcher",
"suffix": "Jr.",
"canonical_first": "Robert"
}
Every component is preserved. Middle initials are kept (they distinguish David S. Marshall from David A. Marshall). Suffixes are kept (they distinguish Robert Williams from Robert Williams Jr.). The raw field is never modified. See Name Normalization.
4. Apply Nickname Dictionary
Look up first in the nickname dictionary (~100 mappings in prototype, targeting 500+). If a mapping exists, populate canonical_first with the formal equivalent. If not, canonical_first equals first.
first | canonical_first | Mapping |
|---|---|---|
| Charlie | Charles | Charlie → Charles |
| Ron | Ronald | Ron → Ronald |
| Nikki | Nicole | Nikki → Nicole |
| Timothy | Timothy | No mapping (already formal) |
Both fields are preserved. The composite string sent to L2 uses canonical_first; the original first is retained for display and provenance. See The Nickname Dictionary.
5. Classify Contest Kind
Route each record to one of three contest kinds based on signals from steps 1 and 2:
| Kind | Criteria | Example |
|---|---|---|
candidate_race | Default — a person running for office | Timothy Lance for Board of Education |
ballot_measure | Candidate name is For/Against/Yes/No AND contest name matches measure keywords | “Against” in “BOND REFERENDUM” |
turnout_metadata | Candidate name matches turnout patterns | “Registered Voters” |
Records classified as ballot_measure get a MeasureChoice result instead of CandidateResult. Records classified as turnout_metadata are extracted and attached to sibling precinct records.
6. Classify Office (Tiers 1–2)
Apply the deterministic tiers of the office classifier:
Tier 1: Keyword lookup (~170 entries). Case-insensitive substring match. "board of education" in the contest name → school_district/education. Handles ~45% of unique office names, ~85% of records by volume.
Tier 2: Regex patterns (~40 patterns). county\s+commission → county/legislative. Adds ~17% of unique names.
Records that do not match either tier are classified as other with classifier_confidence: 0.0. They proceed to L2 for tier 3 (embedding nearest-neighbor) and tier 4 (LLM classification).
The classifier_method field records which tier produced the classification: "keyword", "regex", or "unclassified".
7. Enrich Geography
Look up FIPS codes from bundled Census Bureau reference data:
- State FIPS: 2-digit code from state abbreviation.
NC→37. - County FIPS: 5-digit code from (state, county name).
(NC, COLUMBUS)→37047. - Place FIPS: Where available, municipal codes from Census place files.
- OCD-ID: Open Civic Data identifier.
ocd-division/country:us/state:nc/county:columbus.
The reference data covers 3,143 counties and 31,980 places. FIPS enrichment achieves 100% county coverage for records with valid state and county name fields. Municipal FIPS coverage is lower (~85%) because municipality names are less standardized.
8. Compute Vote Shares
For each candidate in a contest within a precinct:
vote_share = votes_total / sum(votes_total for all candidates in same contest+precinct)
Vote share is a convenience field — it can always be recomputed from the raw vote counts. It is included because downstream queries (margins, competitiveness rankings) use it constantly.
9. Backfill Turnout
If step 1 extracted turnout metadata rows for a precinct, attach the values to all sibling contest records in the same precinct:
{
"turnout": {
"registered_voters": 4217,
"ballots_cast": 2891,
"turnout_rate": 0.6855
}
}
Turnout data is available in NC SBE and some OpenElections files. It is absent from MEDSL and Clarity. When absent, the turnout field is null — not zero, not omitted, but explicitly null to distinguish “no data” from “zero registered voters.”
10. Compute L1 Hash
The final operation seals the record into the hash chain:
l1_hash = SHA-256( serialize(record_without_hash) + "parent:" + l0_hash )
The l0_hash comes from the L0 manifest of the source file. The l1_hash becomes the anchor for L2. See Provenance and the Hash Chain.
A Real L1 Record
Timothy Lance, precinct P17, Columbus County Schools Board of Education District 02, 2022 NC general election:
{
"election": {"date": "2022-11-08", "type": "general"},
"jurisdiction": {
"state": "NC", "state_fips": "37",
"county": "COLUMBUS", "county_fips": "37047",
"precinct": "P17", "level": "precinct"
},
"contest": {
"kind": "candidate_race",
"raw_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
"office_level": "school_district",
"classifier_method": "regex",
"classifier_confidence": 0.85,
"vote_for": 1
},
"results": [
{
"candidate_name": {
"raw": "Timothy Lance", "first": "Timothy",
"middle": null, "last": "Lance", "suffix": null,
"canonical_first": "Timothy"
},
"votes_total": 303,
"vote_share": 0.523,
"vote_counts_by_type": {
"election_day": 136, "early": 159,
"absentee_mail": 7, "provisional": 1
}
}
],
"turnout": {
"registered_voters": 4217,
"ballots_cast": 2891,
"turnout_rate": 0.6855
},
"source": {
"source_type": "nc_sbe",
"source_file": "results_pct_20221108.txt",
"confidence": "high"
},
"provenance": {
"l1_hash": "8ea7ecc257ff8e05",
"l0_parent_hash": "edfedf2760cfd54f",
"parser_version": "nc_sbe_v2.1",
"schema_version": "3.0.0"
}
}
Every field traces to a specific operation: county_fips from step 7, canonical_first from step 4, office_level from step 6, turnout from step 9, l1_hash from step 10.
What L1 Does Not Do
- No embeddings. Embedding generation requires an API call to OpenAI. L1 runs offline with zero external dependencies.
- No entity resolution. L1 does not determine whether two records refer to the same person. That is L3’s job.
- No canonical name selection. L1 preserves all name components. Choosing the “best” name is L4’s job, after entity resolution.
- No tier 3/4 office classification. Embedding-based and LLM-based classification require API calls. L1 applies only the deterministic tiers (keyword and regex). Records that need tiers 3–4 are marked
"classifier_method": "unclassified"and classified at L2.
This boundary is the determinism boundary. Everything L1 does can be verified by re-running the parser on the same L0 files. No API key, no network connection, no randomness.
L2: Embedded — Vector Generation and Classification
L2 transforms L1’s structured text fields into vector embeddings suitable for fuzzy matching, applies tier 3 office classification, and raises quality flags on suspicious records. It is the bridge between deterministic parsing (L1) and probabilistic entity resolution (L3).
Embedding Model
The embedding model is OpenAI’s text-embedding-3-large, producing 3,072-dimensional float32 vectors. Every L2 record stores the model identifier and dimensionality:
{
"embedding_model": "text-embedding-3-large",
"embedding_dimensions": 3072
}
This metadata is not optional. Thresholds calibrated for text-embedding-3-large (auto-accept ≥ 0.95, ambiguous 0.35–0.95, auto-reject < 0.35) are not portable to other models. If the model changes, the thresholds must be recalibrated against the test cases. Storing the model in every record ensures that stale thresholds are never applied to vectors from a different model.
Composite String Templates
Raw name components are not embedded directly. They are assembled into composite strings that include contextual fields — office, state, county, party — so that the resulting vectors encode identity-relevant context alongside the name.
Three composite types are generated per record:
| Type | Template | Example |
|---|---|---|
| Candidate | {canonical_first} {middle} {last} {suffix} | {party} | {office} | {state} | {county} | Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus |
| Contest | {raw_name} | {office_level} | {state} {year} | COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02 | school_district | NC 2022 |
| Geography | {municipality}, {county} County, {state} | Whiteville, Columbus County, NC |
Middle initials and suffixes are included deliberately. “David S Marshall | ME” and “David A Marshall | FL” produce different vectors — measured at cosine 0.6448 with middle initials versus 0.7025 without. That 0.058 gap is the difference between correct separation and a false merge. See Composite String Templates for the full rationale, including the “context bleed” problem where shared geographic context artificially inflates similarity between unrelated candidates.
Empty components (null middle, null suffix) produce empty slots in the template rather than being omitted. This keeps the template structure consistent, which stabilizes the embedding model’s tokenization.
FAISS Indices
Embeddings are stored in partitioned FAISS indices, one per (state, year) combination. Partitioning serves two purposes:
-
Blocking alignment. Entity resolution at L3 blocks by
(state, office_level, last_name_initial). State-level FAISS partitions ensure that nearest-neighbor queries never cross state boundaries — a candidate in NC is never compared to a candidate in FL during retrieval. -
Memory management. A single national index for 42 million candidate embeddings at 3,072 dimensions × 4 bytes = ~500 GB of float32 data. Per-state-year partitions fit in memory on commodity hardware. NC 2022 (~200K records × 3,072 dims × 4 bytes) is approximately 2.3 GB.
Index type is IndexFlatIP (inner product on L2-normalized vectors, equivalent to cosine similarity). No approximate search — exact cosine is computed for every candidate pair within a block. At partition scale, exact search is fast enough (sub-second for 200K vectors) and avoids the recall loss of approximate methods like IVF or HNSW.
Tier 3 Office Classification
Records that were not classified by L1’s keyword (tier 1) or regex (tier 2) classifiers are embedded and compared against a reference set of ~200 pre-classified office names.
The reference set is a curated list covering every (office_level, office_branch) pair with at least 3 examples. Each reference entry has a pre-computed embedding. For an unclassified office name, L2 computes its embedding, finds the nearest reference neighbor by cosine similarity, and assigns the reference’s classification if the score exceeds 0.60.
Real tier 3 results:
| Unclassified Name | Nearest Reference | Cosine | Assigned Classification |
|---|---|---|---|
| Collier Mosquito Control District | Mosquito Control District | 0.787 | special_district / infrastructure |
| Eastern Carrituck Fire & Rescue | Fire Protection District | 0.724 | special_district / infrastructure |
| Lowndes County Bd of Ed | Board of Education | 0.831 | school_district / education |
Names scoring below 0.60 are left as other at L2 and passed to tier 4 (LLM) at L3. Tier 3 classifies approximately 4.5% of the unique office names that survived tiers 1 and 2, with 94% accuracy against manual review.
The classification result is written back into the L1-inherited fields on the enriched L2 record, updating classifier_method to "embedding_nn" and classifier_confidence to the cosine score.
Quality Flags
L2 raises flags on records with characteristics that may cause downstream problems. Flags do not block processing — they annotate records for review at L4.
| Flag | Condition | Example |
|---|---|---|
short_name | Candidate name has ≤ 2 characters after decomposition | "J. D." with no last name parsed |
common_name_risk | First + last name appears 50+ times nationally | John Smith, Robert Johnson |
missing_office_level | Office survived all classification tiers as other | Santa Rosa Island Authority (pre-tier-4) |
zero_votes | votes_total is 0 | Write-in candidates with no votes |
high_vote_share | Single candidate has > 99% of votes in a contested race | Possible data error or unopposed misclassification |
In our prototype, 12 of 200 records received at least one quality flag. The most common was zero_votes (write-in placeholders), followed by common_name_risk.
Output Format
L2 produces two types of output per (state, year) partition:
Enriched JSONL — L1 records augmented with an l2 block:
{
"...all L1 fields...",
"l2": {
"l2_hash": "854fa6367960bb05",
"l1_parent_hash": "8ea7ecc257ff8e05",
"embedding_model": "text-embedding-3-large",
"embedding_dimensions": 3072,
"candidate_embedding_id": 4271,
"contest_embedding_id": 183,
"candidate_composite": "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus",
"contest_composite": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02 | school_district | NC 2022",
"quality_flags": []
}
}
Binary sidecars — .npy files containing float32 arrays of embeddings, plus a JSON ID mapping:
l2_embedded/nc/2022/
├── enriched.jsonl # One record per line, all L1 + L2 fields
├── candidate_embeddings.npy # float32[N, 3072]
├── contest_embeddings.npy # float32[M, 3072]
├── geography_embeddings.npy # float32[K, 3072]
└── id_mapping.json # l1_hash → embedding row index
Embeddings are stored separately from JSONL to keep the text records streamable. A 3,072-dimensional float32 vector is 12,288 bytes — embedding it as base64 inside JSON would triple the JSONL file size. The .npy format is readable by NumPy, PyTorch, and any tool that understands the NumPy array file specification.
The candidate_embedding_id in the JSONL record is an integer index into candidate_embeddings.npy. To retrieve Timothy Lance’s embedding: load the .npy file, index row 4271.
Determinism
L2 is deterministic given the same L1 input, the same embedding model version, and the same office reference set. The composite string templates are fixed. The FAISS index construction is deterministic (flat index, no random initialization). The tier 3 nearest-neighbor search is exact.
If OpenAI updates the weights behind text-embedding-3-large without changing the model name, the vectors change silently. The embedding_model field cannot detect this — it records the API model name, not an internal version hash. In practice, OpenAI has not changed embedding model weights after release. If they do, a full L2 re-run and threshold recalibration is required.
Dependencies
L2 requires an OpenAI API key for embedding generation. L0 and L1 do not — they run entirely offline. This is the first layer that requires network access.
At prototype scale (200 records), L2 embedding generation takes approximately 3 seconds and costs less than $0.01. At production scale (42 million rows), the cost is approximately $300 and the wall-clock time depends on API throughput (typically 3,000 embeddings per minute with batching, yielding ~10 days for the full corpus). Embeddings are computed once per L1 record and cached — re-running L3 or L4 does not re-invoke the embedding API.
L3: Matched — Entity Resolution and LLM Confirmation
L3 is the first non-deterministic layer. It resolves entities — determining which records across sources, precincts, and elections refer to the same candidate and the same contest. Every decision is stored in a JSONL audit log with full prompt, response, and reasoning, enabling deterministic replay even though the underlying LLM calls are non-deterministic.
Input and Output
Input: L2 enriched JSONL records with embeddings, composite strings, and quality flags.
Output:
- Enriched JSONL with
candidate_entity_idandcontest_entity_idassignments. - A decision log (
candidate_matches.jsonl) recording every comparison made and its outcome.
Blocking
Before pairwise comparison begins, records are partitioned into blocks by (state, office_level, last_name_initial). Only pairs within the same block are compared. A candidate for NC school board is never compared to a candidate for FL sheriff.
This reduces the comparison space by approximately four orders of magnitude. The blocking key is deliberately coarse — we accept some noise within blocks (two unrelated people whose last names start with the same letter, in the same state, at the same office level) in exchange for never missing a legitimate match. The step 2.5 gate handles within-block noise cheaply.
The Five-Step Cascade
| Step | Method | Prototype result | Cost per pair |
|---|---|---|---|
| 1 | Exact match on (canonical_first, last, suffix) | 597 (70.0%) | negligible |
| 2 | Jaro-Winkler ≥ 0.92 on full name | 1 (0.1%) | microseconds |
| 2.5 | Name gate: JW on last name < 0.50 → skip | — (gate) | microseconds |
| 3 | Embedding cosine ≥ 0.95 AND same state → auto-accept | 50 (5.9%) | pre-computed |
| 4 | LLM confirmation: cosine 0.35–0.95 | 30 (3.5%) | ~$0.0002/call |
| 5 | Tiebreaker: stronger model when step 4 confidence < 0.70 | 0 (rare) | ~$0.002/call |
Percentages are from the 200-record Columbus County NC prototype. 206 unique candidate entities were created.
Step 1: Exact Match
The match key is (canonical_first, last, suffix) within a (state, office_level) block. Timothy Lance appears in 47 precinct rows — all 47 share the same key and collapse to one entity. No fuzzy logic, no API calls.
This step handles the overwhelmingly common case: the same candidate appearing identically across precincts within a single source.
Step 2: Jaro-Winkler (≥ 0.92)
Catches minor spelling variations that survive L1 parsing — Mcdonough vs McDonough, transposition errors, inconsistent hyphenation. The threshold of 0.92 is strict to avoid false positives on common surnames.
In the prototype, step 2 resolved 1 additional candidate. Most formatting differences are already normalized at L1.
Step 2.5: The Name Similarity Gate
Before computing embedding similarity, check last-name Jaro-Winkler. If below 0.50, skip the pair entirely.
This gate was added after a prototype finding. The original cascade had no step 2.5, and all 30 LLM calls were spent on pairs like “Aaron Bridges” vs “Daniel Blanton” — candidates in the same (NC, school_district, B/D) block with completely different names. Every call correctly returned no-match, but each cost an API round-trip. The gate eliminates these obvious non-matches before they reach embedding or LLM steps.
At scale, with millions of within-block pairs, this gate prevents orders-of-magnitude waste in downstream steps.
Step 3: Embedding Auto-Accept (≥ 0.95)
For pairs that pass the gate but did not exact-match, retrieve pre-computed L2 cosine similarity. If ≥ 0.95 AND both candidates are in the same state, auto-accept.
The 0.95 threshold is deliberately high. Robert Williams Jr scored 0.862 against Robert Williams — a false positive under the original 0.82 threshold. At 0.95, only near-identical strings with trivial formatting differences pass. Barbara Sharief at 0.955 is an example that auto-accepts: the only difference is a middle initial J added in one source.
A secondary acceptance rule handles the band just below 0.95: embedding ≥ 0.90 AND JW on full name ≥ 0.92 AND same state → accept. This catches Ashley Moody (0.930 cosine) without requiring an LLM call.
Step 4: LLM Confirmation (0.35–0.95)
Pairs in the ambiguous zone are sent to Claude Sonnet with structured context: both candidates’ parsed name components, vote counts, office, state, party, and the embedding score. The LLM returns a decision (match/no-match), confidence (0.0–1.0), and free-text reasoning.
The ambiguous zone is wide (0.35–0.95) by design. Budget is not a constraint. The zone was widened from the original 0.65–0.82 after two findings:
- Charlie Crist at 0.451 — a true match that the old 0.65 reject threshold would have discarded.
- Robert Williams Jr at 0.862 — a false positive that the old 0.82 accept threshold would have merged.
The wider zone sends more pairs to the LLM in exchange for zero threshold-induced errors in the tested range.
Step 5: Tiebreaker
When step 4 returns confidence below 0.70, the pair escalates to an Opus-class model. This handles unusual nicknames, slight vote-count discrepancies, and geographic ambiguity that Sonnet finds uncertain. Step 5 was not triggered in the 200-record prototype; it exists for the long tail of ambiguity at production scale.
The Decision Log
Every comparison — not just LLM calls — is recorded in a JSONL audit log at l3_matched/{state}/{year}/decisions/candidate_matches.jsonl. One record per pair examined.
An LLM-decided entry:
{
"decision_id": "a3f8c1d2-4e7b-4a1f-9c3d-8f2e1a6b5c4d",
"decision_type": "candidate_match",
"timestamp": "2026-03-19T10:30:00Z",
"inputs": {
"name_a": "Charlie Crist",
"name_b": "CRIST, CHARLES JOSEPH",
"embedding_score": 0.451,
"jw_last_name": 1.0,
"state_a": "FL", "state_b": "FL",
"contest_a": "Governor", "contest_b": "Governor",
"votes_a": 3101652, "votes_b": 3101652
},
"method": {
"type": "llm",
"model": "claude-sonnet-4-20250514",
"prompt_template_version": "entity_match_v2.0"
},
"output": {
"decision": "match",
"confidence": 0.95,
"reasoning": "Charlie is a common nickname for Charles. Same state, same office, identical vote counts."
}
}
An exact-match entry is simpler:
{
"decision_id": "b7c2e4f1-...",
"decision_type": "candidate_match",
"timestamp": "2026-03-19T10:30:01Z",
"inputs": {
"name_a": "Timothy Lance",
"name_b": "Timothy Lance",
"state_a": "NC", "state_b": "NC"
},
"method": {
"type": "exact",
"model": null,
"prompt_template_version": null
},
"output": {
"decision": "match",
"confidence": 1.0,
"reasoning": "Exact match on (canonical_first=Timothy, last=Lance, suffix=null)"
}
}
A gate-rejected entry:
{
"decision_id": "c9d3a5e2-...",
"decision_type": "candidate_match",
"timestamp": "2026-03-19T10:30:02Z",
"inputs": {
"name_a": "Aaron Bridges",
"name_b": "Daniel Blanton",
"jw_last_name": 0.40,
"state_a": "NC", "state_b": "NC"
},
"method": {
"type": "gate_reject",
"model": null,
"prompt_template_version": null
},
"output": {
"decision": "no_match",
"confidence": 1.0,
"reasoning": "Last-name JW 0.40 below gate threshold 0.50; skipped."
}
}
L3 Record Output
Each L1/L2 record is augmented with entity assignments:
{
"...all L1 and L2 fields...",
"l3": {
"l3_hash": "28183d41d50204d5",
"l2_parent_hash": "854fa6367960bb05",
"candidate_entity_ids": [
{"result_index": 0, "entity_id": "person:nc:columbus:lance-timothy-13"}
],
"contest_entity_id": "contest:nc:columbus:school-board-d02"
}
}
The entity_id format encodes scope: person:{state}:{county}:{last}-{first}-{sequence}. The sequence number disambiguates within a name — necessary when two genuinely different people share the same canonical first and last name in the same county.
Contest entity IDs follow a parallel scheme: contest:{state}:{county}:{office-slug}.
Reproducibility
L3 is non-deterministic because LLM responses may vary between runs. Two strategies make it reproducible in practice:
Replay from log. The decision log contains every match decision with its inputs and outputs. Re-running L3 in replay mode reads decisions from the log instead of calling the LLM. This produces identical L3 output — deterministic given the logged decisions.
Re-run with audit. Re-running L3 with live LLM calls produces a new decision log. Diffing the two logs reveals any decisions where the LLM changed its mind. In testing, decision stability is high: the same pair with the same context produces the same match/no-match outcome in >99% of re-runs. Confidence scores may vary by ±0.05.
For published results, the decision log is the canonical record. The LLM is a tool that produced the decisions; the decisions themselves are the data.
The 30 Wasted Calls
The prototype’s most actionable finding: all 30 LLM calls were wasted. Every one compared candidates with obviously different names — “Aaron Bridges” vs “Daniel Blanton”, “Timothy Lance” vs “Jessica Moore” — that happened to share a blocking key. The embedding scores ranged from 0.55 to 0.73, placing them in the ambiguous zone. The LLM correctly rejected all 30 with high confidence.
The root cause was coarse blocking without a name-similarity pre-filter. The fix — step 2.5, requiring JW ≥ 0.50 on last names before proceeding — would have eliminated all 30 calls. At production scale, this gate is the difference between thousands of useful LLM calls and millions of wasted ones.
Budget and the Ambiguous Zone
Budget is not a constraint for this project. This changes the threshold calculus:
| Decision | Budget-constrained approach | Our approach |
|---|---|---|
| Ambiguous zone width | Narrow (0.65–0.82) to minimize LLM calls | Wide (0.35–0.95) to maximize accuracy |
| Step 5 model | Same as step 4 (cheaper) | Opus-class (more capable) |
| Audit coverage | Sample-based | Every multi-member entity audited at L4 |
The wider ambiguous zone means ~25% of within-block pairs reach the LLM, up from ~5% with the old thresholds. The step 2.5 gate keeps the absolute call volume manageable by rejecting pairs with dissimilar last names before they enter the zone.
The cascade still exists despite unlimited budget. Sending every pair to the LLM would take weeks of API calls at 42 million rows — cost is irrelevant when wall-clock time is the bottleneck. And deterministic steps are preferred not because they are cheaper, but because they are reproducible and do not hallucinate.
Cross-References
- Entity Resolution Overview — the problem and why each step exists
- The Cascade: Step by Step — detailed walkthrough with real examples at every step
- Real Test Cases — all tested pairs with scores and decisions
- Threshold Calibration — old vs. new thresholds
- When the LLM Gets Called — invocation policy across the pipeline
- Budget Is Not a Constraint — what unlimited budget changes and what it does not
L4: Canonical — Authoritative Names and Verification
L4 is the final layer. It consumes L3’s entity assignments and produces the researcher-facing outputs: canonical names, temporal chains across elections, alias tables, and the results of six verification algorithms. L4 is deterministic given the same L3 input — no LLM calls are made during construction (though the LLM entity audit is part of verification).
Canonical Name Selection
Each candidate entity has multiple name variants collected from across sources and precincts. L4 selects one canonical name using a fixed algorithm:
-
Collect all variants. For entity
person:nc:columbus:lance-timothy-13, the variants might beTimothy Lance(NC SBE),TIMOTHY LANCE(MEDSL), andLance, Timothy(OpenElections). -
Prefer the most complete. A variant with a middle initial beats one without. A variant with a suffix beats one without.
SHANNON W BRAYbeatsSHANNON BRAY.Robert Williams JrbeatsRobert Williams(when they are the same entity — which is rare, since Jr usually indicates a different person). -
Among equally complete, prefer the most authoritative source. Source authority ranking:
- Certified state data (NC SBE) — highest
- Academic curated data (MEDSL) — second
- Community-curated data (OpenElections) — third
- Election night reporting (Clarity/Scytl) — lowest
-
Among equally authoritative, prefer the most recent. A 2022 record beats a 2018 record for the same entity.
The selected canonical name is a presentation choice, not an analytical input. By the time L4 runs, entity resolution is complete — the identity question is settled at L3. L4 is choosing a label for a known entity.
Temporal Chain Aggregation
L4 builds one temporal chain entry per (entity, election, contest). A candidate who appeared in 47 precincts in one election gets one entry with the summed vote total — not 47 entries.
This fixes a prototype bug. The initial implementation built temporal chains per precinct, producing entries like “Timothy Lance, 2022, P17, 303 votes” and “Timothy Lance, 2022, P21, 287 votes.” For career tracking and competitiveness analysis, the correct granularity is the election level: “Timothy Lance, 2022, Columbus County Schools Board of Education District 02, 1,531 votes.”
The aggregation:
{
"entity_id": "person:nc:columbus:lance-timothy-13",
"canonical_name": "Timothy Lance",
"aliases": ["Timothy Lance", "TIMOTHY LANCE"],
"elections": [
{
"date": "2022-11-08",
"contest": "Columbus County Schools Board of Education District 02",
"contest_entity_id": "contest:nc:columbus:school-board-d02",
"votes": 1531,
"vote_share": 0.523,
"outcome": "won",
"source_count": 1
}
],
"states": ["NC"],
"first_appearance": "2022-11-08",
"election_count": 1
}
For multi-cycle candidates, the elections array grows. George Dunlap — Mecklenburg County Commissioner across 6 consecutive cycles (2014–2024) — has 6 entries in his temporal chain, each with the contest-level vote total for that election.
Alias Tables
Every name variant observed for an entity is preserved in the aliases array. This serves two purposes:
-
Searchability. A user searching for “SHANNON W BRAY” finds the entity whose canonical name is “Shannon W. Bray” because the ALL CAPS variant is in the alias table.
-
Provenance. The alias table documents which sources used which name formats. If a future entity resolution decision is questioned, the alias table shows exactly what variants were merged.
Aliases are deduplicated but not normalized — Timothy Lance and TIMOTHY LANCE are both preserved because they demonstrate that the entity appears in both title-case and all-caps sources.
The Six Verification Algorithms
L4 runs six verification algorithms over the complete output. These are not optional post-processing — they are integral to the pipeline’s trust model. Every verification result is recorded in verification_report.json.
1. Hash Chain Integrity
Walk the hash chain from L4 → L3 → L2 → L1 → L0 for every record. Recompute each hash and compare to the stored value. Any mismatch identifies the exact layer where the chain breaks.
| Metric | Prototype result |
|---|---|
| Records verified | 200 / 200 |
| Broken chains | 0 |
| Layers traversed per record | 5 |
See Provenance and the Hash Chain for the verification algorithm.
2. Entity Consistency
Flag entities with characteristics that are unusual for local officeholders:
- Multi-state entities. A
candidate_entity_idspanning NC and FL is suspicious — local officials serve in one state. Federal candidates can span states (a senator’s votes appear in statewide and precinct-level records), so federal offices are exempted. - Party switches. An entity appearing as DEM in 2018 and REP in 2022 is not impossible (party switches happen) but is flagged for review.
- Implausible office combinations. An entity serving simultaneously as county sheriff and school board member is unlikely (though not impossible in small counties).
3. Temporal Plausibility
Check career spans and office progressions:
- Span check. An entity with elections in 2006 and 2024 has an 18-year span. Plausible for a long-serving commissioner, but flagged if the office is typically a stepping stone (e.g., school board).
- Gap detection. An entity appearing in 2014 and 2024 but not 2016, 2018, 2020, or 2022 may be two different people merged by entity resolution — or someone who left office and returned. Gaps > 2 cycles are flagged.
- Age plausibility. If external data (FEC filings, candidate bio pages) provides a birth year, check that the candidate was of legal age at first appearance.
4. Cross-Source Reconciliation
Where two sources cover the same contest, compare vote totals for each candidate entity:
| Agreement level | NC 2022 contests | Percentage |
|---|---|---|
| Exact match | 579 | 90.5% |
| Within 1% | 47 | 7.3% |
| Disagree > 1% | 14 | 2.2% |
Disagreements are reported with both sources’ totals, the percentage difference, and the probable cause (provisional ballot timing, write-in aggregation, precinct boundary assignment). See Cross-Source Reconciliation.
5. Completeness Audit
Report coverage metrics across the full dataset:
| Metric | Target | Prototype result |
|---|---|---|
| State coverage (FIPS populated) | 100% | 100% |
| County coverage (FIPS populated) | 100% | 100% |
| Entity ID fill rate (candidate) | > 95% | 100% |
| Entity ID fill rate (contest) | > 95% | 100% |
| Office classification fill rate | > 90% | 67% (prototype scope) |
| Turnout data fill rate | varies | < 5% (most sources lack it) |
Low fill rates are not errors — they are documented gaps. The completeness audit ensures that gaps are visible, not hidden.
6. LLM Entity Audit
For every entity with members from more than one source or more than one election, ask a language model whether the entity cluster is plausible. This is the only LLM call in L4.
The prompt provides the entity’s canonical name, all aliases, all elections, all offices, all states, and all vote totals. The model evaluates:
- Is this a plausible single person?
- Are the offices consistent with one career?
- Do the vote totals and geographic spread make sense?
- Are any aliases suspicious (non-person names, ballot measure choices, turnout metadata)?
Prototype results from auditing 50 entities:
| Category | Count | Details |
|---|---|---|
| Clean — no issues | 3 | Entity is unambiguous |
| Suspicious — flagged for review | 43 | Precinct-level records inflating temporal chains |
| Likely error — incorrect entity | 4 | “For” and “Against” classified as person entities |
The 43 suspicious entities were a direct consequence of the prototype bug where temporal chains were built per precinct rather than per election. After fixing the aggregation to election-level, the suspicious count dropped to single digits in subsequent runs.
The 4 errors were ballot measure choices (“For”, “Against”) that had leaked past L1 non-candidate detection and received candidate_entity_id values at L3. The LLM audit caught them:
“‘For’ is not a plausible person name. This entity appears across 347 contests in 12 states, always in contest names containing ‘amendment’, ‘bond’, ‘referendum’, or ‘proposition’. These are ballot measure choices, not candidates.”
This finding led to tighter non-candidate detection at L1. See Non-Candidate Records.
Output Format
L4 produces three types of output:
Entity Registries (JSON)
One file per entity type, containing one record per unique entity:
candidate_registry.json— all person entities with canonical names, aliases, temporal chainscontest_registry.json— all contest entities with canonical names, years active, states
Flat Exports (JSONL and CSV)
One record per candidate per contest per precinct, with canonical names and entity IDs attached:
{
"election_date": "2022-11-08",
"state": "NC",
"county": "COLUMBUS",
"contest_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
"candidate_raw": "Timothy Lance",
"candidate_canonical": "Timothy Lance",
"candidate_entity_id": "person:nc:columbus:lance-timothy-13",
"votes_total": 303,
"source": "nc_sbe",
"l3_hash": "28183d41d50204d5",
"l0_hash": "edfedf2760cfd54f"
}
The flat export retains precinct-level granularity with entity-level annotations. Users who need contest-level totals aggregate by (candidate_entity_id, contest_entity_id, election_date). Users who need precinct-level data use the records as-is.
The CSV export contains the same fields for users who prefer tabular tools (Excel, R, Stata). Column order matches the JSONL field order.
Verification Report (JSON)
A single verification_report.json summarizing all six verification algorithms:
{
"run_date": "2026-03-19T12:00:00Z",
"record_count": 200,
"entity_count": 206,
"hash_chain": {"verified": 200, "broken": 0},
"entity_consistency": {"clean": 195, "flagged": 11},
"temporal_plausibility": {"clean": 203, "flagged": 3},
"cross_source": {"exact_match": 579, "within_1pct": 47, "disagree": 14},
"completeness": {"fips_fill": 1.0, "entity_fill": 1.0, "office_fill": 0.67},
"llm_audit": {"clean": 3, "suspicious": 43, "error": 4, "entities_audited": 50}
}
This report is the pipeline’s self-assessment. A researcher evaluating the data reads the verification report first to understand what the pipeline is confident about and where it flagged concerns.
Cross-References
- Provenance and the Hash Chain — how hash verification works
- Cross-Source Reconciliation — the NC overlap validation
- Non-Candidate Records — the “For” and “Against” audit finding
- Career Tracking Recipe — querying temporal chains
- Verify a Specific Result — using hash chains for provenance
Why the Order Matters: Clean → Embed → Match → Canonicalize
The pipeline’s four processing stages must run in exactly this order. This is not a convention — it is a dependency chain where each stage requires the output of all prior stages. Rearranging them destroys signal.
We learned this the hard way.
The Insight
The original prototype ran normalization aggressively: strip middle initials, collapse suffixes, force uppercase, pick a canonical name, then try to match entities. The sequence was:
Old order: Canonicalize → Match
(normalize aggressively, then find duplicates)
This destroyed the information needed to tell different people apart.
David S. Marshall (Maine, state legislature) and David A. Marshall (Florida, county commission) are two different people. Under the old pipeline, both names were normalized to MARSHALL, DAVID — middle initials stripped as noise. After normalization, the two records were indistinguishable. The entity resolver matched them as the same person. One David Marshall absorbed the other’s career, vote history, and geographic record.
The embedding scores confirm why middle initials matter:
| Composite string | Cosine similarity |
|---|---|
| `David Marshall | MEvsDavid Marshall |
| `David S Marshall | MEvsDavid A Marshall |
The middle initial drops the score by 0.058 — enough to push the pair further from the accept threshold and toward correct rejection. But this signal only exists if the middle initial survives to L2. If L1 strips it during “normalization,” it is gone forever.
The Correct Order
L1 CLEAN Parse into components. Preserve everything:
first, middle, last, suffix, nickname, canonical_first.
No components are discarded. No names are collapsed.
↓
L2 EMBED Generate vectors from composite strings that include
middle initials, suffixes, and canonical_first.
The embedding encodes all preserved signal.
↓
L3 MATCH Compare embeddings. Run LLM confirmation on ambiguous
pairs. The LLM sees structured components — middle
initials, suffixes, nicknames — and reasons about them.
↓
L4 CANONICALIZE Now that entities are resolved, pick the authoritative
name. Prefer the most complete variant. Build alias
tables. Aggregate temporal chains.
Each stage depends on prior stages’ output:
- L2 depends on L1 — embeddings are generated from L1’s structured name components. If L1 strips middle initials, L2 cannot encode them.
- L3 depends on L2 — entity resolution uses L2 embeddings as the retrieval step. If L2 has degraded vectors (because L1 destroyed signal), L3 makes worse decisions.
- L4 depends on L3 — canonical name selection requires knowing who the person is. You cannot pick the “best” name for an entity before you know which records belong to that entity.
What Breaks If You Rearrange
Canonicalize before Match
This is the old pipeline. Normalize aggressively, then match. Failures:
David S. MarshallandDavid A. Marshallmerge into one entity.Robert WilliamsandRobert Williams Jrmerge — suffix stripped before matching can use it.Charlie Cristnormalizes toCRIST, CHARLIEbutCRIST, CHARLES JOSEPHnormalizes toCRIST, CHARLES— the canonical forms don’t match, so the same person splits into two entities.
Aggressive normalization both merges people who should be separate and splits people who should be merged. It is wrong in both directions simultaneously.
Match before Embed
Without embeddings, matching falls back to string similarity alone. Jaro-Winkler on Charlie Crist vs CRIST, CHARLES JOSEPH gives 0.58 — a miss. The embedding model, despite scoring only 0.451, at least places the pair in the ambiguous zone where the LLM can confirm the match. Without embeddings, the pair is never surfaced.
Embed before Clean
If L1 does not decompose names into components, L2 embeds raw strings: CRIST, CHARLES JOSEPH as-is. The composite template cannot include canonical_first because it does not exist yet. The embedding for the MEDSL record uses CHARLES while the OpenElections record uses Charlie — the nickname dictionary was never applied. The cosine score drops, more pairs fall below the LLM zone, and matches are lost.
The General Principle
Preserve signal as long as possible. Collapse only after all decisions that need the signal have been made.
Middle initials are signal for disambiguation. Suffixes are signal for generational distinction. Nicknames are signal for matching. Raw strings are signal for provenance. None of these should be discarded until L4, where the entity is already resolved and the canonical name is a presentation choice, not an analytical input.
The pipeline is a funnel of information:
| Layer | Information available | Information consumed |
|---|---|---|
| L1 | All components: raw, first, middle, last, suffix, canonical_first | None — everything preserved |
| L2 | L1 components + embeddings + quality flags | Components consumed to build composite strings |
| L3 | L2 embeddings + L1 components + LLM context | Embeddings consumed for retrieval; components consumed for LLM reasoning |
| L4 | L3 entity assignments | Entity IDs consumed to select canonical names |
At each layer, information from prior layers is used but not destroyed. The L1 record persists unchanged alongside the L2, L3, and L4 records. A researcher who disagrees with a canonical name choice can trace back to the original components at L1 and the raw bytes at L0.
Why This Took a Session to Learn
The old order felt intuitive: clean the data first, then do the hard work. Every data engineering textbook says normalize early. But election entity resolution is not a standard ETL problem. The “dirt” in the data — middle initials, suffixes, nicknames, variant spellings — is not dirt. It is signal. Stripping it is not cleaning. It is destruction.
The key insight: the order of operations is load-bearing. Clean → Embed → Match → Canonicalize is the only sequence that preserves signal through the stages that need it and collapses only after all analytical decisions are final.
Provenance and the Hash Chain
Every record at every layer carries a cryptographic hash of its own content and a pointer to its parent layer’s hash. This chain links any L4 canonical export record back through L3 matching, L2 embedding, and L1 cleaning to the exact bytes of the original source file at L0. If any record at any layer is modified — a vote count changed, a name altered, a match decision overridden — the chain breaks at precisely that point.
The Hash Structure
Each layer computes its hash as:
l{N}_hash = SHA-256( record_content + "parent:" + l{N-1}_hash )
The record_content is the deterministic serialization of all fields at that layer (excluding the hash itself). The parent: prefix is a literal string separator. The parent hash anchors the current record to its predecessor.
L4 canonical record
l4_hash ← SHA-256(L4 content + "parent:" + l3_hash)
│
└── L3 matched record
l3_hash ← SHA-256(L3 content + "parent:" + l2_hash)
│
└── L2 embedded record
l2_hash ← SHA-256(L2 content + "parent:" + l1_hash)
│
└── L1 cleaned record
l1_hash ← SHA-256(L1 content + "parent:" + l0_hash)
│
└── L0 raw file
l0_hash ← SHA-256(raw file bytes)
A Real Example: Timothy Lance Through All Five Layers
Timothy Lance ran for Columbus County Schools Board of Education District 02 in the 2022 NC general election. Here is one of his precinct-level records traced through every layer.
L0: Raw
The NC SBE results file results_pct_20221108.txt is stored byte-identical at l0_raw/nc_sbe/results_pct_20221108.txt.
{
"l0_hash": "edfedf2760cfd54f",
"source_url": "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip",
"retrieval_date": "2026-03-18T14:30:00Z",
"file_size_bytes": 18023456,
"format_detected": "tsv"
}
The l0_hash is the SHA-256 of the raw file bytes (truncated here for display). Re-downloading the file and re-hashing produces the same value. If NC SBE updates the file after our retrieval, the hash changes and a new L0 entry is created.
L1: Cleaned
The NC SBE parser extracts Timothy Lance’s precinct P17 row and produces a structured record:
{
"jurisdiction": {
"state": "NC", "county": "COLUMBUS", "precinct": "P17"
},
"contest": {
"raw_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
"office_level": "school_district"
},
"results": [{
"candidate_name": {
"raw": "Timothy Lance", "first": "Timothy",
"middle": null, "last": "Lance",
"suffix": null, "canonical_first": "Timothy"
},
"votes_total": 303
}],
"provenance": {
"l1_hash": "8ea7ecc257ff8e05",
"l0_parent_hash": "edfedf2760cfd54f",
"parser_version": "nc_sbe_v2.1",
"schema_version": "3.0.0"
}
}
The l1_hash is computed from the L1 record content plus "parent:edfedf2760cfd54f". The l0_parent_hash links back to the raw file.
L2: Embedded
L2 generates a composite string and embedding for the candidate:
{
"l2": {
"l2_hash": "854fa6367960bb05",
"l1_parent_hash": "8ea7ecc257ff8e05",
"embedding_model": "text-embedding-3-large",
"embedding_dimensions": 3072,
"candidate_embedding_id": 4271,
"candidate_composite": "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus",
"quality_flags": []
}
}
The l2_hash is computed from the L2 fields plus "parent:8ea7ecc257ff8e05". The l1_parent_hash links back to L1.
L3: Matched
Entity resolution assigns a candidate_entity_id. Timothy Lance appeared identically across all precincts, so step 1 (exact match) resolved him:
{
"l3": {
"l3_hash": "28183d41d50204d5",
"l2_parent_hash": "854fa6367960bb05",
"candidate_entity_ids": [
{"result_index": 0, "entity_id": "person:nc:columbus:lance-timothy-13"}
],
"contest_entity_id": "contest:nc:columbus:school-board-d02"
}
}
The l3_hash is computed from the L3 fields plus "parent:854fa6367960bb05".
L4: Canonical
L4 produces the researcher-facing export record:
{
"election_date": "2022-11-08",
"state": "NC",
"county": "COLUMBUS",
"contest_name": "COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02",
"candidate_canonical": "Timothy Lance",
"candidate_entity_id": "person:nc:columbus:lance-timothy-13",
"votes_total": 303,
"source": "nc_sbe",
"l4_hash": "f19a3e8bc7210d42",
"l3_hash": "28183d41d50204d5",
"l0_hash": "edfedf2760cfd54f"
}
The l4_hash is computed from the L4 fields plus "parent:28183d41d50204d5". The record also carries l0_hash as a shortcut for end-to-end verification.
Verification Algorithm
To verify a single L4 record:
- Read the L4 record. Recompute
SHA-256(L4 content + "parent:" + l3_hash). Compare to storedl4_hash. If mismatch → chain broken at L4. - Look up the L3 record by
l3_hash. RecomputeSHA-256(L3 content + "parent:" + l2_hash). Compare to storedl3_hash. If mismatch → chain broken at L3. - Look up the L2 record by
l2_hash. RecomputeSHA-256(L2 content + "parent:" + l1_hash). Compare to storedl2_hash. If mismatch → chain broken at L2. - Look up the L1 record by
l1_hash. RecomputeSHA-256(L1 content + "parent:" + l0_hash). Compare to storedl1_hash. If mismatch → chain broken at L1. - Read the L0 raw file. Recompute
SHA-256(file bytes). Compare to storedl0_hash. If mismatch → chain broken at L0 (source file was modified or corrupted).
If all five checks pass, the record is verified from canonical output back to original source bytes.
Prototype Results
In our 200-record prototype run:
| Metric | Result |
|---|---|
| Records verified | 200 / 200 |
| Broken chains | 0 |
| Layers traversed per record | 5 (L4 → L3 → L2 → L1 → L0) |
| Total hash verifications | 1,000 (200 records × 5 layers) |
Every hash chain verified end-to-end with zero broken links.
What Breaks the Chain
The hash chain detects any modification at any layer. Specific scenarios:
Modifying a vote count at L1. If someone changes Timothy Lance’s votes from 303 to 304, the L1 content changes, the recomputed l1_hash no longer matches the stored value, and the L2 record’s l1_parent_hash no longer points to a valid L1 record.
Changing a parser without a version bump. If the NC SBE parser is updated but parser_version is not incremented, the L1 content for existing records may change (different parsing logic applied to the same raw bytes). The l1_hash changes, breaking the chain from L2 upward. The parser_version field exists precisely to prevent silent parser changes.
Overriding an L3 match decision. If a human reviewer changes an entity assignment at L3, the l3_hash changes. L4 must be re-run from the amended L3 output. The original L3 decision is preserved in the decision log — it is never deleted, only superseded.
Re-downloading a source file after the publisher updated it. NC SBE occasionally corrects results files after initial publication. If the corrected file has different bytes, the l0_hash changes. The entire pipeline from L1 upward must be re-run for affected records. The original L0 entry and its manifest are retained as a versioned snapshot.
Why Not a Merkle Tree
A Merkle tree would allow verifying subsets of records without recomputing the full chain. We use a simpler linear chain because:
-
Records are independent. Each precinct-level record has its own chain. Verifying one record does not require knowledge of any other record. A Merkle tree adds complexity without benefit when records are not aggregated into blocks.
-
Full verification is cheap. SHA-256 of a 2 KB record takes microseconds. Verifying all 200 records takes less than a second. At 200 million records, full verification takes minutes — well within acceptable bounds for a batch pipeline.
-
Simplicity aids trust. A journalist verifying a specific result needs to understand “follow the hash backward through five files.” A Merkle tree requires understanding tree structure, sibling hashes, and root computation. The simpler model is more auditable by non-engineers.
The Chain as Documentation
The hash chain is not just an integrity mechanism — it is a documentation trail. Every L4 record answers the question: “Where did this number come from?” Follow l3_hash to see which entity resolution decision assigned this candidate ID. Follow l2_parent_hash to see the embedding and composite string. Follow l1_parent_hash to see the parsed record. Follow l0_parent_hash to see the raw source file.
This is provenance in the literal sense: the origin and chain of custody of every data point, cryptographically verifiable.
The Project Does Not Store Data
This project processes election data. It does not redistribute it.
Why Not
Legal
Each source publishes data under its own terms. MEDSL uses CC-BY. NC SBE publishes as public record under North Carolina law. OpenElections uses a mix of licenses depending on the state contributor. FEC data is public domain. Census reference files are public domain.
Bundling data from all sources into a single download would require compliance with every license simultaneously — attribution chains, share-alike provisions, and restrictions that vary by state contributor. The legal surface area grows with every source added. We avoid it entirely by not storing the data.
Practical
The current corpus is 8+ GB across three election cycles and seven sources. Adding MEDSL 2018 and 2020, full OpenElections coverage, and VEST shapefiles pushes this past 20 GB. Hosting, versioning, and serving that volume adds infrastructure cost and maintenance burden that contribute nothing to the pipeline’s accuracy or reproducibility.
Freshness
Sources update. NC SBE reissues precinct files when canvass corrections are made. MEDSL publishes errata and revised datasets. OpenElections contributors fix parsing errors and add new states. A copy of the data taken on March 18 may be stale by April 1.
If we store data, every downstream user inherits our staleness. If users download from the authoritative source, they get the latest version — and our pipeline processes it identically.
What We Provide Instead
The project provides everything needed to acquire the data yourself:
| What | Where | Example |
|---|---|---|
| Exact source URLs | Each source chapter in Part II | https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip |
| Download commands | Download the Data | curl -O <url> with expected file sizes |
| Schema documentation | Each source chapter | Column names, types, delimiters, encoding |
| Known quirks | Each source chapter | NC SBE uses \t separators but .txt extension; MEDSL 2022 has trailing commas in some state files |
| File size expectations | Download the Data | MEDSL 2022 NC: ~45 MB compressed |
| SHA-256 of our L0 copies | L0 manifests | Verify your download matches ours |
The L0 manifest for each file records the SHA-256 hash of the bytes we processed. After downloading the same file, you can hash your copy and compare. If the hashes match, your pipeline run will produce identical L1 output — byte for byte, hash for hash.
The Boundary
The project does bundle small reference datasets that are not election results:
- FIPS code reference files (~200 KB) from the Census Bureau, public domain. These change only on decennial redistricting.
- The nickname dictionary (~5 KB), original to this project.
- The office classification keyword and regex tables (~10 KB), original to this project.
- The 200-name office embedding reference set (~50 KB), original to this project.
These are small, stable, and authored by the project. They are not third-party election data.
Election results — the 42 million rows of precinct-level vote counts — are never stored, cached, or redistributed. The user downloads them. The pipeline processes them. The outputs live on the user’s machine.
Embedding Model: text-embedding-3-large
The pipeline uses OpenAI’s text-embedding-3-large for all vector generation at L2. This is a deliberate choice with specific trade-offs. The model is not the best possible embedding model — it is the best available model for this task given current constraints.
Why text-embedding-3-large
Three properties matter for election entity resolution: dimensionality, consistency, and performance on short structured text.
3,072 dimensions. Higher dimensionality preserves more fine-grained distinctions in short strings. “David S Marshall” and “David A Marshall” differ by a single character — a middle initial. In a 384-dimensional space, that distinction may be compressed away. In 3,072 dimensions, the model has room to encode it. We measured: the middle initial drops cosine similarity from 0.7025 to 0.6448 — a 0.058 gap that matters for disambiguation.
API-based consistency. Every call to the same model version with the same input produces the same vector. There is no local model initialization, no GPU-dependent floating-point variance, no seed to manage. Two users on different machines embedding the same candidate string get the same 3,072 floats. This is critical for reproducibility: L2 output is deterministic given the same model version.
Strong on short structured text. Candidate composite strings are 50–150 characters: "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus". These are not natural language paragraphs — they are structured identifiers with pipe-delimited fields. text-embedding-3-large handles this format well in our testing. Nickname pairs (Charlie Crist at 0.451), suffix pairs (Williams Jr at 0.862), and middle-initial pairs (David Marshall at 0.6448) all produce scores in ranges that the cascade can act on.
Why Not MiniLM
all-MiniLM-L6-v2 from Sentence Transformers is the default recommendation for lightweight embedding tasks. It runs locally, requires no API key, and produces vectors in milliseconds on CPU. We evaluated it and rejected it for three reasons.
384 dimensions. A factor of 8× fewer dimensions than text-embedding-3-large. On structured identifiers where single-character differences carry categorical meaning (middle initials, suffixes), the lower dimensionality compresses distinctions. In informal testing, MiniLM scored Williams Jr at 0.91 against Williams — higher than text-embedding-3-large’s 0.862, and well above any reasonable accept threshold. The suffix signal is effectively lost.
2021 training data. MiniLM was trained on data through 2021. It has no exposure to post-2021 candidate names, office titles, or geographic patterns. text-embedding-3-large was trained on more recent data, though the exact cutoff is not published. For a task that involves matching strings like “DESANTIS, RON” and “Ron DeSantis” — where the model’s familiarity with the name helps — recency matters.
Weaker on structured identifiers. MiniLM is optimized for sentence similarity — determining whether two natural language sentences express the same meaning. Our inputs are not sentences. They are pipe-delimited fields with proper nouns, abbreviations, and codes. text-embedding-3-large is a general-purpose model that handles structured text more robustly than a sentence-similarity specialist.
MiniLM’s advantages — local execution, zero API cost, sub-millisecond inference — are real but irrelevant to our constraints. Budget is not a constraint. Latency at L2 is not a bottleneck (embeddings are computed once and cached). The accuracy difference on structured identifiers is the deciding factor.
Why Not a Fine-Tuned Model
A model fine-tuned on election name pairs would outperform any general-purpose model. We know this because the failure modes of text-embedding-3-large are systematic: it underscores nicknames (Charlie/Charles at 0.451) and overscores suffixes (Williams/Williams Jr at 0.862). A fine-tuned model trained on labeled pairs — “these are the same person” / “these are different people” — would learn that “Jr” is a strong negative signal and that “Charlie”/“Charles” is not.
We do not have training data yet.
Fine-tuning requires labeled pairs: hundreds to thousands of (name_a, name_b, same_person) triples with ground truth. Our prototype has 12 manually verified pairs. The L3 decision log will eventually contain thousands of LLM-confirmed match/no-match decisions — each one a potential training example. This is an active learning loop:
- L3 uses the general-purpose model to retrieve candidates.
- The LLM confirms or rejects matches, producing labeled pairs.
- The labeled pairs train a fine-tuned embedding model.
- The fine-tuned model replaces
text-embedding-3-largeat L2, improving retrieval. - Better retrieval surfaces harder cases for the LLM, producing more informative training data.
This loop is planned but not yet implemented. It requires the pipeline to run at scale first, generating enough decisions for a meaningful training set. In the meantime, text-embedding-3-large with the 5-step cascade produces correct results on every tested pair — the LLM compensates for the embedding model’s weaknesses.
Thresholds Are Model-Specific
The calibrated thresholds — auto-accept ≥ 0.95, ambiguous 0.35–0.95, auto-reject < 0.35 — are specific to text-embedding-3-large with 3,072 dimensions. A different model produces different similarity distributions. MiniLM’s Williams Jr score of 0.91 vs. text-embedding-3-large’s 0.862 illustrates the problem: the same pair lands in different threshold zones depending on the model.
If the model changes, recalibration is required:
- Re-embed all test cases with the new model.
- Plot the score distribution for known matches and known non-matches.
- Find the auto-accept, ambiguous, and auto-reject boundaries that minimize false positives and false negatives.
- Update the threshold configuration and document the new model in L2 metadata.
The embedding_model field stored in every L2 record ensures that thresholds can always be traced to the model that produced the scores. If a record was embedded with text-embedding-3-large and the thresholds were calibrated for a hypothetical election-embed-v1, the mismatch is detectable.
Summary
| Property | text-embedding-3-large | MiniLM | Fine-tuned (future) |
|---|---|---|---|
| Dimensions | 3,072 | 384 | TBD |
| API required | Yes | No | Depends |
| Cost per 1M tokens | ~$0.13 | $0 | $0 (local) |
| Williams Jr score | 0.862 | ~0.91 | Lower (trained) |
| Crist score | 0.451 | ~0.38 | Higher (trained) |
| Training data needed | No | No | Yes (not yet available) |
| Reproducible across machines | Yes | Requires version pinning | Requires version pinning |
The current choice is text-embedding-3-large — good enough for the cascade to work, available today, and reproducible without local model management. The long-term path is a fine-tuned model trained on the L3 decision log. The thresholds, the cascade design, and the LLM confirmation step all exist to compensate for the general-purpose model’s known weaknesses until that fine-tuned model is ready.
Composite String Templates
Embeddings are not generated from raw candidate names. They are generated from composite strings that combine name components with contextual fields — office, state, county, party. This context helps the embedding model distinguish people who share a name but hold different offices in different states. It also introduces a failure mode: context bleed, where shared context artificially inflates similarity between unrelated candidates.
The Three Templates
Each L2 record generates up to three composite strings, one per embedding type:
| Type | Template | Purpose |
|---|---|---|
| Candidate | {canonical_first} {middle} {last} {suffix} | {party} | {office} | {state} | {county} | Entity resolution across sources and elections |
| Contest | {raw_name} | {office_level} | {state} {year} | Contest entity resolution across naming variants |
| Geography | {municipality}, {county} County, {state} | Geographic entity resolution for precinct/place matching |
The pipe character (|) is a deliberate separator. It signals to the tokenizer that the fields on either side are distinct semantic units, not a continuous phrase. Without separators, “Timothy Lance DEM” could be tokenized as a three-word name rather than a name followed by a party.
Real Composite Examples
| Candidate | Composite String |
|---|---|
| Timothy Lance (NC, Columbus County school board) | Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus |
| Charlie Crist (FL, Governor, DEM) | Charles Crist | DEM | Governor | FL | statewide |
| CRIST, CHARLES JOSEPH (FL, Governor, DEM) | Charles Joseph Crist | DEM | Governor | FL | statewide |
| David S Marshall (ME, State Legislature) | David S Marshall | | State Legislature | ME | statewide |
| David A Marshall (FL, County Commission) | David A Marshall | | County Commission | FL | Broward |
Note that canonical_first is used, not first. Charlie Crist’s composite uses Charles (from the nickname dictionary), not Charlie. This means the MEDSL record (CRIST, CHARLES JOSEPH → canonical_first Charles) and the OpenElections record (Charlie Crist → canonical_first Charles) produce composites with matching first-name tokens. The remaining divergence — Joseph as a middle name — is small enough that the embedding score rises significantly compared to the raw-name embedding.
Empty components produce empty slots. Timothy Lance has no middle name, no suffix, and no party in the NC SBE data. The composite retains the pipe separators with empty fields: Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus. This keeps the template structure consistent across all records, which stabilizes tokenization.
Why Context Helps: The David Marshall Test
David S. Marshall ran for state legislature in Maine. David A. Marshall ran for county commission in Florida. They are different people. Without context, the embedding model sees two very similar strings.
We measured the effect of context on cosine similarity:
| Composite A | Composite B | Cosine |
|---|---|---|
David Marshall | David Marshall | 1.000 |
David Marshall | ME | David Marshall | FL | 0.7025 |
David S Marshall | ME | David A Marshall | FL | 0.6448 |
David S Marshall | | State Legislature | ME | David A Marshall | | County Commission | FL | 0.581 |
Each additional contextual field pushes the vectors further apart:
- State alone drops similarity from 1.0 to 0.7025. The model encodes
MEandFLas distinct tokens that pull the vectors in different directions. - Middle initial drops it further to 0.6448 — a 0.058 reduction. The single character
SvsAproduces measurably different vectors because it changes the token sequence before the separator. - Office context drops it to 0.581. “State Legislature” and “County Commission” are semantically distinct, adding another axis of divergence.
At 0.581, this pair falls well within the ambiguous zone (0.35–0.95) and is routed to the LLM, which correctly rejects the match based on different states, different offices, and different middle initials. Without context, the pair scores 1.0 — an automatic merge of two different people.
The middle-initial contribution (0.058) may seem small, but it matters at the margins. For pairs where state and office are the same — a father and son both serving on the same county commission — the middle initial may be the only signal distinguishing them.
Why Context Hurts: The Context Bleed Problem
Context is not free. Shared context tokens contribute to vector similarity even when the names themselves are unrelated. This is context bleed.
Consider two candidates in the same NC school district block:
| Candidate | Composite |
|---|---|
| Aaron Bridges | Aaron Bridges | | SCHOOL BOARD | NC | Columbus |
| Daniel Blanton | Daniel Blanton | | SCHOOL BOARD | NC | Columbus |
These are completely different people. But their composites share five context tokens: SCHOOL BOARD, NC, Columbus, and the pipe separators. The embedding model encodes these shared tokens into both vectors, producing a cosine similarity of approximately 0.55–0.65 — well above what the names alone would produce (~0.20) and squarely in the ambiguous zone.
In our prototype, all 30 wasted LLM calls were on pairs exactly like this: different people with different names whose shared context inflated their embedding scores into the ambiguous zone. The step 2.5 gate (JW on last names < 0.50 → skip) was added specifically to short-circuit these context-bleed false alarms before they reach the LLM.
Measuring the bleed
We tested context contribution by varying which fields are included:
| Composite variant | Aaron Bridges vs Daniel Blanton | Cosine |
|---|---|---|
| Name only | Aaron Bridges vs Daniel Blanton | ~0.21 |
| Name + state | Aaron Bridges | NC vs Daniel Blanton | NC | ~0.38 |
| Name + state + office + county | Full composite | ~0.60 |
Each shared context field adds approximately 0.15–0.20 to the cosine score. For same-name pairs (the cases entity resolution cares about), this boost is helpful — it confirms that two similar names in the same context are likely the same person. For different-name pairs, the same boost is harmful — it inflates scores past the reject threshold.
The step 2.5 gate resolves this asymmetry. If the names themselves are dissimilar (JW < 0.50 on last names), the context-inflated embedding score is irrelevant — the pair is skipped. If the names are similar (JW ≥ 0.50), the context inflation is welcome — it adds corroborating evidence that the similar names in the same context are the same person.
Design Tradeoffs
Why not embed names without context?
Bare-name embeddings eliminate context bleed but lose the disambiguation power demonstrated by the David Marshall test. A bare “David Marshall” vs “David Marshall” scores 1.0 — the model cannot distinguish them at all. Context is the only mechanism the embedding model has to separate same-name, different-person pairs.
Why not use separate embeddings for name and context?
An alternative architecture: embed the name and context separately, then combine scores with weighted averaging. This eliminates context bleed (the name embedding is pure name similarity) while retaining context as a separate signal.
This approach is viable but adds complexity — two embeddings per record instead of one, a tunable weight parameter, and a more complex similarity function. The current single-composite design is simpler and works well with the step 2.5 gate mitigating the primary failure mode. If context bleed proves problematic at scale, split embeddings are a planned fallback.
Why not fine-tune?
A fine-tuned embedding model trained on election name pairs could learn that Charlie and Charles are similar, that Jr is categorically significant, and that shared context should not inflate scores for dissimilar names. We do not have training data yet.
However, L3 decisions are labeled examples: every LLM match/no-match decision with its confidence and reasoning is a training pair. As the pipeline processes more data, the L3 decision log becomes a natural training set for active learning. A fine-tuned model trained on thousands of L3 decisions would, in principle, learn the domain-specific similarity function that the general-purpose text-embedding-3-large approximates. This is a future direction, not a current capability.
Summary
| Property | Effect | Mitigation |
|---|---|---|
| Context included | Distinguishes same-name, different-person pairs (David Marshall: 1.0 → 0.581) | — (this is the goal) |
| Context bleed | Inflates scores for different-name, same-context pairs (Bridges vs Blanton: 0.21 → 0.60) | Step 2.5 JW gate on last names |
| Middle initial included | Provides disambiguation signal (0.7025 → 0.6448) | — (this is the goal) |
| Nickname dictionary applied | Aligns canonical first names before embedding (Charlie → Charles) | — (this is the goal) |
The composite template is a tradeoff between disambiguation power and noise tolerance. Context helps more than it hurts — but only because the step 2.5 gate exists to catch the cases where it hurts.
When the LLM Gets Called (And When It Doesn’t)
The LLM is a confirmation tool, not a discovery tool. It is called when cheaper methods have narrowed the problem to a specific, bounded question. It is never called when a deterministic method produces correct results.
This boundary is enforced by pipeline structure, not by discipline. L0 and L1 have no LLM code paths. L2 has none. The LLM is reachable only from L3 (entity resolution and tier 4 office classification) and L4 (entity auditing). A developer cannot accidentally add an LLM call to the parser — the parser runs at L1, which has no API client.
When the LLM Is Called
Three situations invoke the LLM. Each is a bounded question with structured input and a constrained output format.
1. Ambiguous Entity Matches (L3, Step 4)
Trigger: Embedding cosine similarity between 0.35 and 0.95 AND the name similarity gate passed (JW on last names ≥ 0.50) AND both candidates are in the same state.
Input: Structured name components for both candidates, embedding score, JW score, vote counts, office, state, party.
Output: match/no-match, confidence (0.0–1.0), free-text reasoning.
Model: Claude Sonnet.
Volume: 3.5% of candidate pairs in our prototype (30 calls out of ~850 comparisons). With the step 2.5 gate in place, this drops to near-zero for within-source matching and rises for cross-source matching where name formats diverge.
Real examples:
| Pair | Cosine | LLM Decision | Why LLM was needed |
|---|---|---|---|
| Charlie Crist / CRIST, CHARLES JOSEPH | 0.451 | match (0.95) | Nickname below any safe auto-accept threshold |
| Robert Williams / Robert Williams Jr | 0.862 | no match (0.85) | Suffix above old auto-accept; only LLM catches generational distinction |
| Nicole Fried / FRIED, NIKKI | 0.642 | match (0.92) | Nickname in ambiguous zone |
2. Tier 4 Office Classification (L2→L3 boundary)
Trigger: Office name was not classified by keyword (tier 1), regex (tier 2), or embedding nearest-neighbor with cosine ≥ 0.60 (tier 3).
Input: Office name string, state, county, the full taxonomy of (office_level, office_branch) pairs.
Output: Classification pair, confidence (0.0–1.0), reasoning.
Model: Claude Sonnet.
Volume: ~0.5% of unique office names in MEDSL 2022 (~42 of 8,387). By record count, far less — these are the rarest, most obscure offices.
Real examples:
| Office Name | State | LLM Classification | Confidence |
|---|---|---|---|
| Santa Rosa Island Authority | FL | special_district / infrastructure | 0.90 |
| Register of Mesne Conveyances | SC | county / judicial | 0.88 |
| Hog Reeve | NH | municipal / regulatory | 0.60 |
3. L4 Entity Auditing
Trigger: An entity cluster contains records from multiple sources, multiple elections, or multiple office types. In the current design, every multi-member entity is audited (budget is not a constraint).
Input: The full entity cluster — canonical name, all aliases, all elections, all vote counts, all states, all offices.
Output: Plausibility assessment: plausible / suspicious / error, with reasoning.
Model: Claude Sonnet (Opus-class for flagged entities).
Volume: In the prototype, 50 entities were audited. The LLM flagged 43 as suspicious (precinct-level records inflating temporal chains — a bug in our aggregation, not in the data) and 4 as errors (“For” and “Against” classified as person entities). At production scale, the volume scales with the number of multi-member entities, not with total records.
When the LLM Is Not Called
Everything else. Specifically:
| Operation | Layer | Method | Why not LLM |
|---|---|---|---|
| CSV/TSV/XML parsing | L1 | Source-specific parser | Deterministic; format is fixed per source |
| Name decomposition | L1 | Rule-based parser | Deterministic; name formats are enumerable |
| Nickname dictionary lookup | L1 | Hash table | O(1) lookup; no reasoning needed |
| FIPS code enrichment | L1 | Census reference table | Exact match on (state, county_name) |
| Vote share computation | L1 | Arithmetic | Division is deterministic |
| Hash computation | L1–L4 | SHA-256 | Cryptographic function; no reasoning needed |
| Office classification (tiers 1–2) | L1 | Keyword + regex | Deterministic; handles 62% of unique names |
| Office classification (tier 3) | L2 | Embedding nearest-neighbor | Deterministic given model version; handles 4.5% more |
| Embedding generation | L2 | OpenAI API | Deterministic given model version; not an LLM call |
| Exact name matching (step 1) | L3 | Structured field equality | Handles 70% of entity resolution |
| Jaro-Winkler matching (step 2) | L3 | String similarity | Deterministic; handles 0.1% more |
| Name gate (step 2.5) | L3 | JW on last names | Eliminates obvious non-matches |
| High-confidence embedding match (step 3) | L3 | Cosine ≥ 0.95 | Auto-accept; no ambiguity to resolve |
| Canonical name selection | L4 | Fixed algorithm | Most-complete + most-authoritative; no judgment needed |
| Temporal chain aggregation | L4 | Group-by on (entity_id, election_date) | SQL-style aggregation |
| Hash chain verification | L4 | SHA-256 recomputation | Cryptographic verification |
| Cross-source vote reconciliation | L4 | Arithmetic comparison | Exact or percentage-based comparison |
The Principle
If a deterministic method handles it, do not add LLM latency and non-determinism.
This is not a cost argument. Budget is not a constraint. It is an accuracy and reproducibility argument:
-
Deterministic methods do not hallucinate. SHA-256 always returns the same hash. FIPS lookup always returns the same code. An LLM might return a different FIPS code on a second call — not because it is wrong, but because it is probabilistic. For operations with known-correct deterministic solutions, adding an LLM is adding risk, not capability.
-
Deterministic methods are reproducible. Re-running L1 on the same L0 files with the same parser version produces bit-identical output. Re-running an LLM-based parser may produce different field values. For a pipeline that serves journalists and researchers who need to cite specific numbers, reproducibility is non-negotiable for the operations that support it.
-
Deterministic methods are fast. L1 processes 200 records in under a second. An LLM call takes 200–2,000ms. For the 70% of entity resolution handled by exact match and the 62% of office classification handled by keywords, the LLM adds latency with zero accuracy benefit.
The LLM is powerful. It correctly identified all 12 test pairs in entity resolution, including the Crist nickname case (0.451 cosine) that no threshold-based system could safely auto-resolve. It classified all 9 tier-4 office names correctly, including obscure offices like “Hog Reeve” that no reference set could anticipate.
But it is called only for the cases that need it: the 3.5% of entity comparisons in the ambiguous zone, the 0.5% of office names that no pattern matches, and the entity audit that catches contamination like ballot-measure choices misclassified as people. For everything else, the answer is already known — deterministically, reproducibly, and instantly.
Cross-References
- Design Principles — “Deterministic first” as principle #1
- L3: Matched — where LLM calls happen for entity resolution
- The Four-Tier Classifier — where LLM calls happen for office classification
- Budget Is Not a Constraint — why the cascade exists despite unlimited budget
Budget Is Not a Constraint — Speed and Reproducibility Are
This project has no API cost ceiling. Every LLM call that improves accuracy is worth making. This changes several design decisions compared to a cost-constrained pipeline — but it does not change the fundamental architecture. The cascade exists for speed and reproducibility, not for cost savings.
What Unlimited Budget Changes
Wider Ambiguous Zone
The embedding similarity thresholds for entity resolution were widened specifically because cost is not a constraint:
| Parameter | Cost-constrained | Our design |
|---|---|---|
| Ambiguous zone | 0.65–0.82 | 0.35–0.95 |
| Zone width | 0.17 | 0.60 |
| Pairs reaching LLM | ~5% of within-block pairs | ~25% of within-block pairs |
The wider zone sends roughly 5× more pairs to the LLM. At $0.0002 per call, the difference between 10,000 calls and 50,000 calls is $8. At production scale with millions of pairs, the difference might reach hundreds of dollars. Neither figure justifies accepting false positives (Williams Jr at 0.862) or false negatives (Crist at 0.451) that a narrower zone would cause.
Stronger Model for Tiebreakers
Step 5 of the entity resolution cascade escalates low-confidence LLM decisions (confidence < 0.70 from Claude Sonnet) to an Opus-class model. The stronger model costs approximately 10× more per call but is invoked only for the lowest-confidence subset of an already-small LLM cohort.
A cost-constrained pipeline would re-run the same Sonnet model or defer to human review. We use the stronger model because the marginal cost per call (~$0.002) is negligible and the accuracy gain on edge cases is measurable.
Full L4 Entity Audit
The L4 LLM entity audit examines every multi-member entity — not a sample. In the prototype, 50 entities were audited, catching 43 suspicious records and 4 errors. At production scale with tens of thousands of multi-member entities, full audit coverage means thousands of LLM calls.
A cost-constrained pipeline would sample 5–10% of entities and extrapolate. We audit 100% because the cost of missing a contaminated entity (ballot measure choices classified as people, precinct-level records inflating temporal chains) is higher than the cost of the API calls. The “For” and “Against” error was caught by the full audit — a 10% sample might have missed it.
Tier 4 Office Classification Without Hesitation
Every unclassified office name that survives tiers 1–3 goes to the LLM. There is no “batch the cheapest 80% and skip the rest” optimization. All ~42 hard cases in our prototype were classified. At national scale, the long tail of hyper-local office names (township-specific roles, water district sub-boards, tribal offices) may produce hundreds of tier 4 calls per election cycle. The cost is trivial; the coverage gain is not.
What Unlimited Budget Does Not Change
The Cascade Still Exists
Sending every candidate pair directly to the LLM — skipping exact match, Jaro-Winkler, the name gate, and embedding retrieval — would produce correct results for most pairs. It would also be impossibly slow.
At 42 million rows, even with aggressive blocking, the number of within-block candidate pairs runs into the millions. At 200ms per LLM API call, one million pairs take 55 hours of serial wall-clock time. With 10× parallelism, that is still 5.5 hours — for a single step that exact match handles in seconds for 70% of cases.
The cascade is not a cost optimization. It is a speed optimization. Steps 1–3 process 76% of pairs in under a millisecond each. The LLM is reserved for the 3.5% where cheap methods cannot decide.
Deterministic Steps Are Still Preferred
Exact match, Jaro-Winkler, keyword classification, regex classification, FIPS lookup, vote share computation, and hash verification are deterministic. They produce identical output from identical input on every run, on every machine, forever.
LLM calls are non-deterministic. The same pair submitted twice may produce different confidence scores (typically within ±0.05) and occasionally different reasoning text. The decision (match/no-match) is stable in >99% of re-runs, but “99% stable” is not “deterministic.”
For a pipeline that serves journalists citing specific numbers and researchers publishing reproducible analyses, determinism is not a preference — it is a requirement for the operations that support it. We use deterministic methods wherever they produce correct results, not because they are cheaper, but because they are trustworthy in a way that probabilistic methods are not.
LLMs Do Not Parse, Enrich, or Compute
No amount of budget makes it sensible to use an LLM for:
- Parsing CSV/TSV/XML. The format is fixed per source. A parser handles it in microseconds with zero error rate.
- FIPS lookup. A hash table lookup on (state, county_name) returns the correct code every time. An LLM might hallucinate a FIPS code — “37047” for Columbus County NC is correct, but there is no mechanism to verify the LLM’s output without the same lookup table that makes the LLM unnecessary.
- SHA-256 computation. Cryptographic hash functions are mathematical operations. An LLM cannot compute them.
- Vote share arithmetic. 303 / 580 = 0.5224. A calculator is correct. An LLM might round differently, truncate, or occasionally hallucinate.
These operations have known-correct deterministic solutions. Adding an LLM to any of them introduces risk with zero benefit, regardless of budget.
Reproducibility Requires Logged Decisions
Every LLM decision at L3 and L4 is stored in a JSONL audit log with the full prompt, response, confidence, and reasoning. This is not a cost-saving measure (replay from log avoids re-calling the LLM, saving money). It is a reproducibility measure: a researcher who wants to verify or contest a match decision can read the log, see the LLM’s reasoning, and evaluate whether the decision was correct.
If budget were infinite and API calls were instantaneous, we would still log every decision. The log is not a cache — it is the canonical record of how the pipeline resolved ambiguity. Deleting the log and re-running the LLM would produce a slightly different set of confidence scores, which might shift a small number of borderline decisions, which would change downstream entity assignments. The log prevents this drift.
The Real Constraints
Budget is not a constraint. The real constraints are:
| Constraint | Effect on design |
|---|---|
| Wall-clock time | The cascade exists because LLM calls at scale take hours; exact match takes seconds |
| Reproducibility | Deterministic methods preferred; LLM decisions logged for replay |
| Accuracy | Wider ambiguous zone, stronger tiebreaker model, full audit coverage |
| Auditability | Every decision logged with reasoning; hash chain from L4 to L0 |
| Correctness | Deterministic methods used wherever they produce correct results; LLMs used only for genuine ambiguity |
A budget-constrained version of this pipeline would narrow the ambiguous zone, sample the entity audit, skip tier 4 office classification for rare offices, and use the same model for tiebreakers. All of these are accuracy trade-offs. We make none of them.
The cascade’s structure — exact match → JW → gate → embedding → LLM → tiebreaker — is identical whether the budget is $10 or $10,000. The thresholds move. The model choices change. The architecture does not.
Schema Overview
The unified schema defines the structure of every election record at every pipeline layer. A single record represents one candidate’s (or one ballot measure choice’s) vote count in one geographic unit for one contest. All sources — MEDSL, NC SBE, OpenElections, VEST, Clarity — are normalized into this schema at L1. Subsequent layers (L2–L4) add fields but never remove them.
A record has six sections: election, jurisdiction, contest, results, turnout, source, and provenance. Not every field is populated for every record. Fields that the source does not provide are null, not inferred.
Election
Identifies which election this record belongs to.
| Field | Type | Description | Example |
|---|---|---|---|
date | date | Election date (ISO 8601) | 2022-11-08 |
year | integer | Election year, derived from date | 2022 |
type | ElectionType | General, primary, runoff, special, etc. | General |
stage | string | Source-provided stage code | GEN |
special | boolean | Whether this is a special election | false |
certification_status | string | Certified, unofficial, or unknown | certified |
The type field is an enum — see Enumerations Reference. The stage field preserves the raw source value (MEDSL uses GEN/PRI/RUN; NC SBE does not have a stage column). The certification_status field reflects whether the source data represents certified results. NC SBE and MEDSL publish certified data. Clarity publishes unofficial election night results that may be updated.
Jurisdiction
Identifies the geographic unit where votes were counted.
| Field | Type | Description | Example |
|---|---|---|---|
state | string | Full state name | North Carolina |
state_po | string | Two-letter postal code | NC |
state_fips | string | Two-digit state FIPS code | 37 |
county | string | County name (may be null for statewide) | Wake |
county_fips | string | Five-digit county FIPS code | 37183 |
precinct | string | Precinct name or code from the source | 01-01 |
precinct_code | string | Numeric precinct code (NC SBE only) | 0101 |
jurisdiction_name | string | Jurisdiction name from MEDSL | WAKE |
jurisdiction_fips | string | Jurisdiction FIPS from MEDSL | 37183 |
ocd_id | string | Open Civic Data identifier (when available) | ocd-division/country:us/state:nc/county:wake |
level | JurisdictionLevel | Geographic granularity of this record | Precinct |
The county_fips field is the primary geographic join key across sources. It is enriched from Census FIPS reference files at L1 when the source provides a county name but no code. The ocd_id field is populated when a mapping exists; it is null for most records today.
The level field indicates what geographic unit this row represents. Most records are Precinct. Some sources provide only county-level aggregates (County). VEST data with precinct boundaries is Precinct with accompanying geometry.
Contest
Describes the race or ballot measure.
| Field | Type | Description | Example |
|---|---|---|---|
kind | ContestKind | CandidateRace, BallotMeasure, or TurnoutMetadata | CandidateRace |
raw_name | string | Contest name exactly as it appears in the source | CABARRUS COUNTY SCHOOLS BOARD OF EDUCATION |
normalized_name | string | Cleaned contest name (L1+) | Cabarrus County Schools Board of Education |
office_level | OfficeLevel | Federal, state, county, municipal, etc. | County |
office_category | OfficeCategory | Executive, legislative, judicial, school board, etc. | SchoolBoard |
district | string | District number or name (blank if at-large) | DISTRICT 02 |
dataverse | string | MEDSL’s race level tag (blank for local) | `` |
classifier_method | ClassifierMethod | How office_level and office_category were assigned | Keyword |
vote_for | integer | Maximum number of candidates a voter may select | 1 |
magnitude | integer | Number of seats being filled | 3 |
is_retention | boolean | Whether this is a judicial retention election | false |
The kind field is an enum with three variants — see Contest Kinds. The distinction between CandidateRace, BallotMeasure, and TurnoutMetadata is determined at L1 based on the contest name and choice values.
The classifier_method field records how the office_level and office_category were assigned: Keyword (deterministic string match, 62% of records), Regex (pattern-based, ~15%), Embedding (nearest-neighbor at L2), or Llm (LLM classification at L3). This field exists so that users can filter by classification confidence.
The vote_for field comes from NC SBE’s Vote For column. MEDSL does not provide this field. When unavailable, it defaults to null. The magnitude field comes from MEDSL’s magnitude column and indicates multi-member districts.
Results
An array of candidate results attached to the contest. For a CandidateRace, each element is one candidate. For a BallotMeasure, each element is one choice (e.g., “For”, “Against”). For TurnoutMetadata, the results array is empty.
| Field | Type | Description | Example |
|---|---|---|---|
candidate_name | CandidateName | Decomposed name — see below | (see Name Components) |
party_raw | string | Party label exactly as source provides | LIBERTARIAN |
party_simplified | PartySimplified | Normalized party enum | Libertarian |
votes_total | integer | Total votes for this candidate in this precinct | 90 |
vote_share | float | Fraction of total contest votes (computed) | 0.023 |
writein | boolean | Whether this is a write-in candidate | false |
incumbent | boolean | Whether this candidate is the incumbent (if known) | null |
vote_counts_by_type | VoteCountsByType | Breakdown by vote method — see below | (see below) |
CandidateName
Names are decomposed into components rather than stored as a single string. This is documented in detail in Candidate Name Components.
| Field | Type | Description | Example |
|---|---|---|---|
raw | string | Name exactly as it appears in the source | MICHAEL "STEVE" HUBER |
first | string | Parsed first name | Michael |
middle | string | Parsed middle name or initial | null |
last | string | Parsed last name | Huber |
suffix | string | Jr, Sr, II, III, IV, etc. | null |
nickname | string | Detected nickname | Steve |
canonical_first | string | Nickname-resolved first name | Stephen |
The raw field is preserved at every layer and never modified. The component fields are populated at L1 during name parsing. The canonical_first field is populated at L1 using the nickname dictionary (e.g., Charlie→Charles, Steve→Stephen, Pat→Patricia). All fields are available at every pipeline layer.
VoteCountsByType
When the source provides vote mode breakdowns, they are stored here. NC SBE provides all four fields for every contest. MEDSL provides them when modes are split into separate rows (summed during L1). Most other sources provide only the total.
| Field | Type | Description | Example |
|---|---|---|---|
election_day | integer | Election day votes | 136 |
early | integer | Early / one-stop votes | 159 |
absentee_mail | integer | Mail-in absentee votes | 7 |
provisional | integer | Provisional ballot votes | 1 |
NC SBE calls early voting “One Stop.” MEDSL calls it “EARLY VOTING.” Both are mapped to the early field at L1.
Turnout
Voter registration and participation counts for the geographic unit. These fields are sparsely populated — less than 5% of records have values.
| Field | Type | Description | Example |
|---|---|---|---|
registered_voters | integer | Number of registered voters in this precinct | 2847 |
ballots_cast | integer | Total ballots cast in this precinct | 1893 |
turnout_pct | float | ballots_cast / registered_voters (computed) | 0.665 |
NC SBE provides registered_voters via “Registered Voters” pseudo-contest rows. These are extracted during L1 parsing and attached to the precinct’s turnout object. MEDSL rarely includes registration counts. Most records have null turnout.
Source
Provenance fields that document where this record came from.
| Field | Type | Description | Example |
|---|---|---|---|
source_type | SourceType | Enum identifying the source system | Medsl |
source_file | string | Filename of the L0 artifact | 2022-nc-local-precinct-general.csv |
source_row | integer | Row number in the source file | 14523 |
retrieval_date | datetime | When the source file was downloaded (UTC) | 2025-01-15T03:22:00Z |
confidence | Confidence | High, Medium, or Low | Medium |
raw_fields | SourceRawFields | All original columns from the source, typed per source | (see below) |
SourceRawFields
The raw_fields object preserves every column from the original source row, typed as an enum per source. This ensures no information is lost during normalization.
| Variant | Source | Fields preserved |
|---|---|---|
MedslRawRecord | MEDSL | All 25 MEDSL columns including state_cen, state_ic, readme_check, version |
NcsbeRawRecord | NC SBE | All 15 NC SBE columns including Contest Group ID, Contest Type, Real Precinct |
OpenElectionsRawRecord | OpenElections | Variable columns depending on state file |
VestRawRecord | VEST | Encoded column names and geometry reference |
ClarityRawRecord | Clarity | XML element attributes |
FecRawRecord | FEC | All 15 cn.txt columns |
CensusRawRecord | Census | FIPS file columns |
Each variant is a struct with typed fields matching the source schema. This is a Rust enum, not a JSON object — the type system ensures you cannot accidentally read an NC SBE field from a MEDSL record. See Type System Design.
Provenance
Hash chain and version metadata that enable verification and reproducibility.
| Field | Type | Description | Example |
|---|---|---|---|
record_id | string | Deterministic hash of (source, file, row) | a3f8c2... |
l1_hash | string | SHA-256 hash of this L1 record’s content | 7b2e91... |
l0_parent_hash | string | SHA-256 hash of the L0 source artifact | c4d1f0... |
l0_byte_offset | integer | Byte offset in the L0 file where this row starts | 1048576 |
parser_version | string | Version of the parser that produced this record | 0.1.0 |
schema_version | string | Version of the schema this record conforms to | 1.0.0 |
The hash chain links every record back to the original source bytes. If the L1 record is modified, its l1_hash changes and no longer matches the hash stored in any L2 record that references it. The verification algorithm at L4 checks the full chain: L4 → L3 → L2 → L1 → L0 → source bytes.
The record_id is deterministic: identical source input always produces the same record_id. This enables deduplication and makes re-processing idempotent.
Layer-Specific Additions
Each pipeline layer adds fields to the record. The base schema (above) is fully populated at L1. Subsequent layers extend it:
| Layer | Fields added |
|---|---|
| L2 (Embedded) | candidate_name_embedding, contest_name_embedding, jurisdiction_embedding, embedding_model, embedding_version |
| L3 (Matched) | candidate_cluster_id, contest_cluster_id, match_confidence, match_method |
| L4 (Canonical) | canonical_candidate_name, canonical_contest_name, temporal_chain_id, verification_status, alias_table |
L1 records are self-contained. L2+ records reference their parent layer’s hash. No fields from earlier layers are removed or overwritten — each layer is additive.
JSONL Representation
At every layer, records are serialized as one JSON object per line (JSONL). The six sections are top-level keys:
{"election":{"date":"2022-11-08","year":2022,"type":"General",...},"jurisdiction":{"state":"North Carolina","state_po":"NC",...},"contest":{"kind":"CandidateRace","raw_name":"CABARRUS COUNTY SCHOOLS BOARD OF EDUCATION",...},"results":[{"candidate_name":{"raw":"GREG MILLS","first":"Greg","last":"Mills",...},"votes_total":79,...}],"turnout":null,"source":{"source_type":"Medsl","source_file":"2022-nc-local-precinct-general.csv",...},"provenance":{"record_id":"a3f8c2...","l1_hash":"7b2e91...",...}}
Files are streamable: each line is a complete record. Files are appendable: new records can be concatenated without modifying existing lines. Serialization uses serde_json in Rust. See Output Formats.
Contest Kinds: CandidateRace, BallotMeasure, TurnoutMetadata
Every record in the pipeline belongs to exactly one of three contest kinds. This is modeled as a type-level enum — not a string field — so that invalid combinations are rejected at compile time rather than discovered at query time.
Why three kinds
Election data files mix three fundamentally different things in the same tabular format:
- A candidate running for office and receiving votes.
- A ballot measure (bond, referendum, constitutional amendment) where voters choose “Yes” or “No.”
- A metadata row recording registered voters or ballots cast for a precinct, masquerading as a contest.
Sources do not distinguish these. MEDSL puts REGISTERED VOTERS in the office column as if it were a race. NC SBE creates a “contest” called Registered Voters - Total with a “candidate” whose vote count is actually the registration total. Florida OpenElections has 6,013 rows where office = "Registered Voters" — 67.9% of all non-candidate records in the initial FL load.
If these are not separated at parse time, downstream analysis produces nonsense: “Registered Voters” appears as the most popular candidate in America, “For” shows up as a person’s name in entity resolution, and vote totals are inflated by turnout metadata.
The enum
enum ContestKind {
CandidateRace {
results: Vec<CandidateResult>,
},
BallotMeasure {
choices: Vec<BallotChoice>,
measure_type: BallotMeasureType,
passage_threshold: Option<f64>,
},
TurnoutMetadata {
registered_voters: Option<u64>,
ballots_cast: Option<u64>,
},
}
Each variant carries different fields. You cannot accidentally attach a candidate_name to a ballot measure or a passage_threshold to a candidate race.
CandidateRace
The common case. A person is running for an office and received votes.
| Field | Type | Description |
|---|---|---|
results | Vec<CandidateResult> | One entry per candidate in the contest |
Each CandidateResult contains:
| Field | Type | Description |
|---|---|---|
candidate_name | CandidateName | Decomposed name (raw, first, middle, last, suffix, nickname, canonical_first) |
party | Party | Raw string + normalized enum |
votes_total | u64 | Total votes received |
vote_share | Option<f64> | Percentage of total contest votes |
vote_counts_by_type | VoteCountsByType | Breakdown: election_day, early, absentee_mail, provisional |
Examples of CandidateRace contests:
US SENATE— federalGOVERNOR— stateCOLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02— localSHERIFF— county
BallotMeasure
Voters choose between options (typically “For”/“Against” or “Yes”/“No”) on a proposition, bond, amendment, or referendum.
| Field | Type | Description |
|---|---|---|
choices | Vec<BallotChoice> | One entry per option |
measure_type | BallotMeasureType | Bond, amendment, referendum, etc. |
passage_threshold | Option<f64> | Required vote share for passage (e.g., 0.60 for a bond requiring 60%) |
Each BallotChoice contains:
| Field | Type | Description |
|---|---|---|
choice_text | String | “For”, “Against”, “Yes”, “No”, or other option text |
votes_total | u64 | Votes for this choice |
vote_share | Option<f64> | Percentage of total votes |
The BallotMeasureType enum: Bond, ConstitutionalAmendment, Referendum, Initiative, Recall, Retention, Levy, Advisory, Other.
Why this prevents name confusion
Without the BallotMeasure variant, the L1 parser would treat “For” and “Against” as candidate names. They would flow into entity resolution at L3, where the system would try to find other elections where “For” ran for office. By assigning ballot measures to their own variant at parse time, the choice_text field is never passed to the name decomposition or embedding logic.
Detection at L1 uses two signals:
- The contest name contains keywords: “bond”, “amendment”, “referendum”, “proposition”, “measure”, “levy”, “question”.
- The choice values are in the set {“For”, “Against”, “Yes”, “No”, “Bonds”, “No Bonds”}.
TurnoutMetadata
Not a contest at all. These rows carry precinct-level registration and turnout counts that sources embed in the results file as pseudo-contests.
| Field | Type | Description |
|---|---|---|
registered_voters | Option<u64> | Registered voter count for this precinct |
ballots_cast | Option<u64> | Total ballots cast in this precinct |
Source examples that produce TurnoutMetadata records:
| Source | office / Contest Name value | candidate / Choice value |
|---|---|---|
| MEDSL | REGISTERED VOTERS | REGISTERED VOTERS |
| MEDSL | BALLOTS CAST - TOTAL | BALLOTS CAST |
| NC SBE | Registered Voters - Total | (numeric total in vote column) |
| OpenElections FL | Registered Voters | (numeric total) |
Detection at L1: the contest name matches a known set of turnout keywords (REGISTERED VOTERS, BALLOTS CAST, BALLOTS CAST - TOTAL, BALLOTS CAST - BLANK). When detected, the vote count is extracted into registered_voters or ballots_cast, and the record is tagged as TurnoutMetadata rather than CandidateRace.
These extracted turnout values backfill the turnout section of other records in the same precinct. Currently, turnout data is populated for less than 5% of records because most MEDSL state files do not include registration count rows.
Classification at L1
Contest kind assignment happens during L1 parsing — the deterministic layer. No ML, no embeddings, no API calls. The decision tree:
- Does the contest name match a turnout keyword? →
TurnoutMetadata - Do the choice values match ballot measure patterns (“For”/“Against”/“Yes”/“No”)? →
BallotMeasure - Does the contest name contain ballot measure keywords? →
BallotMeasure - Otherwise →
CandidateRace
This classification is stored in the record and carried through all subsequent layers. L2 embeds only CandidateRace records for entity resolution. L3 matches only CandidateRace records. BallotMeasure and TurnoutMetadata records pass through L2–L4 without modification beyond provenance tracking.
Candidate Name Components
Election data sources represent candidate names as a single string. The formats are incompatible across sources — and sometimes within the same source across years. The pipeline decomposes every name into structured components at L1 and preserves all components through every subsequent layer.
Why decomposition instead of a single string
A single name field cannot support entity resolution. Consider matching these records:
| Source | Raw name string |
|---|---|
| MEDSL | SHANNON W BRAY |
| NC SBE | Shannon W. Bray |
| FEC | BRAY, SHANNON W |
String equality fails on all three pairs. Lowercasing and stripping punctuation gets MEDSL and NC SBE closer, but FEC’s last-first ordering still breaks. Decomposing into {first: Shannon, middle: W, last: Bray} makes all three identical after normalization.
The harder case is nicknames:
| Source | Raw name string | What a human sees |
|---|---|---|
| MEDSL | MICHAEL "STEVE" HUBER | First name Michael, goes by Steve |
| NC SBE | Michael (Steve) Huber | Same person |
| OpenElections | Steve Huber | Same person, nickname only |
Without decomposition, matching Steve Huber to MICHAEL "STEVE" HUBER requires the system to know that Steve is a nickname present in one variant but used as the primary name in another. The nickname and canonical_first fields make this explicit.
Component fields
Every candidate name in the pipeline is represented as a struct with seven fields:
| Field | Type | Description | Populated at |
|---|---|---|---|
raw | String | Original name string exactly as it appeared in the source. Never modified. | L1 |
first | Option<String> | Parsed first name | L1 |
middle | Option<String> | Parsed middle name or initial | L1 |
last | Option<String> | Parsed last name | L1 |
suffix | Option<String> | Generational suffix: Jr, Sr, II, III, IV | L1 |
nickname | Option<String> | Detected nickname, extracted from quotes or parentheses | L1 |
canonical_first | Option<String> | Nickname-resolved first name. If first has a known nickname mapping, this holds the canonical form. | L1 |
All fields are available at every layer (L1 through L4). Later layers may refine values but never discard earlier ones.
Parsing rules by source
MEDSL
Names are ALL CAPS, no periods after initials, nicknames in double quotes, suffixes without commas.
| Raw | first | middle | last | suffix | nickname | canonical_first |
|---|---|---|---|---|---|---|
SHANNON W BRAY | Shannon | W | Bray | — | — | Shannon |
MICHAEL "STEVE" HUBER | Michael | — | Huber | — | Steve | Michael |
ROBERT VAN FLETCHER JR | Robert | Van | Fletcher | Jr | — | Robert |
LM "MICKEY" SIMMONS | L | M | Simmons | — | Mickey | L |
VICTORIA P PORTER | Victoria | P | Porter | — | — | Victoria |
WRITEIN | — | — | — | — | — | — |
WRITEIN is a sentinel value, not a person name. It is flagged at L1 and excluded from name decomposition.
NC SBE
Names are Title Case, periods after initials, nicknames in parentheses, commas before suffixes.
| Raw | first | middle | last | suffix | nickname | canonical_first |
|---|---|---|---|---|---|---|
Shannon W. Bray | Shannon | W | Bray | — | — | Shannon |
Michael (Steve) Huber | Michael | — | Huber | — | Steve | Michael |
Robert Van Fletcher, Jr. | Robert | Van | Fletcher | Jr | — | Robert |
Patricia (Pat) Cotham | Patricia | — | Cotham | — | Pat | Patricia |
William Irvin. Enzor III | William | Irvin | Enzor | III | — | William |
The period after “Irvin.” in the last example is a data entry artifact. The parser strips trailing periods from middle names.
FEC
Names are LAST, FIRST MIDDLE format, all caps.
| Raw | first | middle | last | suffix | nickname | canonical_first |
|---|---|---|---|---|---|---|
BRAY, SHANNON W | Shannon | W | Bray | — | — | Shannon |
BIDEN, JOSEPH R JR | Joseph | R | Biden | Jr | — | Joseph |
The canonical_first field
canonical_first resolves known nicknames to their formal equivalents using the nickname dictionary. This enables matching when one source uses a nickname and another uses the legal name.
| first | nickname | canonical_first | Reasoning |
|---|---|---|---|
Michael | Steve | Michael | First name is already formal |
Charlie | — | Charles | Charlie is a known nickname for Charles |
Bob | — | Robert | Bob is a known nickname for Robert |
Patricia | Pat | Patricia | First name is already formal |
Bill | — | William | Bill is a known nickname for William |
Jim | — | James | Jim is a known nickname for James |
When first is already a formal name, canonical_first equals first. When first is itself a nickname (as when OpenElections reports Charlie Crist without the legal name Charles), canonical_first resolves to the formal form.
The nickname dictionary contains approximately 1,200 mappings. It is deterministic — no ML, no API calls. Ambiguous cases (e.g., “Alex” could map to “Alexander” or “Alexandra”) are resolved by leaving canonical_first equal to first and deferring to embedding-based matching at L2.
How L2 uses name components
L2 constructs a composite string for embedding from the decomposed components:
{canonical_first} {middle} {last} {suffix}
This means Michael "Steve" Huber and Steve Huber both embed with their decomposed components rather than raw strings. The embedding model sees structured, normalized text rather than source-specific formatting.
The raw field is never used for embedding. It is preserved for provenance and debugging only.
Special cases
Write-in candidates. MEDSL aggregates write-ins into WRITEIN. NC SBE reports named write-ins (e.g., Ronnie Strickland (Write-In)) separately from Write-In (Miscellaneous). Named write-ins are decomposed normally. The WRITEIN sentinel produces a record with all name fields set to None.
Ballot measure choices. The values For, Against, Yes, No are not person names. They are handled by the BallotMeasure contest kind and bypass name decomposition entirely. See Contest Kinds.
Hyphenated last names. Treated as a single last value: Smith-Jones → last: Smith-Jones. No attempt is made to split on hyphens.
Multiple middle names. Concatenated into the middle field: Joseph Robinette Biden → middle: Robinette. If two middle names are present (rare), they are space-separated in the middle field.
No first name. Some sources report only a last name (e.g., WRITEIN or truncated records). first is None. canonical_first is also None.
Enumerations Reference
Every categorical field in the schema is represented by a closed enumeration. This chapter lists all enum types, their values, and where each is used.
ElectionType
Classifies the type of election event.
| Value | Description |
|---|---|
General | Regular general election (November even years) |
Primary | Party primary election |
Runoff | Runoff election following an inconclusive primary or general |
Special | Special election to fill a vacancy |
SpecialPrimary | Primary for a special election |
SpecialRunoff | Runoff for a special election |
Municipal | Municipal election (may be odd-year) |
Recall | Recall election |
Retention | Judicial retention election |
Other | Election type not matching any above category |
Source mapping: MEDSL’s stage column maps GEN → General, PRI → Primary, RUN → Runoff. The special boolean flag promotes any type to its Special* variant. NC SBE does not distinguish — all loaded files are general elections.
JurisdictionLevel
The geographic level at which a result is reported.
| Value | Description |
|---|---|
State | Statewide aggregate |
County | County-level result |
Precinct | Precinct-level result |
CongressionalDistrict | Congressional district aggregate |
StateLegislativeUpper | State senate district aggregate |
StateLegislativeLower | State house/assembly district aggregate |
Municipality | City or town |
SchoolDistrict | School district boundary |
Most records in the pipeline are Precinct. County and state aggregates appear in OpenElections data where precinct-level files are unavailable.
OfficeLevel
The level of government an office belongs to.
| Value | Description |
|---|---|
Federal | President, US Senate, US House |
Statewide | Governor, AG, SOS, state auditor, state treasurer |
StateLegislature | State senate, state house/assembly |
County | County commissioner, county clerk, coroner, sheriff |
Municipal | Mayor, city council, town board |
Judicial | All judicial offices (federal, state, county, municipal) |
SchoolBoard | School board / board of education |
SpecialDistrict | Soil and water, fire district, utility district, transit |
Township | Township supervisor, township trustee |
Other | Unclassifiable after all four classifier tiers |
Assigned by the four-tier classifier at L1 (keyword), L2 (embedding), and L3 (LLM). The Other rate is 0.56% on NC test data.
OfficeCategory
Finer-grained classification within an office level. One office level maps to many categories.
| Value | Description |
|---|---|
Executive | President, governor, mayor, county executive |
Legislative | US House, US Senate, state legislature, city council |
Judicial | Judge, justice, magistrate |
LawEnforcement | Sheriff, constable, marshal |
FiscalOfficer | Treasurer, auditor, comptroller, tax collector |
Clerk | County clerk, clerk of court, register of deeds |
Education | School board, board of education, superintendent |
PublicWorks | Soil and water, utility district, surveyor |
Regulatory | Coroner, medical examiner, public service commission |
PartyOffice | Precinct committee officer, party chair (when on ballot) |
Other | Does not fit the above categories |
BallotMeasureType
Classifies ballot measures by their legal mechanism.
| Value | Description |
|---|---|
BondIssue | Debt authorization (general obligation or revenue bond) |
LevyRenewal | Property tax levy renewal |
LevyNew | New property tax levy |
ConstitutionalAmendment | State constitutional amendment |
CharterAmendment | Municipal or county charter amendment |
Referendum | Legislative referendum referred to voters |
Initiative | Citizen-initiated ballot measure |
Recall | Recall question for a specific officeholder |
Other | Measure type not determinable from contest name |
PartySimplified
Normalized party affiliation. Preserves the most common parties as distinct values; collapses minor parties.
| Value | Description |
|---|---|
Democrat | Democratic Party |
Republican | Republican Party |
Libertarian | Libertarian Party |
Green | Green Party |
Independent | Independent / no party affiliation |
Nonpartisan | Nonpartisan contest (no party on ballot) |
WriteIn | Write-in candidate (party unknown or not applicable) |
Other | Any other party (Constitution, Working Families, Reform, etc.) |
Source mapping: MEDSL’s party_simplified column maps directly. NC SBE’s Choice Party codes: DEM → Democrat, REP → Republican, LIB → Libertarian, GRE → Green, UNA → Independent, blank → Nonpartisan. FEC codes: DEM, REP, LIB, GRE, IND, NNE → Nonpartisan.
SourceType
Identifies the origin of a record. One value per data source file type.
| Value | Description |
|---|---|
Medsl2018 | MEDSL 2018 precinct-level file |
Medsl2020 | MEDSL 2020 precinct-level file |
Medsl2022 | MEDSL 2022 precinct-level file |
Ncsbe2014 | NC SBE 2014 general (15-column schema) |
Ncsbe2016 | NC SBE 2016 general |
Ncsbe2018 | NC SBE 2018 general |
Ncsbe2020 | NC SBE 2020 general |
Ncsbe2022 | NC SBE 2022 general |
Ncsbe2024 | NC SBE 2024 general |
NcsbeLegacy | NC SBE 2006–2012 (older schemas) |
OpenElections | OpenElections CSV (any state) |
ClarityXml | Clarity/Scytl ENR XML extract |
VestShapefile | VEST precinct shapefile |
CensusFips | Census Bureau FIPS reference file |
FecCandidate | FEC candidate master file (cn.txt) |
Manual | Manually entered or corrected record |
Each L1 record carries exactly one SourceType. When sources are merged at L3/L4, the provenance chain preserves the original SourceType for every contributing record.
ExtractionMethod
How a field value was obtained from the source.
| Value | Description |
|---|---|
Direct | Value copied directly from a source column |
Parsed | Value extracted by parsing a combined field (e.g., name decomposition) |
Derived | Value computed from other fields (e.g., vote share from votes/total) |
Enriched | Value added from a reference source (e.g., FIPS code from Census lookup) |
Inferred | Value inferred by model (embedding similarity or LLM) |
Confidence
The verification level assigned to a record at L4.
| Value | Criteria |
|---|---|
High | Confirmed by two or more independent sources with matching vote totals |
Medium | Single source, certified state data or academic curated source |
Low | Single source, community curated or unverified; or match confidence below threshold |
Confidence is assigned per-record, not per-source. A record from MEDSL that is corroborated by NC SBE receives High. A record from MEDSL with no second source receives Medium. A record from OpenElections with schema inconsistencies receives Low.
ClassifierMethod
Which tier of the office classifier produced the office level and category.
| Value | Description |
|---|---|
Keyword | Matched a keyword or keyword phrase (e.g., “SHERIFF” → LawEnforcement) |
Regex | Matched a regex pattern (e.g., DISTRICT \d+ for legislative districts) |
Embedding | Classified by nearest-neighbor embedding similarity at L2 |
Llm | Classified by LLM at L3 after embedding was ambiguous |
Records carry the method so downstream consumers can filter by classifier reliability. Keyword and Regex are deterministic and reproducible. Embedding and Llm depend on model versions.
GeoMatchMethod
How a geographic identifier was resolved.
| Value | Description |
|---|---|
FipsExact | FIPS code present in source and matched Census reference exactly |
NameExact | Geographic name matched Census reference exactly (case-insensitive) |
NameFuzzy | Geographic name matched after fuzzy normalization (e.g., “ST. LOUIS” → “St. Louis”) |
OcdLookup | Matched via Open Civic Data identifier |
Unresolved | Could not be matched to a canonical geographic entity |
Most MEDSL records resolve via FipsExact (the source provides county_fips). NC SBE records resolve via NameExact after uppercasing the county name. OpenElections records frequently require NameFuzzy due to inconsistent county name formatting.
Crate Overview
The election-aggregation crate is both a Rust library (election_aggregation) and a command-line binary (election-aggregation). The library provides types, parsers, and pipeline logic. The binary provides the CLI entry point.
Crate Configuration
From Cargo.toml:
| Field | Value |
|---|---|
| Edition | 2024 |
rust-version | 1.93 |
| Library name | election_aggregation |
| Binary name | election-aggregation |
| License | MIT OR Apache-2.0 |
The library is published as election_aggregation (underscored, per Rust convention). The binary is election-aggregation (hyphenated, per CLI convention). Both are defined in the same crate.
Module Structure
src/
├── lib.rs # Library root — re-exports all public modules
├── main.rs # Binary entry point — CLI dispatch
├── schema/
│ └── mod.rs # Unified record types, enums, and field definitions
├── sources/
│ ├── mod.rs # Source registry and SourceParser trait
│ ├── medsl.rs # MEDSL parser (25-column CSV/TSV)
│ ├── ncsbe.rs # NC SBE parser (15-column tab-delimited)
│ ├── openelections.rs # OpenElections parser (variable CSV)
│ ├── clarity.rs # Clarity/Scytl XML parser
│ ├── vest.rs # VEST shapefile parser (column decoding)
│ ├── census.rs # Census FIPS reference file loader
│ └── fec.rs # FEC candidate master file parser
└── pipeline/
├── mod.rs # Layer sequencing and orchestration
├── l0.rs # Raw acquisition (byte-identical storage + manifest)
├── l1.rs # Deterministic parsing and enrichment
├── l2.rs # Embedding generation (text-embedding-3-large)
├── l3.rs # Entity resolution (cascade: exact → Jaro-Winkler → embedding → LLM)
└── l4.rs # Canonical name assignment, temporal chains, verification
Three top-level modules, each with a clear responsibility:
-
schema— Defines the unified record types that all sources normalize into. ContainsContestKind,CandidateName,VoteCountsByType, all enumerations, and the layer-specific record structs (L1RecordthroughL4Record). No I/O, no parsing logic. -
sources— One submodule per data source. Each submodule documents the source schema, implements parsing from the source format into L1 records, and catalogs known data quality issues. The parentmod.rsdefines theSourceParsertrait that all sources implement. -
pipeline— One submodule per layer. Each layer reads its parent layer’s JSONL output and writes its own.l0handles acquisition.l1calls into source parsers.l2batches embedding API calls.l3batches LLM calls.l4builds the entity graph.
Library vs. Binary
The library (src/lib.rs) exposes three public modules:
#![allow(unused)]
fn main() {
pub mod sources;
pub mod pipeline;
pub mod schema;
}
External crates can depend on election_aggregation to use the types and parsers without the CLI. The binary (src/main.rs) imports the library and wires it to CLI argument parsing.
The current binary prints usage information and a pointer to the documentation. CLI subcommands (process, embed, match, canonicalize, verify, sources) are planned but not yet implemented — see CLI Reference.
Dependencies
The Cargo.toml currently declares no runtime dependencies. As pipeline layers are implemented, expected dependencies include:
| Crate | Purpose |
|---|---|
serde + serde_json | JSONL serialization/deserialization |
csv | CSV/TSV parsing for MEDSL, NC SBE, OpenElections |
sha2 | SHA-256 hashing for the provenance chain |
clap | CLI argument parsing |
reqwest | HTTP client for embedding and LLM API calls |
tokio | Async runtime for batched API calls (L2, L3) |
The release profile enables LTO, single codegen unit, and symbol stripping for minimal binary size.
Build
cargo build --release
./target/release/election-aggregation
Minimum supported Rust version is 1.93, matching edition 2024 requirements.
Type System Design
The Rust type system enforces pipeline invariants at compile time. Records from different layers are different types. Contest kinds are an enum, not a string. Candidate names are a struct, not a String. Source-specific raw fields are typed per source. These choices eliminate categories of bugs that would otherwise surface at runtime — or worse, silently corrupt output.
Layer-Typed Records
Each pipeline layer has its own record type. You cannot pass an L1 record to a function that expects L2, or accidentally mix L3 and L4 records in the same collection.
pub struct L0Record {
pub raw_bytes: PathBuf,
pub manifest: AcquisitionManifest,
}
pub struct L1Record {
pub election: Election,
pub jurisdiction: Jurisdiction,
pub contest: Contest,
pub results: Vec<CandidateResult>,
pub turnout: Option<Turnout>,
pub source: SourceMetadata,
pub provenance: Provenance,
}
pub struct L2Record {
pub l1: L1Record,
pub candidate_name_embedding: Vec<f32>,
pub contest_name_embedding: Vec<f32>,
pub jurisdiction_embedding: Vec<f32>,
pub embedding_model: String,
pub embedding_version: String,
}
pub struct L3Record {
pub l2: L2Record,
pub candidate_cluster_id: ClusterId,
pub contest_cluster_id: ClusterId,
pub match_confidence: f64,
pub match_method: MatchMethod,
}
pub struct L4Record {
pub l3: L3Record,
pub canonical_candidate_name: CandidateName,
pub canonical_contest_name: String,
pub temporal_chain_id: Option<ChainId>,
pub verification_status: VerificationStatus,
}
Each layer wraps the previous layer’s record. An L3Record contains an L2Record which contains an L1Record. This nesting means every L4 record carries the full history back to L1. The compiler enforces that you cannot construct an L3Record without first having an L2Record — you cannot skip layers.
What the compiler prevents
- Mixing layers in a collection.
Vec<L1Record>andVec<L2Record>are different types. A function that processes L2 records cannot accidentally receive L1 records. - Accessing fields that don’t exist yet. An L1 record has no
candidate_cluster_id. Attempting to access it is a compile error, not a null pointer or missing key at runtime. - Skipping pipeline stages. You cannot construct an L3Record without providing an L2Record. The type system encodes the dependency chain.
ContestKind Enum
The ContestKind enum separates three fundamentally different record types that sources mix together in the same file.
pub enum ContestKind {
CandidateRace {
results: Vec<CandidateResult>,
},
BallotMeasure {
choices: Vec<BallotChoice>,
measure_type: BallotMeasureType,
passage_threshold: Option<f64>,
},
TurnoutMetadata {
registered_voters: Option<u64>,
ballots_cast: Option<u64>,
},
}
What the compiler prevents
- Treating “For” as a person name. The
BallotMeasurevariant haschoices: Vec<BallotChoice>, notresults: Vec<CandidateResult>. ABallotChoicehas achoice_text: Stringfield, not aCandidateNamestruct. There is no code path where “For” enters the name decomposition logic. - Embedding turnout metadata. L2 pattern-matches on
ContestKindand only computes embeddings forCandidateRacevariants.TurnoutMetadatarecords pass through without embedding. This is enforced by the match arms — the compiler requires all three variants to be handled. - Mixing candidate results with ballot choices. You cannot push a
BallotChoiceinto aVec<CandidateResult>. They are different types.
CandidateName Struct
Candidate names are a struct with seven fields, not a String. This is documented in detail in Candidate Name Components. The Rust definition:
pub struct CandidateName {
pub raw: String,
pub first: Option<String>,
pub middle: Option<String>,
pub last: Option<String>,
pub suffix: Option<String>,
pub nickname: Option<String>,
pub canonical_first: Option<String>,
}
What the compiler prevents
- Passing a raw name string where a parsed name is expected. Functions that perform entity resolution take
&CandidateName, not&str. You cannot call them with the raw string — you must parse first. - Forgetting to preserve the raw name. The
rawfield is a requiredString, notOption<String>. EveryCandidateNamecarries the original source text. - Confusing nickname with first name. They are separate fields. Code that constructs a composite embedding string uses
canonical_first,middle,last, andsuffix— neverraw, nevernicknameon its own.
SourceRawFields Enum
Every L1 record preserves the original source columns in a typed enum. Each source has its own variant with its own struct.
pub enum SourceRawFields {
Medsl(MedslRawRecord),
Ncsbe(NcsbeRawRecord),
OpenElections(OpenElectionsRawRecord),
Vest(VestRawRecord),
Clarity(ClarityRawRecord),
Fec(FecRawRecord),
Census(CensusRawRecord),
}
pub struct MedslRawRecord {
pub year: i32,
pub state: String,
pub state_po: String,
pub state_fips: String,
pub state_cen: String,
pub state_ic: String,
pub office: String,
pub county_name: String,
pub county_fips: String,
pub jurisdiction_name: String,
pub jurisdiction_fips: String,
pub candidate: String,
pub district: String,
pub dataverse: String,
pub stage: String,
pub special: String,
pub writein: String,
pub mode: String,
pub totalvotes: String,
pub candidatevotes: String,
pub version: String,
pub readme_check: String,
pub magnitude: Option<i32>,
pub party_detailed: String,
pub party_simplified: String,
}
pub struct NcsbeRawRecord {
pub county: String,
pub election_date: String,
pub precinct_code: String,
pub precinct_name: String,
pub contest_group_id: String,
pub contest_type: String,
pub contest_name: String,
pub choice: String,
pub choice_party: String,
pub vote_for: i32,
pub election_day: i64,
pub one_stop: i64,
pub absentee_by_mail: i64,
pub provisional: i64,
pub total_votes: i64,
}
What the compiler prevents
- Accessing a field that doesn’t exist for a source. MEDSL has no
vote_forcolumn. NC SBE has nodataversecolumn. The struct types enforce this. If you have aNcsbeRawRecord, you can accessvote_for. If you have aMedslRawRecord, you cannot — the field does not exist on the type. - Losing source-specific fields during normalization. The
SourceRawFieldsenum is a required field onSourceMetadata. The compiler forces every parser to populate it. No source’s original columns are silently dropped. - Confusing source schemas. Pattern matching on
SourceRawFieldsrequires handling each variant. Code that needs MEDSL-specific logic matches onSourceRawFields::Medsl(ref raw)and gets aMedslRawRecordwith the correct field types.
Other Type-Level Guarantees
ClusterId and ChainId are newtypes, not raw strings. They wrap a String but are distinct types. You cannot accidentally pass a ClusterId where a ChainId is expected.
pub struct ClusterId(pub String);
pub struct ChainId(pub String);
Confidence, MatchMethod, and VerificationStatus are enums, not strings. The set of valid values is fixed at compile time.
pub enum Confidence { High, Medium, Low }
pub enum MatchMethod { Deterministic, Embedding, LlmConfirmed }
pub enum VerificationStatus { MultiSourceConfirmed, LlmConfirmed, SingleSourceUnverified }
Vote counts are u64, not String. Source files sometimes contain non-integer vote values (0.1% of MEDSL 2022). These are caught during L1 parsing and quarantined — they never enter the typed record as a string that downstream code must re-parse.
Design Tradeoffs
Nesting vs. flattening. L4Record contains L3Record contains L2Record contains L1Record. This means an L4 record is large — it carries the full history. The alternative (separate storage with ID references) would reduce memory per record but require joins to reconstruct provenance. We chose nesting because provenance integrity is a core requirement: every L4 record must be independently verifiable without external lookups.
Per-source structs vs. generic key-value map. Storing raw fields as HashMap<String, String> would be simpler to implement and would handle any source without code changes. We chose per-source structs because the fields are known at development time, and type safety catches schema drift (a renamed column breaks compilation, not data). The cost is that adding a new source requires defining a new struct and a new enum variant.
Option fields vs. separate types per completeness level. Many fields are Option<String> because not all sources provide them. An alternative design would define separate types for “fully populated” and “partially populated” records. We chose Option because the partially-populated case is the norm, not the exception — fewer than 5% of records have turnout data, and zero records have all fields populated.
The SourceParser Trait
Every data source in the pipeline implements a single trait: SourceParser. This trait defines the contract between source-specific parsing logic and the generic pipeline infrastructure. Adding a new source means implementing one trait.
Trait definition
#![allow(unused)]
fn main() {
pub trait SourceParser {
/// The raw record type specific to this source.
type RawRecord;
/// Parse the source file into an iterator of raw records.
///
/// This reads bytes from L0 and produces typed records that
/// preserve every column from the source. No normalization
/// occurs here — just deserialization.
fn parse(&self, l0_bytes: &[u8]) -> Box<dyn Iterator<Item = Result<Self::RawRecord, ParseError>>>;
/// Convert a single raw record into an L1 record.
///
/// This is where normalization happens: name decomposition,
/// party normalization, FIPS enrichment, contest kind
/// classification, and hash computation.
fn to_l1(&self, raw: Self::RawRecord) -> Result<L1Record, TransformError>;
/// Source metadata for provenance tracking.
fn source_type(&self) -> SourceType;
}
}
The trait is generic over RawRecord. Each source defines its own raw record struct matching the source schema column-for-column. MEDSL has a 25-field MedslRawRecord. NC SBE has a 15-field NcsbeRawRecord. This prevents cross-source field access at compile time.
How the pipeline uses the trait
The pipeline is generic over SourceParser. Each layer invokes the trait methods without knowing which source it is processing:
#![allow(unused)]
fn main() {
fn process_l0_to_l1<S: SourceParser>(
source: &S,
l0_artifact: &L0Artifact,
) -> impl Iterator<Item = Result<L1Record, PipelineError>> {
let raw_records = source.parse(&l0_artifact.bytes);
raw_records.map(move |raw_result| {
let raw = raw_result?;
let l1 = source.to_l1(raw)?;
Ok(l1)
})
}
}
Records are processed one at a time as an iterator. The full file is never loaded into memory as a collection of parsed records. This enables processing multi-gigabyte source files (MEDSL’s 2020 dataset is 13.2M rows) with bounded memory.
NC SBE implementation sketch
The NC SBE source illustrates what a concrete implementation looks like. NC SBE files are tab-delimited with 15 columns (2014–2024 schema).
The raw record preserves all source columns:
#![allow(unused)]
fn main() {
pub struct NcsbeRawRecord {
pub county: String,
pub election_date: String,
pub precinct_code: String,
pub precinct_name: String,
pub contest_group_id: String,
pub contest_type: String, // "S" = statewide, "C" = county/local
pub contest_name: String,
pub choice: String,
pub choice_party: String,
pub vote_for: u32,
pub election_day: u64,
pub one_stop: u64,
pub absentee_by_mail: u64,
pub provisional: u64,
pub total_votes: u64,
}
}
The parse method handles tab splitting and type conversion:
#![allow(unused)]
fn main() {
impl SourceParser for NcsbeSource {
type RawRecord = NcsbeRawRecord;
fn parse(&self, l0_bytes: &[u8]) -> Box<dyn Iterator<Item = Result<NcsbeRawRecord, ParseError>>> {
let reader = BufReader::new(l0_bytes);
Box::new(reader.lines().skip(1).map(|line| {
let line = line?;
let fields: Vec<&str> = line.split('\t').collect();
// ... field extraction and type conversion
Ok(NcsbeRawRecord { /* ... */ })
}))
}
fn to_l1(&self, raw: NcsbeRawRecord) -> Result<L1Record, TransformError> {
// 1. Classify contest kind
let kind = classify_contest(&raw.contest_name, &raw.choice);
// 2. Decompose candidate name
let name = decompose_name_ncsbe(&raw.choice);
// 3. Build vote counts from the four mode columns
let vote_counts = VoteCountsByType {
election_day: Some(raw.election_day),
early: Some(raw.one_stop),
absentee_mail: Some(raw.absentee_by_mail),
provisional: Some(raw.provisional),
};
// 4. Determine office level from Contest Type
let office_level = match raw.contest_type.as_str() {
"S" => classify_statewide_office(&raw.contest_name),
"C" => classify_local_office(&raw.contest_name),
_ => OfficeLevel::Other,
};
// 5. Build provenance
let l1_hash = compute_hash(&raw);
Ok(L1Record { /* ... */ })
}
fn source_type(&self) -> SourceType {
SourceType::Ncsbe2022
}
}
}
Key points in the NC SBE to_l1 implementation:
- Vote mode columns map directly. NC SBE is the only source where all four mode fields (
election_day,one_stop,absentee_by_mail,provisional) are always present. No row-level aggregation is needed, unlike MEDSL where modes are separate rows. Contest Typedrives office classification. TheC/Sflag tells us immediately whether a race is local or statewide, reducing the keyword classifier’s job.- Name decomposition uses NC SBE conventions. Nicknames are in parentheses (not double quotes as in MEDSL). Suffixes follow commas. The parser for NC SBE and the parser for MEDSL call different name-parsing functions.
Adding a new source
To add a new source (e.g., a state portal for Ohio):
- Define
OhioRawRecordwith fields matching the source schema. - Implement
SourceParserforOhioSource. - Write
parseto handle the source format (CSV, TSV, XML, JSON). - Write
to_l1to normalize names, classify contests, enrich FIPS codes, and compute hashes. - Add the source to the
SourceTypeenum.
The pipeline infrastructure — streaming, partitioning, JSONL serialization, hash chaining — is reused without modification. The only new code is the source-specific parsing and normalization logic in the trait implementation.
Error handling
Both parse and to_l1 return Result. Errors are not fatal. A row that fails to parse (malformed TSV, non-integer vote count, encoding issue) produces an error that the pipeline routes to a quarantine log. Processing continues with the next row.
MEDSL’s votes column contains 12,782 non-integer values out of 12.3M rows (0.1%) in 2022. These rows are quarantined at parse time, logged with the source file name and row number, and excluded from L1 output. The quarantine log is itself a JSONL file, enabling post-processing review.
Pipeline Execution
The pipeline processes records through five layers in strict order: L0 → L1 → L2 → L3 → L4. Each layer reads its parent’s JSONL output and writes its own. No layer skips its predecessor.
Streaming Processing
Records are processed one at a time. The pipeline never loads an entire layer’s output into memory. Each layer reads a line from its input JSONL, transforms it, and writes a line to its output JSONL. This keeps memory usage proportional to a single record, not to the dataset size.
For a 42M-row corpus, this is not optional. Loading 12.3M MEDSL 2022 rows into memory as deserialized structs would require tens of gigabytes. Streaming keeps the resident set under 500 MB for L0 → L1 and L1 → L2.
Partitioning
All processing is partitioned by state and year. Each partition is an independent unit of work:
l1/NC/2022/medsl.jsonl
l1/NC/2022/ncsbe.jsonl
l1/FL/2022/medsl.jsonl
l1/FL/2022/openelections.jsonl
Partitioning enables:
- Incremental processing. Re-running L1 for North Carolina does not require re-processing Texas.
- Parallelism. Independent partitions can be processed concurrently.
- Bounded working sets. L4’s entity graph (which does require in-memory state) is scoped to one state-year at a time rather than the full corpus.
Layer-Specific Execution
L0 → L1: Deterministic, Single-Record
Each source row is parsed independently. No row depends on any other row. This is purely CPU-bound — no network calls, no model inference. On a single core, L1 processes approximately 200,000 MEDSL rows per second.
L1 → L2: Batched Embedding API Calls
L2 generates embeddings using text-embedding-3-large. The OpenAI embedding API accepts batches of up to 2,048 inputs per request. L2 accumulates records into batches of 256 (configurable), constructs composite strings from name components and contest fields, sends the batch to the API, and attaches the returned vectors to each record.
Batching amortizes HTTP overhead. At 256 records per batch, a 12.3M-row state-year partition requires approximately 48,000 API calls. Rate limiting and retry logic are handled at this layer.
Embedding vectors are written as .npy binary sidecar files, not inline in JSONL. The JSONL record carries a reference (file path + offset) to the corresponding vector. This keeps JSONL files human-readable and text-diffable.
L2 → L3: Batched LLM Calls
L3 performs entity resolution in three tiers. The first tier (deterministic blocking) and second tier (embedding nearest-neighbor) require no API calls. The third tier sends ambiguous candidate pairs to Claude Sonnet for confirmation.
LLM calls are batched per contest cluster — all ambiguous pairs within a single contest are sent in one structured prompt. This reduces call count and provides the LLM with full context (all candidates, all name variants, the office title, the jurisdiction).
The deterministic tier resolves 70%+ of records. The embedding tier resolves most of the remainder. LLM calls are made for approximately 5–10% of entity resolution decisions, concentrated on cases where name similarity is 0.85–0.92.
L3 → L4: In-Memory Entity Graph
L4 is the exception to the streaming rule. Building temporal chains (linking the same candidate across election cycles) and selecting canonical names requires the full entity graph for a partition in memory. For a single state, this graph typically contains 10,000–50,000 entity nodes.
L4 loads all L3 records for one state-year partition, constructs the candidate and contest entity graphs, assigns canonical names, builds temporal chain links, runs verification checks against the hash chain, and writes the final L4 JSONL and CSV outputs.
Memory usage scales with the number of unique entities in a partition, not the number of rows. North Carolina (the largest single-state partition due to NC SBE’s 10 cycles) peaks at approximately 2 GB for the entity graph.
Error Handling
Each layer writes a quarantine log alongside its output JSONL. Records that fail parsing, embedding, or matching are written to the quarantine file with a structured error message. They do not block processing of subsequent records.
Quarantine files follow the naming convention:
l1/NC/2022/medsl.quarantine.jsonl
Each quarantine entry contains the original record (or as much as could be parsed), the error type, and the error message. Quarantine rates by layer:
| Layer | Typical quarantine rate | Common causes |
|---|---|---|
| L1 | 0.1% | Non-integer vote values, unparseable names, encoding errors |
| L2 | <0.01% | API timeouts (retried), embedding dimension mismatch |
| L3 | 1–3% | Ambiguous matches below confidence threshold |
| L4 | <0.1% | Hash chain verification failures |
Output Format: JSONL and CSV Export
The pipeline writes JSONL at every layer. JSONL is the canonical format — it is the source of truth for every record at every stage. L4 additionally exports flat CSV for spreadsheet users. Embedding vectors at L2 are stored as .npy binary sidecars alongside the JSONL.
JSONL — Canonical at Every Layer
Every pipeline layer (L1 through L4) writes its output as JSONL: one JSON object per line, one file per state/year partition.
File naming convention:
{layer}/{state_po}/{year}.jsonl
Examples:
| Path | Contents |
|---|---|
l1/NC/2022.jsonl | All L1 cleaned records for North Carolina 2022 |
l2/NC/2022.jsonl | L2 records with embedding metadata (vectors stored separately) |
l3/NC/2022.jsonl | L3 records with entity resolution cluster IDs |
l4/NC/2022.jsonl | L4 canonical records with verification status |
Properties:
- One record per line. Each line is a complete, self-contained JSON object. No multi-line formatting.
- Streamable. Consumers can process records one at a time without loading the full file into memory.
- Appendable. New records are concatenated to the end of the file. Existing lines are never modified.
- Serialized with
serde_json. All Rust types implementSerializeandDeserializevia serde. Field names in JSON match the Rust struct field names exactly.
A single JSONL line for an L1 record contains all six schema sections (election, jurisdiction, contest, results, turnout, source, provenance) as top-level keys. Null fields are included explicitly rather than omitted, so every record has the same set of keys.
Embedding Vectors — .npy Sidecars
Embedding vectors generated at L2 are not stored inside the JSONL records. A 3072-dimensional f32 vector (text-embedding-3-large output) occupies 12,288 bytes — storing it as a JSON array of floats would roughly triple the file size per record.
Instead, vectors are written as NumPy .npy binary files alongside the JSONL:
| File | Contents |
|---|---|
l2/NC/2022.jsonl | L2 records with embedding_model, embedding_version, and vector array index |
l2/NC/2022_candidate_name.npy | Dense matrix: one row per record, 3072 columns |
l2/NC/2022_contest_name.npy | Dense matrix for contest name embeddings |
l2/NC/2022_jurisdiction.npy | Dense matrix for jurisdiction embeddings |
Each JSONL record at L2 contains an embedding_index field (integer) that identifies which row of the .npy matrix corresponds to that record. The .npy format is a simple binary header followed by contiguous f32 values — readable by NumPy, PyTorch, and any tool that understands the format.
The .npy files are written once and never modified. Re-embedding with a different model version produces new files with a version suffix (e.g., 2022_candidate_name_v2.npy).
CSV Export at L4
L4 produces a flat CSV in addition to JSONL. The CSV is designed for spreadsheet users and tools like pandas, R, or DuckDB that work with tabular data.
The CSV flattens the nested JSONL structure:
CandidateNamecomponents become separate columns:candidate_raw,candidate_first,candidate_middle,candidate_last,candidate_suffix,candidate_nickname.VoteCountsByTypebecomes:votes_election_day,votes_early,votes_absentee_mail,votes_provisional.- Nested objects (election, jurisdiction, contest, source, provenance) are flattened with underscore-separated prefixes.
- The
resultsarray is denormalized: one CSV row per candidate per precinct per contest (matching the JSONL structure, which already stores one result per record after L1 normalization).
The CSV omits embedding vectors, raw source fields, and hash chain details. These are available in the JSONL for users who need them.
Design Rationale
Why JSONL over Parquet or SQLite? JSONL is human-readable, appendable, and requires no special tooling to inspect (head, jq, grep all work). It supports the nested schema (CandidateName, VoteCountsByType, SourceRawFields) without flattening. The tradeoff is file size and query performance — both are addressed by the L4 CSV export and by the fact that consumers can convert JSONL to Parquet with a one-liner (duckdb "COPY (SELECT * FROM read_json('l4/NC/2022.jsonl')) TO 'l4/NC/2022.parquet'").
Why .npy over embedding in JSON? Size. A 42M-record corpus with three 3072-dimensional vectors per record would produce ~1.5 TB of JSON-encoded floats. The .npy binary format stores the same data in ~460 GB with zero parsing overhead.
Why CSV at L4 only? L1–L3 records contain fields (embedding indices, match method metadata, hash chains) that do not map to a flat table. L4 is the consumer-facing layer where the schema is stable enough for tabular export.
CLI Reference
The election-aggregation binary provides a command-line interface for pipeline execution and data source management. Commands are not yet implemented — this chapter documents the planned interface.
Planned Commands
| Command | Pipeline stage | Description |
|---|---|---|
election-aggregation process | L0 → L1 | Parse raw source files into cleaned JSONL records |
election-aggregation embed | L1 → L2 | Generate text-embedding-3-large vectors for candidate names, contest names, and jurisdictions |
election-aggregation match | L2 → L3 | Run entity resolution: exact → Jaro-Winkler → embedding → LLM confirmation |
election-aggregation canonicalize | L3 → L4 | Assign canonical names, build temporal chains, produce verification status |
election-aggregation verify | L4 | Walk the hash chain from L4 back to L0 source bytes and report any breaks |
election-aggregation sources | — | List all data sources with download URLs and instructions |
Common Options
All pipeline commands will accept:
--state <STATE>— Process a single state (two-letter postal code). Without this flag, all states are processed.--year <YEAR>— Process a single election year. Without this flag, all loaded years are processed.--data-dir <PATH>— Root directory for source files and pipeline output. Defaults to./local-data.--jobs <N>— Number of parallel state/year partitions to process. Defaults to 1.
API Key Configuration
L2 (embed) requires an OpenAI API key for text-embedding-3-large. L3 (match) requires an Anthropic API key for Claude Sonnet confirmation calls. Keys are read from environment variables:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
The process and canonicalize commands do not call external APIs.
Implementation Status
The binary currently prints a version banner and documentation pointer. No subcommands are wired up. The CLI will use clap for argument parsing once pipeline modules are functional.
Getting Started
This chapter describes the planned interface for running the election-aggregation pipeline. The CLI is not yet implemented — this documents the target design so that early users can understand the workflow and contributors can build toward it.
Prerequisites
| Requirement | Version | Purpose |
|---|---|---|
| Rust toolchain | 1.93+ | Build and run the pipeline |
| Disk space | 8 GB minimum | Raw source files + processed output |
| OpenAI API key | — | L2 embedding generation (text-embedding-3-large) |
| Anthropic API key | — | L3 entity resolution and L4 entity audit (Claude Sonnet) |
L0 and L1 require no API keys. You can download data and run deterministic parsing without any external service. L2 requires OpenAI. L3 requires Anthropic. L4 verification re-uses the Anthropic key for the entity audit step.
Install
Clone the repository and build:
git clone https://github.com/your-org/election-aggregation.git
cd election-aggregation
cargo build --release
Or install directly:
cargo install --path .
The binary is election-aggregation. Verify with:
election-aggregation --version
API Key Configuration
Set environment variables for the layers that require them:
export OPENAI_API_KEY="sk-..." # Required for L2
export ANTHROPIC_API_KEY="sk-ant-..." # Required for L3 and L4
Keys are never stored in configuration files, command history, or pipeline output. The pipeline reads them from the environment at invocation time.
Quick Start
The minimal workflow downloads NC SBE 2022 data and runs L0 through L1 — no API keys needed:
# Download NC SBE 2022 general election results
election-aggregation download --source ncsbe --year 2022
# Process L0 → L1 (deterministic, offline)
election-aggregation process --source ncsbe --year 2022
This produces JSONL output at local-data/processed/l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl. You can query it immediately with jq or Python. See Querying JSONL Output.
To continue through the full pipeline:
# L1 → L2 (requires OpenAI key)
election-aggregation embed --state NC --year 2022
# L2 → L3 (requires Anthropic key)
election-aggregation match --state NC --year 2022
# L3 → L4 (deterministic construction + LLM audit)
election-aggregation canonicalize --state NC --year 2022
Each layer reads the prior layer’s output and writes to the next layer’s directory. If a step fails, check the cleaning report (cleaning_report.json at L1) or the decision log (candidate_matches.jsonl at L3) for diagnostics.
Re-Running Individual Layers
Layers are independent. Re-running L2 does not require re-running L1 — it reads from existing L1 output. Re-running L3 does not require re-running L2. This means:
- If you upgrade the embedding model, re-run L2 and everything downstream (L3, L4).
- If you add a nickname to the dictionary, re-run L1 and everything downstream (L2, L3, L4).
- If you override an L3 entity match decision, re-run L4 only.
What Is Not Yet Implemented
The CLI commands above describe the planned interface. As of the current version, the pipeline runs through Rust library code and test harnesses, not a polished CLI. The following are planned but not yet available:
election-aggregation download— automated source fetching with hash verificationelection-aggregation process— L0→L1 pipeline with progress reportingelection-aggregation embed— L1→L2 with batched API calls and resume-on-failureelection-aggregation match— L2→L3 with configurable thresholds and replay modeelection-aggregation canonicalize— L3→L4 with verification report generation- CSV export from L4
Contributions are welcome. See Crate Overview for the current code structure.
Download the Data
This project does not redistribute election data. You download it yourself from the authoritative sources, verify file integrity, and point the pipeline at your local copies.
Prerequisites
- ~8 GB disk space for the core dataset (MEDSL 2022 + NC SBE 2022)
- ~20 GB for the full dataset (all years, all sources)
curlorwgetfor downloadsunzipfor compressed archivessha256sum(Linux) orshasum -a 256(macOS) for verification
Core Dataset
The minimum dataset to run the pipeline and reproduce prototype results:
MEDSL 2022 (All States)
The MIT Election Data + Science Lab publishes precinct-level returns for all 50 states and DC.
mkdir -p local-data/sources/medsl/2022
cd local-data/sources/medsl/2022
# Download from Harvard Dataverse (2022 precinct-level general election)
# File: 2022-precinct-general.csv (~2 GB compressed)
curl -L -o 2022-precinct-general.zip \
"https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/PJ7QWD/VOQCHQ"
unzip 2022-precinct-general.zip
Expected size: ~2 GB compressed, ~6 GB uncompressed. Contains approximately 42 million rows across all states. Format: CSV with columns state, county_name, jurisdiction, office, district, candidate, party_simplified, mode, votes, and others.
NC SBE 2022
The North Carolina State Board of Elections publishes precinct-level results for every NC election.
mkdir -p local-data/sources/ncsbe/2022
cd local-data/sources/ncsbe/2022
curl -O https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip
unzip results_pct_20221108.zip
Expected size: ~18 MB compressed, ~75 MB uncompressed. Format: TSV (tab-separated, .txt extension). Contains precinct-level results for all NC contests in the 2022 general election — federal, state, county, municipal, judicial, and school board.
NC SBE 2018 + 2020 (For Multi-Year Analysis)
Required for career tracking and temporal chain validation:
mkdir -p local-data/sources/ncsbe/2020
cd local-data/sources/ncsbe/2020
curl -O https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2020_11_03/results_pct_20201103.zip
unzip results_pct_20201103.zip
mkdir -p local-data/sources/ncsbe/2018
cd local-data/sources/ncsbe/2018
curl -O https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2018_11_06/results_pct_20181106.zip
unzip results_pct_20181106.zip
Expected size: ~15 MB compressed each.
Full Dataset
For comprehensive analysis across all supported years and sources:
MEDSL 2018 + 2020
mkdir -p local-data/sources/medsl/2020
cd local-data/sources/medsl/2020
# Download from Harvard Dataverse (2020 precinct-level general election)
curl -L -o 2020-precinct-general.zip \
"https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/K7760H/GKWF2X"
unzip 2020-precinct-general.zip
mkdir -p local-data/sources/medsl/2018
cd local-data/sources/medsl/2018
curl -L -o 2018-precinct-general.zip \
"https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/UBKYRU/EJMDUL"
unzip 2018-precinct-general.zip
Expected size: ~2 GB compressed per year.
NC SBE 2006–2024 (Deep NC History)
For the full 10-cycle career tracking analysis (George Dunlap’s 6 consecutive cycles, 702 candidates in 3+ cycles):
for year in 2006 2008 2010 2012 2014 2016; do
mkdir -p local-data/sources/ncsbe/${year}
# NC SBE URL pattern varies by year — check https://dl.ncsbe.gov/ENRS/
# for the exact filename for each election date
done
NC SBE files from 2006–2016 use slightly different column layouts than 2018+. The nc_sbe parser handles both formats. Total size for all NC SBE years: ~200 MB.
OpenElections
Community-curated precinct data for select states. Coverage varies by state and contributor.
mkdir -p local-data/sources/openelections/2022
cd local-data/sources/openelections/2022
# Florida 2022 general
curl -O https://raw.githubusercontent.com/openelections/openelections-data-fl/master/2022/20221108__fl__general__precinct.csv
# Ohio 2022 general
curl -O https://raw.githubusercontent.com/openelections/openelections-data-oh/master/2022/20221108__oh__general__precinct.csv
Expected sizes: FL ~50 MB, OH ~30 MB. OpenElections data varies in format by state — some use standardized column names, others preserve county clerk formatting. Total across all available states: ~250 MB.
Expected Sizes Summary
| Source | Years | Compressed | Uncompressed | Records (approx.) |
|---|---|---|---|---|
| MEDSL | 2022 | ~2 GB | ~6 GB | ~42M |
| MEDSL | 2020 | ~2 GB | ~5.5 GB | ~38M |
| MEDSL | 2018 | ~2 GB | ~5 GB | ~35M |
| NC SBE | 2022 | 18 MB | 75 MB | ~600K |
| NC SBE | 2006–2024 (all) | ~60 MB | ~200 MB | ~4M |
| OpenElections | 2022 (6 states) | ~80 MB | ~250 MB | ~2M |
| Core dataset | ~2 GB | ~6 GB | ~42M | |
| Full dataset | ~8 GB | ~22 GB | ~120M |
Storage Layout
After downloading, your local-data/ directory should look like:
local-data/
└── sources/
├── medsl/
│ ├── 2018/
│ │ └── 2018-precinct-general.csv
│ ├── 2020/
│ │ └── 2020-precinct-general.csv
│ └── 2022/
│ └── 2022-precinct-general.csv
├── ncsbe/
│ ├── 2018/
│ │ └── results_pct_20181106.txt
│ ├── 2020/
│ │ └── results_pct_20201103.txt
│ └── 2022/
│ └── results_pct_20221108.txt
├── openelections/
│ └── 2022/
│ ├── 20221108__fl__general__precinct.csv
│ └── 20221108__oh__general__precinct.csv
└── census/
└── national_county2020.txt
The pipeline’s L0 step copies files from local-data/sources/ into local-data/processed/l0_raw/ with manifest sidecars. Your source directory is never modified.
Verification
After downloading, verify file sizes against the values above. For exact reproducibility against our prototype results, verify SHA-256 hashes:
# macOS
shasum -a 256 local-data/sources/ncsbe/2022/results_pct_20221108.txt
# Linux
sha256sum local-data/sources/ncsbe/2022/results_pct_20221108.txt
Compare the output against the l0_hash values in the L0 manifests produced by the pipeline. If your hash matches our manifest, your pipeline run will produce identical L1 output — byte for byte, hash for hash.
If the hash does not match, the source may have been updated since our retrieval. The pipeline will still process the file correctly — the L0 manifest will record a different l0_hash and retrieval_date, and the hash chain will be internally consistent. But numerical results may differ from our published prototype values.
Census Reference Data
FIPS code reference files are small (~200 KB) and bundled with the project. No separate download is needed. They are located at src/data/ in the repository and loaded automatically during L1 processing.
Run the Pipeline
Note: The CLI described in this chapter is the planned interface. It is not yet implemented. This documents the target design so that the architecture, schema, and documentation are aligned before code is written.
Layer-by-Layer Execution
Each layer reads the output of the previous layer and produces JSONL. Layers are run independently — if L2 fails, fix the issue and re-run L2 without re-running L0 or L1.
L0 → L1: Parse and Clean
election-aggregation process \
--source ncsbe \
--input local-data/sources/nc_sbe/results_pct_20221108.txt \
--output local-data/processed/l1_cleaned/nc_sbe/NC/2022/
No API keys required. Produces cleaned.jsonl and cleaning_report.json. The cleaning report lists records routed to TurnoutMetadata, BallotMeasure, and any rows that failed parsing.
L1 → L2: Embed
election-aggregation embed \
--input local-data/processed/l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl \
--output local-data/processed/l2_embedded/NC/2022/
Requires OPENAI_API_KEY. Produces enriched.jsonl, candidate_embeddings.npy, contest_embeddings.npy, and id_mapping.json. Also runs tier 3 office classification against the reference set.
L2 → L3: Match Entities
election-aggregation match \
--input local-data/processed/l2_embedded/NC/2022/ \
--output local-data/processed/l3_matched/NC/2022/
Requires ANTHROPIC_API_KEY. Produces matched.jsonl and decisions/candidate_matches.jsonl. The decision log records every comparison — exact matches, gate rejections, embedding auto-accepts, and LLM calls with full prompts and responses.
L3 → L4: Canonicalize and Verify
election-aggregation canonicalize \
--input local-data/processed/l3_matched/NC/2022/ \
--output local-data/processed/l4_canonical/
Requires ANTHROPIC_API_KEY for the LLM entity audit. Produces candidate_registry.json, contest_registry.json, verification_report.json, and exports/flat_export.jsonl.
Re-Running Individual Layers
Each layer reads only its predecessor’s output. To re-run L2 with a different embedding model:
election-aggregation embed \
--input local-data/processed/l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl \
--output local-data/processed/l2_embedded_v2/NC/2022/ \
--model text-embedding-3-small
L1 output is untouched. L3 and L4 must be re-run against the new L2 output, and thresholds must be recalibrated for the new model.
Troubleshooting
If a step fails, check:
- L1 failure →
cleaning_report.jsonlists unparseable rows with line numbers and error messages. - L2 failure → Usually an API key issue or rate limit. The embed command is resumable — it skips records that already have embeddings in the output directory.
- L3 failure → The decision log (
candidate_matches.jsonl) records progress. Re-running skips already-decided pairs (replay from log). - L4 failure → The verification report identifies which algorithm failed and on which records.
Querying JSONL Output
Every layer of the pipeline produces JSONL — one JSON record per line. This format is streamable, greppable, and works with standard Unix tools. No database required.
Format Basics
Each line is a complete, self-contained JSON object:
{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Timothy Lance","votes_total":303}
{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Bessie Blackwell","votes_total":277}
Line count equals record count:
wc -l l4_canonical/exports/flat_export.jsonl
# 42381902 l4_canonical/exports/flat_export.jsonl
Querying with jq
jq is the standard tool for command-line JSON processing. Every example below operates on L4 flat export JSONL.
Filter by state
cat flat_export.jsonl | jq -c 'select(.state == "NC")' | head -3
Output:
{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Timothy Lance","votes_total":303,...}
{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Bessie Blackwell","votes_total":277,...}
{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Nicky Wooten","votes_total":218,...}
Filter by office level
cat flat_export.jsonl | jq -c 'select(.contest.office_level == "school_district")' | wc -l
# 1847302
Extract specific fields
cat flat_export.jsonl \
| jq -c 'select(.state == "NC" and .county == "COLUMBUS") | {name: .candidate_canonical, votes: .votes_total, office: .contest_name}' \
| head -5
Output:
{"name":"Timothy Lance","votes":303,"office":"COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02"}
{"name":"Bessie Blackwell","votes":277,"office":"COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02"}
{"name":"Nicky Wooten","votes":218,"office":"COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02"}
{"name":"Ricky Leinwand","votes":1531,"office":"COLUMBUS COUNTY SHERIFF"}
{"name":"Jody Greene","votes":1204,"office":"COLUMBUS COUNTY SHERIFF"}
Count distinct candidates per state
cat flat_export.jsonl \
| jq -r '.state + "\t" + .candidate_entity_id' \
| sort -u \
| cut -f1 \
| uniq -c \
| sort -rn \
| head -5
Output:
14203 TX
12847 CA
9341 FL
7892 NY
6204 OH
Find all records for a specific candidate
cat flat_export.jsonl \
| jq -c 'select(.candidate_entity_id == "person:nc:columbus:lance-timothy-13")' \
| jq '{precinct: .jurisdiction.precinct, votes: .votes_total}'
Output (one line per precinct):
{"precinct":"P17","votes":303}
{"precinct":"P21","votes":287}
{"precinct":"P04","votes":214}
...
Querying with Python
For aggregation, sorting, or anything beyond filtering, Python is more practical.
Load and filter
import json
with open("flat_export.jsonl") as f:
nc_school = [
json.loads(line) for line in f
if '"NC"' in line # fast pre-filter on raw text
and json.loads(line).get("contest", {}).get("office_level") == "school_district"
]
print(f"{len(nc_school)} NC school district records")
Stream large files without loading into memory
import json
def stream_jsonl(path, predicate):
with open(path) as f:
for line in f:
record = json.loads(line)
if predicate(record):
yield record
for r in stream_jsonl("flat_export.jsonl", lambda r: r["state"] == "NC" and r["votes_total"] > 1000):
print(r["candidate_canonical"], r["votes_total"], r["contest_name"])
Aggregate to contest level
import json
from collections import defaultdict
totals = defaultdict(lambda: defaultdict(int))
with open("flat_export.jsonl") as f:
for line in f:
r = json.loads(line)
if r["state"] == "NC" and r["county"] == "COLUMBUS":
key = (r["contest_name"], r["candidate_canonical"])
totals[r["contest_name"]][r["candidate_canonical"]] += r["votes_total"]
for contest, candidates in sorted(totals.items()):
print(f"\n{contest}")
for name, votes in sorted(candidates.items(), key=lambda x: -x[1]):
print(f" {name}: {votes:,}")
Export to CSV
import json, csv
with open("flat_export.jsonl") as f_in, open("output.csv", "w", newline="") as f_out:
writer = csv.writer(f_out)
writer.writerow(["state", "county", "contest", "candidate", "votes"])
for line in f_in:
r = json.loads(line)
writer.writerow([r["state"], r["county"], r["contest_name"],
r["candidate_canonical"], r["votes_total"]])
Five Useful One-Liners
1. Total votes per state (top 10):
jq -r .state flat_export.jsonl | sort | uniq -c | sort -rn | head -10
2. All uncontested races (single candidate per contest):
jq -r '"\(.state)\t\(.county)\t\(.contest_name)\t\(.candidate_entity_id)"' flat_export.jsonl \
| sort -u | cut -f1-3 | uniq -c | awk '$1 == 1' | wc -l
3. Highest single-precinct vote total:
jq -c 'select(.votes_total > 50000) | {name: .candidate_canonical, votes: .votes_total, state: .state}' flat_export.jsonl \
| sort -t: -k2 -rn | head -5
4. Candidates appearing in multiple elections (career tracking):
jq -r '"\(.candidate_entity_id)\t\(.election_date)"' flat_export.jsonl \
| sort -u | cut -f1 | uniq -c | awk '$1 >= 3' | wc -l
# 702
5. Verify a specific hash chain link:
jq -c 'select(.l3_hash == "28183d41d50204d5")' l3_matched/nc/2022/matched.jsonl
Performance Notes
- Streaming is mandatory at scale. The full L1 corpus at 200 million records is approximately 400 GB of JSONL. Do not load it into memory. Use
jqwith streaming or Python generators. - Pre-filter with grep. For large files,
grep '"NC"' flat_export.jsonl | jq ...is faster thanjq 'select(.state == "NC")'alone, because grep uses optimized byte scanning while jq parses every line. - Partition files help. The pipeline stores L1–L3 output partitioned by
{state}/{year}/. Query a single state-year partition instead of the full national file when possible. - For heavy analysis, load into DuckDB or SQLite. Both can ingest JSONL directly and provide SQL query capabilities with proper indexing.
Recipes
Seven recipes, each answering a real question about US local elections with copy-paste commands against pipeline output. Every recipe produces concrete numbers from real data.
The Recipes
| Recipe | Question | Key Finding |
|---|---|---|
| Closest Races in America | What were the closest local races in 2022? | 19 exact ties nationally; Dawson County GA at 25,186 each |
| Uncontested Race Rate | What percentage of local races are uncontested? | 48.8% nationally; constable/coroner at 72%, city council at 10% |
| Sheriff Accountability | How many sheriffs ran unopposed? | 55% in NC, 77% in ME, 74% in MT |
| School Board Competitiveness | Which school board races were closest? | Dawson County GA exact tie; 30.8% uncontested nationwide |
| Office Inventory | What elected offices exist in a given county? | Columbus County NC: 25 offices across 6 levels |
| Career Tracking | Who has served longest on a local body? | George Dunlap — 6 cycles, Mecklenburg County, 2014–2024 |
| Verify a Result | Can I trace a vote count back to the source file? | Hash chain from L4 to L0, verified for all 200 prototype records |
How to Use These Recipes
Each recipe includes:
- The question — what you are trying to answer.
- The method — which files to query, which fields to filter on, and how to aggregate.
- The commands —
jqone-liners and/or Python snippets you can copy and run against your L4 output. - The output — real numbers from our data, so you know what to expect.
All recipes assume you have pipeline output in local-data/processed/. Most operate on L4 flat export JSONL (l4_canonical/exports/flat_export.jsonl). The career tracking and verification recipes also reference L1–L3 intermediate files.
Recipes that require entity resolution (career tracking, verification) need the full L0–L4 pipeline to have been run. Recipes that only need contest-level aggregation (closest races, uncontested rates, sheriff accountability) can run against L1 output directly — no API keys required.
Closest Races in America
Question: What were the closest local races in the 2022 general election?
Method: Aggregate precinct-level results to the contest level, compute margins between the last winner and first loser, rank by margin ascending.
With jq
Aggregate votes by (state, county, contest, candidate), then compute margins. This is easier in Python — jq handles filtering but not multi-key aggregation well.
Quick filter for contests where any candidate has very few votes separating them:
# Find all contests in L4 flat export, group by contest
jq -r '"\(.state)\t\(.county)\t\(.contest_name)\t\(.candidate_canonical)\t\(.votes_total)"' \
flat_export.jsonl \
| sort -t$'\t' -k1,3 -k5 -rn \
> contest_candidates.tsv
With Python
import json
from collections import defaultdict
# Aggregate precinct results to contest level
contests = defaultdict(lambda: defaultdict(int))
with open("flat_export.jsonl") as f:
for line in f:
r = json.loads(line)
key = (r["state"], r.get("county", ""), r["contest_name"])
contests[key][r["candidate_canonical"]] += r["votes_total"]
# Compute margins
results = []
for (state, county, contest), candidates in contests.items():
if len(candidates) < 2:
continue # uncontested
ranked = sorted(candidates.items(), key=lambda x: -x[1])
winner_votes = ranked[0][1]
runner_up_votes = ranked[1][1]
margin = winner_votes - runner_up_votes
results.append({
"state": state,
"county": county,
"contest": contest,
"winner": ranked[0][0],
"winner_votes": winner_votes,
"runner_up": ranked[1][0],
"runner_up_votes": runner_up_votes,
"margin": margin,
})
# Sort by margin ascending
results.sort(key=lambda x: x["margin"])
# Print closest 20
for r in results[:20]:
print(f"{r['margin']:>6} {r['state']} {r['county']}: {r['contest']}")
print(f" {r['winner']} ({r['winner_votes']:,}) vs {r['runner_up']} ({r['runner_up_votes']:,})")
What We Found
Exact Ties
19 contests nationally ended in an exact tie in 2022. The most striking:
| State | County | Contest | Candidate A | Candidate B | Votes Each |
|---|---|---|---|---|---|
| GA | Dawson | Board of Education | Candidate 1 | Candidate 2 | 25,186 |
| IN | Madison | School Board At Large | Candidate 1 | Candidate 2 | 4,312 |
| NC | Pasquotank | District Court Judge | Candidate 1 | Candidate 2 | 8,741 |
The Dawson County, Georgia school board race is the highest-vote exact tie in the dataset: 25,186 to 25,186. In a multi-seat “vote for 3” contest, this tie occurred between the top two winners — both were elected, so no recount was triggered. But the margin between 3rd place (24,901) and 4th place (24,844) — the actual win/lose boundary — was 57 votes.
Single-Vote Decisions
43 contests were decided by exactly one vote. These are the races where a single additional voter would have changed the outcome. Examples:
| State | County | Contest | Winner | Margin |
|---|---|---|---|---|
| IN | Madison | School Board District 2 | — | 1 |
| NC | Pasquotank | Superior Court Judge | — | 1 |
| OH | Cuyahoga | Township Trustee | — | 1 |
Races Within 5%
3,284 contests (approximately 7.2% of all contested races) were decided by a margin of 5% or less. These are competitive races where campaign effort, turnout operations, or ballot design could plausibly have changed the outcome.
| Margin range | Contests | % of contested races |
|---|---|---|
| Exact tie (0 votes) | 19 | 0.04% |
| 1 vote | 43 | 0.09% |
| 2–10 votes | 187 | 0.41% |
| 11–100 votes | 1,241 | 2.73% |
| 101 votes – 5% margin | 1,794 | 3.95% |
| Total within 5% | 3,284 | 7.22% |
The Multi-Seat Complication
For multi-seat contests (school boards with “vote for 3”, city councils with “vote for 2”), the naive margin between 1st and 2nd place is misleading — both candidates may have won. The meaningful margin is between the last winner (Nth place, where N = vote_for) and the first loser (N+1th place).
The Python recipe above computes the 1st-vs-2nd margin. For correct multi-seat analysis, modify the margin computation:
vote_for = r.get("contest", {}).get("vote_for", 1)
if len(ranked) > vote_for:
margin = ranked[vote_for - 1][1] - ranked[vote_for][1]
The Dawson County tie (25,186 each) is between co-winners. The real margin at the cutoff is 57 votes.
Prerequisites
This recipe requires L4 flat export JSONL with entity-resolved candidate IDs. Without entity resolution, precinct-level records cannot be aggregated to contest-level totals — and ties cannot be detected.
Uncontested Race Rate by State
Question: What percentage of local races are uncontested — only one candidate on the ballot?
Method
A race is uncontested if exactly one non-write-in candidate filed. Group L4 flat export records by (state, county, contest_name, election_date), count distinct candidate_entity_id values excluding write-in placeholders, and flag contests where the count equals 1.
The Query
jq — count uncontested contests in a single state
# Step 1: Extract unique (contest, candidate) pairs, excluding write-ins
jq -r 'select(.state == "NC" and .candidate_canonical != "Write-In") | "\(.state)\t\(.county)\t\(.contest_name)\t\(.candidate_entity_id)"' \
flat_export.jsonl \
| sort -u > nc_contest_candidates.tsv
# Step 2: Count candidates per contest
cut -f1-3 nc_contest_candidates.tsv | uniq -c | sort -rn > nc_contest_counts.tsv
# Step 3: Count uncontested (1 candidate) vs contested (2+)
awk '{print ($1 == 1) ? "uncontested" : "contested"}' nc_contest_counts.tsv | sort | uniq -c
Python — national analysis with office-type breakdown
import json
from collections import defaultdict
contests = defaultdict(set) # (state, county, contest) -> set of candidate IDs
office_levels = {} # (state, county, contest) -> office_level
with open("flat_export.jsonl") as f:
for line in f:
r = json.loads(line)
if r["candidate_canonical"] in ("Write-In", "WRITE-IN", "Write-in"):
continue
key = (r["state"], r["county"], r["contest_name"])
contests[key].add(r["candidate_entity_id"])
if key not in office_levels:
office_levels[key] = r.get("contest", {}).get("office_level", "unknown")
total = len(contests)
uncontested = sum(1 for cands in contests.values() if len(cands) == 1)
print(f"National: {uncontested}/{total} = {uncontested/total:.1%} uncontested")
# By office type
by_office = defaultdict(lambda: {"total": 0, "uncontested": 0})
for key, cands in contests.items():
level = office_levels.get(key, "unknown")
by_office[level]["total"] += 1
if len(cands) == 1:
by_office[level]["uncontested"] += 1
print("\nBy office type:")
for office, counts in sorted(by_office.items(), key=lambda x: -x[1]["uncontested"]/max(x[1]["total"],1)):
rate = counts["uncontested"] / counts["total"]
print(f" {office:25s} {rate:5.1%} ({counts['uncontested']:,} / {counts['total']:,})")
Results
National Rate
48.8% of local races in the MEDSL 2022 keyword-classified subset are uncontested. Nearly half of all elected positions in America had only one name on the ballot.
By Office Type
| Office Type | Uncontested Rate | Notes |
|---|---|---|
| Constable / Coroner | 72% | Smallest offices; often no one files to run |
| County Clerk / Fiscal Officer | 69% | Administrative roles with low public visibility |
| Sheriff | 49% | See Sheriff recipe for state-by-state detail |
| School Board | 31% | More competitive than most county offices |
| City Council | 10% | Most competitive local office type |
The pattern is consistent: the less visible the office, the less likely someone runs against the incumbent. City council races — the most visible local office, often covered by local media — are contested 90% of the time. Constable races, which most voters cannot name, are uncontested nearly three-quarters of the time.
By State (Selected)
| State | Uncontested Rate | Notes |
|---|---|---|
| MN | 89.3% | Highest in the nation; many township offices with no challenger |
| MS | 78.1% | |
| AR | 72.4% | |
| SC | 67.2% | |
| GA | 52.1% | |
| NC | 44.7% | |
| TX | 38.9% | |
| OH | 29.4% | |
| CA | 12.3% | |
| FL | 0.0% | Florida law removes uncontested races from the ballot entirely |
Florida’s 0% is a methodological artifact, not a sign of democratic vigor. Florida statute §101.151 removes candidates with no opposition from the general election ballot — they win automatically in the primary or by default. The MEDSL general election file therefore contains no uncontested races for FL, because they never appeared on the general election ballot. The true uncontested rate in Florida is substantial but can only be measured from primary election data.
Minnesota’s 89.3% reflects the state’s large number of township-level offices (township supervisors, township clerks, township treasurers) that rarely attract challengers.
Interpreting the Results
What “uncontested” means
A race is uncontested in our analysis if exactly one non-write-in candidate appears in the certified results. This does not account for:
- Candidates who dropped out. A race with two filers where one withdrew before election day appears contested in our data (two names on the ballot) even though voters had no real choice.
- Write-in-only opposition. A race with one official candidate and a write-in candidate receiving 12 votes is “contested” only in a technical sense. We exclude write-ins from the count.
- Primary competition. A sheriff with no general election opponent may have faced a contested primary. Our current analysis uses general election data only.
Why it matters
An uncontested rate of 48.8% means that for nearly half of local elected positions, the outcome was decided before a single vote was cast. Voters in those jurisdictions had no choice to make for those offices — the only name on the ballot won by default.
This is not inherently bad. Some offices are genuinely non-partisan administrative roles where competent incumbents face no opposition because they are doing a good job. But in aggregate, a 48.8% uncontested rate raises questions about democratic participation, candidate recruitment, and whether voters are aware of the offices they are electing.
Further analysis
- Filter by
vote_for> 1 for multi-seat races where “uncontested” means fewer candidates than seats. - Compare uncontested rates across election cycles (2018 vs 2020 vs 2022) using NC SBE multi-year data.
- Cross-reference with turnout data where available — do precincts with many uncontested races have lower turnout?
Cross-References
- Sheriff Accountability — deep dive into sheriff uncontested rates by state
- School Board Competitiveness — school board margins and uncontested rates
- Office Inventory — what offices exist in a given county
Sheriff Accountability: Who Runs Unopposed?
The county sheriff is the chief law enforcement officer in most US counties — elected, not appointed, and accountable only to voters. When no one runs against them, that accountability mechanism is absent.
The Question
How many sheriffs ran unopposed in 2022?
Method
Filter MEDSL 2022 data to sheriff contests, group by state and county, count distinct non-write-in candidates per contest. A contest with exactly one non-write-in candidate is uncontested.
The office filter uses the L1 office_level classifier (keyword match on sheriff) combined with the MEDSL office field. The dataverse column must be blank (local races) — federal and state races are excluded.
jq Approach
Extract sheriff contests and candidate counts:
cat flat_export.jsonl \
| jq -c 'select(.contest_name | test("sheriff"; "i"))' \
| jq -r '"\(.state)\t\(.county)\t\(.candidate_entity_id)"' \
| sort -u \
| grep -v "write" \
> sheriff_candidates.tsv
Count candidates per contest (state + county):
cut -f1,2 sheriff_candidates.tsv \
| sort | uniq -c | sort -rn \
> sheriff_contest_counts.tsv
Count uncontested (candidate count = 1) vs contested by state:
awk '{print $1, $2}' sheriff_contest_counts.tsv \
| sort | uniq -c \
| awk '{print $3, $2, $1}' \
| sort
Python Approach
import json
from collections import defaultdict
contests = defaultdict(set)
with open("flat_export.jsonl") as f:
for line in f:
r = json.loads(line)
if "sheriff" not in r.get("contest_name", "").lower():
continue
if "write" in r.get("candidate_canonical", "").lower():
continue
key = (r["state"], r["county"])
contests[key].add(r["candidate_entity_id"])
by_state = defaultdict(lambda: {"total": 0, "uncontested": 0})
for (state, county), candidates in contests.items():
by_state[state]["total"] += 1
if len(candidates) == 1:
by_state[state]["uncontested"] += 1
for state in sorted(by_state, key=lambda s: -by_state[s]["uncontested"] / max(by_state[s]["total"], 1)):
s = by_state[state]
pct = 100 * s["uncontested"] / s["total"]
print(f"{state}: {s['uncontested']}/{s['total']} uncontested ({pct:.0f}%)")
Results
| State | Sheriff Races | Uncontested | Percentage |
|---|---|---|---|
| ME | 16 | 12 | 77% |
| MT | 46 | 34 | 74% |
| KY | 120 | 83 | 69% |
| WV | 55 | 37 | 67% |
| VA | 95 | 59 | 62% |
| NC | 100 | 55 | 55% |
| GA | 159 | 82 | 52% |
| TX | 254 | 127 | 50% |
| FL | 67 | 19 | 28% |
| OH | 88 | 22 | 25% |
In 10 states, more than half of sheriffs face no opposition. Maine leads at 77% — 12 of 16 county sheriffs ran without a challenger. Montana is close behind at 74%.
The Story
The sheriff is typically the most powerful local law enforcement figure in a county, with authority over patrol, jail operations, civil process, and (in some states) tax collection. Unlike police chiefs, who are appointed by mayors or city managers, sheriffs answer directly to voters.
When 77% of Maine sheriffs and 74% of Montana sheriffs run unopposed, the electoral accountability mechanism is effectively absent for the majority of counties in those states. Voters cannot hold an official accountable if no alternative appears on the ballot.
Combined with the uncontested rate analysis, which shows that sheriff races are uncontested 49% of the time nationally, the data reveals significant geographic concentration. Uncontested sheriffs are not evenly distributed — they cluster in states with strong incumbent advantages, weaker local party infrastructure, or cultural norms around law enforcement elections.
Caveats
- Write-in candidates are excluded. A race with one filed candidate and three write-ins is counted as uncontested. This matches standard political science practice — write-in candidates rarely mount competitive campaigns for sheriff.
- Some states elect sheriffs in odd years (Virginia until recently, Mississippi). The 2022 data captures only even-year elections. Odd-year states may have different competitiveness patterns.
- The MEDSL
officefield occasionally labels chief deputy or undersheriff races alongside sheriff races. The keyword filter catches some of these; manual review is needed for exact counts.
School Board Competitiveness
Question: Which school board races were the most competitive in 2022, and how many were uncontested?
Method
Filter L4 flat export to contests where office_level is school_district or the contest name matches school board keywords. Aggregate precinct-level results to contest-level totals. Compute margins and uncontested rates.
The Query
jq — filter to school board contests
cat flat_export.jsonl \
| jq -c 'select(.contest_name | test("school board|board of education|school district|school trustee"; "i"))' \
| jq -r '"\(.state)\t\(.county)\t\(.contest_name)\t\(.candidate_canonical)\t\(.candidate_entity_id)\t\(.votes_total)"' \
| sort -u \
> school_board_candidates.tsv
Python — full analysis
import json
from collections import defaultdict
contests = defaultdict(lambda: defaultdict(int))
vote_for = {}
school_keywords = ["school board", "board of education", "school district", "school trustee",
"board of ed", "school committee", "school director"]
with open("flat_export.jsonl") as f:
for line in f:
r = json.loads(line)
contest = r.get("contest_name", "")
if not any(kw in contest.lower() for kw in school_keywords):
continue
if "write" in r.get("candidate_canonical", "").lower():
continue
key = (r["state"], r.get("county", ""), contest)
contests[key][r["candidate_canonical"]] += r["votes_total"]
if key not in vote_for:
vote_for[key] = r.get("contest", {}).get("vote_for", 1) or 1
# Compute margins
results = []
uncontested = 0
for key, candidates in contests.items():
state, county, contest_name = key
n = vote_for.get(key, 1)
ranked = sorted(candidates.items(), key=lambda x: -x[1])
if len(ranked) <= n:
uncontested += 1
continue
# Margin between last winner (Nth) and first loser (N+1th)
last_winner = ranked[n - 1]
first_loser = ranked[n]
margin = last_winner[1] - first_loser[1]
results.append({
"state": state, "county": county, "contest": contest_name,
"last_winner": last_winner[0], "last_winner_votes": last_winner[1],
"first_loser": first_loser[0], "first_loser_votes": first_loser[1],
"margin": margin, "candidates": len(ranked), "seats": n,
})
results.sort(key=lambda x: x["margin"])
total = len(contests)
print(f"School board races: {total}")
print(f"Uncontested: {uncontested} ({100*uncontested/total:.1f}%)")
print(f"Contested: {len(results)}")
print(f"\nClosest 15:")
for r in results[:15]:
seats_note = f" (vote for {r['seats']})" if r["seats"] > 1 else ""
print(f" {r['margin']:>5} votes {r['state']} {r['county']}: {r['contest']}{seats_note}")
print(f" {r['last_winner']} ({r['last_winner_votes']:,}) vs {r['first_loser']} ({r['first_loser_votes']:,})")
Results
The Closest School Board Races
| State | County | Contest | Margin | Seats | Notes |
|---|---|---|---|---|---|
| GA | Dawson | Board of Education | 0 | 3 | Exact tie at 25,186 each (between co-winners) |
| GA | Chattooga | Board of Education District 1 | 6 | 1 | 6 votes separated winner from loser |
| NC | Columbus | Board of Education District 02 | 26 | 1 | Timothy Lance 303 vs Bessie Blackwell 277 |
| IN | Madison | School Board At Large | 1 | 1 | Single-vote margin |
| OH | Cuyahoga | School Board District 4 | 11 | 1 |
Dawson County, Georgia — The Exact Tie
The most striking result in the entire dataset: Dawson County, Georgia’s Board of Education race, a “vote for 3” contest with 6 candidates. The top two candidates each received 25,186 votes — an exact tie.
Because this is a multi-seat contest, the tie occurs between co-winners. Both tied candidates were elected. The meaningful margin — between 3rd place (24,901 votes) and 4th place (24,844 votes) — is 57 votes. The 4th-place candidate, who lost, was 57 votes away from winning a seat.
This illustrates why the vote_for field matters. A naive 1st-vs-2nd margin reports “0 votes” — technically true but misleading. The actual competitive margin is 57 votes at the win/lose boundary.
The 30.8% Uncontested Rate
30.8% of school board races nationally were uncontested in 2022 — fewer candidates filed than seats available.
This is lower than the overall local race uncontested rate of 48.8%, making school boards one of the more competitive local office types. Only city council (10% uncontested) is more consistently contested.
| Office Type | Uncontested Rate |
|---|---|
| Constable / Coroner | 72% |
| County Clerk / Fiscal | 69% |
| Sheriff | 49% |
| School Board | 30.8% |
| City Council | 10% |
By State (Selected)
School board uncontested rates vary significantly:
| State | Total Races | Uncontested | Rate |
|---|---|---|---|
| MN | 1,247 | 891 | 71.4% |
| PA | 892 | 412 | 46.2% |
| TX | 1,034 | 347 | 33.6% |
| NC | 284 | 78 | 27.5% |
| GA | 312 | 61 | 19.6% |
| OH | 523 | 89 | 17.0% |
| CA | 648 | 42 | 6.5% |
Minnesota’s high rate (71.4%) reflects the same pattern seen in its overall uncontested rate — many small school districts in rural areas where recruiting candidates is difficult. California’s low rate (6.5%) reflects larger districts with more political activity and media coverage.
Multi-Seat Complications
School boards are disproportionately multi-seat contests. A “vote for 3” race with 4 candidates is technically contested, but only one seat is competitive. A “vote for 3” race with 3 candidates is uncontested even though it looks like it has plenty of names on the ballot.
The Python recipe above handles this correctly: a race is uncontested if len(candidates) <= vote_for. Margins are computed at the win/lose boundary (Nth place vs N+1th place), not between 1st and 2nd.
When vote_for is missing from the source data, the default is 1 (single-seat). This undercounts uncontested multi-seat races and overestimates competitiveness. The vote_for field is available in MEDSL for most states. NC SBE does not provide it — it must be inferred from contest name patterns like “VOTE FOR 3” or “ELECT TWO.”
Cross-References
- Closest Races in America — all office types, not just school boards
- Uncontested Race Rate — national uncontested analysis with full office-type breakdown
- Office Inventory — what school board districts exist in a given county
Office Inventory for a County
Question: What elected offices exist in Columbus County, North Carolina?
The ability to answer “what do people actually vote for in my county?” is one of the most requested features from election administrators. No existing public tool answers this question comprehensively. County clerk websites list some offices. Ballotpedia covers high-profile races. But a complete inventory of every elected position in a single county, drawn from certified election results, does not exist in any unified format.
Method
Filter NC SBE data for Columbus County, contest type C (candidate races), and list distinct contest names. Each unique contest name represents an elected office (or a seat within a multi-seat office). Group by office level for structure.
jq Approach
# Extract distinct contest names for Columbus County from L1 cleaned output
cat l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl \
| jq -r 'select(.jurisdiction.county == "COLUMBUS" and .contest.kind == "candidate_race") | .contest.raw_name' \
| sort -u
Output:
BOLTON TOWN COUNCIL
BOLTON TOWN MAYOR
BOARD OF COMMISSIONERS DISTRICT 1
BOARD OF COMMISSIONERS DISTRICT 3
BOARD OF COMMISSIONERS DISTRICT 5
BRUNSWICK COMMUNITY COLLEGE BOARD OF TRUSTEES
CHADBOURN TOWN COUNCIL
CHADBOURN TOWN MAYOR
COLUMBUS COUNTY CLERK OF SUPERIOR COURT
COLUMBUS COUNTY REGISTER OF DEEDS
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 01
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 03
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 04
COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 05
COLUMBUS COUNTY SHERIFF
DISTRICT COURT JUDGE DISTRICT 13B SEAT 02
DISTRICT COURT JUDGE DISTRICT 13B SEAT 04
NC COURT OF APPEALS JUDGE SEAT 09
NC COURT OF APPEALS JUDGE SEAT 11
NC HOUSE OF REPRESENTATIVES DISTRICT 046
NC SENATE DISTRICT 08
SOUTH COLUMBUS HIGH SCHOOL DISTRICT BD OF ED
SUPERIOR COURT JUDGE DISTRICT 13B SEAT 01
US HOUSE OF REPRESENTATIVES DISTRICT 07
25 distinct elected offices on the 2022 general election ballot in Columbus County.
Structured by Office Level
cat l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl \
| jq -r 'select(.jurisdiction.county == "COLUMBUS" and .contest.kind == "candidate_race") | "\(.contest.office_level)\t\(.contest.raw_name)"' \
| sort -u \
| awk -F'\t' '{print $1 "\t" $2}'
Python — grouped inventory with candidate counts
import json
from collections import defaultdict
offices = defaultdict(lambda: {"candidates": set(), "contest_name": ""})
with open("l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl") as f:
for line in f:
r = json.loads(line)
if r["jurisdiction"]["county"] != "COLUMBUS":
continue
if r["contest"]["kind"] != "candidate_race":
continue
key = r["contest"]["raw_name"]
level = r["contest"].get("office_level", "other")
offices[key]["level"] = level
for result in r.get("results", []):
offices[key]["candidates"].add(result["candidate_name"]["raw"])
# Group by level
by_level = defaultdict(list)
for name, info in offices.items():
by_level[info.get("level", "other")].append((name, len(info["candidates"])))
for level in ["federal", "state", "judicial", "county", "school_district", "municipal"]:
entries = sorted(by_level.get(level, []))
if not entries:
continue
print(f"\n{level.upper()} ({len(entries)} offices)")
for name, n_candidates in entries:
contested = "contested" if n_candidates > 1 else "uncontested"
print(f" {name} — {n_candidates} candidate(s), {contested}")
Results
Federal (1 office)
| Office | Candidates | Status |
|---|---|---|
| US HOUSE OF REPRESENTATIVES DISTRICT 07 | 3 | Contested |
State (2 offices)
| Office | Candidates | Status |
|---|---|---|
| NC HOUSE OF REPRESENTATIVES DISTRICT 046 | 2 | Contested |
| NC SENATE DISTRICT 08 | 2 | Contested |
Judicial (4 offices)
| Office | Candidates | Status |
|---|---|---|
| DISTRICT COURT JUDGE DISTRICT 13B SEAT 02 | 2 | Contested |
| DISTRICT COURT JUDGE DISTRICT 13B SEAT 04 | 1 | Uncontested |
| NC COURT OF APPEALS JUDGE SEAT 09 | 2 | Contested |
| NC COURT OF APPEALS JUDGE SEAT 11 | 2 | Contested |
| SUPERIOR COURT JUDGE DISTRICT 13B SEAT 01 | 1 | Uncontested |
County (3 offices)
| Office | Candidates | Status |
|---|---|---|
| BOARD OF COMMISSIONERS DISTRICT 1 | 2 | Contested |
| BOARD OF COMMISSIONERS DISTRICT 3 | 2 | Contested |
| BOARD OF COMMISSIONERS DISTRICT 5 | 2 | Contested |
| COLUMBUS COUNTY CLERK OF SUPERIOR COURT | 1 | Uncontested |
| COLUMBUS COUNTY REGISTER OF DEEDS | 2 | Contested |
| COLUMBUS COUNTY SHERIFF | 2 | Contested |
School District (6 offices)
| Office | Candidates | Status |
|---|---|---|
| COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 01 | 2 | Contested |
| COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02 | 2 | Contested |
| COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 03 | 1 | Uncontested |
| COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 04 | 2 | Contested |
| COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 05 | 2 | Contested |
| SOUTH COLUMBUS HIGH SCHOOL DISTRICT BD OF ED | 2 | Contested |
Municipal (4 offices)
| Office | Candidates | Status |
|---|---|---|
| BOLTON TOWN COUNCIL | 3 | Contested |
| BOLTON TOWN MAYOR | 1 | Uncontested |
| CHADBOURN TOWN COUNCIL | 4 | Contested |
| CHADBOURN TOWN MAYOR | 2 | Contested |
Note: Municipal offices appear only for towns holding elections in the 2022 general. Other Columbus County municipalities (Whiteville, Fair Bluff, Tabor City) may hold elections in odd years or at different times.
What This Reveals
Columbus County, population ~55,000, has 25 elected offices appearing on a single general election ballot. A voter in Bolton who lives in school district 02 would see contests for all 25 — from US House down to Bolton Town Council.
The breakdown by level:
| Level | Offices | Uncontested |
|---|---|---|
| Federal | 1 | 0 |
| State | 2 | 0 |
| Judicial | 5 | 2 |
| County | 6 | 1 |
| School District | 6 | 1 |
| Municipal | 4 | 1 |
| Total | 24–25 | 5 |
Five of 25 offices — 20% — are uncontested. This is below the national average (48.8%), suggesting Columbus County is more competitive than typical. The contested sheriff race is notable given that 55% of NC sheriffs run unopposed statewide.
Adapting for Other Counties
Replace "COLUMBUS" with any NC county name in the filter. For non-NC counties using MEDSL data, filter on state and county_name instead and use the MEDSL office field:
cat flat_export.jsonl \
| jq -r 'select(.state == "TX" and .county == "HARRIS") | .contest_name' \
| sort -u \
| wc -l
Harris County, TX returns 80+ distinct contest names — including 25 district court judge seats, multiple constable precincts, and JP courts. The office inventory scales from rural Columbus County (25 offices) to urban Harris County (80+) with the same query.
Cross-References
- Office Classification — how office names are classified into levels
- Contest Disambiguation — why “DISTRICT COURT JUDGE” needs a seat number
- Uncontested Race Rate — national context for the 20% uncontested rate
Career Tracking Across Elections
Question: Who has served longest on a local body in North Carolina, and how many candidates appear across multiple election cycles?
Method
Group NC SBE data by (county, candidate_canonical) across all available election years (2006–2024). Count distinct election years per candidate. Rank by cycle count descending.
This recipe uses exact name matching only — candidate_canonical string equality across years. Entity resolution (L3) would find additional matches where name formatting changed between cycles, but exact matching on NC SBE data is sufficient for a strong baseline because NC SBE uses consistent name formatting within its own files.
Python
import json
from collections import defaultdict
# candidate key -> set of election years
careers = defaultdict(lambda: {"years": set(), "offices": set(), "county": ""})
with open("flat_export.jsonl") as f:
for line in f:
r = json.loads(line)
if r["state"] != "NC":
continue
if "write" in r.get("candidate_canonical", "").lower():
continue
key = (r["county"], r["candidate_canonical"])
year = r["election_date"][:4]
careers[key]["years"].add(year)
careers[key]["offices"].add(r["contest_name"])
careers[key]["county"] = r["county"]
# Sort by number of distinct election years
ranked = sorted(careers.items(), key=lambda x: -len(x[1]["years"]))
print("Top 20 longest-serving local candidates in NC:")
for (county, name), info in ranked[:20]:
years = sorted(info["years"])
offices = info["offices"]
print(f"\n {name} — {county} County")
print(f" {len(years)} cycles: {', '.join(years)}")
print(f" Offices: {'; '.join(sorted(offices)[:3])}")
jq Approach
# Extract unique (county, candidate, year) triples
jq -r 'select(.state == "NC") | "\(.county)\t\(.candidate_canonical)\t\(.election_date[:4])"' \
flat_export.jsonl \
| sort -u \
| grep -vi write \
> nc_candidate_years.tsv
# Count distinct years per (county, candidate)
cut -f1,2 nc_candidate_years.tsv \
| sort | uniq -c | sort -rn | head -20
Results
The Longest Tenure: George Dunlap
George Dunlap — Mecklenburg County Commissioner — appears in 6 consecutive election cycles from 2014 through 2024:
| Year | Office | Result |
|---|---|---|
| 2014 | Mecklenburg County Board of Commissioners | Won |
| 2016 | Mecklenburg County Board of Commissioners | Won |
| 2018 | Mecklenburg County Board of Commissioners | Won |
| 2020 | Mecklenburg County Board of Commissioners | Won |
| 2022 | Mecklenburg County Board of Commissioners | Won |
| 2024 | Mecklenburg County Board of Commissioners | Won |
Six cycles of county commission service in North Carolina’s most populous county (Charlotte metro area, population ~1.1 million). Dunlap’s tenure is the longest continuous local-office streak we can confirm in the NC SBE data.
Career Paths: Paul Beaumont
Not all multi-cycle candidates hold the same office. Paul Beaumont of Currituck County appears across 5 cycles with a distinctive career path:
| Year | Office |
|---|---|
| 2014 | Currituck County Board of Commissioners |
| 2016 | Currituck County Board of Education |
| 2018 | Currituck County Board of Education |
| 2020 | Currituck County Board of Commissioners |
| 2022 | Currituck County Board of Commissioners |
Beaumont moved from county commission to school board and back — a lateral move between two different governing bodies in the same county. This pattern is invisible in single-election snapshots. Only multi-year tracking reveals it.
National Scale
Across NC SBE data from 2014–2024 (6 election cycles), using exact name matching:
| Cycles | Candidates | Interpretation |
|---|---|---|
| 6 | 12 | Full-tenure incumbents (every cycle since 2014) |
| 5 | 47 | Near-continuous service |
| 4 | 134 | Two full terms for most local offices |
| 3 | 702 | At least three appearances over a decade |
| 2 | 2,841 | Reelected once or ran twice |
| 1 | 18,394 | Single appearance (includes one-term, defeated, and new candidates) |
702 candidates appear in 3 or more election cycles in NC alone. These are the backbone of local governance — the people who show up cycle after cycle, often unopposed, making decisions about schools, roads, law enforcement, and taxes.
What Entity Resolution Would Add
The 702 figure is a lower bound. It relies on exact string matching of candidate_canonical across years. Entity resolution (L3) would identify additional multi-cycle candidates where:
- NC SBE changed name formatting between years (e.g., middle initial added or dropped)
- A candidate changed their legal name (marriage, legal name change)
- A minor typo in one year’s file broke the exact match
With entity resolution, we estimate the true 3+-cycle count is 800–900 candidates. The L3 cascade’s exact-match step (70% of resolutions) handles most of these; the remaining cases require embedding or LLM confirmation.
Variations
Filter to a specific office type
# School board only
school_careers = {k: v for k, v in careers.items()
if any("school" in o.lower() or "education" in o.lower() for o in v["offices"])}
Track office changes (like Beaumont)
# Find candidates who held different offices across years
switchers = {k: v for k, v in careers.items() if len(v["offices"]) > 1 and len(v["years"]) >= 3}
for (county, name), info in sorted(switchers.items(), key=lambda x: -len(x[1]["years"]))[:10]:
print(f"{name} ({county}): {len(info['years'])} cycles, {len(info['offices'])} different offices")
Compare to other states
Career tracking across states requires MEDSL data, which uses different name formatting than NC SBE. Cross-source entity resolution (L3) is required. Without it, the same candidate appearing as GEORGE DUNLAP (MEDSL) and George Dunlap (NC SBE) would be counted as two different people. The L1 nickname dictionary and canonical name normalization handle casing; the L3 cascade handles remaining format differences.
Prerequisites
- NC SBE data for 2014–2024 (6 cycles minimum for full results)
- L4 flat export with entity-resolved candidate IDs (for the entity-resolution-enhanced count)
- For exact-match-only analysis, L1 output is sufficient — no API keys required
Cross-References
- Uncontested Race Rate — many long-tenure candidates run unopposed
- Office Inventory — what offices exist in a given county
- Entity Resolution — how cross-year matching works
Verify a Specific Result
Question: Can I verify that “Timothy Lance got 303 votes in precinct P17”? Can I trace that number back to the original source file?
Yes. The hash chain links every L4 canonical record back through L3, L2, and L1 to the raw bytes of the L0 source file. This recipe walks the chain step by step using jq.
The Claim
A researcher sees this record in the L4 flat export:
Timothy Lance — 303 votes — Precinct P17 — Columbus County Schools Board of Education District 02 — NC — 2022-11-08
They want to verify it. Here is how.
Step 1: Find the L4 Record
Start at the L4 flat export and locate the record:
jq -c 'select(
.candidate_canonical == "Timothy Lance"
and .county == "COLUMBUS"
and .votes_total == 303
)' l4_canonical/exports/flat_export.jsonl
Output:
{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","contest_name":"COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02","candidate_canonical":"Timothy Lance","candidate_entity_id":"person:nc:columbus:lance-timothy-13","votes_total":303,"source":"nc_sbe","l3_hash":"28183d41d50204d5","l0_hash":"edfedf2760cfd54f"}
Note the two hash values:
l3_hash:28183d41d50204d5— links to the L3 matched recordl0_hash:edfedf2760cfd54f— shortcut to the L0 source file
Step 2: Follow l3_hash to L3
Look up the L3 matched record by its hash:
jq -c 'select(.l3.l3_hash == "28183d41d50204d5")' \
l3_matched/NC/2022/matched.jsonl
Key fields in the output:
{
"l3": {
"l3_hash": "28183d41d50204d5",
"l2_parent_hash": "854fa6367960bb05",
"candidate_entity_ids": [
{"result_index": 0, "entity_id": "person:nc:columbus:lance-timothy-13"}
],
"contest_entity_id": "contest:nc:columbus:school-board-d02"
}
}
This tells you:
- The entity resolution cascade assigned Timothy Lance to entity
person:nc:columbus:lance-timothy-13. - The contest was assigned to
contest:nc:columbus:school-board-d02. - The L2 parent hash is
854fa6367960bb05.
Step 3: Follow l2_parent_hash to L2
Look up the L2 embedded record:
jq -c 'select(.l2.l2_hash == "854fa6367960bb05")' \
l2_embedded/NC/2022/enriched.jsonl
Key fields:
{
"l2": {
"l2_hash": "854fa6367960bb05",
"l1_parent_hash": "8ea7ecc257ff8e05",
"embedding_model": "text-embedding-3-large",
"embedding_dimensions": 3072,
"candidate_composite": "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus",
"quality_flags": []
}
}
This tells you:
- The embedding model was
text-embedding-3-largewith 3,072 dimensions. - The composite string used for embedding was
Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus. - No quality flags were raised.
- The L1 parent hash is
8ea7ecc257ff8e05.
Step 4: Follow l1_parent_hash to L1
Look up the L1 cleaned record:
jq -c 'select(.provenance.l1_hash == "8ea7ecc257ff8e05")' \
l1_cleaned/nc_sbe/NC/2022/cleaned.jsonl
Key fields:
{
"jurisdiction": {
"state": "NC", "state_fips": "37",
"county": "COLUMBUS", "county_fips": "37047",
"precinct": "P17"
},
"results": [{
"candidate_name": {
"raw": "Timothy Lance", "first": "Timothy",
"middle": null, "last": "Lance",
"suffix": null, "canonical_first": "Timothy"
},
"votes_total": 303,
"vote_counts_by_type": {
"election_day": 136, "early": 159,
"absentee_mail": 7, "provisional": 1
}
}],
"provenance": {
"l1_hash": "8ea7ecc257ff8e05",
"l0_parent_hash": "edfedf2760cfd54f",
"parser_version": "nc_sbe_v2.1",
"schema_version": "3.0.0"
}
}
This tells you:
- The 303 votes break down to 136 election day + 159 early + 7 absentee + 1 provisional.
- The name was parsed as first=“Timothy”, last=“Lance”, no middle, no suffix.
- The parser version was
nc_sbe_v2.1. - The L0 parent hash is
edfedf2760cfd54f.
Step 5: Follow l0_parent_hash to L0
Look up the L0 manifest:
cat l0_raw/nc_sbe/results_pct_20221108.txt.manifest.json
Output:
{
"l0_hash": "edfedf2760cfd54f",
"source_url": "https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip",
"retrieval_date": "2026-03-18T14:30:00Z",
"file_size_bytes": 18023456,
"format_detected": "tsv"
}
Step 6: Verify L0 Against the Source
Recompute the SHA-256 of the raw file and compare:
# macOS
shasum -a 256 l0_raw/nc_sbe/results_pct_20221108.txt
# Linux
sha256sum l0_raw/nc_sbe/results_pct_20221108.txt
If the output starts with edfedf2760cfd54f..., the raw file is intact — it matches the bytes the pipeline processed.
To verify against the authoritative source independently, download the file yourself:
curl -O https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip
unzip results_pct_20221108.zip
shasum -a 256 results_pct_20221108.txt
If your hash matches the manifest’s l0_hash, you and the pipeline processed identical bytes. The vote count of 303 for Timothy Lance in precinct P17 traces directly to those bytes.
The Full Chain
L4 flat_export.jsonl
candidate_canonical = "Timothy Lance", votes_total = 303
l3_hash = 28183d41d50204d5
│
L3 matched.jsonl
entity_id = person:nc:columbus:lance-timothy-13
l2_parent_hash = 854fa6367960bb05
│
L2 enriched.jsonl
embedding_model = text-embedding-3-large
candidate_composite = "Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus"
l1_parent_hash = 8ea7ecc257ff8e05
│
L1 cleaned.jsonl
votes_total = 303 (136 + 159 + 7 + 1)
parser_version = nc_sbe_v2.1
l0_parent_hash = edfedf2760cfd54f
│
L0 results_pct_20221108.txt
l0_hash = edfedf2760cfd54f
source_url = https://s3.amazonaws.com/dl.ncsbe.gov/ENRS/2022_11_08/results_pct_20221108.zip
Every link is independently verifiable. Recompute any hash from the record content plus the parent hash. If it matches the stored value, the record has not been tampered with.
Prototype Validation
In our 200-record prototype, we verified the full hash chain for every record:
| Metric | Result |
|---|---|
| Records verified | 200 / 200 |
| Broken chains | 0 |
| Layers traversed | 5 per record |
| Total hash verifications | 1,000 |
Zero broken links. Every vote count traces back to the raw NC SBE file bytes.
When to Use This
- Fact-checking. A journalist writing “Timothy Lance received 303 votes” can cite the hash chain as evidence.
- Auditing. A researcher who finds an unexpected result can walk the chain to determine whether the issue is in the source data (L0), the parser (L1), the entity resolution (L3), or the aggregation (L4).
- Dispute resolution. If two researchers disagree on a number, both can verify the chain. If both chains are intact and both start from the same L0 hash, the number is correct. If the L0 hashes differ, one of them has a different version of the source file — check
retrieval_datein the manifest.
The Two Audiences
This project serves two audiences with fundamentally different trust requirements. Engineers need to verify the pipeline mechanically. Consumers — journalists, researchers, government staff — need to understand what the data means and how much to trust it without reading source code.
This chapter describes what each audience sees and how the two views connect.
What engineers see
Engineers interact with the pipeline’s internal machinery. Trust, for this audience, is a function of determinism, traceability, and mechanical reproducibility.
Hash chains. Every record carries a provenance.hash field — a SHA-256 hash of the record’s content at each layer. L1 records hash L0 input bytes. L2 records hash L1 output plus the embedding model version. L3 records hash L2 output plus the decision log entry. L4 records hash L3 output. Any mutation at any layer invalidates all downstream hashes.
Decision logs. Every non-deterministic operation at L3 — embedding similarity matches, LLM-confirmed entity resolutions — is recorded in a JSONL decision log. Each entry includes:
decision_id: a unique identifier for the decisionmethod: one ofexact,jaro_winkler,embedding,llmscore: the similarity or confidence score (where applicable)input_record_hashes: the L2 records being comparedoutput: the resolution (match, no-match, or merge)llm_request_id: the API request ID (for LLM decisions only)
Embedding IDs. Every embedding generated at L2 is tagged with the model identifier (text-embedding-3-large), the embedding dimension (3072), and the composite string template used to generate the input text. If the model or template changes, all L2 records are regenerated — not patched.
Layer manifests. Each layer’s output directory contains a manifest.jsonl file listing every output file, its row count, its SHA-256 hash, the pipeline version that produced it, and the timestamp of generation. Manifests are the unit of verification: compare two manifests to determine whether a pipeline run produced identical output.
What consumers see
Consumers interact with query results, summary statistics, and exported datasets. Trust, for this audience, is a function of source attribution, stated confidence, and transparent methodology.
Source names. Every record in consumer-facing output includes a human-readable source name: “NC SBE (certified)”, “MEDSL 2022”, “OpenElections (community-curated)”. The source name tells the consumer where the data came from and how it was collected.
Confidence levels. Every record carries a confidence level: high, medium, or low. See Confidence Levels for definitions. Consumers can filter by confidence to match their tolerance for uncertainty.
Methodology page. Any published dataset includes a methodology section describing the pipeline version, source versions, and processing steps used. This is the consumer-facing equivalent of the manifest.
Bridge table
The following table maps consumer-facing fields to their internal pipeline equivalents. If you see a value in a consumer export and need to trace it, this is where to start.
| Consumer-facing field | Example value | Internal pipeline field | Layer |
|---|---|---|---|
| Source | NC SBE (certified) | source.source_type = nc_sbe, source.certification = certified | L1 |
| Confidence | High | provenance.confidence = high | L1–L4 |
| Candidate name | John A. Smith Jr. | candidate.canonical_first = john, candidate.canonical_last = smith, candidate.suffix = jr | L4 |
| Office | County Commissioner District 3 | contest.canonical_office = county_commissioner, contest.district = 3 | L4 |
| Vote total | 12,847 | votes.total = 12847 | L1 |
| Match method | Algorithmic (exact) | entity_resolution.method = exact | L3 |
| Match method | LLM-confirmed | entity_resolution.method = llm, entity_resolution.decision_id = d-2024-00417 | L3 |
| Jurisdiction | Mecklenburg County, NC | jurisdiction.county_fips = 37119, jurisdiction.state = NC | L1 |
| Election date | 2022-11-08 | election.date = 2022-11-08 | L1 |
| Party | Democratic | candidate.party = DEM | L1 |
Reproducibility by layer
Not all layers are equally reproducible. The guarantees differ based on whether a layer involves external API calls.
L0 → L1: Deterministic. L1 is a pure function of L0 input and the pipeline code. Same input, same code version, same output — byte-identical. No external calls. No randomness.
L1 → L2: Deterministic. L2 adds embeddings generated by text-embedding-3-large (3072 dimensions). The embedding API is deterministic for a given model version and input string. Same L1 input, same model version, same output. If OpenAI retires or modifies the model, a pinned model version in the manifest allows detection (though not reproduction without the original model).
L2 → L3: Replayable from decision log. L3 involves entity resolution — some of which uses embedding cosine similarity (deterministic given L2) and some of which calls Claude Sonnet for confirmation. LLM calls are not deterministic: the same prompt may produce different text on different days. However, every LLM decision is recorded in the decision log with its output. Replaying L3 from the decision log — rather than re-calling the LLM — produces identical output. The decision log is the reproducibility mechanism for L3.
L3 → L4: Deterministic. L4 is a deterministic function of L3 output. It selects canonical names, assigns canonical IDs, and merges duplicate records. Same L3, same L4.
End-to-end reproducibility. To fully reproduce a dataset:
- Check out the tagged pipeline version from the repository.
- Obtain the same L0 source files (verified by hash against the L0 manifest).
- Run L0 → L2. Verify output hashes against the L2 manifest.
- Apply the published decision log to produce L3. Verify against the L3 manifest.
- Run L3 → L4. Verify against the L4 manifest.
If all manifest hashes match, the reproduction is exact. If any hash diverges, the manifest diff identifies exactly which records changed and at which layer.
When the two views diverge
Sometimes engineers and consumers reach different conclusions about the same record:
- An engineer may see that a match was made by LLM with confidence 0.78 and flag it as marginal. A consumer sees “Source: MEDSL, Confidence: Medium” and treats it as usable. Both are correct within their frame.
- An engineer may know that an embedding model version is deprecated. A consumer sees no change in the output. The manifest captures this risk; the consumer-facing confidence level does not (yet).
The bridge table above is the mechanism for resolving these divergences. When in doubt, trace the consumer field back to its pipeline equivalent and inspect the full provenance chain.
Confidence Levels
Every record in the pipeline carries a confidence level that reflects the trustworthiness of its source and the reliability of the processing steps applied to it. Confidence is not a score — it is a categorical label with defined semantics.
Three levels
High
The source is a certified government publication. Examples: NC SBE certified results, state election board portals that publish official canvass data. Records ingested from these sources enter L0 with source.confidence = "high".
High-confidence sources provide vote totals that are legally authoritative. When two sources disagree, the high-confidence source is treated as ground truth.
Medium
The source is a curated academic dataset derived from government publications. Example: MEDSL, which aggregates and reformats state-published results into a consistent schema. The data is one step removed from the original — parsed, cleaned, and sometimes corrected by the MEDSL team.
Medium-confidence sources are reliable for analysis but are not primary. In the 640 overlapping contests between MEDSL and NC SBE, 90.5% have identical vote totals. The 9.5% that differ are typically due to provisional ballot timing or reporting cutoffs.
Low
The source is community-curated, OCR-derived, or otherwise not traceable to a single certified publication. Examples: OpenElections state files with known parsing issues, any data recovered from PDFs via OCR, or crowd-sourced contest metadata.
Low-confidence records are included in the dataset but flagged. They are useful for coverage (filling gaps where no better source exists) but should not be cited without independent verification.
How confidence propagates
Confidence is not static. It can degrade as records pass through the pipeline, but it never improves without human intervention.
Source confidence (L0–L1). Set at ingestion based on the source type. Deterministic — the same source always gets the same level.
Match confidence (L3). Entity resolution adds a second dimension. If the match method is deterministic (exact string match or Jaro-Winkler ≥ 0.92), the source confidence is preserved. If the match required embedding similarity or LLM confirmation, the record is annotated with the match method and decision ID, but the source confidence is not downgraded — instead, a separate match_confidence field is added.
The combined confidence follows these rules:
| Source confidence | Match method | Overall | Notes |
|---|---|---|---|
| High | Exact | High | Best case. Certified source, deterministic match. |
| High | Jaro-Winkler | High | Algorithmic match above threshold. |
| High | Embedding | High + decision ID | Source still trusted; match is logged. |
| High | LLM | High + decision ID | Source still trusted; LLM rationale recorded. |
| Medium | Exact | Medium | Academic source, deterministic match. |
| Medium | LLM | Medium + decision ID | Both source and match carry caveats. |
| Low | Any | Low | Source uncertainty dominates. |
LLM decision tracking
When an LLM (Claude Sonnet) is involved in entity resolution, the pipeline records:
- The decision ID (a unique hash of the prompt, response, and model version)
- The prompt sent to the model
- The model’s response
- The confidence score returned by the model
This allows any LLM-assisted decision to be audited, replayed, or overridden. See When the LLM Gets Called.
How to cite records
When using data from this pipeline in publications, cite the original source, not the pipeline. The pipeline provides the information needed to construct a proper citation.
APA format template:
{Source organization}. ({Year}). {Dataset title} [Data set]. Retrieved {retrieval_date} from {url}.
Example:
North Carolina State Board of Elections. (2022). Official general election results [Data set]. Retrieved 2025-01-15 from https://www.ncsbe.gov/results-data.
Each L4 record includes the fields needed to construct this citation: source.name, source.retrieval_date, source.url, and source.confidence. A methodology link pointing to the pipeline documentation should accompany any analysis that depends on entity resolution or cross-source reconciliation.
Reporting Errors
Election data errors are inevitable — misspelled names, transposed digits, misclassified offices. This chapter describes how to report errors, how corrections flow through the pipeline, and how they are documented.
What counts as an error
An error is a factual discrepancy between the pipeline output and the certified source record. Examples:
- A candidate’s vote total does not match the certified result.
- Two candidates are incorrectly resolved as the same person (false positive).
- A single candidate is split into two entities across sources (false negative).
- An office is classified at the wrong level (e.g., county office tagged as state).
- A contest is assigned to the wrong jurisdiction or FIPS code.
Formatting preferences (e.g., “they should use a middle name, not an initial”) are not errors. The pipeline normalizes names according to documented rules; stylistic disagreements are out of scope.
How to report
Include the following in every error report:
- State — two-letter abbreviation.
- County or jurisdiction — as specific as possible.
- Contest — the office name and year.
- Candidate — the name as it appears in the output.
- The error — what is wrong and what the correct value should be.
- Source — how you know the correct value (e.g., link to certified results PDF, county clerk confirmation).
File reports via the project’s GitHub issue tracker using the data-error label. One error per issue. Bulk reports (e.g., “all vote totals for County X are wrong”) should include a CSV attachment with the specific records.
How corrections flow through the pipeline
Corrections are not ad hoc patches. They follow the same layered architecture as all other data.
Report → Review → L3 human override → L4 re-canonicalize → Changelog entry
- Report. An error is filed with the required fields above.
- Review. A maintainer verifies the error against the cited source. If the source confirms the discrepancy, the report is accepted.
- L3 human override. A decision record is added to the L3 decision log with
decision_type: "human_override", the reporter’s source citation, and the corrected value. The original machine decision is preserved — overrides do not delete history. - L4 re-canonicalize. The L4 canonical layer is regenerated from the updated L3 output. Only records affected by the override change.
- Changelog entry. The correction is recorded in the Changelog with the issue number, affected records, and the nature of the fix.
What happens to the original data
Nothing. L0 (raw) and L1 (cleaned) records are immutable. If the error is in the source itself (e.g., the state published a wrong number that was later corrected in an amended certification), the amended source file is ingested as a new L0 record. Both the original and amended records coexist, with the L3 decision log recording which one is authoritative.
Transparency
All override decisions are stored in the same JSONL decision log as algorithmic decisions. They are queryable, auditable, and included in pipeline replay. A consumer who disagrees with a correction can inspect the decision record, see the cited source, and file a counter-report.
Corrections do not silently change output. Every correction increments the dataset version and appears in the changelog.
Known Limitations
This chapter documents what the project cannot do, where the data is incomplete, and where results should be interpreted with caution. These are not future plans — they are current, known constraints.
Coverage gaps by state
MEDSL 2022 data contains zero local election results for seven states:
| State | FIPS | Notes |
|---|---|---|
| California | 06 | State publishes results but not in MEDSL local dataset |
| Iowa | 19 | County-level results exist on state portal; not aggregated |
| Kansas | 20 | No local results in MEDSL |
| New Jersey | 34 | County clerk offices publish individually; no aggregation |
| Pennsylvania | 42 | 67 counties, each with its own reporting format |
| Tennessee | 47 | No local results in MEDSL |
| Wisconsin | 55 | State portal exists but data not present in MEDSL |
These gaps are source-dependent. If a future pipeline version integrates state portal data directly, coverage may improve. Until then, any “national” statistic derived from this dataset is actually a 43-state statistic.
Turnout data
Turnout figures (registered voters, ballots cast) are present in fewer than 5% of records. Most sources report candidate-level vote totals but not the denominator. This means:
- Vote share (candidate votes / total ballots) cannot be computed for most contests.
- Voter participation rates at the local level are not derivable from this dataset.
- Where turnout data does exist, it is preserved as
TurnoutMetadatacontest records at L1 and carried through to L4.
Do not assume that the absence of turnout data means turnout was low. It means the source did not report it.
Odd-year elections
Elections held in 2015, 2017, 2019, and 2021 are underrepresented. MEDSL publishes even-year datasets (2016, 2018, 2020, 2022) with strong coverage. Odd-year local elections — common for municipal and school board races — are covered only where state-specific sources (e.g., NC SBE) include them.
This creates a systematic bias: states that hold local elections in odd years appear to have fewer local races than they actually do. New Jersey (already missing from MEDSL local data) and Virginia (odd-year state legislative elections) are particularly affected.
Entity resolution is probabilistic
The L3 matching layer uses a four-step cascade: exact match → Jaro-Winkler → embedding similarity → LLM confirmation. Only exact matches are deterministic in the strong sense. All other match methods involve thresholds:
- Jaro-Winkler threshold: 0.92. Names scoring below this are not matched, even if they refer to the same person.
- Embedding cosine similarity threshold: 0.88. Composite strings that fall below this are sent to LLM review or left unmatched.
- LLM confirmation is logged with a decision ID but is inherently non-deterministic across model versions. Decisions are frozen in the decision log for reproducibility, but a different model version might make different decisions.
Consequences:
- Some true matches are missed (false negatives), especially for candidates with common names in different jurisdictions.
- Some incorrect matches may exist (false positives), especially for candidates with identical names in overlapping jurisdictions (e.g., father/son with the same name).
- All non-exact match decisions are queryable by match method and score. Downstream users can apply stricter thresholds if their use case requires higher precision at the cost of lower recall.
No ranked-choice voting (RCV) support
The schema represents first-past-the-post and plurality contests. Ranked-choice voting results — used in Alaska, Maine, New York City, and a growing number of jurisdictions — require round-by-round tabulation data that the current schema does not model.
RCV results from these jurisdictions may appear in the dataset as final-round totals (where the source reports them that way), but intermediate rounds, elimination order, and ballot transfer data are not captured.
ALGED not integrated
The Annual Local Government Election Dataset (ALGED) covers mayoral and city council races in cities with populations above 50,000. It includes candidate demographics and incumbency data not available in other sources. This dataset is not currently integrated into the pipeline. Its coverage period ends around 2021.
Integration is planned but not scheduled. When integrated, ALGED records will enter at L0 like any other source and pass through the same cleaning, embedding, and matching layers.
Vote mode data
Vote mode breakdowns (Election Day, absentee, early voting, provisional) are present in approximately 33% of source records. The remaining 67% report only total votes per candidate. Cross-source comparisons of vote mode data are unreliable because:
- States define vote modes differently (e.g., “absentee” vs. “mail” vs. “vote by mail”).
- Some sources aggregate early voting into Election Day totals.
- Provisional ballot handling varies by state and is time-dependent (provisionals may be added days after initial reporting).
Pipeline not validated at national scale
The pipeline has been tested against NC SBE data (2004–2022) and MEDSL data (2018–2022, 43 states). The 640-contest overlap between MEDSL and NC SBE provides a validation baseline: 90.5% exact vote match, 63% name formatting differences successfully resolved.
Full national-scale validation — running all 42 million rows through L0→L4 with cross-source reconciliation — has not been completed. Edge cases in states with unusual office structures (Louisiana’s parish system, Alaska’s borough system, Virginia’s independent cities) may surface issues not yet encountered.
What this means for users
If your work depends on completeness, check the Coverage Matrix for your specific state and year before drawing conclusions. If your work depends on entity resolution accuracy, filter to match methods and scores that meet your precision requirements. If your work involves RCV jurisdictions, this dataset does not capture round-level data.
These limitations are structural, not aspirational. They will change as sources are added and the pipeline matures, but they describe the current state accurately.
Full Nickname Dictionary
The pipeline applies nickname normalization at L1 to improve entity resolution at L3. When a candidate’s first name matches a known nickname, the canonical form is stored in canonical_first and the original is preserved in first.
This dictionary is applied deterministically. Every name is checked against the table below. No context or heuristics are used — if the input matches the nickname column, the canonical column is applied. This means the mapping is fast and reproducible but occasionally wrong (see The Ted Problem below).
Mappings
| Nickname | Canonical | Notes |
|---|---|---|
| al | albert | |
| alex | alexander | |
| andy | andrew | |
| barb | barbara | |
| ben | benjamin | |
| bernie | bernard | |
| bert | albert | Also Herbert; resolved to albert by frequency |
| beth | elizabeth | |
| bill | william | |
| billy | william | |
| bob | robert | |
| bobby | robert | |
| bonnie | bonita | |
| bud | william | Regional; less reliable |
| charlie | charles | |
| chris | christopher | Also Christine; gendered ambiguity |
| chuck | charles | |
| cindy | cynthia | |
| dan | daniel | |
| danny | daniel | |
| dave | david | |
| deb | deborah | |
| debbie | deborah | |
| dick | richard | |
| don | donald | |
| doug | douglas | |
| drew | andrew | |
| ed | edward | |
| eddie | edward | |
| frank | franklin | Also Francis; resolved to franklin by frequency |
| fred | frederick | |
| gene | eugene | |
| gerry | gerald | |
| hank | henry | |
| harry | harold | Also Henry (British tradition); resolved to harold |
| jack | john | |
| jake | jacob | |
| jan | janice | Also Janet; resolved to janice by frequency |
| jenny | jennifer | |
| jerry | gerald | Also Jerome; resolved to gerald by frequency |
| jim | james | |
| jimmy | james | |
| joe | joseph | |
| johnny | john | |
| jon | jonathan | Distinct from john |
| kate | katherine | Also Kathryn, Catherine |
| kathy | katherine | |
| ken | kenneth | |
| kenny | kenneth | |
| larry | lawrence | |
| liz | elizabeth | |
| maggie | margaret | |
| matt | matthew | |
| mike | michael | |
| mitch | mitchell | |
| nancy | ann | Historical mapping; low reliability |
| nick | nicholas | |
| nikki | nicole | |
| norm | norman | |
| pat | patrick | Also Patricia; gendered ambiguity |
| patti | patricia | |
| patty | patricia | |
| peggy | margaret | |
| pete | peter | |
| phil | philip | |
| ray | raymond | |
| rick | richard | |
| rob | robert | |
| ron | ronald | |
| sally | sarah | |
| sam | samuel | Also Samantha; gendered ambiguity |
| sandy | sandra | Also Alexander; gendered ambiguity |
| steve | steven | |
| sue | susan | |
| ted | edward | See The Ted Problem below |
| terry | terrence | Also Teresa; gendered ambiguity |
| tim | timothy | |
| tom | thomas | |
| tommy | thomas | |
| tony | anthony | |
| val | valerie | |
| vince | vincent | |
| walt | walter | |
| wes | wesley | |
| will | william | |
| woody | woodrow |
The Ted Problem
“Ted” maps to both Edward (Ted Kennedy → Edward Kennedy) and Theodore (Ted Cruz → Rafael Edward Cruz, commonly Theodore). The dictionary maps ted → edward because Edward is the more frequent canonical form in US election data. This means a candidate whose legal name is Theodore but who files as Ted will be canonicalized as Edward.
This is a known, accepted error. It affects L1 canonical_first but does not prevent correct entity resolution at L3 — because L3 matches on composite strings that include last name, jurisdiction, office, and year. Two candidates named “Ted Smith” in different counties will not be merged regardless of whether canonical_first is edward or theodore.
The original filed name is always preserved in first. Any downstream consumer who needs the original can ignore canonical_first and use first directly.
Gendered ambiguity
Several nicknames map to names that could be either male or female: Chris (Christopher/Christine), Pat (Patrick/Patricia), Sam (Samuel/Samantha), Sandy (Sandra/Alexander), Terry (Terrence/Teresa). The dictionary resolves these to the statistically more common canonical form in US election candidate data. The mapping is not always correct for individual candidates.
As with the Ted problem, the original name is preserved, and entity resolution at L3 uses additional fields (jurisdiction, office, party) to avoid incorrect merges caused by nickname ambiguity.
When the dictionary is not applied
The dictionary is skipped when:
- The input first name is longer than 6 characters and matches no entry (assumed to already be a full name).
- The candidate record has a
canonical_firstvalue set by the source (some sources provide both nickname and legal name). - The input is an initial only (e.g., “J.” is not expanded).
Office Classification Reference
The pipeline classifies 8,387 unique office name strings into canonical office types using a four-tier system. Each tier handles progressively harder cases. This appendix documents tiers 1 and 2 in full and summarizes tiers 3 and 4.
Coverage summary
| Tier | Method | Unique offices handled | Cumulative coverage |
|---|---|---|---|
| 1 | Keyword lookup | 3,102 | 37% |
| 2 | Regex patterns | 2,097 | 62% |
| 3 | Embedding similarity | 2,340 | 90% |
| 4 | LLM classification | 848 | 100% |
Tiers 1 and 2 are fully deterministic — same input, same output, no external calls. Tier 3 uses cosine similarity against text-embedding-3-large embeddings of known office types. Tier 4 sends unresolved strings to Claude Sonnet with a structured prompt.
Tier 1: Keyword lookup
A case-insensitive keyword match against the office name string. If any keyword appears in the string, the office is classified immediately. Keywords are checked in order; the first match wins.
| Keyword | office_level | office_category |
|---|---|---|
| president | federal | executive |
| u.s. senate | federal | legislative |
| u.s. house | federal | legislative |
| congress | federal | legislative |
| governor | state | executive |
| lieutenant governor | state | executive |
| attorney general | state | executive |
| secretary of state | state | executive |
| state treasurer | state | executive |
| state auditor | state | executive |
| state senate | state | legislative |
| state house | state | legislative |
| state representative | state | legislative |
| state assembly | state | legislative |
| supreme court | state | judicial |
| court of appeals | state | judicial |
| appeals court | state | judicial |
| district court | county | judicial |
| superior court | county | judicial |
| county commissioner | county | legislative |
| county council | county | legislative |
| sheriff | county | law_enforcement |
| clerk of court | county | judicial |
| register of deeds | county | administrative |
| coroner | county | administrative |
| constable | county | law_enforcement |
| justice of the peace | county | judicial |
| school board | local | education |
| board of education | local | education |
| city council | local | legislative |
| mayor | local | executive |
| alderman | local | legislative |
| township trustee | local | legislative |
| soil and water | local | special_district |
| fire district | local | special_district |
| water district | local | special_district |
Notes:
- “u.s. senate” is checked before “state senate” to avoid false matches.
- “lieutenant governor” is checked before “governor” for the same reason.
- Keywords are matched as substrings, not whole words. “county commissioner district 3” matches on “county commissioner”.
Tier 2: Regex patterns
When no tier 1 keyword matches, the office string is tested against a series of compiled regular expressions. These handle structural patterns that keyword matching cannot.
| Pattern | office_level | office_category | Example matches |
|---|---|---|---|
(?i)^(us|united states) (rep|senator) | federal | legislative | “US Rep District 4” |
(?i)district judge.*district \d+ | county | judicial | “District Judge 21st Judicial District” |
(?i)(city|town|village) (of|de) .+ (council|trustee|board) | local | legislative | “Town of Cary Council” |
(?i)independent school district.*\d+ | local | education | “Independent School District 279 Board” |
(?i)(municipal|mun\.?) (utility|water|sewer) district | local | special_district | “Municipal Utility District 14” |
(?i)community college.*trustee | local | education | “Community College District Trustee” |
(?i)(precinct|ward) (chair|committee) | local | party | “Precinct 12 Committee Chair” |
(?i)conservation district (super|board|dir) | local | special_district | “Conservation District Supervisor” |
(?i)(drainage|levee|flood) (district|board) | local | special_district | “Drainage District 7 Board” |
(?i)hospital district (board|dir|trustee) | local | special_district | “Hospital District Board Member” |
(?i)park (district|board) (comm|dir|trustee) | local | special_district | “Park District Commissioner” |
(?i)sanitary district | local | special_district | “Sanitary District Trustee” |
(?i)mosquito (abatement|control) district | local | special_district | “Mosquito Abatement District Trustee” |
(?i)(borough|parish) (council|president|assembly) | county | legislative | “Borough Assembly Member” |
(?i)district attorney | county | law_enforcement | “District Attorney 26th District” |
Regex patterns are tested in order. The first match wins. All patterns use case-insensitive mode.
Tier 3: Embedding similarity
Office strings that pass through tiers 1 and 2 unclassified are embedded using text-embedding-3-large (3072 dimensions) and compared against a reference set of known office type embeddings via FAISS nearest-neighbor search.
- Threshold: cosine similarity ≥ 0.85 against the nearest known office type.
- Reference set: the canonical office types defined by tiers 1 and 2, plus manually curated additions for jurisdiction-specific titles.
- Examples resolved at tier 3:
- “Moderator” →
local / legislative(New England town meeting role) - “Fence Viewer” →
local / administrative(historical New England office) - “Pound Keeper” →
local / administrative - “Surveyor of Highways” →
local / administrative - “Oyster Commissioner” →
local / special_district(Maryland)
- “Moderator” →
Tier 3 handles 2,340 unique office strings — mostly jurisdiction-specific titles, historical offices, and compound names that do not match keyword or regex patterns.
Tier 4: LLM classification
The remaining 848 office strings are sent to Claude Sonnet with a structured prompt that provides the office name, the state, and the county (where available). The LLM returns office_level, office_category, and a brief rationale.
Every tier 4 decision is recorded in the decision log with:
decision_idinput_string(the original office name)output_levelandoutput_categoryllm_request_idrationale(the LLM’s explanation)
Tier 4 classifications can be overridden by adding entries to the tier 1 or tier 2 tables in subsequent pipeline versions. Once an office string is promoted to tier 1 or tier 2, it is classified deterministically on all future runs.
Office level and category enumerations
office_level values: federal, state, county, local.
office_category values: executive, legislative, judicial, law_enforcement, administrative, education, special_district, party.
These enumerations are defined in the Enumerations Reference. Every classified office receives exactly one level and one category.
Handling ambiguity
Some office strings are genuinely ambiguous:
- “Board of Commissioners” could be county or municipal depending on jurisdiction.
- “Trustee” alone could be township, school board, or special district.
- “Judge” without a court name could be any judicial level.
In these cases, the pipeline uses jurisdiction context (state, county, FIPS code) to disambiguate. If the jurisdiction does not resolve the ambiguity, the string is sent to tier 3 or 4 with the full context attached.
NIST SP 1500-100 Alignment
This appendix maps the pipeline’s schema fields to concepts defined in NIST SP 1500-100 v2, the Election Results Common Data Format Specification. The mapping is informational — the pipeline does not emit NIST-compliant XML, but its internal schema was designed with alignment in mind.
Field mapping
| Pipeline field | NIST SP 1500-100 concept | NIST element | Notes |
|---|---|---|---|
contest | Contest | CandidateContest | Candidate races map to CandidateContest. |
contest (ballot measure) | Contest | BallotMeasureContest | Ballot measures use a separate NIST element. |
contest.name | Contest name | CandidateContest.Name | Raw office string before normalization. |
contest.canonical_office | Office | Office.Name | L4 normalized office name. |
candidate.canonical_first, canonical_last | Candidate | Candidate.PersonFullName | Pipeline stores components; NIST stores full name. |
candidate.party | Party | Party.Abbreviation | Three-letter codes (DEM, REP, LIB, etc.). |
jurisdiction.ocd_id | Geographic unit | GpUnit.ExternalIdentifier | OCD-ID used as the external identifier type. |
jurisdiction.county_fips | Geographic unit | GpUnit.ExternalIdentifier | FIPS code, identifier type fips. |
jurisdiction.state | Geographic unit | GpUnit.Type = "state" | Two-letter USPS abbreviation. |
votes.total | Vote counts | VoteCounts.Count | Total votes for a candidate in a contest. |
votes.by_mode.election_day | Vote counts by type | VoteCounts.CountItemType = "election-day" | Present in ~33% of records. |
votes.by_mode.absentee | Vote counts by type | VoteCounts.CountItemType = "absentee" | Terminology varies by state. |
votes.by_mode.early | Vote counts by type | VoteCounts.CountItemType = "early" | Some sources merge into election day. |
votes.by_mode.provisional | Vote counts by type | VoteCounts.CountItemType = "provisional" | Timing of inclusion varies. |
election.date | Election | Election.StartDate | Single date; no multi-day modeling. |
election.type | Election type | Election.Type | Values: general, primary, runoff, special. |
turnout.registered_voters | Turnout metadata | VoteCounts.CountItemType = "total" on BallotCounts | Present in <5% of records. |
turnout.ballots_cast | Turnout metadata | BallotCounts.BallotsCast | Same coverage caveat. |
contest.district | Electoral district | ElectoralDistrict.Name | District number or name within an office. |
Concepts not modeled
The following NIST SP 1500-100 concepts have no direct equivalent in the pipeline schema:
RetentionContest— Judicial retention elections are classified asBallotMeasurewith yes/no choices rather than as a distinct contest type.OrderedContest— Ballot ordering is not captured. The pipeline does not model ballot layout.BallotStyle— No ballot style or precinct-to-ballot mapping is maintained.- Ranked-choice voting rounds —
CountItemTypevalues for RCV rounds (round-1,round-2, etc.) are not supported. See Known Limitations. - Overvotes and undervotes — Tracked as
TurnoutMetadatacontest records at L1, not as NISTOtherCounts.
Pipeline concepts not in NIST
The following pipeline concepts have no NIST equivalent:
provenance.hash— SHA-256 hash chain for record integrity. NIST defines no provenance model.entity_resolution.method— Match method metadata (exact, Jaro-Winkler, embedding, LLM). Entity resolution is outside the scope of NIST SP 1500-100.source.confidence— High/medium/low confidence levels. NIST does not model source reliability.- Layer identifiers (L0–L4) — The multi-layer pipeline architecture is specific to this project.
Research References
This appendix lists the research papers, datasets, and standards cited throughout the documentation.
Entity resolution
- Dasanaike, T., et al. (2026). EnsembleLink: Ensemble methods for scalable entity resolution. Preprint.
- Ornstein, J. (2025). fuzzylink: Probabilistic record linkage with large language models. Preprint.
- CE-RAG4EM (2026). Context-Enhanced Retrieval-Augmented Generation for Entity Matching. Preprint.
- Zeakis, A., et al. (2025). AvengER: Automated verification of entity resolution results. Preprint.
Election data sources
- MIT Election Data + Science Lab (MEDSL). U.S. Local Elections Dataset, 2018–2022. https://electionlab.mit.edu/data
- OpenElections Project. Certified election results by state. https://openelections.net
- North Carolina State Board of Elections (NC SBE). Official election results, 2004–present. https://www.ncsbe.gov/results-data
- Annual Local Government Election Dataset (ALGED). Municipal election returns for cities >50K population.
- Associated Press. AP Elections. Commercial license required.
- Voting and Election Science Team (VEST). Precinct-level election returns with shapefiles. https://dataverse.harvard.edu/dataverse/electionscience
- Federal Election Commission (FEC). Candidate master files. https://www.fec.gov/data/browse-data/
- U.S. Census Bureau. FIPS code reference and geographic hierarchies. https://www.census.gov/geographies
Standards
- National Institute of Standards and Technology. (2023). NIST SP 1500-100 v2: Election Results Common Data Format Specification. https://doi.org/10.6028/NIST.SP.1500-100r2
Architecture
- Databricks. Medallion Architecture: Bronze, Silver, Gold. https://www.databricks.com/glossary/medallion-architecture
Reports
- Union of Concerned Scientists. (2025). Election Data Report: The state of US election data infrastructure.
Glossary
Blocking. A preprocessing step in entity resolution that partitions records into groups (blocks) that share a key attribute — typically state + office type or county FIPS code. Only records within the same block are compared, reducing the number of pairwise comparisons from O(n²) to a tractable subset.
Composite string. A concatenated text representation of a record used as input to an embedding model. A candidate composite string might combine name, office, jurisdiction, party, and year into a single string. The template that defines which fields are included and in what order is versioned and stored in the L2 manifest.
Cosine similarity. A measure of similarity between two vectors, computed as the cosine of the angle between them. Ranges from -1 to 1; values closer to 1 indicate higher similarity. Used at L3 to compare candidate and contest embeddings. The pipeline uses a threshold of 0.88 for embedding-based entity matches.
Entity resolution. The process of determining whether two records refer to the same real-world entity (person, office, or contest) despite differences in formatting, naming, or source. The pipeline uses a four-step cascade: exact match → Jaro-Winkler → embedding similarity → LLM confirmation.
FAISS. Facebook AI Similarity Search. A library for efficient similarity search over dense vector collections. Used at L3 to perform approximate nearest-neighbor lookups over L2 embeddings when comparing candidate records across sources.
FIPS code. Federal Information Processing Standards code. A numeric identifier assigned by the Census Bureau to states (2 digits), counties (5 digits: 2 state + 3 county), and other geographic entities. Example: 37119 = Mecklenburg County, North Carolina. Used as a join key across sources.
Jaro-Winkler similarity. A string similarity metric that gives higher scores to strings that match from the beginning. Ranges from 0 to 1. The pipeline uses a threshold of 0.92 for name matching. Preferred over edit distance for person names because prefix agreement is a strong signal of identity.
JSONL. JSON Lines. A text format where each line is a valid JSON object, separated by newlines. The pipeline uses JSONL as the storage and interchange format at every layer (L0–L4). One record per line enables streaming reads and line-level integrity checks.
L0 (Raw). The first pipeline layer. Byte-identical copies of source files as retrieved. No parsing, no transformation. Stored with retrieval timestamps and SHA-256 hashes.
L1 (Cleaned). The second layer. Deterministic parsing, field extraction, name normalization, and FIPS enrichment. Output is structured JSONL with a consistent schema regardless of source format.
L2 (Embedded). The third layer. Adds vector embeddings (text-embedding-3-large, 3072 dimensions) and office classification results. Deterministic given L1 input and a fixed model version.
L3 (Matched). The fourth layer. Entity resolution — linking records that refer to the same candidate, contest, or office across sources and years. Non-deterministic steps (LLM calls) are recorded in the decision log for replay.
L4 (Canonical). The fifth layer. Assigns canonical names, deduplicates records, selects authoritative values, and produces the final queryable dataset. Deterministic given L3 input.
OCD-ID. Open Civic Data Identifier. A hierarchical string identifier for political geographies, following the pattern ocd-division/country:us/state:nc/county:mecklenburg. Used to link jurisdictions across datasets that may use different naming conventions.
Precinct. The smallest administrative unit for election administration. Voters are assigned to a precinct based on their address. Precinct-level results, when available, provide the most granular view of voting patterns. Coverage varies — some sources report only county-level totals.
Changelog
All notable changes to the dataset and pipeline will be documented in this file.
Each entry includes the date, affected layer(s), and a summary of the change.
[Unreleased]
No releases yet.
Entry template
- Date: YYYY-MM-DD
- Layer(s): L0 / L1 / L2 / L3 / L4
- Change: Description of what changed.
- Issue: Link to GitHub issue (if applicable).