Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Cross-Source Reconciliation

When two independent sources cover the same election, their overlap becomes a validation set. If MEDSL and NC SBE both report results for the same contest in the same county, the vote totals should match. When they do, both sources are credible. When they don’t, at least one has an error — and the disagreement reveals data quality issues that no single-source analysis can detect.

North Carolina 2022 is our primary validation case. Both MEDSL and the NC State Board of Elections publish precinct-level results for all NC contests in the 2022 general election.

The Overlap

We identified 640 contests present in both MEDSL and NC SBE for the 2022 general election. These span federal, state, county, municipal, judicial, and school board races across all 100 NC counties.

For each contest, we aggregated precinct-level results to the contest level and compared total votes per candidate.

Agreement LevelContestsPercentage
Exact vote total match57990.5%
Within 1% of each other477.3%
Disagree by more than 1%142.2%
Total640100%

90.5% exact match across 640 contests, derived from two completely independent data pipelines (MIT’s academic processing vs. NC’s official state board reporting), is strong evidence that both sources are faithfully representing the same underlying certified results.

The 7.3% — Small Disagreements

The 47 contests with near-matches (within 1%) trace to identifiable causes:

CauseContestsNotes
Provisional ballot inclusion timing22MEDSL snapshot taken before final canvass; NC SBE includes provisionals
Precinct boundary rounding11Split precincts assigned differently by each source
Write-in aggregation9NC SBE reports individual write-ins; MEDSL aggregates to “Write-in”
Unknown5Under investigation

These are not errors — they are legitimate differences in how two organizations process the same raw certified results. Provisional ballot timing is the most common cause: MEDSL’s data may reflect an earlier snapshot of the canvass than NC SBE’s final certified totals.

The 2.2% — Real Disagreements

The 14 contests with >1% disagreement require individual investigation. Common causes include:

  • Misassigned precincts. A precinct’s results attributed to the wrong contest or district in one source.
  • Partial data. One source missing results from a subset of precincts, typically in multi-county contests where one county’s data arrived late.
  • Candidate name mismatch causing split. The same candidate’s votes split across two entity IDs in one source because a name variant was not resolved — e.g., “JOHN SMITH” in early voting vs. “John R. Smith” in election-day results treated as different candidates.

These 14 cases are flagged by the L4 cross-source reconciliation algorithm and reported in the verification output. They are not silently ignored.

Name Formatting Differences

Vote totals may agree, but candidate names almost never do. Of the 640 overlapping contests, 401 (62.7%) have at least one candidate whose name is formatted differently between MEDSL and NC SBE.

Formatting DifferenceExample (MEDSL)Example (NC SBE)Frequency
ALL CAPS vs Title CaseTIMOTHY LANCETimothy Lance389
Last-first vs first-lastLANCE, TIMOTHYTimothy Lance247
Middle initial present/absentSHANNON W BRAYShannon W. Bray118
Period after middle initialSHANNON W BRAYShannon W. Bray94
Nickname in quotes vs parensCHARLES "CHARLIE" CRISTCharles (Charlie) Crist12
Suffix formattingROBERT WILLIAMS JRRobert Williams, Jr.31
Prefix/title includedHON. JANE DOEJane Doe8

A single candidate can exhibit multiple formatting differences simultaneously. “BRAY, SHANNON W” (MEDSL) vs “Shannon W. Bray” (NC SBE) combines casing, ordering, and punctuation differences in one pair.

This is why entity resolution exists. The vote totals confirm these are the same contests with the same candidates. The name formatting confirms that string equality is insufficient — structured decomposition, embedding, and in some cases LLM confirmation are required to link records across sources.

This Overlap as a Validation Set

The 640-contest NC overlap serves three purposes in the pipeline:

1. Entity Resolution Validation

For every candidate pair that the L3 cascade matches across MEDSL and NC SBE, we can verify the match by comparing vote totals. If the cascade says “TIMOTHY LANCE” (MEDSL) and “Timothy Lance” (NC SBE) are the same person, and their vote totals match exactly, the match is confirmed by an independent signal. If the cascade says they match but the vote totals disagree by 50%, the match is suspect.

2. Office Classification Validation

Both sources cover the same contests but use different office name strings. MEDSL might report “NC HOUSE OF REPRESENTATIVES DISTRICT 047” while NC SBE reports “NC HOUSE OF REPRESENTATIVES - DISTRICT 47”. If both classify to state/legislative, the classifier is consistent. If one classifies to state/legislative and the other to county/legislative, we have a bug.

3. Parser Validation

When two independent parsers (the MEDSL parser and the NC SBE parser) produce the same vote counts for the same contest, both parsers are likely correct. When they disagree, the disagreement localizes the bug to one parser or the other — far easier to debug than a single-source pipeline where errors are invisible.

Beyond NC

The NC overlap is our deepest validation case because NC SBE publishes granular, machine-readable precinct data going back to 2006. Other states offer less overlap:

StateMEDSL 2022Secondary SourceOverlap Quality
NCYesNC SBE (precinct-level, 2006–2024)High
FLYesOpenElections (county-level, select years)Medium
OHYesOpenElections (precinct-level, 2022)Medium
GAYesClarity/Scytl (election night, unstable URLs)Low
All othersYesMEDSL onlyNone

As additional state-level sources are integrated, each creates a new validation pair. The architecture is designed to scale: the L4 cross-source reconciliation algorithm runs for any pair of sources that cover the same (state, year, contest) combination. No code changes are required — only new L0 data and a new L1 parser.

The Lesson

Cross-source reconciliation is not a feature — it is the only reliable way to detect errors in election data. A single source can be internally consistent and still wrong. Two independent sources that agree are almost certainly right. Two independent sources that disagree tell you exactly where to look.

The 90.5% exact match rate across 640 NC contests is our current evidence floor. Every additional source and state that achieves similar agreement raises confidence in the pipeline. Every disagreement is a bug report — either in our pipeline or in the source data.