Cross-Source Reconciliation

When two independent sources cover the same election, their overlap becomes a validation set. If MEDSL and NC SBE both report results for the same contest in the same county, the vote totals should match. When they do, both sources are credible. When they don’t, at least one has an error — and the disagreement reveals data quality issues that no single-source analysis can detect.

North Carolina 2022 is our primary validation case. Both MEDSL and the NC State Board of Elections publish precinct-level results for all NC contests in the 2022 general election.

The Overlap

We identified 640 contests present in both MEDSL and NC SBE for the 2022 general election. These span federal, state, county, municipal, judicial, and school board races across all 100 NC counties.

For each contest, we aggregated precinct-level results to the contest level and compared total votes per candidate.

Agreement Level	Contests	Percentage
Exact vote total match	579	90.5%
Within 1% of each other	47	7.3%
Disagree by more than 1%	14	2.2%
Total	640	100%

90.5% exact match across 640 contests, derived from two completely independent data pipelines (MIT’s academic processing vs. NC’s official state board reporting), is strong evidence that both sources are faithfully representing the same underlying certified results.

The 7.3% — Small Disagreements

The 47 contests with near-matches (within 1%) trace to identifiable causes:

Cause	Contests	Notes
Provisional ballot inclusion timing	22	MEDSL snapshot taken before final canvass; NC SBE includes provisionals
Precinct boundary rounding	11	Split precincts assigned differently by each source
Write-in aggregation	9	NC SBE reports individual write-ins; MEDSL aggregates to “Write-in”
Unknown	5	Under investigation

These are not errors — they are legitimate differences in how two organizations process the same raw certified results. Provisional ballot timing is the most common cause: MEDSL’s data may reflect an earlier snapshot of the canvass than NC SBE’s final certified totals.

The 2.2% — Real Disagreements

The 14 contests with >1% disagreement require individual investigation. Common causes include:

Misassigned precincts. A precinct’s results attributed to the wrong contest or district in one source.
Partial data. One source missing results from a subset of precincts, typically in multi-county contests where one county’s data arrived late.
Candidate name mismatch causing split. The same candidate’s votes split across two entity IDs in one source because a name variant was not resolved — e.g., “JOHN SMITH” in early voting vs. “John R. Smith” in election-day results treated as different candidates.

These 14 cases are flagged by the L4 cross-source reconciliation algorithm and reported in the verification output. They are not silently ignored.

Name Formatting Differences

Vote totals may agree, but candidate names almost never do. Of the 640 overlapping contests, 401 (62.7%) have at least one candidate whose name is formatted differently between MEDSL and NC SBE.

Formatting Difference	Example (MEDSL)	Example (NC SBE)	Frequency
ALL CAPS vs Title Case	`TIMOTHY LANCE`	`Timothy Lance`	389
Last-first vs first-last	`LANCE, TIMOTHY`	`Timothy Lance`	247
Middle initial present/absent	`SHANNON W BRAY`	`Shannon W. Bray`	118
Period after middle initial	`SHANNON W BRAY`	`Shannon W. Bray`	94
Nickname in quotes vs parens	`CHARLES "CHARLIE" CRIST`	`Charles (Charlie) Crist`	12
Suffix formatting	`ROBERT WILLIAMS JR`	`Robert Williams, Jr.`	31
Prefix/title included	`HON. JANE DOE`	`Jane Doe`	8

A single candidate can exhibit multiple formatting differences simultaneously. “BRAY, SHANNON W” (MEDSL) vs “Shannon W. Bray” (NC SBE) combines casing, ordering, and punctuation differences in one pair.

This is why entity resolution exists. The vote totals confirm these are the same contests with the same candidates. The name formatting confirms that string equality is insufficient — structured decomposition, embedding, and in some cases LLM confirmation are required to link records across sources.

This Overlap as a Validation Set

The 640-contest NC overlap serves three purposes in the pipeline:

1. Entity Resolution Validation

For every candidate pair that the L3 cascade matches across MEDSL and NC SBE, we can verify the match by comparing vote totals. If the cascade says “TIMOTHY LANCE” (MEDSL) and “Timothy Lance” (NC SBE) are the same person, and their vote totals match exactly, the match is confirmed by an independent signal. If the cascade says they match but the vote totals disagree by 50%, the match is suspect.

2. Office Classification Validation

Both sources cover the same contests but use different office name strings. MEDSL might report “NC HOUSE OF REPRESENTATIVES DISTRICT 047” while NC SBE reports “NC HOUSE OF REPRESENTATIVES - DISTRICT 47”. If both classify to state/legislative, the classifier is consistent. If one classifies to state/legislative and the other to county/legislative, we have a bug.

3. Parser Validation

When two independent parsers (the MEDSL parser and the NC SBE parser) produce the same vote counts for the same contest, both parsers are likely correct. When they disagree, the disagreement localizes the bug to one parser or the other — far easier to debug than a single-source pipeline where errors are invisible.

Beyond NC

The NC overlap is our deepest validation case because NC SBE publishes granular, machine-readable precinct data going back to 2006. Other states offer less overlap:

State	MEDSL 2022	Secondary Source	Overlap Quality
NC	Yes	NC SBE (precinct-level, 2006–2024)	High
FL	Yes	OpenElections (county-level, select years)	Medium
OH	Yes	OpenElections (precinct-level, 2022)	Medium
GA	Yes	Clarity/Scytl (election night, unstable URLs)	Low
All others	Yes	MEDSL only	None

As additional state-level sources are integrated, each creates a new validation pair. The architecture is designed to scale: the L4 cross-source reconciliation algorithm runs for any pair of sources that cover the same (state, year, contest) combination. No code changes are required — only new L0 data and a new L1 parser.

The Lesson

Cross-source reconciliation is not a feature — it is the only reliable way to detect errors in election data. A single source can be internally consistent and still wrong. Two independent sources that agree are almost certainly right. Two independent sources that disagree tell you exactly where to look.

The 90.5% exact match rate across 640 NC contests is our current evidence floor. Every additional source and state that achieves similar agreement raises confidence in the pipeline. Every disagreement is a bug report — either in our pipeline or in the source data.

Keyboard shortcuts

Election Aggregation