Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Source Overview

This project ingests election data from seven sources. None are complete on their own. Each fills a different gap — geographic breadth, temporal depth, local race coverage, geographic boundaries, or reference identifiers. The pipeline merges them into a unified schema; this chapter documents what each provides and where they overlap.

Source Summary

SourceWhat It ProvidesCoverageFormatAccess Method
MEDSLPrecinct-level returns for federal, state, and some local races50 states + DC; 2018, 2020, 2022 (~36.5M rows)CSV/TSV, one file per state per cycleHarvard Dataverse download, GitHub mirror
NC SBEPrecinct-level returns for every contest on the ballot, with vote mode breakdownsNC only; 2006–2024 (10 cycles, ~2M rows)Tab-delimited TXT in ZIP archivesS3 bucket direct download
OpenElectionsCommunity-curated precinct-level CSV files~8 states with 2022 data (FL, GA, MI, OH, PA, TX, others); coverage variesCSV, schema varies by stateGit clone per state repo on GitHub
Clarity/ScytlElection night reporting with precinct-level XML results~1,000+ jurisdictions nationwideStructured XML in ZIP filesPer-jurisdiction URLs (unstable across cycles)
VESTPrecinct boundaries (shapefiles) with vote counts as attributes50 states; odd-year elections for KY/LA/MS/VA (2015, 2019)Shapefile (.shp/.dbf/.shx/.prj)Harvard Dataverse download
CensusFIPS reference codes for states (50+DC), counties (3,143), and places (31,980)National, 2020 vintagePipe-delimited text filescensus.gov direct download
FECFederal candidate master records with stable CAND_ID identifiersAll registered federal candidates; 2020 and 2022 loadedPipe-delimited TXT (cn.txt) in ZIPfec.gov bulk download

What Each Source Contributes to the Pipeline

MEDSL is the backbone. It covers all 50 states at precinct granularity for three recent even-year cycles. Approximately 41.5% of rows in the 2022 dataset have a blank dataverse column, indicating local races. Seven states have zero local race rows — see Coverage Matrix.

NC SBE provides the deepest single-state coverage: every contest on every ballot in every precinct across 10 election cycles. It is the only source that provides vote mode breakdowns (Election Day, early, absentee, provisional) for local races. It serves as the primary validation dataset for cross-source entity resolution.

OpenElections fills state-level gaps where MEDSL coverage is incomplete or where an alternative source view aids cross-validation. Schema varies by state, requiring per-state parser logic.

Clarity has the highest value for hyperlocal races (school board, city council, judicial) because it captures results directly from county ENR systems. Not yet integrated in our pipeline. URL instability is the primary obstacle.

VEST provides the only precinct boundary geometries in the corpus, enabling geographic analysis. It also covers odd-year elections (2015, 2019) for states with off-cycle gubernatorial races — data that MEDSL’s loaded cycles do not include.

Census provides the authoritative FIPS code-to-name mappings used at L1 for geographic enrichment and cross-source geographic joins.

FEC provides stable candidate identifiers (CAND_ID) for federal candidates, used at L3 as reference anchors during entity resolution.

Cross-Source Overlap

Two source pairs have been compared quantitatively.

MEDSL + NC SBE (North Carolina, 2022 General)

Both sources report precinct-level results for the same 640 contests in North Carolina’s 2022 general election. Comparison results:

MetricValue
Contests with exact vote total match579 (90.5%)
Contests matching within 1%47 (7.3%)
Contests disagreeing by >1%14 (2.2%)
Contests with different candidate name formatting401 (63%)

The 63% name formatting difference rate is the reason entity resolution exists. MEDSL reports SHANNON W BRAY (all caps, no period). NC SBE reports Shannon W. Bray (title case, period after initial). Same person, different string. This overlap is the primary test bed for the matching pipeline — see Cross-Source Reconciliation.

MEDSL + OpenElections (Florida, 2022 General)

Florida OpenElections data contains 6,013 “Registered Voters” rows (67.9% of non-candidate records), which are turnout metadata rows mixed into the results file. This overlap revealed the non-candidate row problem documented in Non-Candidate Records.

Source Priority Ranking

When multiple sources report results for the same contest, the pipeline applies a priority order to select the authoritative record:

PrioritySource TypeRationaleExamples
1Certified state dataPublished by the official election authority; legally authoritativeNC SBE
2Academic curatedCleaned and standardized by researchers with documented methodologyMEDSL, VEST
3Community curatedVolunteer-driven; quality varies by state and contributorOpenElections
4Election night reportingOften preliminary, not certified; URLs are unstableClarity
5Reference onlyNot election results; used for enrichment and cross-referencingCensus, FEC

Priority 1 sources are preferred when available. In practice, NC SBE is the only certified state source currently loaded. For the remaining 49 states, MEDSL (priority 2) is the primary source. Lower-priority sources are retained in the record’s provenance for cross-validation, not discarded.

The priority ranking affects two pipeline decisions: which record becomes the canonical version at L4, and which confidence level is assigned. A record confirmed by two independent sources (e.g., MEDSL + NC SBE with matching vote totals) receives High confidence. A record from a single source receives Medium or Low depending on the source tier.