Source Overview
This project ingests election data from seven sources. None are complete on their own. Each fills a different gap — geographic breadth, temporal depth, local race coverage, geographic boundaries, or reference identifiers. The pipeline merges them into a unified schema; this chapter documents what each provides and where they overlap.
Source Summary
| Source | What It Provides | Coverage | Format | Access Method |
|---|---|---|---|---|
| MEDSL | Precinct-level returns for federal, state, and some local races | 50 states + DC; 2018, 2020, 2022 (~36.5M rows) | CSV/TSV, one file per state per cycle | Harvard Dataverse download, GitHub mirror |
| NC SBE | Precinct-level returns for every contest on the ballot, with vote mode breakdowns | NC only; 2006–2024 (10 cycles, ~2M rows) | Tab-delimited TXT in ZIP archives | S3 bucket direct download |
| OpenElections | Community-curated precinct-level CSV files | ~8 states with 2022 data (FL, GA, MI, OH, PA, TX, others); coverage varies | CSV, schema varies by state | Git clone per state repo on GitHub |
| Clarity/Scytl | Election night reporting with precinct-level XML results | ~1,000+ jurisdictions nationwide | Structured XML in ZIP files | Per-jurisdiction URLs (unstable across cycles) |
| VEST | Precinct boundaries (shapefiles) with vote counts as attributes | 50 states; odd-year elections for KY/LA/MS/VA (2015, 2019) | Shapefile (.shp/.dbf/.shx/.prj) | Harvard Dataverse download |
| Census | FIPS reference codes for states (50+DC), counties (3,143), and places (31,980) | National, 2020 vintage | Pipe-delimited text files | census.gov direct download |
| FEC | Federal candidate master records with stable CAND_ID identifiers | All registered federal candidates; 2020 and 2022 loaded | Pipe-delimited TXT (cn.txt) in ZIP | fec.gov bulk download |
What Each Source Contributes to the Pipeline
MEDSL is the backbone. It covers all 50 states at precinct granularity for three recent even-year cycles. Approximately 41.5% of rows in the 2022 dataset have a blank dataverse column, indicating local races. Seven states have zero local race rows — see Coverage Matrix.
NC SBE provides the deepest single-state coverage: every contest on every ballot in every precinct across 10 election cycles. It is the only source that provides vote mode breakdowns (Election Day, early, absentee, provisional) for local races. It serves as the primary validation dataset for cross-source entity resolution.
OpenElections fills state-level gaps where MEDSL coverage is incomplete or where an alternative source view aids cross-validation. Schema varies by state, requiring per-state parser logic.
Clarity has the highest value for hyperlocal races (school board, city council, judicial) because it captures results directly from county ENR systems. Not yet integrated in our pipeline. URL instability is the primary obstacle.
VEST provides the only precinct boundary geometries in the corpus, enabling geographic analysis. It also covers odd-year elections (2015, 2019) for states with off-cycle gubernatorial races — data that MEDSL’s loaded cycles do not include.
Census provides the authoritative FIPS code-to-name mappings used at L1 for geographic enrichment and cross-source geographic joins.
FEC provides stable candidate identifiers (CAND_ID) for federal candidates, used at L3 as reference anchors during entity resolution.
Cross-Source Overlap
Two source pairs have been compared quantitatively.
MEDSL + NC SBE (North Carolina, 2022 General)
Both sources report precinct-level results for the same 640 contests in North Carolina’s 2022 general election. Comparison results:
| Metric | Value |
|---|---|
| Contests with exact vote total match | 579 (90.5%) |
| Contests matching within 1% | 47 (7.3%) |
| Contests disagreeing by >1% | 14 (2.2%) |
| Contests with different candidate name formatting | 401 (63%) |
The 63% name formatting difference rate is the reason entity resolution exists. MEDSL reports SHANNON W BRAY (all caps, no period). NC SBE reports Shannon W. Bray (title case, period after initial). Same person, different string. This overlap is the primary test bed for the matching pipeline — see Cross-Source Reconciliation.
MEDSL + OpenElections (Florida, 2022 General)
Florida OpenElections data contains 6,013 “Registered Voters” rows (67.9% of non-candidate records), which are turnout metadata rows mixed into the results file. This overlap revealed the non-candidate row problem documented in Non-Candidate Records.
Source Priority Ranking
When multiple sources report results for the same contest, the pipeline applies a priority order to select the authoritative record:
| Priority | Source Type | Rationale | Examples |
|---|---|---|---|
| 1 | Certified state data | Published by the official election authority; legally authoritative | NC SBE |
| 2 | Academic curated | Cleaned and standardized by researchers with documented methodology | MEDSL, VEST |
| 3 | Community curated | Volunteer-driven; quality varies by state and contributor | OpenElections |
| 4 | Election night reporting | Often preliminary, not certified; URLs are unstable | Clarity |
| 5 | Reference only | Not election results; used for enrichment and cross-referencing | Census, FEC |
Priority 1 sources are preferred when available. In practice, NC SBE is the only certified state source currently loaded. For the remaining 49 states, MEDSL (priority 2) is the primary source. Lower-priority sources are retained in the record’s provenance for cross-validation, not discarded.
The priority ranking affects two pipeline decisions: which record becomes the canonical version at L4, and which confidence level is assigned. A record confirmed by two independent sources (e.g., MEDSL + NC SBE with matching vote totals) receives High confidence. A record from a single source receives Medium or Low depending on the source tier.