Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Non-Candidate Records

Not every row in an election results file is a candidate. Sources routinely embed turnout metadata, ballot measure choices, vote quality indicators, and aggregation artifacts alongside candidate results — using the same columns, the same format, and no reliable flag to distinguish them.

If your system treats every row as a candidate, you will create entity records for people named “Registered Voters”, “For”, “BLANK”, and “TOTAL VOTES”. The L4 LLM audit in our prototype caught exactly this: “For” and “Against” were classified as person entities. They are not people.

The Four Categories

1. Turnout Metadata

Rows recording registration and participation counts at the precinct level:

Pseudo-candidateMeaningSource
Registered VotersTotal registered voters in precinctFL OpenElections, NC SBE
Ballots CastTotal ballots submittedFL OpenElections, NC SBE
Cards CastTotal ballot cards (may differ from ballots in multi-card elections)FL OpenElections

Florida OpenElections is the most prolific source. Of the “other” records in our FL 2022 ingest, 6,013 rows are “Registered Voters” — accounting for 67.9% of all non-candidate records in that source. These are not errors in the source data. They are genuine turnout figures published alongside contest results in the same file format.

2. Ballot Measure Choices

Rows representing choices on referenda, bond issues, and constitutional amendments:

Pseudo-candidateMeaningSource
ForYes vote on ballot measureOpenElections, MEDSL
AgainstNo vote on ballot measureOpenElections, MEDSL
YesYes vote on ballot measureNC SBE, MEDSL
NoNo vote on ballot measureNC SBE, MEDSL

These are legitimate vote counts — but the “candidate” is not a person. Detection requires examining both the candidate name (a single common word) and the contest name (bond, referendum, amendment, proposition). See Ballot Measure Choices.

3. Vote Quality Indicators

Rows recording ballots that did not produce a valid vote for any candidate:

Pseudo-candidateMeaningSource
Over VotesVoter selected more candidates than allowedMEDSL, NC SBE
Under VotesVoter selected fewer candidates than allowedMEDSL, NC SBE
BLANKNo selection made (Maine’s term for undervote)MEDSL (ME)
Write-inAggregate write-in count (no specific candidate)Multiple sources

Over votes and under votes are important data quality signals. A contest with 15% over votes may indicate a confusing ballot design. But they are not candidates and must not be counted as such.

4. Aggregation Artifacts

Rows that are computational summaries, not individual results:

Pseudo-candidateMeaningSource
TOTAL VOTESSum of all candidates in the contestMEDSL (UT)
ScatteringAggregate of write-in candidates below reporting thresholdMEDSL (IA, MN)
TOTALAnother sum variantOpenElections

These rows are redundant with the candidate-level data. Including them double-counts votes and inflates totals.

The Detection Strategy

Non-candidate records are detected at L1 — the earliest possible point. The principle is extract before filter: non-candidate rows often contain valuable information (registered voter counts, undervote rates) that should be captured in the correct schema object before the row is excluded from contest analysis.

Detection uses a three-part check:

  1. Exact match on candidate name. A lookup table of ~40 known pseudo-candidate strings: “Registered Voters”, “Ballots Cast”, “Over Votes”, “Under Votes”, “BLANK”, “TOTAL VOTES”, “Scattering”, “For”, “Against”, “Yes”, “No”, etc.

  2. Contest name pattern. For ambiguous names like “For” and “Against”, check whether the contest name contains ballot measure keywords: bond, referendum, amendment, proposition, measure, question, initiative, charter.

  3. Source-specific rules. Some sources use unique pseudo-candidates. Maine uses “BLANK”. Iowa uses “Scattering”. Utah includes “TOTAL VOTES” rows. Each source parser knows its own ghosts.

Routing

Detected non-candidate records are routed to the appropriate schema object:

CategoryRoute toSchema type
Turnout metadataTurnoutMetadataAttached to sibling precinct records
Ballot measure choicesBallotMeasureMeasureChoice with For/Against/Yes/No
Vote quality indicatorsVoteQualityAttached to parent contest record
Aggregation artifactsDiscardedRedundant with candidate-level sums

Records routed to TurnoutMetadata and VoteQuality are preserved in the L1 output — they are valuable data, just not candidate data. Aggregation artifacts are discarded with a note in the cleaning report.

What Happens Without Detection

If non-candidate rows pass through to L2 and L3:

  • “Registered Voters” gets an embedding vector, a candidate entity ID, and appears in 6,013 precinct-level records as the most prolific “candidate” in Florida.
  • “For” and “Against” become person entities. The L4 LLM audit flagged exactly this in our prototype: “‘For’ is not a plausible person name.”
  • “TOTAL VOTES” inflates vote counts when aggregated, because the total row is summed alongside the individual candidate rows.
  • “Over Votes” appears as a candidate who received votes in every contest — the busiest politician in America.

Detection at L1 prevents all of these downstream errors.

Sub-Chapters