Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Schema Overview

The unified schema defines the structure of every election record at every pipeline layer. A single record represents one candidate’s (or one ballot measure choice’s) vote count in one geographic unit for one contest. All sources — MEDSL, NC SBE, OpenElections, VEST, Clarity — are normalized into this schema at L1. Subsequent layers (L2–L4) add fields but never remove them.

A record has six sections: election, jurisdiction, contest, results, turnout, source, and provenance. Not every field is populated for every record. Fields that the source does not provide are null, not inferred.


Election

Identifies which election this record belongs to.

FieldTypeDescriptionExample
datedateElection date (ISO 8601)2022-11-08
yearintegerElection year, derived from date2022
typeElectionTypeGeneral, primary, runoff, special, etc.General
stagestringSource-provided stage codeGEN
specialbooleanWhether this is a special electionfalse
certification_statusstringCertified, unofficial, or unknowncertified

The type field is an enum — see Enumerations Reference. The stage field preserves the raw source value (MEDSL uses GEN/PRI/RUN; NC SBE does not have a stage column). The certification_status field reflects whether the source data represents certified results. NC SBE and MEDSL publish certified data. Clarity publishes unofficial election night results that may be updated.


Jurisdiction

Identifies the geographic unit where votes were counted.

FieldTypeDescriptionExample
statestringFull state nameNorth Carolina
state_postringTwo-letter postal codeNC
state_fipsstringTwo-digit state FIPS code37
countystringCounty name (may be null for statewide)Wake
county_fipsstringFive-digit county FIPS code37183
precinctstringPrecinct name or code from the source01-01
precinct_codestringNumeric precinct code (NC SBE only)0101
jurisdiction_namestringJurisdiction name from MEDSLWAKE
jurisdiction_fipsstringJurisdiction FIPS from MEDSL37183
ocd_idstringOpen Civic Data identifier (when available)ocd-division/country:us/state:nc/county:wake
levelJurisdictionLevelGeographic granularity of this recordPrecinct

The county_fips field is the primary geographic join key across sources. It is enriched from Census FIPS reference files at L1 when the source provides a county name but no code. The ocd_id field is populated when a mapping exists; it is null for most records today.

The level field indicates what geographic unit this row represents. Most records are Precinct. Some sources provide only county-level aggregates (County). VEST data with precinct boundaries is Precinct with accompanying geometry.


Contest

Describes the race or ballot measure.

FieldTypeDescriptionExample
kindContestKindCandidateRace, BallotMeasure, or TurnoutMetadataCandidateRace
raw_namestringContest name exactly as it appears in the sourceCABARRUS COUNTY SCHOOLS BOARD OF EDUCATION
normalized_namestringCleaned contest name (L1+)Cabarrus County Schools Board of Education
office_levelOfficeLevelFederal, state, county, municipal, etc.County
office_categoryOfficeCategoryExecutive, legislative, judicial, school board, etc.SchoolBoard
districtstringDistrict number or name (blank if at-large)DISTRICT 02
dataversestringMEDSL’s race level tag (blank for local)``
classifier_methodClassifierMethodHow office_level and office_category were assignedKeyword
vote_forintegerMaximum number of candidates a voter may select1
magnitudeintegerNumber of seats being filled3
is_retentionbooleanWhether this is a judicial retention electionfalse

The kind field is an enum with three variants — see Contest Kinds. The distinction between CandidateRace, BallotMeasure, and TurnoutMetadata is determined at L1 based on the contest name and choice values.

The classifier_method field records how the office_level and office_category were assigned: Keyword (deterministic string match, 62% of records), Regex (pattern-based, ~15%), Embedding (nearest-neighbor at L2), or Llm (LLM classification at L3). This field exists so that users can filter by classification confidence.

The vote_for field comes from NC SBE’s Vote For column. MEDSL does not provide this field. When unavailable, it defaults to null. The magnitude field comes from MEDSL’s magnitude column and indicates multi-member districts.


Results

An array of candidate results attached to the contest. For a CandidateRace, each element is one candidate. For a BallotMeasure, each element is one choice (e.g., “For”, “Against”). For TurnoutMetadata, the results array is empty.

FieldTypeDescriptionExample
candidate_nameCandidateNameDecomposed name — see below(see Name Components)
party_rawstringParty label exactly as source providesLIBERTARIAN
party_simplifiedPartySimplifiedNormalized party enumLibertarian
votes_totalintegerTotal votes for this candidate in this precinct90
vote_sharefloatFraction of total contest votes (computed)0.023
writeinbooleanWhether this is a write-in candidatefalse
incumbentbooleanWhether this candidate is the incumbent (if known)null
vote_counts_by_typeVoteCountsByTypeBreakdown by vote method — see below(see below)

CandidateName

Names are decomposed into components rather than stored as a single string. This is documented in detail in Candidate Name Components.

FieldTypeDescriptionExample
rawstringName exactly as it appears in the sourceMICHAEL "STEVE" HUBER
firststringParsed first nameMichael
middlestringParsed middle name or initialnull
laststringParsed last nameHuber
suffixstringJr, Sr, II, III, IV, etc.null
nicknamestringDetected nicknameSteve
canonical_firststringNickname-resolved first nameStephen

The raw field is preserved at every layer and never modified. The component fields are populated at L1 during name parsing. The canonical_first field is populated at L1 using the nickname dictionary (e.g., Charlie→Charles, Steve→Stephen, Pat→Patricia). All fields are available at every pipeline layer.

VoteCountsByType

When the source provides vote mode breakdowns, they are stored here. NC SBE provides all four fields for every contest. MEDSL provides them when modes are split into separate rows (summed during L1). Most other sources provide only the total.

FieldTypeDescriptionExample
election_dayintegerElection day votes136
earlyintegerEarly / one-stop votes159
absentee_mailintegerMail-in absentee votes7
provisionalintegerProvisional ballot votes1

NC SBE calls early voting “One Stop.” MEDSL calls it “EARLY VOTING.” Both are mapped to the early field at L1.


Turnout

Voter registration and participation counts for the geographic unit. These fields are sparsely populated — less than 5% of records have values.

FieldTypeDescriptionExample
registered_votersintegerNumber of registered voters in this precinct2847
ballots_castintegerTotal ballots cast in this precinct1893
turnout_pctfloatballots_cast / registered_voters (computed)0.665

NC SBE provides registered_voters via “Registered Voters” pseudo-contest rows. These are extracted during L1 parsing and attached to the precinct’s turnout object. MEDSL rarely includes registration counts. Most records have null turnout.


Source

Provenance fields that document where this record came from.

FieldTypeDescriptionExample
source_typeSourceTypeEnum identifying the source systemMedsl
source_filestringFilename of the L0 artifact2022-nc-local-precinct-general.csv
source_rowintegerRow number in the source file14523
retrieval_datedatetimeWhen the source file was downloaded (UTC)2025-01-15T03:22:00Z
confidenceConfidenceHigh, Medium, or LowMedium
raw_fieldsSourceRawFieldsAll original columns from the source, typed per source(see below)

SourceRawFields

The raw_fields object preserves every column from the original source row, typed as an enum per source. This ensures no information is lost during normalization.

VariantSourceFields preserved
MedslRawRecordMEDSLAll 25 MEDSL columns including state_cen, state_ic, readme_check, version
NcsbeRawRecordNC SBEAll 15 NC SBE columns including Contest Group ID, Contest Type, Real Precinct
OpenElectionsRawRecordOpenElectionsVariable columns depending on state file
VestRawRecordVESTEncoded column names and geometry reference
ClarityRawRecordClarityXML element attributes
FecRawRecordFECAll 15 cn.txt columns
CensusRawRecordCensusFIPS file columns

Each variant is a struct with typed fields matching the source schema. This is a Rust enum, not a JSON object — the type system ensures you cannot accidentally read an NC SBE field from a MEDSL record. See Type System Design.


Provenance

Hash chain and version metadata that enable verification and reproducibility.

FieldTypeDescriptionExample
record_idstringDeterministic hash of (source, file, row)a3f8c2...
l1_hashstringSHA-256 hash of this L1 record’s content7b2e91...
l0_parent_hashstringSHA-256 hash of the L0 source artifactc4d1f0...
l0_byte_offsetintegerByte offset in the L0 file where this row starts1048576
parser_versionstringVersion of the parser that produced this record0.1.0
schema_versionstringVersion of the schema this record conforms to1.0.0

The hash chain links every record back to the original source bytes. If the L1 record is modified, its l1_hash changes and no longer matches the hash stored in any L2 record that references it. The verification algorithm at L4 checks the full chain: L4 → L3 → L2 → L1 → L0 → source bytes.

The record_id is deterministic: identical source input always produces the same record_id. This enables deduplication and makes re-processing idempotent.


Layer-Specific Additions

Each pipeline layer adds fields to the record. The base schema (above) is fully populated at L1. Subsequent layers extend it:

LayerFields added
L2 (Embedded)candidate_name_embedding, contest_name_embedding, jurisdiction_embedding, embedding_model, embedding_version
L3 (Matched)candidate_cluster_id, contest_cluster_id, match_confidence, match_method
L4 (Canonical)canonical_candidate_name, canonical_contest_name, temporal_chain_id, verification_status, alias_table

L1 records are self-contained. L2+ records reference their parent layer’s hash. No fields from earlier layers are removed or overwritten — each layer is additive.


JSONL Representation

At every layer, records are serialized as one JSON object per line (JSONL). The six sections are top-level keys:

{"election":{"date":"2022-11-08","year":2022,"type":"General",...},"jurisdiction":{"state":"North Carolina","state_po":"NC",...},"contest":{"kind":"CandidateRace","raw_name":"CABARRUS COUNTY SCHOOLS BOARD OF EDUCATION",...},"results":[{"candidate_name":{"raw":"GREG MILLS","first":"Greg","last":"Mills",...},"votes_total":79,...}],"turnout":null,"source":{"source_type":"Medsl","source_file":"2022-nc-local-precinct-general.csv",...},"provenance":{"record_id":"a3f8c2...","l1_hash":"7b2e91...",...}}

Files are streamable: each line is a complete record. Files are appendable: new records can be concatenated without modifying existing lines. Serialization uses serde_json in Rust. See Output Formats.