Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Candidate Name Components

Election data sources represent candidate names as a single string. The formats are incompatible across sources — and sometimes within the same source across years. The pipeline decomposes every name into structured components at L1 and preserves all components through every subsequent layer.

Why decomposition instead of a single string

A single name field cannot support entity resolution. Consider matching these records:

SourceRaw name string
MEDSLSHANNON W BRAY
NC SBEShannon W. Bray
FECBRAY, SHANNON W

String equality fails on all three pairs. Lowercasing and stripping punctuation gets MEDSL and NC SBE closer, but FEC’s last-first ordering still breaks. Decomposing into {first: Shannon, middle: W, last: Bray} makes all three identical after normalization.

The harder case is nicknames:

SourceRaw name stringWhat a human sees
MEDSLMICHAEL "STEVE" HUBERFirst name Michael, goes by Steve
NC SBEMichael (Steve) HuberSame person
OpenElectionsSteve HuberSame person, nickname only

Without decomposition, matching Steve Huber to MICHAEL "STEVE" HUBER requires the system to know that Steve is a nickname present in one variant but used as the primary name in another. The nickname and canonical_first fields make this explicit.

Component fields

Every candidate name in the pipeline is represented as a struct with seven fields:

FieldTypeDescriptionPopulated at
rawStringOriginal name string exactly as it appeared in the source. Never modified.L1
firstOption<String>Parsed first nameL1
middleOption<String>Parsed middle name or initialL1
lastOption<String>Parsed last nameL1
suffixOption<String>Generational suffix: Jr, Sr, II, III, IVL1
nicknameOption<String>Detected nickname, extracted from quotes or parenthesesL1
canonical_firstOption<String>Nickname-resolved first name. If first has a known nickname mapping, this holds the canonical form.L1

All fields are available at every layer (L1 through L4). Later layers may refine values but never discard earlier ones.

Parsing rules by source

MEDSL

Names are ALL CAPS, no periods after initials, nicknames in double quotes, suffixes without commas.

Rawfirstmiddlelastsuffixnicknamecanonical_first
SHANNON W BRAYShannonWBrayShannon
MICHAEL "STEVE" HUBERMichaelHuberSteveMichael
ROBERT VAN FLETCHER JRRobertVanFletcherJrRobert
LM "MICKEY" SIMMONSLMSimmonsMickeyL
VICTORIA P PORTERVictoriaPPorterVictoria
WRITEIN

WRITEIN is a sentinel value, not a person name. It is flagged at L1 and excluded from name decomposition.

NC SBE

Names are Title Case, periods after initials, nicknames in parentheses, commas before suffixes.

Rawfirstmiddlelastsuffixnicknamecanonical_first
Shannon W. BrayShannonWBrayShannon
Michael (Steve) HuberMichaelHuberSteveMichael
Robert Van Fletcher, Jr.RobertVanFletcherJrRobert
Patricia (Pat) CothamPatriciaCothamPatPatricia
William Irvin. Enzor IIIWilliamIrvinEnzorIIIWilliam

The period after “Irvin.” in the last example is a data entry artifact. The parser strips trailing periods from middle names.

FEC

Names are LAST, FIRST MIDDLE format, all caps.

Rawfirstmiddlelastsuffixnicknamecanonical_first
BRAY, SHANNON WShannonWBrayShannon
BIDEN, JOSEPH R JRJosephRBidenJrJoseph

The canonical_first field

canonical_first resolves known nicknames to their formal equivalents using the nickname dictionary. This enables matching when one source uses a nickname and another uses the legal name.

firstnicknamecanonical_firstReasoning
MichaelSteveMichaelFirst name is already formal
CharlieCharlesCharlie is a known nickname for Charles
BobRobertBob is a known nickname for Robert
PatriciaPatPatriciaFirst name is already formal
BillWilliamBill is a known nickname for William
JimJamesJim is a known nickname for James

When first is already a formal name, canonical_first equals first. When first is itself a nickname (as when OpenElections reports Charlie Crist without the legal name Charles), canonical_first resolves to the formal form.

The nickname dictionary contains approximately 1,200 mappings. It is deterministic — no ML, no API calls. Ambiguous cases (e.g., “Alex” could map to “Alexander” or “Alexandra”) are resolved by leaving canonical_first equal to first and deferring to embedding-based matching at L2.

How L2 uses name components

L2 constructs a composite string for embedding from the decomposed components:

{canonical_first} {middle} {last} {suffix}

This means Michael "Steve" Huber and Steve Huber both embed with their decomposed components rather than raw strings. The embedding model sees structured, normalized text rather than source-specific formatting.

The raw field is never used for embedding. It is preserved for provenance and debugging only.

Special cases

Write-in candidates. MEDSL aggregates write-ins into WRITEIN. NC SBE reports named write-ins (e.g., Ronnie Strickland (Write-In)) separately from Write-In (Miscellaneous). Named write-ins are decomposed normally. The WRITEIN sentinel produces a record with all name fields set to None.

Ballot measure choices. The values For, Against, Yes, No are not person names. They are handled by the BallotMeasure contest kind and bypass name decomposition entirely. See Contest Kinds.

Hyphenated last names. Treated as a single last value: Smith-Joneslast: Smith-Jones. No attempt is made to split on hyphens.

Multiple middle names. Concatenated into the middle field: Joseph Robinette Bidenmiddle: Robinette. If two middle names are present (rare), they are space-separated in the middle field.

No first name. Some sources report only a last name (e.g., WRITEIN or truncated records). first is None. canonical_first is also None.