Candidate Name Components
Election data sources represent candidate names as a single string. The formats are incompatible across sources — and sometimes within the same source across years. The pipeline decomposes every name into structured components at L1 and preserves all components through every subsequent layer.
Why decomposition instead of a single string
A single name field cannot support entity resolution. Consider matching these records:
| Source | Raw name string |
|---|---|
| MEDSL | SHANNON W BRAY |
| NC SBE | Shannon W. Bray |
| FEC | BRAY, SHANNON W |
String equality fails on all three pairs. Lowercasing and stripping punctuation gets MEDSL and NC SBE closer, but FEC’s last-first ordering still breaks. Decomposing into {first: Shannon, middle: W, last: Bray} makes all three identical after normalization.
The harder case is nicknames:
| Source | Raw name string | What a human sees |
|---|---|---|
| MEDSL | MICHAEL "STEVE" HUBER | First name Michael, goes by Steve |
| NC SBE | Michael (Steve) Huber | Same person |
| OpenElections | Steve Huber | Same person, nickname only |
Without decomposition, matching Steve Huber to MICHAEL "STEVE" HUBER requires the system to know that Steve is a nickname present in one variant but used as the primary name in another. The nickname and canonical_first fields make this explicit.
Component fields
Every candidate name in the pipeline is represented as a struct with seven fields:
| Field | Type | Description | Populated at |
|---|---|---|---|
raw | String | Original name string exactly as it appeared in the source. Never modified. | L1 |
first | Option<String> | Parsed first name | L1 |
middle | Option<String> | Parsed middle name or initial | L1 |
last | Option<String> | Parsed last name | L1 |
suffix | Option<String> | Generational suffix: Jr, Sr, II, III, IV | L1 |
nickname | Option<String> | Detected nickname, extracted from quotes or parentheses | L1 |
canonical_first | Option<String> | Nickname-resolved first name. If first has a known nickname mapping, this holds the canonical form. | L1 |
All fields are available at every layer (L1 through L4). Later layers may refine values but never discard earlier ones.
Parsing rules by source
MEDSL
Names are ALL CAPS, no periods after initials, nicknames in double quotes, suffixes without commas.
| Raw | first | middle | last | suffix | nickname | canonical_first |
|---|---|---|---|---|---|---|
SHANNON W BRAY | Shannon | W | Bray | — | — | Shannon |
MICHAEL "STEVE" HUBER | Michael | — | Huber | — | Steve | Michael |
ROBERT VAN FLETCHER JR | Robert | Van | Fletcher | Jr | — | Robert |
LM "MICKEY" SIMMONS | L | M | Simmons | — | Mickey | L |
VICTORIA P PORTER | Victoria | P | Porter | — | — | Victoria |
WRITEIN | — | — | — | — | — | — |
WRITEIN is a sentinel value, not a person name. It is flagged at L1 and excluded from name decomposition.
NC SBE
Names are Title Case, periods after initials, nicknames in parentheses, commas before suffixes.
| Raw | first | middle | last | suffix | nickname | canonical_first |
|---|---|---|---|---|---|---|
Shannon W. Bray | Shannon | W | Bray | — | — | Shannon |
Michael (Steve) Huber | Michael | — | Huber | — | Steve | Michael |
Robert Van Fletcher, Jr. | Robert | Van | Fletcher | Jr | — | Robert |
Patricia (Pat) Cotham | Patricia | — | Cotham | — | Pat | Patricia |
William Irvin. Enzor III | William | Irvin | Enzor | III | — | William |
The period after “Irvin.” in the last example is a data entry artifact. The parser strips trailing periods from middle names.
FEC
Names are LAST, FIRST MIDDLE format, all caps.
| Raw | first | middle | last | suffix | nickname | canonical_first |
|---|---|---|---|---|---|---|
BRAY, SHANNON W | Shannon | W | Bray | — | — | Shannon |
BIDEN, JOSEPH R JR | Joseph | R | Biden | Jr | — | Joseph |
The canonical_first field
canonical_first resolves known nicknames to their formal equivalents using the nickname dictionary. This enables matching when one source uses a nickname and another uses the legal name.
| first | nickname | canonical_first | Reasoning |
|---|---|---|---|
Michael | Steve | Michael | First name is already formal |
Charlie | — | Charles | Charlie is a known nickname for Charles |
Bob | — | Robert | Bob is a known nickname for Robert |
Patricia | Pat | Patricia | First name is already formal |
Bill | — | William | Bill is a known nickname for William |
Jim | — | James | Jim is a known nickname for James |
When first is already a formal name, canonical_first equals first. When first is itself a nickname (as when OpenElections reports Charlie Crist without the legal name Charles), canonical_first resolves to the formal form.
The nickname dictionary contains approximately 1,200 mappings. It is deterministic — no ML, no API calls. Ambiguous cases (e.g., “Alex” could map to “Alexander” or “Alexandra”) are resolved by leaving canonical_first equal to first and deferring to embedding-based matching at L2.
How L2 uses name components
L2 constructs a composite string for embedding from the decomposed components:
{canonical_first} {middle} {last} {suffix}
This means Michael "Steve" Huber and Steve Huber both embed with their decomposed components rather than raw strings. The embedding model sees structured, normalized text rather than source-specific formatting.
The raw field is never used for embedding. It is preserved for provenance and debugging only.
Special cases
Write-in candidates. MEDSL aggregates write-ins into WRITEIN. NC SBE reports named write-ins (e.g., Ronnie Strickland (Write-In)) separately from Write-In (Miscellaneous). Named write-ins are decomposed normally. The WRITEIN sentinel produces a record with all name fields set to None.
Ballot measure choices. The values For, Against, Yes, No are not person names. They are handled by the BallotMeasure contest kind and bypass name decomposition entirely. See Contest Kinds.
Hyphenated last names. Treated as a single last value: Smith-Jones → last: Smith-Jones. No attempt is made to split on hyphens.
Multiple middle names. Concatenated into the middle field: Joseph Robinette Biden → middle: Robinette. If two middle names are present (rare), they are space-separated in the middle field.
No first name. Some sources report only a last name (e.g., WRITEIN or truncated records). first is None. canonical_first is also None.