Nicknames and Middle Initials
Two distinct problems share a root cause: the candidate’s legal name differs from the name on the ballot or in the source file. Nicknames substitute one first name for another. Middle initials appear in some sources and not others. Both must be handled at L1 to preserve signal for L2 and L3.
Nicknames
A nickname replaces the candidate’s legal first name with a familiar variant. The embedding model has no reliable way to recover the connection — it encodes character-level and token-level similarity, not social knowledge about naming conventions.
Real test results from our prototype, using text-embedding-3-large (3,072 dimensions):
| Source A | Source B | Nickname → Legal | Cosine | LLM Decision | LLM Confidence |
|---|---|---|---|---|---|
| Charlie Crist | CRIST, CHARLES JOSEPH | Charlie → Charles | 0.451 | match | 0.95 |
| Nicole Fried | FRIED, NIKKI | Nikki → Nicole | 0.642 | match | 0.92 |
| Ron DeSantis | DESANTIS, RON | Ron → Ronald | 0.729 | match | 0.98 |
The Crist result is the critical case. At 0.451, the embedding score falls below any plausible auto-accept threshold — and below many reject thresholds. Without nickname resolution, this pair would be missed entirely or routed to LLM on every encounter.
The fix operates at L1. The nickname dictionary maps Charlie → Charles, Nikki → Nicole, Ron → Ronald, and ~100 other mappings. When the L1 parser decomposes a name, it checks the first name against the dictionary and populates canonical_first:
{
"raw": "Charlie Crist",
"first": "Charlie",
"middle": null,
"last": "Crist",
"suffix": null,
"canonical_first": "Charles"
}
Both first and canonical_first are preserved. The original is kept for display and provenance. The canonical form is used in the L2 composite string for embedding and in the L3 exact-match step. After dictionary application, the L3 exact matcher sees (canonical_first="Charles", last="Crist", suffix=null) on both sides — an exact match with no embedding or LLM call required.
Why the embedding model fails on nicknames
Charlie and Charles share a prefix, but the embedding model must also reconcile Crist vs CRIST, CHARLES JOSEPH — different casing, different ordering, and a middle name that appears in one source but not the other. The model embeds the full composite string, not individual tokens. The combined divergence pushes the cosine score to 0.451.
Ron and Ronald are closer (0.729) because the surface forms are more similar and both sources use last-name-first ordering. But 0.729 is still in the ambiguous zone — it requires an LLM call to confirm.
The nickname dictionary eliminates these LLM calls for known mappings. At scale, this matters: if 5% of candidates use nicknames and each requires an LLM call, that is tens of thousands of unnecessary API round-trips.
Middle Initials
Middle initials are a different problem. They do not substitute one name for another — they add or remove a disambiguation signal.
The key case: David S. Marshall (Maine) and David A. Marshall (Florida) are different people. Without middle initials, both reduce to David Marshall. With middle initials preserved, L2 generates different embedding vectors.
We measured the effect directly:
| Composite (no middle) | Composite (with middle) | Cosine (no middle) | Cosine (with middle) |
|---|---|---|---|
| David Marshall | ME | David S Marshall | ME | — | — |
| David Marshall | FL | David A Marshall | FL | 0.7025 | 0.6448 |
The middle initial drops the cosine score by 0.058 — enough to shift the pair further from the accept threshold and closer to correct rejection. The principle: middle initials are signal, not noise.
More middle-initial test results from our prototype:
| Source A | Source B | Cosine | LLM Decision | Key Signal |
|---|---|---|---|---|
| Ashley Moody | Ashley B. Moody | 0.930 | match | Same person, middle added |
| Val Demings | VAL DEMINGS | 0.828 | match | Same person, format difference |
| Dale Holness | DALE V.C. HOLNESS | 0.896 | match | Same person, middle initials added |
Ashley Moody at 0.930 is the same person — the B. appears in one source but not the other. The high embedding score plus same-state context is sufficient for auto-accept above the 0.95 threshold (or just below it, in which case JW on the last name at 1.0 pushes it through).
How Both Feed Into L2
The L2 composite string for a candidate includes both canonical_first and middle:
{canonical_first} {middle} {last} {suffix} | {party} | {office} | {state} | {county}
For Charlie Crist, this becomes:
Charles Crist | DEM | Governor | FL | statewide
For CRIST, CHARLES JOSEPH, this becomes:
Charles Joseph Crist | DEM | Governor | FL | statewide
The canonical first names now match. The remaining divergence — Joseph as a middle name in one source — is small enough that the embedding score rises well above the ambiguous zone. The nickname dictionary at L1 did the heavy lifting; L2 and L3 finish the job.
The Combined Rule
- At L1, apply the nickname dictionary to populate
canonical_first. - At L1, preserve
middleexactly as parsed — do not strip it, do not normalize it. - At L2, include both
canonical_firstandmiddlein the composite string. - At L3 exact match, match on
(canonical_first, last, suffix)— middle is not required for exact match but is used for disambiguation when multiple candidates share the same canonical first and last name. - At L3 LLM confirmation, provide both the raw and canonical names so the model can reason about nickname relationships and middle-initial differences.
The principle behind both: clean without collapsing. Normalize what you can (nicknames to canonical forms), preserve what you must (middle initials as disambiguation signal), and let downstream layers use the full context.