Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Full Nickname Dictionary

The pipeline applies nickname normalization at L1 to improve entity resolution at L3. When a candidate’s first name matches a known nickname, the canonical form is stored in canonical_first and the original is preserved in first.

This dictionary is applied deterministically. Every name is checked against the table below. No context or heuristics are used — if the input matches the nickname column, the canonical column is applied. This means the mapping is fast and reproducible but occasionally wrong (see The Ted Problem below).

Mappings

NicknameCanonicalNotes
alalbert
alexalexander
andyandrew
barbbarbara
benbenjamin
berniebernard
bertalbertAlso Herbert; resolved to albert by frequency
bethelizabeth
billwilliam
billywilliam
bobrobert
bobbyrobert
bonniebonita
budwilliamRegional; less reliable
charliecharles
chrischristopherAlso Christine; gendered ambiguity
chuckcharles
cindycynthia
dandaniel
dannydaniel
davedavid
debdeborah
debbiedeborah
dickrichard
dondonald
dougdouglas
drewandrew
ededward
eddieedward
frankfranklinAlso Francis; resolved to franklin by frequency
fredfrederick
geneeugene
gerrygerald
hankhenry
harryharoldAlso Henry (British tradition); resolved to harold
jackjohn
jakejacob
janjaniceAlso Janet; resolved to janice by frequency
jennyjennifer
jerrygeraldAlso Jerome; resolved to gerald by frequency
jimjames
jimmyjames
joejoseph
johnnyjohn
jonjonathanDistinct from john
katekatherineAlso Kathryn, Catherine
kathykatherine
kenkenneth
kennykenneth
larrylawrence
lizelizabeth
maggiemargaret
mattmatthew
mikemichael
mitchmitchell
nancyannHistorical mapping; low reliability
nicknicholas
nikkinicole
normnorman
patpatrickAlso Patricia; gendered ambiguity
pattipatricia
pattypatricia
peggymargaret
petepeter
philphilip
rayraymond
rickrichard
robrobert
ronronald
sallysarah
samsamuelAlso Samantha; gendered ambiguity
sandysandraAlso Alexander; gendered ambiguity
stevesteven
suesusan
tededwardSee The Ted Problem below
terryterrenceAlso Teresa; gendered ambiguity
timtimothy
tomthomas
tommythomas
tonyanthony
valvalerie
vincevincent
waltwalter
weswesley
willwilliam
woodywoodrow

The Ted Problem

“Ted” maps to both Edward (Ted Kennedy → Edward Kennedy) and Theodore (Ted Cruz → Rafael Edward Cruz, commonly Theodore). The dictionary maps ted → edward because Edward is the more frequent canonical form in US election data. This means a candidate whose legal name is Theodore but who files as Ted will be canonicalized as Edward.

This is a known, accepted error. It affects L1 canonical_first but does not prevent correct entity resolution at L3 — because L3 matches on composite strings that include last name, jurisdiction, office, and year. Two candidates named “Ted Smith” in different counties will not be merged regardless of whether canonical_first is edward or theodore.

The original filed name is always preserved in first. Any downstream consumer who needs the original can ignore canonical_first and use first directly.

Gendered ambiguity

Several nicknames map to names that could be either male or female: Chris (Christopher/Christine), Pat (Patrick/Patricia), Sam (Samuel/Samantha), Sandy (Sandra/Alexander), Terry (Terrence/Teresa). The dictionary resolves these to the statistically more common canonical form in US election candidate data. The mapping is not always correct for individual candidates.

As with the Ted problem, the original name is preserved, and entity resolution at L3 uses additional fields (jurisdiction, office, party) to avoid incorrect merges caused by nickname ambiguity.

When the dictionary is not applied

The dictionary is skipped when:

  • The input first name is longer than 6 characters and matches no entry (assumed to already be a full name).
  • The candidate record has a canonical_first value set by the source (some sources provide both nickname and legal name).
  • The input is an initial only (e.g., “J.” is not expanded).