Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Composite String Templates

Embeddings are not generated from raw candidate names. They are generated from composite strings that combine name components with contextual fields — office, state, county, party. This context helps the embedding model distinguish people who share a name but hold different offices in different states. It also introduces a failure mode: context bleed, where shared context artificially inflates similarity between unrelated candidates.

The Three Templates

Each L2 record generates up to three composite strings, one per embedding type:

TypeTemplatePurpose
Candidate{canonical_first} {middle} {last} {suffix} | {party} | {office} | {state} | {county}Entity resolution across sources and elections
Contest{raw_name} | {office_level} | {state} {year}Contest entity resolution across naming variants
Geography{municipality}, {county} County, {state}Geographic entity resolution for precinct/place matching

The pipe character (|) is a deliberate separator. It signals to the tokenizer that the fields on either side are distinct semantic units, not a continuous phrase. Without separators, “Timothy Lance DEM” could be tokenized as a three-word name rather than a name followed by a party.

Real Composite Examples

CandidateComposite String
Timothy Lance (NC, Columbus County school board)Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus
Charlie Crist (FL, Governor, DEM)Charles Crist | DEM | Governor | FL | statewide
CRIST, CHARLES JOSEPH (FL, Governor, DEM)Charles Joseph Crist | DEM | Governor | FL | statewide
David S Marshall (ME, State Legislature)David S Marshall | | State Legislature | ME | statewide
David A Marshall (FL, County Commission)David A Marshall | | County Commission | FL | Broward

Note that canonical_first is used, not first. Charlie Crist’s composite uses Charles (from the nickname dictionary), not Charlie. This means the MEDSL record (CRIST, CHARLES JOSEPH → canonical_first Charles) and the OpenElections record (Charlie Crist → canonical_first Charles) produce composites with matching first-name tokens. The remaining divergence — Joseph as a middle name — is small enough that the embedding score rises significantly compared to the raw-name embedding.

Empty components produce empty slots. Timothy Lance has no middle name, no suffix, and no party in the NC SBE data. The composite retains the pipe separators with empty fields: Timothy Lance | | BOARD OF EDUCATION DISTRICT 02 | NC | Columbus. This keeps the template structure consistent across all records, which stabilizes tokenization.

Why Context Helps: The David Marshall Test

David S. Marshall ran for state legislature in Maine. David A. Marshall ran for county commission in Florida. They are different people. Without context, the embedding model sees two very similar strings.

We measured the effect of context on cosine similarity:

Composite AComposite BCosine
David MarshallDavid Marshall1.000
David Marshall | MEDavid Marshall | FL0.7025
David S Marshall | MEDavid A Marshall | FL0.6448
David S Marshall | | State Legislature | MEDavid A Marshall | | County Commission | FL0.581

Each additional contextual field pushes the vectors further apart:

  • State alone drops similarity from 1.0 to 0.7025. The model encodes ME and FL as distinct tokens that pull the vectors in different directions.
  • Middle initial drops it further to 0.6448 — a 0.058 reduction. The single character S vs A produces measurably different vectors because it changes the token sequence before the separator.
  • Office context drops it to 0.581. “State Legislature” and “County Commission” are semantically distinct, adding another axis of divergence.

At 0.581, this pair falls well within the ambiguous zone (0.35–0.95) and is routed to the LLM, which correctly rejects the match based on different states, different offices, and different middle initials. Without context, the pair scores 1.0 — an automatic merge of two different people.

The middle-initial contribution (0.058) may seem small, but it matters at the margins. For pairs where state and office are the same — a father and son both serving on the same county commission — the middle initial may be the only signal distinguishing them.

Why Context Hurts: The Context Bleed Problem

Context is not free. Shared context tokens contribute to vector similarity even when the names themselves are unrelated. This is context bleed.

Consider two candidates in the same NC school district block:

CandidateComposite
Aaron BridgesAaron Bridges | | SCHOOL BOARD | NC | Columbus
Daniel BlantonDaniel Blanton | | SCHOOL BOARD | NC | Columbus

These are completely different people. But their composites share five context tokens: SCHOOL BOARD, NC, Columbus, and the pipe separators. The embedding model encodes these shared tokens into both vectors, producing a cosine similarity of approximately 0.55–0.65 — well above what the names alone would produce (~0.20) and squarely in the ambiguous zone.

In our prototype, all 30 wasted LLM calls were on pairs exactly like this: different people with different names whose shared context inflated their embedding scores into the ambiguous zone. The step 2.5 gate (JW on last names < 0.50 → skip) was added specifically to short-circuit these context-bleed false alarms before they reach the LLM.

Measuring the bleed

We tested context contribution by varying which fields are included:

Composite variantAaron Bridges vs Daniel BlantonCosine
Name onlyAaron Bridges vs Daniel Blanton~0.21
Name + stateAaron Bridges | NC vs Daniel Blanton | NC~0.38
Name + state + office + countyFull composite~0.60

Each shared context field adds approximately 0.15–0.20 to the cosine score. For same-name pairs (the cases entity resolution cares about), this boost is helpful — it confirms that two similar names in the same context are likely the same person. For different-name pairs, the same boost is harmful — it inflates scores past the reject threshold.

The step 2.5 gate resolves this asymmetry. If the names themselves are dissimilar (JW < 0.50 on last names), the context-inflated embedding score is irrelevant — the pair is skipped. If the names are similar (JW ≥ 0.50), the context inflation is welcome — it adds corroborating evidence that the similar names in the same context are the same person.

Design Tradeoffs

Why not embed names without context?

Bare-name embeddings eliminate context bleed but lose the disambiguation power demonstrated by the David Marshall test. A bare “David Marshall” vs “David Marshall” scores 1.0 — the model cannot distinguish them at all. Context is the only mechanism the embedding model has to separate same-name, different-person pairs.

Why not use separate embeddings for name and context?

An alternative architecture: embed the name and context separately, then combine scores with weighted averaging. This eliminates context bleed (the name embedding is pure name similarity) while retaining context as a separate signal.

This approach is viable but adds complexity — two embeddings per record instead of one, a tunable weight parameter, and a more complex similarity function. The current single-composite design is simpler and works well with the step 2.5 gate mitigating the primary failure mode. If context bleed proves problematic at scale, split embeddings are a planned fallback.

Why not fine-tune?

A fine-tuned embedding model trained on election name pairs could learn that Charlie and Charles are similar, that Jr is categorically significant, and that shared context should not inflate scores for dissimilar names. We do not have training data yet.

However, L3 decisions are labeled examples: every LLM match/no-match decision with its confidence and reasoning is a training pair. As the pipeline processes more data, the L3 decision log becomes a natural training set for active learning. A fine-tuned model trained on thousands of L3 decisions would, in principle, learn the domain-specific similarity function that the general-purpose text-embedding-3-large approximates. This is a future direction, not a current capability.

Summary

PropertyEffectMitigation
Context includedDistinguishes same-name, different-person pairs (David Marshall: 1.0 → 0.581)— (this is the goal)
Context bleedInflates scores for different-name, same-context pairs (Bridges vs Blanton: 0.21 → 0.60)Step 2.5 JW gate on last names
Middle initial includedProvides disambiguation signal (0.7025 → 0.6448)— (this is the goal)
Nickname dictionary appliedAligns canonical first names before embedding (Charlie → Charles)— (this is the goal)

The composite template is a tradeoff between disambiguation power and noise tolerance. Context helps more than it hurts — but only because the step 2.5 gate exists to catch the cases where it hurts.