Threshold Calibration

Embedding similarity thresholds determine which candidate pairs auto-accept, which enter the LLM zone, and which auto-reject. These thresholds are not universal constants — they are calibrated to a specific embedding model (text-embedding-3-large, 3,072 dimensions) using real test data from our prototype.

Two findings from early testing forced a complete recalibration.

The Two Findings

Robert Williams Jr at 0.862 — a false positive. Under the original thresholds, any pair scoring ≥ 0.82 was auto-accepted. “Robert Williams” and “Robert Williams Jr” scored 0.862 — above the threshold. The system would have silently merged father and son into one entity. The suffix “Jr” carries categorical meaning (different person), but the embedding model treats it as a minor token appended to an otherwise identical string.

Charlie Crist at 0.451 — a false negative. Under the original thresholds, any pair scoring < 0.65 was auto-rejected. “Charlie Crist” and “CRIST, CHARLES JOSEPH” scored 0.451 — below the threshold. The system would have discarded a true match. The nickname “Charlie” for “Charles”, combined with different name ordering, different casing, and an extra middle name, pushed the score well below the reject boundary.

Both errors are unacceptable. Merging different people corrupts every downstream analysis. Missing true matches fragments candidate records across sources, breaking cross-source reconciliation and career tracking.

Old vs. New Thresholds

Zone	Old Range	New Range	Change
Auto-accept	≥ 0.82	≥ 0.95 AND same state	Raised by 0.13, added state constraint
Ambiguous (LLM zone)	0.65–0.82	0.35–0.95 AND same state	Widened from 0.17 to 0.60 range
Auto-reject	< 0.65	< 0.35 OR different state	Lowered by 0.30, added state escape

The ambiguous zone expanded from a 0.17-wide band to a 0.60-wide band. This means far more pairs are routed to the LLM for confirmation.

What Each Change Addresses

Auto-accept raised to 0.95

The Williams Jr pair at 0.862 demonstrated that scores in the 0.82–0.95 range can contain suffix-bearing false positives. At 0.95, the only pairs that auto-accept are near-identical strings with trivial formatting differences — “ASHLEY MOODY” vs “Ashley Moody” (0.930 would not auto-accept; it enters the LLM zone where the model confirms the match using full context).

The same-state constraint is an additional guard. A candidate for county sheriff in Maine should never auto-match with a candidate for county sheriff in Florida, regardless of embedding score. Different-state pairs always enter the LLM zone.

Ambiguous zone widened to 0.35–0.95

The Crist pair at 0.451 sat in the old auto-reject zone. The new lower bound of 0.35 captures every nickname case we tested:

Pair	Cosine	Old Zone	New Zone
DeSantis / DESANTIS, RON	0.729	Ambiguous	Ambiguous
Crist / CRIST, CHARLES JOSEPH	0.451	Reject	Ambiguous
Nicole Fried / FRIED, NIKKI	0.642	Reject	Ambiguous
Williams / Williams Jr	0.862	Accept	Ambiguous
Val Demings / VAL DEMINGS	0.828	Accept	Ambiguous
Marco Rubio / RUBIO, MARCO ANTONIO	0.743	Ambiguous	Ambiguous
Ashley Moody / MOODY, ASHLEY B	0.930	Accept	Ambiguous
Dale Holness / HOLNESS, DALE V.C.	0.896	Accept	Ambiguous

Under the old thresholds, 3 of 8 pairs were misclassified (2 false accepts, 1 false reject). Under the new thresholds, all 8 enter the LLM zone where the model resolves them correctly.

Auto-reject lowered to 0.35

Below 0.35, no tested pair in our prototype was a true match. At this score range, the names share almost no surface similarity — they are genuinely different people who happen to share a blocking key.

The different-state escape allows immediate rejection of cross-state pairs regardless of score. Local officeholders do not appear in multiple states. (Federal candidates can, but they are handled by a separate federal-office pathway that does not use this threshold table.)

The Cost of a Wider Ambiguous Zone

The old ambiguous zone (0.65–0.82) captured roughly 5% of within-block pairs. The new zone (0.35–0.95) captures roughly 25% — a 5× increase in LLM calls.

At prototype scale (200 records), this is negligible. At production scale (42 million rows), the increase matters for throughput but not for budget. Budget is not a constraint. The step 2.5 name gate (JW < 0.50 on last names → skip) eliminates the majority of low-score pairs before they reach the LLM, keeping the actual call volume manageable.

The wider zone is a deliberate trade: more LLM calls in exchange for zero false accepts and zero false rejects in the tested range.

Thresholds Are Model-Specific

These thresholds are calibrated for text-embedding-3-large with 3,072 dimensions. A different model — even an updated version of the same model — will produce different similarity distributions. If the embedding model changes:

Re-run the test cases against the new model.
Plot the score distribution for known matches and known non-matches.
Recalibrate auto-accept, ambiguous, and auto-reject boundaries.
Store the new thresholds alongside the model identifier in L2 metadata.

The embedding_model field in every L2 record ensures that thresholds can always be traced to the model that produced the scores.

Summary

Principle	Implementation
Never auto-accept a suffix mismatch	Threshold raised to 0.95; suffixes always enter LLM zone
Never auto-reject a nickname match	Threshold lowered to 0.35; nicknames always enter LLM zone
Cross-state pairs require LLM confirmation	Same-state constraint on auto-accept
Wider zone is acceptable	Budget is not a constraint; accuracy is
Thresholds are not portable	Model version stored in every record

Cross-Reference

Real Test Cases — all pairs that informed calibration
Suffixes: Jr/Sr Means Different People — the Williams Jr finding
Nicknames and Middle Initials — the Crist finding
Budget Is Not a Constraint — why the wider zone is acceptable

Keyboard shortcuts

Election Aggregation