Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Four-Tier Classifier

Office classification proceeds through four tiers in strict order. Each tier handles a progressively harder subset of the 8,387 unique office names found in MEDSL 2022. A name classified at tier 1 never reaches tier 2. A name classified at tier 2 never reaches tier 3. The tiers are ordered by cost: deterministic and free first, embedding-based second, LLM last.

Tier 1: Keyword Match

A lookup table of 170 keyword entries maps office name substrings to (office_level, office_branch) pairs. Matching is case-insensitive and checks for substring containment.

Example:

Raw office name: WARREN COUNTY BOARD OF EDUCATION

The keyword table contains:

Keywordoffice_leveloffice_branch
board of educationschool_districteducation

"board of education" appears as a substring → classified as school_district/education.

Coverage: ~3,775 of 8,387 unique names (~45.0%). These are the offices with unambiguous keywords: sheriff, coroner, board of education, city council, state senate, district court, county clerk, school board, mayor, constable, treasurer.

Limitations: Keyword matching is context-free. DALLAS COUNTY JUDGE contains judge, which maps to county/judicial. In Texas, the County Judge is the chief executive — county/executive is correct. Tier 1 gets this wrong. The planned fix is a state-context override table applied before keyword matching.

Tier 2: Regex Patterns

Approximately 40 regular expressions handle office names with structural patterns that keywords alone cannot capture.

Example:

Raw office name: CLERK OF THE CIRCUIT COURT, 11TH JUDICIAL CIRCUIT

Regex pattern: clerk\s+of\s+(the\s+)?(circuit|district|superior)\s+court

Match → classified as county/judicial.

Other regex examples:

PatternMatchesClassification
county\s+commissionCounty Commissioner, County Commission District 3county/legislative
(city|town|village)\s+councilCity Council Ward 2, Town Council At Largemunicipal/legislative
district\s+\d+\s+judgeDistrict 14 Judge, District 3 Judgecounty/judicial
soil\s+and\s+waterSoil and Water Conservation District Supervisorspecial_district/conservation

Coverage: ~1,426 additional unique names (~17.0%), bringing the cumulative total to ~62.0%.

Limitations: Regex patterns are brittle against novel phrasings. CONSERVATION DISTRICT BOARD MEMBER does not match the soil-and-water pattern. Regex also cannot handle the 4,995 office names that appear in exactly one county — writing a pattern for each is infeasible.

Tier 3: Embedding Nearest Neighbor

The remaining ~3,186 unclassified office names are embedded using text-embedding-3-large and compared against a reference set of ~200 pre-classified office names. The nearest neighbor’s classification is assigned if cosine similarity exceeds 0.60.

Example:

Raw office name: Collier Mosquito Control District

Nearest reference: Mosquito Control Districtspecial_district/infrastructure

Cosine similarity: 0.787

0.787 > 0.60 → classified as special_district/infrastructure with confidence 0.787.

Other tier 3 results:

Unclassified NameNearest ReferenceCosineClassification
Collier Mosquito Control DistrictMosquito Control District0.787special_district/infrastructure
Eastern Carrituck Fire & RescueFire Protection District0.724special_district/infrastructure
Lowndes County Bd of EdBoard of Education0.831school_district/education
Hospital Authority BoardHospital District0.692special_district/health

Coverage: ~378 additional unique names (~4.5%), bringing the cumulative total to ~66.5%.

What falls through: Office names with no close reference analog, names below the 0.60 threshold, and names whose nearest neighbor is misleading (e.g., Community Development District matching Community College District at 0.71 — wrong classification). These proceed to tier 4.

Tier 4: LLM Classification

The final tier sends unclassified office names to Claude Sonnet with a structured prompt that includes the office name, state, and the full taxonomy of (office_level, office_branch) pairs.

Example:

Raw office name: Santa Rosa Island Authority

State: Florida

The LLM prompt provides the taxonomy and asks: “Classify this office into the most appropriate (office_level, office_branch) pair. Explain your reasoning.”

LLM response:

Classification: special_district/infrastructure (confidence: 0.90)

“The Santa Rosa Island Authority is a special-purpose governmental entity in Escambia County, Florida, responsible for managing development and infrastructure on Santa Rosa Island (Pensacola Beach). It is not a general-purpose county or municipal government. ‘Special district’ at the ‘infrastructure’ branch is the best fit.”

Coverage: ~42 additional unique names (~0.5%) in our prototype evaluation, classified with 100% accuracy against manual review (9 of 9 hard cases correct).

Other tier 4 examples:

Office NameStateLLM ClassificationConfidence
Santa Rosa Island AuthorityFLspecial_district/infrastructure0.90
Cuyahoga County ExecutiveOHcounty/executive0.95
Drainage CommissionerILspecial_district/infrastructure0.85
Register of Mesne ConveyancesSCcounty/judicial0.88

The South Carolina example is illustrative: “Register of Mesne Conveyances” is an office that exists in exactly one state. No keyword, regex, or embedding reference can classify it without external knowledge. The LLM knows that mesne conveyances is a legal term related to property transfers and that the Register is a judicial officer.

Tier Summary

TierMethodUnique NamesCumulative %Cost per NameDeterministic
1Keyword (170 entries)~3,77545.0%$0Yes
2Regex (~40 patterns)~1,42662.0%$0Yes
3Embedding NN (200 refs)~37866.5%~$0.0001Yes*
4LLM~4267.0%~$0.001No
Unclassified / other~2,766100%

* Deterministic given the same embedding model version.

The remaining ~33% classified as other are office names that did not pass through our full pipeline in the prototype. At production scale, tiers 1–4 are projected to handle ~99.5% of names, with ~0.5% remaining as other pending human review.

Why Four Tiers Instead of Just the LLM

Three reasons:

  1. Speed. Keyword and regex classify 62% of names in microseconds. Embedding NN classifies 4.5% more in milliseconds. Sending all 8,387 names to the LLM would take minutes and achieve the same result for the easy cases.

  2. Reproducibility. Tiers 1–3 produce identical output on every run. Tier 4 may produce slightly different reasoning (though classifications are stable in practice). Minimizing non-deterministic surface area makes the pipeline easier to audit.

  3. Debuggability. When a classification is wrong, the classifier_method field tells you which tier produced it. A wrong keyword mapping is a one-line table fix. A wrong regex is a pattern edit. A wrong embedding match means the reference set needs expansion. A wrong LLM classification means the prompt needs refinement. Each failure mode has a distinct fix.

Cross-Reference