Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Office Classification

MEDSL 2022 contains 8,387 unique office names across all 50 states and DC. These are not 8,387 distinct offices — they are 8,387 different strings that humans typed to describe elected positions. “Board of Education”, “BOARD OF ED.”, “BOE”, “School Board”, and “Board of Education Members” all refer to the same type of office. “DALLAS COUNTY JUDGE” means a chief executive in Texas and a judicial officer everywhere else.

Classifying these strings into a consistent taxonomy is required for every downstream operation: blocking for entity resolution, computing competitiveness by office type, comparing the same office across states, and answering “what offices exist in my county?”

The taxonomy

Every office is classified into two fields:

FieldValuesExample
office_levelfederal, state, county, municipal, school_district, special_district, judicial, tribalschool_district
office_branchexecutive, legislative, judicial, law_enforcement, fiscal, education, infrastructure, regulatory, othereducation

The pair (office_level, office_branch) defines the classification. “Board of Education” → (school_district, education). “County Sheriff” → (county, law_enforcement). “City Council” → (municipal, legislative).

The scale of the problem

Of the 8,387 unique office names in MEDSL 2022:

CharacteristicCountPercentage
Appear in only 1 state6,24174.4%
Appear in only 1 county4,99559.6%
Appear in 10+ states3123.7%
Contain a proper noun (county/city name)3,10837.1%

Most office names are effectively unique strings. “DALLAS COUNTY JUDGE”, “Collier Mosquito Control District”, “Santa Rosa Island Authority” — these appear once in the entire national dataset. No keyword list can enumerate them all. The classifier must generalize.

Four-tier approach

The classifier runs four tiers in sequence. Each tier handles what the previous tier could not. A record classified at tier 1 is never re-examined by tier 2.

TierMethodUnique names handledCumulative %Cost
1Keyword lookup~3,775~45.0%$0
2Regex patterns~1,426~62.0%$0
3Embedding nearest-neighbor~378~66.5%~$0.01/1K
4LLM classification~42~67.0%~$0.002/call
Unclassified (other)~2,766100%

The remaining ~33% classified as other are primarily hyper-local offices (township-specific roles, water district sub-boards, tribal offices) that require either expanded reference data or manual review. The other rate drops as the keyword and regex lists expand.

Note: Percentages are based on unique office name strings. By record count, the coverage is much higher — the 312 names that appear in 10+ states account for millions of records. Keyword tier 1 alone handles ~85% of records by volume.

Tier 1: Keyword lookup

A table of ~170 keywords mapped to (office_level, office_branch) pairs. If any keyword appears in the office name string, the classification is assigned.

Keywordoffice_leveloffice_branchExample match
sheriffcountylaw_enforcement“WARREN COUNTY SHERIFF”
board of educationschool_districteducation“COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02”
city councilmunicipallegislative“CITY COUNCIL WARD 3”
coronercountyfiscal“COUNTY CORONER”
constablecountylaw_enforcement“CONSTABLE PRECINCT 4”

Keywords are matched case-insensitively. When multiple keywords match, the most specific wins (“county board of education” matches board of education → school_district, not county → county). The keyword table is maintained in the appendix.

Keyword lookup handles approximately 45% of unique office name strings and ~85% of total records. The most common offices — sheriff, school board, city council, county commission — all have unambiguous keywords.

Tier 2: Regex patterns

Approximately 40 regex patterns handle structured variations that keywords miss. Patterns capture positional and combinatorial relationships:

Patternoffice_leveloffice_branchExample match
county\s+commissioncountylegislative“CLARK COUNTY COMMISSION DIST 2”
district\s+court\s+judgejudicialjudicial“15TH DISTRICT COURT JUDGE”
register\s+of\s+(deeds|wills)countyfiscal“REGISTER OF DEEDS”
soil.*water.*conservationspecial_districtinfrastructure“SOIL AND WATER CONSERVATION DISTRICT SUPERVISOR”
(mayor|alcalde)municipalexecutive“MAYOR - CITY OF SPRINGFIELD”

Regex patterns add approximately 17% of unique names beyond what keywords catch. Combined with tier 1, the two deterministic tiers handle ~62% of unique names and ~92% of records by volume.

Tier 3: Embedding nearest-neighbor

For names that survive tiers 1 and 2, L2 generates an embedding using text-embedding-3-large and finds the nearest neighbor in a reference set of ~200 pre-classified office names.

Real example from our prototype:

  • Input: “Collier Mosquito Control District”
  • Nearest neighbor: “Mosquito Control District” (reference set)
  • Cosine similarity: 0.787
  • Classification: (special_district, infrastructure)

The tier 3 accept threshold is cosine ≥ 0.60. Below that, the match is too uncertain and the record passes to tier 4. In our prototype, tier 3 classified ~4.5% of remaining unique names with a manual-review accuracy of 94%.

The 200-name reference set was curated from the most common office names across all states, covering every (office_level, office_branch) pair with at least 3 reference examples. Expanding this set to 500+ names is a planned improvement.

Tier 4: LLM classification

Remaining unclassified names go to Claude Sonnet with the full context: office name, state, county, and the taxonomy definition.

Real examples from our prototype:

Office nameStateLLM classificationConfidence
Santa Rosa Island AuthorityFLspecial_district / infrastructure0.90
Mosquito Control Board MemberFLspecial_district / infrastructure0.95
Judge of Compensation ClaimsFLjudicial / judicial0.88
Public AdministratorMOcounty / fiscal0.82
Recorder of DeedsMOcounty / fiscal0.95
Drainage CommissionerILspecial_district / infrastructure0.85
Fence ViewerVTmunicipal / regulatory0.70
Pound KeeperNHmunicipal / regulatory0.65
Hog ReeveNHmunicipal / regulatory0.60

In our prototype, the LLM classified 9 hard cases with 100% accuracy against manual review. The lower-confidence cases (Fence Viewer at 0.70, Hog Reeve at 0.60) are genuine obscure New England town offices that even the LLM finds unusual — but it classified them correctly.

The state-context problem

“DALLAS COUNTY JUDGE” illustrates why state context matters. In Texas, the county judge is the presiding officer of the commissioners court — an executive role, not a judicial one. In every other state, a county judge sits on the bench.

The keyword classifier alone cannot resolve this. The word “judge” appears, suggesting judicial. But the Texas county judge is (county, executive).

The fix is a state-specific override table in tier 1. Before general keyword matching, a small set of (state, keyword) → classification entries handles known exceptions:

StateOffice patternCorrect classification
TXcounty judgecounty / executive
LAparish presidentcounty / executive
LApolice jurycounty / legislative
AKborough assemblycounty / legislative

This table is currently small (~15 entries). As more state-specific offices are identified, it grows. The pattern generalizes: when the same word means different things in different states, the state-specific override takes priority.

Accuracy by tier

TierMethodAccuracy (manual review)False positive rate
1Keyword99.2%< 0.5%
2Regex97.8%~1.0%
3Embedding NN94.0%~3.5%
4LLM100% (N=9)0% (N=9)

Tier 1 and 2 errors are almost entirely from the state-context problem (a keyword matching the wrong sense of the word). Tier 3 errors come from embedding matches that are semantically close but functionally wrong — “Tax Collector” matching to “Tax Assessor” when they are separate offices in some states.

Cross-references