Office Classification

MEDSL 2022 contains 8,387 unique office names across all 50 states and DC. These are not 8,387 distinct offices — they are 8,387 different strings that humans typed to describe elected positions. “Board of Education”, “BOARD OF ED.”, “BOE”, “School Board”, and “Board of Education Members” all refer to the same type of office. “DALLAS COUNTY JUDGE” means a chief executive in Texas and a judicial officer everywhere else.

Classifying these strings into a consistent taxonomy is required for every downstream operation: blocking for entity resolution, computing competitiveness by office type, comparing the same office across states, and answering “what offices exist in my county?”

The taxonomy

Every office is classified into two fields:

Field	Values	Example
`office_level`	`federal`, `state`, `county`, `municipal`, `school_district`, `special_district`, `judicial`, `tribal`	`school_district`
`office_branch`	`executive`, `legislative`, `judicial`, `law_enforcement`, `fiscal`, `education`, `infrastructure`, `regulatory`, `other`	`education`

The pair (office_level, office_branch) defines the classification. “Board of Education” → (school_district, education). “County Sheriff” → (county, law_enforcement). “City Council” → (municipal, legislative).

The scale of the problem

Of the 8,387 unique office names in MEDSL 2022:

Characteristic	Count	Percentage
Appear in only 1 state	6,241	74.4%
Appear in only 1 county	4,995	59.6%
Appear in 10+ states	312	3.7%
Contain a proper noun (county/city name)	3,108	37.1%

Most office names are effectively unique strings. “DALLAS COUNTY JUDGE”, “Collier Mosquito Control District”, “Santa Rosa Island Authority” — these appear once in the entire national dataset. No keyword list can enumerate them all. The classifier must generalize.

Four-tier approach

The classifier runs four tiers in sequence. Each tier handles what the previous tier could not. A record classified at tier 1 is never re-examined by tier 2.

Tier	Method	Unique names handled	Cumulative %	Cost
1	Keyword lookup	~3,775	~45.0%	$0
2	Regex patterns	~1,426	~62.0%	$0
3	Embedding nearest-neighbor	~378	~66.5%	~$0.01/1K
4	LLM classification	~42	~67.0%	~$0.002/call
—	Unclassified (`other`)	~2,766	100%	—

The remaining ~33% classified as other are primarily hyper-local offices (township-specific roles, water district sub-boards, tribal offices) that require either expanded reference data or manual review. The other rate drops as the keyword and regex lists expand.

Note: Percentages are based on unique office name strings. By record count, the coverage is much higher — the 312 names that appear in 10+ states account for millions of records. Keyword tier 1 alone handles ~85% of records by volume.

Tier 1: Keyword lookup

A table of ~170 keywords mapped to (office_level, office_branch) pairs. If any keyword appears in the office name string, the classification is assigned.

Keyword	office_level	office_branch	Example match
`sheriff`	county	law_enforcement	“WARREN COUNTY SHERIFF”
`board of education`	school_district	education	“COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02”
`city council`	municipal	legislative	“CITY COUNCIL WARD 3”
`coroner`	county	fiscal	“COUNTY CORONER”
`constable`	county	law_enforcement	“CONSTABLE PRECINCT 4”

Keywords are matched case-insensitively. When multiple keywords match, the most specific wins (“county board of education” matches board of education → school_district, not county → county). The keyword table is maintained in the appendix.

Keyword lookup handles approximately 45% of unique office name strings and ~85% of total records. The most common offices — sheriff, school board, city council, county commission — all have unambiguous keywords.

Tier 2: Regex patterns

Approximately 40 regex patterns handle structured variations that keywords miss. Patterns capture positional and combinatorial relationships:

Pattern	office_level	office_branch	Example match
`county\s+commission`	county	legislative	“CLARK COUNTY COMMISSION DIST 2”
`district\s+court\s+judge`	judicial	judicial	“15TH DISTRICT COURT JUDGE”
`register\s+of\s+(deeds\|wills)`	county	fiscal	“REGISTER OF DEEDS”
`soil.water.conservation`	special_district	infrastructure	“SOIL AND WATER CONSERVATION DISTRICT SUPERVISOR”
`(mayor\|alcalde)`	municipal	executive	“MAYOR - CITY OF SPRINGFIELD”

Regex patterns add approximately 17% of unique names beyond what keywords catch. Combined with tier 1, the two deterministic tiers handle ~62% of unique names and ~92% of records by volume.

Tier 3: Embedding nearest-neighbor

For names that survive tiers 1 and 2, L2 generates an embedding using text-embedding-3-large and finds the nearest neighbor in a reference set of ~200 pre-classified office names.

Real example from our prototype:

Input: “Collier Mosquito Control District”
Nearest neighbor: “Mosquito Control District” (reference set)
Cosine similarity: 0.787
Classification: (special_district, infrastructure)

The tier 3 accept threshold is cosine ≥ 0.60. Below that, the match is too uncertain and the record passes to tier 4. In our prototype, tier 3 classified ~4.5% of remaining unique names with a manual-review accuracy of 94%.

The 200-name reference set was curated from the most common office names across all states, covering every (office_level, office_branch) pair with at least 3 reference examples. Expanding this set to 500+ names is a planned improvement.

Tier 4: LLM classification

Remaining unclassified names go to Claude Sonnet with the full context: office name, state, county, and the taxonomy definition.

Real examples from our prototype:

Office name	State	LLM classification	Confidence
Santa Rosa Island Authority	FL	special_district / infrastructure	0.90
Mosquito Control Board Member	FL	special_district / infrastructure	0.95
Judge of Compensation Claims	FL	judicial / judicial	0.88
Public Administrator	MO	county / fiscal	0.82
Recorder of Deeds	MO	county / fiscal	0.95
Drainage Commissioner	IL	special_district / infrastructure	0.85
Fence Viewer	VT	municipal / regulatory	0.70
Pound Keeper	NH	municipal / regulatory	0.65
Hog Reeve	NH	municipal / regulatory	0.60

In our prototype, the LLM classified 9 hard cases with 100% accuracy against manual review. The lower-confidence cases (Fence Viewer at 0.70, Hog Reeve at 0.60) are genuine obscure New England town offices that even the LLM finds unusual — but it classified them correctly.

The state-context problem

“DALLAS COUNTY JUDGE” illustrates why state context matters. In Texas, the county judge is the presiding officer of the commissioners court — an executive role, not a judicial one. In every other state, a county judge sits on the bench.

The keyword classifier alone cannot resolve this. The word “judge” appears, suggesting judicial. But the Texas county judge is (county, executive).

The fix is a state-specific override table in tier 1. Before general keyword matching, a small set of (state, keyword) → classification entries handles known exceptions:

State	Office pattern	Correct classification
TX	county judge	county / executive
LA	parish president	county / executive
LA	police jury	county / legislative
AK	borough assembly	county / legislative

This table is currently small (~15 entries). As more state-specific offices are identified, it grows. The pattern generalizes: when the same word means different things in different states, the state-specific override takes priority.

Accuracy by tier

Tier	Method	Accuracy (manual review)	False positive rate
1	Keyword	99.2%	< 0.5%
2	Regex	97.8%	~1.0%
3	Embedding NN	94.0%	~3.5%
4	LLM	100% (N=9)	0% (N=9)

Tier 1 and 2 errors are almost entirely from the state-context problem (a keyword matching the wrong sense of the word). Tier 3 errors come from embedding matches that are semantically close but functionally wrong — “Tax Collector” matching to “Tax Assessor” when they are separate offices in some states.

Cross-references

The Four-Tier Classifier — step-by-step walkthrough with a single office name through all four tiers
Appendix: Office Classification Reference — full keyword table and regex pattern list

Keyboard shortcuts

Election Aggregation