Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Office Classification Reference

The pipeline classifies 8,387 unique office name strings into canonical office types using a four-tier system. Each tier handles progressively harder cases. This appendix documents tiers 1 and 2 in full and summarizes tiers 3 and 4.

Coverage summary

TierMethodUnique offices handledCumulative coverage
1Keyword lookup3,10237%
2Regex patterns2,09762%
3Embedding similarity2,34090%
4LLM classification848100%

Tiers 1 and 2 are fully deterministic — same input, same output, no external calls. Tier 3 uses cosine similarity against text-embedding-3-large embeddings of known office types. Tier 4 sends unresolved strings to Claude Sonnet with a structured prompt.

Tier 1: Keyword lookup

A case-insensitive keyword match against the office name string. If any keyword appears in the string, the office is classified immediately. Keywords are checked in order; the first match wins.

Keywordoffice_leveloffice_category
presidentfederalexecutive
u.s. senatefederallegislative
u.s. housefederallegislative
congressfederallegislative
governorstateexecutive
lieutenant governorstateexecutive
attorney generalstateexecutive
secretary of statestateexecutive
state treasurerstateexecutive
state auditorstateexecutive
state senatestatelegislative
state housestatelegislative
state representativestatelegislative
state assemblystatelegislative
supreme courtstatejudicial
court of appealsstatejudicial
appeals courtstatejudicial
district courtcountyjudicial
superior courtcountyjudicial
county commissionercountylegislative
county councilcountylegislative
sheriffcountylaw_enforcement
clerk of courtcountyjudicial
register of deedscountyadministrative
coronercountyadministrative
constablecountylaw_enforcement
justice of the peacecountyjudicial
school boardlocaleducation
board of educationlocaleducation
city councillocallegislative
mayorlocalexecutive
aldermanlocallegislative
township trusteelocallegislative
soil and waterlocalspecial_district
fire districtlocalspecial_district
water districtlocalspecial_district

Notes:

  • “u.s. senate” is checked before “state senate” to avoid false matches.
  • “lieutenant governor” is checked before “governor” for the same reason.
  • Keywords are matched as substrings, not whole words. “county commissioner district 3” matches on “county commissioner”.

Tier 2: Regex patterns

When no tier 1 keyword matches, the office string is tested against a series of compiled regular expressions. These handle structural patterns that keyword matching cannot.

Patternoffice_leveloffice_categoryExample matches
(?i)^(us|united states) (rep|senator)federallegislative“US Rep District 4”
(?i)district judge.*district \d+countyjudicial“District Judge 21st Judicial District”
(?i)(city|town|village) (of|de) .+ (council|trustee|board)locallegislative“Town of Cary Council”
(?i)independent school district.*\d+localeducation“Independent School District 279 Board”
(?i)(municipal|mun\.?) (utility|water|sewer) districtlocalspecial_district“Municipal Utility District 14”
(?i)community college.*trusteelocaleducation“Community College District Trustee”
(?i)(precinct|ward) (chair|committee)localparty“Precinct 12 Committee Chair”
(?i)conservation district (super|board|dir)localspecial_district“Conservation District Supervisor”
(?i)(drainage|levee|flood) (district|board)localspecial_district“Drainage District 7 Board”
(?i)hospital district (board|dir|trustee)localspecial_district“Hospital District Board Member”
(?i)park (district|board) (comm|dir|trustee)localspecial_district“Park District Commissioner”
(?i)sanitary districtlocalspecial_district“Sanitary District Trustee”
(?i)mosquito (abatement|control) districtlocalspecial_district“Mosquito Abatement District Trustee”
(?i)(borough|parish) (council|president|assembly)countylegislative“Borough Assembly Member”
(?i)district attorneycountylaw_enforcement“District Attorney 26th District”

Regex patterns are tested in order. The first match wins. All patterns use case-insensitive mode.

Tier 3: Embedding similarity

Office strings that pass through tiers 1 and 2 unclassified are embedded using text-embedding-3-large (3072 dimensions) and compared against a reference set of known office type embeddings via FAISS nearest-neighbor search.

  • Threshold: cosine similarity ≥ 0.85 against the nearest known office type.
  • Reference set: the canonical office types defined by tiers 1 and 2, plus manually curated additions for jurisdiction-specific titles.
  • Examples resolved at tier 3:
    • “Moderator” → local / legislative (New England town meeting role)
    • “Fence Viewer” → local / administrative (historical New England office)
    • “Pound Keeper” → local / administrative
    • “Surveyor of Highways” → local / administrative
    • “Oyster Commissioner” → local / special_district (Maryland)

Tier 3 handles 2,340 unique office strings — mostly jurisdiction-specific titles, historical offices, and compound names that do not match keyword or regex patterns.

Tier 4: LLM classification

The remaining 848 office strings are sent to Claude Sonnet with a structured prompt that provides the office name, the state, and the county (where available). The LLM returns office_level, office_category, and a brief rationale.

Every tier 4 decision is recorded in the decision log with:

  • decision_id
  • input_string (the original office name)
  • output_level and output_category
  • llm_request_id
  • rationale (the LLM’s explanation)

Tier 4 classifications can be overridden by adding entries to the tier 1 or tier 2 tables in subsequent pipeline versions. Once an office string is promoted to tier 1 or tier 2, it is classified deterministically on all future runs.

Office level and category enumerations

office_level values: federal, state, county, local.

office_category values: executive, legislative, judicial, law_enforcement, administrative, education, special_district, party.

These enumerations are defined in the Enumerations Reference. Every classified office receives exactly one level and one category.

Handling ambiguity

Some office strings are genuinely ambiguous:

  • “Board of Commissioners” could be county or municipal depending on jurisdiction.
  • “Trustee” alone could be township, school board, or special district.
  • “Judge” without a court name could be any judicial level.

In these cases, the pipeline uses jurisdiction context (state, county, FIPS code) to disambiguate. If the jurisdiction does not resolve the ambiguity, the string is sent to tier 3 or 4 with the full context attached.