Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Confidence Levels

Every record in the pipeline carries a confidence level that reflects the trustworthiness of its source and the reliability of the processing steps applied to it. Confidence is not a score — it is a categorical label with defined semantics.

Three levels

High

The source is a certified government publication. Examples: NC SBE certified results, state election board portals that publish official canvass data. Records ingested from these sources enter L0 with source.confidence = "high".

High-confidence sources provide vote totals that are legally authoritative. When two sources disagree, the high-confidence source is treated as ground truth.

Medium

The source is a curated academic dataset derived from government publications. Example: MEDSL, which aggregates and reformats state-published results into a consistent schema. The data is one step removed from the original — parsed, cleaned, and sometimes corrected by the MEDSL team.

Medium-confidence sources are reliable for analysis but are not primary. In the 640 overlapping contests between MEDSL and NC SBE, 90.5% have identical vote totals. The 9.5% that differ are typically due to provisional ballot timing or reporting cutoffs.

Low

The source is community-curated, OCR-derived, or otherwise not traceable to a single certified publication. Examples: OpenElections state files with known parsing issues, any data recovered from PDFs via OCR, or crowd-sourced contest metadata.

Low-confidence records are included in the dataset but flagged. They are useful for coverage (filling gaps where no better source exists) but should not be cited without independent verification.

How confidence propagates

Confidence is not static. It can degrade as records pass through the pipeline, but it never improves without human intervention.

Source confidence (L0–L1). Set at ingestion based on the source type. Deterministic — the same source always gets the same level.

Match confidence (L3). Entity resolution adds a second dimension. If the match method is deterministic (exact string match or Jaro-Winkler ≥ 0.92), the source confidence is preserved. If the match required embedding similarity or LLM confirmation, the record is annotated with the match method and decision ID, but the source confidence is not downgraded — instead, a separate match_confidence field is added.

The combined confidence follows these rules:

Source confidenceMatch methodOverallNotes
HighExactHighBest case. Certified source, deterministic match.
HighJaro-WinklerHighAlgorithmic match above threshold.
HighEmbeddingHigh + decision IDSource still trusted; match is logged.
HighLLMHigh + decision IDSource still trusted; LLM rationale recorded.
MediumExactMediumAcademic source, deterministic match.
MediumLLMMedium + decision IDBoth source and match carry caveats.
LowAnyLowSource uncertainty dominates.

LLM decision tracking

When an LLM (Claude Sonnet) is involved in entity resolution, the pipeline records:

  • The decision ID (a unique hash of the prompt, response, and model version)
  • The prompt sent to the model
  • The model’s response
  • The confidence score returned by the model

This allows any LLM-assisted decision to be audited, replayed, or overridden. See When the LLM Gets Called.

How to cite records

When using data from this pipeline in publications, cite the original source, not the pipeline. The pipeline provides the information needed to construct a proper citation.

APA format template:

{Source organization}. ({Year}). {Dataset title} [Data set]. Retrieved {retrieval_date} from {url}.

Example:

North Carolina State Board of Elections. (2022). Official general election results [Data set]. Retrieved 2025-01-15 from https://www.ncsbe.gov/results-data.

Each L4 record includes the fields needed to construct this citation: source.name, source.retrieval_date, source.url, and source.confidence. A methodology link pointing to the pipeline documentation should accompany any analysis that depends on entity resolution or cross-source reconciliation.