The Two Audiences
This project serves two audiences with fundamentally different trust requirements. Engineers need to verify the pipeline mechanically. Consumers — journalists, researchers, government staff — need to understand what the data means and how much to trust it without reading source code.
This chapter describes what each audience sees and how the two views connect.
What engineers see
Engineers interact with the pipeline’s internal machinery. Trust, for this audience, is a function of determinism, traceability, and mechanical reproducibility.
Hash chains. Every record carries a provenance.hash field — a SHA-256 hash of the record’s content at each layer. L1 records hash L0 input bytes. L2 records hash L1 output plus the embedding model version. L3 records hash L2 output plus the decision log entry. L4 records hash L3 output. Any mutation at any layer invalidates all downstream hashes.
Decision logs. Every non-deterministic operation at L3 — embedding similarity matches, LLM-confirmed entity resolutions — is recorded in a JSONL decision log. Each entry includes:
decision_id: a unique identifier for the decisionmethod: one ofexact,jaro_winkler,embedding,llmscore: the similarity or confidence score (where applicable)input_record_hashes: the L2 records being comparedoutput: the resolution (match, no-match, or merge)llm_request_id: the API request ID (for LLM decisions only)
Embedding IDs. Every embedding generated at L2 is tagged with the model identifier (text-embedding-3-large), the embedding dimension (3072), and the composite string template used to generate the input text. If the model or template changes, all L2 records are regenerated — not patched.
Layer manifests. Each layer’s output directory contains a manifest.jsonl file listing every output file, its row count, its SHA-256 hash, the pipeline version that produced it, and the timestamp of generation. Manifests are the unit of verification: compare two manifests to determine whether a pipeline run produced identical output.
What consumers see
Consumers interact with query results, summary statistics, and exported datasets. Trust, for this audience, is a function of source attribution, stated confidence, and transparent methodology.
Source names. Every record in consumer-facing output includes a human-readable source name: “NC SBE (certified)”, “MEDSL 2022”, “OpenElections (community-curated)”. The source name tells the consumer where the data came from and how it was collected.
Confidence levels. Every record carries a confidence level: high, medium, or low. See Confidence Levels for definitions. Consumers can filter by confidence to match their tolerance for uncertainty.
Methodology page. Any published dataset includes a methodology section describing the pipeline version, source versions, and processing steps used. This is the consumer-facing equivalent of the manifest.
Bridge table
The following table maps consumer-facing fields to their internal pipeline equivalents. If you see a value in a consumer export and need to trace it, this is where to start.
| Consumer-facing field | Example value | Internal pipeline field | Layer |
|---|---|---|---|
| Source | NC SBE (certified) | source.source_type = nc_sbe, source.certification = certified | L1 |
| Confidence | High | provenance.confidence = high | L1–L4 |
| Candidate name | John A. Smith Jr. | candidate.canonical_first = john, candidate.canonical_last = smith, candidate.suffix = jr | L4 |
| Office | County Commissioner District 3 | contest.canonical_office = county_commissioner, contest.district = 3 | L4 |
| Vote total | 12,847 | votes.total = 12847 | L1 |
| Match method | Algorithmic (exact) | entity_resolution.method = exact | L3 |
| Match method | LLM-confirmed | entity_resolution.method = llm, entity_resolution.decision_id = d-2024-00417 | L3 |
| Jurisdiction | Mecklenburg County, NC | jurisdiction.county_fips = 37119, jurisdiction.state = NC | L1 |
| Election date | 2022-11-08 | election.date = 2022-11-08 | L1 |
| Party | Democratic | candidate.party = DEM | L1 |
Reproducibility by layer
Not all layers are equally reproducible. The guarantees differ based on whether a layer involves external API calls.
L0 → L1: Deterministic. L1 is a pure function of L0 input and the pipeline code. Same input, same code version, same output — byte-identical. No external calls. No randomness.
L1 → L2: Deterministic. L2 adds embeddings generated by text-embedding-3-large (3072 dimensions). The embedding API is deterministic for a given model version and input string. Same L1 input, same model version, same output. If OpenAI retires or modifies the model, a pinned model version in the manifest allows detection (though not reproduction without the original model).
L2 → L3: Replayable from decision log. L3 involves entity resolution — some of which uses embedding cosine similarity (deterministic given L2) and some of which calls Claude Sonnet for confirmation. LLM calls are not deterministic: the same prompt may produce different text on different days. However, every LLM decision is recorded in the decision log with its output. Replaying L3 from the decision log — rather than re-calling the LLM — produces identical output. The decision log is the reproducibility mechanism for L3.
L3 → L4: Deterministic. L4 is a deterministic function of L3 output. It selects canonical names, assigns canonical IDs, and merges duplicate records. Same L3, same L4.
End-to-end reproducibility. To fully reproduce a dataset:
- Check out the tagged pipeline version from the repository.
- Obtain the same L0 source files (verified by hash against the L0 manifest).
- Run L0 → L2. Verify output hashes against the L2 manifest.
- Apply the published decision log to produce L3. Verify against the L3 manifest.
- Run L3 → L4. Verify against the L4 manifest.
If all manifest hashes match, the reproduction is exact. If any hash diverges, the manifest diff identifies exactly which records changed and at which layer.
When the two views diverge
Sometimes engineers and consumers reach different conclusions about the same record:
- An engineer may see that a match was made by LLM with confidence 0.78 and flag it as marginal. A consumer sees “Source: MEDSL, Confidence: Medium” and treats it as usable. Both are correct within their frame.
- An engineer may know that an embedding model version is deprecated. A consumer sees no change in the output. The manifest captures this risk; the consumer-facing confidence level does not (yet).
The bridge table above is the mechanism for resolving these divergences. When in doubt, trace the consumer field back to its pipeline equivalent and inspect the full provenance chain.