Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Querying JSONL Output

Every layer of the pipeline produces JSONL — one JSON record per line. This format is streamable, greppable, and works with standard Unix tools. No database required.

Format Basics

Each line is a complete, self-contained JSON object:

{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Timothy Lance","votes_total":303}
{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Bessie Blackwell","votes_total":277}

Line count equals record count:

wc -l l4_canonical/exports/flat_export.jsonl
# 42381902 l4_canonical/exports/flat_export.jsonl

Querying with jq

jq is the standard tool for command-line JSON processing. Every example below operates on L4 flat export JSONL.

Filter by state

cat flat_export.jsonl | jq -c 'select(.state == "NC")' | head -3

Output:

{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Timothy Lance","votes_total":303,...}
{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Bessie Blackwell","votes_total":277,...}
{"election_date":"2022-11-08","state":"NC","county":"COLUMBUS","candidate_canonical":"Nicky Wooten","votes_total":218,...}

Filter by office level

cat flat_export.jsonl | jq -c 'select(.contest.office_level == "school_district")' | wc -l
# 1847302

Extract specific fields

cat flat_export.jsonl \
  | jq -c 'select(.state == "NC" and .county == "COLUMBUS") | {name: .candidate_canonical, votes: .votes_total, office: .contest_name}' \
  | head -5

Output:

{"name":"Timothy Lance","votes":303,"office":"COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02"}
{"name":"Bessie Blackwell","votes":277,"office":"COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02"}
{"name":"Nicky Wooten","votes":218,"office":"COLUMBUS COUNTY SCHOOLS BOARD OF EDUCATION DISTRICT 02"}
{"name":"Ricky Leinwand","votes":1531,"office":"COLUMBUS COUNTY SHERIFF"}
{"name":"Jody Greene","votes":1204,"office":"COLUMBUS COUNTY SHERIFF"}

Count distinct candidates per state

cat flat_export.jsonl \
  | jq -r '.state + "\t" + .candidate_entity_id' \
  | sort -u \
  | cut -f1 \
  | uniq -c \
  | sort -rn \
  | head -5

Output:

  14203 TX
  12847 CA
   9341 FL
   7892 NY
   6204 OH

Find all records for a specific candidate

cat flat_export.jsonl \
  | jq -c 'select(.candidate_entity_id == "person:nc:columbus:lance-timothy-13")' \
  | jq '{precinct: .jurisdiction.precinct, votes: .votes_total}'

Output (one line per precinct):

{"precinct":"P17","votes":303}
{"precinct":"P21","votes":287}
{"precinct":"P04","votes":214}
...

Querying with Python

For aggregation, sorting, or anything beyond filtering, Python is more practical.

Load and filter

import json

with open("flat_export.jsonl") as f:
    nc_school = [
        json.loads(line) for line in f
        if '"NC"' in line  # fast pre-filter on raw text
        and json.loads(line).get("contest", {}).get("office_level") == "school_district"
    ]

print(f"{len(nc_school)} NC school district records")

Stream large files without loading into memory

import json

def stream_jsonl(path, predicate):
    with open(path) as f:
        for line in f:
            record = json.loads(line)
            if predicate(record):
                yield record

for r in stream_jsonl("flat_export.jsonl", lambda r: r["state"] == "NC" and r["votes_total"] > 1000):
    print(r["candidate_canonical"], r["votes_total"], r["contest_name"])

Aggregate to contest level

import json
from collections import defaultdict

totals = defaultdict(lambda: defaultdict(int))

with open("flat_export.jsonl") as f:
    for line in f:
        r = json.loads(line)
        if r["state"] == "NC" and r["county"] == "COLUMBUS":
            key = (r["contest_name"], r["candidate_canonical"])
            totals[r["contest_name"]][r["candidate_canonical"]] += r["votes_total"]

for contest, candidates in sorted(totals.items()):
    print(f"\n{contest}")
    for name, votes in sorted(candidates.items(), key=lambda x: -x[1]):
        print(f"  {name}: {votes:,}")

Export to CSV

import json, csv

with open("flat_export.jsonl") as f_in, open("output.csv", "w", newline="") as f_out:
    writer = csv.writer(f_out)
    writer.writerow(["state", "county", "contest", "candidate", "votes"])
    for line in f_in:
        r = json.loads(line)
        writer.writerow([r["state"], r["county"], r["contest_name"],
                         r["candidate_canonical"], r["votes_total"]])

Five Useful One-Liners

1. Total votes per state (top 10):

jq -r .state flat_export.jsonl | sort | uniq -c | sort -rn | head -10

2. All uncontested races (single candidate per contest):

jq -r '"\(.state)\t\(.county)\t\(.contest_name)\t\(.candidate_entity_id)"' flat_export.jsonl \
  | sort -u | cut -f1-3 | uniq -c | awk '$1 == 1' | wc -l

3. Highest single-precinct vote total:

jq -c 'select(.votes_total > 50000) | {name: .candidate_canonical, votes: .votes_total, state: .state}' flat_export.jsonl \
  | sort -t: -k2 -rn | head -5

4. Candidates appearing in multiple elections (career tracking):

jq -r '"\(.candidate_entity_id)\t\(.election_date)"' flat_export.jsonl \
  | sort -u | cut -f1 | uniq -c | awk '$1 >= 3' | wc -l
# 702

5. Verify a specific hash chain link:

jq -c 'select(.l3_hash == "28183d41d50204d5")' l3_matched/nc/2022/matched.jsonl

Performance Notes

  • Streaming is mandatory at scale. The full L1 corpus at 200 million records is approximately 400 GB of JSONL. Do not load it into memory. Use jq with streaming or Python generators.
  • Pre-filter with grep. For large files, grep '"NC"' flat_export.jsonl | jq ... is faster than jq 'select(.state == "NC")' alone, because grep uses optimized byte scanning while jq parses every line.
  • Partition files help. The pipeline stores L1–L3 output partitioned by {state}/{year}/. Query a single state-year partition instead of the full national file when possible.
  • For heavy analysis, load into DuckDB or SQLite. Both can ingest JSONL directly and provide SQL query capabilities with proper indexing.