Skillia Voting Records

>Skillia Voting Records

Objective

Investigate skillia_voting_records.zip (extracted to 002.records/skillia_voting_records.csv) to identify six voting stations that were manipulated. Produce a reproducible, fully documented analysis, include the code used, commands, and final flag.

>Environment & Setup

Workspace: /home/noigel/Desktop/RSTCONctf/Forensics/Skilla
Evidence archive: skillia_voting_records.zip → contains skillia_voting_records.csv (~217 MB)
Virtual environment (isolated): 001.venv (created with python3 -m venv)
Python packages installed (within the venv): python-magic, exifread, Pillow, pandas, openpyxl

Commands used to set up the environment (already run):

bash

cd /home/noigel/Desktop/RSTCONctf/Forensics/Skilla
python3 -m venv 001.venv
source 001.venv/bin/activate
python -m pip install --upgrade pip
pip install python-magic exifread Pillow pandas openpyxl
unzip skillia_voting_records.zip -d 002.records

All further scripts in this writeup were executed with the venv python to ensure reproducibility, e.g.:

bash

./001.venv/bin/python 003.initial_triage.py

>Investigative Approach (high level)

Initial triage (file inspection, sample rows, header analysis).
Produce per-station summaries (ballot counts, absentee fractions, most-common ballot patterns).
Detect anomalies using multiple heuristics:
- Repeated identical ballot pattern fraction (indicative of automated/duplicated ballots).
- Per-station deviation from global vote distribution (presidential A% deviation).
- Absentee vs in-person distribution divergence (L1 distance) per station.
- Absentee fraction outliers within counties (z-score by county).
- Duplicate aggregate profiles across stations.
Triangulate results to form a final candidate list of 6 manipulated stations.

Rationale: manipulations often leave statistical fingerprints — large internal mismatches between absentee and in-person vote distributions at the same station, unusual repeated identical ballot patterns, or substantial deviation from the global distribution.

>Files produced during analysis

002.initial_triage_summary.csv — per-station summary: total ballots, absentee fraction, unique patterns, most-common pattern count, president counts.
003.patterns_ranked.csv — stations sorted by fraction of ballots matching the station's most common ballot pattern.
004.pres_deviation_top30.csv — stations whose presidential A% deviates most from the global A%.
006.absentee_outliers_top20.csv — absentee fraction z-score outliers by county.
008.abs_vs_inperson_pres_top30.csv — stations with the largest L1 distance between absentee and in-person presidential vote distributions.

These were generated by scripts included below.

>Key results and reasoning

I used multiple signals to identify manipulated stations. The strongest single-signal I used for final selection was the absentee vs in-person presidential vote distribution divergence (L1 distance). This directly compares the mix of votes (A/B/undervote) between ballots cast in-person and absentee ballots at the same station; a large divergence often indicates differential tampering or injected ballots into one of the streams.

The top 6 stations by L1 divergence (and corroborating signals from other heuristics) were:

MetaCTF{81b7bc0c,1251749b,4b45c819,1fe457e4,a4defb9c,329071ee}

Why these six?

All six appear in the top results for L1 divergence between absentee and in-person presidential vote distributions (008.abs_vs_inperson_pres_top30.csv).
Several of them also show up as outliers in other tests:
- a4defb9c and 329071ee are high on pattern-based rankings (003.patterns_ranked.csv) and in deviation from global presidential percentages.
- 81b7bc0c, 1251749b, 4b45c819, 1fe457e4 appear in top L1 divergence positions and have additional statistical anomalies (counts or deviations).

Taken together, these signals strongly indicate tampering targeted at these stations.

>Commands & Execution Log (representative)

All commands were executed in /home/noigel/Desktop/RSTCONctf/Forensics/Skilla using the isolated venv as shown earlier. Representative commands executed to produce artifacts:

bash

# 1) Generate per-station triage summary
./001.venv/bin/python 003.initial_triage.py

# 2) Rank repeated ballot patterns per station
./001.venv/bin/python 004.analyze_patterns.py

# 3) Compute stations deviating from global presidential distribution
./001.venv/bin/python 005.find_outliers.py

# 4) Find county-level absentee outliers
./001.venv/bin/python 007.absentee_outliers.py

# 5) Compare absentee vs in-person presidential distributions (L1)
./001.venv/bin/python 009.abs_vs_inperson_pres.py

You can open the CSV/text output files to inspect the numbers directly.

>Full analysis scripts (verbatim)

Below are the exact Python scripts used during the investigation. They are included so the analysis is fully reproducible. Each script was executed with ./001.venv/bin/python <script>.

`003.initial_triage.py`

python

#!/usr/bin/env python3
import csv
from collections import defaultdict, Counter
from pathlib import Path

csv_path = Path('002.records/skillia_voting_records.csv')

# Aggregates per station
station_counts = Counter()
station_absentee = Counter()
station_patterns = defaultdict(Counter)  # pattern -> count per station
station_pres = defaultdict(Counter)
station_sen = defaultdict(Counter)
station_gov = defaultdict(Counter)

with csv_path.open('r', newline='') as fh:
    reader = csv.DictReader(fh)
    for row in reader:
        sid = row['voting_station_id']
        station_counts[sid] += 1
        if row.get('voter_type','').strip().lower() == 'absentee':
            station_absentee[sid] += 1
        # pattern: tuple of votes
        pattern = (row.get('vote_president',''), row.get('vote_senate',''), row.get('vote_governor',''), row.get('vote_marijuana',''), row.get('vote_tax_cap',''))
        station_patterns[sid][pattern] += 1
        station_pres[sid][row.get('vote_president','')] += 1
        station_sen[sid][row.get('vote_senate','')] += 1
        station_gov[sid][row.get('vote_governor','')] += 1

# Prepare summary
out_lines = []
out_lines.append('station_id,total_ballots,absentee_fraction,unique_patterns,top_pattern_count,top_pres_A,top_pres_B,top_pres_undervote')
for sid in station_counts:
    total = station_counts[sid]
    absentee_frac = station_absentee[sid]/total if total else 0
    unique_patterns = len(station_patterns[sid])
    top_pattern_count = station_patterns[sid].most_common(1)[0][1]
    presA = station_pres[sid].get('A',0)
    presB = station_pres[sid].get('B',0)
    presUnd = station_pres[sid].get('undervote',0)
    out_lines.append(f"{sid},{total},{absentee_frac:.4f},{unique_patterns},{top_pattern_count},{presA},{presB},{presUnd}")

# Write summary to file
with open('002.initial_triage_summary.csv','w') as outfh:
    outfh.write('\n'.join(out_lines))

print('Wrote 002.initial_triage_summary.csv with per-station summaries')

`004.analyze_patterns.py`

python

#!/usr/bin/env python3
import csv
from collections import defaultdict, Counter
from pathlib import Path

csv_path = Path('002.records/skillia_voting_records.csv')

# We'll compute per-station stats focusing on repeated patterns
station_counts = Counter()
station_patterns = defaultdict(Counter)

with csv_path.open('r', newline='') as fh:
    reader = csv.DictReader(fh)
    for row in reader:
        sid = row['voting_station_id']
        station_counts[sid] += 1
        pattern = (row.get('vote_president',''), row.get('vote_senate',''), row.get('vote_governor',''), row.get('vote_marijuana',''), row.get('vote_tax_cap',''))
        station_patterns[sid][pattern] += 1

rows = []
for sid, total in station_counts.items():
    unique_patterns = len(station_patterns[sid])
    top_count = station_patterns[sid].most_common(1)[0][1]
    top_frac = top_count/total
    unique_frac = unique_patterns/total
    rows.append((sid, total, unique_patterns, top_count, top_frac, unique_frac))

# Sort by top_frac descending
rows_sorted = sorted(rows, key=lambda x: x[4], reverse=True)

out_lines = ['station_id,total,unique_patterns,top_count,top_frac,unique_frac']
for sid, total, unique, topc, topf, uniqf in rows_sorted:
    out_lines.append(f"{sid},{total},{unique},{topc},{topf:.6f},{uniqf:.6f}")

with open('003.patterns_ranked.csv','w') as outfh:
    outfh.write('\n'.join(out_lines))

print('Wrote 003.patterns_ranked.csv')

`005.find_outliers.py`

python

#!/usr/bin/env python3
import csv
from collections import defaultdict, Counter
from pathlib import Path
import math

csv_path = Path('002.records/skillia_voting_records.csv')
station_counts = Counter()
station_pres = defaultdict(Counter)

total_ballots = 0

aggregate_pres = Counter()
with csv_path.open('r', newline='') as fh:
    reader = csv.DictReader(fh)
    for row in reader:
        sid = row['voting_station_id']
        station_counts[sid] += 1
        v = row.get('vote_president','')
        station_pres[sid][v] += 1
        aggregate_pres[v] += 1
        total_ballots += 1

# compute global percent for A and B
global_A = aggregate_pres.get('A',0)/total_ballots
global_B = aggregate_pres.get('B',0)/total_ballots

rows = []
for sid, total in station_counts.items():
    a = station_pres[sid].get('A',0)
    b = station_pres[sid].get('B',0)
    und = station_pres[sid].get('undervote',0)
    a_pct = a/total if total else 0
    b_pct = b/total if total else 0
    rows.append((sid, total, a_pct, b_pct, abs(a_pct-global_A)))

# sort by deviation from global A fraction
rows_sorted = sorted(rows, key=lambda x: x[4], reverse=True)

out_lines = ['station_id,total,a_pct,b_pct,dev_from_global_A']
for sid, total, a_pct, b_pct, dev in rows_sorted[:30]:
    out_lines.append(f"{sid},{total},{a_pct:.6f},{b_pct:.6f},{dev:.6f}")

with open('004.pres_deviation_top30.csv','w') as outfh:
    outfh.write('\n'.join(out_lines))

print('Wrote 004.pres_deviation_top30.csv')

`006.duplicate_profiles.py`

python

#!/usr/bin/env python3
import csv
from collections import defaultdict, Counter
from pathlib import Path

csv_path = Path('002.records/skillia_voting_records.csv')
station_counts = Counter()
# For each station, counters per race
station_pres = defaultdict(Counter)
station_sen = defaultdict(Counter)
station_gov = defaultdict(Counter)
station_mar = defaultdict(Counter)
station_tax = defaultdict(Counter)

with csv_path.open('r', newline='') as fh:
    reader = csv.DictReader(fh)
    for row in reader:
        sid = row['voting_station_id']
        station_counts[sid] += 1
        station_pres[sid][row.get('vote_president','')] += 1
        station_sen[sid][row.get('vote_senate','')] += 1
        station_gov[sid][row.get('vote_governor','')] += 1
        station_mar[sid][row.get('vote_marijuana','')] += 1
        station_tax[sid][row.get('vote_tax_cap','')] += 1

# build profile vector for each station: counts for A,B,undervote for each race
profiles = defaultdict(list)
for sid in station_counts:
    total = station_counts[sid]
    pA = station_pres[sid].get('A',0)
    pB = station_pres[sid].get('B',0)
    pU = station_pres[sid].get('undervote',0)
    sA = station_sen[sid].get('A',0)
    sB = station_sen[sid].get('B',0)
    sU = station_sen[sid].get('undervote',0)
    gA = station_gov[sid].get('A',0)
    gB = station_gov[sid].get('B',0)
    gU = station_gov[sid].get('undervote',0)
    mYes = station_mar[sid].get('Yes',0)
    mNo = station_mar[sid].get('No',0)
    tYes = station_tax[sid].get('Yes',0)
    tNo = station_tax[sid].get('No',0)
    vec = (total, pA,pB,pU, sA,sB,sU, gA,gB,gU, mYes,mNo, tYes,tNo)
    profiles[vec].append(sid)

# find duplicate profiles
dupes = [(vec, sids) for vec, sids in profiles.items() if len(sids)>1]
with open('005.duplicate_profiles.txt','w') as outfh:
    for vec, sids in sorted(dupes, key=lambda x: len(x[1]), reverse=True):
        outfh.write(f"Group of {len(sids)} stations: {', '.join(sids)}\n")
        outfh.write(str(vec)+"\n\n")

print('Wrote 005.duplicate_profiles.txt')

`007.absentee_outliers.py`

python

#!/usr/bin/env python3
import csv
from collections import defaultdict, Counter
from pathlib import Path
import math

csv_path = Path('002.records/skillia_voting_records.csv')
# We need county info
station_total = Counter()
station_abs = Counter()
station_county = {}

with csv_path.open('r', newline='') as fh:
    reader = csv.DictReader(fh)
    for row in reader:
        sid = row['voting_station_id']
        station_total[sid] += 1
        if row.get('voter_type','').strip().lower() == 'absentee':
            station_abs[sid] += 1
        if sid not in station_county:
            station_county[sid] = row.get('county','')

# group by county
county_stations = defaultdict(list)
for sid, cnt in station_total.items():
    county_stations[station_county.get(sid,'')].append(sid)

out = []
for county, sids in county_stations.items():
    fracs = []
    for sid in sids:
        fr = station_abs[sid]/station_total[sid]
        fracs.append(fr)
    mean = sum(fracs)/len(fracs)
    var = sum((x-mean)**2 for x in fracs)/len(fracs)
    sd = math.sqrt(var)
    for sid in sids:
        fr = station_abs[sid]/station_total[sid]
        z = (fr-mean)/sd if sd>0 else 0
        out.append((abs(z), sid, county, fr, mean, sd))

out_sorted = sorted(out, reverse=True)[:20]
with open('006.absentee_outliers_top20.csv','w') as fh:
    fh.write('abs_z,station_id,county,abs_frac,county_mean,county_sd\n')
    for z, sid, county, fr, mean, sd in out_sorted:
        fh.write(f"{z:.3f},{sid},{county},{fr:.4f},{mean:.4f},{sd:.4f}\n")

print('Wrote 006.absentee_outliers_top20.csv')

`008.station_id_format.py`

python

#!/usr/bin/env python3
import csv
from collections import Counter
from pathlib import Path

csv_path = Path('002.records/skillia_voting_records.csv')
ids = Counter()
with csv_path.open('r', newline='') as fh:
    reader = csv.DictReader(fh)
    for row in reader:
        sid = row['voting_station_id']
        ids[sid]+=1

bad = []
for sid in ids:
    if len(sid)!=8 or any(c not in '0123456789abcdef' for c in sid.lower()):
        bad.append((sid, ids[sid]))

with open('007.station_id_anomalies.txt','w') as fh:
    if not bad:
        fh.write('No anomalous station ids found\n')
    else:
        for s,c in bad:
            fh.write(f"{s},{c}\n")
print('Wrote 007.station_id_anomalies.txt')

`009.abs_vs_inperson_pres.py`

python

#!/usr/bin/env python3
import csv
from collections import defaultdict, Counter
from pathlib import Path

csv_path = Path('002.records/skillia_voting_records.csv')
# counts per station per voter_type
counts = defaultdict(lambda: defaultdict(Counter))
station_total = Counter()
with csv_path.open('r', newline='') as fh:
    reader = csv.DictReader(fh)
    for row in reader:
        sid = row['voting_station_id']
        vt = row.get('voter_type','').strip().lower()
        vote = row.get('vote_president','')
        counts[sid][vt][vote]+=1
        station_total[sid]+=1

out=[]
for sid, types in counts.items():
    in_person = types.get('in_person',{})
    absentee = types.get('absentee',{})
    ip_total = sum(in_person.values())
    ab_total = sum(absentee.values())
    if ip_total<50 or ab_total<50:
        continue
    # compute normalized dist
    candidates = set(list(in_person.keys())+list(absentee.keys()))
    ip_dist = {c: in_person.get(c,0)/ip_total for c in candidates}
    ab_dist = {c: absentee.get(c,0)/ab_total for c in candidates}
    # L1 distance
    l1 = sum(abs(ip_dist[c]-ab_dist[c]) for c in candidates)
    out.append((l1,sid,ip_total,ab_total))

out_sorted = sorted(out, reverse=True)[:30]
with open('008.abs_vs_inperson_pres_top30.csv','w') as fh:
    fh.write('l1,station_id,in_person_count,absentee_count\n')
    for l1,sid,ip,ab in out_sorted:
        fh.write(f"{l1:.6f},{sid},{ip},{ab}\n")

print('Wrote 008.abs_vs_inperson_pres_top30.csv')

>Final Flag

MetaCTF{81b7bc0c,1251749b,4b45c819,1fe457e4,a4defb9c,329071ee}