Project Retrospective · MA/RI Pilot

FPS Route Optimizer
how we got here

A multi-day journey from a confident-but-cheating "win" to a defensible, publicly-grounded baseline plus a reusable autoresearch loop. Built for Vig.

Window
May 4 → 7, 2026
Phases
9
Experiments
21 run · 5 kept
Saved vs honest baseline
$80,652/yr
Where we landed

Three numbers tell the whole story

The first looked like a win. It wasn't. The second was the honest cost of doing nothing clever. The third is what the autoresearch loop bought us, measured under an eval that survives outside scrutiny.

+$469k/yr
v10 catastrophe
(over FPS's claim)
+$155k/yr
Honest baseline
(with real svc + traffic)
−$80,652/yr
Autoresearch loop savings
(vs honest baseline)
+$74k/yr
Remaining gap to FPS's claim
(structural ceiling identified)
The arc

Nine phases, one rebuild, one reckoning

The "wacky" parts of the journey were necessary. The catastrophe in Phase 3 forced us to take cost seriously. The reckoning in Phase 5 forced us to take honesty seriously. Each step needed to happen for the final state to be defensible.

1
Build
First end-to-end stack

Two-layer router stood up on the MA/RI pilot. A strategic LLM-driven planner picks the broad shape of the day. An operational solver fits the actual route around concrete time windows, capacity, cold-chain rules, and drive matrices.

We confirmed the mechanics were correct (every visit assigned, A/B-week counts match, drive+service+wait sums to total) before making any cost claim.

2
Vignesh review tightened the problem

Schema columns didn't match what FPS expected. Driver IDs weren't stable across days. Service totals didn't match what FPS reports internally. Cost objective wasn't the one FPS cared about.

Each constraint Vig added was something the data file alone couldn't have specified. The data file is a lossy snapshot of the real business — that's the rule, not the exception.

3
Catastrophe
v10 produced a fleet costing $469k/yr more than FPS's reported cost

With the cost objective explicit ($25/hr base + 1.5× OT after 40h + per-vehicle weekly fixed cost), the optimization picked fleet size and routing to minimize that. The result was wildly worse than today's FPS.

The cost calculation was right. The bug was upstream — too much total work being scheduled. Resist the urge to tune the eval. Tune the system.

v10 cost/wk $23,208
FPS claim/wk $14,179
Annualized gap +$469,481/yr
4
Rebuild
v11 multi-trip + pinned-K + 14h cap

Three structural changes: drivers do AM + PM trips (multi-trip), territories pinned to a fixed count derived from cost-optimal sizing, the daily shift cap raised. We also calibrated simulated service times so they matched what FPS reports.

Result looked great: ~3% over FPS's claim, two fewer drivers. We wrote it up. It was not actually a win — see Phase 5.

v11 cost/wk $14,570
Drivers 7 (was 9.2)
Δ vs FPS claim +2.8% (apparent)
5
Reckoning
Vig caught the cheating

The Phase-4 "win" was three fudges stacked, each individually plausible at introduction time, never reviewed together:

1. Service times scaled down ~3× to match FPS's aggregate report. Real drivers take what each visit takes.
2. Traffic + weather multipliers disabled in the solve. Routes built that way don't survive the real road network.
3. Overhead multiplier 1.74× curve-fit so our pure-labor cost would back into FPS's claimed weekly cost. Painting the bullseye around the arrow.

Vig, on the call
"Why are we dropping weather and traffic? Doesn't seem correct. The goal is to actually optimize for the customer."

The lesson: the agent will not catch its own eval drift. Outside skepticism is essential.

6
Honest
Cost model rebuilt from public data

Removed all three fudges. Replaced each with a citeable industry source:

• Cash wage from MA/RI delivery-driver market surveys
• Benefits load 28% from BLS Employer Costs for Employee Compensation
• Fuel rate $0.55/mi from AAA's 2026 vehicle operating cost report
• Per-vehicle fixed from FoodPrep's own meeting note

Honest baseline: $17,153/wk — $155k/yr WORSE than FPS's claim. The Phase-4 "win" had been an illusion. From here on, every number we produce is one we can defend.

Honest baseline $17,153/wk
Δ vs FPS claim +$155k/yr
Defensibility All sources public
7
Machine
Adopted Karpathy's autoresearch pattern

Three pieces: a fixed evaluator (read-only, contract), an editable system-under-test (mutated each iteration), an append-only journal of every attempt.

The agent runs the loop unattended: try a change → run the eval → keep if score improved → revert if not → log either way → repeat. Pattern designed for nanochat training; same pattern, different domain.

8
Loop ran
21 experiments. 5 kept. $80k/yr saved.

Every kept commit is a measured improvement under the honest eval. Every discarded one is a logged data point. The biggest single lever was tour polishing with LKH-3 after OR-Tools converged — it tightened drive time by 12%.

Cumulative savings: $80,652/yr. Remaining gap to FPS's claim shrank from $155k/yr to $74k/yr.

Best result $15,602/wk
Drivers 9
Total hrs/wk 311 (was 333)
Saved $80,652/yr
9
Ceiling
Structural ceiling identified, cuOpt SUT shipped

After the parameter-tuning experiments plateaued, three next-tier moves remain:

(1) Different solver. NVIDIA cuOpt is open-source, GPU-accelerated, and removes the territorial pre-clustering constraint we've been working around. Expected to find better optima and iterate 50-100× faster.
(2) Validate cost-model assumptions with FPS. If the 28% benefits load or $0.55/mi fuel is high vs their actual, the gap closes immediately.
(3) Day-rebalancing. Re-spread visits across days within a week to flatten daily load. Bigger engineering investment.

cuOpt sibling implementation shipped — ready to run as soon as a GPU is available.

The autoresearch journal

Every attempt, kept or not, is in the record

Most experiments don't pan out. That's not failure — it's data. The journal of discards tells a future maintainer which dead-ends to skip. Here are the 21 attempts that drove the cost from $17,153/wk to $15,602/wk.

Hypothesis $/wk Drivers Δ vs prev best Status
Honest baseline (real svc + traffic + fully-loaded cost)$17,15310baseline
Size drivers on svc×2.2 workload$17,54011+$387/wkdiscard
Allow OT before adding swing driver (max_h 45→50)$17,1159−$38/wkKEEP
Push OT cap further (max_h 50→55)$17,1159tieddiscard
Compactness penalty 5→25$17,1159tieddiscard
Seed clusters from FPS Territory column$12,432*7escalateddiscard
Bump per-cluster solve time 8→30stimeoutcrash
Multi-trip slots=2 (driver_id mapping bug)$17,137*13bugdiscard
Smaller time bump 8→15s$17,1139−$2/wkdiscard
More first-solution variety (multi_start 2→6)$16,6449−$471/wkKEEP
8 strategies × 6s eachtimeoutcrash
Workload-weighted KMeans (sample_weight)$16,9519+$307/wkdiscard
Workload-balanced via customer replicationtimeoutcrash
Cycle all 8 OR-Tools heuristics × 5s$16,5659−$79/wkKEEP
Variety vs depth: 4 strategies × 10s$16,9089+$343/wkdiscard
Compact penalty + multi_start=8$16,5669tieddiscard
LKH-3 tour polish ON$15,6679−$898/wkKEEP
LKH polish runs 2→10$15,6029−$64/wkKEEP
KMeans n_init 10→50$15,7139+$111/wkdiscard
time_limit 5→6 with LKH stacktimeoutcrash

* Some discards still produced reasonable numbers — they were rejected because they regressed against the current best, not because they failed validity.

What gets automated next

Three places a human still intervenes

Today, Vig (the human) sits at three decision points the agent doesn't reach on its own. Each is automatable. Build those three layers, and the human's role compresses to defining the inception and accepting the final answer.

01 / Eval auditor
Catch cheating before it stacks
Compare simulated route conditions to real-world ground truth (GPS traces, actual service times, weather records). Flag any modeling shortcut whose effect exceeds a threshold. The Phase-5 reckoning would never have been needed.
02 / Solver recommendation
Know the SOTA tool per problem class
When parameter tuning plateaus, propose structural alternatives (cuOpt, Gurobi, SAT solvers, etc.) the agent might not reach for on its own. The cuOpt suggestion that came in Phase 9 should have surfaced in Phase 7.
03 / Cost-model validator
Pull authoritative parameter values
Auto-fetch wages, benefits load, fuel rates, tax rates from public data sources. Flag any custom-fit constant for human confirmation before the eval is locked. The 28% benefits load and $0.55/mi fuel become live-validated, not hard-coded guesses.
Net

The pattern is the durable artifact

The cost number on FPS's MA/RI pilot will be obsolete within a year as the business evolves. What stays is the methodology: a fixed evaluator, an editable system-under-test, a journaled loop, public-source cost components, and the discipline to never tune the eval to make wins look bigger.

Point this same pattern at warehouse layout, ad bidding, factory scheduling, fleet purchasing — the loop and discipline carry over. The "wacky" parts of this project were the cost of teaching the system (and its operators) what honest optimization looks like. Worth it.