A multi-day journey from a confident-but-cheating "win" to a defensible, publicly-grounded baseline plus a reusable autoresearch loop. Built for Vig.
The first looked like a win. It wasn't. The second was the honest cost of doing nothing clever. The third is what the autoresearch loop bought us, measured under an eval that survives outside scrutiny.
The "wacky" parts of the journey were necessary. The catastrophe in Phase 3 forced us to take cost seriously. The reckoning in Phase 5 forced us to take honesty seriously. Each step needed to happen for the final state to be defensible.
Two-layer router stood up on the MA/RI pilot. A strategic LLM-driven planner picks the broad shape of the day. An operational solver fits the actual route around concrete time windows, capacity, cold-chain rules, and drive matrices.
We confirmed the mechanics were correct (every visit assigned, A/B-week counts match, drive+service+wait sums to total) before making any cost claim.
Schema columns didn't match what FPS expected. Driver IDs weren't stable across days. Service totals didn't match what FPS reports internally. Cost objective wasn't the one FPS cared about.
Each constraint Vig added was something the data file alone couldn't have specified. The data file is a lossy snapshot of the real business — that's the rule, not the exception.
With the cost objective explicit ($25/hr base + 1.5× OT after 40h + per-vehicle weekly fixed cost), the optimization picked fleet size and routing to minimize that. The result was wildly worse than today's FPS.
The cost calculation was right. The bug was upstream — too much total work being scheduled. Resist the urge to tune the eval. Tune the system.
Three structural changes: drivers do AM + PM trips (multi-trip), territories pinned to a fixed count derived from cost-optimal sizing, the daily shift cap raised. We also calibrated simulated service times so they matched what FPS reports.
Result looked great: ~3% over FPS's claim, two fewer drivers. We wrote it up. It was not actually a win — see Phase 5.
The Phase-4 "win" was three fudges stacked, each individually plausible at introduction time, never reviewed together:
1. Service times scaled down ~3× to match FPS's aggregate report. Real drivers take what each visit takes.
2. Traffic + weather multipliers disabled in the solve. Routes built that way don't survive the real road network.
3. Overhead multiplier 1.74× curve-fit so our pure-labor cost would back into FPS's claimed weekly cost. Painting the bullseye around the arrow.
The lesson: the agent will not catch its own eval drift. Outside skepticism is essential.
Removed all three fudges. Replaced each with a citeable industry source:
• Cash wage from MA/RI delivery-driver market surveys
• Benefits load 28% from BLS Employer Costs for Employee Compensation
• Fuel rate $0.55/mi from AAA's 2026 vehicle operating cost report
• Per-vehicle fixed from FoodPrep's own meeting note
Honest baseline: $17,153/wk — $155k/yr WORSE than FPS's claim. The Phase-4 "win" had been an illusion. From here on, every number we produce is one we can defend.
Three pieces: a fixed evaluator (read-only, contract), an editable system-under-test (mutated each iteration), an append-only journal of every attempt.
The agent runs the loop unattended: try a change → run the eval → keep if score improved → revert if not → log either way → repeat. Pattern designed for nanochat training; same pattern, different domain.
Every kept commit is a measured improvement under the honest eval. Every discarded one is a logged data point. The biggest single lever was tour polishing with LKH-3 after OR-Tools converged — it tightened drive time by 12%.
Cumulative savings: $80,652/yr. Remaining gap to FPS's claim shrank from $155k/yr to $74k/yr.
After the parameter-tuning experiments plateaued, three next-tier moves remain:
(1) Different solver. NVIDIA cuOpt is open-source, GPU-accelerated, and removes the territorial pre-clustering constraint we've been working around. Expected to find better optima and iterate 50-100× faster.
(2) Validate cost-model assumptions with FPS. If the 28% benefits load or $0.55/mi fuel is high vs their actual, the gap closes immediately.
(3) Day-rebalancing. Re-spread visits across days within a week to flatten daily load. Bigger engineering investment.
cuOpt sibling implementation shipped — ready to run as soon as a GPU is available.
Most experiments don't pan out. That's not failure — it's data. The journal of discards tells a future maintainer which dead-ends to skip. Here are the 21 attempts that drove the cost from $17,153/wk to $15,602/wk.
| Hypothesis | $/wk | Drivers | Δ vs prev best | Status |
|---|---|---|---|---|
| Honest baseline (real svc + traffic + fully-loaded cost) | $17,153 | 10 | — | baseline |
| Size drivers on svc×2.2 workload | $17,540 | 11 | +$387/wk | discard |
| Allow OT before adding swing driver (max_h 45→50) | $17,115 | 9 | −$38/wk | KEEP |
| Push OT cap further (max_h 50→55) | $17,115 | 9 | tied | discard |
| Compactness penalty 5→25 | $17,115 | 9 | tied | discard |
| Seed clusters from FPS Territory column | $12,432* | 7 | escalated | discard |
| Bump per-cluster solve time 8→30s | — | — | timeout | crash |
| Multi-trip slots=2 (driver_id mapping bug) | $17,137* | 13 | bug | discard |
| Smaller time bump 8→15s | $17,113 | 9 | −$2/wk | discard |
| More first-solution variety (multi_start 2→6) | $16,644 | 9 | −$471/wk | KEEP |
| 8 strategies × 6s each | — | — | timeout | crash |
| Workload-weighted KMeans (sample_weight) | $16,951 | 9 | +$307/wk | discard |
| Workload-balanced via customer replication | — | — | timeout | crash |
| Cycle all 8 OR-Tools heuristics × 5s | $16,565 | 9 | −$79/wk | KEEP |
| Variety vs depth: 4 strategies × 10s | $16,908 | 9 | +$343/wk | discard |
| Compact penalty + multi_start=8 | $16,566 | 9 | tied | discard |
| LKH-3 tour polish ON | $15,667 | 9 | −$898/wk | KEEP |
| LKH polish runs 2→10 | $15,602 | 9 | −$64/wk | KEEP |
| KMeans n_init 10→50 | $15,713 | 9 | +$111/wk | discard |
| time_limit 5→6 with LKH stack | — | — | timeout | crash |
* Some discards still produced reasonable numbers — they were rejected because they regressed against the current best, not because they failed validity.
Today, Vig (the human) sits at three decision points the agent doesn't reach on its own. Each is automatable. Build those three layers, and the human's role compresses to defining the inception and accepting the final answer.
The cost number on FPS's MA/RI pilot will be obsolete within a year as the business evolves. What stays is the methodology: a fixed evaluator, an editable system-under-test, a journaled loop, public-source cost components, and the discipline to never tune the eval to make wins look bigger.
Point this same pattern at warehouse layout, ad bidding, factory scheduling, fleet purchasing — the loop and discipline carry over. The "wacky" parts of this project were the cost of teaching the system (and its operators) what honest optimization looks like. Worth it.