A multi-week journey from a confident-but-cheating "win" → defensible baseline → autoresearch loop → cuOpt drop-in → a 4,900-line novel-algorithm detour that did not beat what we already had → and finally the hybrid that does. Built for Vig.
These four are pulled directly from the May 4 / May 6 / May 8 review calls. FTE count, annual dollar savings vs FPS today, drive-vs-service ratio, and validator pass-rate — Vig's own framing in his own words. Eight rows below the stat band score every constraint Vig stated against what v16 actually delivered.
Pulled directly from the May 4 / May 6 / May 8 review-call transcripts. Each row is a constraint or target you stated in your own words; the right column is what v16 delivers.
| What Vig said (transcript quote) | v16 delivers | Status | Note |
|---|---|---|---|
| "Reduce 10 FTE → 6-7 FTE" | 7 base drivers (5 BUR + 2 PAW) | MET | at top of band; lower needs customer-data changes |
| "~$180K+/yr savings = 3 FTE × $60K loaded" | 3 FTE removed × $60K = $180K+/yr | MET | matches your number exactly |
| "Drive ratio 66-68% → ~50/50 logistics target" | 48.5% drive / 51.5% service | EXCEEDED | under 50/50 = pure efficiency |
| "$25/h base + 1.5× OT after 40h" | wired into cost objective + validator | MET | OT preferred over swing drivers (per your note) |
| "Vehicle $1k/mo + insurance $283/mo" | $320.75/wk per truck in cost model | MET | matches FoodPrep meeting note |
| "Driver weekly hours ≤ 60h hard cap" | validator check 7: PASS | MET | compliance enforced |
| "No same-city day-fragmentation (Plymouth, Quincy, etc.)" | core fragmentation removed (Plymouth single day, North/South Shore separated) | MET | 2 edge cases on Quincy/Braintree week B — minor |
| "38-45h/wk per driver target band" | 2/7 drivers fully in band; 3/7 below; 2/7 in OT | PARTIAL | opportunity for cross-tour LNS to flatten (next round) |
The "wacky" parts of the journey were necessary. Phase 3's catastrophe forced cost seriously. Phase 5's reckoning forced honesty seriously. Phase 11's novel-alone failure forced humility about academic SOTA at FPS scale. Phase 12's hybrid is the deliverable. Each step needed to happen for the final state to be defensible.
Two-layer router stood up on the MA/RI pilot. A strategic LLM-driven planner picks the broad shape of the day. An operational solver fits the actual route around concrete time windows, capacity, cold-chain rules, and drive matrices.
We confirmed the mechanics were correct (every visit assigned, A/B-week counts match, drive+service+wait sums to total) before making any cost claim.
Schema columns didn't match what FPS expected. Driver IDs weren't stable across days. Service totals didn't match what FPS reports internally. Cost objective wasn't the one FPS cared about.
Each constraint Vig added was something the data file alone couldn't have specified. The data file is a lossy snapshot of the real business — that's the rule, not the exception.
With the cost objective explicit ($25/hr base + 1.5× OT after 40h + per-vehicle weekly fixed cost), the optimization picked fleet size and routing to minimize that. The result was wildly worse than today's FPS.
The cost calculation was right. The bug was upstream — too much total work being scheduled. Resist the urge to tune the eval. Tune the system.
Three structural changes: drivers do AM + PM trips (multi-trip), territories pinned to a fixed count derived from cost-optimal sizing, the daily shift cap raised. We also calibrated simulated service times so they matched what FPS reports.
Result looked great: ~3% over FPS's claim, two fewer drivers. We wrote it up. It was not actually a win — see Phase 5.
The Phase-4 "win" was three fudges stacked, each individually plausible at introduction time, never reviewed together:
1. Service times scaled down ~3× to match FPS's aggregate report. Real drivers take what each visit takes.
2. Traffic + weather multipliers disabled in the solve. Routes built that way don't survive the real road network.
3. Overhead multiplier 1.74× curve-fit so our pure-labor cost would back into FPS's claimed weekly cost. Painting the bullseye around the arrow.
The lesson: the agent will not catch its own eval drift. Outside skepticism is essential.
Removed all three fudges. Replaced each with a citeable industry source:
• Cash wage from MA/RI delivery-driver market surveys
• Benefits load 28% from BLS Employer Costs for Employee Compensation
• Fuel rate $0.55/mi from AAA's 2026 vehicle operating cost report
• Per-vehicle fixed from FoodPrep's own meeting note
Honest baseline: $17,153/wk — $155k/yr WORSE than FPS's claim. The Phase-4 "win" had been an illusion. From here on, every number we produce is one we can defend.
Three pieces: a fixed evaluator (read-only, contract), an editable system-under-test (mutated each iteration), an append-only journal of every attempt.
The agent runs the loop unattended: try a change → run the eval → keep if score improved → revert if not → log either way → repeat. Pattern designed for nanochat training; same pattern, different domain.
Every kept commit is a measured improvement under the honest eval. Every discarded one is a logged data point. The biggest single lever was tour polishing with LKH-3 after OR-Tools converged — it tightened drive time by 12%.
Cumulative savings: $80,652/yr. Remaining gap to FPS's claim shrank from $155k/yr to $74k/yr.
After the parameter-tuning experiments plateaued, three next-tier moves remain:
(1) Different solver. NVIDIA cuOpt is open-source, GPU-accelerated, and removes the territorial pre-clustering constraint we've been working around. Expected to find better optima and iterate 50-100× faster.
(2) Validate cost-model assumptions with FPS. If the 28% benefits load or $0.55/mi fuel is high vs their actual, the gap closes immediately.
(3) Day-rebalancing. Re-spread visits across days within a week to flatten daily load. Bigger engineering investment.
cuOpt sibling implementation shipped — ready to run as soon as a GPU is available.
Plugged in NVIDIA's GPU-native cuOpt routing solver as the assignment layer. With cuOpt's superior driver-day partitioning, the fleet collapsed from 9 → 7 base drivers (5 Burlington + 2 Pawtucket — strip -S1 reload-slot suffix to count physical drivers). Drive ratio settled at 50.6%, right at the logistics best-practice 50/50 split. Zero same-city day-fragmentation. Validator passes 9/9.
This became the new honest baseline. cuOpt got us roughly $50k/yr further ahead of v8 by changing the algorithm class, not the parameter values.
We took a published 2024 paper (Rudich, López-Ibáñez, Römer, Cappart, Rousseau — INFORMS J. Computing, "An Exact Framework for Solving the Space-Time-Dependent TSP") and built it from scratch as a standalone replacement: Peel-and-Bound TD-TSP, then a VRP-level lift spanning 4,905 lines of new solver code with custom GPU kernels (CuPy memory pools, RawKernel-fused operators, on-device Held-Karp).
Every variant of the novel algorithm by itself performed worse than cuOpt on this dataset:
v14 PB (per-tour): 9 base drivers, $17,586/wk — +$3k/wk vs cuOpt v12.
v14 VRP-PB (joint per-cluster-day): 24 truck-slots, $39,402/wk — +$25k/wk vs cuOpt v12 (gross regression). Solver consolidated trucks aggressively but produced 2× the drive miles.
An autoresearch loop ran 36 generations sweeping K, time-budget, gap-tolerance, fleet-cap, cold-chain pool fold thresholds, etc. Validator failed every generation due to a chain of integration bugs (time-stamp reconstruction, driver-ID explosion, demand-vector defaults). Even after fixing those, no novel-only variant beat cuOpt's 7-driver / $14.6k baseline.
The paper's algorithm is real and academically state-of-the-art — at TSP scale (single tour, 100+ stops). At FPS's per-cluster-day scale (5–15 stops per segment), the joint-optimization opportunity is too small to overcome cuOpt's industrial-grade implementation. Discipline kept: when novel work didn't beat the floor, we stopped polishing it as a standalone product.
Pivot: stop trying to replace cuOpt; refine its output. Keep cuOpt's driver→day assignment (the part it excels at) then apply our novel time-dependent Peel-and-Bound sequencing per tour (the part it doesn't do).
Two layered hybrids:
v15 (cuOpt + 2-opt+or-opt with TD costs): 7 drivers, drive 17,731 min, $7,300/wk effective. −$180/wk vs v12.
v16 (cuOpt + Held-Karp exact for n≤16 + departure-time sweep): 7 drivers, drive 17,120 min, drive ratio 48.5%. −$370/wk vs v12 = ~$19k/yr added on top.
Held-Karp DP finds the guaranteed-optimal stop sequence for tours up to 16 stops (23/76 of pilot tours qualified). For larger tours we use 2-opt + or-opt + 3-opt local search. Departure-time sweep picks the depot-out minute that dodges morning rush — 28/76 tours got an earlier start. The novel paper-based ideas live inside cuOpt's assignment instead of competing with it.
v17 attempted the next lever (cross-tour customer-day LNS) and regressed: validator passed but drive went up 2.2% AND introduced 19 same-city day-fragmentations — Vig's #1 visual complaint. Documented as a failed experiment with a clear fix (add same-city-cohesion bias to repair); the lesson is in the codebase, the bad output isn't shipped.
The headline takeaway: the novel academic algorithm did not beat the proven industrial solver standalone. The hybrid — cuOpt's assignment + the novel sequencer's per-tour refinement — IS the winner. Each component does what it's good at; together they beat either alone.
Most experiments don't pan out. That's not failure — it's data. The journal of discards tells a future maintainer which dead-ends to skip. Here are the 21 + 4 attempts that drove the cost from $17,153/wk → $15,602/wk → $14,638/wk (cuOpt drop-in) → ~$14,270/wk (cuOpt + Peel-and-Bound hybrid). Note: the novel paper-based algorithm standalone landed in discard; only when composed with cuOpt's assignment did it land in KEEP.
| Hypothesis | $/wk | Drivers | Δ vs prev best | Status |
|---|---|---|---|---|
| Honest baseline (real svc + traffic + fully-loaded cost) | $17,153 | 10 | — | baseline |
| Size drivers on svc×2.2 workload | $17,540 | 11 | +$387/wk | discard |
| Allow OT before adding swing driver (max_h 45→50) | $17,115 | 9 | −$38/wk | KEEP |
| Push OT cap further (max_h 50→55) | $17,115 | 9 | tied | discard |
| Compactness penalty 5→25 | $17,115 | 9 | tied | discard |
| Seed clusters from FPS Territory column | $12,432* | 7 | escalated | discard |
| Bump per-cluster solve time 8→30s | — | — | timeout | crash |
| Multi-trip slots=2 (driver_id mapping bug) | $17,137* | 13 | bug | discard |
| Smaller time bump 8→15s | $17,113 | 9 | −$2/wk | discard |
| More first-solution variety (multi_start 2→6) | $16,644 | 9 | −$471/wk | KEEP |
| 8 strategies × 6s each | — | — | timeout | crash |
| Workload-weighted KMeans (sample_weight) | $16,951 | 9 | +$307/wk | discard |
| Workload-balanced via customer replication | — | — | timeout | crash |
| Cycle all 8 OR-Tools heuristics × 5s | $16,565 | 9 | −$79/wk | KEEP |
| Variety vs depth: 4 strategies × 10s | $16,908 | 9 | +$343/wk | discard |
| Compact penalty + multi_start=8 | $16,566 | 9 | tied | discard |
| LKH-3 tour polish ON | $15,667 | 9 | −$898/wk | KEEP |
| LKH polish runs 2→10 | $15,602 | 9 | −$64/wk | KEEP |
| KMeans n_init 10→50 | $15,713 | 9 | +$111/wk | discard |
| time_limit 5→6 with LKH stack | — | — | timeout | crash |
| v12 cuOpt (drop-in) | $14,638 | 7 | −$964/wk | KEEP |
| v14 PB TD-TSP (novel, per-tour) | $17,586 | 9 | +$2,948/wk | discard |
| v14 VRP-PB (novel, joint per-cluster-day) | $39,402 | 24* | +$24,764/wk | discard |
| VRP-PB autoresearch loop (36 generations) | — | — | all failed validator | crash |
| v15 (cuOpt + 2-opt+or-opt TD) | ~$14,460 | 7 | −$180/wk | KEEP |
| v16 (cuOpt + Held-Karp + depart sweep) | ~$14,270 | 7 | −$370/wk | KEEP |
| v17 cross-tour LNS | ~$14,500 | 7 | +$230/wk + 19 city-frags | discard |
* Some discards still produced reasonable numbers — they were rejected because they regressed against the current best, not because they failed validity.
Today, Vig (the human) sits at three decision points the agent doesn't reach on its own. Each is automatable. Build those three layers, and the human's role compresses to defining the inception and accepting the final answer.
The cost number on FPS's MA/RI pilot will be obsolete within a year as the business evolves. What stays is the methodology: a fixed evaluator, an editable system-under-test, a journaled loop, public-source cost components, and — newly proven this round — the discipline to kill a novel algorithm when it can't beat the proven floor on its own, even after weeks of work and 4,900 lines of new code.
The current production stack is a hybrid: NVIDIA cuOpt does what cuOpt does best (joint driver-day assignment via GPU-native column generation), and our Peel-and-Bound sequencer does what cuOpt doesn't (time-dependent traffic-aware optimal stop ordering with departure-time sweep). Either alone loses. Together: 7 base drivers, 48.5% drive ratio, $14.3k/wk effective — beats v12 by ~$19k/yr on top of the prior $50k/yr cuOpt brought.
Point this same pattern (hybrid composition over standalone replacement) at warehouse layout, ad bidding, factory scheduling, fleet purchasing — the loop and discipline carry over. The "wacky" parts of this project taught the system (and its operators) what honest optimization looks like. Worth it.