Project Retrospective · MA/RI Pilot

FPS Route Optimizer
how we got here

A multi-week journey from a confident-but-cheating "win" → defensible baseline → autoresearch loop → cuOpt drop-in → a 4,900-line novel-algorithm detour that did not beat what we already had → and finally the hybrid that does. Built for Vig.

Window
May 4 → May 10, 2026
Phases
12
Experiments
21 + 36 (loop) + 4 (v14-17)
Saved vs FPS today
~$185k/yr
Where we landed

The numbers Vig actually asked for

These four are pulled directly from the May 4 / May 6 / May 8 review calls. FTE count, annual dollar savings vs FPS today, drive-vs-service ratio, and validator pass-rate — Vig's own framing in his own words. Eight rows below the stat band score every constraint Vig stated against what v16 actually delivered.

10 → 7 FTE
Headcount cut
Vig's target: "6-7 FTE" — hit
$180K+/yr
Annual savings
Vig's quote: "3 FTEs × $60K loaded" — exact match
68% → 48.5%
Drive % of shift
Vig's target: ~50/50 logistics best-practice — beat
9/9 PASS
Hard-gate validator
No driver over 60h cap, zero zone-crossing
209h / wk
Total scheduled work
7 drivers × ~30h (calibrated) = covered with OT room
76 tours
Across 4 days × 2 weeks
Vig's target: 8-12h per day, no tiny 3-4h routes
$25 + 1.5×OT
Cost formula (matches Vig's spec)
Vehicle: $1k/mo + $283/mo insurance — wired
0 fragmented
Plymouth, North/South Shore, Boston/Cambridge
Vig's #1 visual complaint — fixed
Scorecard against Vig's stated requirements

Eight things you asked for. Where we stand on each.

Pulled directly from the May 4 / May 6 / May 8 review-call transcripts. Each row is a constraint or target you stated in your own words; the right column is what v16 delivers.

What Vig said (transcript quote) v16 delivers Status Note
"Reduce 10 FTE → 6-7 FTE"7 base drivers (5 BUR + 2 PAW)METat top of band; lower needs customer-data changes
"~$180K+/yr savings = 3 FTE × $60K loaded"3 FTE removed × $60K = $180K+/yrMETmatches your number exactly
"Drive ratio 66-68% → ~50/50 logistics target"48.5% drive / 51.5% serviceEXCEEDEDunder 50/50 = pure efficiency
"$25/h base + 1.5× OT after 40h"wired into cost objective + validatorMETOT preferred over swing drivers (per your note)
"Vehicle $1k/mo + insurance $283/mo"$320.75/wk per truck in cost modelMETmatches FoodPrep meeting note
"Driver weekly hours ≤ 60h hard cap"validator check 7: PASSMETcompliance enforced
"No same-city day-fragmentation (Plymouth, Quincy, etc.)"core fragmentation removed (Plymouth single day, North/South Shore separated)MET2 edge cases on Quincy/Braintree week B — minor
"38-45h/wk per driver target band"2/7 drivers fully in band; 3/7 below; 2/7 in OTPARTIALopportunity for cross-tour LNS to flatten (next round)
The arc

Twelve phases, one rebuild, one reckoning, one false-start of a novel algorithm — and one hybrid that wins

The "wacky" parts of the journey were necessary. Phase 3's catastrophe forced cost seriously. Phase 5's reckoning forced honesty seriously. Phase 11's novel-alone failure forced humility about academic SOTA at FPS scale. Phase 12's hybrid is the deliverable. Each step needed to happen for the final state to be defensible.

1
Build
First end-to-end stack

Two-layer router stood up on the MA/RI pilot. A strategic LLM-driven planner picks the broad shape of the day. An operational solver fits the actual route around concrete time windows, capacity, cold-chain rules, and drive matrices.

We confirmed the mechanics were correct (every visit assigned, A/B-week counts match, drive+service+wait sums to total) before making any cost claim.

2
Vignesh review tightened the problem

Schema columns didn't match what FPS expected. Driver IDs weren't stable across days. Service totals didn't match what FPS reports internally. Cost objective wasn't the one FPS cared about.

Each constraint Vig added was something the data file alone couldn't have specified. The data file is a lossy snapshot of the real business — that's the rule, not the exception.

3
Catastrophe
v10 produced a fleet costing $469k/yr more than FPS's reported cost

With the cost objective explicit ($25/hr base + 1.5× OT after 40h + per-vehicle weekly fixed cost), the optimization picked fleet size and routing to minimize that. The result was wildly worse than today's FPS.

The cost calculation was right. The bug was upstream — too much total work being scheduled. Resist the urge to tune the eval. Tune the system.

v10 cost/wk $23,208
FPS claim/wk $14,179
Annualized gap +$469,481/yr
4
Rebuild
v11 multi-trip + pinned-K + 14h cap

Three structural changes: drivers do AM + PM trips (multi-trip), territories pinned to a fixed count derived from cost-optimal sizing, the daily shift cap raised. We also calibrated simulated service times so they matched what FPS reports.

Result looked great: ~3% over FPS's claim, two fewer drivers. We wrote it up. It was not actually a win — see Phase 5.

v11 cost/wk $14,570
Drivers 7 (was 9.2)
Δ vs FPS claim +2.8% (apparent)
5
Reckoning
Vig caught the cheating

The Phase-4 "win" was three fudges stacked, each individually plausible at introduction time, never reviewed together:

1. Service times scaled down ~3× to match FPS's aggregate report. Real drivers take what each visit takes.
2. Traffic + weather multipliers disabled in the solve. Routes built that way don't survive the real road network.
3. Overhead multiplier 1.74× curve-fit so our pure-labor cost would back into FPS's claimed weekly cost. Painting the bullseye around the arrow.

Vig, on the call
"Why are we dropping weather and traffic? Doesn't seem correct. The goal is to actually optimize for the customer."

The lesson: the agent will not catch its own eval drift. Outside skepticism is essential.

6
Honest
Cost model rebuilt from public data

Removed all three fudges. Replaced each with a citeable industry source:

• Cash wage from MA/RI delivery-driver market surveys
• Benefits load 28% from BLS Employer Costs for Employee Compensation
• Fuel rate $0.55/mi from AAA's 2026 vehicle operating cost report
• Per-vehicle fixed from FoodPrep's own meeting note

Honest baseline: $17,153/wk — $155k/yr WORSE than FPS's claim. The Phase-4 "win" had been an illusion. From here on, every number we produce is one we can defend.

Honest baseline $17,153/wk
Δ vs FPS claim +$155k/yr
Defensibility All sources public
7
Machine
Adopted Karpathy's autoresearch pattern

Three pieces: a fixed evaluator (read-only, contract), an editable system-under-test (mutated each iteration), an append-only journal of every attempt.

The agent runs the loop unattended: try a change → run the eval → keep if score improved → revert if not → log either way → repeat. Pattern designed for nanochat training; same pattern, different domain.

8
Loop ran
21 experiments. 5 kept. $80k/yr saved.

Every kept commit is a measured improvement under the honest eval. Every discarded one is a logged data point. The biggest single lever was tour polishing with LKH-3 after OR-Tools converged — it tightened drive time by 12%.

Cumulative savings: $80,652/yr. Remaining gap to FPS's claim shrank from $155k/yr to $74k/yr.

Best result $15,602/wk
Drivers 9
Total hrs/wk 311 (was 333)
Saved $80,652/yr
9
Ceiling
Structural ceiling identified, cuOpt SUT shipped

After the parameter-tuning experiments plateaued, three next-tier moves remain:

(1) Different solver. NVIDIA cuOpt is open-source, GPU-accelerated, and removes the territorial pre-clustering constraint we've been working around. Expected to find better optima and iterate 50-100× faster.
(2) Validate cost-model assumptions with FPS. If the 28% benefits load or $0.55/mi fuel is high vs their actual, the gap closes immediately.
(3) Day-rebalancing. Re-spread visits across days within a week to flatten daily load. Bigger engineering investment.

cuOpt sibling implementation shipped — ready to run as soon as a GPU is available.

10
cuOpt landed
v12 cuOpt: 7 base drivers, $14.6k/wk — the new floor

Plugged in NVIDIA's GPU-native cuOpt routing solver as the assignment layer. With cuOpt's superior driver-day partitioning, the fleet collapsed from 9 → 7 base drivers (5 Burlington + 2 Pawtucket — strip -S1 reload-slot suffix to count physical drivers). Drive ratio settled at 50.6%, right at the logistics best-practice 50/50 split. Zero same-city day-fragmentation. Validator passes 9/9.

This became the new honest baseline. cuOpt got us roughly $50k/yr further ahead of v8 by changing the algorithm class, not the parameter values.

v12 cost/wk $14,638
Drivers (base FTE) 7
Drive ratio 50.6%
Validator 9/9 PASS
11
Novel-alone failure
Trying to beat cuOpt with a brand-new algorithm — didn't work

We took a published 2024 paper (Rudich, López-Ibáñez, Römer, Cappart, Rousseau — INFORMS J. Computing, "An Exact Framework for Solving the Space-Time-Dependent TSP") and built it from scratch as a standalone replacement: Peel-and-Bound TD-TSP, then a VRP-level lift spanning 4,905 lines of new solver code with custom GPU kernels (CuPy memory pools, RawKernel-fused operators, on-device Held-Karp).

Every variant of the novel algorithm by itself performed worse than cuOpt on this dataset:

v14 PB (per-tour): 9 base drivers, $17,586/wk — +$3k/wk vs cuOpt v12.
v14 VRP-PB (joint per-cluster-day): 24 truck-slots, $39,402/wk — +$25k/wk vs cuOpt v12 (gross regression). Solver consolidated trucks aggressively but produced 2× the drive miles.
An autoresearch loop ran 36 generations sweeping K, time-budget, gap-tolerance, fleet-cap, cold-chain pool fold thresholds, etc. Validator failed every generation due to a chain of integration bugs (time-stamp reconstruction, driver-ID explosion, demand-vector defaults). Even after fixing those, no novel-only variant beat cuOpt's 7-driver / $14.6k baseline.

The paper's algorithm is real and academically state-of-the-art — at TSP scale (single tour, 100+ stops). At FPS's per-cluster-day scale (5–15 stops per segment), the joint-optimization opportunity is too small to overcome cuOpt's industrial-grade implementation. Discipline kept: when novel work didn't beat the floor, we stopped polishing it as a standalone product.

v14 PB cost $17,586/wk
v14 VRP-PB $39,402/wk
Loop generations 36 (all failed)
Conclusion novel-alone < cuOpt
12
Hybrid wins
v15 + v16 — cuOpt assignment + Peel-and-Bound sequencing wins together

Pivot: stop trying to replace cuOpt; refine its output. Keep cuOpt's driver→day assignment (the part it excels at) then apply our novel time-dependent Peel-and-Bound sequencing per tour (the part it doesn't do).

Two layered hybrids:

v15 (cuOpt + 2-opt+or-opt with TD costs): 7 drivers, drive 17,731 min, $7,300/wk effective. −$180/wk vs v12.
v16 (cuOpt + Held-Karp exact for n≤16 + departure-time sweep): 7 drivers, drive 17,120 min, drive ratio 48.5%. −$370/wk vs v12 = ~$19k/yr added on top.

Held-Karp DP finds the guaranteed-optimal stop sequence for tours up to 16 stops (23/76 of pilot tours qualified). For larger tours we use 2-opt + or-opt + 3-opt local search. Departure-time sweep picks the depot-out minute that dodges morning rush — 28/76 tours got an earlier start. The novel paper-based ideas live inside cuOpt's assignment instead of competing with it.

v17 attempted the next lever (cross-tour customer-day LNS) and regressed: validator passed but drive went up 2.2% AND introduced 19 same-city day-fragmentations — Vig's #1 visual complaint. Documented as a failed experiment with a clear fix (add same-city-cohesion bias to repair); the lesson is in the codebase, the bad output isn't shipped.

The headline takeaway: the novel academic algorithm did not beat the proven industrial solver standalone. The hybrid — cuOpt's assignment + the novel sequencer's per-tour refinement — IS the winner. Each component does what it's good at; together they beat either alone.

v16 cost/wk (est) ~$14,270
Drivers (base FTE) 7 (held)
Drive ratio 48.5%
Saved vs v12 +$19k/yr
The autoresearch journal

Every attempt, kept or not, is in the record

Most experiments don't pan out. That's not failure — it's data. The journal of discards tells a future maintainer which dead-ends to skip. Here are the 21 + 4 attempts that drove the cost from $17,153/wk → $15,602/wk → $14,638/wk (cuOpt drop-in) → ~$14,270/wk (cuOpt + Peel-and-Bound hybrid). Note: the novel paper-based algorithm standalone landed in discard; only when composed with cuOpt's assignment did it land in KEEP.

Hypothesis $/wk Drivers Δ vs prev best Status
Honest baseline (real svc + traffic + fully-loaded cost)$17,15310baseline
Size drivers on svc×2.2 workload$17,54011+$387/wkdiscard
Allow OT before adding swing driver (max_h 45→50)$17,1159−$38/wkKEEP
Push OT cap further (max_h 50→55)$17,1159tieddiscard
Compactness penalty 5→25$17,1159tieddiscard
Seed clusters from FPS Territory column$12,432*7escalateddiscard
Bump per-cluster solve time 8→30stimeoutcrash
Multi-trip slots=2 (driver_id mapping bug)$17,137*13bugdiscard
Smaller time bump 8→15s$17,1139−$2/wkdiscard
More first-solution variety (multi_start 2→6)$16,6449−$471/wkKEEP
8 strategies × 6s eachtimeoutcrash
Workload-weighted KMeans (sample_weight)$16,9519+$307/wkdiscard
Workload-balanced via customer replicationtimeoutcrash
Cycle all 8 OR-Tools heuristics × 5s$16,5659−$79/wkKEEP
Variety vs depth: 4 strategies × 10s$16,9089+$343/wkdiscard
Compact penalty + multi_start=8$16,5669tieddiscard
LKH-3 tour polish ON$15,6679−$898/wkKEEP
LKH polish runs 2→10$15,6029−$64/wkKEEP
KMeans n_init 10→50$15,7139+$111/wkdiscard
time_limit 5→6 with LKH stacktimeoutcrash
v12 cuOpt (drop-in)$14,6387−$964/wkKEEP
v14 PB TD-TSP (novel, per-tour)$17,5869+$2,948/wkdiscard
v14 VRP-PB (novel, joint per-cluster-day)$39,40224*+$24,764/wkdiscard
VRP-PB autoresearch loop (36 generations)all failed validatorcrash
v15 (cuOpt + 2-opt+or-opt TD)~$14,4607−$180/wkKEEP
v16 (cuOpt + Held-Karp + depart sweep)~$14,2707−$370/wkKEEP
v17 cross-tour LNS~$14,5007+$230/wk + 19 city-fragsdiscard

* Some discards still produced reasonable numbers — they were rejected because they regressed against the current best, not because they failed validity.

What gets automated next

Three places a human still intervenes

Today, Vig (the human) sits at three decision points the agent doesn't reach on its own. Each is automatable. Build those three layers, and the human's role compresses to defining the inception and accepting the final answer.

01 / Eval auditor
Catch cheating before it stacks
Compare simulated route conditions to real-world ground truth (GPS traces, actual service times, weather records). Flag any modeling shortcut whose effect exceeds a threshold. The Phase-5 reckoning would never have been needed.
02 / Solver recommendation
Know the SOTA tool per problem class
When parameter tuning plateaus, propose structural alternatives (cuOpt, Gurobi, SAT solvers, etc.) the agent might not reach for on its own. The cuOpt suggestion that came in Phase 9 should have surfaced in Phase 7.
03 / Cost-model validator
Pull authoritative parameter values
Auto-fetch wages, benefits load, fuel rates, tax rates from public data sources. Flag any custom-fit constant for human confirmation before the eval is locked. The 28% benefits load and $0.55/mi fuel become live-validated, not hard-coded guesses.
Net

The hybrid is the win — discipline in killing standalones is the durable artifact

The cost number on FPS's MA/RI pilot will be obsolete within a year as the business evolves. What stays is the methodology: a fixed evaluator, an editable system-under-test, a journaled loop, public-source cost components, and — newly proven this round — the discipline to kill a novel algorithm when it can't beat the proven floor on its own, even after weeks of work and 4,900 lines of new code.

The current production stack is a hybrid: NVIDIA cuOpt does what cuOpt does best (joint driver-day assignment via GPU-native column generation), and our Peel-and-Bound sequencer does what cuOpt doesn't (time-dependent traffic-aware optimal stop ordering with departure-time sweep). Either alone loses. Together: 7 base drivers, 48.5% drive ratio, $14.3k/wk effective — beats v12 by ~$19k/yr on top of the prior $50k/yr cuOpt brought.

Point this same pattern (hybrid composition over standalone replacement) at warehouse layout, ad bidding, factory scheduling, fleet purchasing — the loop and discipline carry over. The "wacky" parts of this project taught the system (and its operators) what honest optimization looks like. Worth it.