When you change something in an LLM chain (new prompt, different model, reordered step), how do you know if the change is good? The natural answer is "I deploy and watch the metrics". The problem: comparing "before" to "after" measures much more than your change.
Any of those can explain the "improvement" or "worsening" you see, without your change having done anything.
You split traffic into two simultaneous groups:
Both run at the same time, with the same users distributed by deterministic hash. The only difference between the two groups is your chain's version. Any difference in metrics is due to your change, not the world's noise.
arm = hash(user_id || trace_id) % 2 == 0 ? 'control' : 'variant'Without determinism, a user could see inconsistent responses (sometimes control, sometimes variant), which degrades the experience and contaminates metrics with unexplained variance.
To detect a 5% accuracy improvement with statistical confidence, you typically need thousands of events per arm. To detect a 1% improvement, tens of thousands. An A/B with 30 events tells you nothing useful.
Define before seeing results what you're going to measure and what threshold counts as "winner". If you decide after, you'll find the metric that confirms what you already want to believe. This is called p-hacking and ruins serious A/Bs.
| Metric | How to measure |
|---|---|
| Eval pass rate | Eval set running over each arm, % passing |
| Latency p50/p95 | Trace duration_ms, percentile 50 and 95 |
| Average cost per request | Sum cost_usd of spans, average |
| Degradation rate | % of traces with status partial |
| Satisfaction (proxy) | User thumbs up/down, follow-up rate |
Offline eval measures the ceiling of quality (could the change be better?). A/B measures the real behavior (is the change actually better with real users?). Both are necessary; neither replaces the other.
On the right, two strategies to evaluate the same change to a chain. Pick the one you'll use in production.