Naveo

STEP 20 / 22

A7 A/B

MCQ · NO COST

Your ticket-processing chain is running in production. orbit proposed a change in step 2 (extract) that in theory reduces hallucinations. Two strategies to evaluate whether it's worth deploying to all traffic. Which one do you pick?

Why?. optional

Look for: closed contract, explicit fallback, scaffold at the end.

GUEST MODE

You're viewing this lesson as a guest. To save your progress, earn XP, and keep your streak, sign in when you're ready to check.

Costs 1 heart

The only honest method to compare two versions

When you change something in an LLM chain (new prompt, different model, reordered step), how do you know if the change is good? The natural answer is "I deploy and watch the metrics". The problem: comparing "before" to "after" measures much more than your change.

What happens between week and week

Traffic changes (seasonality, events, day of week).
Other teams change other systems.
Providers silently update models.
User composition changes (new onboardings, churn).
Your own interpretation of metrics changes (confirmation bias).

Any of those can explain the "improvement" or "worsening" you see, without your change having done anything.

The A/B test eliminates those confounders

You split traffic into two simultaneous groups:

Control (A): the old version. 50% of traffic.
Variant (B): the new version. 50% of traffic.

Both run at the same time, with the same users distributed by deterministic hash. The only difference between the two groups is your chain's version. Any difference in metrics is due to your change, not the world's noise.

The three rules that make A/B valid

1. Deterministic assignment by hash

code

arm = hash(user_id || trace_id) % 2 == 0 ? 'control' : 'variant'

Without determinism, a user could see inconsistent responses (sometimes control, sometimes variant), which degrades the experience and contaminates metrics with unexplained variance.

2. Sufficient sample size

To detect a 5% accuracy improvement with statistical confidence, you typically need thousands of events per arm. To detect a 1% improvement, tens of thousands. An A/B with 30 events tells you nothing useful.

3. Pre-agreed metrics

Define before seeing results what you're going to measure and what threshold counts as "winner". If you decide after, you'll find the metric that confirms what you already want to believe. This is called p-hacking and ruins serious A/Bs.

Typical metrics for LLM chain A/Bs

Metric	How to measure
Eval pass rate	Eval set running over each arm, % passing
Latency p50/p95	Trace duration_ms, percentile 50 and 95
Average cost per request	Sum cost_usd of spans, average
Degradation rate	% of traces with status `partial`
Satisfaction (proxy)	User thumbs up/down, follow-up rate

When NOT to use A/B

Security changes. If you patched a vulnerability, don't leave it in control 50% of the time. Full deploy + monitoring.
Obviously better changes. Fixing a bug? Just deploy.
Small traffic. If your system processes 100 requests per week, an A/B won't reach significance. Use offline eval.

Offline eval measures the ceiling of quality (could the change be better?). A/B measures the real behavior (is the change actually better with real users?). Both are necessary; neither replaces the other.

Your exercise

On the right, two strategies to evaluate the same change to a chain. Pick the one you'll use in production.