Naveo

STEP 11 / 22

A7 A/B

MCQ · NO COST

Your system processes 100,000 messages per day. Most are trivial (acks, "ok", emojis, "thanks"); a minority (~5%) need a real, elaborate response. Two architectures. Which one do you ship?

Why?. optional

Look for: closed contract, explicit fallback, scaffold at the end.

GUEST MODE

You're viewing this lesson as a guest. To save your progress, earn XP, and keep your streak, sign in when you're ready to check.

Costs 1 heart

The most universal cost-optimization trick

When a system processes volume, model cost becomes the largest line on the balance sheet. The two naive answers:

Always use the big model. High quality, disproportionate cost.
Always use the small model. Low cost, insufficient quality on the hard cases.

Neither answer works at scale. The one that does: cascade. Small model upfront to classify, big model only for cases that need it.

What it looks like

code

    [message]
        ↓
   [SMALL model]  → is it trivial?
   ↙           ↘
trivial      [BIG model]
reply            ↓
              elaborate
              response

Step 1 processes 100% of traffic, but at 1/30 of the big model's cost. Step 2 only processes the 5-20% the small one flagged as "needs help". Aggregate cost drops 5-15×, and average latency improves because most are resolved at step 1.

Why it works

Small models are great at classifying and bad at generating. That asymmetry is exactly what you need:

Classifying "trivial vs needs_response" is a binary decision. Small model, short prompt, 100 tokens, 80ms.
Generating an elaborate response needs real capability. Big model, long context, 500-2000ms.

You put each model to do what it does best. And the big one stops paying tolls for the 95,000 invocations where it added nothing.

Variants you'll see

N-level cascade. small → medium → big. Each level only escalates what it couldn't resolve. Useful when there are three clear complexity bands.
Confidence cascade. Step 1 returns its decision + a score. If the score is high, you accept it. If low, you escalate to step 2.
Per-task cascade. Small model for summaries, big model only for creative generation. Different models per task, not per difficulty.

The error to avoid

If the classifier is bad (less than 85% accuracy), you lose almost all the saving: false negatives pay for the big model anyway, and false positives drop response quality. Measuring the classifier with an eval set before building the cascade is mandatory, not optional.

Your exercise

On the right, two architectures for the same volume. One sends everything to the big model. The other uses cascade. Pick which one you ship.