When a system processes volume, model cost becomes the largest line on the balance sheet. The two naive answers:
Neither answer works at scale. The one that does: cascade. Small model upfront to classify, big model only for cases that need it.
[message]
↓
[SMALL model] → is it trivial?
↙ ↘
trivial [BIG model]
reply ↓
elaborate
responseStep 1 processes 100% of traffic, but at 1/30 of the big model's cost. Step 2 only processes the 5-20% the small one flagged as "needs help". Aggregate cost drops 5-15×, and average latency improves because most are resolved at step 1.
Small models are great at classifying and bad at generating. That asymmetry is exactly what you need:
You put each model to do what it does best. And the big one stops paying tolls for the 95,000 invocations where it added nothing.
If the classifier is bad (less than 85% accuracy), you lose almost all the saving: false negatives pay for the big model anyway, and false positives drop response quality. Measuring the classifier with an eval set before building the cascade is mandatory, not optional.
On the right, two architectures for the same volume. One sends everything to the big model. The other uses cascade. Pick which one you ship.