Once your system grows. a 3-step chain, a router with 4 routes, RAG with honest generate. you can't test it mentally anymore. You need a set of test cases that demonstrates the system does what you promise.
That set is the eval set. And the skill of writing a good one is what separates the LLM engineer from the demo-builder.
An eval set isn't a handful of random cases. It's a curated selection of inputs covering:
Each case carries a comment: what it's there for. Without comments, when a case fails 6 months from now, you won't know if it was a control or an edge case you consciously decided not to support.
Echo asks for an eval set for the step 02 classifier (4 categories: safety, maintenance, social, other).
At least 10 cases, distributed:
comment explaining what each case evaluates.6 LLM-judge criteria on the set's quality:
The mental test: if I, reading only your eval set, could reconstruct what the classifier does and where it struggles. you passed. If it looks like "10 random inputs with labels", you didn't.