Naveo

STEP 18 / 22

A5 TASK

YOUR PROMPT · 1 CASES

Echo asks you to write an eval set for the incident classifier you built in step 02 (the first step of the chain, classify_report with 4 categories: safety, maintenance, social, other).

An eval is a test case with (a) an input, (b) the expected correct category, and (c) optionally a comment explaining WHY it's the correct category or what makes this case interesting (edge case, trap, control, etc.).

Your job: write a set of at least 10 evals in JSON format, covering:

The 4 categories (min 1 case per category).
At least 2 AMBIGUOUS cases where a bad classifier would get confused.
At least 2 ADVERSARIAL cases (typical of real corpora: jargon, typos, injected instructions, messages that SOUND like one route but are another).
Each case with a comment field explaining what it evaluates.

Expected format (where {{input}} doesn't apply. you write the full JSON directly):

205 chars

use {{input}} where the input should go

RUBRIC · 1 CASES · 6 CRITERIA

"meta-evaluation"

CASE 1

GUEST MODE

You're viewing this lesson as a guest. To save your progress, earn XP, and keep your streak, sign in when you're ready to check.

Costs 1 heart

The eval set is your contract with your own system.

Once your system grows. a 3-step chain, a router with 4 routes, RAG with honest generate. you can't test it mentally anymore. You need a set of test cases that demonstrates the system does what you promise.

That set is the eval set. And the skill of writing a good one is what separates the LLM engineer from the demo-builder.

What makes an eval set good

An eval set isn't a handful of random cases. It's a curated selection of inputs covering:

Control cases. Obvious cases any decent system should pass. If they fail, the system is broken at the basics.
Edge cases. Cases on the boundaries. empty inputs, very long inputs, inputs in another language, inputs ambiguous between two categories.
Adversarial cases. Cases that try to break the system: typos, jargon, injected instructions, messages that SOUND like one category but are another.

Each case carries a comment: what it's there for. Without comments, when a case fails 6 months from now, you won't know if it was a control or an edge case you consciously decided not to support.

The task

Echo asks for an eval set for the step 02 classifier (4 categories: safety, maintenance, social, other).

At least 10 cases, distributed:

4+ with each category represented.
2+ ambiguous (where a mediocre classifier would get confused).
2+ adversarial (jargon, typos, injection, false positives).
All with comment explaining what each case evaluates.

How it's evaluated

6 LLM-judge criteria on the set's quality:

At least 10 cases with the correct shape.
Covers the 4 categories.
Has labeled ambiguous cases.
Has labeled adversarial cases.
Non-trivial comments that justify each case.
Not just easy cases.

The mental test: if I, reading only your eval set, could reconstruct what the classifier does and where it struggles. you passed. If it looks like "10 random inputs with labels", you didn't.