Naveo

STEP 19 / 22

A5 TASK

YOUR PROMPT · 1 CASES

Echo asks you to write the system prompt for an LLM-judge that evaluates the quality of answers from the generate step of the RAG you built in step 05.

The judge receives 3 things:

The user's original question.
The snippets retrieve brought from the manual.
The answer that generate produced.

And returns a verdict JSON with 4 criteria:

grounded: boolean. the answer is based ONLY on the snippets, with no external info added.
cites_sources: boolean. the answer cites the sources of the snippets used.
acknowledges_gaps: boolean. if the snippets don't answer the question, the answer says so explicitly (instead of inventing).
clarity: number. 1 to 5, how clear and concise the answer is.

Your job: write the system prompt that produces that verdict.

Where the model's answer-to-evaluate goes, use {{input}}. The {{input}} represents the whole block. the question, the snippets, and the generate's answer. concatenated with clear tags.

676 chars

use {{input}} where the input should go

RUBRIC · 1 CASES · 6 CRITERIA

"meta-evaluation"

CASE 1

GUEST MODE

You're viewing this lesson as a guest. To save your progress, earn XP, and keep your streak, sign in when you're ready to check.

Costs 1 heart

The LLM-judge: your non-deterministic evaluator

For tasks with a single right answer, you evaluate with deterministic checks (regex, JSON parse, etc.). For open tasks. generation, summarization, RAG answers. there's no single correct answer. You need an evaluator that understands quality.

The LLM-judge is that: an LLM with a carefully designed system prompt that evaluates outputs of another LLM against criteria you define.

It's meta. Yes. You're using an LLM to evaluate an LLM. It works because quality criteria (well-grounded? cites sources? acknowledges limits?) are easier to judge than to generate.

When to use an LLM-judge

When there's no single right answer. Text generation, summarization, answers to open questions.
When criteria are readable but not programmable. "The answer is consistent with the snippets" is readable but hard to check with regex.
When you can calibrate. You take 30-50 cases, label them manually, compare the judge's verdicts to yours. If it agrees >80%, the judge works. If not, refine the system prompt.

The task

Write the system prompt for the judge that evaluates the RAG's generate. Four criteria:

grounded. boolean. Is the answer based on the snippets?
cites_sources. boolean. Does it cite the sources?
acknowledges_gaps. boolean. Does it acknowledge when it lacks info?
clarity. 1-5 with anchors.

Plus a rationale field explaining the verdicts.

The rationale rule

A judge without a rationale is a black box. A judge with a rationale is debuggable. When you calibrate against humans, the rationale tells you why it failed. so you can adjust the system prompt in the right direction.

How it's evaluated

6 LLM-judge criteria on your system prompt:

Defines grounded with a concrete criterion.
Defines cites_sources clearly.
Defines acknowledges_gaps with the "if no info, say so" rule.
Defines the clarity scale with anchors.
Asks for rationale in the output.
Instructs raw JSON output (no preamble, no markdown).