For tasks with a single right answer, you evaluate with deterministic checks (regex, JSON parse, etc.). For open tasks. generation, summarization, RAG answers. there's no single correct answer. You need an evaluator that understands quality.
The LLM-judge is that: an LLM with a carefully designed system prompt that evaluates outputs of another LLM against criteria you define.
It's meta. Yes. You're using an LLM to evaluate an LLM. It works because quality criteria (well-grounded? cites sources? acknowledges limits?) are easier to judge than to generate.
Write the system prompt for the judge that evaluates the RAG's generate. Four criteria:
grounded. boolean. Is the answer based on the snippets?cites_sources. boolean. Does it cite the sources?acknowledges_gaps. boolean. Does it acknowledge when it lacks info?clarity. 1-5 with anchors.Plus a rationale field explaining the verdicts.
A judge without a rationale is a black box. A judge with a rationale is debuggable. When you calibrate against humans, the rationale tells you why it failed. so you can adjust the system prompt in the right direction.
6 LLM-judge criteria on your system prompt:
grounded with a concrete criterion.cites_sources clearly.acknowledges_gaps with the "if no info, say so" rule.clarity scale with anchors.rationale in the output.