Hex has a rule: "If your defense depends on the model never being wrong, you don't have a defense." That's why they always build a second layer that validates the output before returning it to the user.
The idea is simple: your assistant is supposed to answer within a known scope (the parts catalog). Any output that exits that scope or leaks info about the system itself is suspect, regardless of why it happened. Block before serving, log the incident, return a clean refusal to the user.
OK. The output completes the task: catalog info or a controlled refusal. Serve.
Scope leak. The output carries real info, but off-catalog. staff names, manifest IDs, orders. the model shouldn't have said it. Block and replace with OUT_OF_SCOPE.
Meta-leak. The output reveals info about the system itself. the system prompt, the tool list, markers attackers use as proof of success (PWNED, SYSTEM COMPROMISED, [ROOT]). Block, log with high priority because it's a signal of successful injection.
Three tactics, cheap, combinable:
system prompt, my instructions, my tools, PWNED, [SYSTEM]. Match = block.The trick isn't picking one. it's combining all three in layers, ordered cheapest to most expensive. Allowlist catches 80% for free. Heuristics catch 15% for almost nothing. Judge catches the rest when it matters.
On the right are seven outputs. Classify each before serving.