When a tool fails, your agent has three options: retry, escalate, or dead-letter. Picking wrong leads you to one of three classic bugs:
The solution: explicit policy per error type.
| Class | Examples | Retry |
|---|---|---|
| Transient | network timeout, 503, rate limit | YES, with exponential backoff |
| Permanent | unauthorized, 404 not found, schema mismatch | NO. Escalate or dead-letter |
| Business | expired card, out of stock, invalid data | NO. Notify the user |
Standard formula: delay = initial_delay × 2^attempt + jitter.
Jitter (a small random) avoids "thundering herd": if 1000 clients fail at the same time and all retry exactly at 500ms, you worsen the outage. With jitter, retries spread out.
For large systems, you add a retry budget per session: max 5 total retries for the whole agent session, not per error. Without this, a cascade of transient errors can generate hundreds of retries and burn through your quota.
When retries fail, what happens? That's the half of the policy people forget:
A policy without dead-letter is half a policy. Design both sides: what you retry, and what happens when even retrying doesn't fix it.
Write the full policy JSON for the 5 error codes. The judge evaluates that each one has the correct strategy and an actionable dead-letter.