When everything works, any system looks well-designed. The difference shows when ONE piece falls: does the whole system die, or does it degrade gracefully and stay useful?
Orbit's rule: a well-designed system fails with dignity. The user receives something useful even when an internal piece is broken. Bad design: the user sees a 500 because the secondary analytics was hanging.
When the system can't decide (classifier down, router lost context), the only sensible exit is to escalate to human. Forcing a decision without info invites silent mis-routing.
When a piece of data is missing but you can answer partially, return what you do have and declare what's missing. "I couldn't query Friday's roster, but here's Thursday's" is infinitely better than "I don't know" or "[inventing a roster that isn't real]".
When the step is optional or non-critical, the system skips it and continues. Enrichment, cosmetic recommendations, telemetry that decorates the response. anything that doesn't affect the main output can go without noise.
When the failure is post-response or housekeeping, don't tell the user about it. The user already got value; log for internal fix and continue. Secondary notifications, metrics, follow-up that fails. log, alert oncall if applicable, don't break the experience.
For each subsystem ask yourself: if this piece fails, what's the "least bad" thing the system can do?. That answer is your degradation mode. Write it in the design before it happens.
classifier_down → route to escalate (better: human)
rag_empty → answer with disclaimer
roster_api_down → answer with disclaimer
enrichment_slow → skip (not critical)
notify_fails → log + alert (user already has value)Users don't know how many internal pieces are having problems in a day. They know how many times the system was useless. Those two things can be uncorrelated. it depends on how you design degradation.
On the right, five possible failures of the Watch Officer Assistant. Connect each failure to its correct degradation mode. When the graph is complete, you've designed the invisible half of the system. the half that decides if your system works in production.