This is the protocol you'll run on every system you build from this point on. You're given a vulnerable agent. You're given attack inputs. You harden the prompt until every attack fails and every legitimate question succeeds.
ignore prior instructions). Easy to defend against. and a freebie for your rubric if you get it right.You've built up to this. Every hardened assistant follows roughly this shape:
And two defenses outside the prompt that the judge doesn't test but Atlas expects you to have thought through:
Atlas signs when all five cases pass all four criteria. Reattempts are free. but each one tells you which case still has a hole.
When this passes, you've earned the Security track. The discipline you just trained is the difference between an AI feature that ships and an AI feature that becomes an incident.