A Jailbreak Shouldn't Be a Breach

A model’s refusals are probabilistic. Your security can’t afford to be.

Last weekend, Anthropic disabled all public access to Claude Fable 5 and Mythos 5, worldwide, after the U.S. government issued a directive citing national security, barring any foreign national from using the models. The trigger was a jailbreak posted publicly days earlier. Every other model stayed up; these two switched off.

The reflex is to read this as a policy story, or as a cautionary tale about depending on a model that can vanish overnight. The more useful reading is that if you build on models, it’s crucial to understand where your security actually lives. The controls in question sat inside the model, and controls that live only in the model are ones you can’t see, audit, or enforce. When the last line of defense between an attacker and your users’ data sits in the model, a single jailbreak also means a breach.

The smart approach is to design your security architecture as if every agent can be talked into doing something it shouldn’t, because it can.

Guardrails aren’t enough, you need enforced boundaries

Here’s what reportedly happened. On June 10, a researcher known as “Pliny the Liberator” published a multi-agent jailbreak using unicode obfuscation, long-context manipulation, narrative framing, and decomposition of the request into innocuous-looking pieces. They claimed it pulled restricted content out of Fable 5.

Anthropic disputes this was a true jailbreak. Their position was that the technique coaxes the model into continuing past its refusals, which is “a well-known limitation in nearly all large language models.”

That isn’t spin, it’s a candid and accurate description of how these systems behave, and it’s the crux of everything that follows. Whether or not this particular instance “counts,” the category holds: refusals are probabilistic, and getting a model to talk past them is a general property of the technology, not a flaw unique to one model or one lab.

A model guardrail is non-deterministic by nature. The same prompt can be refused on Monday and obliged on Tuesday. You can move the probability. You can’t draw a line with a guardrail. Guardrails lower the likelihood of bad output, but they do not enforce a boundary, and a boundary is what security and governance actually require.

Anthropic’s own design makes the same point. In sensitive domains like cybersecurity and biology, Fable 5 automatically falls back to the more conservative Claude Opus 4.8 to reduce misuse. While this user experience can be debated, from a security perspective, it’s notable that Anthropic included this extra level of protection, and that it still left gaps.

Security failures are structural and demand structural fixes

In a way it would be comforting to treat this as a single dramatic incident. It isn’t. The security community has been converging on the opposite conclusion for a while.

Prompt injection sits at the top of the OWASP Top 10 for LLM Applications and has been the number-one risk for two editions running. In addition, “Excessive Agency” (LLM06) is one of the most expanded entries, covering excessive functionality, permissions, and autonomy. Experts who study these risks tend to have a structural reading: prompt injection is less a bug to be patched and more a standing property of any system that mixes trusted instructions and untrusted data in the same context.

Simon Willison’s “lethal trifecta” explains this so cleanly it’s worth memorizing. Give an agent three things at once — access to private data, exposure to untrusted content, and the ability to communicate externally — and you’ve built an exploit by construction. The poisoned content steers the agent, which reads the sensitive data, and then sends it out the door. No malware or exploit chain needed.

If the failure is structural and probabilistic, “train it harder” is not a fix. The fix has to be a property of the system around the model.

Understanding information hazard vs. excessive agency

Before going further, we need to make an important distinction because the Fable story actually bundles two different problems, and pretending one solution covers both is how vendors lose credibility.

Problem #1, Information hazard. The model emits dangerous knowledge like exploit steps and synthesis routes. This is a model-weights problem. No amount of authorization tooling stops a model from saying something it shouldn’t. If that’s your threat, your answer lives in training, evaluation, and the kind of capability fallback Anthropic built.

Problem #2, Excessive agency. The model is manipulated into taking an action it was never authorized to take: reading a record it shouldn’t, moving money, deleting data, emailing it somewhere. This is a systems problem. And this is where “a jailbreak shouldn’t be a breach” is not a slogan but a literal design property.

The Fable headline is the cinematic version of the information hazard failure. But for almost everyone shipping agents into production, failure mode two is the more common and more expensive risk; an agent holding a broad token reads one poisoned web page and acts on it. That’s the risk most teams are overlooking, but the good news is that it’s the one you can actually engineer away.

What defense in depth looks like for agents

The encouraging part is that this is now an industry direction, not a vendor talking point. Three patterns worth knowing.

Meta’s Agents Rule of Two. Treat the lethal trifecta as a budget. An agent operating without human supervision may satisfy at most two of {processes untrusted input, accesses sensitive data, can change state or communicate externally}. Need all three? Put a human in the loop. It’s deliberately crude, and that’s the point. It reduces severity deterministically rather than hoping the model behaves. The idea is borrowed straight from browser sandboxing.

DeepMind’s CaMeL. A research design that defeats prompt injection by construction rather than by training. A privileged model orchestrates the task; a quarantined model handles untrusted data with no tool access; every value carries capability metadata that dictates how it may be used; least privilege is enforced by the system, not the prompt. On the AgentDojo benchmark, it solved 77% of tasks with provable security, versus 84% undefended. That’s strong protection at a modest utility cost.

Authorization at the tool boundary. The model can request an action, but a deterministic layer decides whether this specific user has actually granted this specific scope. The model’s output is treated as an intent, never as an authority.

The common thread is determinism. Each pattern replaces a probabilistic hope about how the model will behave with a property of the system that resolves the same way on every run, whatever the model was talked into. The model is assumed fallible, and the boundary holds anyway, every time.

Why determinism is a governance problem too

Security is the obvious frame, but the same property is what makes a control governable. You cannot audit a probability. You cannot certify to a regulator, a customer, or your own risk team that a model will refuse, only that, so far, it usually has. Compliance, access policy, incident response, and oversight all assume a boundary that behaves identically whether or not anyone is watching. A non-deterministic guardrail can’t offer that, which means it can’t really be governed.

The Fable directive actually holds two separate failures that are easy to blur together. The first was about the model. Its guardrails are probabilistic, so they can never fully guarantee the model won’t produce a given capability, and that is the opening a jailbreak exploits.

The second failure had nothing to do with that. The government wanted a deterministic outcome, no access for a defined set of people, and that is an identity question, not a question about the model. The inference pipeline had no reliable way to verify who was on the other end of a request, so it couldn’t enforce a policy as specific as “these people, not those.” With no identity layer to draw that line, the only lever left was the crudest one: switch the entire model off, everywhere, for everyone. The options collapse to all-or-nothing, because nothing finer can be enforced. Deterministic controls at the action layer, tied to a verified identity, are what give you any choice other than the kill switch.

None of this is a knock on Anthropic. The same is true of any model from any lab. Probabilistic controls stay probabilistic no matter whose weights they live in, and Anthropic happens to ship some of the most carefully guarded models out there. The shutdown wasn’t one company falling short. It’s what governing a non-deterministic system looks like when the only boundary on offer is a maybe.

The point of action authorization principle

This is the boundary where Arcade.dev focuses, so I’ll state the principle rather than the feature list. Authorization belongs at the point where an action actually executes, and it should treat whatever the model emits as a request, not a permission. An agent acts as a specific authenticated user, through that user’s own OAuth grants, so its reach is exactly the user’s reach and nothing wider. Each tool requires a named scope, checked against what the user actually granted at the moment of the call. And the credential itself stays out of the model’s context entirely, held where the tool runs, never visible to the thing an attacker is trying to manipulate. Jailbreak the model, and you have manipulated something that holds no keys and can grant itself none.

That is deliberately narrow. It does nothing about the first failure mode. It won’t stop a model from saying something it shouldn’t, and it’s no substitute for input controls or the Rule-of-Two budget. It is one layer. But it is the layer that decides whether a manipulated model can reach your users’ data, and it answers that question deterministically the same way every time, auditable after the fact, regardless of what the model was talked into.

The takeaway

You cannot guarantee a model won’t be talked into something. You can guarantee that being talked into that something grants it no new authority; no token it didn’t already have, no scope the user didn’t already approve.

That is the line between a probabilistic system and a governed one. The frontier isn’t how capable the model is. It’s how tightly, and how provably, you can bound what it’s allowed to do. The model will stay probabilistic, that’s the nature of the thing. Your controls don’t have to be.

So treat every model as if it’s already jailbroken, then build so the breach never happens.