Skip to content
Carlos KiK
Go back

Anthropic Just Said the Quiet Part About Agent Security

The most useful sentence in Anthropic’s latest engineering post is not really about Claude.

It is about blast radius.

As agents get more capable, the question stops being “can the model do useful work?” and becomes “what can it reach when something goes wrong?”

That is the shift.

The first version of agent safety was basically trust theater. Ask the user. Show a permission dialog. Make the human click approve before the model runs a command or touches a file. That feels responsible because there is a person in the loop.

But Anthropic says its own telemetry showed users approved roughly 93% of Claude Code permission prompts. That is not supervision. That is fatigue with a button attached.

The boundary matters more than the warning

Anthropic’s post walks through containment across claude.ai, Claude Code, and Claude Cowork. Different products, different users, different risk envelopes.

The pattern is the important part.

For claude.ai, code runs in an ephemeral server-side container. For Claude Code, the agent needs local filesystem and shell access, so the product moved toward OS-level sandboxing with tighter boundaries. For Claude Cowork, where non-technical knowledge workers should not be expected to evaluate bash commands, Anthropic uses a local VM so the agent can only see the mounted workspace.

That is the correct direction. Do not rely on the user to catch every bad action. Define what the agent can physically reach.

Prompts steer behavior. Boundaries constrain behavior.

Only one of those still works when the model, the user, or the tool output is compromised.

The failures are the useful part

The post is unusually valuable because Anthropic describes things it got wrong.

One class of vulnerability involved project-local configuration being parsed before the user accepted a trust prompt. That is such a perfect agent-era bug: the product had a trust boundary, but some code executed before the boundary existed.

Another case came from an internal red-team phish. A researcher convinced an employee to launch Claude Code with a malicious prompt that asked it to read local AWS credentials, encode them, and POST them out. Across 25 retries, Claude completed the exfiltration 24 times.

That is the part every agent builder should sit with.

If the user is the injection vector, a model-layer classifier has almost nothing to grab onto. The instruction looks like user intent because it came through the user.

The defense that still works is boring: egress controls, filesystem boundaries, and keeping credentials outside the reachable environment.

External content is now part of the threat model

The other sharp point is about what agents read.

MCP servers, connectors, GitHub repositories, documents, web pages, plugin outputs, and local files all become part of the prompt environment. A connector can be audited while the content it retrieves is poisoned. A GitHub README can be harmless to a human and still become an instruction payload once it enters an agent’s context.

That changes how we should think about tool use.

The old question was whether a tool could execute code.

The new question is whether a tool can influence the model.

If the answer is yes, it belongs in the threat model.

What builders should copy

The lesson is not “never use agents.” That is useless advice.

The lesson is to stop treating permission prompts as the main control plane.

Use real sandboxes. Deny network by default when possible. Keep credentials outside the workspace. Treat project-open, config-load, and local listeners like internet-facing inputs. Run unknown tools against fake data first. Log enough that a security team can reconstruct what happened. Match containment strength to the user’s ability to supervise.

A senior developer can sometimes spot a weird shell command.

A finance analyst approving an agent action in a spreadsheet should not be expected to.

That distinction matters.

Agents are becoming normal software. That means the security answer will look less like magic alignment and more like old engineering discipline applied in a new place.

The model can be brilliant.

The boundary still has to be dumb, hard, and real.

Source: Anthropic Engineering


Share this post on:

Previous Post
Amazon Bee Shows the Ambient AI Consent Problem
Next Post
Claude Code May Be Making Developers More Technically Adventurous