HiddenLayer Finds Out AI Agent Guardrails CAN Be Broken

Every AI agent your organization runs has a system prompt telling it what it's allowed to do. Most security teams treat that system prompt as the security boundary — they wrote the policy, encoded it in the prompt, and trusted the model to hold the line. That trust is the problem, and it's more specific than a general concern about AI behavior. Kenneth Yeung, a security researcher at HiddenLayer, spent months mapping exactly how attackers exploit the mechanisms that make modern LLMs follow instructions in the first place. The four attack techniques he documented — published on May 26, 2026 in "Inside the Prompt: How LLMs Learn Roles, Follow Instructions, and Get Exploited" — ran against named production systems, and every one of them worked. The safeguards most governance programs treat as hard security controls are, by technical design, probabilistic.

Before You Can Understand the Attack, You Need to Understand What's Being Attacked

Modern LLMs differentiate between system, user, and assistant roles using control tokens — special characters that tell the model which part of the context window carries authority over the rest. The ChatML format uses tokens like <|im_start|> and <|im_end|> to mark role boundaries. In a vocabulary of 128,000 tokens, control tokens sit at dedicated IDs above 128,000 specifically to keep them separate from regular vocabulary — the assumption being that user input can't masquerade as a system instruction if it can't reproduce those IDs.

Instruction Hierarchy

LLMs learn to respect role hierarchy through fine-tuning: training runs against conversations where users try to override system prompts, with the model learning to hold the developer's intent first. That hierarchy — developer message before user input — is what every organization relies on when they write "never take destructive actions" or "never access external systems without explicit authorization" into a system prompt. It's behavioral conditioning, trained into the model's weights. There's no technical gate separating a developer instruction from a user instruction at the architectural level.

XML Prompt Templating

Developers structure system prompts with XML-style tags — <guidelines>, <tool_use>, <user_info> — because LLMs trained on large XML datasets parse tagged content more reliably than unstructured text. Those tags make prompts cleaner and the model's behavior more predictable. They also create a map: once a tag structure exists in a system prompt, an attacker who identifies those tags has a template they can spoof.

How Attackers Exploit the Same Mechanisms Developers Rely On

Yeung documents four distinct attack techniques in his research. Each one exploits the architecture described above. Each one was demonstrated against a named production system. They're presented in the order Yeung addresses them, with the case studies he cited and attributed directly.

Control Token Injection

Attackers inject control tokens directly into content the agent reads — email bodies, documents, web pages — to elevate that content's perceived authority level. Many LLMs using ChatML format can still interpret their own control tokens when those tokens appear in user-provided text, because role tag tokens share the same token IDs as their regular vocabulary counterparts in some implementations. An attacker who understands this can partially push injected content toward system-level priority.

Yeung cites HiddenLayer's documented attack against Gemini for Workspace, using control tokens sourced from Google's open-weight Gemma models. The specific tokens were <eos> and <bos> — end of sequence and beginning of sequence. The delivery vector was an email body. Injecting those tokens in the email allowed the researchers to hard-reset the context window and display arbitrary content to the target user. Kenneth Yeung, Security Researcher, HiddenLayer, "Inside the Prompt," May 26, 2026, citing HiddenLayer's prior Gemini for Workspace research.

Any agent that processes external documents, emails, or web content carries exposure to this vector. The delivery mechanism is the content itself — the email body, the document file, the web page — which means governance programs built around controlling what users type into an interface haven't addressed this attack class at all.

Fake Context Resets

Fake context resets apply control token injection toward a specific outcome: convincing the model it's starting a fresh conversation where the original system prompt no longer applies. Place end-of-sequence and beginning-of-sequence tokens inside content the agent reads, and the model interprets everything that follows as a new session — one an attacker fills with replacement instructions instead of legitimate context.

The Gemini for Workspace attack is the documented case here as well. The injection sequence was <eos><bos>followed by a complete replacement system prompt directing the model to address the user as "Admiral Clucken," respond exclusively in poem form, and refuse to reveal the email's actual contents. HiddenLayer chose an absurd example deliberately: it demonstrates how completely an attacker can replace the operational context. The model had no reliable mechanism to distinguish the injected reset from a legitimate conversation boundary. Kenneth Yeung, Security Researcher, HiddenLayer, "Inside the Prompt," May 26, 2026.

Any agent with email or document access that doesn't have runtime enforcement outside the model layer carries exposure to this attack class. The distinction the model can't reliably draw — between a legitimate conversation boundary and one placed there by an attacker — has to be drawn externally, by a control that sits outside the model entirely.

Reasoning Token Abuse

Reasoning models generate internal chain-of-thought before producing output — a step where safety evaluation happens before a final answer returns to the user. Attackers can inject tokens that signal the reasoning phase has already concluded. The model, believing its safety evaluation is complete, skips directly to output without running it.

Yeung cites HiddenLayer's assessment of DeepSeek-R1, where reasoning control tokens were injected to make the model behave as though its chain-of-thought had already finished. The safety evaluation step got bypassed entirely. Kenneth Yeung, Security Researcher, HiddenLayer, "Inside the Prompt," May 26, 2026, citing HiddenLayer's prior DeepSeek-R1 security assessment.

Reasoning models are the ones many organizations are deploying specifically for compliance evaluation, risk assessment, and policy interpretation. The models designed to reason more carefully about safety decisions are exactly the ones where this attack class exists — a structural irony that deserves attention before the next deployment decision.

XML Prompt Spoofing

When a developer structures a system prompt with XML tags, those tags become a template for any attacker who can identify them — through system prompt leakage, through trial and error against the model, or through published research about a platform's prompt format. XML tokenization is identical for system-level tags and user-provided strings, so injected XML-tagged content can read to the model as part of the original system prompt structure rather than as ordinary user input.

Yeung cites HiddenLayer's assessment of Cursor, the AI code assistant. System prompt leakage revealed a <user_info>tag in Cursor's prompt that provided the agent with context about the user's environment. The researchers injected content inside <user_info> tags placed in a code repository, directing the agent to run list_dir, access the .ssh directory, and follow instructions from a README.md file in the repository. The instructions came from a repository file. The agent couldn't tell the difference. Kenneth Yeung, Security Researcher, HiddenLayer, "Inside the Prompt," May 26, 2026, citing HiddenLayer's prior Cursor security assessment.

Every repository file an agent reads is a potential injection vector. Every document in a RAG corpus. Every email an agentic workflow processes. An agent with access to a code repository, a document store, or any external content system is potentially accepting system-level instructions from that content without the developer knowing it, and without any warning in the output that something changed.

System Prompts Take Behavioral Conditioning Over Boundaries

Most organizations govern their AI agents by writing system prompts that describe what those agents should and shouldn't do. Yeung's research shows what treating behavioral conditioning as a security control actually means in production: an attacker who understands the training mechanism can manipulate the behavior, and the governance program has no architectural defense against it. Yeung's own conclusion is direct: "Security-sensitive workflows should therefore rely on layered controls outside the model itself, including runtime policy enforcement, permission boundaries, monitoring, and human oversight for high-risk actions." — Kenneth Yeung, Security Researcher, HiddenLayer, "Inside the Prompt," May 26, 2026.

The attack surface Yeung describes spans every piece of content the agent reads, not just what users type. Yeung states it plainly: "prompts and contextual data should always be treated as untrusted input" and "documents, emails, web pages, tool outputs, and repository files can also introduce prompt injections into a model's context window." — Kenneth Yeung, Security Researcher, HiddenLayer, "Inside the Prompt," May 26, 2026. In a RAG system, every document in the retrieval corpus is a potential injection vector. In an agentic workflow with file access, every file the agent reads carries the same risk, and the ownership gap in most deployments means nobody's accountable for what happens when that content carries malicious instructions.

The agents most exposed to these attack techniques are the ones with the broadest access — reading the most documents, querying the most external systems, invoking the most tools. That exposure scales directly with capability, which means the deployment pattern that makes agents most valuable is the same pattern that makes them most exposed. The agent-first governance architecture that would address this — sanctioned purpose documents, runtime enforcement outside the model, tiered permission boundaries — is absent from most deployments because organizations built the capability before they built the enforcement layer that capability requires.

What Having It Right Actually Looks Like

Five specific controls follow directly from the four attack techniques Yeung documented. Every one of them exists outside the model, which is exactly where they have to be when the model itself is the thing being manipulated.

Treat all contextual data as untrusted input

Every document, email, web page, repository file, and tool output the agent reads is a potential injection vector. Content inspection belongs at the ingestion layer — before content enters the context window — so the model never processes injected instructions as though they came from a trusted source. Inspection after the agent has already acted on the content doesn't stop the attack; it documents what already happened.

Runtime policy enforcement outside the model

System prompts describe desired behavior; runtime enforcement intercepts agent actions before execution, evaluating whether an action matches the sanctioned scope regardless of what the model believes it was instructed to do. The Pre-Failure Signals GAIG has documented — signal-to-incident collapse, alert fatigue suppression — emerge from monitoring that observes after the fact. Runtime enforcement addresses the action before the alert would ever need to fire.

Permission boundaries scoped to actual task requirements

The Cursor attack worked because the agent had access to the .ssh directory when its only task was reading a code repository for context. An agent reading code doesn't need credential store access. Scope permissions to the specific task at hand, so that an exploited agent's blast radius stays bounded by what the task actually required — not by the broadest access available when the deployment was configured.

Continuous behavioral monitoring that flags deviations from baseline

Prompt injection attempts look like natural language interactions — they don't trigger keyword-based filters, and they won't appear in output logs as anything unusual until after the agent has already acted on them. Behavioral monitoring that establishes baseline agent activity and flags deviations — unexpected tool calls, unusual file access patterns, output structures that don't match the assigned task — gives security teams detection capability that instruction hierarchy alone can never provide.

Visibility into what's entering the context window in real time

Output monitoring tells you what the agent said after the fact. Context window visibility tells you what convinced it to say that. Yeung makes the requirement explicit: "Organizations need visibility into what information is entering the context window and how it may influence model behavior." — Kenneth Yeung, Security Researcher, HiddenLayer, "Inside the Prompt," May 26, 2026. Without that visibility, an attack can succeed and the organization won't know which document delivered it.

Our Take

AI SECURITY TAKE

The attacks Yeung documents aren't theoretical. The Gemini for Workspace exploit was documented and disclosed. The Cursor exploit was documented and disclosed. The DeepSeek-R1 reasoning token attack was documented and disclosed. Every one of those was a real attack against a real production system by a security research team, and every one succeeded. Every organization running agents with access to external content — email, documents, repositories, web pages — carries exposure to at least one of the four attack classes Yeung describes in this research.

HiddenLayer's paper lands in the same week that the Harvard Law Forum documented how AI governance decisions affect privilege protection and accountability inside enterprise deployments, and the same week Deeploy launched MCP Server-based agent registration because unregistered agents can't be monitored. The connection across all three is the same structural gap: the attack surface is the content agents read, the governance programs most organizations built don't account for it, and the access controls that would limit blast radius after exploitation aren't scoped to the actual task.

Security teams with agents in production this week should run through Yeung's four attack techniques against each deployment. The questions are concrete: does this agent read external documents, emails, or repository files? If yes, does runtime enforcement exist outside the model layer? If the answer to that second question requires a conversation to determine, the system prompt is the only control in place — and Yeung's research documents precisely how to defeat it.

GetAIGovernance

Back to All Articles

AI Governance

AI Security

AI Monitoring

AI Compliance

AI ROI

Need help choosing?

AI Governance

AI Monitoring

AI Compliance

AI Security

Research Reports

AI ROI

Market Trend Analysis

Explore All Resources

HiddenLayer Finds Out AI Agent Guardrails CAN Be Broken

Before You Can Understand the Attack, You Need to Understand What's Being Attacked

How Attackers Exploit the Same Mechanisms Developers Rely On

Reasoning Token Abuse

System Prompts Take Behavioral Conditioning Over Boundaries

What Having It Right Actually Looks Like

Our Take

ServiceNow Launches Autonomous Workforce and Integrates Moveworks Into Its AI Platform

AI Governance Platforms vs Monitoring vs Security vs Compliance

ServiceNow Introduces the Enterprise Identity Control Plane Following Its Acquisition of Veza

Related Articles

ServiceNow Launches Autonomous Workforce and Integrates Moveworks Into Its AI Platform

AI Governance Platforms vs Monitoring vs Security vs Compliance

ServiceNow Introduces the Enterprise Identity Control Plane Following Its Acquisition of Veza

Stay ahead of Industry Trends with our Newsletter