Accuracy Claim on Your AI Guardrail Is Probably Meaningless

THE SETUP

Ninety-five percent. That is the detection accuracy figure that most enterprise guardrail and AI gateway vendors present during procurement evaluations. On the surface it sounds like a meaningful benchmark — high enough to feel safe, specific enough to seem rigorous. In practice it is one of the least useful numbers a security team will encounter during an AI tool evaluation, because it describes performance against a test the vendor designed, using attacks the vendor selected, in an environment that almost certainly looks nothing like the one where the guardrail will actually operate.

The problem is not that guardrail vendors are misrepresenting their products. The problem is that the standard evaluation process systematically produces accuracy figures that measure something different from what enterprise buyers assume they're measuring. Benchmark accuracy measures detection rate against known, static attack patterns drawn from open-source datasets. Real-world resilience measures whether the control holds up against an adaptive attacker who sees a failed attempt and tries again differently. Those are two different questions, and most enterprise procurement processes treat them as one.

On June 12, Dr. Peter Garraghan of Mindgard published a framework identifying five specific questions buyers should ask before trusting any accuracy figure a guardrail vendor shows them. The framework is accurate and practically useful. What it leaves implicit — because Mindgard is a security company, not a governance platform — is the accountability layer underneath it. Someone inside the enterprise needs to own the responsibility for asking those questions, verifying the answers, and ensuring the evaluation runs again after deployment. Most governance programs haven't named that person. This piece does both: it covers what Garraghan documented, and it extends the argument to the governance accountability layer that determines whether any of it gets enforced in practice.

WHAT ACTUALLY HAPPENED

The accuracy figures enterprise buyers encounter during guardrail evaluations are typically generated through a combination of three methods. The first is vendor-supplied benchmarks drawn from open-source datasets and red-teaming tools that the vendor has already built their product against. The second is widely used independent AI safety benchmarks, many of which contain attack categories that were well-documented before most current guardrail products were built — which means the products are trained to recognize them. The third is manual probing by the prospective customer, which is useful but unlikely to surface the specific attack vectors the guardrail performs worst on, because the buyer doesn't know what those are and the vendor has no incentive to point them out.

Dr. Garraghan described the core structural problem in the Mindgard webinar that preceded the June 12 publication: "Many benchmark datasets contain well-known and relatively obvious attempts, such as instructions to disregard previous prompts or generate prohibited content. A competent guardrail should detect these attacks. But blocking the most recognisable examples is not the same as resisting a determined adversary."

The agentic dimension compounds the problem. In a single-step LLM interaction, a guardrail failing to catch one attack means one instance of harmful output. In a multi-agent workflow where the model is calling tools, querying retrieval systems, and triggering downstream automation, a guardrail failure can propagate across the entire chain. Jim Olsen, CTO at ModelOp, documented the mathematical version of this in VKTR: "Cascading errors, where each system is even 95% accurate, means you encounter a 5% error rate for each decision made. Clearly, this error rate then compounds at each step." A guardrail evaluated against a single-step LLM interaction produces an accuracy figure that says nothing about how that control performs when the protected system is one node in a workflow making sequential autonomous decisions.

THE FIVE QUESTIONS — DR. PETER GARRAGHAN

1. What exactly does the accuracy figure measure?

Ask which datasets, attack categories, and thresholds were used. Determine whether the reported score reflects a narrow benchmark or a broader evaluation of realistic adversarial behavior.

2. Has the guardrail been tested against adaptive attacks?

Static prompts are insufficient. Evaluations should include multi-turn manipulation, contextual obfuscation, character-level evasion, and attempts to disguise malicious intent across multiple interactions.

3. Has the control been tested in the environment it will protect?

A generic lab test cannot reflect the behavior of your models, your prompts, your tools, your data flows, your permissions, and your agentic workflows. Guardrails must be evaluated in the actual deployment context.

4. Has the evaluation been independently validated?

Vendor-led testing has a structural limitation: the vendor controls the methodology. Independent testing exposes where the control performs well and where its defenses break down.

5. How will the guardrail be tested after deployment?

AI systems change. Models update. Applications gain capabilities. Attack techniques evolve. A point-in-time evaluation is not sufficient. Guardrails require continuous testing against emerging threat patterns.

"Buyers may be shown strong benchmark results during an evaluation, only to discover that the same guardrail performs very differently when exposed to more realistic attacker behaviour. The problem is not that benchmarks are useless. The problem is treating benchmark performance as proof of real-world resilience."
— Dr. Peter Garraghan, CEO, Mindgard

HOW IT WORKS

The evaluation process has a structural flaw that exists independent of vendor intent. Benchmark datasets are built from known attacks — the ones that have been documented, categorized, and published in security research. A guardrail trained to detect known attacks will perform well against a test set built from the Why same corpus. That is not evidence of resilience against a motivated adversary. That is pattern matching against a static sample of historical attacks.

The problem compounds in two directions simultaneously. The vendor controls which attacks are included in the evaluation and can select categories where their product performs strongest. A prospective buyer running manual probing during an evaluation is unlikely to identify the specific vectors the guardrail is weakest against — because those are exactly the areas the evaluation process tends not to surface. Meanwhile, benchmark datasets age quickly in this market. Attack techniques documented in 2024 are well-known to anyone building a guardrail product in 2026 — and equally well-known to attackers, who have moved to variations the published benchmarks don't cover.

An attacker does not stop when an obvious jailbreak fails. They paraphrase the instruction. They translate it into a different language and back again. They fragment the attack across multiple conversation turns, delivering pieces that appear innocuous in isolation and combine into malicious intent only in context. They exploit role manipulation, framing prohibited requests as hypotheticals, as fiction, as security research. They test how the system behaves when instructions arrive embedded in documents the model retrieves, memory it reads, or tool responses it processes — attack surfaces that most current guardrail evaluations don't test at all.

In agentic systems the attack surface expands considerably beyond what the evaluation typically covers. The guardrail protecting a single model endpoint does not automatically protect the tool calls that model makes, the retrieval system it queries, or the downstream automation it triggers. An attacker who cannot jailbreak the model directly may be able to inject instructions into a document the agent retrieves as part of a legitimate task, redirecting the agent's behavior without ever interacting with the guardrail directly.

GOVERNANCE IMPLICATIONS

Control Layer	What the Benchmark Gap Exposes	What Needs to Change
AI Security	Guardrails evaluated only against static benchmarks in lab environments leave organizations structurally exposed to adaptive attackers.	Require vendors to test against your specific deployment environment. Mandate adaptive attack coverage — multi-turn, contextual obfuscation, indirect injection.
AI Monitoring	Post-deployment monitoring of model behavior does not substitute for continuous adversarial testing of the guardrail itself.	Build separate monitoring for guardrail efficacy. Quarterly adversarial testing minimum for any guardrail in production agentic deployment.
AI Governance	The five questions require someone inside the organization to own the responsibility for asking them and verifying the answers — both during procurement and after deployment.	Name a specific person accountable for guardrail validation. Someone whose job includes re-running the evaluation on a documented schedule.
AI Compliance	EU AI Act Article 9 and NIST AI RMF require ongoing testing and validation. A guardrail evaluated once at procurement and never retested is an unvalidated control.	Establish a trace retention policy covering guardrail evaluation results alongside model behavioral logs for compliance examinations.

WHAT'S STILL MISSING

The Mindgard piece has a commercial dimension that is worth naming directly. Mindgard sells AI red teaming and continuous adversarial testing. Their five-question framework correctly identifies that continuous independent testing is the answer to the benchmark accuracy problem. The analysis is accurate regardless of who produced it, but buyers evaluating this framework should understand that the company recommending independent continuous testing is also selling independent continuous testing.

The second gap is organizational. The five questions assume a security team with the technical capability to evaluate the answers. Most enterprise security teams do not have dedicated AI red teaming expertise in-house. For organizations without in-house AI security capability, the minimum viable version of the framework is simpler: before signing a guardrail contract, require the vendor to test their product against your actual deployment environment and share the results, including what the product failed to catch.

The third gap is the most significant. The cascading error problem that Jim Olsen quantified — where 95% accuracy per step compounds into meaningful failure rates across a multi-step agentic chain — points to a failure mode that individual guardrails at each step cannot fully address. The governance solution is Execution Authority Boundaries that limit how many sequential autonomous decisions an agent can make before a human review checkpoint occurs.

Our Take

The accuracy figure a guardrail vendor shows you during procurement is, at its best, a measure of how well their product performs against the attack patterns they selected, in the environment they configured, assessed by the team that built the product. That information is worth having. It is not evidence of real-world resilience against an adaptive attacker testing your live system with techniques that post-date the benchmark.

Three things governance programs need to add:

• Name a person responsible for post-deployment guardrail validation with a documented schedule and evidence trail.

• Separate guardrail efficacy monitoring from behavioral monitoring. Quarterly adversarial testing of the guardrail itself is the minimum.

• For agentic deployments, add Execution Authority Boundaries. The guardrail is one control layer. It is not the stack.

GetAIGovernance

Back to All Articles

AI Governance

AI Security

AI Monitoring

AI Compliance

AI ROI

Need help choosing?

AI Governance

AI Monitoring

AI Compliance

AI Security

Research Reports

AI ROI

Market Trend Analysis

Explore All Resources

The 95% Accuracy Claim on Your AI Guardrail Is Probably Meaningless

THE SETUP

WHAT ACTUALLY HAPPENED

THE FIVE QUESTIONS — DR. PETER GARRAGHAN

HOW IT WORKS

GOVERNANCE IMPLICATIONS

WHAT'S STILL MISSING

Our Take

ServiceNow Launches Autonomous Workforce and Integrates Moveworks Into Its AI Platform

Arize vs Fiddler vs Arthur: Which AI Monitoring Platform Actually Fits Your Enterprise?

AI Governance Platforms vs Monitoring vs Security vs Compliance

Related Articles

ServiceNow Launches Autonomous Workforce and Integrates Moveworks Into Its AI Platform

Arize vs Fiddler vs Arthur: Which AI Monitoring Platform Actually Fits Your Enterprise?

AI Governance Platforms vs Monitoring vs Security vs Compliance

Stay ahead of Industry Trends with our Newsletter