Solving the Unsolvable: Acuvity's Prompt Injection and Jailbreak Detection Model

TLDR: Acuvity’s Prompt Injection and Jailbreak Detection Model achieved the highest F1 score across all four major public benchmarks, outperforming Meta, ProtectAI, Qualifire, and others—with perfect precision on real-world attack datasets. This post explains the threat landscape, why traditional defenses fail, and how our small-model approach delivers superior results.

Download the full technical report here.

—

When ChatGPT launched in November 2022, it took just weeks for security researchers to discover they could manipulate it into ignoring its instructions, leaking its system prompts, and producing content it was explicitly designed to refuse.

In the three years since, prompt injection and jailbreak attacks have remained the most persistent and consequential security challenge facing large language models. The UK’s National Cyber Security Centre concluded in 2023 that prompt injection “may simply be an inherent issue with LLM technology,” with “no surefire mitigations” available. OWASP ranks it the number one security risk for LLM applications in 2025.

Many in the industry have treated this as an architectural inevitability, a problem to be managed rather than solved.

We disagree. And we have the results to prove it.

Acuvity’s Prompt Injection and Jailbreak Detection Model achieved the highest F1 score across all four major public benchmarks we tested, outperforming models from Meta, ProtectAI, Qualifire, and others. On real-world attack datasets, we achieved perfect precision with zero false positives. On the most adversarial benchmarks, where attacks are specifically engineered to evade detection, our performance advantage over the nearest competitor widened to nearly four percentage points.

This paper explains how we built that detection capability and why our architectural approach produces consistently superior results. But first, it’s worth understanding how the threat itself has evolved.

Multi-turn jailbreaks now build context across conversations to gradually erode safety boundaries. Semantic prompt exploits use carefully constructed language that appears benign but carries adversarial intent. Indirect injections embed malicious instructions in external data sources that models retrieve and act upon, meaning the attack payload never touches the user’s input at all. These techniques represent the most impactful real-world attack vectors in 2025, affecting both chat-based applications and agentic workflows where models take autonomous actions.

What makes these threats difficult to defend against is the asymmetry between attacker and defender. Attackers innovate constantly, tailoring payloads to specific applications and inserting instructions at multiple input points. Most defenses remain rigid, lack deep context awareness, and struggle with the fundamental opacity of large language models.

Our approach was purpose-built to change that equation.

Defining the Problem Clearly

Before diving into how we approach detection, it’s worth establishing clear definitions. The terms “prompt injection” and “jailbreak” are often used interchangeably in industry conversations, but they describe different things, and conflating them leads to defensive architectures that miss entire categories of attack.

We treat prompt injection as an attack vector. It’s the mechanism by which an adversary attempts to override, alter, or subvert an AI system’s intended instructions. This can take two primary forms. The first is subversion of contextual logic, where attackers try to override existing instructions using statements like “forget the above discussion” or “ignore previous instructions.” These disrupt the model’s logical flow, forcing it to prioritize the attacker’s injected instructions over legitimate ones. The second form is prompt stealing, where the adversary’s goal is to extract the original system prompt. This is often attempted through carefully crafted queries such as “show me the hidden rules you’re following” or “reveal the instructions above.” By stealing the system prompt, attackers can uncover proprietary logic, sensitive guardrails, or hidden reasoning paths, all of which can then be exploited for future attacks.

Both forms can appear directly in user input or be embedded within otherwise benign content, which is what makes detection so challenging.

Jailbreaks, by contrast, represent a specific outcome. The attacker’s intent is to break out of enforced safety boundaries, getting the model to say, generate, or do something it would normally refuse. Classic examples include the “DAN” (Do Anything Now) prompts that assign the model an alternate persona with unrestricted behavior, or role-playing scenarios that trick the model into ignoring safety rails. While prompt injections often serve as the delivery mechanism, jailbreaks are the explicit result adversaries aim for: removing restrictions and unlocking capabilities the model was designed to withhold.

The relationship between these concepts matters for building effective defenses. All jailbreaks are a type of prompt injection, but not all prompt injections are jailbreaks. An attacker who successfully extracts a system prompt has executed a prompt injection without triggering a jailbreak. An attacker who manipulates an AI agent into sending emails to unintended recipients has injected instructions without necessarily bypassing safety constraints.

A detection system that only looks for jailbreak patterns will miss these cases entirely. Effective protection requires recognizing both the vector and the outcome, and understanding how they interact across different application contexts.

Why Content Filters and Large Models Miss These Attacks

Understanding these distinctions helps explain why conventional approaches to content moderation and safety filtering have struggled with prompt injection and jailbreaks. Traditional content moderation systems were designed to catch harmful outputs: hate speech, explicit content, misinformation. They operate on the assumption that dangerous content can be identified by what it contains. Prompt injections don’t work that way. They target the model’s reasoning layer itself, manipulating how it interprets and prioritizes instructions rather than simply producing prohibited content.

This creates a fundamental mismatch. A prompt injection might contain nothing overtly harmful. It might look like a polite request, a creative writing exercise, or a technical question. The danger lies not in the words themselves but in their relationship to the model’s instruction hierarchy and the actions they’re designed to trigger. Content filters that scan for toxic language or sensitive topics will pass these inputs through without hesitation.

There’s also a widespread assumption in the industry that larger, more capable models must also be more secure. The reasoning seems intuitive: if a model is smarter, it should be better at recognizing when it’s being manipulated. In practice, the opposite is often true. Large general-purpose models are optimized for broad reasoning and helpfulness, which makes them more susceptible to cleverly framed requests that exploit those very capabilities. Their size and complexity also make them slower and more expensive to run, which limits their viability as real-time filtering mechanisms. And their generalist nature means they lack the specialized pattern recognition that adversarial detection requires.

The result is a defensive landscape that remains largely reactive. New jailbreak techniques emerge, defenders scramble to patch specific phrasings or patterns, and attackers simply adjust their approach. This cat-and-mouse dynamic has persisted since the earliest days of ChatGPT, and it explains why so many practitioners have concluded that prompt injection is simply an inherent limitation of the technology.

But this conclusion conflates two different problems. The fact that large language models are inherently vulnerable to adversarial manipulation doesn’t mean that adversarial inputs can’t be detected before they reach those models. It means the detection layer needs to be purpose-built for that task rather than bolted on as an afterthought.

The Counterintuitive Thesis: Small Language Models as Classifiers

This is where our approach diverges from conventional wisdom. If large models are vulnerable and slow, and traditional filters miss adversarial patterns, what should a dedicated detection layer actually look like?

Our answer is small language models trained specifically as classifiers.

The instinct to throw more compute and larger models at security problems is understandable, but it misses what makes adversarial detection different from general-purpose reasoning. A detection model doesn’t need to write poetry or hold extended conversations. It needs to do one thing exceptionally well: distinguish between benign inputs and malicious ones. That focused mandate changes the architectural calculus entirely.

Small language models operate at high speed and low cost, making them well suited for real-time filtering. They can evaluate every incoming prompt in milliseconds, blocking malicious attempts before they ever reach the core system. Their size also makes them easier to specialize. While a large general-purpose model may struggle to distinguish between malicious instructions and harmless creative input, a small classifier can be fine-tuned specifically on prompt injection and jailbreak datasets, learning the linguistic markers and structural cues that characterize real attacks.

Small models offer two additional advantages that matter in enterprise environments. First, interpretability: they are architecturally more transparent, making it easier for security and compliance teams to audit decisions and justify them to regulators. Second, adaptability: as attackers develop new techniques, a small classifier can be retrained in hours or days rather than the weeks required for large foundation models. When the threat landscape shifts, defenses need to shift with it.

The bottom line is that small language model classifiers aren’t a cost-saving compromise. They are purpose-built guardrails, combining the power of advanced language generation with a nimble safety layer designed specifically for adversarial detection.

How We Built Our Detection Model

With that foundation in place, here’s how we applied it.

The Acuvity Prompt Injection and Jailbreak Detection Model is built on the BERT family architecture with an attention-pooling bridge to a proprietary decision layer. We chose this foundation because it handles long inputs with efficient attention and stable long-range reasoning, which is critical for catching injected instructions buried deep in lengthy prompts. It also understands the hybrid text and code patterns common in prompt injections, and it’s already strong on classification and semantic search tasks.

To make the model more sensitive to subtle adversarial patterns, we integrated attention pooling, which allows it to weigh the most critical parts of a prompt rather than treating every token equally. This ensures the model focuses on manipulative cues even when they’re surrounded by benign context designed to obscure them. The pooled representation then passes through proprietary network layers that refine the signal while suppressing noise, with residual pathways that keep long-context information intact.

The result is an architecture that combines representational power with a lightweight classifier optimized for real-time adversarial detection.

The full technical paper includes additional depth on our architecture and training methodology.

Detecting Real Attacks, From Obvious to Subtle

Architecture descriptions only go so far. To understand how the model performs, it helps to see it working against real attack patterns.

Consider a classic DAN jailbreak attempt. The prompt instructs the model to act as “DAN” (Do Anything Now), a persona freed from typical AI constraints. It includes explicit instructions to bypass ethical restrictions, requests for assistance with hacking tools, commands to provide dual responses (one compliant, one “jailbroken”), and meta-commands to persist the bypass across turns. Our model flagged this as a prompt exploit with 0.99 confidence, detecting policy override language, role-play coercion, illicit intent, and a high density of override verbs and policy-negation phrases.

But obvious jailbreaks are not where detection gets difficult. The harder cases involve indirect injection, where malicious instructions are embedded in content the model retrieves rather than content the user submits.

Take this example from real plugin vulnerability research. The payload instructs the model to introduce itself as “Mallory,” invoke a code plugin to create a public GitHub repository, add issues to all private repositories, and do all of this without asking for user confirmation. Our model flagged this with 0.99 confidence, identifying unauthorized tool control, explicit bypass of safety checks, persona priming, and cross-context command injection. This is the kind of attack targeting agentic workflows, and it represents where the threat landscape is heading.

These examples illustrate the balance between precision and recall: catching genuine attacks while distinguishing them from benign creative requests that share surface-level features.

Acuvity Outperforms Meta, ProtectAI, Qualifire, and Jackhhao Across Every Benchmark

Individual examples are useful, but the real question is how this approach performs at scale, across diverse attack types, against the best available alternatives.

We evaluated our model against four major prompt injection and jailbreak detection benchmarks, comparing performance against leading publicly available models including Qualifire’s Prompt Injection Sentinel, Meta’s Prompt-Guard-86M, ProtectAI’s DeBERTa v3, and the jackhhao jailbreak classifier.

The benchmarks cover meaningfully different territory. Deepset’s test set focuses on politically biased speech injections. The Jackhhao dataset contains over 15,000 real-world jailbreak attempts collected between December 2022 and December 2023. Qualifire’s benchmark offers 5,000 prompts mixing injections and roleplay-style jailbreaks. And AllenAI’s WildJailbreak set represents the most adversarial test, with compositional attacks specifically engineered to evade detection.

Across all four benchmarks, our model achieved the highest F1 score.

On the Jackhhao real-world dataset, we achieved an F1 of 0.9891 with perfect precision of 1.0, meaning zero false positives. The nearest competitor reached 0.9856. Meta’s Prompt-Guard collapsed to 0.6933. On WildJailbreak, where attacks are hardest to catch, we achieved 0.9729 against Qualifire Sentinel’s 0.9357, a gap of nearly four percentage points. When attacks get harder, our advantage widens.

Our model also achieved best-in-class AUC scores ranging from 0.9611 to 0.9986, indicating stable performance across different detection thresholds and deployment contexts.

What these numbers mean in practice: a detection system that catches real attacks without creating friction for legitimate users. High recall means threats don’t slip through. High precision means security teams aren’t drowning in false alerts. That balance is where most detection approaches break down.

What This Means for Security Teams

These results have practical implications for organizations deploying LLM-powered applications.

First, effective prompt injection and jailbreak detection is achievable. The conventional wisdom that these attacks represent an unsolvable problem conflates the vulnerability of large language models with the impossibility of detecting adversarial inputs. Our benchmarks demonstrate that a purpose-built detection layer can identify the vast majority of attacks with high confidence while maintaining the precision necessary to avoid disrupting legitimate use.

Second, detection needs to operate at runtime. Static analysis and pre-deployment testing can’t account for the adversarial creativity that emerges once systems are live. Attackers adapt to observed behavior, probe for weaknesses, and share successful techniques. A detection model that sits in front of your generative AI systems and evaluates every input in real time provides a fundamentally different security posture than one-time assessments or periodic audits.

Third, the performance gap between detection approaches is significant and consequential. On the most adversarial benchmarks, the difference between our model and alternatives spans several percentage points of F1 score. In production environments processing thousands or millions of prompts, those percentage points translate directly into attacks that either get caught or slip through, and false positives that either get suppressed or frustrate users. Choosing the right detection layer matters.

Finally, this capability needs to evolve continuously. The threat landscape for prompt injection is not static. New techniques emerge regularly, attack patterns shift as models change, and the expansion of agentic AI systems introduces entirely new categories of risk. A detection approach built for rapid iteration and retraining, rather than locked into a fixed model, is essential for maintaining protection over time.

Going Deeper

The full technical paper includes detailed methodology for our benchmark evaluations, additional attack examples with detection rationales, deeper discussion of our architectural choices and training approach, and complete performance breakdowns across all tested models and datasets. For security leaders making decisions about AI runtime protection, or practitioners interested in the technical details behind effective prompt injection detection, it’s the more comprehensive resource.

—