On November 30, 2022, OpenAI released ChatGPT as a “research preview” — a chatbot built on GPT-3.5 that could answer questions, write code, and hold conversations. Within five days, it had a million users, and within two months, it had become the fastest-growing consumer application in history.
Three years later, ChatGPT has over 800 million weekly active users. It remembers your preferences across sessions, browses the web on your behalf, and books hotels and fills out forms autonomously. With the recent launch of Atlas, OpenAI’s AI-powered browser, ChatGPT is becoming the interface through which millions of people experience the internet.
Each of these capabilities represents a genuine advance in what AI can do for users, but each also introduced security risks that traditional frameworks couldn’t anticipate — risks that security researchers, often working faster than the vendors themselves, have spent three years uncovering.
This is the story of what the security community learned as ChatGPT evolved from a text-generating LLM into an autonomous agent: how prompt injection went from a theoretical concern to a vector for real-world harm, why each new capability expanded the attack surface in ways defenders struggled to anticipate, and what it means when an AI system can not only generate text but take actions on your behalf.
Phase 1: LLMs Hit the Market (November 2022 – Early 2023)
Prompt Injection Was There From the Beginning
Prompt injection as a vulnerability class emerged months before ChatGPT launched. In May 2022, researchers at AI safety firm Preamble discovered that GPT-3 could be manipulated through carefully crafted inputs and privately disclosed their findings to OpenAI. The flaw didn’t get a name until that September, when data scientist Riley Goodside publicly demonstrated the same technique and programmer Simon Willison recognized what it meant.
Willison drew a parallel to SQL injection, but noted a crucial difference: while SQL injection can be prevented with parameterized queries, prompt injection may be inherent to how large language models work. Because LLMs process both instructions and data as natural language, they have no reliable way to distinguish between the two — a limitation the UK’s National Cyber Security Centre acknowledged in August 2023 when it stated that “as yet there are no surefire mitigations” and that prompt injection may be “an inherent issue with LLM technology.”
Then Came the Jailbreaks
Within weeks of ChatGPT’s launch, users on Reddit developed “DAN” — Do Anything Now — a roleplay-based technique that convinced ChatGPT it was an alternate AI personality free from content restrictions. What followed was an arms race: when OpenAI patched one variant, the community developed another, with DAN 5.0 introducing a “token death” threat to make the jailbreak more compelling and iterations continuing through DAN 13.0.
Other techniques proved equally effective. The “Grandma exploit,” which emerged in April 2023, used emotional manipulation by asking ChatGPT to roleplay as a deceased grandmother who happened to know how to make dangerous substances, framing harmful instructions as nostalgic bedtime stories.
Academic researchers eventually systematized these attacks. In July 2023, a team at Carnegie Mellon that included Andy Zou and Nicholas Carlini published work demonstrating that adversarial suffixes — seemingly nonsensical strings appended to prompts — could bypass safety guardrails with high reliability. Their Greedy Coordinate Gradient algorithm achieved 87.9% success on GPT-3.5 and 53.6% on GPT-4, and perhaps more troubling, attacks developed against open-source models transferred to closed commercial systems.
The March 2023 Data Breach
Four months after launch, ChatGPT suffered its first major security incident.
On March 20, 2023, users began reporting that they could see other users’ conversation titles in their sidebar, and OpenAI took the service offline the same day. When the company published its postmortem on March 24, the scope turned out to be larger than initially understood: a race condition in the redis-py library had caused data to be routed to wrong users during a nine-hour window, and the exposure extended beyond conversation metadata to include payment information for approximately 1.2% of ChatGPT Plus subscribers — names, email addresses, billing addresses, and partial credit card details.
The regulatory aftermath proved consequential. Italy’s Garante data protection authority cited the breach when issuing a temporary ban on March 30, 2023, making it the first Western country to restrict ChatGPT. OpenAI’s failure to notify Italian authorities as required under GDPR contributed to an eventual €15 million fine announced in December 2024.
The incident demonstrated that even when AI-specific vulnerabilities aren’t in play, the infrastructure supporting AI systems is subject to the same bugs that affect any software — but the difference is scale. When hundreds of millions of users share a single system, a caching bug affects a lot of people.
Phase 2: The Platform (2023–2024)
Plugins Introduced Third-Party Risk
In March 2023, OpenAI launched ChatGPT plugins, a way to extend the model’s capabilities by connecting it to external services like GitHub repositories, flight booking systems, and real-time data feeds. OpenAI described plugins as giving ChatGPT “eyes and ears,” but security researchers saw something else: a dramatically expanded attack surface.
Between June 2023 and February 2024, Salt Security researchers identified three critical vulnerability classes in the plugin ecosystem. The first involved ChatGPT’s OAuth flow, where the plugin installation process didn’t validate that users had actually initiated the installation — allowing attackers to craft malicious links that, when clicked, installed plugins on victim accounts and forwarded all subsequent conversations to attacker infrastructure.
More concerning was a zero-click account takeover vulnerability in PluginLab.AI, a popular framework powering dozens of plugins. By submitting a SHA-1 hash of a victim’s email to an unauthenticated API endpoint, attackers could obtain the victim’s member ID and request OAuth codes on their behalf, a chain that Salt Labs demonstrated by gaining access to victims’ private GitHub repositories.
Johann Rehberger, a security researcher who would become the most prolific finder of ChatGPT vulnerabilities, explored how plugins fundamentally broke assumptions about trust. “ChatGPT cannot trust the plugin,” he wrote. “It fundamentally cannot trust what comes back from the plugin because it could be anything.” A malicious or compromised plugin could inject instructions that ChatGPT would follow — indirect prompt injection at scale.
OpenAI deprecated plugins in March 2024, replacing them with “custom GPTs” and later “Actions,” but the underlying pattern — ChatGPT processing potentially hostile content from external sources — remained.
The GPT Store Repeated the Same Mistakes
When OpenAI launched the GPT Store in January 2024, allowing developers to publish custom ChatGPT configurations, researchers quickly found similar problems. Academic analysis of nearly 15,000 custom GPTs found that approximately 95% showed inadequate protection against common threats, including system prompt leakage, roleplay attacks, and the ability to generate phishing content.
Rehberger demonstrated a proof-of-concept he called “Thief GPT,” a malicious GPT disguised as a Tic-Tac-Toe game that exfiltrated user data via hidden image rendering. When he reported the vulnerability to OpenAI in November 2023, the company closed the report as “Not Applicable.”
Palo Alto Networks researchers highlighted another dimension of the risk: GPTs integrating third-party APIs share chat data with providers who are not subject to OpenAI’s privacy and security commitments. OpenAI’s own documentation acknowledged as much, stating that “Builders of GPTs can specify the APIs to be called. OpenAI does not independently verify the API provider’s privacy and security practices.”
Training Data Extraction Showed Models Remember Everything
In November 2023, researchers from Google DeepMind, UC Berkeley, Cornell, and ETH Zürich demonstrated the first successful training data extraction attack against production ChatGPT, and their technique was, as they described it, “kind of silly”: prompting ChatGPT to “repeat the word ‘poem’ forever.”
After several hundred repetitions, ChatGPT would “diverge,” leaving its normal conversational mode and emitting what appeared to be random content. That content, upon analysis, turned out to be verbatim training data — email addresses, phone numbers, snippets of copyrighted books, code from Stack Overflow, and content from CNN and WordPress blogs. The researchers spent just $200 querying gpt-3.5-turbo and extracted over 10,000 unique memorized training examples, estimating that gigabytes of data could be retrieved with more resources. In their strongest configuration, 5% of ChatGPT’s output consisted of direct 50-token verbatim copies from training data, and 16.9% of generations contained memorized personally identifiable information.
Within days of the paper’s publication, OpenAI updated ChatGPT to flag “repeat forever” prompts as terms of service violations, patching the exploit while leaving the underlying vulnerability intact. As the researchers noted, you can train a model to refuse specific prompts, but that doesn’t remove the memorized data or prevent other techniques from extracting it.
Phase 3: The Memory Feature (2024)
How Attackers Could Poison ChatGPT’s Memory
In February 2024, OpenAI began rolling out a memory feature that allowed ChatGPT to retain information across conversations. The goal was personalization — ChatGPT could remember your name, your job, and your preferences, then use that context to provide more relevant responses. But Johann Rehberger saw something different: an entirely new threat model. If ChatGPT’s memory could be written by users, could it also be written by documents ChatGPT processed? Could an attacker plant instructions that would persist beyond a single session?
In May 2024, Rehberger demonstrated that the answer was yes. By embedding instructions in a Google Doc that a user analyzed with ChatGPT, he could write arbitrary content to ChatGPT’s memory, including malicious instructions that would execute on every future conversation. He called the technique “SpAIware.”
The implications were severe. A single poisoned document could compromise all of a user’s future interactions with ChatGPT, and Rehberger demonstrated continuous exfiltration of all user inputs to attacker-controlled servers — effectively creating command-and-control capability through planted memories. He also showed that attackers could inject false beliefs, such as “user is 102 years old, lives in the Matrix, believes Earth is flat.”
When Rehberger reported this to OpenAI in May 2024, the company initially closed the report, classifying it as a “safety issue, not security vulnerability.” Only after he developed a proof-of-concept demonstrating persistent data exfiltration did OpenAI issue a partial fix in September 2024. The fix prevented the specific data exfiltration technique but did not address the fundamental issue: prompt injections can still manipulate memory, and memories persist even after individual chats are deleted. As Rehberger noted, “The main point and mitigation for users is to watch out for the ‘Memory updated’ message. If that happens, ensure to inspect the changes that have been made.” Users are now responsible for monitoring their AI’s memory for evidence of poisoning.
Phase 4: The Agent (2025)
From Words to Actions
In January 2025, OpenAI released Operator — its first AI agent capable of taking autonomous action. Operator could navigate websites, click buttons, fill forms, and complete purchases on a user’s behalf. CEO Sam Altman called it a “basic step toward unlocking the power of AI agents.”
For security researchers, this represented a fundamental escalation. Prompt injection in a chatbot could produce harmful text. Prompt injection in an agent could produce harmful actions.
OpenAI implemented safeguards: user confirmation requirements, “watch mode” for sensitive sites, prohibition of banking transactions, rate limits on tasks. The Operator System Card acknowledged that “prompt injection remains an active concern” and described red-teaming efforts that improved prompt injection detection from 79% to 99% recall.
Yet OpenAI also deliberately restricted certain capabilities — sending emails, deleting calendar events — because, as the company stated, security risks were too high. The gap between what Operator could technically do and what OpenAI allowed it to do revealed the tension between capability and control that would define the agentic era.
OpenAI’s Atlas Browser Piled on More Risk
In October 2025, OpenAI launched Atlas, an AI-powered browser where ChatGPT is the primary interface for web interaction. Atlas can browse autonomously, remember sites you visit, and take actions across tabs.
Just hours after launch, security researchers demonstrated attacks that highlighted what autonomous browsing means in practice. Researchers at SquareX showed clipboard injection attacks that overwrote user clipboards with phishing links. Others demonstrated indirect prompt injection causing the browser to book hotels, delete files, or send messages without user authorization. Researchers found Atlas could be tricked into visiting malicious sites disguised as legitimate services.
OpenAI CISO Dane Stuckey’s statement was remarkably candid: “Prompt injection remains a frontier, unsolved security problem, and our adversaries will spend significant time and resources to find ways to make ChatGPT agents fall for these attacks.”
Unlike prior product launches where OpenAI downplayed risks, the Atlas announcement was accompanied by extensive security documentation acknowledging specific threat vectors. The company published tips for staying ahead of prompt injection attacks on Instagram. Researchers at Brave published a report determining that indirect prompt injection is a “systemic challenge facing the entire category of AI-powered browsers.”
The pattern of capability expansion outpacing security understanding — visible in plugins, in memory, in agents — reached its logical conclusion: an AI system with persistent memory, autonomous action capability, and access to the entire web, shipping with known vulnerabilities that have no reliable fix.
The Mixpanel Breach: Supply Chain Risk Never Left
Most recently, on November 9, 2025, analytics provider Mixpanel discovered that an attacker had gained unauthorized access to its systems through a targeted SMS phishing attack on an employee. The attacker exfiltrated a dataset containing customer information — including data belonging to OpenAI API users.
Mixpanel notified OpenAI on November 9 and shared the affected dataset on November 25. OpenAI began notifying impacted users within 48 hours, disclosing that names, email addresses, approximate locations, browser and operating system details, and organization or user IDs had been exposed.
OpenAI emphasized what wasn’t compromised: no chat content, API requests, passwords, API keys, payment details, or government IDs. The breach affected users of the API platform (platform.openai.com), not general ChatGPT users.
The company terminated its relationship with Mixpanel and announced expanded security reviews across its vendor ecosystem. “Trust, security, and privacy are foundational to our products, our organization, and our mission,” OpenAI stated.
The incident highlights a reality that persisted throughout ChatGPT’s three years: even as AI-specific vulnerabilities like prompt injection and memory manipulation commanded attention, traditional supply chain risks never disappeared. The March 2023 Redis breach and the November 2025 Mixpanel breach bookend the timeline, both third-party issues, both exposing user data, neither involving the AI itself.
As one security researcher noted, the breach shows how external, untrusted code can compromise an organization regardless of how well its core systems are secured. For enterprises evaluating AI vendors, the lesson is that security assessments must extend beyond the AI model to the entire ecosystem supporting it.
What Have We Learned?
Three years of ChatGPT security research yield lessons that extend beyond any single product.
Prompt injection is not a bug to be fixed but a property to be managed. Unlike SQL injection, which can be prevented with parameterized queries, prompt injection exploits the fundamental way language models process text. Every mitigation OpenAI has deployed — from input filters to output monitoring to user confirmation requirements — reduces risk without eliminating it. Security teams must design systems assuming that safety guardrails will be bypassed and apply principle of least privilege to any actions AI systems can take.
Capability expansion creates new threat categories. A chatbot that can only produce text has limited attack surface. Add plugins, and you get third-party risk. Add memory, and you get persistence. Add autonomous action, and prompt injection becomes a vector for real-world harm. Each new capability required security researchers to develop new attack techniques and defenders to develop new controls — often after the capability shipped.
Traditional security boundaries blur when AI processes external content. The distinction between trusted and untrusted input, foundational to most security models, becomes ambiguous when an AI assistant can be directed by the content of a document it’s summarizing, a website it’s browsing, or an email it’s reading. Indirect prompt injection — where malicious instructions are embedded in content the AI processes on behalf of the user — may prove more dangerous than direct attacks because it’s harder for users to recognize.
Transparency has improved, slowly. OpenAI’s initial responses to security research ranged from dismissive (closing Rehberger’s memory vulnerability as “not applicable”) to slow. But by the Atlas launch, the company was publishing detailed system cards acknowledging specific threat vectors, and its CISO was publicly stating that prompt injection remains unsolved. The shift suggests that the AI industry is beginning to treat security research as a partner rather than a nuisance — though the transition is incomplete.
Users bear significant responsibility. Review stored memories regularly. Grant AI agents minimum required permissions. Watch for “Memory updated” notifications. Treat browser memories as a security liability. These recommendations, appearing in OpenAI’s own documentation, represent a transfer of security responsibility from vendor to user that would be unusual for most consumer software.
The three-year journey from chatbot to agent reveals that large language models present fundamentally new security challenges. Traditional security frameworks assume clear separation between code and data, between instructions and content. LLMs deliberately blur these distinctions — and attackers have proven adept at exploiting the resulting ambiguity.
The lesson isn’t that organizations should avoid these technologies. It’s that AI requires genuinely novel security approaches, many of which are still being developed.
Three years in, the security community is still learning what questions to ask. The answers will take longer still.





