AI Security Series 3 -Datastores

Modern AI applications—especially those involving conversational agents, retrieval-augmented generation (RAG), and enterprise copilots—depend heavily on a variety of datastores to supply, retrieve, and manage knowledge. Below is an outline of various data subsystems that are user in AI applications and agents.
Vector Databases
Vector databases store high-dimensional embeddings of text, images, or other media. They enable semantic search, allowing AI models to retrieve the most contextually relevant chunks of data based on meaning rather than exact keywords. In RAG pipelines, they are the primary mechanism for fetching supporting documents or facts to ground the model’s responses.
Enterprise Data Sources (e.g. SharePoint, JIRA, SAP)
Corporate knowledge—such as policies, documents, meeting notes, and reports—typically resides in platforms like SharePoint or Google Drive. These sources provide raw content for AI systems to index and retrieve. Integrating them ensures that AI agents can answer questions based on real, organization-specific data rather than general internet knowledge.
Memory for Agents
Agent memory systems store conversational context, user preferences, and long-term facts. They allow AI systems to maintain coherence across interactions, remember past tasks, and personalize behavior. This memory can be short-lived (within a session) or persistent across many sessions, often backed by vector stores or specialized memory databases.
Together, these datastores form the information backbone of AI systems—fueling retrieval, personalization, and continuity.
Security Concerns
Unlike the internet where data volumes are high and access is ubiquitous, enterprises operate on much smaller volumes and with much fragmented silos due to the data access rules called authorization.
- Data Access and Authorization
It’s very common in enterprises to have access limitations to data sources related to various business units, functions, teams and individuals. Access to customer data brings in yet another dimension to access control. As we build AI applications and agents, the data access and authorization rules must be adhered to – a very common violation that has been observed in many systems. Authorization isn’t a “nice to have”—it’s foundational. Whether you’re building a chatbot, a RAG system, or an AI agent platform, embedding fine-grained, context-aware access control across all data touchpoints (retrieval, memory, and input sources) is essential for safe, scalable AI.
Risks
- Compliance Failures
- Sensitive Data Leakage and Privacy violations
Example: An enterprise search engine deployed in your organization starts to provide HR compensation, family and background check data of employees to sales, engineering and customer service teams.
Example: Employee performance appraisals of all employees become accessible to other employees.
- Memory Poisoning
Memory poisoning refers to the intentional or accidental injection of misleading, harmful, or manipulative content into an AI agent’s memory or knowledge retrieval systems. This content can later be retrieved or used by the model, resulting in hallucinations, misinformation, or behavior manipulation. In case of intentional attacks there is also a possibility of side-channel attacks.
Unlike prompt injection (which occurs at runtime), memory poisoning targets the persistent or retrievable knowledge that agents rely on across sessions or tasks.
Risks
- Data Contamination
- Misinformation
- Behavior Manipulation
- Goal breaking
- Liabilities
Example: In 2023, researchers demonstrated an attack on a memory-enabled GPT-based customer service agent. The agent stored prior conversations to personalize user experiences. An attacker deliberately crafted questions and responses that inserted false company policies (e.g., “You offer 100% refunds no matter what”). These were accepted, embedded into memory, and later retrieved during genuine customer interactions—causing the agent to misstate refund policies and create compliance risk.
- Session Management/Multi-user/Mutli-tenancy
AI agents and assistants are deployed in multi-user environments—from team copilots to enterprise chatbots and developer tools. In such contexts, properly managing user sessions, isolation, and tenancy boundaries is critical. Without robust mechanisms, agents may confuse or leak context between users, violating privacy and security guarantees.
Key problem areas include:
- Session confusion: Mixing up active user sessions, especially in web or chat-based agents with async interactions.
- Improper tenant scoping: Retrieval or memory layers failing to filter by tenant or user, causing cross-data contamination.
- Shared embedding indexes: Vector stores serving multiple customers without strict logical or physical partitioning.
- Insecure identity binding: Agents responding to prompts before user identity is validated (e.g., unauthenticated sessions persisting across logins).
Risks
- Unintended Personalization
- Compliance Violations
- Data Leakage
- Account Hijacking
- Auditing
Example: In a reported case during early tests of an AI bot, the system used a shared vector index across all workspaces to store and retrieve knowledge. Due to a missing tenant filter in the retrieval layer, a user in Company A asked for an onboarding guide—and received content from Company B’s private handbook.
- Insecure Ingestion
Most AI applications ingest data into vector databases. Pulling data from systems like SharePoint or Notion without proper validation or sanitization can introduce both incorrect context and security liabilities into AI workflows.
Risks
- Data Leakage
- Unauthorized Access
Example: Many enterprise search and workflow solutions have demonstrated this problem where employees got access to more data such as confidential meeting summaries.
Summary
As AI systems increasingly depend on datastores—ranging from vector databases to enterprise content platforms and conversational memory layers—their security posture becomes inseparable from the trust boundaries of those stores.
As the foundation of intelligent behavior, datastores must be treated as attack surfaces—requiring rigorous controls around access, isolation, provenance, and auditability. Building secure, scalable AI systems means treating data integrity and contextual trust as first-class design principles, not afterthoughts.