Prompt Injection
Manipulate agent behavior via untrusted input
Prompt injection is the most prevalent attack against AI agents. It occurs when untrusted input -- from web pages, documents, emails, or user messages -- contains instructions that override the agent's original programming. Unlike traditional code injection (SQL injection, XSS), prompt injection exploits the fact that AI agents cannot reliably distinguish between instructions from their operator and instructions embedded in content they process.
Real-World Example
In 2024, researchers demonstrated that a hidden instruction in a Google Doc could cause an AI assistant to exfiltrate the user's email contacts. The instruction was invisible to the human reader but processed by the AI as a command. Similar attacks have been demonstrated against Bing Chat, ChatGPT plugins, and autonomous coding agents.
Defense Strategy
Defense requires multiple layers: instruction anchoring (repeating safety instructions), input/output filtering, delimiter-based prompt architecture, and behavioral monitoring. No single defense is sufficient. Tools like HackMyAgent can scan for injection vulnerabilities in your agent's configuration.