Prompt Security
Detect prompt injection attacks, jailbreak attempts, role-playing exploits, and adversarial input patterns.
Attack types
- Prompt injection: Malicious instructions embedded in user input
- Jailbreak attempts: Requests to bypass safety instructions via roleplay, hypotheticals, or encoding tricks
- Goal hijacking: Inputs that try to redirect the model's objective
- Context manipulation: Attempts to overwrite the system prompt
Detection layers
Prompt Security runs on both the input (user prompt) and the output (model response) to catch both direct attacks and successful manipulations.
Response
On detection, the default action is block. You can configure escalate to route suspicious inputs to human review instead.
Was this page helpful?