Prompt Security — VeldrixAI Docs

Threats covered

Direct injection — "ignore previous instructions and…"
Indirect injection — malicious instructions hidden in retrieved documents or tool output
Jailbreaks — role-play and persona attacks ("DAN", "developer mode")
Exfiltration — attempts to leak the system prompt or secrets

Flags

Flag	Meaning
`prompt_injection_detected`	Injection/override attempt — forces `BLOCK`.
`jailbreak_attempt`	Known jailbreak pattern detected.
`system_prompt_leak`	Response reveals protected instructions.

Hard block behaviour

prompt_injection_detected is a critical flag: the verdict is BLOCK even if every other pillar scores perfectly. Injection is treated as a security event, not a quality issue.

Example

Evaluate both the user prompt and the final response — injection often lives in the input. For agents, also evaluate tool output before the model consumes it (see Tool Interception).

Was this page helpful?