Safety & Toxicity
Detect harmful, violent, dangerous, or self-harm-inducing content in AI output.
What it detects
- Violent or graphic content
- Hate speech and targeted harassment
- Self-harm and suicide-related content
- Content harmful to minors
- Instructions for dangerous activities
Scoring
The Safety pillar returns a score from 0 to 1. A score of 0.0 indicates severe violations; 1.0 means clean. The default blocking threshold is 0.3.
Configuration
In your policy settings, you can configure:
- Threshold: Score below which the action triggers (default 0.3)
- Action: What to do on violation (default: block)
- Categories: Which sub-categories to enable (all enabled by default)
Was this page helpful?