Safety & Toxicity

Detect harmful, violent, dangerous, or self-harm-inducing content in AI output.

What it detects

  • Violent or graphic content
  • Hate speech and targeted harassment
  • Self-harm and suicide-related content
  • Content harmful to minors
  • Instructions for dangerous activities

Scoring

The Safety pillar returns a score from 0 to 1. A score of 0.0 indicates severe violations; 1.0 means clean. The default blocking threshold is 0.3.

Configuration

In your policy settings, you can configure:

  • Threshold: Score below which the action triggers (default 0.3)
  • Action: What to do on violation (default: block)
  • Categories: Which sub-categories to enable (all enabled by default)
Was this page helpful?