Safety & Toxicity — VeldrixAI Docs

What it catches

The Safety pillar uses multi-stage semantic detection — zero-shot classifiers backed by the inference layer — to score how likely a response is to cause harm. It covers:

Violence, threats, and incitement
Harassment and hate speech
Sexual and explicit content
Self-harm and dangerous instructions (weapons, illicit synthesis)

Flags

Flag	Meaning
`content_unsafe`	General unsafe content detected.
`explicit_content_detected`	Sexual/explicit material — forces `BLOCK`.
`self_harm`	Self-harm encouragement or instructions.

explicit_content_detected is a critical flag and forces a hard block regardless of the aggregate score.

Example

Python

r = client.evaluate_sync(prompt=prompt, response=model_output)
if r.pillar_scores["safety"] < 0.5:
    print("Unsafe:", r.critical_flags)

Tuning

Raise the safety weight for consumer-facing or youth audiences; lower the review threshold if your domain legitimately discusses sensitive topics (e.g. clinical or security research) to reduce false positives.

Was this page helpful?