AI Agents
AI Agents
Configure guardrails

Configure guardrails

EARLY ACCESS

Guardrails detect harmful content in end user messages and control how the agent responds. They are part of the agent configuration.

For broader behavioral planning, see Plan behavioral guidelines.


Filter settings

Each guardrail has three settings:

  1. Category
  2. Severity
  3. Mode

Category

Select the type of content to filter:

  • Violence - Violent language or threats.
  • Hate - Hate speech or discriminatory content.
  • Sexual - Explicit or inappropriate sexual content.
  • Self harm - Content promoting self-harm.
  • Jailbreak shield - Attempts to manipulate the agent or bypass safety guidelines.

Severity

How sensitive the filter is:

SeverityDescription
LowDetects mildly inappropriate language.
MediumDetects moderately harmful language.
HighDetects only explicitly harmful language.
NOTENot applicable for Jailbreak shield.

Mode

What the agent does when the filter triggers:

ModeDescription
AnnotateAllows the message through; logs the filter match in analytics.
BlockBlocks the message; logs the filter match in analytics.
OffDisables the filter for this category.
AI agents guardrails configuration
NOTE

The Jailbreak shield category does not use severity levels. It detects attempts by end users to manipulate the AI agent, such as exploiting flaws, bypassing safety guidelines, or overriding predefined instructions.

To configure guardrails, open the Guardrails section in your agent configuration and set the Category, Mode, and Severity (where applicable) for each filter.


Next steps







Need assistance

Explore Infobip Tutorials

Encountering issues

Contact our support

What's new? Check out

Release Notes

Unsure about a term? See

Glossary
Service status

Copyright @ 2006-2026 Infobip ltd.

Service Terms & ConditionsPrivacy policyTerms of use