CTRLK

Shared components

Configure guardrails

|

View as Markdown

Guardrails detect harmful content in end user messages and control how the agent responds. They are part of the agent configuration.

For broader behavioral planning, see Plan behavioral guidelines.


Filter settings

Each guardrail has three settings:

  1. Category
  2. Severity
  3. Mode

Category [#category-filter-settings]

Select the type of content to filter:

  • Violence - Violent language or threats.
  • Hate - Hate speech or discriminatory content.
  • Sexual - Explicit or inappropriate sexual content.
  • Self harm - Content promoting self-harm.
  • Jailbreak shield - Attempts to manipulate the agent or bypass safety guidelines.

Severity [#severity-filter-settings]

How sensitive the filter is:

SeverityDescription
LowDetects mildly inappropriate language.
MediumDetects moderately harmful language.
HighDetects only explicitly harmful language.
NOTENot applicable for Jailbreak shield.

Mode [#mode-filter-settings]

What the agent does when the filter triggers:

ModeDescription
AnnotateAllows the message through; logs the filter match in analytics.
BlockBlocks the message; logs the filter match in analytics.
OffDisables the filter for this category.
AI agents guardrails configuration
NOTE

The Jailbreak shield category does not use severity levels. It detects attempts by end users to manipulate the AI agent, such as exploiting flaws, bypassing safety guidelines, or overriding predefined instructions.

To configure guardrails, open the Guardrails section in your agent configuration and set the Category, Mode, and Severity (where applicable) for each filter.


Next steps