Configure content filters
Content filters identify harmful content in end user messages and control how the agent responds.
You can filter messages by categories such as Violence or Hate. For each category, you can set the severity and the AI agent's action (Mode).
Category
Select one or more categories to filter content.
- Violence
- Hate
- Sexual
- Self harm
- Jailbreak shield. This filter detects attempts by end users to manipulate the AI agent. Examples include exploiting flaws, bypassing safety guidelines, or overriding predefined instructions.
Mode
Select the action the agent takes when a message meets the filter criteria.
- Annotate: Allows the AI agent to process and respond to messages. In the analytics, you can see the filter that the message triggered.
- Annotate and block: Prevents the AI agent from processing and responding to messages. In the analytics, you can see the filter that the message triggered.
- Off: The AI agent does not attempt to filter messages, regardless of whether they meet the criteria for the selected filter category.
Severity
This setting controls how sensitive the filter is to harmful content.
Available severity levels:
- Low: Detects even mildly inappropriate language.
- Medium: Detects moderately harmful or aggressive language.
- High: Detects only explicitly harmful language.
Configure the content filter
- In the Content filter tab, enable Content filter.
- For each category, set the following:
- Mode
- Severity (not applicable for Jailbreak shield)
Next steps
After configuring content filters, test your agent to verify that the filters work as expected, before publishing the agent.