
Content filters identify harmful content in end user messages and control how the agent responds.

You can filter messages by categories such as Violence or Hate. For each category, you can set the severity and the AI agent's action (Mode).

## Category

Select one or more categories to filter content.

- Violence
- Hate
- Sexual
- Self harm
- Jailbreak shield. This filter detects attempts by end users to manipulate the AI agent. Examples include exploiting flaws, bypassing safety guidelines, or overriding predefined instructions.

## Mode

Select the action the agent takes when a message meets the filter criteria.

- **Annotate**: Allows the AI agent to process and respond to messages. In the analytics, you can see the filter that the message triggered.
- **Annotate and block**: Prevents the AI agent from processing and responding to messages. In the analytics, you can see the filter that the message triggered.
- **Off**: The AI agent does not attempt to filter messages, regardless of whether they meet the criteria for the selected filter category.

## Severity

This setting controls how sensitive the filter is to harmful content.

Available severity levels:

- **Low**: Detects even mildly inappropriate language.
- **Medium**: Detects moderately harmful or aggressive language.
- **High**: Detects only explicitly harmful language.

## Configure the content filter

1. In the **Content filter** tab, enable **Content filter**.
2. For each category, set the following:
    - **Mode**
    - **Severity** (not applicable for Jailbreak shield)

## Next steps

After configuring content filters, [test your agent](https://www.infobip.com/docs/ai-agents/build/test-agent) to verify that the filters work as expected, before publishing the agent.

