AI agents
Build and publish an agent
Configure content filters

Configure content filters

Content filters identify harmful content in end user messages and control how the agent responds.

You can filter messages by categories such as Violence or Hate. For each category, you can set the severity and the AI agent's action (Mode).

Category

Select one or more categories to filter content.

  • Violence
  • Hate
  • Sexual
  • Self harm
  • Jailbreak shield. This filter detects attempts by end users to manipulate the AI agent. Examples include exploiting flaws, bypassing safety guidelines, or overriding predefined instructions.

Mode

Select the action the agent takes when a message meets the filter criteria.

  • Annotate: Allows the AI agent to process and respond to messages. In the analytics, you can see the filter that the message triggered.
  • Annotate and block: Prevents the AI agent from processing and responding to messages. In the analytics, you can see the filter that the message triggered.
  • Off: The AI agent does not attempt to filter messages, regardless of whether they meet the criteria for the selected filter category.

Severity

This setting controls how sensitive the filter is to harmful content.

Available severity levels:

  • Low: Detects even mildly inappropriate language.
  • Medium: Detects moderately harmful or aggressive language.
  • High: Detects only explicitly harmful language.

Configure the content filter

  1. In the Content filter tab, enable Content filter.
  2. For each category, set the following:
    • Mode
    • Severity (not applicable for Jailbreak shield)

Next steps

After configuring content filters, test your agent to verify that the filters work as expected, before publishing the agent.

Need assistance

Explore Infobip Tutorials

Encountering issues

Contact our support

What's new? Check out

Release Notes

Unsure about a term? See

Glossary
Service status

Copyright @ 2006-2026 Infobip ltd.

Service Terms & ConditionsPrivacy policyTerms of use