How we moved prompt injection protections from the agent into the MCP server

Prompt injection is one of the hardest problems in AI security, and almost every proposed solution focuses on the agent. But there’s another place to build defences: the MCP server itself. At Infobip, we mostly deal with 2-way communication between businesses and their end users. AI agents processing inbound SMS messages open themselves to prompt injection from those texts, unless proper safeguards are put in place. We have built those safeguards into our MCP servers, so that they can be reused by various agents that rely on our platform for communicating with their users.

This blog post walks through how those safeguards work, what trade-offs they impose on connected agents, and where the same pattern could apply beyond CPaaS.

Current Landscape

When Simon Willison coined the term, prompt injection, back in 2022 it was primarily viewed in context of jailbreaking, though understood even back then as a unique concept. With the advent of tool usage, later standardized through the adoption of the MCP specification, the surface area for exploits against AI agents grew. Willison identified a set of capabilities he refers to as a lethal trifecta that leave agents vulnerable to prompt injection:

1. Processing content from untrusted sources

2. Access to sensitive systems or private data

3. Ability to communicate externally or change state of the system

Whenever an agentic system has all 3 capabilities it becomes susceptible to exploit. Capabilities could be implemented by agent’s built-in tools, or by one or more of the installed MCP servers. They can come from different sources and from different services and providers.

Meta engineers identified this as a problem and came up with the Agents Rule of Two framework. The rule states that an AI agent can freely implement 2 of these capabilities, but must omit or restrict the 3rd one. This framework helps us build safer agents by allowing us to make trade-offs and decide which capability to sacrifice. Researchers from Google’s DeepMind devised an approach in which AI agent encodes its actions as Python code which in turn operates on potentially malicious data to prevent it from influencing the LLM. A group of authors composed an overview of design patterns that can be used to architect agentic applications resilient to prompt injection.

Crucial to the success of all these mitigation strategies is the fact that they are deterministic and focus on long standing engineering principles like securing the data flow and control flow. They also focus on the agent application, or harness. This is a reasonable starting point, because agent source code has the greatest amount of control over the flows. Developers of the agent application can decide on what trade-offs make sense for the use-cases they are implementing.

CPaaS Considerations

Before looking at how we addressed prompt injection, it helps to first understand the risks CPaaS platforms already have to manage. These are the same risks that an AI agent with communication capabilities could easily amplify if those capabilities are exposed without the right safeguards.

One of the clearest examples is spam, or unsolicited messages, which CPaaS platforms like Infobip need to guard against. Some regions, such as the US for example, impose strict restrictions and require brands and businesses that send texts and other messages to respect opt-outs from their users. Businesses that end up sending unwanted messages, even unwillingly, can face investigation and fines.

Closely related to spam is phishing (smishing, for SMS). SMS texts, e-mails, and OTT channels (such as WhatsApp, Viber, etc.) are often a vehicle used by fraudsters to deliver phishing messages. In these cases, attackers try to trick users by impersonating brands that users know and trust. It is important to protect senders: phone numbers, email addresses, and similar identifiers that brands use to communicate with their users from such attackers.

These same protections against spam and phishing must also be considered when building CPaaS-capable AI agents. Such agents provide high utility, being capable of communication with end users over channels that users already know and are comfortable with. However, exposing naive implementation of these capabilities would leave agents vulnerable.

A Vulnerable MCP Server Setup

Consider a scenario in which we expose an MCP server with 2 tools, one to receive raw inbound SMS texts, and one to send outbound texts. An AI agent could use OAuth or API keys to authenticate business customer which would grant it access to business’ senders. Then the agent could send and receive SMS from the official phone number that users already know and use it to interact with a brand they trust.

Without additional protections against prompt injection an agent connected to such an MCP server would be vulnerable, since all 3 lethal trifecta capabilities would be present:

1. Agent could receive maliciously crafted messages from inbound SMS texts.

2. Agent would have access to a trusted sender.

3. Agent could send outbound spam or phishing SMS texts.

Each agent connected to such a server would violate Agents Rule of Two and would need to implement some protections and limitations on at least one of these capabilities.

Asking every agent application to reimplement these protections seemed fragile. That’s why we decided to build them into Infobip’s MCP server directly. That way, the safeguards are implemented once and can be reused by every application that connects to the server.

Safe 2-way SMS Conversations

Typically, security features are usually a compromise between utility and safety. In case of an AI agent with capability to communicate over texts one such compromise is to lock one agent session into conversation with one phone. In this scenario, during one agent session, (i.e., one agentic task) or a series of LLM calls that share memory, the agent is only allowed to receive messages from and send messages to a single user. This compromise still allows for 2-way conversations between users and the agent, which covers the majority of potential use cases.

Attack prevention with safe 2-way system

One limitation of this approach is that users won’t be able to instruct the agent to send messages to third parties. There are a few additional limitations that will become clear as we dive deeper into the implementation details discussed below.

Implementation we ended up with

Let’s look at the specific implementation that we ended up with in Infobip, and how we lock the agent into conversation with one user at a time.

We implemented the 2-way SMS agentic capability by starting from an inbound text message sent by a user to the agent. Prior to sending the message to the agent with a webhook, the server creates a limited-time session with a random token. The webhook payload contains both the inbound message details and the session token. Agent code, a harness, can use the message sender as a key of persisted memory that it feeds to the LLM tasked with processing received messages.

The agent is connected to Infobip’s MCP server for responding. The server’s tool descriptions and input schemas instruct the agent to return the session token in subsequent tool calls when sending outbound messages. The session is linked to the information about the user on Infobip side. This includes their phone number, from which they sent the original inbound message, but can also include data from their CDP (Customer Data Platform) profile, such as e-mail address.

The AI agent is free to interpret the inbound text however it sees fit. For example, a RAG system might be used in combination with the 2-way SMS to build a knowledge agent capable of answering domain-specific questions. If the agent decides to respond by sending an outbound SMS text, it needs to include the session token in its tool call. MCP server checks that the token is present, corresponds to an active session, and that the agent is interacting with the same user that sent the original inbound message.

Legitimate processing with safe 2-way system

This can be expanded in a few ways. Based on the conversation, the agent might decide to update the CDP profile of the user. Or it might switch the conversation from SMS to one of the richer communication channels such as RCS or e-mail. In each case, it would need to provide the session token, and the server would verify that token matches the initial user.

Implications

By using a cryptographically secure pseudorandom value for the session token, we ensure that the LLM cannot hallucinate a valid token value. The only way for it to obtain a valid session token is to receive it from the webhook payload. Inbound messages are ephemeral on the server side; once delivered, they cannot be accessed again, ensuring there is only ever a single session for each inbound message.

The MCP server rejects tool calls to send outbound messages without a valid session token. Likewise, server rejects tool calls that attempt to target users other than the one the session is linked to. This forces the agent to interact only with the initial user and their data. This includes sending messages, but also updating user’s profile in CDP, etc.

The session is time-bound and expires after a set interval of few minutes. This gives the agent a time window to respond. It is enough time for an LLM based system to produce a response to the initial inbound message, but prevents reuse of old tokens. For example, even if the session token leaks into persisted memory and agent gets compromised by a prompt injection from a different source later on, an expired session cannot be used any more. Lastly, depending on configuration, sessions can be created as one-time use, meaning only the first tool call within time window will succeed. Subsequent attempts by the agent to reuse the same session token will be rejected by the server.

Each new inbound message results in a new session with a new token being generated and sent to the agent. This creates a risk of the LLM getting confused and reusing an old and no longer valid token. This can be mitigated with clear tool descriptions and server instructions. The upside is having single-use time-based tokens.

Limitations

There are a few additional limitations of this system.

This approach intentionally splits receiving inbound messages from sending outbound ones. The AI workload is triggered by a webhook call from Infobip to the agent. This requires the agent application to expose a web server. It complicates the deployment setup, but makes it easier for classic agent code to keep LLM memory separate for each user. Which in turn guards from memory poisoning attacks.

An AI agent connected to the MCP server for responding can’t proactively start a conversation by sending an outbound text. It must first receive a webhook with an inbound message from a user to reply to it. This can be worked around by using a different flow with a different, unrestricted MCP server to initialize the conversation by sending the first outbound message. In this initialization flow the LLM would not need to process inbound messages, so there is no risk of prompt injection.

There is still a risk of an inbound message tricking the agent into calling other tools besides those exposed by the response MCP server which enforces the session token checks. In these cases, a classic HTTP API can be exposed alongside the MCP server which can be used to validate session tokens. Then the agent can be instructed to include the token with those additional tool calls, and the tool implementation can be extended to validate the token with a classic API call.

CDP data can be used to give additional context to the session retrieved from the API. This added context can be used to perform authorization on additional tools. This, of course, requires developers to have control over the implementation of these tools, which might not always be the case.

Attack prevention with system extended to 3^rd party tools

Conclusion

At Infobip, we have applied the Agents Rule of Two framework to our implementation of a system that empowers AI agents with 2-way communication capabilities. The underlying principle extends beyond CPaaS. Wherever an agent responds to requests from known parties: a customer support ticket, an incoming webhook, an appointment callback, that initiating event can serve as a session anchor enforced at the MCP server level. This demonstrates how prompt injection mitigation need not be limited to source code of AI agents. Preventative controls can be implemented on the MCP server side and reused across all agents that connect to it.

Our approach benefits from having a focused domain (2-way communication) and a tight tool surface. This results in smaller server footprint and is a good approach to designing MCP servers in general. Additionally, it imposes a specific compromise to agents connected to the server. This is why we also expose fully featured MCP servers without these limitations as lower-level primitives that can be used by those agents that require different trade-offs.