Essentials
Support
Incident management

Incident management

Incident management is a structured process for identifying, addressing, and resolving platform incidents. The goal is to minimize the impact on business operations and customer experience.

What is an incident?

An incident is any unexpected disruption, service degradation, or decline in the quality of Infobip’s platform and services. Incidents can range from minor issues, such as a product feature not working as intended, to complete platform outages.

Incident management stages

The incident management process includes three main stages:

  1. Identification: Detect incidents using monitoring tools, internal discovery, or customer feedback.
  2. Intervention: Restore the functionality of affected services.
  3. Review: Document the incident to support learning and future improvements.

Maintenance activities

Maintenance activities are essential for keeping the Infobip platform operational, efficient, and secure. These tasks include software updates, system optimizations, security patches, and infrastructure upgrades, and they help prevent unexpected downtime and ensure seamless service delivery.

Types of maintenance

Maintenance activities are divided into two categories:

  • Planned maintenance: Scheduled in advance and communicated to customers at least 10 business days before the activity. Planned maintenance is usually performed during off-peak hours to minimize disruption.
  • Emergency maintenance: Announced less than 10 business days in advance. Emergency maintenance addresses urgent issues that could affect platform functionality or security.

The main difference is that planned maintenance is pre-scheduled and communicated ahead of time, while emergency maintenance is reactive and performed as needed.

Maintenance and incident notifications

Incidents and maintenance activities within the Infobip platform are announced on the Status (opens in a new tab) page. Operator maintenances and delivery degradations caused by external factors are announced on the External connectivity status (opens in a new tab) page.

You can subscribe to updates for both status pages to receive notifications through your preferred channel (Email, Slack, Webhook, Atom/RSS feed).

For more details, see:

Incident management process

Early detection helps prevent issues from escalating and supports faster response times.

Incident identification

How incidents are detected:

  • Platform monitoring: Automated tools continuously assess the health and performance of platform components. These tools trigger alerts when anomalies are detected.
  • Internal discovery: Team members may manually identify issues that automated systems do not catch, such as unexpected service behavior or procedural errors.
  • Customer reports: Customers can report issues they experience. User feedback helps identify incidents that monitoring tools might miss.

Initial response

When an incident is detected, an initial impact assessment is performed. The goal is to gather basic information about affected locations (such as data centers or regions), channels (such as SMS, RCS, WhatsApp), and products or interfaces (such as Conversations, Answers, HTTP API, SMPP).

Incidents are logged as internal "bridge" events using pre-made forms. This creates a central communication channel for cross-department collaboration.

Incident escalation and notification

After reporting, the incident is escalated to the appropriate personnel. Operational team members are alerted and begin troubleshooting. Structured escalation procedures ensure timely involvement of necessary experts.

Customer Support monitors the incident and prepares a Status (opens in a new tab) page notification. The initial notification is typically issued within 15 minutes and includes:

  • Location of affected products and channels
  • Affected channels and products/interfaces
  • A brief description of the known impact

Following the initial notification, Customer Support provides updates as new information becomes available.

Incident intervention

Once escalated, engineers focus on mitigating the issue. The process includes:

  • Verifying and assessing the impact
  • Formulating a theory about the cause
  • Collecting data to support or refute the theory

Mitigation involves implementing quick actions to restore service functionality.

These actions may be temporary and include steps such as:

  • Removing a faulty instance from the load balancer
  • Reconfiguring the service to use a healthy database cluster

Mitigation steps are announced in advance and monitored for effectiveness.

The mitigation phase concludes when:

  • A stable situation is achieved where continuous human involvement is no longer necessary
  • There is sufficient time to implement proper fixes for incident resolution

If only temporary solutions are applied, long-term fixes are required to address the root cause.

For example:

Mitigation stepLong-term fix
Remove a faulty instance from the load balancerRestore the instance and reintegrate it
Reconfigure service to use a healthy database clusterRepair the original database cluster

Incident review

After resolution, a post-incident review is conducted. The review includes collecting detailed data and creating a Root Cause Analysis (RCA) document. A key aspect of this phase is identifying preventive actions and translating them into actionable tasks.

At Infobip, Incident Record Files (IRFs) are used to record incident metrics and written reviews from involved personnel and reliability teams.

IRF template

The IRF includes:

  • Summary and timeline: Short summary with timestamped key actions and events.
  • Detection: How and when the incident was detected.
  • Mitigation: Record of all mitigation actions and requirements for faster future mitigation.
  • Contributing causes: Root cause and other factors that affected the incident’s duration or impact.
  • Impact assessment: Description of customer impact and affected products or services.
  • Preventive actions: Actions taken or planned to prevent recurrence and address all contributing causes.

Root Cause Analysis document

The RCA document is prepared for incidents originating within the Infobip platform, using information from IRFs.

Infobip's Root Cause Analysis documents generally follow this structure:

RCA document

Need assistance

Explore Infobip Tutorials

Encountering issues

Contact our support

What's new? Check out

Release Notes

Unsure about a term? See

Glossary
Service status

Copyright @ 2006-2025 Infobip ltd.

Service Terms & ConditionsPrivacy policyTerms of use