Incident management

View as Markdown

Incident management is a structured process for identifying, addressing, and resolving platform incidents. The goal is to minimize the impact on business operations and customer experience.

What is an incident?

An incident is any unexpected disruption, service degradation, or decline in the quality of Infobip’s platform and services. Incidents can range from minor issues, such as a product feature not working as intended, to complete platform outages.

Incident management stages

The incident management process includes three main stages:

Identification: Detect incidents using monitoring tools, internal discovery, or customer feedback.
Intervention: Restore the functionality of affected services.
Review: Document the incident to support learning and future improvements.

Maintenance activities

Maintenance activities are essential for keeping the Infobip platform operational, efficient, and secure. These tasks include software updates, system optimizations, security patches, and infrastructure upgrades, and they help prevent unexpected downtime and ensure seamless service delivery.

Types of maintenance [#types-of-maintenance-maintenance-activities]

Maintenance activities are divided into two categories:

Planned maintenance: Scheduled in advance and communicated to customers at least 10 business days before the activity. Planned maintenance is usually performed during off-peak hours to minimize disruption.
Emergency maintenance: Announced less than 10 business days in advance. Emergency maintenance addresses urgent issues that could affect platform functionality or security.

The main difference is that planned maintenance is pre-scheduled and communicated ahead of time, while emergency maintenance is reactive and performed as needed.

Maintenance and incident notifications [#maintenance-and-incident-notifications-maintenance-activities]

Incidents and maintenance activities within the Infobip platform are announced on the Status page. Operator maintenances and delivery degradations caused by external factors are announced on the External connectivity status page.

You can subscribe to updates for both status pages to receive notifications through your preferred channel (Email, Slack, Webhook, Atom/RSS feed).

For more details, see:

Incident management process

Early detection helps prevent issues from escalating and supports faster response times.

Incident identification [#incident-identification-incident-management-process]

How incidents are detected:

Platform monitoring: Automated tools continuously assess the health and performance of platform components. These tools trigger alerts when anomalies are detected.
Internal discovery: Team members may manually identify issues that automated systems do not catch, such as unexpected service behavior or procedural errors.
Customer reports: Customers can report issues they experience. User feedback helps identify incidents that monitoring tools might miss.

Initial response [#initial-response-incident-management-process]

When an incident is detected, an initial impact assessment is performed. The goal is to gather basic information about affected locations (such as data centers or regions), channels (such as SMS, RCS, WhatsApp), and products or interfaces (such as Conversations, Answers, HTTP API, SMPP).

Incidents are logged as internal "bridge" events using pre-made forms. This creates a central communication channel for cross-department collaboration.

Incident escalation and notification [#incident-escalation-and-notification-incident-management-process]

After reporting, the incident is escalated to the appropriate personnel. Operational team members are alerted and begin troubleshooting. Structured escalation procedures ensure timely involvement of necessary experts.

Customer Support monitors the incident and prepares a Status page notification. The initial notification is typically issued within 15 minutes and includes:

Location of affected products and channels
Affected channels and products/interfaces
A brief description of the known impact

Following the initial notification, Customer Support provides updates as new information becomes available.

Incident intervention

Once escalated, engineers focus on mitigating the issue. The process includes:

Verifying and assessing the impact
Formulating a theory about the cause
Collecting data to support or refute the theory

Mitigation involves implementing quick actions to restore service functionality.

These actions may be temporary and include steps such as:

Removing a faulty instance from the load balancer
Reconfiguring the service to use a healthy database cluster

Mitigation steps are announced in advance and monitored for effectiveness.

The mitigation phase concludes when:

A stable situation is achieved where continuous human involvement is no longer necessary
There is sufficient time to implement proper fixes for incident resolution

If only temporary solutions are applied, long-term fixes are required to address the root cause.

For example:

Mitigation step	Long-term fix
Remove a faulty instance from the load balancer	Restore the instance and reintegrate it
Reconfigure service to use a healthy database cluster	Repair the original database cluster

Incident review

After resolution, a post-incident review is conducted. The review includes collecting detailed data and creating a Root Cause Analysis (RCA) document. A key aspect of this phase is identifying preventive actions and translating them into actionable tasks.

At Infobip, Incident Record Files (IRFs) are used to record incident metrics and written reviews from involved personnel and reliability teams.

IRF template [#irf-template-incident-review]

The IRF includes:

Summary and timeline: Short summary with timestamped key actions and events.
Detection: How and when the incident was detected.
Mitigation: Record of all mitigation actions and requirements for faster future mitigation.
Contributing causes: Root cause and other factors that affected the incident’s duration or impact.
Impact assessment: Description of customer impact and affected products or services.
Preventive actions: Actions taken or planned to prevent recurrence and address all contributing causes.

Root Cause Analysis document [#root-cause-analysis-document-incident-review]

The RCA document is prepared for incidents originating within the Infobip platform, using information from IRFs.

Infobip's Root Cause Analysis documents generally follow this structure: