Incident management
Incident management is a structured process for identifying, addressing, and resolving platform incidents. The goal is to minimize the impact on business operations and customer experience.
What is an incident?
An incident is any unexpected disruption, service degradation, or decline in the quality of Infobip’s platform and services. Incidents can range from minor issues, such as a product feature not working as intended, to complete platform outages.
Incident management stages
The incident management process includes three main stages:
- Identification: Detect incidents using monitoring tools, internal discovery, or customer feedback.
- Intervention: Restore the functionality of affected services.
- Review: Document the incident to support learning and future improvements.
Maintenance activities
Maintenance activities are essential for keeping the Infobip platform operational, efficient, and secure. These tasks include software updates, system optimizations, security patches, and infrastructure upgrades, and they help prevent unexpected downtime and ensure seamless service delivery.
Types of maintenance
Maintenance activities are divided into two categories:
- Planned maintenance: Scheduled in advance and communicated to customers at least 10 business days before the activity. Planned maintenance is usually performed during off-peak hours to minimize disruption.
- Emergency maintenance: Announced less than 10 business days in advance. Emergency maintenance addresses urgent issues that could affect platform functionality or security.
The main difference is that planned maintenance is pre-scheduled and communicated ahead of time, while emergency maintenance is reactive and performed as needed.
Maintenance and incident notifications
Incidents and maintenance activities within the Infobip platform are announced on the Status (opens in a new tab) page. Operator maintenances and delivery degradations caused by external factors are announced on the External connectivity status (opens in a new tab) page.
You can subscribe to updates for both status pages to receive notifications through your preferred channel (Email, Slack, Webhook, Atom/RSS feed).
For more details, see:
Incident management process
Early detection helps prevent issues from escalating and supports faster response times.
Incident identification
How incidents are detected:
- Platform monitoring: Automated tools continuously assess the health and performance of platform components. These tools trigger alerts when anomalies are detected.
- Internal discovery: Team members may manually identify issues that automated systems do not catch, such as unexpected service behavior or procedural errors.
- Customer reports: Customers can report issues they experience. User feedback helps identify incidents that monitoring tools might miss.
Initial response
When an incident is detected, an initial impact assessment is performed. The goal is to gather basic information about affected locations (such as data centers or regions), channels (such as SMS, RCS, WhatsApp), and products or interfaces (such as Conversations, Answers, HTTP API, SMPP).
Incidents are logged as internal "bridge" events using pre-made forms. This creates a central communication channel for cross-department collaboration.
Incident escalation and notification
After reporting, the incident is escalated to the appropriate personnel. Operational team members are alerted and begin troubleshooting. Structured escalation procedures ensure timely involvement of necessary experts.
Customer Support monitors the incident and prepares a Status (opens in a new tab) page notification. The initial notification is typically issued within 15 minutes and includes:
- Location of affected products and channels
- Affected channels and products/interfaces
- A brief description of the known impact
Following the initial notification, Customer Support provides updates as new information becomes available.
Incident intervention
Once escalated, engineers focus on mitigating the issue. The process includes:
- Verifying and assessing the impact
- Formulating a theory about the cause
- Collecting data to support or refute the theory
Mitigation involves implementing quick actions to restore service functionality.
These actions may be temporary and include steps such as:
- Removing a faulty instance from the load balancer
- Reconfiguring the service to use a healthy database cluster
Mitigation steps are announced in advance and monitored for effectiveness.
The mitigation phase concludes when:
- A stable situation is achieved where continuous human involvement is no longer necessary
- There is sufficient time to implement proper fixes for incident resolution
If only temporary solutions are applied, long-term fixes are required to address the root cause.
For example:
Mitigation step | Long-term fix |
---|---|
Remove a faulty instance from the load balancer | Restore the instance and reintegrate it |
Reconfigure service to use a healthy database cluster | Repair the original database cluster |
Incident review
After resolution, a post-incident review is conducted. The review includes collecting detailed data and creating a Root Cause Analysis (RCA) document. A key aspect of this phase is identifying preventive actions and translating them into actionable tasks.
At Infobip, Incident Record Files (IRFs) are used to record incident metrics and written reviews from involved personnel and reliability teams.
IRF template
The IRF includes:
- Summary and timeline: Short summary with timestamped key actions and events.
- Detection: How and when the incident was detected.
- Mitigation: Record of all mitigation actions and requirements for faster future mitigation.
- Contributing causes: Root cause and other factors that affected the incident’s duration or impact.
- Impact assessment: Description of customer impact and affected products or services.
- Preventive actions: Actions taken or planned to prevent recurrence and address all contributing causes.
Root Cause Analysis document
The RCA document is prepared for incidents originating within the Infobip platform, using information from IRFs.
Infobip's Root Cause Analysis documents generally follow this structure:
