Incident Management (IM) might seem like a drag at times, but I hope by the end of this post, you’ll look at it in a different light, at least a bit.
Let’s start from the beginning. For those unfamiliar, incident management (IM) is a set of practices and processes typically used by Site Reliability Engineering (SRE) teams. They use them to address unplanned events that affect the quality or operations of services. A good incident management process helps IT teams investigate, record, and resolve service interruptions or outages in a fast and efficient way.
The story I’m about to tell you concerns an incident that happened not long ago, and for one reason or another, I feel the need to highlight how it transpired.
Here’s the TL;DR version of the incident:
Multiple unassociated teams from different countries detected issues causing an outage of crucial Infobip services. Typical IM mitigation actions we tried did not help. We had to manually restore the services without knowing the root cause. The incident repeated itself the next day. Again, we had to resort to a manual failover because the root cause was still unknown. We managed to cut the incident duration time in half, though; when it first occurred, the time to recover was 1h07m, while the second one was 34 minutes.
What makes this incident worthy of a blog post, you might ask? It’s not the specific mitigation actions or root cause analysis that took the spotlight. What’s relevant is the collaboration between multiple teams during the incident and after it was resolved.
What happened in gruesome details
One day, multiple teams received alerts about their services. They were unable to access the database. Additionally, other teams reported errors preventing customers from logging into Infobip’s web services. The incidents were not isolated to one data center, it was all of them!
And still, none of us could connect to a database.
Tension grew, and panic slowly started setting in as we realized this was a beast unknown that nobody knew how to tackle.
When we thought it couldn’t get any worse, our API Governance team reported issues with the APIs. Upon investigation, the root problem was the same: an issue with database connectivity.
At some point, our DB Team managed to pinpoint the culprit, an unresponsive server, as they saw the failover cluster couldn’t connect to the SQL server and that the CPU and RAM were at 100%.
That was not the only problem we noticed. The VM performance was failing as well. A team reported that their VM could not detect all its allocated RAM. Once more RAM was added to the instance, it seemed to stabilize the server.
Finally, we’ve managed to restore our services, even though the root cause was still unknown. In this kind of situation, it’s very important to figure out what is causing the problem so that we can address the main issue (aka the root cause). Since we still had no idea, we did the next best thing and tried to figure out how to react better if the same thing happened again. And we all knew that it was only a matter of time for it to happen as we didn’t fix the main problem.
So, when it happened again the next day, we all knew the drill. We did a manual failover, and the incident was resolved again, quickly and efficiently.
Lessons learned
Without team collaboration and diligence in reporting issues, we would still be scratching our heads on what to do. We would struggle with having an action plan, especially bearing in mind the fact that we were not able to determine what was causing the issue in the first place. Having multiple teams share their side of the story during the incident made it easier to grasp the scope of the incident.
We then had it all written down in a single document to help others learn from the incident, so that they can react in the same, or better way when a similar problem happens again.
Even though the root cause of the incident still remains a mystery, what happened after the incident is a big success story.
What happens next
Thanks to the detailed report we’ve collected from different teams involved in the incident, the DB Team was able to investigate expensive queries on the main database. Based on the findings, they suggested some upgrades to lessen the load on the database and improve its performance.
Additionally, keeping our customers’ login struggles in mind, we’ve created new critical alerts for the Infobip web interface downtime. These alerts were aimed at more effectively alerting specific teams about incidents outside of regular business hours.
Other teams improved the metrics of our internal services by going through the most expensive queries/methods in an attempt to write them as cached methods.
We’ve also fixed alerting for the database connection pool and plan to implement a cached version of UsersResource/getById.
Why failing made us stronger
My story has a very happy ending. We’ve all learned a lot from the incident. And we’ve all decided to draw conclusions from it. All teams involved in the incident made planned actions that will positively impact our services.
Service improvements will make the service more reliable, thus prolonging the time between incidents.
Improving monitoring and alerting will speed up our response time so the incidents will be handed to the right teams and resolved faster.
Lessons learned from the incident will make it easier to respond to similar incidents in the future.
Conclusions
I hope this incident shows how team collaboration brings out the good stuff. All those planned actions different teams committed to would never have occurred if they hadn’t been involved in the IM process. In turn, this means that the incident would be less likely to repeat itself.
There are different ways you can be involved in Incident Management, even if your service doesn’t have an ongoing incident. Don’t be afraid to jump in if you think you have some useful information to share about an ongoing incident.
Even better, check out an incident review document that discusses all previous incidents or get involved in your company’s weekly incident reviews. If your company doesn’t have them, suggest they do! It’s an amazing collaborative practice that can help minimize time spent resolving incidents and share knowledge about your internal systems.