So I caused an incident recently. This is a nightmare scenario for any engineer, of course. It affected only a handful of clients and only a part of their traffic, but it was enough to reevaluate my thought process on incidents.
The incident in question affected one of our most important services and lasted almost three hours, preventing affected clients from sending certain types of messages. Not a significant incident or anything out of the ordinary for a company of our size, but as my entire team gathered on an urgent call to fix this, I already began evaluating why this happened and possible ways of avoiding this.
A little bit of pessimism goes a long way
A colleague of mine, Mihovil, said in a recent LinkedIn post that “he loves pessimists in his team”.
As a pessimist by nature, I tend to overthink and catastrophize things. This is also how I approach testing. Before deploying anything to production, I always think of the worst possible scenario.
I also try to be constructive and immediately come up with a way to fix these issues before they arise. It’s what psychology calls defensive pessimism and has probably saved us a few times.
You can’t predict everything
The problem with this is obvious: Even the worst (or the best?) pessimist cannot consider every possible problem which may occur. Or you can simply make a human mistake, like trying to reproduce conditions on the production environment with the exact same configuration running. Either way, failures will happen.
Unfortunately, I wasn’t able to predict my mistake in the case mentioned above. I manually tested the service on the dev environment, and when I automatically provisioned it later, I also had some leftover code from manual deployment, which caused me to think my service worked correctly.
There was also an additional issue since the dev cluster didn’t precisely match our production cluster, as it had a different network policy in place, which I overlooked.
Not to get into the nitty-gritty details of how we fixed this, it was caused by faulty configuration for a DNS caching service (As a well-known saying in IT, when an unknown problem comes up – it is always DNS). Unfortunately, fixing (or reverting it) ultimately involved rebooting all affected nodes, which handled those clients’ traffic, which took a while.
Fortunately, here in Infobip, we have a great team handling SRE, with exceptionally well-written procedures for handling incidents, and everything was up and running soon enough.
Incidents can happen – looking for a culprit shouldn’t
Perhaps more importantly, we also have a “zero-blame culture.” This means we avoid blaming a particular person or a team when an incident occurs. Various studies have proved that searching for someone to blame for an incident is unproductive, harms employee morale, and creates a negative culture that impedes overall productivity.
Instead, we try to examine why the processes were unable to prevent that failure.
Of course, the “zero blame culture” and SRE procedures in place weren’t always so advanced, but as the company grew, a lot was learned from previous mistakes.
I remember one particular incident where the entire DC went down and pulled most of our services in other DCs along with it. That was solved by all involved teams frantically trying to fix things on their own end. They eventually managed to fix it, but it also started a major overhaul in incident thinking culture and ways of solving and preventing them.
So when this incident started to happen, I caught it the same minute as I was deploying something. Partly because I’m a pessimist at heart. But partly because zero blame culture lets us learn from our mistakes instead of making us pay for them.