Journey of Learning and Growth: My Experience with Incident Management

"Incident management" - the two words every engineer hates! It's cumbersome, with so many processes, and let's face it, no one likes to be in that limelight. Yet, it is an essential cog in the functioning of any organization. Through the years, I have been a part of various incident management teams, each holding their own stance on how an incident needs to be resolved. However, most teams rarely view incidents through the right lens, causing many minor and sometimes major blips, breaks, and even outages, causing components to slip through again and again.

Created with Sketch.

What most teams think incident management is:

  • On-call rotations at odd hours .. ugh.. 
  • Troubleshooting and applying hotfixes all the time 
  • Fixing the same issues over and over again 
  • Scrambling for the right logs 
  • Finally, dreading deployments ! 

Here's why I consider it differently: 

I began my career as a technical support engineer, moved on to being a NOC engineer for about two years, a Network support engineer and then an SRE for about 2 years. My experience through these roles, taught me different aspects of managing downtime be it on a client machine, a server or a virtual cloud environment. 

Every stage of my career taught me a different aspect of incident management even though I didnt see it then. 

As a tech support engineer, every call was essentially an incident, and we were always on-call. These incidents could be caused by updates, incompatibilities, or even the users themselves. We had to solve these issues in record time while ensuring customer satisfaction. It was here that I understood first-hand the significance of SLAs or service level agreements. In layman's terms, an SLA is the price you have to pay to an unhappy customer :P. It could mean offering the customer help to resolve the issue free of cost to them, but obviously at a cost to the company. It could also mean allowing certain features to be used free of charge, additional time of usage, and many more.

The next phase of my professional career began as a systems engineer working and troubleshooting legacy systems frequently. Being on-call here showed me the need to search for patterns, to look for repeated flaws, and to document them. I began building run-books and go-to documents, which I used to bring new joinees up to speed and also to help the on-call find a quick solution. As much as I disliked being on-call here, to my surprise, I did pick up a very important habit.

My role as a network support engineer a couple of years later, at a different organization, was all about understanding the cloud and adapting to new technology. Moving from a legacy system to a cloud environment was definitely hard to comprehend at first. Many times, during an incident, accurate and understandable logs would be searched for far and wide across the kingdom :( and would never be found. The necessity of having a robust log-scraping and collating tool was definitely felt! Understanding the importance of logging, documentation, and looking at error patterns drove me to search for a different solution. My research brought me to tools like ELK, Blameless, Gremlin, Prometheus, Grafana, Loki, and the list goes on. This opened a whole new gateway of wonderful things to learn, like chaos engineering, observability tooling, and measurable metrics.

As I shifted to the role of an SRE, the pieces of the puzzle seemed to fall in place perfectly.

I now see an incident as a necessary evil that any organization needs to face to:

  • Find flaws in their working process
  • Identify patterns of errors that could be solved through planning, testing, and automation
  • Understand that being on-call is not only about putting out fires but also about communication and data collection
  • Find out where the gaps are with a robust logging system
  • Have enough testing and preparedness to deploy something without having the heebie-jeebies :D

Incident management to me now is:

  • Planning for an eventuality
  • Undertaking good communication and event logging during said eventuality
  • In the words of Thor - 'Let's do Get Help!' - call for help if needed
  • Lastly, dig in! Postmortems are a way to learn and not stumble on the same block twice!
  • Once the gaps are found, use the right tools to permanently fix the issue.

My goal is to make the service teams I work with see what I see when an incident alarm goes off!

Communication in Engineering and Leadership:

I had just completed a short presentation on the daily to-do's of the team and our metrics when a senior engineer walked up to me, commended me for my communication and presentation abilities, and then said, "Maybe you should be in the HR team; you do this stuff well." I was both dumbfounded and belittled, and all I could do was smile and walk away.

Created with Sketch.

I began my educational journey at Clarence High School in Bangalore, where I studied until the 4th grade. I completed the rest of my schooling at MES Indian School in Qatar. Throughout my education, the one remark that stuck with me was that I was talkative. This trait has become a part of who I am, and naturally, speaking in public became one of my favourite things to do. However, I always found myself unable to be precise when needed and elaborate when necessary.

My father is a toastmaster and was a member of a club called ICC-ONE TOASTMASTERS. He used to take me along to attend his meetings regularly. I would wonder why everyone was taking their roles and responsibilities so seriously, whether it was giving a speech or keeping time. I thought to myself, "What good does this do them? How is this helpful?" In my mind, I already had fluency of speech, so why was I even here. Little did I know that I would reap the benefits of the lessons learnt at these meetings years later.

One summer, my father enrolled me in something called a youth leadership programme, and I agreed mostly to find out what the fuss was all about. The programme lasted a month, but the skills I acquired will last me a lifetime. I began to understand the importance of organization, the value of being prepared to perform one's roles, the value of receiving and giving constructive criticism. I also understood the importance of communicating not just with my words, but also with my voice, eyes, and hands.

The principles I learned from toastmasters spilled over into my daily life, allowing me to strike a balance between my passion and my academic and professional commitments. I adopted a routine of planning, which helped me prioritize tasks and prepare for important events. Additionally, I became more receptive to constructive criticism and learned to address my shortcomings and work on improving them. I have encountered situations where some engineers consider communication and networking skills to be beneath them and question those who do have these skills. However, it was these same engineers that often approached me to be the spokeperson in most events that of course included the higher management. I consistently faced such situations during my tenure at the company, where I was made to feel like my communication skills overshadowed my engineering knowledge. However, it was only when I transitioned to a different role that I realized that effective communication is not just limited to HR or marketing professionals, but it is a crucial skill for every engineer to possess and utilize in their work.

As I began to move ahead in my career, I learnt that communication and leadership are two essential skills that go hand in hand. Effective communication is a critical component of effective leadership. Without communication, it is impossible to convey ideas, build relationships, or make decisions and this holds true at every phase of our professional and personal lives. 

"The art of communication is the language of leadership." - James Humes