Software Incident Response Is Stuck In The Past. Is AI The Answer?

Software Incident Response Is Stuck In The Past. Is AI The Answer?

Welcome to my LinkedIn newsletter! In each issue of Entrepreneurship and Leadership, I'll be sharing my thoughts on scaling, leading and funding high-growth startups and what the future of innovation looks like. Subscribe here to stay updated.


At my first software company, I was the unlucky first responder when an outage struck our platform … on Thanksgiving. Holiday viewing had pushed traffic way up at Netflix, our biggest customer, leaving our monitoring system unable to handle the extra demand.

Reluctant to disturb my teammates, I took it upon myself to get to the root of the problem. Four hours later, I was still hunched over my laptop. In total, we spent at least two days figuring out what went wrong and fixing it. 

In the intervening 15 years, not much has changed in incident response. Software incidents remain a menace. And for engineers, dealing with them is still a huge headache — reactive, manual and time-consuming. These outages play a major role in burnout, which more than half of developers cite as a reason their peers quit, Harness research found

Here’s why incident response is still such a hassle — and how new agentic AI may ease that toil and frustration.

The whole process is messy and unpredictable. Often, fixes are the easy part. It’s pinpointing and diagnosing the problem that’s tricky.

Stuck in the past: A serious problem with ripple effects  

Right now, when a software incident strikes, it’s like a house is on fire. And every minute matters. The goal isn’t merely to “put out the fire,” but to mitigate the impact of the incident and make sure that the customer experience doesn’t suffer.  

First, the on-call engineer gets an alert, putting them in the hot seat. Logging into monitoring systems, they pore over charts and graphs to identify possible root causes, most of which turn out to be dead ends. Inevitably, other experts from the dev team are brought into the loop, the response swelling into group Slack threads or Zoom calls.

The whole process is messy and unpredictable. Often, fixes are the easy part. It’s pinpointing and diagnosing the problem that’s tricky. Meanwhile, the impact of inefficient incident response can be devastating. Last year’s CrowdStrike meltdown, for instance, saw a faulty software update crash 8.5 million devices worldwide, plunging airlines, banks, retailers, hospitals and many other essential businesses into chaos. 

Nor are these events rare. In a recent survey, 55% of companies said they experienced IT disruptions at least once a week, with almost half of these resulting in downtime for two hours or longer. Overall, one in three businesses lost $100,000 to $1 million or more.

Article content
Source: The Hidden Costs of Downtime Report from Splunk

Then there’s the impact on developers and the dev experience. No one likes to get woken up in the middle of the night or have to spend the weekend on call. And when an engineer is in the midst of a task, it’s challenging to context-switch to troubleshoot an incident.

Bringing incident response into the AI era

While AI is being used extensively for coding, it’s still novel for incident response. Historically, dev teams have been reluctant to introduce automation into a process that can be frustratingly ad hoc. But in many ways, incident response makes a perfect use case for AI — and agentic AI in particular — since it requires always-on capabilities, a deep knowledge base and repeated cycles to uncover root causes. 

Let’s say several users report issues with publishing on a company’s website. Identifying what could be a bigger problem, agentic AI can then alert the on-call team of a possible incident — getting a headstart before things spiral out of control.

Traditionally, diagnostics — figuring out what went wrong — has been the most tedious step in incident response, requiring experienced devs to cycle through potential root causes and eliminate them one by one. But agentic AI can quickly handle detective work that might take engineers hours, or even days. Ideally, it also integrates institutional knowledge from past Zoom calls, Slack messages and other communication into its knowledge base.

Ultimately, AI will settle on one or more likely root causes: the latest software deployment may have introduced conflicting publishing permissions, leading to user errors, for instance.

Importantly, human engineers still have a critical role to play at this stage. Rather than prescribe a solution, AI can guide the ensuing investigation by flagging recent code changes that might be problematic or system logs that warrant a closer look. Engineers can use those AI-powered insights to do further research and decide on the best course of action — like rolling back the code to the previous stable version or applying a patch.

The process doesn’t end here. Ideally, agentic AI can learn from the experience: documenting key events and actions taken and filing away insights for future incidents. (Translation: Next time, you don’t have to wake up the database manager to repeat the process.)

For incident response to work, the AI agent must be plugged into the company’s developer resources and knowledge graph ...

Caveats for AI incident response

While this process may sound seamless, it comes with caveats. For starters, even the best large language model will fall short for incident response, unless it’s fully integrated into the development process.

For incident response to work, the AI agent must be plugged into the company’s developer resources and knowledge graph — databases, microservices, continuous integration/continuous delivery (CI/CD), and other infrastructure and apps. By connecting the dots across the software development lifecycle, agentic AI simplifies the hunt for root causes and speeds up resolution time.

When used properly, agentic AI for incident response can yield a significant payoff for the business, customers and employees.

For starters, there’s a dramatic improvement in mean time to recovery (MTTR) from incidents — as much as 50-80%, in my experience. With unplanned downtime costing the world’s biggest companies an estimated $400 billion a year — or almost 10% of profits — that’s a huge ROI. Customers regain access to vital services faster, reducing risks to brand loyalty. 

A better developer experience is another upside. Freed from unnecessary toil, engineers can focus on what they do best — creating innovative software that adds value for customers. And just as important, they can relax and enjoy their Thanksgiving.


Thank you for reading! For more insights from my experience as a serial entrepreneur and how we can harness the power of software to change the world, subscribe to Entrepreneurship and Leadership.


LikeYourActivities FriendshipLookingFor:BusinessDev./J.Venture Travelled World wide for various Products/Also Required to appoint Executives/RetiredPersons for Full/PartTime/LIAISON For:EXPORT-IMPORT,etc.WeTravel,Enjoy,See Best Places,Business Develop Worldwide.. TravelledWorldWide: Bangladesh,Belgium,Brunei, Bulgaria,Czech,Egypt,France, Germany,Greece, Hong Kong,Hungary,Iran,Italy,Japan, Jordan ,S.Korea, Kuwait, Lebanon,Malaysia,Myanmar,Nepal,Netherlands,Philippines, Poland,Romania, Singapore,Slovak,Syria,Taiwan,U K,,etc. We Dev.Sales OF VariousProducts.Please Sms.Your Mob.No.EMail For discussion & Business Dev.for MutualBenefit..I Can visit your place,Worldwide:Want to open Liaison Office from Yr.Place LaterShift to My/Our Office.I am healthy height 6”1’.Lot of Scope for Business Dev.between our country's also Worldwide.I Studied Commerce Science both. We should get data of Products Traded Between our Countries.for which to Start.Later add more items.I have Studied Commerce Science both. Kishan +917976885871/9057834435,EMail:kgproperty98@gmail.com

Like
Reply

Brilliant article Jyoti. Agentic AI truly does feels like a "missing link" in modern incident response, especially pertinent when stitching together context from fragmented sources. We’ve found that context is still the limiting factor across so many enterprise AI use cases, not just for triage, but for everything from data flow to decision-making.That’s a big reason we’re building BlueNexus, a secure memory and data broker layer that acts like Stripe for personal data and Firebase for AI agents. The more AI systems like incident responders proliferate, the more essential it becomes to consider structured, sovereign context, in real-time. Looking forward to more of this kind of content from you.

Like
Reply

This really hits home. That feeling of being stuck on a problem, especially when it impacts so many, is something many of us in software can relate to. The idea of leveraging agentic AI to streamline incident response, not to replace human expertise but to augment it, is incredibly promising. Imagine cutting MTTR by 50-80%! It's about freeing up engineers to focus on building, innovating, and, yes, even enjoying their holidays. Speaking of AI and website problem-solving, this article provides more insight on the topic: https://www.davidayo.com/blog/ai-website-problem-solving/. It's definitely a conversation we need to keep having."

Like
Reply

I couldn't agree more Jyoti Bansal. It's clear that in the era of AI, humans will not continue to manually correlate hundreds of dashboards to find root cause of issues. At Logz.io we just announced 200+ customers are using Logz.io's AI Agents to solve this exact problem. https://logz.io/blog/transforming-observability-through-intelligent-automation/

Like
Reply

AI-driven incident response is a game changer for managing outages swiftly and efficiently.

Like
Reply

To view or add a comment, sign in

More articles by Jyoti Bansal

Explore content categories