AI, Toil, and the SRE Feedback Loops We Can’t Afford to Break

There’s a lot of energy right now around AI in incident management.

Automating toil.
Improving signal-to-noise.
Self-healing systems.
Agents that detect deviations, mitigate issues, and even resolve incidents before humans wake up.

And honestly, I’m excited about it.

There are real opportunities here to improve detection, triage, operational efficiency, and recovery speed. AI has the potential to meaningfully elevate how we run distributed platforms at scale.

It’s also entirely possible that AI will transform the SDLC so profoundly that many of today’s assumptions will evolve.

But in the world we still operate in today, there are a few important principles we need to keep top of mind as we adopt these capabilities.


Feedback Loops Are How Systems and Engineers Learn

If you go back to the DevOps movement, especially The Phoenix Project and the Three Ways, the second Way emphasizes fast, tight feedback loops.

Engineers need to see the consequences of the systems they design.

When an engineer pushes code:

  • Strong CI/CD guardrails should catch issues early

  • If something slips through, telemetry should make it visible

  • If impact occurs, rollback mechanisms should engage

  • And if needed, the team should feel the operational pain

That discomfort is not a failure of the system.

It is the system learning.

It’s how teams build better observability.
It’s how they design safer migrations.
It’s how they improve rollback logic.
It’s how they clarify ownership and reduce ambiguity.

And those learnings are not trivial.

They are what build seniority.

They are what shape a seasoned Principal SRE. Someone who understands failure modes deeply, recognizes patterns under pressure, and knows where a distributed system bends versus where it breaks.

Experience is forged in feedback loops.

As we introduce AI agents into incident response, the goal should not be to remove engineers from those loops entirely. If engineers begin to rely on “AI will catch it, predict it, fix it,” we risk unintentionally weakening the very muscle that makes systems resilient over the long term.

Yes, reports can be generated.
Yes, metrics can look stable.

But stability without learning is fragile.


The Hidden Value of Small Incidents

Some smaller incidents can feel noisy and seem trivial in hindsight.

And yet, those incidents matter.

Smaller, contained failures are signals. They expose missing telemetry, fragile dependencies, unclear ownership, weak runbooks, or overly complex designs. They introduce the right friction when systems are not operating as expected.

They are opportunities to improve maintainability, one of the defining traits of a high-quality system.

When teams respond to these incidents directly, they:

  • refine alerts

  • simplify dependencies

  • reduce ambiguity

  • harden automation

  • make the system easier to operate

Those moments build operational intuition.

If AI handles every smaller incident end-to-end, we may reduce noise, but we also eliminate the learning those signals provide


Eventually, There Will Be That Incident

At some point, every system encounters the incident that is not obvious.

The race condition.
The subtle distributed systems failure.
The unforeseen interaction between components.
The deep concurrency bug.

These are rare. They require judgment, pattern recognition, context, and experience. They are not always easily modeled.

If human capability has atrophied because agents handled most of the operational thinking, the cost of that rare incident can be enormous.

It doesn’t matter how stable your platform was for 12 months.

An eight-hour outage resets the narrative.


Complexity Is Always Compounding

There is also the principle of simplicity.

Distributed systems are inherently complicated. Great engineers constantly fight that through readability, maintainability, and thoughtful design.

When we introduce AI agents into operational workflows, we are adding another layer of abstraction, often non-deterministic by nature, into already complicated distributed systems.

So we have to ask:

  • Are we reducing toil, or relocating it?

  • Are we strengthening reliability, or adding invisible complexity?

  • Are we tightening feedback loops, or weakening them?


The Goal Is Not Resistance. It Is Intentional Adoption.

The objective isn’t to slow down AI adoption.

It is to be intentional.

As we build AI-driven incident management and automation, we must ensure we preserve the feedback loops that develop engineering maturity. We need to balance what we delegate to AI with an understanding of the long-term consequences. This includes skill development, maintainability, simplicity, and systemic resilience.


Brilliant at the Basics

My leader often reminds us of a core expectation:

Be brilliant at the basics.

Being excellent at fundamentals, understanding failure modes, designing for operability, responding under pressure, and maintaining clarity in distributed systems must remain a deterministic human capability.

If we use AI to amplify those fundamentals, we win.

If we allow it to quietly replace them, we may gain short-term efficiency while weakening long-term resilience.

That balance is the leadership challenge in front of us.


Comments

  1. Deep and very insightful thoughts, well-Structured! I love the headings on all of the factual interpretations. There is a lot of potential here in this space and constant every changing new technology world, reliability and stability are far more important aspects .

    ReplyDelete

Post a Comment

Popular posts from this blog

Leading Through Change: Lessons on People, Systems, and Growth