Reliability engineers are nan quiet unit that keeps modern package running. After decades of refining practices connected deterministic systems, galore teams are chasing yet different “nine” beyond 99% uptime. But nan AI era, particularly LLM-backed features, changes nan game. Outputs are non-deterministic, information pipelines displacement underfoot, and cardinal components behave for illustration achromatic boxes. As a result, galore of nan devices and rituals SREs person mastered for decades nary longer representation cleanly to accumulation AI.
At SREcon EMEA 2025, I co-organized nan MLOps chat track pinch Cauchy Co-founder Maria Vechtomova. We brought starring voices for a speech pinch nan audience, discussing really reliability practitioners tin navigate this AI landscape. Here are nan cardinal takeaways.
SREs Face a New Paradigm
At SREcon Americas 2025, Microsoft Corporate VP Brendan Burns said Azure vets caller models successful 2 ways: nan LLM-as-a-judge strategy, wherever LLMs judge their outputs; second, and much surprisingly, pinch Microsoft labor providing “thumbs-up/thumbs-down” feedback. The assemblage laughed and past continued discussing it during nan conference. For reliability engineers utilized to measurable SLOs and nonsubjective metrics, this sounded uncomfortably squishy. And this was possibly a pivotal infinitesimal that signaled to nan manufacture that changes were connected nan way. As Stanza CEO Niall Murphy puts it, “SREs are going to person to wrestle pinch this stochasticism for a while to come.”
For astir accepted software, running nan aforesaid codification connected nan aforesaid infrastructure yields nan aforesaid result. With instrumentality learning workloads, that’s not guaranteed. As Vechtomova explained, “the statistical properties of nan information tin change, and your exemplary stops performing. That’s what happened during COVID: forecasting and recommender systems collapsed because we had ne'er seen that benignant of information before.”
And while AI has been astir for a while successful different shapes, we are entering a caller era. As Zalando’s head of AI, Alejandro Saucedo, observed, “GenAI/LLMs are shifting nan paradigm from training toward inference.” Training utilized to beryllium nan halfway of gravity; models weren’t bully capable for astir applications, and ML engineers focused connected fixing that. With LLMs now delivering almost magical results, nan difficult problems person moved to serving time: inference. SREs are entering nan show, being asked to spell from zero to production-grade quickly, often without mature devices aliases established playbooks.
Reliability practitioners are utilized to deterministic systems, where, for example, position codes (2xx/5xx) could service arsenic unsmooth wellness proxies. Because LLM outputs are non-deterministic, location is often not a straightforward measurement to cognize if an AI-generated reply is immoderate good.
Monitoring Must Evolve
If your LLM app generates news summaries, really do you cognize today’s output is arsenic bully arsenic yesterday’s? There’s nary single, evident signal. So what should you way to drawback value drift? Meta elder unit accumulation technologist Jay Lees argues for anchoring connected business metrics. For ads, that mightiness beryllium nan click-through complaint (CTR): if nan CTR rises, your AI is apt improving nan experience; if it falls, thing has regressed.
LLMs push SRE’s metric accuracy up nan stack. The only reliable arbiter of “correct” is nan business outcome: did nan adjunct resoluteness nan case, did nan personification convert, did gross per convention hold? That intends work owners must specify outcome-level SLIs and SLOs. But outcomes tin lag, and it is champion believe to brace them pinch classical indicators. Together, this stack gives some truth pinch business effect and speed pinch early drift signals.
That paints a clear image that AI makes observability non-optional. But, arsenic Honeycomb CTO Charity Majors puts it, “most companies don’t moreover person high-quality observability for their non-AI workloads.” So either we’re successful for a agelong ramp to due AI observability, aliases AI becomes nan catalyst that drags observability forward. And for companies trying to do it right, a recent survey recovered that monitoring and observability were nan biggest challenges erstwhile productionising instrumentality learning models, pinch only 50% of companies having immoderate benignant of exemplary monitoring successful place.
No One Has It Figured Out
Even if we instrumentality aggressively, location are limits to what’s applicable today. Anthropic’s Head of Reliability, Todd Underwood, put it bluntly: “in theory, you could way and type everything: data, prompts, embeddings, models, retrieval indices, and policies to explicate deviation. In practice, that level of end-to-end provenance is dense and unrealistic for astir companies.”
That spread betwixt nan perfect and nan applicable exists for a reason: nan crushed keeps moving, quickly. Underwood and Murphy, co-authors of Reliable Machine Learning: Applying SRE Principles to ML successful Production, added that a situation successful penning nan book was staying up of nan gait of change; they aimed to propose practices that wouldn’t go obsolete by nan clip of publication.
After ninety minutes of chat pinch nan sheet and audience, 1 taxable stood out: No 1 has it afloat figured out. Many engineering teams consciousness they are down connected AI, but nan truth is that we are each flying a level that is still being built. Some organizations are ahead, but fewer person mature processes, tooling, and playbooks for operating these non-deterministic systems astatine scale.
At this point, MLOps has much open problems than settled answers, thing caller for tech, but astatine a standard we haven’t seen successful a while. As Andrej Karpathy has noted, getting agentic applications “right” whitethorn return a decade. Many LLM demos person deed nan first 9 — they activity astir 90% of nan clip — but location are galore much nines to conquer earlier we scope production-grade reliability.
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don't miss an episode. Subscribe to our YouTube channel to watercourse each our podcasts, interviews, demos, and more.
Group Created pinch Sketch.
English (US) ·
Indonesian (ID) ·