Production AI Observability: What Logging Doesn't Catch

The standard monitoring setup for production AI systems looks roughly like this: log the request, log the response, track latency, alert on errors. Maybe attach some costs if you're being diligent. Dashboard everything in Grafana or whatever your team uses.

This is not sufficient. And if you've shipped a real AI system into production, you've probably discovered that the hard way.

The failure modes of AI systems don't look like the failure modes of traditional software. They're not binary. They don't throw exceptions when they go wrong. They degrade quietly, in ways that are visible to users — who stop using the product or start complaining — before they're visible in your monitoring.

What Traditional Logging Misses

Consider the ways an AI system can fail that don't show up in your logs:

Semantic drift. The model's outputs gradually shift in character without any single request failing. Tone changes. Verbosity increases. The outputs technically answer the question but stop being useful in the way your users need them to be. No error, no latency spike. Your logs show green. Your users are quietly dissatisfied.

Prompt sensitivity. Small variations in how users phrase requests produce wildly different output quality. The model is inconsistent in a way that's invisible in aggregate metrics. Median output quality looks fine. Variance is high. Some users have great experiences; others don't. You can't see this without sampling actual outputs at the distribution level, not just the median.

Context utilization failures. Systems that use retrieved context (RAG architectures, tool-augmented agents) can silently fail to actually use the context they retrieve. The model generates a plausible-sounding answer from its training data instead of from the retrieved information you gave it. The response looks fine. It's wrong. Your logs don't tell you this happened.

Instruction following degradation. The model stops following constraints you've specified in the system prompt — format requirements, persona constraints, scope limitations. This is especially common when context windows get long and the system prompt gets diluted by conversation history. Again: no error, degraded behavior.

What You Actually Need

The problem with "real" AI observability is that it requires evaluation, not just logging. And evaluation requires knowing what good looks like.

This is where most teams get stuck. Defining "good" for natural language output is genuinely hard. But "good enough to build monitoring around" is more tractable than "perfect." You don't need a complete quality rubric. You need signals that tell you when something is wrong.

Output sampling with structured review. Pull a random sample of actual outputs — daily, or on whatever cadence fits your traffic pattern — and build a lightweight review process. This doesn't have to be human review (though some human review is valuable). A secondary model call that evaluates the output against a rubric is sufficient for most purposes and can run automatically. The point is that you're looking at outputs, not just metadata about outputs.

Latency at the tail, not the median. p95 and p99 latency are more informative than p50 for AI systems. The median looks fine until it doesn't. The tail is where user experience breaks first.

Hallucination detection for factual assertions. If your system makes factual claims — retrieval-augmented answers, data lookups, anything grounded in specific information — you need verification. Model-graded verification is imperfect but better than nothing. You're looking for systematic patterns: are there specific query types where the model consistently fails to use retrieved context?

Instruction adherence spot checks. Periodically probe the system with test inputs where the correct behavior is known and the constraint is specific. The model should format the output this way, stay on this topic, refuse this type of request. Automated probes that verify this run continuously and alert if the pass rate drops.

The Hard Part

None of this is technically difficult to build. The hard part is organizational. It requires someone to own the quality signal, not just the uptime signal. SREs who know traditional infrastructure monitoring know how to alert on errors. AI systems mostly don't error when they fail — they degrade. That requires a different operational posture.

Build it before you need it. By the time users are complaining in volume, the degradation has been running for a while.

Stay safe out there — your dashboards lie.

— Dustin