Observability: the three pillars are not the whole building

December 4, 2022

“Observability” has become the industry’s preferred synonym for “monitoring with nicer dashboards,” and in the process has lost most of its useful content. The original distinction — borrowed from control theory — is sharper than the marketing makes it sound. A system is observable if you can infer its internal state from its outputs. A system is monitored if you have decided, in advance, which specific failure modes you will watch for. Monitoring answers questions you prepared. Observability lets you ask questions you had not thought of.

Most production incidents, once the interesting ones, are surprises. Something broke in a way nobody predicted, triggered by a combination nobody planned for. A monitoring setup tuned to known failures tells you the dashboards are green while users suffer. An observable system lets you ask the system what happened — across time, across services, across customers — and get an answer that fits the specific incident, not the generic one.

The three pillars — logs, metrics, traces — are the usual framing for observability tooling, and they are necessary but not sufficient. The rest of the discipline is SLIs, SLOs, error budgets, structured telemetry, correlation, and a set of uncomfortable decisions about cardinality, cost, and what you are going to look at.

The three pillars, briefly

Logs. Time-stamped records of events, usually textual, one line per event. Good for detail: what exactly happened in this specific request, this specific error, this specific edge case. Bad for aggregation at scale — querying “how many of these happened in the last hour?” across millions of logs is either slow or expensive, depending on how you indexed.

Metrics. Numeric time series, usually pre-aggregated: request count, latency histogram, error rate, CPU, queue depth. Good for trends and alerts — fast to query, cheap to store, ideal for dashboards and for “tell me when X goes wrong.” Bad for specifics — a metric tells you errors went up; it does not tell you which requests failed or why.

Traces. Records of a single request’s path through the system — which services it touched, in what order, how long each span took, with metadata attached. Good for causality and latency attribution — “the p99 spike is coming from this one downstream call in this one path.” Bad for sampling discipline — full traces are expensive, so almost everyone samples, and the sampled-out traces are the ones you need when the bug is rare.

Each pillar has a failure mode when used alone. Logs alone give you detail without shape. Metrics alone give you shape without detail. Traces alone give you paths without trends. The working setup is all three, with enough correlation to walk from one to the other: a metric alert leads you to a time window; the time window leads you to traces; the traces lead you to logs; the logs tell you exactly what happened.

Correlation is the piece that makes the pillars work together. A request id that appears in all three — in the log lines it produced, the metrics it contributed to, and the trace it belonged to — is worth more than any single pillar is alone. Without correlation you have three separate pools of telemetry and no bridge between them.

Structured logging and the free-text trap

Traditional logs are lines of free text: 2024-03-15 14:22:03 ERROR failed to process order 4719 for customer 88 due to inventory check failure. Humans can read them; machines can only regex them, and the regex breaks every time a developer rephrases a log line.

Structured logging treats log entries as typed records — JSON, logfmt, or a schema your logging library defines — with named fields: { "timestamp": "...", "level": "ERROR", "event": "order.process.failed", "order_id": 4719, "customer_id": 88, "reason": "inventory_check_failed" }. The message is still human-readable, but the fields are machine-queryable.

The payoff is that “show me every failure in the last hour, grouped by reason, for customers in this segment” becomes a query instead of a text-mining project. Combined with a consistent set of fields across services — trace id, span id, request id, user id, tenant id — structured logs become the most flexible of the three pillars and the cheapest to make useful. Unstructured logs are a liability that grows with volume; structured logs scale.

The rule: if a value would be useful to filter, sort, or aggregate on, it goes in its own field, not embedded in the message text. The discipline is boring and the payoff is enormous.

The cardinality bill

Every observability system has a secret: the cost is controlled by cardinality, and cardinality is controlled by you.

A metric with ten thousand unique label combinations stores ten thousand separate time series. A metric labeled by user id, where you have ten million users, will store ten million separate time series and will either be rejected by your metrics backend or cost more per month than the service it observes. A trace with an attribute set to something high-cardinality (a full URL, a UUID, a customer id) indexes every distinct value; a log search over a high-cardinality field without a proper index scans the raw data.

This is the one piece of observability that rewards engineering judgment. Low-cardinality dimensions (status code, region, service, endpoint) on metrics — fine. High-cardinality dimensions (user id, order id, request id) on metrics — almost never; put those on logs and traces, where they belong. Logs are the place high-cardinality attribution is cheap; metrics are the place it is expensive. Using each pillar for what it is good at is the difference between a bill that grows linearly with traffic and one that grows cubically.

A diagnostic: if your observability bill is larger than your compute bill, somebody is emitting high-cardinality data to the wrong pillar. This happens to almost every team eventually. The fix is rarely the vendor; it is almost always the telemetry.

SLIs, SLOs, and error budgets

The Google SRE book is the standard reference here, and the vocabulary has enough content to merit a careful definition.

A Service Level Indicator (SLI) is a measurement of how the service is doing from the user’s perspective. The classic four: availability (what fraction of requests succeeded?), latency (how long did they take?), error rate (what fraction failed?), throughput or saturation (how close to capacity?). The key word is user. An SLI measures user-observable behavior, not internal health. CPU utilization is not an SLI; it is a cause. “99.3% of authenticated requests returned 2xx within 500ms” is an SLI.

An SLO (Service Level Objective) is a target for an SLI: “99.9% of authenticated requests succeed within 500ms, measured over a rolling 30-day window.” It is a commitment, expressed in the same units as the SLI, with a time window attached.

An error budget is the inverse of the SLO. If the SLO is 99.9%, the error budget is 0.1% — and it is a budget: you are allowed to spend it. Deploy risky changes, run load tests, do controlled chaos engineering, fail occasionally in ways you can learn from. When the budget is spent, deploys stop (or slow down) until reliability recovers.

The error budget is what turns reliability from an aspiration into a negotiation. Product wants to ship features. SRE wants the system to be stable. The error budget gives both sides the same unit of account: how much unreliability can we afford this quarter? A team with budget to spare can ship aggressively. A team out of budget focuses on reliability work. The conversation stops being “is this safe?” and becomes “do we have budget for it?” — which is a quantitative conversation instead of a political one.

SLOs also have an unglamorous secondary benefit: they force the team to decide what actually matters. A service with one SLO is easy to reason about. A service with thirty is an observability dashboard pretending to be a policy. The useful SLOs are the ones that, if violated, would matter to a user or a business. Everything else is a metric, not an SLO.

Alert on symptoms, not causes

This is the one piece of alerting advice that actually matters, and that most teams ignore. An alert’s job is to wake someone up. Wake them up for the thing the user is experiencing, not for a cause that may or may not produce user impact.

The common anti-pattern: alerts on CPU at 80%, on memory at 90%, on queue depth above 1000, on disk above 70% full, on “service has restarted.” Each of these may indicate a problem. None of them directly means a user is suffering. A system that pages on every one produces alert fatigue, and the alert that actually matters gets lost in the noise.

The working pattern: alert on SLO violations. If the latency SLO is being burned at a rate that will exhaust the error budget in the next hour, wake someone up. If CPU is at 90% but the SLOs are fine, it is a graph to check in the morning, not a page at 2am. The causes are diagnostic information you look at after a user-facing symptom has fired an alert. They are not what you alert on.

Multi-window burn-rate alerts — the pattern Google popularized — are the honest implementation of this. A fast-burn window (last 5 minutes) catches sudden spikes; a slow-burn window (last hour or six hours) catches sustained degradation. Paging on a combination of short- and long-window burn rates gives you sensitivity without false-positives. It is more complex to configure than a static threshold, and it is worth it.

The golden signals, RED, and USE

Three overlapping frameworks for “what to measure” that mostly agree and occasionally disagree:

The Four Golden Signals (Google SRE): latency, traffic, errors, saturation. The broadest framing.
RED (Tom Wilkie): rate, errors, duration. A focused subset for request-driven services — how many requests, how many failed, how long did they take. Easy to dashboard and easy to alert on.
USE (Brendan Gregg): utilization, saturation, errors — applied per resource (CPU, memory, disk, network). Oriented toward resource-level diagnostics, complementary to RED.

Use RED for each service. Use USE for each resource. The Golden Signals are the union. The specific framework matters less than having one — a service with “we measure whatever comes out of the default dashboard” will not catch the regression that does not happen to be on the default dashboard.

Distributed tracing and the sampling tax

Tracing is the pillar that pays off hardest when the system is distributed — you genuinely cannot debug a slow request across eight services without traces — and that costs the most to keep honest.

The core cost is sampling. Keeping every span of every request is expensive. So most systems sample: keep 1%, keep 0.1%, keep more of errors and slow requests than of fast successes. This is fine for trend analysis and usually fine for debugging common problems. It is terrible for rare bugs, because the bug is, by definition, rare enough that the sampling is likely to have missed it.

Tail-based sampling is the honest response: buffer the spans of each trace until the trace is complete, then decide whether to keep it based on its properties (did any span error? did the total duration exceed a threshold? does it belong to a specific customer?). Tail sampling gives you the interesting traces without the cost of keeping every boring one. It requires a collector that can buffer, and a collector budget that can hold the buffer; it is worth the operational cost for systems that rely on traces for debugging.

The broader point: traces are not free, and deciding how to sample is a design decision, not a vendor setting. “Default sampling” is a choice. It is usually the wrong one.

Observability-driven development

The discipline most teams eventually land on is that telemetry is not an afterthought. A new feature ships with the metrics, logs, and traces it needs to be operated — chosen deliberately, not left to whatever the default instrumentation emitted. An incident teaches the team which telemetry was missing; the next feature gets the missing telemetry by default.

The habit that makes this work is asking, before each change is merged, “how will we know if this is broken?” A change that cannot be answered with existing telemetry is a change that ships blind. Add the telemetry, or accept the blindness explicitly. Both are choices; one is deliberate.

Teams that do this consistently tend to have two properties. First, their mean time to detect (MTTD) drops below their mean time to resolve (MTTR), which means they learn about problems before users call. Second, their dashboards stop growing — old metrics get retired, because nobody looks at them, instead of accumulating forever. Both are signs that observability is a discipline being practiced, not a product being consumed.

The working definition

Observability is a property of the system, not a product you buy. A system is observable to the extent that its operators can ask it questions they had not thought of and get useful answers. The three pillars are the inputs to that property. Correlation, structured telemetry, SLIs and SLOs, error budgets, and the discipline of alerting on symptoms are what turn those inputs into the property.

Buying a vendor gives you the pillars. Adopting the discipline gives you the property. The two are commonly confused and are not the same thing. A team with a seven-figure Datadog bill and no SLOs is paying for monitoring. A team with SLOs, structured logs, and correlated traces is paying for observability — often with a smaller bill, because they know what they are looking for and have designed the telemetry around it.

The rule, in one sentence: you cannot operate what you cannot observe, and you cannot observe by default — only on purpose. The purpose is the whole job.