Caching: a denormalization pretending to be an optimization

June 26, 2022

Caching is widely presented as a performance technique and is more honestly a consistency problem. A cache is a second copy of some data, stored somewhere faster or closer. Once you have two copies, you have the question of what happens when one changes and the other has not yet. Every caching decision is, in the end, a decision about how much staleness you can tolerate and how much machinery you are willing to run to bound it.

The famous line — “there are only two hard things in computer science: cache invalidation and naming things” — is a joke that is also a warning. The easy part of caching is putting things into a cache. The hard part is getting them out at the right time.

This post is a map of the caching layers in a modern stack, the standard patterns for reading and writing through each layer, and the specific failure modes that make caching harder than the “add Redis in front of Postgres” picture suggests.

Why caches exist

Three real reasons, which are worth distinguishing because they drive different design choices.

Latency. Accessing data from a cache that is physically or logically closer — in memory instead of on disk, in the same process instead of across a network, at the edge instead of at the origin — is faster. The win is usually one to three orders of magnitude, depending on the layer.

Throughput. A cache absorbs reads that would otherwise hit the source of truth. If the source can handle 5,000 reads per second and the application needs 50,000, a cache with a 90% hit rate turns the arithmetic into something survivable.

Cost. Repeatedly computing an expensive result — a rendered page, a query against a data warehouse, a call to a paid API — is cheaper to memoize than to recompute. Sometimes the cached copy is the difference between a profitable feature and an expensive one.

Which of these you are optimizing for shapes what the cache should do. A cache for latency can be small and close. A cache for throughput must be large enough to hold the working set. A cache for cost can tolerate more staleness, because the freshness cost is paid each time the cache misses.

The layers

A modern web application has caches at many levels, often four or five of them, stacked on top of each other:

Browser cache. HTTP headers (Cache-Control, ETag, Expires) let the browser avoid asking again. Effectively free when it works; impossible to purge when something goes wrong because you do not control the clients.
CDN / edge cache. Cloudflare, Fastly, CloudFront. Cache responses geographically close to users. Fast, large, and the first line of defense against traffic spikes. Invalidation is a named API call and is not instant.
Reverse proxy / gateway cache. Varnish, nginx’s proxy cache, an API gateway. In front of the origin. Useful for caching responses the CDN did not.
Application cache. In-process (a dictionary, an LRU) or out of process (Redis, Memcached). Fast, under your control, invalidated by your code.
Query / ORM cache. Result-set caching in ORMs or query middleware. Easy to turn on, often worth less than it costs, because it sits between the application and the database where invalidation is hardest.
Database buffer pool. The database’s own in-memory cache of frequently-accessed pages. Managed by the database. Free, and the reason a well-provisioned database is faster than people expect.

Each layer is a cache. Each layer has its own invalidation story. The compounded effect — a stale response at the browser because the CDN was stale because the app was stale because the ORM cache was stale — is how a single bug in a write path produces user-visible staleness at every level. Debugging it is difficult precisely because there are so many places the stale value could be hiding.

The discipline is to be explicit about which layers cache what, and for how long, and on what invalidation signal. “We just cache everything” is how you end up with a system whose consistency model is implicit and unknowable.

The four patterns

Four standard shapes for how an application interacts with a cache. The differences matter.

Cache-aside (lazy loading). The application checks the cache. On hit, it returns the value. On miss, it fetches from the source, writes the result to the cache, and returns it. Writes go straight to the source and either update or invalidate the cache explicitly.

This is the default and the one most applications use, often implicitly. It is simple and correct most of the time. The failure modes are the stampede (below), the write-miss (a value is updated in the source but not in the cache, so readers see staleness until expiry), and the cold-start (a fresh cache has no data, and the first wave of requests all miss together).

Read-through. The cache is in front of the source, and reads always go through it. If the cache does not have the value, it fetches from the source itself. The application only talks to the cache. This is effectively cache-aside with the logic moved into the cache layer instead of the application. Managed caching products (DAX for DynamoDB, many CDNs) work this way.

Write-through. Writes go through the cache to the source. Every write updates both in one operation. The cache is always consistent with the source. The cost is write latency (both stores must confirm), and the benefit is no stale cache after a write.

Write-behind (write-back). Writes go to the cache; the cache asynchronously flushes to the source. Write latency is whatever the cache takes. The cost is that a cache failure before the flush loses writes. Appropriate for telemetry, metrics, or write-heavy workloads where some loss is tolerable; inappropriate for anything resembling a ledger.

Cache-aside is the default because it is the most tolerant of cache failures — if the cache is down, the application falls back to the source. Write-through and write-behind couple the application’s fate more tightly to the cache’s, which is usually not what you want.

Invalidation

The genuinely hard problem. Three approaches, in order of how much staleness each tolerates.

Time-based (TTL). Each entry has a time-to-live. After TTL expires, the entry is gone, and the next request fetches fresh. Simple, bounded staleness, no coordination required. The cost is that until TTL expires, the cache is serving stale data regardless of what happens at the source. Picking TTLs is the craft: too short and you lose the benefit, too long and staleness becomes user-visible.

Explicit invalidation. When the source changes, the code that made the change tells the cache to invalidate. cache.delete(key). Immediate, precise, and hostage to the discipline of remembering to call it. A new code path that updates the source but forgets to invalidate the cache produces silent staleness.

Event-driven invalidation. Writes to the source emit events; the cache subscribes and invalidates accordingly. Most robust, most machinery. Works especially well when the source already emits events for other reasons (CDC, domain events). The cache becomes one of several projections of the event stream, which is a clean model.

The practical pattern is usually a combination: a reasonable TTL as a safety net, plus explicit invalidation for the changes you know about, plus event-driven invalidation for cross-service cases. Pure TTL is cheap and always slightly wrong. Pure explicit invalidation is expensive to maintain and breaks when someone forgets. The layers compose.

A subtle point: invalidating is almost always safer than updating. If a write updates a cache with the new value, and the update conflicts with another concurrent write, you may end up with the wrong value cached. If a write invalidates the cache, the next read refetches from the source and gets whatever the source has — which is, by definition, correct. “Delete from cache on write” is a more robust default than “update cache on write.”

The cache stampede

A specific failure mode every team discovers eventually. A hot value expires from the cache. The next hundred (or hundred thousand) requests all miss at once. They all go to the source simultaneously. The source, not designed to serve its full read traffic, falls over. The cache cannot refill because the source is down. Everything cascades.

Three standard mitigations:

Request coalescing / single-flight. If N requests miss the same key at the same time, only one goes to the source; the others wait for it. A first-request-wins pattern, typically implemented with a lock or with the single-flight idiom. Trivially fixes the stampede for any one key; trickier to get right if you have many keys expiring together.

Probabilistic early expiration. Each request, instead of checking “has the entry expired?”, rolls a probability based on how close the entry is to expiry. As the entry approaches its TTL, occasional requests proactively refresh it in the background before it expires. The cache never has a moment of simultaneous mass expiry; the refresh load is spread out. The XFetch paper’s algorithm is a clean reference.

Stale-while-revalidate. Serve the stale entry after expiry, while asynchronously refreshing it. Clients see a stale value for a short window; the source sees a steady refresh load instead of a spike. HTTP’s Cache-Control: stale-while-revalidate=N is this pattern at the CDN layer. Conceptually similar to probabilistic early expiration but triggered on expiry rather than before it.

A system serving a high-traffic cached endpoint needs at least one of these. Without them, every cache expiry is a latent load spike on the source.

The hottest-key problem

Caches assume load is distributed across keys. Most workloads concentrate load on a small number of keys — the top product, the most-viewed article, the celebrity’s profile. A cache node serving a single hot key can become the bottleneck, especially in distributed caches where a key is pinned to one node.

Mitigations:

Replication of hot keys. Store the same value on multiple nodes; route reads to any of them. Redis Cluster does not do this natively; client libraries or proxies have to add it.

Client-side caching of hot keys. The application keeps an in-process cache of values it reads most often, in front of the distributed cache. An extra layer, but it absorbs the hot traffic before it reaches the distributed tier. Common in high-throughput systems.

Intentional sharding. If the key is product:42 and it is hot, split it into product:42:shard:0 through product:42:shard :15, each with an identical copy. Read from a random shard. The load spreads across nodes. Writes must update all shards.

The question to ask about any cache design: what does the load look like if one key gets 10,000× the traffic of the median? Most caches fail gracefully for uniform load and catastrophically for skewed load.

Negative caching

A request that does not find a value in the source still takes time to determine absence. If the cache does not remember the absence, every “not found” request hits the source every time.

Negative caching — caching the fact that a lookup missed — closes this gap. A short TTL is usually enough; absences change less often than presences, but they do change. Cache a missing value for 60 seconds, and then look again.

The failure mode: cache poisoning. An attacker (or a buggy client) issues lookups for many non-existent keys. Each one populates a “not found” entry in the cache, exhausting the cache’s memory. Bloom filters in front of the cache handle this well: “probably does not exist” without storing anything.

Write amplification and eviction

Every cache has a size limit. Eviction policies — LRU, LFU, ARC, random — decide which entry to remove when the cache is full. The usual default is LRU; it is fine for most workloads and catastrophic for a specific one: a workload that scans more data than fits in the cache evicts everything useful with the scan. LFU is more resistant to this but costs more to maintain.

Redis offers all of these and defaults to noeviction, which means the cache errors on write when full. That is rarely what you want and is rarely changed from the default; the result is a silent shift from “cache” to “in-memory database” that fills up and breaks. Every cache deployment deserves a deliberate eviction policy and a size limit.

Cache is a denormalization

The framing that makes most caching decisions sharper: a cache is a denormalization of the source data, shaped for a specific access pattern. Denormalizations are not free. They must be maintained. They can drift from the source. They trade write cost (every write must propagate to all denormalized copies) for read cost (reads hit the shape they need without recomputation).

Seen this way, a cache layer is one of several denormalizations — alongside materialized views, search indexes, read-model projections, and CQRS read sides. They are all the same pattern with different names. The difference between a “cache” and a “read model” is often just how explicit you are being about the denormalization.

A team that treats caches as an optimization adds one wherever things are slow. A team that treats caches as denormalizations asks, for each one: what is being denormalized, for which access pattern, with what freshness guarantees, invalidated by what signal, at what storage cost. The answers determine whether a cache is worth its weight.

The rule

Cache deliberately. Know which layer is caching what. Pick a pattern — cache-aside, read-through, write-through, write-behind — for each layer and each class of data. Plan for invalidation before you plan for the happy path. Protect against stampedes. Handle hot keys. Size the cache to the working set. Measure the hit rate and the staleness both — a cache with a 99% hit rate serving five-minute-stale data is not the same as one serving fresh data, and the metrics should tell both stories.

Most importantly: when the cache and the source disagree, the source wins. Always. The cache is a convenience. Its job is to be right most of the time and to fail safely when it is not. A system where the cache is the source of truth has a second database, not a cache, and deserves the consistency treatment of a database.

Caching is a denormalization with branding. The branding buys you nothing. The denormalization costs you what denormalizations always cost — and pays you what they always pay. Budget accordingly.