Schema and API evolution: every stable system has a deprecation culture


Most outages in stable, well-operated systems do not happen during incidents. They happen during deploys. Someone pushed a change, something that depended on the old shape of the change fell over, and the incident response is to figure out which end had made an assumption the other end no longer honors.

That is a schema evolution problem. The schema in question might be a database column, an API response body, an event payload, a protobuf message, a queue’s message format, or a configuration file — but the pattern is always the same. Two components disagree about what a thing looks like, at a moment when both are supposed to be able to handle it. The disagreement was introduced by a deploy, committed without noticing it was a schema change, or committed noticing but without the coordination needed to make it safe.

Evolving a schema without downtime is a discipline with mechanical answers. This post is about those answers: expand-contract, additive change, versioning strategies, deprecation windows, and the practices that make each schema change a routine operation instead of a heroic one.

The expand-contract pattern

The single most important pattern in schema evolution, and the one most teams reinvent independently before discovering it has a name.

The naive migration: change the schema; deploy the code that uses the new schema; done. This works when the system is small enough that the schema change and the code change can happen in one atomic step — which is to say, never, in any system with more than one server, one service, or one client.

The expand-contract pattern breaks the migration into three phases:

  1. Expand. Change the schema to support both the old and new shape simultaneously. Add the new column without dropping the old one. Add the new API field without removing the old one. Add the new event field without renaming the old one. The schema now accommodates every reader and writer — old and new.
  2. Migrate. Update the readers and writers, one at a time, to use the new shape. Old clients continue to work because the old shape still exists. New clients use the new shape. Nobody coordinates deploys because nobody needs to.
  3. Contract. Once every reader and writer has moved to the new shape, and enough time has passed that you are sure, remove the old shape. The schema now supports only the new shape.

This is sometimes called parallel change. It turns a risky all-at-once migration into a sequence of safe, independent steps, each of which is individually reversible and none of which requires all components to deploy simultaneously.

The cost is time. A schema change that might have taken a day as a naive migration takes two to four weeks as expand-contract. The cost of that time is much smaller than the cost of the outage the naive migration produces on a bad day. Every team that has tried both converges on the same answer.

Additive changes only

The working rule for API and event schemas: additive changes are backward-compatible; non-additive changes are not, and require versioning.

What counts as additive:

  • Adding a new field. Clients that do not know about it ignore it.
  • Adding a new endpoint. Clients that do not call it are unaffected.
  • Adding a new event type. Consumers that do not subscribe are unaffected.
  • Adding a new optional query parameter. Existing callers that do not use it behave as before.
  • Adding a new enum value — with care. Consumers that pattern- match on specific enum values and have no catch-all will break. This is a common trap.

What is non-additive:

  • Removing a field. Clients that read it break.
  • Renaming a field. Same as removing it plus adding it.
  • Changing a field’s type. Parsing either breaks or silently corrupts.
  • Changing a field’s semantics while keeping the name. The worst option because the breakage is quiet — the field still exists, still parses, but now means something different. These changes tend to produce data corruption that is discovered weeks later.
  • Tightening a constraint. A field that was optional becoming required breaks any producer that did not fill it.
  • Changing default values. Existing records that relied on the default now have a different meaning.

The discipline is to treat the schema as append-only by default and to budget deliberately for the few cases that require non-additive change. Paired with expand-contract, almost every “non-additive” change becomes a sequence: add the new shape, migrate, remove the old. The step-by-step approach converts non-additive changes into additive ones at every intermediate step.

Database schema migrations without downtime

Applied to databases, expand-contract has well-worn mechanics. A renaming of column name to full_name, done safely:

  1. Add full_name column (nullable, no default). The schema now has both; no application code reads it yet.
  2. Deploy code that writes to both name and full_name on every update. Old code still writes only name. Reads still use name.
  3. Backfill: for every existing row, copy name into full_name. This is a background job, chunked, run until complete.
  4. Deploy code that reads from full_name with a fallback to name (in case any row was missed).
  5. Deploy code that reads only from full_name. No fallback.
  6. Deploy code that writes only to full_name. Old writes stop going to name.
  7. Drop the name column.

Seven steps to rename a column safely. Most migrations do not bother with all of them and get away with it because their databases are small. At scale, each step matters. Tools like gh-ost, pt-online-schema-change, and pg_repack automate the heavy-lifting parts (adding columns without table locks, backfilling in chunks), but the sequence remains.

A few rules fall out:

  • Never drop a column in the same deploy that removes the code using it. The deploy is not atomic; for some window, one version of the code is running against a schema that does not match.
  • Never add a NOT NULL column without a default on a large table. It locks the table on most databases while the column is populated. Add nullable, backfill, then add the NOT NULL constraint in a second migration.
  • Avoid long-running migrations inside the deploy. A migration that takes 30 seconds is a 30-second outage for the tables it touches. Move long migrations out of the deploy pipeline and into background jobs that complete before the deploy.

These are the kind of rules that look pedantic on a small system and save you from a midnight page on a large one.

API versioning strategies

Versioning exists for the subset of changes that are genuinely non-additive and cannot be done with expand-contract alone. The usual options:

URL path versioning. /v1/orders, /v2/orders. The version is in the URL. Easy to route, easy to document, easy to run two versions side by side. The cost: every client has to know which version to call, and the URL becomes a commitment you cannot easily walk back.

Header versioning. Clients send Accept: application/vnd.example .v2+json. Cleaner URLs, more RESTful in spirit, harder to route in proxies, harder for clients to get right without good SDK support. More common in public APIs that care about hypermedia than in internal APIs.

Media type versioning / content negotiation. A variant of header versioning where the version is part of the MIME type. Same tradeoffs.

No versioning. Additive-only changes forever. Works for internal APIs where the producer and consumers are coordinated, and for public APIs that treat breaking changes as a truly exceptional event (GitHub’s API went more than a decade on “v3” by sticking strictly to additive changes). The discipline is hard; the payoff is that users never have to migrate.

Version per field. Instead of versioning the API, version individual fields. Deprecate a field, add its replacement alongside, remove the old field after the deprecation window. This is essentially expand-contract at the API level.

For internal APIs, “no versioning, with strict additive discipline” is usually the best default. For public APIs, URL-path versioning is the most honest about the fact that versions are commitments — your users will tell you which version they are on because it is in the URL, and you can measure when it is safe to sunset a version.

Deprecation windows

Removing anything — a field, an endpoint, an event type — requires a deprecation window. The window is the time between “this is deprecated” and “this is gone.” During the window, the deprecated thing still works, but every use of it is visible and notifying.

The mechanics:

  • Mark it deprecated in the schema or documentation. This is the least useful step and the one most teams stop at.
  • Emit a deprecation header or log on every call. Deprecation: true and Sunset: <date> headers for HTTP APIs.
  • Measure usage. Count calls to the deprecated thing, by client. Per-client measurement is what makes the deprecation socially enforceable — you can identify who is still on the old version and ask them specifically.
  • Notify the users. Not just the engineers, but the product owners of the clients. A client that has not upgraded after nine months will usually turn out to be owned by a team that did not know they were using it at all.
  • Set a sunset date. The window must have an end. An “eventual” deprecation is a deprecation that never happens. Six to twelve months is standard for public APIs; two to four weeks for internal APIs between coordinated teams.
  • Remove it. The deprecation has to actually conclude. A deprecated thing that has been deprecated for three years has taught everyone to ignore deprecation notices.

A team with a working deprecation culture has several things simultaneously: measurement, notification, a schedule, and the political will to turn things off. Missing any of these and the deprecation stalls indefinitely. The cultural part is the hardest and the part most retrospectives trace back to.

Protobuf’s evolution rules

Protobuf’s schema evolution rules are worth studying even if you use something else, because they are the most rigorous encoding of “what does additive change actually mean” any wire format has shipped.

  • Field numbers are the identity. Renaming a field (changing its name) is free; the wire format uses the number. Renumbering a field is a breaking change.
  • Removing a required field is always breaking. Protobuf 3 removed required from the schema language for this reason — it discouraged a category of decisions that locked the schema into compatibility hell.
  • Adding a new optional field is safe. Old readers ignore unknown fields; old writers simply don’t set the new one.
  • Changing a scalar type is sometimes safe. int32int64 is safe on the wire; most other conversions are not.
  • Never reuse a field number. Once removed, mark the number reserved so a future field cannot accidentally reuse it. The old wire format lingers in stored data and in old clients; a reused field number will be misinterpreted.

These rules exist because protobuf is designed to be serialized into long-lived data — messages persisted in databases, events written to Kafka logs, requests stored in traces. The evolution rules make the serialized data forward- and backward-compatible across a ten-year window. Other formats (Avro, Thrift, Cap’n Proto) have similar rules; JSON, famously, does not, which is why JSON APIs drift and protobuf-based systems tend to age better.

Event schema evolution

Events are the hardest case, because they outlive the code that produced them. An event written to a durable log three years ago is still in the log today, and any consumer processing the log must still be able to read it.

The rules for event schema evolution:

  • Never remove a field. Even if no producer writes it anymore, historical events still contain it.
  • Never change a field’s meaning. A new semantic requires a new field or a new event type.
  • Version the event type when the schema changes substantially. OrderPlaced.v1 and OrderPlaced.v2 coexist in the log. Consumers handle both.
  • Use a schema registry. Confluent Schema Registry, Apicurio, or a home-rolled equivalent. Producers register the schema they emit; consumers fetch the schema to deserialize. The registry is the durable source of truth for what events have looked like over time.

The discipline is demanding and the reward is that a log from three years ago is still legible today, which is exactly the property that makes event sourcing work. Teams that underestimate this discipline end up with “we cannot reprocess anything older than six months because the schemas don’t match the code anymore,” which is a recoverable situation but not a comfortable one.

Contract tests as the safety net

The hole in every schema evolution process is that nobody runs every client against every version. You test the server against its own test suite; you test each client against whatever version of the server it was written against; and you hope the intersection of those tests covers the combinations that will exist in production.

Contract tests — Pact is the usual tooling, though there are others — close this hole. The consumer writes tests against the contract it expects from the producer. Those tests generate a contract file. The producer’s CI runs against every consumer’s contract file and fails if it breaks any of them.

The practical effect: a server team cannot accidentally break a client team, because the client’s contract is part of the server’s test suite. A client team cannot accidentally depend on undocumented behavior, because the contract is what’s verified — not the implementation. Deploys become safer because the contracts are green; negotiations about schema changes become data-driven because the contract diff is the evidence.

Contract testing is the single most valuable testing discipline for microservice architectures, and the least commonly adopted relative to its payoff. It is how you evolve schemas at scale without staging environments for integration testing every combination.

The rule

Every stable system has a deprecation culture — a set of habits that make changing schemas routine instead of heroic. The pieces are well-known: expand-contract for migrations, additive-by-default discipline, versioning when additivity is not enough, measured deprecation windows with real sunsets, contract tests as the safety net, schema registries for durable data.

Systems without this culture develop schema paralysis: every change requires heroics, so changes do not happen, so the schema calcifies, so eventually the only way to move forward is a rewrite. Systems with it accumulate changes smoothly, keep old contracts honest, and ship without coordinated deploy windows.

The discipline is boring. That is the point. Boring is what you want at the layer that breaks during incidents.