Platform engineering: the platform is a product
Platform engineering, as a named discipline, is the most recent piece of the architecture vocabulary this blog has covered — the term is maybe five years old in its current sense, though the practice is older. It grew out of two converging realizations: that Team Topologies’ platform teams were a real and stable pattern, not a phase, and that the operational burden of cloud-native architectures (Kubernetes, microservices, observability stacks, CI/CD) was more than any stream-aligned team should be expected to own individually.
The core claim of the discipline, once you strip away the tooling catalog, is small and easy to state: treat the platform as a product, with real users (the engineering teams), a real PM, real roadmaps, and a real success metric (adoption by choice, not by mandate). Platforms that follow this discipline tend to thrive. Platforms that do not — “we’ll build some internal tools and the other teams can use them” — tend to produce expensive infrastructure nobody likes and everyone works around.
This post is about what platform engineering actually is, why it usually fails, the principles that distinguish the working implementations, and the specific product discipline that makes the difference.
What a platform is, in this context
“Platform” is an overloaded word. In this post it means the specific thing Skelton and Pais defined in Team Topologies: a curated set of internal products that make stream-aligned teams more effective at delivering customer value.
The products on a platform vary by organization. Commonly included:
- A deployment pipeline from commit to production.
- An opinionated way to run services (Kubernetes, a serverless platform, a managed container runtime).
- Observability defaults (metrics, logs, traces, dashboards).
- Authentication and authorization building blocks.
- Databases provisioned with backup, monitoring, and tuning done.
- A way to stand up a new service from scratch without a week of yak-shaving.
- Secrets management, feature flags, configuration, rate limiting, message queues — the primitives every team needs.
What ties these together into “a platform” is not the technology. It is that each of them is offered to stream-aligned teams as a product they can adopt, with a user experience, documentation, and support — not as infrastructure they are expected to figure out.
The thinnest viable platform
The discipline’s central design principle, from Skelton and Pais: the thinnest viable platform. Provide the smallest possible set of features that actually reduces cognitive load on the stream-aligned teams. Everything beyond that is waste.
This is surprising advice in a field where “more capability” reads as “more valuable.” The reasoning: a thick platform creates its own cognitive load. If using the platform requires learning the platform’s conventions, DSLs, CLI, abstractions, and failure modes, the platform has shifted cognitive load rather than reduced it. At the extreme, the platform becomes its own engineering problem that rivals the one it was supposed to solve.
The thinnest viable platform takes a specific shape. For each piece of functionality:
- Use existing managed services where they exist. AWS already runs databases. GitHub already runs CI. Datadog already runs observability. A platform that rewraps these with thin abstractions is cheaper than one that reimplements them.
- Expose the managed service directly when possible. A platform that adds a layer of abstraction between teams and the underlying service has to maintain that layer forever, against the underlying service’s evolution.
- Abstract only when the abstraction is justified: the underlying service is too complex to expose directly, security requires controlled access, cost controls require a wrapper, or multiple teams benefit from a shared opinion.
A platform that consists of “well-documented paved paths through the managed services our cloud already offers, with guardrails on the few places they need guardrails” is usually more valuable than a platform that consists of “a custom PaaS we built on top of Kubernetes.” The second sounds more ambitious. The first costs less and solves the same problem.
Golden paths
The mechanism a platform uses to actually reduce cognitive load is the golden path: a well-paved, well-documented, well-supported way of doing a common thing. Paved path to deploy a new service. Paved path to add a database. Paved path to set up authentication. Paved path to get a service into production-grade observability.
A golden path has several properties:
- It is documented with enough detail that a new engineer can follow it without help.
- It is automated as much as possible — often via a scaffolding tool that generates the initial setup.
- It comes with defaults that are production-grade, not toy. The defaults for observability include SLOs. The defaults for databases include backups and monitoring. The defaults for deployment include blue/green or canary rollouts.
- It is supported. If something breaks, the platform team picks it up. A golden path without support is a trap disguised as a template.
- It is the recommendation, not a mandate. Teams that diverge from the golden path for good reason are allowed to. Teams that diverge without reason are gently nudged back.
The last point is the hardest culturally. A platform team that mandates the golden path becomes a compliance team. A platform team that abandons the golden path the moment anyone pushes back has no leverage. The working balance is “strong recommendation with a clear escalation path for exceptions” — usually backed by the fact that the golden path is genuinely the easiest way to do the thing, and divergence costs the diverging team real effort.
Spotify’s Backstage, popularized in 2020 and now the basis of many internal developer portals, is in large part a golden-path-delivery tool: a searchable catalog of templates, documentation, and dashboards, all pointing toward the paved path for each class of thing a team might need to build.
Self-service as the core discipline
Platforms fail when they are not self-service. An “internal tool” that requires filing a ticket with the platform team is a bottleneck wearing the uniform of a product. A self-service platform lets teams provision, configure, deploy, and operate without platform-team involvement for the common cases.
The test: how long, from a team’s first intent to use a platform capability to that capability being in their hands, without any platform-team human in the loop? A platform where that time is minutes is working. A platform where it is days or requires approval is not self-service; it is infrastructure-with-a-portal.
Self-service has three ingredients:
- Paved paths as above — clear defaults that work without decisions.
- APIs, CLIs, or portals that expose the platform’s capabilities programmatically. Teams that need to automate their workflows against the platform can.
- Safe defaults and guardrails — the platform prevents the common mistakes without requiring human review. A database provisioning workflow that automatically enables backups, rather than requiring the team to ask for them, is safer than one that relies on checklists.
The platform team’s role in a self-service platform is not to be the operator. It is to be the maintainer of the paved paths — adding new ones as new needs emerge, deprecating old ones, improving the defaults, investigating the edge cases. The teams that consume the platform operate their own services.
This is uncomfortable for teams that grew up with a central operations function. The transition “we run it” → “we build the paved paths for others to run it themselves” is a real cultural shift, not just a reorganization. Some platform engineers do not want to make it, and some organizations underestimate how much of the work is product management, documentation, and support rather than infrastructure engineering. Both are common failure modes.
Adoption is the metric
The single metric that distinguishes a working platform from an aspirational one: are the stream-aligned teams adopting it voluntarily?
A platform with high adoption is a platform that is making teams’ lives easier. They use it because it is the fastest and safest way to do their work. The platform team can iterate on the product; the users are real; the feedback is direct.
A platform with low adoption — teams avoiding it, building their own alternatives, or using it only where mandated — is telling you something. Usually one of: the paved paths do not cover the common cases, the abstractions do not match how teams think, the documentation is incomplete, or the platform is slower than not-using-it for the common case. Adoption without mandate is a vote of confidence. Lack of adoption is data, not a discipline problem.
A bad response: mandate adoption. Require teams to use the platform. Measure “platform usage” as the metric. This works on paper and fails in practice — mandated platforms attract resentment, workarounds, and eventually the same low adoption, just covered in compliance theater. The signal is lost.
A good response: treat low adoption as a product failure. Talk to the teams that aren’t adopting. Find out why. Fix the gap. Ship the fix. Measure again. The platform is a product. The users are telling you what the product lacks.
The PM role
A platform team without a product manager is usually a platform team with an engineering manager who is doing PM work in their spare time, poorly. Platform product management is a specific skill: talking to internal users, synthesizing their needs, prioritizing against a roadmap, measuring impact, saying no to things that sound good but aren’t core.
The platform PM’s job is not unlike any other PM’s, with a few distinctive shapes:
- The users are colleagues. They are easier to reach than external users, and they will tell you everything you want to hear in a way that external users won’t. You have to dig past the politeness to find the real friction.
- The users can leave. A team that finds the platform too expensive or too limiting can build their own solution. Unlike external users, they have no switching cost — the “product” is free. Competition comes from within.
- The metric is not revenue. Adoption, time-to-deploy, developer satisfaction, operational burden on teams — these are the real outcomes. None of them is as clean as dollars, but all of them are measurable.
Teams that hire a platform PM early tend to produce platforms that are used. Teams that rely on “the engineers know what’s needed” tend to produce platforms that reflect what engineers find interesting to build rather than what users find painful. Both happen; the first is more common when the role is intentional.
The common failure modes
A partial taxonomy of how platform engineering efforts go wrong, each corresponding to a missed principle above.
The ivory tower. The platform team designs the perfect abstraction in isolation, releases it, and is surprised when nobody adopts it. The abstraction solves problems nobody had. The teams that were supposed to benefit built their own thing six months ago. Fix: product discovery before product building. Talk to the users first.
The wrapper farm. The platform is a thin wrapper around each cloud service it exposes. The wrappers add no value but have to be maintained against the underlying services’ evolution. The platform team spends all its time keeping up with AWS releases rather than building new capability. Fix: remove the wrappers, expose the underlying services directly where they are good enough.
The bottleneck. The platform team owns deployment and every deploy requires their involvement. Teams wait on the platform team for their work. The platform is infrastructure, not a product. Fix: self-service. Paved paths with automation, not tickets.
The mandatory-platform-nobody-likes. Leadership mandates adoption. The platform is full of friction. Teams comply, build workarounds, and avoid anything that would put them in contact with the platform team. Morale on both sides suffers. Fix: treat low adoption as product feedback, not insubordination. Lift the mandate, fix the product, let voluntary adoption come back.
The platform-as-operations-team. The platform team is on call for every production issue in every team’s service. They are the first responders to everyone’s problems. Their backlog is a queue of tickets. Fix: teams own their services. Platform team owns the platform. The SRE principle that “you build it, you run it” is compatible with a platform team — the platform makes “running it” tractable for the building team.
The internal rebuild of every cloud primitive. The platform team reimplements message queues, databases, caches, and object stores because “vendor lock-in” is a stated concern. The reimplementations are worse than the originals and cost more to maintain. Fix: accept vendor lock-in. You already have it. The cost of avoiding it is higher than the cost of living with it, almost always.
Each of these has a fix that takes months to execute once identified. The first step is always diagnosing which one you have. The second is admitting it publicly. Neither is comfortable; both are necessary.
The internal developer platform
The current term of art for a mature platform that wraps most of the above into a single user experience is internal developer platform (IDP). Backstage is the most common backbone; Humanitec, Port, and others offer similar things with different tradeoffs.
An IDP typically provides:
- A catalog of services, teams, APIs, and dependencies across the organization. “Where is this service’s code, who owns it, what is its SLO, what are its dashboards, who deploys it?”
- Scaffolding for creating new services from templates that embody the golden paths.
- Operational surfaces — dashboards, deployments, runbooks, docs — consolidated per service, visible to the owning team.
- Governance plumbing — security reviews, ownership verification, policy enforcement — implemented as automation against the catalog rather than as tickets.
An IDP is an expression of platform engineering, not a substitute for it. Buying or adopting Backstage does not give you a platform; it gives you a portal. The platform is the set of capabilities and paved paths the portal surfaces. An empty Backstage instance is a very visible symptom that the underlying platform is not yet mature.
The rule
The platform is a product. Its users are the engineering teams. Its success metric is adoption by choice. Build the thinnest thing that makes those teams more effective. Paved paths, not pavement everywhere. Self-service, not tickets. Documentation, onboarding, support — all the things you would expect of an external product, applied to an internal one.
Most platform efforts fail because they are framed as infrastructure projects and executed as infrastructure projects. The successful ones are framed as product development and executed as product development — with PMs, roadmaps, user research, and metrics that measure outcomes instead of outputs.
The tooling is incidental. Backstage, Kubernetes, Terraform, the cloud’s managed services — each of these is useful, and none of them is the platform. The platform is the discipline of making the tooling usable by teams that did not build it, at the scale of your organization. Treat the platform like a product. The rest follows.