Containers and Kubernetes: the unit of deployment is a process, not a machine


Containers are older than Docker. Linux had namespaces and cgroups for years before anyone packaged them into something a normal engineer would use; FreeBSD jails and Solaris zones predate Linux containers entirely. What Docker did in 2013 was not invent isolation; it invented an image format and a workflow around it, and that workflow — build, push, pull, run — turned out to be the thing the industry was waiting for.

The pitch, compressed to one sentence: a container is a process that carries its filesystem with it. That is almost all there is. The process runs on the host kernel, sees a filesystem it brought along, has its own view of the network and process tree, and can be killed and restarted without the host caring. The image is the frozen filesystem plus a little metadata. The runtime is the piece that unpacks the image, wires up the namespaces, and starts the process.

Kubernetes then sits on top of all of this and answers a different question: given a fleet of machines and a set of containers that need to run, who decides which container runs where, how they find each other, and what happens when any of it breaks? Most of the apparent complexity of Kubernetes is the cost of doing that job generically — across clouds, workload types, and organizational needs — rather than as a bespoke solution for one team.

This post is about what containers actually are, what the image format buys you that a VM does not, what Kubernetes is really solving underneath the YAML, and the handful of patterns that matter more than the tooling.

What a container actually is

A container is a Linux process (or set of processes) running under a combination of kernel features that make it look, from inside, like it has its own machine:

  • Namespaces isolate what the process can see. PID namespace: its own process tree. Network namespace: its own network interfaces. Mount namespace: its own filesystem view. UTS namespace: its own hostname. There are others; the point is that each axis of “what does this process see” is a separate, independently-scoped abstraction.
  • cgroups limit what the process can consume. CPU shares, memory caps, IO bandwidth, process counts. cgroups do not isolate; they ration.
  • A root filesystem mounted from an image, usually as a union of read-only layers plus a writable top layer. This is what makes containers cheap to start and cheap to store.

No hypervisor. No guest kernel. The container’s system calls go directly to the host kernel. This is why containers start in milliseconds where VMs start in tens of seconds, and why the overhead of running one is close to the overhead of running the same process without the container.

It is also why the isolation is weaker than a VM’s. A kernel bug that lets a container escape its namespaces gives you the host. A VM escape requires a hypervisor bug, which is a much harder target. For most workloads this tradeoff is fine; for multi-tenant hosting of untrusted code, it is not, which is why gVisor, Kata Containers, and Firecracker exist — they pay the VM cost back to regain the VM’s isolation.

The image is the contract

What made Docker ubiquitous was not the runtime. It was the image format and the Dockerfile. An image is a content-addressed tarball of layers plus a manifest; a Dockerfile is a declarative build for one. The important properties:

  • Layered. Each instruction in the Dockerfile produces a layer. Layers are shared across images that share a prefix — pulling a new image on a host that already has the base layers only downloads the new layers. Storage and transfer are cheap in proportion to what is actually different.
  • Content-addressed. Each layer and each image has a digest. myapp:latest is a pointer to a digest; myapp@sha256:abc... is the thing itself. Deployments that pin to digests are reproducible in a way that tag-based deployments are not.
  • Self-contained. The image carries the application’s dependencies — libraries, language runtime, config defaults, the application binary itself. “Works on my machine” collapses because the machine is now part of the artifact.
  • Portable. The OCI image spec standardized the format in 2017. Any OCI-compliant runtime can run any OCI-compliant image. Docker is one runtime; containerd, CRI-O, and Podman are others. The image is not tied to the runtime that built it.

The workflow follows: build an image, push it to a registry, pull it somewhere, run it. The registry is the distribution mechanism, the image is the artifact, and the runtime is a commodity. This is the shape that made containers stick. The isolation was always there; the artifact format was not.

Dockerfile discipline

Most production image problems are Dockerfile problems. A partial list of the ones that recur:

  • Layer ordering. Layers are cached in build order, invalidated top-down. COPY . . early means every source change busts the cache for everything below it. COPY package.json then npm install then COPY . . keeps the install layer cached across source changes. Image builds that take ten minutes usually take ten minutes because someone reordered the Dockerfile without thinking about cache.
  • Base image. ubuntu:22.04 carries hundreds of megabytes of things your application does not need. alpine is smaller but uses musl libc, which breaks some native extensions. distroless images (no shell, no package manager, just the runtime) are the current best practice for final images — smaller attack surface, smaller pull time, harder to debug in prod. The tradeoff is real; pick deliberately.
  • Multi-stage builds. A build stage that pulls in compilers, dev dependencies, and build tools; a final stage that copies only the built artifacts into a minimal base. The final image does not contain the toolchain that produced it. This is the single biggest improvement most images can make.
  • Non-root users. Containers running as root have the host’s root on anything mounted in. Add a USER directive. The default is wrong.
  • Secrets. Secrets baked into the image stay in the image forever, including in layer history. Use build-time mounts (--mount=type=secret), runtime injection (env vars, mounted files from a secrets manager), or dedicated tools. Not ENV DATABASE_PASSWORD=....

None of this is secret knowledge. All of it is violated in production images right now, because the default path is the path of least resistance.

Container orchestration: what the scheduler does

Once you have containers, you need to run them somewhere. At the scale of one or two hosts, docker run is fine. At any larger scale, you need a system that answers questions like:

  • Given this set of containers to run and that set of machines to run them on, which containers go where?
  • If a machine dies, where do its containers go instead?
  • If a container crashes, how many times should it be restarted and how fast?
  • How do containers on different machines find each other?
  • How do containers find the database, the cache, the message bus?
  • How does traffic from outside the cluster reach the right container?
  • How do I roll out a new version without dropping traffic?

These questions have to be answered by something. Kubernetes is one answer. ECS, Nomad, and a dozen PaaS products are others. The questions are the durable part; the tooling is the answer-of-the-year.

Kubernetes in one breath

Kubernetes is a control loop system. You declare the state you want — “I want three replicas of this container, exposed on port 80, with these environment variables” — and a collection of controllers work continuously to make the observed state match the declared state. If a node dies, the controller notices the observed replica count dropped to two, and schedules a third somewhere else. If a pod crashes, the kubelet on its node restarts it. If the desired replica count changes, a controller scales up or down.

The top-level objects worth knowing:

  • Pod. The smallest unit. Usually one container; sometimes a few tightly-coupled ones sharing a network namespace and volumes (sidecar pattern). Pods are not long-lived — they can be killed and replaced at any time.
  • Deployment. Declares “I want N replicas of this pod template.” Manages rolling updates when the template changes. This is what most stateless services are.
  • StatefulSet. Like a Deployment, but pods have stable names (db-0, db-1, db-2) and stable storage. For things that need identity — databases, message brokers, anything where one instance is not interchangeable with another.
  • DaemonSet. One pod per node. For node-level agents — log shippers, monitoring collectors, CNI plugins.
  • Service. A stable virtual IP and DNS name that routes to the current set of pods matching a label selector. Pods come and go; the Service does not.
  • Ingress. How external HTTP traffic reaches Services. Usually backed by a cloud load balancer or an in-cluster reverse proxy (nginx, Envoy, Traefik).
  • ConfigMap / Secret. Configuration and secrets, injected into pods as env vars or mounted files.
  • Namespace. A soft boundary for names. Not isolation in any strong sense; a tool for organization and RBAC.

Everything else in Kubernetes — HPA, PDB, NetworkPolicy, CRDs, the operator pattern, service meshes — is a specialization of the same control-loop principle applied to new axes.

Scheduling, briefly

The scheduler’s job: place pending pods onto nodes. Inputs: pod’s resource requests (CPU, memory), pod’s constraints (node selectors, affinities, taints/tolerations), node’s available capacity, current cluster state. Outputs: an assignment.

Two resource concepts that matter:

  • Requests. What the pod is guaranteed. The scheduler will not place a pod on a node that does not have the requested resources free. Used for scheduling decisions.
  • Limits. What the pod cannot exceed. Used for enforcement at runtime. Exceeding the memory limit gets the pod OOM-killed. Exceeding the CPU limit throttles it.

The common mistake: setting requests equal to limits, or not setting requests at all. Requests too low oversubscribes nodes; pods get evicted when real usage catches up. Requests too high wastes capacity. Limits too low causes OOM kills in normal operation. Getting this right requires measuring actual usage, not guessing.

Rollouts and the rollout strategy

Updating a Deployment with a new image triggers a rolling update. Kubernetes creates pods of the new version and terminates pods of the old version, one batch at a time, according to maxSurge (how many extra pods are allowed above desired count) and maxUnavailable (how many below).

The defaults give you a rolling update. Other strategies are achievable with a bit more work:

  • Rolling update (default). Gradually replace old pods with new. Cheap, slow, partial rollback. Both versions coexist during the rollout, which means the old and new versions must be compatible with each other.
  • Recreate. Kill all old pods, then start new ones. Downtime. Used when two versions genuinely cannot coexist (schema changes, singleton workloads). Rare.
  • Blue/green. Run two full copies. Switch traffic atomically. Fast rollback, expensive capacity. Implemented via two Deployments and a Service selector switch, or via a service mesh.
  • Canary. Route a small percentage of traffic to the new version, watch the SLIs, ramp up if clean, roll back if not. Implemented natively in service meshes (Istio, Linkerd) or with progressive-delivery tools (Argo Rollouts, Flagger).

The schema/API evolution post covers the compatibility side of this; none of these strategies work if the two versions cannot both be live at once. Expand-contract is the pattern you need before canary is safe.

Liveness, readiness, startup

Three probes, three purposes, routinely conflated:

  • Liveness. “Is this pod still working, or should it be killed and restarted?” Failing liveness causes a restart. The default should usually be no liveness probe at all — a pod that doesn’t crash on its own usually doesn’t need to be killed. Bad liveness probes cause crash loops during real incidents.
  • Readiness. “Should traffic go to this pod right now?” Failing readiness removes the pod from Service endpoints. Used for “warming up” (waiting for a cache to populate, a connection pool to establish), for load shedding under pressure, for gating on downstream dependencies.
  • Startup. “Is this pod done starting yet?” Startup probes suppress liveness and readiness until they pass. For slow-starting applications (JVMs, anything that loads a large dataset at boot) that would otherwise be killed by the liveness probe before they finish starting.

The useful heuristic: set a readiness probe that reflects whether the pod can serve traffic. Avoid liveness probes unless you have a specific hang or deadlock scenario they are catching.

Networking: the flat-network model

Kubernetes networking makes one strong assumption: every pod has its own IP, and every pod can reach every other pod on that IP, without NAT. This is the flat network model. CNI plugins (Calico, Cilium, the cloud providers’ own) implement it.

From inside a pod, the rest of the world looks like this:

  • Other pods are reachable by IP, but IPs are not stable. You almost never use them directly.
  • Services give you stable DNS. http://my-api.default.svc.cluster.local (or http://my-api within the same namespace) routes to a healthy pod behind the Service.
  • External services (databases, third-party APIs) are reached the way they always were — DNS and IP. Kubernetes does not mediate egress by default.
  • Ingress controllers terminate external HTTP traffic and route it to Services based on hostname and path.

NetworkPolicies let you restrict pod-to-pod traffic, but they are opt-in; the default is “everything can talk to everything.” Treat this as a security problem the moment you care about blast radius between services.

A service mesh (Istio, Linkerd) replaces a lot of this hand-rolled routing with a sidecar proxy in every pod. You get mTLS between services, traffic splitting for canaries, retry and timeout policies, per-request observability, and authorization policies — all without the applications knowing. The cost is more moving parts and a real operational burden. Worth it for estates large enough that cross-cutting concerns are actually crossing many services; overkill for small ones.

Storage

Stateless workloads are the easy case. A pod dies, Kubernetes makes a new one, the new one has no state, nobody notices. Almost all services in a well-factored microservices estate are stateless in this sense — state lives in managed databases, object stores, caches.

When state has to live in the cluster, the primitives are:

  • PersistentVolume (PV). A piece of storage in the cluster — often backed by a cloud disk (EBS, Persistent Disk, Azure Disk).
  • PersistentVolumeClaim (PVC). A pod’s request for storage. Bound to a PV by the scheduler.
  • StorageClass. The template that tells the cluster how to provision a PV dynamically when a PVC requests one.
  • StatefulSet. The workload controller that gives each pod a stable PVC.

Running a database in Kubernetes is possible and usually the wrong call. The cloud-managed alternative (RDS, Cloud SQL, Aurora) runs the same database with backups, monitoring, failover, and patching that the cluster does not give you for free. Operators (CloudNativePG, Zalando’s Postgres Operator) narrow the gap, but the cost of running a production database well is not primarily a “does Kubernetes have a PVC?” question.

Autoscaling

Three levels, each solving a different problem:

  • Horizontal Pod Autoscaler (HPA). Scales replica count up and down based on metrics — CPU, memory, custom. For stateless workloads under varying load.
  • Vertical Pod Autoscaler (VPA). Adjusts a pod’s requests and limits based on actual usage. Useful for sizing; rarely used as a live controller because resizing requires restarting the pod.
  • Cluster Autoscaler. Adds and removes nodes when pending pods cannot be scheduled or when nodes are under-utilized. This is the one that actually changes the cloud bill.

Karpenter is the current best-in-class cluster autoscaler on AWS — it reasons about instance types rather than just node counts, and picks cost-effective capacity for the pending workload.

The operator pattern

A Kubernetes operator is a controller you write that watches custom resources (CRDs) and reconciles some piece of the world towards a desired state. The pattern is: define a CRD describing the thing you want to manage (a PostgresCluster, a KafkaTopic, a certificate renewal), write a controller that reacts to changes in instances of that CRD and makes the world match.

Operators extend Kubernetes from “container scheduler” into “generic control loop runtime.” Certificate management (cert-manager), secrets from external vaults (external-secrets), managed databases — all of these are operators. For an application team, writing an operator is usually overkill; for a platform team, it is one of the primary patterns.

What Kubernetes is not

A list of things Kubernetes does not solve, despite periodically claiming to:

  • Multi-tenancy with strong isolation. Namespaces are soft boundaries. Network policies and RBAC get you partway. Noisy neighbors (CPU, memory, network, IO) are not prevented by default. Hostile tenants need hypervisor-level isolation.
  • Secrets management. Kubernetes Secrets are base64-encoded ConfigMaps. They are not encrypted by default. Real secrets management means integration with Vault, AWS Secrets Manager, or equivalent — via CSI drivers or external-secrets.
  • Configuration management. ConfigMaps are fine for static config. Dynamic config — feature flags, runtime-tunable thresholds — needs a real system (LaunchDarkly, Unleash, or a home-grown one on top of a database).
  • Authorization for your application. RBAC authorizes callers of the Kubernetes API. It says nothing about whether a user of your app can delete an order. That is your application’s problem.
  • A PaaS out of the box. A bare Kubernetes cluster is a lot closer to “Linux with a scheduler” than to Heroku. The golden paths — how do I deploy a new service, how do I get observability, how do I do secrets — are what platform teams build on top.

Mistaking Kubernetes for any of these leads to the “empty Kubernetes, no platform” failure that the platform engineering post describes.

The rule

Containers are packaging plus isolation. The image is the artifact; the runtime is a commodity. Kubernetes is a generic control loop over a fleet of machines running those containers, and almost every specific feature — Deployments, HPAs, operators — is an instance of the same declarative reconciliation pattern.

Most of the difficulty in running containers in production is not the isolation or the scheduling. It is everything around them — image hygiene, rollout discipline, storage for the things that need it, networking that is safe by default, and a platform on top that makes the paved path the easy path. Kubernetes hands you the primitives. The usable system is something you build with them.