← Writing/kubernetes

One Pod to Node Root: Defending Kubernetes from Copy Fail and Dirty Frag

Copy Fail and Dirty Frag are different bugs in different kernel subsystems, but they escalate the same way: one unprivileged container to root on the node. Here is the layered defense that stops the whole class, not just this month's CVEs.

Two Linux kernel privilege escalations landed this spring, and if you run Kubernetes you should treat both as node-level incidents waiting to happen. They are unrelated bugs in unrelated subsystems, disclosed weeks apart. What makes them worth writing about together is that they share a punchline: an unprivileged process in one pod becomes root on the node, and from there, every other pod on that node is yours.

That shared punchline is not a coincidence. It is the defining property of containers - they share one kernel - and it is why the defense for both is nearly identical.

Copy Fail (CVE-2026-31431)

Copy Fail is a logic flaw in algif_aead, part of the kernel's AF_ALG userspace crypto interface. An in-place optimization from 2017 lets an unprivileged user drive a deterministic, controlled 4-byte write into the page cache of any readable file - no race, no disk write, no container escape.

In Kubernetes that page cache is shared. When two pods run images built on the same base layer, the kernel serves those identical files from the same physical pages. So an attacker in a throwaway pod can corrupt the in-memory copy of a binary that a neighbouring pod - or a node daemon - is about to execute. Nothing on disk changes; integrity scanners see clean files.

It carries a CVSS of 7.8, a working public PoC has been validated against ACK, EKS and GKE, and it is on CISA's Known Exploited Vulnerabilities list. The fix is in kernel 6.13.x and the distro backports.

Dirty Frag (CVE-2026-43284, CVE-2026-43500)

Dirty Frag is a local privilege escalation chain in two niche networking subsystems: ESP (IPsec) and RxRPC. Like Copy Fail it abuses page-cache-backed buffers the kernel does not exclusively own, this time reachable through splice()-style paths. The outcome is the familiar one: any local unprivileged user gets root. There is no remote vector.

The important nuance for cluster operators is the precondition. Reaching the vulnerable code generally needs CAP_NET_ADMIN and the ability to open obscure socket families - AF_KEY, XFRM netlink, or AF_RXRPC. A default-configured pod that has dropped capabilities and runs under a seccomp profile largely cannot get there. A privileged or hostNetwork pod absolutely can. (A third related CVE, CVE-2026-46300, "Fragnesia", travels with the same advisory.)

The network is not your security boundary, and neither is the container. The kernel is. Treat every node as a single trust domain and design for the day a pod turns hostile.

The defense is a stack, not a patch

Patching is necessary and not sufficient: kernel CVEs of this shape arrive several times a year, and you are always exposed in the window before a fix ships and reboots. So build the layers that make the next one a non-event too.

1. Patch the kernel - and automate the reboot

This is the only fix that closes the actual bug. What separates teams that shrug off kernel CVEs from teams that scramble is automation: a node-image pipeline that rebuilds on a cadence, plus orchestrated draining and rolling replacement.

# confirm the running kernel on every node, fast
kubectl get nodes -o custom-columns=\
'NODE:.metadata.name,KERNEL:.status.nodeInfo.kernelVersion'

Prefer replacing nodes over patching them in place - immutable nodes from a known-good image beat apt upgrade on a pet. Tools like the Kured reboot daemon or your managed pool's node auto-upgrade turn "patch the fleet" into a background process instead of a fire drill.

2. Turn on seccomp - the cheapest win you are probably skipping

Both exploits need an unusual syscall surface: Copy Fail needs an AF_ALG socket, Dirty Frag needs AF_KEY / AF_RXRPC. The container runtime's default seccomp profile blocks exactly these obscure paths - but Kubernetes does not apply it unless you ask. Pods run Unconfined by default.

Flip it on, cluster-wide if you can, and per-pod everywhere else:

securityContext:
  seccompProfile:
    type: RuntimeDefault

For defence in depth you can layer an explicit deny of the socket families these bugs ride on. seccomp can filter the socket() domain argument directly:

{
  "defaultAction": "SCMP_ACT_ALLOW",
  "syscalls": [
    {
      "names": ["socket"],
      "action": "SCMP_ACT_ERRNO",
      "args": [{ "index": 0, "value": 38, "op": "SCMP_CMP_EQ" }]
    }
  ]
}

That rule rejects AF_ALG (domain 38). Add sibling rules for AF_RXRPC (33) and PF_KEY (15) and you have denied the entry points for both CVEs in a few lines - independent of whether the node is patched.

3. Drop capabilities and stop running as root

Dirty Frag needs CAP_NET_ADMIN. Almost nothing you ship does. The "restricted" Pod Security Standard bundles the controls that matter here, but the minimum is:

securityContext:
  runAsNonRoot: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"] 

Enforce it at the namespace boundary so a forgotten manifest cannot opt out:

apiVersion: v1
kind: Namespace
metadata:
  name: workloads
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest

Dropping CAP_NET_ADMIN alone neuters Dirty Frag for the vast majority of workloads. That is the whole point of least privilege: the exploit needs a capability your pod never should have had.

4. Shrink the blast radius with the scheduler

The shared kernel means co-tenancy is a security decision, not just a bin-packing one. Keep untrusted or multi-tenant workloads off the same nodes as sensitive ones using a dedicated pool, a taint, and a toleration:

kubectl taint nodes -l pool=untrusted dedicated=untrusted:NoSchedule

For genuinely untrusted code - CI runners, customer workloads, anything that evaluates input as instructions - put a second kernel boundary under it with a sandboxed runtime like gVisor or Kata Containers:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc

A gVisor pod that triggers Copy Fail corrupts gVisor's emulated page cache, not the host's. The blast radius stops at the sandbox.

5. Do not share image layers across trust boundaries

This one is specific to Copy Fail, and it is easy to miss. Because the attack works through shared page-cache pages, it only reaches a victim that reads the same file from the same base layer. If your sensitive workloads and your untrusted workloads are built FROM the same image and land on the same node, they share those pages.

Give sensitive workloads their own minimal, distinct base images, and pair that with the node isolation above. Two workloads that never share a node and never share a layer cannot share a poisoned page.

6. Strip unused kernel modules from your nodes

If you do not run IPsec or AFS, the ESP and RxRPC modules are pure attack surface. Same for algif_aead if nothing legitimately uses kernel crypto sockets. Blacklist them in your node image or a privileged bootstrap DaemonSet:

cat > /etc/modprobe.d/harden.conf <<'EOF'
install esp4 /bin/false
install esp6 /bin/false
install rxrpc /bin/false
install algif_aead /bin/false
EOF
rmmod rxrpc esp6 esp4 algif_aead 2>/dev/null || true

A module that cannot load is a vulnerability that cannot be reached. Verify the workloads on those nodes do not need IPsec first - this breaks it if they do.

7. Detect the attempt, and assume breach

None of the above is perfect, so watch for the behaviours these exploits require. A runtime sensor like Falco or any eBPF-based tool can flag the tells cheaply:

And accept the premise behind all of it: a container RCE on an unpatched node is a node compromise. Wire your response so that a compromised pod triggers cordon, drain, and replacement of the node, not just a pod restart. Recycling the node evicts an attacker who has poisoned page-cache pages that no file scan will ever find.

The takeaway

Copy Fail and Dirty Frag will be patched and forgotten. The class will not. Shared-kernel escapes are a permanent feature of how containers work, and the controls that blunt them - seccomp on, capabilities off, sensitive and untrusted workloads on separate nodes, untrusted code in a sandbox, unused modules gone, and node recycling as your incident reflex - are the same every time.

Turn them on now, while these two are the example and not the incident.

References

← All writing