Introduction
Podman is the new de facto state-of-the-art to run containers on Linux. It comes by design with a very interesting feature : rootless container. In this mode, the container runtime itself runs without privileges. This means exploiting the container runtime could at most grant the attacker the permissions of the running user (kernel own attack surface is out-of-scope here).
Historically, sysadmins and CI/CD developers found themselves in situations where they have to run container in container (see Docker-in-Docker, a.k.a. DinD), or other things also dealing with cgroups/seccomp/namespacing running in a container (e.g. systemd in unprivileged LXC). We call this “nesting”, and this may introduce some security benefits (as always depending on your threat model).
Nesting Podman in “containers” is supported and actually documented, and the combinational leads to these situations :
-
Rootful in rootful (pretty bad from a security point of view)
-
Rootless in rootful (already better !)
-
Rootful in rootless (pretty handy, but consider your container compromised if the application runs as root and has flaws)
-
Rootless in rootless (ideal from a security point of view !)
There is a hiccup between (at least) 4
and Debian image, and that’s why we’ll talk about here.
In below commands, you’ll see I map
/dev/fuse
device in containers to provide OverlayFS support in unprivileged user namespaces for Linux < 5.11.
Podman rootless in Podman rootless
Podman is very well integrated in the Red Hat ecosystem (mainly with systemd), and the official Podman container image is built upon Fedora.
On a rather “recent” GNU/Linux distribution, you can safely run as a regular user :
This command pulls the official Podman container image and runs, as a regular user too (named podman
in the first container), another Podman runtime which pulls the official Alpine image and spawns a shell in it.
If we focus on user namespaces, it gives :
-
the first
podman
runs as uid=1000 (host machine user session) -
the second
podman
(in the first container) runs as uid=1000 (but shifted by 100000, default value defined in/etc/subuid
on Debian) -
eventually,
sh
runs as uid=0 (shifted by 101001, where 100000 comes from parent user namespace and 1001 from/etc/subuid
packaged inquay.io/podman/stable
image)
Podman in Debian
Podman is packaged in Debian since Bullseye (11). A simple apt install podman
(which pulls a lot of recommended dependencies, I’d confess) and you’re all set.
Since Debian 12 (this year !), the specific-but-deprecated kernel.unprivileged_userns_clone
sysctl parameter is even enabled by default so you don’t have to tweak your system anymore.
Unfortunately, if we attempt to build a Debian-based image to run “rootless in rootless” with such a Containerfile
:
… it hard fails with :
@jam49 initially experienced this error while trying to run Podman rootless in Jenkins official Docker image, which is Debian-based.
Granting CAP_SYS_ADMIN
to parent user namespace (hence the first container) actually “fixes” this issue, but this is highly discouraged due to the bloated range of system operations that it permits, which can easily leads to root
privileges. Also, unless explicitly dropped, it will be inherited in child user namespaces as well (including the one running your application or service !).
So, why does this work flawlessly on Fedora, and not against Debian ? Let’s dive in user namespaces and capabilities magic world
From newuidmap
, to user namespaces and capabilities
newuidmap
(respectively newgidmap
) is a privileged program maintained in shadow-utils
project (see upstream tree, or shadow
on Debian) which allows unprivileged users to safely map their UID (GID) to parent user namespace, based on ids range defined in /etc/subuid
(/etc/subgid
) file.
Since Linux >= 3.9, modifying namespace id mapping requires CAP_SYS_ADMIN
. As this capability is usually not granted in container contexts, shadow-utils
maintainers switched to file capabilities (in a backward-compatible way for setuid setups (see !132, fixed-up by !138).
Going through file capabilities is a good way to obtain them even if they are missing from your “Effective set” (they still need to be in your “Permitted set” though, see this awesome diagram, or even next section for a visual experience).
Debian (still) installs uidmap binaries with setuid bit, whereas the packaged version fully-supports file capabilities (>= 4.7), since Bullseye (11).
Theoretically, this shouldn’t be an issue as gaining root
privileges through setuid bit implies the full set of capabilities by default.
So, would uidmap
be compiled without capability support, and thus failing to retain CAP_SETUID
(CAP_SETGID
) ?
We can see in Debian shadow
sources that :
But what do build logs tell ?
That’s it ! Due to Debian shadow
compilation environment and packaging, uidmap
binaries lack of both capabilities upstream patches.
TL; DR : The workaround
As stated here, dropping setuid bit and granting CAP_SETUID
(CAP_SETGID
) as file capability in our previous Debian-based image using setcap
(libcap2-bin
package on Debian) :
… elegantly workarounds this issue :
But what if CAP_SETUID
(CAP_SETGID
) is explicitly forbidden in my context ?
Well, as you can imagine, it breaks again :
Although, you’ll notice the error is slightly different : kernel prevents binary execution, instead of subsequent /proc/self/uid_map
write operation as observed before.
(bonus) Harden your “last level of nesting”
If your “last level of nesting” is not supposed to re-gain privileges, you can safely set the “No New Privileges” flag through a Podman security option :
Here we can note that the flag actually breaks our first post example, as expected (gaining privileges from newuidmap
program is denied by kernel).
The status of this flag can be retrieved using capsh
:
… or even directly through /proc
:
Conclusion
This has been a “funny” bug to investigate !
I’ve run a quick search on the Web and it doesn’t look like Debian plans to switch to file capabilities for uidmap binaries (yet), so it’s very likely that the shim above will be around for some time.