Podman rootless in Podman rootless, the Debian way

Introduction

Podman is the new de facto state-of-the-art to run containers on Linux. It comes by design with a very interesting feature : rootless container. In this mode, the container runtime itself runs without privileges. This means exploiting the container runtime could at most grant the attacker the permissions of the running user (kernel own attack surface is out-of-scope here).

Historically, sysadmins and CI/CD developers found themselves in situations where they have to run container in container (see Docker-in-Docker, a.k.a. DinD), or other things also dealing with cgroups/seccomp/namespacing running in a container (e.g. systemd in unprivileged LXC). We call this “nesting”, and this may introduce some security benefits (as always depending on your threat model).

Nesting Podman in “containers” is supported and actually documented, and the combinational leads to these situations :

Rootful in rootful (pretty bad from a security point of view)
Rootless in rootful (already better !)
Rootful in rootless (pretty handy, but consider your container compromised if the application runs as root and has flaws)
Rootless in rootless (ideal from a security point of view !)

There is a hiccup between (at least) 4 and Debian image, and that’s why we’ll talk about here.

In below commands, you’ll see I map /dev/fuse device in containers to provide OverlayFS support in unprivileged user namespaces for Linux < 5.11.

Podman rootless in Podman rootless

Podman is very well integrated in the Red Hat ecosystem (mainly with systemd), and the official Podman container image is built upon Fedora.

On a rather “recent” GNU/Linux distribution, you can safely run as a regular user :

podman run -q -it --rm \
    --user podman \
    --device /dev/fuse \
    quay.io/podman/stable:latest \
        podman run -q -it --rm \
            docker.io/library/alpine:latest \
            sh
/ # id
uid=0(root) gid=0(root) groups=1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video),0(root)

This command pulls the official Podman container image and runs, as a regular user too (named podman in the first container), another Podman runtime which pulls the official Alpine image and spawns a shell in it.

If we focus on user namespaces, it gives :

the first podman runs as uid=1000 (host machine user session)
the second podman (in the first container) runs as uid=1000 (but shifted by 100000, default value defined in /etc/subuid on Debian)
eventually, sh runs as uid=0 (shifted by 101001, where 100000 comes from parent user namespace and 1001 from /etc/subuid packaged in quay.io/podman/stable image)

Podman in Debian

Podman is packaged in Debian since Bullseye (11). A simple apt install podman (which pulls a lot of recommended dependencies, I’d confess) and you’re all set.

Since Debian 12 (this year !), the specific-but-deprecated kernel.unprivileged_userns_clone sysctl parameter is even enabled by default so you don’t have to tweak your system anymore.

Unfortunately, if we attempt to build a Debian-based image to run “rootless in rootless” with such a Containerfile :

FROM debian:bookworm

RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y --no-install-recommends podman fuse-overlayfs slirp4netns uidmap

RUN useradd podman -s /bin/bash && \
    echo "podman:1001:64535" > /etc/subuid && \
    echo "podman:1001:64535" > /etc/subgid

ARG _REPO_URL="https://raw.githubusercontent.com/containers/image_build/refs/heads/main/podman"
ADD $_REPO_URL/containers.conf /etc/containers/containers.conf
ADD $_REPO_URL/podman-containers.conf /home/podman/.config/containers/containers.conf

RUN mkdir -p /home/podman/.local/share/containers && \
    chown podman:podman -R /home/podman && \
    chmod 0644 /etc/containers/containers.conf

VOLUME /home/podman/.local/share/containers

ENV _CONTAINERS_USERNS_CONFIGURED=""

USER podman
WORKDIR /home/podman

… it hard fails with :

podman image build -q -t debian:podman -f Containerfile . && \
    podman run -q -it --rm \
    --device /dev/fuse \
    debian:podman \
        podman unshare bash
98ea1c8c9e32cff5c3dabc4925f55a87cfad77e32d5778785a4f025215124fab
ERRO[0000] running `/usr/bin/newuidmap 12 0 1000 1 1 1001 64535`: newuidmap: write to uid_map failed: Operation not permitted 
Error: cannot set up namespace using "/usr/bin/newuidmap": exit status 1

@jam49 initially experienced this error while trying to run Podman rootless in Jenkins official Docker image, which is Debian-based.
Granting CAP_SYS_ADMIN to parent user namespace (hence the first container) actually “fixes” this issue, but this is highly discouraged due to the ~~bloated~~ range of system operations that it permits, which can easily leads to root privileges. Also, unless explicitly dropped, it will be inherited in child user namespaces as well (including the one running your application or service !).

So, why does this work flawlessly on Fedora, and not against Debian ? Let’s dive in user namespaces and capabilities magic world

From `newuidmap`, to user namespaces and capabilities

newuidmap (respectively newgidmap) is a privileged program maintained in shadow-utils project (see upstream tree, or shadow on Debian) which allows unprivileged users to safely map their UID (GID) to parent user namespace, based on ids range defined in /etc/subuid (/etc/subgid) file.

Since Linux >= 3.9, modifying namespace id mapping requires CAP_SYS_ADMIN. As this capability is usually not granted in container contexts, shadow-utils maintainers switched to file capabilities (in a backward-compatible way for setuid setups (see !132, fixed-up by !138).

Going through file capabilities is a good way to obtain them even if they are missing from your “Effective set” (they still need to be in your “Permitted set” though, see this awesome diagram, or even next section for a visual experience).

Debian (still) installs uidmap binaries with setuid bit, whereas the packaged version fully-supports file capabilities (>= 4.7), since Bullseye (11).
Theoretically, this shouldn’t be an issue as gaining root privileges through setuid bit implies the full set of capabilities by default.
So, would uidmap be compiled without capability support, and thus failing to retain CAP_SETUID (CAP_SETGID) ?

We can see in Debian shadow sources that :

But what do build logs tell ?

apt install -y --no-install-recommends devscripts wget

# Download last `shadow` (packaging `uidmap` binaries) build logs
getbuildlog shadow "last" amd64

grep 'sys/capability.h' shadow_*_amd64.log 
checking for sys/capability.h... no

That’s it ! Due to Debian shadow compilation environment and packaging, uidmap binaries lack of both capabilities upstream patches.

TL; DR : The workaround

As stated here, dropping setuid bit and granting CAP_SETUID (CAP_SETGID) as file capability in our previous Debian-based image using setcap (libcap2-bin package on Debian) :

 FROM debian:bookworm
 
 RUN apt-get update && \
     apt-get upgrade -y && \
     apt-get install -y --no-install-recommends podman fuse-overlayfs slirp4netns uidmap
 
 RUN useradd podman -s /bin/bash && \
     echo "podman:1001:64535" > /etc/subuid && \
     echo "podman:1001:64535" > /etc/subgid
 
 ARG _REPO_URL="https://raw.githubusercontent.com/containers/image_build/refs/heads/main/podman"
 ADD $_REPO_URL/containers.conf /etc/containers/containers.conf
 ADD $_REPO_URL/podman-containers.conf /home/podman/.config/containers/containers.conf
 
 RUN mkdir -p /home/podman/.local/share/containers && \
     chown podman:podman -R /home/podman && \
     chmod 0644 /etc/containers/containers.conf
 
 VOLUME /home/podman/.local/share/containers
 
+ # Replace setuid bits by proper file capabilities for uidmap binaries.
+ # See <https://github.com/containers/podman/discussions/19931>.
+ RUN apt-get install -y libcap2-bin && \
+     chmod 0755 /usr/bin/newuidmap /usr/bin/newgidmap && \
+     setcap cap_setuid=ep /usr/bin/newuidmap && \
+     setcap cap_setgid=ep /usr/bin/newgidmap && \
+     apt-get autoremove --purge -y libcap2-bin
+ 
 ENV _CONTAINERS_USERNS_CONFIGURED=""
 
 USER podman
 WORKDIR /home/podman

… elegantly workarounds this issue :

podman image build -q -t debian:podman -f Containerfile . && \
    podman run -q -it --rm \
    --device /dev/fuse \
    debian:podman \
        podman unshare bash
6b0661ddedbf13459493720f992c171d912d33bfd79a48a0d162d3eb0335cc99
root@bbb7d3d08f5a:~# id
uid=0(root) gid=0(root) groups=0(root)

But what if `CAP_SETUID` (`CAP_SETGID`) is explicitly forbidden in my context ?

Well, as you can imagine, it breaks again :

podman image build -q -t debian:podman -f Containerfile . && \
    podman run -q -it --rm \
    --device /dev/fuse \
    --cap-drop setuid,setgid \
    debian:podman \
        podman unshare bash
6b0661ddedbf13459493720f992c171d912d33bfd79a48a0d162d3eb0335cc99
ERRO[0000] running `/usr/bin/newuidmap 9 0 1000 1 1 1001 64535`:  
Error: cannot set up namespace using "/usr/bin/newuidmap": fork/exec /usr/bin/newuidmap: operation not permitted

Although, you’ll notice the error is slightly different : kernel prevents binary execution, instead of subsequent /proc/self/uid_map write operation as observed before.

(bonus) Harden your “last level of nesting”

If your “last level of nesting” is not supposed to re-gain privileges, you can safely set the “No New Privileges” flag through a Podman security option :

podman run -q -it --rm \
    --user podman \
    --device /dev/fuse \
    --security-opt=no-new-privileges \
    quay.io/podman/stable \
        podman run -q -it --rm \
            alpine:latest \
            sh
ERRO[0000] running `/usr/bin/newuidmap 9 0 1000 1 1 1 999 1000 1001 64535`: newuidmap: write to uid_map failed: Operation not permitted 
Error: cannot set up namespace using "/usr/bin/newuidmap": exit status 1

Here we can note that the flag actually breaks our first post example, as expected (gaining privileges from newuidmap program is denied by kernel).

The status of this flag can be retrieved using capsh :

podman run -q -it --rm \
    --user podman \
    --device /dev/fuse \
    --security-opt=no-new-privileges \
    quay.io/podman/stable \
        bash
[podman@5936edfb845d /]$ /sbin/capsh --print | grep no-new-privs
Securebits: 00/0x0/1'b0 (no-new-privs=1)

… or even directly through /proc :

grep NoNewPrivs /proc/self/status
NoNewPrivs:	1

Conclusion

This has been a “funny” bug to investigate !

I’ve run a quick search on the Web and it doesn’t look like Debian plans to switch to file capabilities for uidmap binaries (yet), so it’s very likely that the shim above will be around for some time.