Proxmox LXC, Systemd, and Linux Capabilities
Debian in LXC/Proxmox works flawlessly, except for some systemd
utility daemons. Instead of disabling those services, we can leverage Linux capabilities to achieve the same results.
Linux capabilities?
In classic UNIX systems, there are two categories of processes: privileged and unprivileged. Privileged processes, also known as superuser or root, have an effective user ID of 0
and bypass all kernel permission checks. On the other hand, unprivileged processes have a nonzero effective UID, and they are subject to full permission checking based on their credentials, including the effective UID, effective GID, and supplementary group list.
However, in kernel 2.2 and later versions of Linux, the privileges associated with superusers have been divided into distinct units called capabilities, which can be independently enabled and disabled.
With capabilities, developers can assign specific permissions to individual threads/processes rather than granting all privileges to the entire application. This separation allows for more fine-grained access control and helps to prevent potential security breaches that could result from the over-assignment of permissions.
There are three ways a process can get capabilities: The child process can inherit capabilities from the parent; or They can be assigned to a thread/process; or They can be set on an executable on disk (when executed, the program will have that capability).
An example of a standard Linux utility that uses capabilities is ping
:
$ sudo getcap `which ping`
/usr/bin/ping cap_net_raw=ep
Why is that? Because ping
uses a raw socket for sending ICMP Ping packets. Raw sockets, in Linux, can be opened only by privileged users or processes with the CAP_NET_RAW
capability. ep
refers to Permitted
and Effective
, two capability sets.
The list of capabilities, among with other info, is in the manpages:
man 7 capabilities
What is wrong with LXC/Proxmox and Debian?
If you create an unprivileged Debian 11-based LXC in Proxmox, you will find that some services won’t run:
$ sudo systemctl is-system-running
degraded
$ sudo systemctl | grep failed
* sys-kernel-config.mount loaded failed failed Kernel Configuration File System
* systemd-journald-audit.socket loaded failed failed Journal Audit Socket
One solution is to mask these services to make systemd
happy. However, the systemd
units governing these services are (correctly) configured to avoid starting the service if the capability is unavailable for the whole container (to check this, just run capsh --print
).
$ sudo systemctl cat sys-kernel-config.mount | grep ^ConditionCapability
ConditionCapability=CAP_SYS_RAWIO
$ sudo systemctl cat systemd-journald-audit.socket | grep ^ConditionCapability
ConditionCapability=CAP_AUDIT_READ
Capabilities are usually expressed with the CAP_
prefix and uppercase or without the CAP_
prefix and lowercase. Capabilities required by the two services are:
sys_rawio
: the ability to perform various low-level system operations, such as I/O port operations, accessing kernel core, ioctl, updating virtual memory settings, mapping files and memory, performing SCSI device commands and device-specific operations on different devices;audit_read
: read the audit log from the kernel.
Since these capabilities are usually unused in the container, why don’t we drop them? “Drop a capability” means it won’t be available in the container. It makes sense if the container is not using them.
To configure Proxmox to drop the capability on start for a container, add these lines in the configuration file for your unprivileged container:
lxc.cap.drop: "sys_rawio audit_read"
The value of that key is a (space-separated) list of capabilities, in lower case, without the CAP_
prefix.
You can drop other capabilities too. The CAPABILITIES(7)
man pages describe all available capabilities. If you are securing a container, you should look at it.