Mastodon Stories for systemd v260

On March 17 we released systemd v260 into the wild.

In the weeks leading up to that release (and since then) I have posted a series of serieses of posts to Mastodon about key new features in this release, under the #systemd260 hash tag. In case you aren't using Mastodon, but would like to read up, here's a list of all 21 posts:

Post #1: NvPCR Measurements for Activated DDIs
Post #2: Varlink Transport Plugins
Post #3: Well-Known Varlink Services
Post #4: .mstack Overlay Mount Stacks
Post #5: RefreshOnReload= in Service Units
Post #6: FANCY_NAME= in /etc/os-release
Post #7: BindNetworkInterface= in Service Units
Post #8: importctl pull-oci for Acquiring OCI Containers
Post #9: systemd-report and Metrics API
Post #10: udev's tpm2_id built-in and the TPM2 Quirks Database
Post #11: Devicetree/CHID Database
Post #12: Varlink IPC for systemd-networkd
Post #13: systemd-vmspawn knows --ephemeral now
Post #14: systemd-loginds's xaccess Concept
Post #15: Unprivileged Portable Services
Post #16: Image Policy Improvements
Post #17: LUKS Volume Key Fixation
Post #18: Journal Varlink Access
Post #19: Nested UID Range Delegation
Post #20: PrivateUsers=managed
Post #21: bootctl install as Varlink API

I intend to do a similar series of serieses of posts for the next systemd release (v261), hence if you haven't left tech Twitter for Mastodon yet, now is the opportunity.

My series for v261 will begin in a few weeks most likely, under the #systemd261 hash tag.

In case you are interested, here is the corresponding blog story for systemd v259, here for v258, here for v257, and here for v256.

Introducing Amutable

Today, we announce Amutable, our ✨ new ✨ company. We – @blixtra@hachyderm.io, @brauner@mastodon.social, @davidstrauss@mastodon.social, @rodrigo_rata@mastodon.social, @michaelvogt@mastodon.social, @pothos@fosstodon.org, @zbyszek@fosstodon.org, @daandemeyer@mastodon.social @cyphar@mastodon.social, @jrocha@floss.social and yours truly – are building the 🚀 next generation of Linux systems, with integrity, determinism, and verification – every step of the way.

For more information see → https://amutable.com/blog/introducing-amutable

Mastodon Stories for systemd v259

On Dec 17 we released systemd v259 into the wild.

In the weeks leading up to that release (and since then) I have posted a series of serieses of posts to Mastodon about key new features in this release, under the #systemd259 hash tag. In case you aren't using Mastodon, but would like to read up, here's a list of all 25 posts:

Post #1: systemd-resolved Hooks
Post #2: dlopen() everything
Post #3: systemd-analyze dlopen-metadata
Post #4: run0 --empower
Post #5: systemd-vmspawn --bind-user=
Post #6: Musl libc support
Post #7: systemd-repart without device name
Post #8: Parallel kmod loading in systemd-modules-load.service
Post #9: NvPCR Support
Post #10: systemd-analyze nvcpcrs
Post #11: systemd-repart Varlink IPC API
Post #12: systemd-vmspawn block device serial
Post #13: systemd-repart --defer-partitions-empty= + --defer-partitions-factory-reset=
Post #14: userdb support for UUID queries
Post #15: Wallclock time in service completion logging
Post #16: systemd-firstboot --prompt-keymap-auto
Post #17: $LISTEN_PIDFDID
Post #18: Incremental partition rescanning
Post #19: ExecReloadPost=
Post #20: Transaction order cycle tracking
Post #21: systemd-firstboot facelift
Post #22: Per-User systemd-machined + systemd-importd
Post #23: systemd-udevd's OPTIONS="dump-json"
Post #24: systemd-resolved's DumpDNSConfiguration() IPC Call
Post #25: DHCP Server EmitDomain= + Domain=

I intend to do a similar series of serieses of posts for the next systemd release (v260), hence if you haven't left tech Twitter for Mastodon yet, now is the opportunity.

My series for v260 will begin in a few weeks most likely, under the #systemd260 hash tag.

In case you are interested, here is the corresponding blog story for systemd v258, here for v257, and here for v256.

Mastodon Stories for systemd v258

Already on Sep 17 we released systemd v258 into the wild.

In the weeks leading up to that release I have posted a series of serieses of posts to Mastodon about key new features in this release, under the #systemd258 hash tag. It was my intention to post a link list here on this blog right after completing that series, but I simply forgot! Hence, in case you aren't using Mastodon, but would like to read up, here's a list of all 37 posts:

Post #1: systemctl start -v
Post #2: Home areas
Post #3: systemd-resolved delegate zones
Post #4: Foreign UID range
Post #5: /etc/hostname ??? wildcards
Post #6: Quota on /tmp/
Post #7: ConcurretnySoftMax= + ConcurrencyHardMax=
Post #8: Product UUID in ConditionHost=
Post #9: Context OSC terminal sequences
Post #10: uki-url Boot Loader Spec Type #1 fields
Post #11: rd.break= boot breakpoints
Post #12: Factory Reset Rework
Post #13: systemd-resolved DNS Configuration Change IPC Subscription API
Post #14: io.systemd.boot-entries.extra= SMBIOS Type #11 Key
Post #15: Bring Your Own Firmware
Post #16: userdb record aliases
Post #17: systemd-validatefs and its xattrs
Post #18: Offline Signing of Artifacts
Post #19: PAMName= in services hooked up to ask-password protocol
Post #20: x-systemd.graceful-option= mount option
Post #21: systemd-userdb-load-credentials.service
Post #22: systemd-vmspawn --grow-image=a
Post #23: systemd-notify --fork
Post #24: $TERM auto-discovery
Post #25: Rebooting/Powering off systemd-nspawn containers via hotkey
Post #26: ExecStart= | modifier
Post #27: systemctl reload reloads confexts
Post #28: Server side userdb filtering
Post #29: Quota on StateDirectory= and friends
Post #30: systemd-analyze unit-shell
Post #31: /etc/issue.d/ drop-in for AF_VSOCK CID
Post #32: fsverity in systemd-repart
Post #33: AcceptFileDescriptor= + PassPIDFD=
Post #34: Tab completion in interactive systemd-firstboot
Post #35: rd.systemd.pull= kernel command line option/Boot into tarball
Post #36: ConditionKernelModuleLoaded=
Post #37: systemd-analyze chid
Post #38: homectl list-signing-keys/get-signing-key/add-signing-key/remove-signing-key
Post #39: DDI Image Filters
Post #40: Android USB Debugging udev rules
Post #41: systemd-vmspawn's --smbios11= switch
Post #42: $MAINPIDFDID + $MANAGERPIDFDID
Post #43: $DEBUG_INVOCATION=1 Respected by all systemd services
Post #44: LoaderDeviceURL EFI Variable and systemd.pull='s origin kernel command line switch
Post #45: cgroupv1 removal
Post #46: ProtectHostname=private
Post #47: homectl adopt + homectl register
Post #48: systemd-machined Varlink APIs
Post #49: DeferTrigger and "lenient" job mode
Post #50: Automatic Removal of foreign UID owned delegate subgroups in the per-user service manager
Post #51: Per-user ask-password protocol
Post #52: PrivateUsers=full
Post #53: LoadCredentialEncrypted= in the per-user service manager
Post #54: dissect_image builtin in systemd-udevd
Post #55: BPF Delegation via Tokens

I intend to do a similar series of serieses of posts for the next systemd release (v259), hence if you haven't left tech Twitter for Mastodon yet, now is the opportunity.

We intend to shorten the release cycle a bit for the future, and in fact managed to tag v259-rc1 already yesterday, just 2 months after v258. Hence, my series for v259 will begin soon, under the #systemd259 hash tag.

In case you are interested, here is the corresponding blog story for systemd v257, and here for v256.

ASG! 2025 CfP Closes Tomorrow!

The All Systems Go! 2025 Call for Participation Closes Tomorrow!

The Call for Participation (CFP) for All Systems Go! 2025 will close tomorrow, on 13th of June! We’d like to invite you to submit your proposals for consideration to the CFP submission site quickly!

Announcing systemd v257

Last week we released systemd v257 into the wild.

In the weeks leading up to this release (and the week after) I have posted a series of serieses of posts to Mastodon about key new features in this release, under the #systemd257 hash tag. In case you aren't using Mastodon, but would like to read up, here's a list of all 37 posts:

Post #1: Fully Locked Accounts with systemd-sysusers
Post #2: Combined Signed PCR and Locally Managed PCR Policies for Disk Encryption
Post #3: Progress Indication via Terminal ANSI Sequence
Post #4: Multi-Profile UKIs
Post #5: The New sd-varlink & sd-json APIs in libsystemd
Post #6: Querying for Passwords in User Scope
Post #7: Secure Attention Key Logic in systemd-logind
Post #8: systemd-nspawn --bind-user= Now Copies User's SSH Key
Post #9: The New DeferReactivation= Switch in .timer Units
Post #10: Support for the New IPE LSM
Post #11: Environment Variables for Shell Prompt Prefix/Suffix
Post #12: sysctl Conflict Detection via eBPF
Post #13: initrd and µcode UKI Add-Ons
Post #14: SecureBoot Signing with the New systemd-sbsign Tool
Post #15: Managed Access to hidraw devices in systemd-logind
Post #16: Fuzzy Filtering in userdbctl
Post #17: MAC Address Based Alternative Network Interface Names
Post #18: Conditional Copying/Symlinking in tmpfiles.d/
Post #19: Automatic Service Restarts in Debug Mode
Post #20: Filtering by Invocation ID in journalctl
Post #21: Supplement Partitions in repart.d/
Post #22: DeviceTree Matching in UKIs
Post #23: The New ssh-exec: Protocol in varlinkctl
Post #24: SecureBoot Key Enrollment Preparation with bootctl
Post #25: Automatically Installing confext/sysext/portable/VMs/container Images at Boot
Post #26: Designated Maintenance Time in systemd-logind
Post #27: PID Namespacing in Service Management
Post #28: Marking Experimental OS Releases in /etc/os-release
Post #29: Decoding Capability Masks with systemd-analyze
Post #30: Investigating Passed SMBIOS Type #11 Data
Post #31: Initializing Partitions from Character Devices in repart.d/
Post #32: Entering Namespaces to Generate Stacktraces
Post #33: ID Mapped Mounts for Per-Service Directories
Post #34: A Daemon for systemd-sysupdate
Post #35: User Record Modifications without Administrator Consent in systemd-homed
Post #36: DNR DHCP Support
Post #37: Name Based AF_VSOCK ssh Access

I intend to do a similar series of serieses of posts for the next systemd release (v258), hence if you haven't left tech Twitter for Mastodon yet, now is the opportunity.

Announcing systemd v256

Yesterday evening we released systemd v256 into the wild. While other projects, such as Firefox are just about to leave the 7bit world and enter 8bit territory, we already entered 9bit version territory! For details about the release, see our announcement mail.

In the weeks leading up to this release I have posted a series of serieses of posts to Mastodon about key new features in this release. Mastodon has its goods and its bads. Among the latter is probably that it isn't that great for posting listings of serieses of posts. Hence let me provide you with a list of the relevant first post in the series of posts here:

Post #1: .v/ Directories
Post #2: User-Scoped Encrypted Service Credentials
Post #3: X_SYSTEMD_UNIT_ACTIVE= sd_notify() Messages
Post #4: System-wide ProtectSystem=
Post #5: run0 As sudo Replacement
Post #6: System Credentials
Post #7: Unprivileged DDI Mounts + Unprivileged systemd-nspawn
Post #8: ssh into systemd-homed Accounts
Post #9: systemd-vmspawn
Post #10: Mutable systemd-sysext
Post #11: Network Device Ownership
Post #12: systemctl sleep
Post #13: systemd-ssh-generator
Post #14: systemd-cryptenroll without device argument
Post #15: dlopen() ELF Metadata
Post #16: Capsules

I intend to do a similar series of serieses of posts for the next systemd release (v257), hence if you haven't left tech Twitter for Mastodon yet, now is the opportunity.

And while I have you: note that the All Systems Go 2024 Conference (Berlin) Call for Papers ends 😲 THIS WEEK 🤯! Hence, HURRY, and get your submissions in now, for the best low-level Linux userspace conference around!

A re-introduction to mkosi -- A Tool for Generating OS Images

This is a guest post written by Daan De Meyer, systemd and mkosi maintainer

Almost 7 years ago, Lennart first wrote about mkosi on this blog. Some years ago, I took over development and there's been a huge amount of changes and improvements since then. So I figure this is a good time to re-introduce mkosi.

mkosi stands for Make Operating System Image. It generates OS images that can be used for a variety of purposes.

If you prefer watching a video over reading a blog post, you can also watch my presentation on mkosi at All Systems Go 2023.

What is mkosi?

mkosi was originally written as a tool to simplify hacking on systemd and for experimenting with images using many of the new concepts being introduced in systemd at the time. In the meantime, it has evolved into a general purpose image builder that can be used in a multitude of scenarios.

Instructions to install mkosi can be found in its readme. We recommend running the latest version to take advantage of all the latest features and bug fixes. You'll also need bubblewrap and the package manager of your favorite distribution to get started.

At its core, the workflow of mkosi can be divided into 3 steps:

Generate an OS tree for some distribution by installing a set of packages.
Package up that OS tree in a variety of output formats.
(Optionally) Boot the resulting image in qemu or systemd-nspawn.

Images can be built for any of the following distributions:

Fedora Linux
Ubuntu
OpenSUSE
Debian
Arch Linux
CentOS Stream
RHEL
Rocky Linux
Alma Linux

And the following output formats are supported:

GPT disk images built with systemd-repart
Tar archives
CPIO archives (for building initramfs images)
USIs (Unified System Images which are full OS images packed in a UKI)
Sysext, confext and portable images
Directory trees

For example, to build an Arch Linux GPT disk image and boot it in qemu, you can run the following command:

$ mkosi -d arch -p systemd -p udev -p linux -t disk qemu

To instead boot the image in systemd-nspawn, replace qemu with boot:

$ mkosi -d arch -p systemd -p udev -p linux -t disk boot

The actual image can be found in the current working directory named image.raw. However, using a separate output directory is recommended which is as simple as running mkdir mkosi.output.

To rebuild the image after it's already been built once, add -f to the command line before the verb to rebuild the image. Any arguments passed after the verb are forwarded to either systemd-nspawn or qemu itself. To build the image without booting it, pass build instead of boot or qemu or don't pass a verb at all.

By default, the disk image will have an appropriately sized root partition and an ESP partition, but the partition layout and contents can be fully customized using systemd-repart by creating partition definition files in mkosi.repart/. This allows you to customize the partition as you see fit:

The root partition can be encrypted.
Partition sizes can be customized.
Partitions can be protected with signed dm-verity.
You can opt out of having a root partition and only have a /usr partition instead.
You can add various other partitions, e.g. an XBOOTLDR partition or a swap partition.
...

As part of building the image, we'll run various tools such as systemd-sysusers, systemd-firstboot, depmod, systemd-hwdb and more to make sure the image is set up correctly.

Configuring mkosi image builds

Naturally with extended use you don't want to specify all settings on the command line every time, so mkosi supports configuration files where the same settings that can be specified on the command line can be written down.

For example, the command we used above can be written down in a configuration file mkosi.conf:

[Distribution]
Distribution=arch

[Output]
Format=disk

[Content]
Packages=
        systemd
        udev
        linux

Like systemd, mkosi uses INI configuration files. We also support dropins which can be placed in mkosi.conf.d. Configuration files can also be conditionalized using the [Match] section. For example, to only install a specific package on Arch Linux, you can write the following to mkosi.conf.d/10-arch.conf:

[Match]
Distribution=arch

[Content]
Packages=pacman

Because not everything you need will be supported in mkosi, we support running scripts at various points during the image build process where all extra image customization can be done. For example, if it is found, mkosi.postinst is called after packages have been installed. Scripts are executed on the host system by default (in a sandbox), but can be executed inside the image by suffixing the script with .chroot, so if mkosi.postinst.chroot is found it will be executed inside the image.

To add extra files to the image, you can place them in mkosi.extra in the source directory and they will be automatically copied into the image after packages have been installed.

Bootable images

If the necessary packages are installed, mkosi will automatically generate a UEFI/BIOS bootable image. As mkosi is a systemd project, it will always build UKIs (Unified Kernel Images), except if the image is BIOS-only (since UKIs cannot be used on BIOS). The initramfs is built like a regular image by installing distribution packages and packaging them up in a CPIO archive instead of a disk image. Specifically, we do not use dracut, mkinitcpio or initramfs-tools to generate the initramfs from the host system. ukify is used to assemble all the individual components into a UKI.

If you don't want mkosi to generate a bootable image, you can set Bootable=no to explicitly disable this logic.

Using mkosi for development

The main requirements to use mkosi for development is that we can build our source code against the image we're building and install it into the image we're building. mkosi supports this via build scripts. If a script named mkosi.build (or mkosi.build.chroot) is found, we'll execute it as part of the build. Any files put by the build script into $DESTDIR will be installed into the image. Required build dependencies can be installed using the BuildPackages= setting. These packages are installed into an overlay which is put on top of the image when running the build script so the build packages are available when running the build script but don't end up in the final image.

An example mkosi.build.chroot script for a project using meson could look as follows:

#!/bin/sh
meson setup "$BUILDDIR" "$SRCDIR"
ninja -C "$BUILDDIR"
if ((WITH_TESTS)); then
    meson test -C "$BUILDDIR"
fi
meson install -C "$BUILDDIR"

Now, every time the image is built, the build script will be executed and the results will be installed into the image.

The $BUILDDIR environment variable points to a directory that can be used as the build directory for build artifacts to allow for incremental builds if the build system supports it.

Of course, downloading all packages from scratch every time and re-installing them again every time the image is built is rather slow, so mkosi supports two modes of caching to speed things up.

The first caching mode caches all downloaded packages so they don't have to be downloaded again on subsequent builds. Enabling this is as simple as running mkdir mkosi.cache.

The second mode of caching caches the image after all packages have been installed but before running the build script. On subsequent builds, mkosi will copy the cache instead of reinstalling all packages from scratch. This mode can be enabled using the Incremental= setting. While there is some rudimentary cache invalidation, the cache can also forcibly be rebuilt by specifying -ff on the command line instead of -f.

Note that when running on a btrfs filesystem, mkosi will automatically use subvolumes for the cached images which can be snapshotted on subsequent builds for even faster rebuilds. We'll also use reflinks to do copy-on-write copies where possible.

With this setup, by running mkosi -f qemu in the systemd repository, it takes about 40 seconds to go from a source code change to a root shell in a virtual machine running the latest systemd with your change applied. This makes it very easy to test changes to systemd in a safe environment without risk of breaking your host system.

Of course, while 40 seconds is not a very long time, it's still more than we'd like, especially if all we're doing is modifying the kernel command line. That's why we have the KernelCommandLineExtra= option to configure kernel command line options that are passed to the container or virtual machine at runtime instead of being embedded into the image. These extra kernel command line options are picked up when the image is booted with qemu's direct kernel boot (using -append), but also when booting a disk image in UEFI mode (using SMBIOS). The same applies to systemd credentials (using the Credentials= setting). These settings allow configuring the image without having to rebuild it, which means that you only have to run mkosi qemu or mkosi boot again afterwards to apply the new settings.

Building images without root privileges and loop devices

By using newuidmap/newgidmap and systemd-repart, mkosi is able to build images without needing root privileges. As long as proper subuid and subgid mappings are set up for your user in /etc/subuid and /etc/subgid, you can run mkosi as your regular user without having to switch to root.

Note that as of the writing of this blog post this only applies to the build and qemu verbs. Booting the image in a systemd-nspawn container with mkosi boot still needs root privileges. We're hoping to fix this in an future systemd release.

Regardless of whether you're running mkosi with root or without root, almost every tool we execute is invoked in a sandbox to isolate as much of the build process from the host as possible. For example, /etc and /var from the host are not available in this sandbox, to avoid host configuration inadvertently affecting the build.

Because systemd-repart can build disk images without loop devices, mkosi can run from almost any environment, including containers. All that's needed is a UID range with 65536 UIDs available, either via running as the root user or via /etc/subuid and newuidmap. In a future systemd release, we're hoping to provide an alternative to newuidmap and /etc/subuid to allow running mkosi from all containers, even those with only a single UID available.

Supporting older distributions

mkosi depends on very recent versions of various systemd tools (v254 or newer). To support older distributions, we implemented so called tools trees. In short, mkosi can first build a tools image for you that contains all required tools to build the actual image. This can be enabled by adding ToolsTree=default to your mkosi configuration. Building a tools image does not require a recent version of systemd.

In the systemd mkosi configuration, we automatically use a tools tree if we detect your distribution does not have the minimum required systemd version installed.

Configuring variants of the same image using profiles

Profiles can be defined in the mkosi.profiles/ directory. The profile to use can be selected using the Profile= setting (or --profile=) on the command line. A profile allows you to bundle various settings behind a single recognizable name. Profiles can also be matched on if you want to apply some settings only to a few profiles.

For example, you could have a bootable profile that sets Bootable=yes, adds the linux and systemd-boot packages and configures Format=disk to end up with a bootable disk image when passing --profile bootable on the kernel command line.

Building system extension images

System extension images may – dynamically at runtime — extend the base system with an overlay containing additional files.

To build system extensions with mkosi, we need a base image on top of which we can build our extension.

To keep things manageable, we'll make use of mkosi's support for building multiple images so that we can build our base image and system extension in one go.

We start by creating a temporary directory with a base configuration file mkosi.conf with some shared settings:

[Output]
OutputDirectory=mkosi.output
CacheDirectory=mkosi.cache

Now let's continue with the base image definition by writing the following to mkosi.images/base/mkosi.conf:

[Output]
Format=directory

[Content]
CleanPackageMetadata=no
Packages=systemd
         udev

We use the directory output format here instead of the disk output so that we can build our extension without needing root privileges.

Now that we have our base image, we can define a sysext that builds on top of it by writing the following to mkosi.images/btrfs/mkosi.conf:

[Config]
Dependencies=base

[Output]
Format=sysext
Overlay=yes

[Content]
BaseTrees=%O/base
Packages=btrfs-progs

BaseTrees= point to our base image and Overlay=yes instructs mkosi to only package the files added on top of the base tree.

We can't sign the extension image without a key. We can generate one by running mkosi genkey which will generate files that are automatically picked up when building the image.

Finally, you can build the base image and the extensions by running mkosi -f. You'll find btrfs.raw in mkosi.output which is the extension image.

Various other interesting features

To sign any generated UKIs for secure boot, put your secure boot key and certificate in mkosi.key and mkosi.crt and enable the SecureBoot= setting. You can also run mkosi genkey to have mkosi generate a key and certificate itself.
The Ephemeral= setting can be enabled to boot the image in an ephemeral copy that is thrown away when the container or virtual machine exits.
ShimBootloader= and BiosBootloader= settings are available to configure shim and grub installation if needed.
mkosi can boot directory trees in a virtual using virtiofsd. This is very useful for quickly rebuilding an image and booting it as the image does not have to be packed up as a disk image.
...

There's many more features that we won't go over in detail here in this blog post. Learn more about those by reading the documentation.

Conclusion

I'll finish with a bunch of links to more information about mkosi and related tooling:

ASG! 2023 CfP Closes Soon

The All Systems Go! 2023 Call for Participation Closes in Three Days!

The Call for Participation (CFP) for All Systems Go! 2023 will close in three days, on 7th of July! We’d like to invite you to submit your proposals for consideration to the CFP submission site quickly!

ASG image

All topics relevant to foundational open-source Linux technologies are welcome. In particular, however, we are looking for proposals including, but not limited to, the following topics:

The CFP will close on July 7th, 2023. A response will be sent to all submitters on or before July 14th, 2023. The conference takes place in 🗺️ Berlin, Germany 🇩🇪 on Sept. 13-14th.

All Systems Go! 2023 is all about foundational open-source Linux technologies. We are primarily looking for deeply technical talks by and for developers, engineers and other technical roles.

We focus on the userspace side of things, so while kernel topics are welcome they must have clear, direct relevance to userspace. The following is a non-comprehensive list of topics encouraged for 2023 submissions:

Image-Based Linux 🖼️
Secure and Measured Boot 📏
TPM-Based Local/Remote Attestation, Encryption, Authentication 🔑
Low-level container executors and infrastructure ⚙️.
IoT, embedded and server Linux infrastructure
Reproducible builds 🔧
Package management, OS, container 📦, image delivery and updating
Building Linux devices and applications 🏗️
Low-level desktop 💻 technologies
Networking 🌐
System and service management 🚀
Tracing and performance measuring 🔍
IPC and RPC systems 🦜
Security 🔐 and Sandboxing 🏖️

For more information please visit our conference website!

Linux Boot Partitions

💽 Linux Boot Partitions and How to Set Them Up 🚀

Let’s have a look how traditional Linux distributions set up /boot/ and the ESP, and how this could be improved.

How Linux distributions traditionally have been setting up their “boot” file systems has been varying to some degree, but the most common choice has been to have a separate partition mounted to /boot/. Usually the partition is formatted as a Linux file system such as ext2/ext3/ext4. The partition contains the kernel images, the initrd and various boot loader resources. Some distributions, like Debian and Ubuntu, also store ancillary files associated with the kernel here, such as kconfig or System.map. Such a traditional boot partition is only defined within the context of the distribution, and typically not immediately recognizable as such when looking just at the partition table (i.e. it uses the generic Linux partition type UUID).

With the arrival of UEFI a new partition relevant for boot appeared, the EFI System Partition (ESP). This partition is defined by the firmware environment, but typically accessed by Linux to install or update boot loaders. The choice of file system is not up to Linux, but effectively mandated by the UEFI specifications: vFAT. In theory it could be formatted as other file systems too. However, this would require the firmware to support file systems other than vFAT. This is rare and firmware specific though, as vFAT is the only file system mandated by the UEFI specification. In other words, vFAT is the only file system which is guaranteed to be universally supported.

There’s a major overlap of the type of the data typically stored in the ESP and in the traditional boot partition mentioned earlier: a variety of boot loader resources as well as kernels/initrds.

Unlike the traditional boot partition, the ESP is easily recognizable in the partition table via its GPT partition type UUID. The ESP is also a shared resource: all OSes installed on the same disk will share it and put their boot resources into them (as opposed to the traditional boot partition, of which there is one per installed Linux OS, and only that one will put resources there).

To summarize, the most common setup on typical Linux distributions is something like this:

Type	Linux Mount Point	File System Choice
Linux “Boot” Partition	`/boot/`	Any Linux File System, typically ext2/ext3/ext4
ESP	`/boot/efi/`	vFAT

As mentioned, not all distributions or local installations agree on this. For example, it’s probably worth mentioning that some distributions decided to put kernels onto the root file system of the OS itself. For this setup to work the boot loader itself [sic!] must implement a non-trivial part of the storage stack. This may have to include RAID, storage drivers, networked storage, volume management, disk encryption, and Linux file systems. Leaving aside the conceptual argument that complex storage stacks don’t belong in boot loaders there are very practical problems with this approach. Reimplementing the Linux storage stack in all its combinations is a massive amount of work. It took decades to implement what we have on Linux now, and it will take a similar amount of work to catch up in the boot loader’s reimplementation. Moreover, there’s a political complication: some Linux file system communities made clear they have no interest in supporting a second file system implementation that is not maintained as part of the Linux kernel.

What’s interesting is that the /boot/efi/ mount point is nested below the /boot/ mount point. This effectively means that to access the ESP the Boot partition must exist and be mounted first. A system with just an ESP and without a Boot partition hence doesn’t fit well into the current model. The Boot partition will also have to carry an empty “efi” directory that can be used as the inner mount point, and serves no other purpose.

Given that the traditional boot partition and the ESP may carry similar data (i.e. boot loader resources, kernels, initrds) one may wonder why they are separate concepts. Historically, this was the easiest way to make the pre-UEFI way how Linux systems were booted compatible with UEFI: conceptually, the ESP can be seen as just a minor addition to the status quo ante that way. Today, primarily two reasons remained:

Some distributions see a benefit in support for complex Linux file system concepts such as hardlinks, symlinks, SELinux labels/extended attributes and so on when storing boot loader resources. – I personally believe that making use of features in the boot file systems that the firmware environment cannot really make sense of is very clearly not advisable. The UEFI file system APIs know no symlinks, and what is SELinux to UEFI anyway? Moreover, putting more than the absolute minimum of simple data files into such file systems immediately raises questions about how to authenticate them comprehensively (including all fancy metadata) cryptographically on use (see below).
On real-life systems that ship with non-Linux OSes the ESP often comes pre-installed with a size too small to carry multiple Linux kernels and initrds. As growing the size of an existing ESP is problematic (for example, because there’s no space available immediately after the ESP, or because some low-quality firmware reacts badly to the ESP changing size) placing the kernel in a separate, secondary partition (i.e. the boot partition) circumvents these space issues.

File System Choices

We already mentioned that the ESP effectively has to be vFAT, as that is what UEFI (more or less) guarantees. The file system choice for the boot partition is not quite as restricted, but using arbitrary Linux file systems is not really an option either. The file system must be accessible by both the boot loader and the Linux OS. Hence only file systems that are available in both can be used. Note that such secondary implementations of Linux file systems in the boot environment – limited as they may be – are not typically welcomed or supported by the maintainers of the canonical file system implementation in the upstream Linux kernel. Modern file systems are notoriously complicated and delicate and simply don’t belong in boot loaders.

In a trusted boot world, the two file systems for the ESP and the /boot/ partition should be considered untrusted: any code or essential data read from them must be authenticated cryptographically before use. And even more, the file system structures themselves are also untrusted. The file system driver reading them must be careful not to be exploitable by a rogue file system image. Effectively this means a simple file system (for which a driver can be more easily validated and reviewed) is generally a better choice than a complex file system (Linux file system communities made it pretty clear that robustness against rogue file system images is outside of their scope and not what is being tested for.).

Some approaches tried to address the fact that boot partitions are untrusted territory by encrypting them via a mechanism compatible to LUKS, and adding decryption capabilities to the boot loader so it can access it. This misses the point though, as encryption does not imply authentication, and only authentication is typically desired. The boot loader and kernel code are typically Open Source anyway, and hence there’s little value in attempting to keep secret what is already public knowledge. Moreover, encryption implies the existence of an encryption key. Physically typing in the decryption key on a keyboard might still be acceptable on desktop systems with a single human user in front, but outside of that scenario unlock via TPM, PKCS#11 or network services are typically required. And even on the desktop FIDO2 unlocking is probably the future. Implementing all the technologies these unlocking mechanisms require in the boot loader is not realistic, unless the boot loader shall become a full OS on its own as it would require subsystems for FIDO2, PKCS#11, USB, Bluetooth network, smart card access, and so on.

File System Access Patterns

Note that traditionally both mentioned partitions were read-only during most parts of the boot. Only later, once the OS is up, write access was required to implement OS or boot loader updates. In today’s world things have become a bit more complicated. A modern OS might want to require some limited write access already in the boot loader, to implement boot counting/boot assessment/automatic fallback (e.g., if the same kernel fails to boot 3 times, automatically revert to older kernel), or to maintain an early storage-based random seed. This means that even though the file system is mostly read-only, we need limited write access after all.

vFAT cannot compete with modern Linux file systems such as btrfs when it comes to data safety guarantees. It’s not a journaled file system, does not use CoW or any form of checksumming. This means when used for the system boot process we need to be particularly careful when accessing it, and in particular when making changes to it (i.e., trying to keep changes local to single sectors). It is essential to use write patterns that minimize the chance of file system corruption. Checking the file system (“fsck”) before modification (and probably also reading) is important, as is ensuring the file system is put into a “clean” state as quickly as possible after each modification.

Code quality of the firmware in typical systems is known to not always be great. When relying on the file system driver included in the firmware it’s hence a good idea to limit use to operations that have a better chance to be correctly implemented. For example, when writing from the UEFI environment it might be wise to avoid any operation that requires allocation algorithms, but instead focus on access patterns that only override already written data, and do not require allocation of new space for the data.

Besides write access from the boot loader code (as described above) these file systems will require write access from the OS, to facilitate boot loader and kernel/initrd updates. These types of accesses are generally not fully random accesses (i.e., never partial file updates) but usually mean adding new files as whole, and removing old files as a whole. Existing files are typically not modified once created, though they might be replaced wholly by newer versions.

Boot Loader Updates

Note that the update cycle frequencies for boot loaders and for kernels/initrds are probably similar these days. While kernels are still vastly more complex than boot loaders, security issues are regularly found in both. In particular, as boot loaders (through “shim” and similar components) carry certificate/keyring and denylist information, which typically require frequent updates. Update cycles hence have to be expected regularly.

Boot Partition Discovery

The traditional boot partition was not recognizable by looking just at the partition table. On MBR systems it was directly referenced from the boot sector of the disk, and on EFI systems from information stored in the ESP. This is less than ideal since by losing this entrypoint information the system becomes unbootable. It’s typically a better, more robust idea to make boot partitions recognizable as such in the partition table directly. This is done for the ESP via the GPT partition type UUID. For traditional boot partitions this was not done though.

Current Situation Summary

Let’s try to summarize the above:

Currently, typical deployments use two distinct boot partitions, often using two distinct file system implementations
Firmware effectively dictates existence of the ESP, and the use of vFAT
In userspace view: the ESP mount is nested below the general Boot partition mount
Resources stored in both partitions are primarily kernel/initrd, and boot loader resources
The mandatory use of vFAT brings certain data safety challenges, as does quality of firmware file system driver code
During boot limited write access is needed, during OS runtime more comprehensive write access is needed (though still not fully random).
Less restricted but still limited write patterns from OS environment (only full file additions/updates/removals, during OS/boot loader updates)
Boot loaders should not implement complex storage stacks.
ESP can be auto-discovered from the partition table, traditional boot partition cannot.
ESP and the traditional boot partition are not protected cryptographically neither in structure nor contents. It is expected that loaded files are individually authenticated after being read.
The ESP is a shared resource — the traditional boot partition a resource specific to each installed Linux OS on the same disk.

How to Do it Better

Now that we have discussed many of the issues with the status quo ante, let’s see how we can do things better:

Two partitions for essentially the same data is a bad idea. Given they carry data very similar or identical in nature, the common case should be to have only one (but see below).
Two file system implementations are worse than one. Given that vFAT is more or less mandated by UEFI and the only format universally understood by all players, and thus has to be used anyway, it might as well be the only file system that is used.
Data safety is unnecessarily bad so far: both ESP and boot partition are continuously mounted from the OS, even though access is pretty restricted: outside of update cycles access is typically not required.
All partitions should be auto-discoverable/self-descriptive
The two partitions should not be exposed as nested mounts to userspace

To be more specific, here’s how I think a better way to set this all up would look like:

Whenever possible, only have one boot partition, not two. On EFI systems, make it the ESP. On non-EFI systems use an XBOOTLDR partition instead (see below). Only have both in the case where a Linux OS is installed on a system that already contains an OS with an ESP that is too small to carry sufficient kernels/initrds. When a system contains a XBOOTLDR partition put kernels/initrd on that, otherwise the ESP.
Instead of the vaguely defined, traditional Linux “boot” partition use the XBOOTLDR partition type as defined by the Discoverable Partitions Specification. This ensures the partition is discoverable, and can be automatically mounted by things like systemd-gpt-auto-generator. Use XBOOTLDR only if you have to, i.e., when dealing with systems that lack UEFI (and where the ESP hence has no value) or to address the mentioned size issues with the ESP. Note that unlike the traditional boot partition the XBOOTLDR partition is a shared resource, i.e., shared between multiple parallel Linux OS installations on the same disk. Because of this it is typically wise to place a per-OS directory at the top of the XBOOTLDR file system to avoid conflicts.
Use vFAT for both partitions, it’s the only thing universally understood among relevant firmwares and Linux. It’s simple enough to be useful for untrusted storage. Or to say this differently: writing a file system driver that is not easily vulnerable to rogue disk images is much easier for vFAT than for let’s say btrfs. – But the choice of vFAT implies some care needs to be taken to address the data safety issues it brings, see below.
Mount the two partitions via the “automount” logic. For example, via systemd’s automount units, with a very short idle time-out (one second or so). This improves data safety immensely, as the file systems will remain mounted (and thus possibly in a “dirty” state) only for very short periods of time, when they are actually accessed – and all that while the fact that they are not mounted continuously is mostly not noticeable for applications as the file system paths remain continuously around. Given that the backing file system (vFAT) has poor data safety properties, it is essential to shorten the access for unclean file system state as much as possible. In fact, this is what the aforementioned systemd-gpt-auto-generator logic actually does by default.
Whenever mounting one of the two partitions, do a file system check (fsck; in fact this is also what systemd-gpt-auto-generatordoes by default, hooked into the automount logic, to run on first access). This ensures that even if the file system is in an unclean state it is restored to be clean when needed, i.e., on first access.
Do not mount the two partitions nested, i.e., no more /boot/efi/. First of all, as mentioned above, it should be possible (and is desirable) to only have one of the two. Hence it is simply a bad idea to require the other as well, just to be able to mount it. More importantly though, by nesting them, automounting is complicated, as it is necessary to trigger the first automount to establish the second automount, which defeats the point of automounting them in the first place. Use the two distinct mount points /efi/ (for the ESP) and /boot/ (for XBOOTLDR) instead. You might have guessed, but that too is what systemd-gpt-auto-generator does by default.
When making additions or updates to ESP/XBOOTLDR from the OS make sure to create a file and write it in full, then syncfs() the whole file system, then rename to give it its final name, and syncfs() again. Similar when removing files.
When writing from the boot loader environment/UEFI to ESP/XBOOTLDR, do not append to files or create new files. Instead overwrite already allocated file contents (for example to maintain a random seed file) or rename already allocated files to include information in the file name (and ideally do not increase the file name in length; for example to maintain boot counters).
Consider adopting UKIs, which minimize the number of files that need to be updated on the ESP/XBOOTLDR during OS/kernel updates (ideally down to 1)
Consider adopting systemd-boot, which minimizes the number of files that need to be updated on boot loader updates (ideally down to 1)
Consider removing any mention of ESP/XBOOTLDR from /etc/fstab, and just let systemd-gpt-auto-generator do its thing.
Stop implementing file systems, complex storage, disk encryption, … in your boot loader.

Implementing things like that you gain:

Simplicity: only one file system implementation, typically only one partition and mount point
Robust auto-discovery of all partitions, no need to even configure /etc/fstab
Data safety guarantees as good as possible, given the circumstances

To summarize this in a table:

Type	Linux Mount Point	File System Choice	Automount
ESP	`/efi/`	vFAT	yes
XBOOTLDR	`/boot/`	vFAT	yes

A note regarding modern boot loaders that implement the Boot Loader Specification: both partitions are explicitly listed in the specification as sources for both Type #1 and Type #2 boot menu entries. Hence, if you use such a modern boot loader (e.g. systemd-boot) these two partitions are the preferred location for boot loader resources, kernels and initrds anyway.

Addendum: You got RAID?

You might wonder, what about RAID setups and the ESP? This comes up regularly in discussions: how to set up the ESP so that (software) RAID1 (mirroring) can be done on the ESP. Long story short: I’d strongly advise against using RAID on the ESP. Firmware typically doesn’t have native RAID support, and given that firmware and boot loader can write to the file systems involved, any attempt to use software RAID on them will mean that a boot cycle might corrupt the RAID sync, and immediately requires a re-synchronization after boot. If RAID1 backing for the ESP is really necessary, the only way to implement that safely would be to implement this as a driver for UEFI – but that creates certain bootstrapping issues (i.e., where to place the driver if not the ESP, a file system the driver is supposed to be used for), and also reimplements a considerable component of the OS storage stack in firmware mode, which seems problematic.

So what to do instead? My recommendation would be to solve this via userspace tooling. If redundant disk support shall be implemented for the ESP, then create separate ESPs on all disks, and synchronize them on the file system level instead of the block level. Or in other words, the tools that install/update/manage kernels or boot loaders should be taught to maintain multiple ESPs instead of one. Copy the kernels/boot loader files to all of them, and remove them from all of them. Under the assumption that the goal of RAID is a more reliable system this should be the best way to achieve that, as it doesn’t pretend the firmware could do things it actually cannot do. Moreover it minimizes the complexity of the boot loader, shifting the syncing logic to userspace, where it’s typically easier to get right.

Addendum: Networked Boot

The discussion above focuses on booting up from a local disk. When thinking about networked boot I think two scenarios are particularly relevant:

PXE-style network booting. I think in this mode of operation focus should be on directly booting a single UKI image instead of a boot loader. This sidesteps the whole issue of maintaining any boot partition at all, and simplifies the boot process greatly. In scenarios where this is not sufficient, and an interactive boot menu or other boot loader features are desired, it might be a good idea to take inspiration from the UKI concept, and build a single boot loader EFI binary (such as systemd-boot), and include the UKIs for the boot menu items and other resources inside it via PE sections. Or in other words, build a single boot loader binary that is “supercharged” and contains all auxiliary resources in its own PE sections. (Note: this does not exist, it’s an idea I intend to explore with systemd-boot). Benefit: a single file has to be downloaded via PXE/TFTP, not more. Disadvantage: unused resources are downloaded unnecessarily. Either way: in this context there is no local storage, and the ESP/XBOOTLDR discussion above is without relevance.
Initrd-style network booting. In this scenario the boot loader and kernel/initrd (better: UKI) are available on a local disk. The initrd then configures the network and transitions to a network share or file system on a network block device for the root file system. In this case the discussion above applies, and in fact the ESP or XBOOTLDR partition would be the only partition available locally on disk.

And this is all I have for today.