Skip to content

fs: Add read_container_root() and copy metadata from /usr -> /#209

Merged
cgwalters merged 6 commits intocomposefs:mainfrom
cgwalters:root-metadata-from-usr
Jan 15, 2026
Merged

fs: Add read_container_root() and copy metadata from /usr -> /#209
cgwalters merged 6 commits intocomposefs:mainfrom
cgwalters:root-metadata-from-usr

Conversation

@cgwalters
Copy link
Collaborator

OCI container layer tars often don't include a root directory entry, and when they do, container runtimes (Podman, Docker) ignore it and use hardcoded defaults. We had some attempt to handle this with our "stat root" stuff but that only handled mtime - i.e. not permissions or other metadata.

Since the contents of / are ill defined, let's just default to defining them based on /usr since I am not aware of anyone who cares for those to be different (except for the selinux label but we handle that already).

Add new APIs to handle this:

  • copy_root_metadata_from_usr(): copies mode, uid, gid, mtime, xattrs from /usr to root directory for deterministic metadata

  • read_container_root(): convenience wrapper that calls read_filesystem() then copy_root_metadata_from_usr()

Remove the incomplete have_root_stat/ensure_root_stat() mechanism which only handled mtime. Callers should now explicitly use the new helpers.

Update cfsctl commands (compute-id, create-image, create-dumpfile) to use read_container_root() by default. Add --no-propagate-usr-to-root flag to use raw read_filesystem() for testing.

Assisted-by: OpenCode (Opus 4.5)

@cgwalters cgwalters force-pushed the root-metadata-from-usr branch from ddcf1ce to 6b05097 Compare January 9, 2026 21:25
Copy link
Collaborator

@allisonkarlitskaya allisonkarlitskaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't look at the code but this definitely requires an update to https://github.com/containers/composefs-rs/blob/main/doc/oci.md

@cgwalters cgwalters force-pushed the root-metadata-from-usr branch from 6b05097 to 3d33e70 Compare January 10, 2026 19:03
@cgwalters
Copy link
Collaborator Author

Thanks, sorry I forgot about that. I've now ensured that's cross-referenced more strongly in the code too.

I added a followup commit here to make even more clear that the "processing OCI layers" is the weird case; from a dumpfile we construct the fs with the root stat metadata that we know is there from the start.


That said...I was thinking about this more and realized we still have at least one other big issue in this area that's representative: Right now if SELinux is enabled at build time, we'll pick up the security.selinux xattrs like container_t etc when operating on a mounted root. When targeting a SELinux host, that's fine as that gets overwritten. But if one is targeting a non-SELinux host (a valid use case), then right now we just leak the build-time xattrs into the target (see early return in selabel()) which certainly breaks reproducibility.

More generally I think there are also corner cases where host-side xattrs can visibly leak into the container runtime just because of how overlayfs works. This one relates to containers/storage#1608 (comment) pretty strongly.

I guess for now, the simplest fix might be:

  • Fix the transform_for_boot logic to remove all security.selinux xattrs if we don't find a policy in the target root
  • Otherwise, only accept an allowlist for xattrs desired from mounted roots, and the default would just be security.capability

@cgwalters cgwalters force-pushed the root-metadata-from-usr branch from 3d33e70 to ff67094 Compare January 12, 2026 15:35
@cgwalters
Copy link
Collaborator Author

Added another change to take a much stronger stance around xattrs.

Add a method to visit every Stat in the filesystem tree. This provides
a reusable primitive for operations that need to traverse all nodes,
such as filtering xattrs or stripping SELinux labels.

Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
OCI container layer tars often don't include a root directory entry,
and when they do, container runtimes (Podman, Docker) ignore it and
use hardcoded defaults. This makes root metadata non-deterministic,
causing composefs digest mismatches.

For detailed analysis of this issue, see:
  https://gist.github.com/cgwalters/c18c9337aa9345d763aa446cc95c7847
  containers/storage#743

Since the metadata of `/` is ill-defined in the OCI spec, we default
to copying it from `/usr` - no one should care if these differ, except
for SELinux labels which we handle separately in the boot path.

Add new APIs:
- copy_root_metadata_from_usr(): copies mode, uid, gid, mtime, xattrs
  from /usr to root directory
- read_container_root(): wrapper that calls read_filesystem() then
  copy_root_metadata_from_usr()

Remove the incomplete have_root_stat/ensure_root_stat() mechanism which
only handled mtime, not permissions or other metadata.

Update cfsctl commands (compute-id, create-image, create-dumpfile) to
use read_container_root() by default. Add --no-propagate-usr-to-root
flag to use raw read_filesystem() when root metadata is well-defined.

Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
Followup to the previous commit. In most use cases we are starting
from a fully populated filesystem - there's no need to hardcode
permissions. The main exception to this is when we're processing
OCI tar streams.

Instead of using mode 0o555 for this and arbitrarily match podman
(but not docker) let's hardcode mode 0 instead which is even
more obviously uninitialized.

We again now expect OCI use cases to do the "propagate attrs from /usr"
approach.

Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
…SELinux targets

When reading a container filesystem from a mounted root, host-side xattrs
can leak into the image. This is particularly problematic for SELinux labels
like `container_t` that come from the build host's overlayfs, breaking
reproducibility when targeting non-SELinux systems.

Add xattr filtering with a predicate-based approach:
- `read_filesystem_filtered()` takes a closure to filter xattrs
- `read_container_root()` uses `is_allowed_container_xattr()` which
  checks against `CONTAINER_XATTR_ALLOWLIST` (currently `security.capability`)
- `FileSystem::filter_xattrs()` provides the underlying filtering using
  the new `for_each_stat()` helper

Additionally, fix `transform_for_boot()` SELinux handling:
- If no SELinux policy is found in the target filesystem, strip all
  `security.selinux` xattrs instead of leaving build-time labels in place
- `selabel()` now returns `bool` indicating whether labeling was performed

This ensures:
- Build-time SELinux labels don't leak into non-SELinux targets
- SELinux-enabled targets get correct labels from their own policy
- Other host xattrs (overlayfs internals, etc.) don't pollute the image

See: containers/storage#1608 (comment)

Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
@cgwalters cgwalters force-pushed the root-metadata-from-usr branch from ff67094 to e1dd239 Compare January 14, 2026 01:59
cgwalters added a commit to cgwalters/bootc that referenced this pull request Jan 14, 2026
See composefs/composefs-rs#209

Signed-off-by: Colin Walters <walters@verbum.org>

Assisted-by: OpenCode (Opus 4.5)
cgwalters added a commit to cgwalters/bootc that referenced this pull request Jan 14, 2026
See composefs/composefs-rs#209

Signed-off-by: Colin Walters <walters@verbum.org>

Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
@cgwalters cgwalters requested a review from jeckersb January 14, 2026 14:44
Empty /run in create_filesystem() when processing OCI images. This is
a tmpfs at runtime and should always be empty in images.

Emptying /run also works around podman/buildah's RUN --mount issue where
cache mounts can leave incomplete directory entries in OCI tar layers,
causing mtime inconsistencies. Container authors can now safely use:

    RUN rm -rf /var/cache/dnf && ln -sr /run/dnfcache /var/cache/dnf
    RUN --mount=type=cache,target=/run/dnfcache dnf install -y ...

The /run directory is emptied for all OCI images, not just bootable ones,
since it should always be empty regardless of boot configuration.

Also change emptied directories in transform_for_boot() (/boot, /sysroot)
to use /usr's mtime instead of 0, for consistency with how root directory
metadata is handled via copy_root_metadata_from_usr().

Closes: composefs#132
Assisted-by: OpenCode (Opus 4.5)
Signed-off-by: Colin Walters <walters@verbum.org>
@cgwalters cgwalters force-pushed the root-metadata-from-usr branch from e1dd239 to c7257e9 Compare January 14, 2026 16:35
@cgwalters
Copy link
Collaborator Author

Now on /run we could also error out instead of emptying it. We could also try to allow it, but it would make working around the other podman bug a lot more annoying; I think we'd always have to make mount points in / which is too ugly to live.

I don't have a really strong opinion on erroring versus emptying, the main reason I chose emptying is just because we already were doing that for /sysroot and /boot.

@cgwalters cgwalters marked this pull request as ready for review January 15, 2026 13:51
`container_t` that come from the build host, not from the target system's
policy.

To ensure reproducibility, `read_container_root()` filters xattrs to only
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From live chat: a TODO is to expose configurability for this on the CLI in case anyone wants to allowlist other xattrs for $reasons.

Copy link
Collaborator

@jeckersb jeckersb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM, we went over this in moderate detail in a live chat

@jeckersb
Copy link
Collaborator

@allisonkarlitskaya this needs signoff on your requested update to oci.md

@cgwalters cgwalters merged commit 4a06016 into composefs:main Jan 15, 2026
5 of 15 checks passed
cgwalters added a commit to cgwalters/bootc that referenced this pull request Jan 16, 2026
The composefs-rs PR 209 has been merged to main. This updates
bootc to use the containers/composefs-rs repository at the
merge commit.

Key API changes:
- Directory::default() -> Directory::new(Stat::uninitialized())
- read_filesystem() no longer takes stat_root parameter
- New read_container_root() for OCI containers (propagates /usr metadata to root)
- stat_root CLI flag renamed to no_propagate_usr_to_root with inverted logic

See composefs/composefs-rs#209

Signed-off-by: Colin Walters <walters@verbum.org>
cgwalters added a commit to cgwalters/bootc that referenced this pull request Jan 16, 2026
The composefs-rs PR 209 has been merged to main. This updates
bootc to use the containers/composefs-rs repository at the
merge commit.

Key API changes:
- Directory::default() -> Directory::new(Stat::uninitialized())
- read_filesystem() no longer takes stat_root parameter
- New read_container_root() for OCI containers (propagates /usr metadata to root)
- stat_root CLI flag renamed to no_propagate_usr_to_root with inverted logic

See composefs/composefs-rs#209

Signed-off-by: Colin Walters <walters@verbum.org>
cgwalters added a commit to cgwalters/bootc that referenced this pull request Jan 16, 2026
The composefs-rs PR 209 has been merged to main. This updates
bootc to use the containers/composefs-rs repository at the
merge commit.

Key API changes:
- Directory::default() -> Directory::new(Stat::uninitialized())
- read_filesystem() no longer takes stat_root parameter
- New read_container_root() for OCI containers (propagates /usr metadata to root)
- stat_root CLI flag renamed to no_propagate_usr_to_root with inverted logic

See composefs/composefs-rs#209

Signed-off-by: Colin Walters <walters@verbum.org>
cgwalters added a commit to cgwalters/bootc that referenced this pull request Jan 19, 2026
The composefs-rs PR 209 has been merged to main. This updates
bootc to use the containers/composefs-rs repository at the
merge commit.

Key API changes:
- Directory::default() -> Directory::new(Stat::uninitialized())
- read_filesystem() no longer takes stat_root parameter
- New read_container_root() for OCI containers (propagates /usr metadata to root)
- stat_root CLI flag renamed to no_propagate_usr_to_root with inverted logic

See composefs/composefs-rs#209

Signed-off-by: Colin Walters <walters@verbum.org>
cgwalters added a commit to cgwalters/bootc that referenced this pull request Jan 19, 2026
The composefs-rs PR 209 has been merged to main. This updates
bootc to use the containers/composefs-rs repository at the
merge commit.

Key API changes:
- Directory::default() -> Directory::new(Stat::uninitialized())
- read_filesystem() no longer takes stat_root parameter
- New read_container_root() for OCI containers (propagates /usr metadata to root)
- stat_root CLI flag renamed to no_propagate_usr_to_root with inverted logic

See composefs/composefs-rs#209

Signed-off-by: Colin Walters <walters@verbum.org>
cgwalters added a commit to bootc-dev/bootc that referenced this pull request Jan 22, 2026
The composefs-rs PR 209 has been merged to main. This updates
bootc to use the containers/composefs-rs repository at the
merge commit.

Key API changes:
- Directory::default() -> Directory::new(Stat::uninitialized())
- read_filesystem() no longer takes stat_root parameter
- New read_container_root() for OCI containers (propagates /usr metadata to root)
- stat_root CLI flag renamed to no_propagate_usr_to_root with inverted logic

See composefs/composefs-rs#209

Signed-off-by: Colin Walters <walters@verbum.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants