Skip to content

cat: avoid unnecessary allocation#11675

Merged
sylvestre merged 1 commit into
uutils:mainfrom
oech3:cat-alloc
Apr 12, 2026
Merged

cat: avoid unnecessary allocation#11675
sylvestre merged 1 commit into
uutils:mainfrom
oech3:cat-alloc

Conversation

@oech3

@oech3 oech3 commented Apr 6, 2026

Copy link
Copy Markdown
Contributor

Allocate buffer on heap instead of stack for read()/write() show-path which is unnecessary if splice() fast-path succeed.

$ echo 1 > /tmp/1
> taskset -c 0 hyperfine -N --runs 10000 "/tmp/coreutils/target/release/cat-stack /tmp/1" "target/release/cat-heap /tmp/1"
Benchmark 1: /tmp/coreutils/target/release/cat-stack /tmp/1
  Time (mean ± σ):     921.2 µs ±  84.4 µs    [User: 372.9 µs, System: 443.7 µs]
  Range (min … max):   843.0 µs … 3926.9 µs    10000 runs
Benchmark 2: target/release/cat-heap /tmp/1
  Time (mean ± σ):     908.6 µs ± 117.0 µs    [User: 380.6 µs, System: 424.1 µs]
  Range (min … max):   821.4 µs … 4337.6 µs    10000 runs 
Summary
  target/release/cat-heap /tmp/1 ran
    1.01 ± 0.16 times faster than /tmp/coreutils/target/release/cat-stack /tmp/1

related #10832

@github-actions

github-actions Bot commented Apr 6, 2026

Copy link
Copy Markdown

GNU testsuite comparison:

Skipping an intermittent issue tests/tty/tty-eof (passes in this run but fails in the 'main' branch)
Note: The gnu test tests/basenc/bounded-memory is now being skipped but was previously passing.
Note: The gnu test tests/dd/no-allocate is now being skipped but was previously passing.
Note: The gnu test tests/tail/tail-n0f is now being skipped but was previously passing.
Congrats! The gnu test tests/cut/bounded-memory is now passing!

@oech3 oech3 marked this pull request as ready for review April 6, 2026 07:55
@oech3 oech3 marked this pull request as draft April 6, 2026 08:04
@github-actions

github-actions Bot commented Apr 6, 2026

Copy link
Copy Markdown

GNU testsuite comparison:

Skip an intermittent issue tests/cut/bounded-memory (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/date/date-locale-hour (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/date/resolution (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/cut/cut-huge-range is now passing!

@oech3

oech3 commented Apr 6, 2026

Copy link
Copy Markdown
Contributor Author

hyperfine is flakey

@oech3 oech3 marked this pull request as ready for review April 6, 2026 08:49
@xtqqczze

xtqqczze commented Apr 6, 2026

Copy link
Copy Markdown
Contributor

Switching from a stack allocation to a heap allocation doesn’t avoid allocation...

@oech3

oech3 commented Apr 6, 2026 via email

Copy link
Copy Markdown
Contributor Author

@oech3

oech3 commented Apr 6, 2026

Copy link
Copy Markdown
Contributor Author

I saw more perf difference with 1024 * 1024 by switching to vec. So I think vec's allocation is deffered.

@github-actions

github-actions Bot commented Apr 8, 2026

Copy link
Copy Markdown

GNU testsuite comparison:

Skipping an intermittent issue tests/cut/bounded-memory (passes in this run but fails in the 'main' branch)
Skipping an intermittent issue tests/date/date-locale-hour (passes in this run but fails in the 'main' branch)
Note: The gnu test tests/rm/many-dir-entries-vs-OOM is now being skipped but was previously passing.

@github-actions

github-actions Bot commented Apr 8, 2026

Copy link
Copy Markdown

GNU testsuite comparison:

Skipping an intermittent issue tests/tty/tty-eof (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/cut/cut-huge-range is now passing!

@sylvestre sylvestre merged commit efd0f0c into uutils:main Apr 12, 2026
169 checks passed
@oech3 oech3 deleted the cat-alloc branch April 12, 2026 14:38
@oech3

oech3 commented Apr 12, 2026

Copy link
Copy Markdown
Contributor Author

We might use nightly fill_buf in the future to avoid 0-fill at here.

@xtqqczze

Copy link
Copy Markdown
Contributor

We might use nightly fill_buf in the future to avoid 0-fill at here.

Presumably you mean nightly-only Read::read_buf. Might be worth prototyping an implementation to validate this approach.

@xtqqczze

Copy link
Copy Markdown
Contributor

1.01 ± 0.16 times faster

This doesn’t appear to be a statistically significant improvement; the reported uncertainty is large enough that the result is consistent with both a slowdown and a speedup.

@oech3

oech3 commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

When I manually changed it with large MiB, it causes stack overflow without vec! . So I think Linux is saving RAM usage at least for.

(but we should avoid N MiB pipe usage for small input)

@xtqqczze

Copy link
Copy Markdown
Contributor

When I manually changed it with large MiB

But we’re talking about the 64 KiB stack allocation here.

@oech3

oech3 commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

Linux can still save 64KiB

@xtqqczze

Copy link
Copy Markdown
Contributor

The stack space is already reserved, so switching to a heap allocation actually increases overall memory usage, at least in theory.

@oech3

oech3 commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

If splice() fast-path succeed, cat does not take code path allocating buf.

@oech3

oech3 commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

This is impossible to test on macOS, but changing buf to large stack causes serious perf drop while vec does not when splice() succeed. So allocation is omitted on Linux.

@xtqqczze

Copy link
Copy Markdown
Contributor

This PR introduced a heap allocation on Linux where there wasn’t one previously. Based on the data in the description, there is no statistically significant improvement. Using a significantly larger stack array would risk stack overflow and violate clippy::large_stack_arrays.

@oech3

oech3 commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

changing buf to large stack

This is just for verification for allocation bypass. I'm not intended to to do at production.

@oech3

oech3 commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

How to actually bypass allocation completely in the case splice() fast-path succeed in your thought?

@xtqqczze

Copy link
Copy Markdown
Contributor

Reverting the PR would avoid the unnecessary heap allocation and allocate for free using existing stack space. Your observed improvement in hyperfine is likely just noise or an artifact of LLVM optimization.

@oech3

oech3 commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

I want to completely stop allocating it when splice() succeed. How to do that? Who guarantee "existing stack space"?

@xtqqczze

Copy link
Copy Markdown
Contributor

There is typically 2 MiB stack already reserved per thread, see https://doc.rust-lang.org/std/thread/#stack-size. Using a fixed-size stack buffer will not introduce an additional system allocation.

@oech3

oech3 commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

Hmm. At least, 1 MiB vec! with pure splice path was faster than 1 MiB stack clearly.

@xtqqczze

Copy link
Copy Markdown
Contributor

1 MiB is too large for a stack array and risks stack overflow. It also violates clippy::large_stack_arrays.

@oech3

oech3 commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

Did you see #11675 (comment) ? It is just for local verification.

If 2 MiB stack is actually free, 1 MiB stack should not drop perf. But it dropped perf.

@oech3

This comment was marked as outdated.

@xtqqczze

Copy link
Copy Markdown
Contributor

Ah, the likely reason for your performance drop is that a 1 MiB stack buffer must be zeroed at function entry. In our case we only use a 64 KiB buffer, so that overhead is negligible. If an uninitialized buffer could be used via Read::read_buf, this would not be a factor.

@oech3

oech3 commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

I would split function containing the stack array and avoid call stack too.

@xtqqczze

Copy link
Copy Markdown
Contributor

I guess the change made sense to avoid unnecessary zero-initialization, but the following would also have worked:

    // Use a small stack array to avoid unnecessary zero-initialization overhead when splice() was used
    #[cfg(any(target_os = "linux", target_os = "android"))]
    let mut buf = [0; 512];
    #[cfg(not(any(target_os = "linux", target_os = "android")))]
    let mut buf = [0; 1024 * 8];

@oech3

oech3 commented Apr 19, 2026

Copy link
Copy Markdown
Contributor Author

Ofcause. But I wanted to save slow-path's syscalls for the sake.

@xtqqczze

Copy link
Copy Markdown
Contributor

I think your new approach in #11906 is much easier to understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants