Skip to content

perf: don't allocate in UDP send path#2747

Draft
mxinden wants to merge 2 commits into
mozilla:mainfrom
mxinden:send-no-alloc
Draft

perf: don't allocate in UDP send path#2747
mxinden wants to merge 2 commits into
mozilla:mainfrom
mxinden:send-no-alloc

Conversation

@mxinden

@mxinden mxinden commented Jun 21, 2025

Copy link
Copy Markdown
Member

Maintain a long-lived send buffer. When producing outbound UDP datagrams
write them into the buffer. Then pass the buffer to the OS to be sent
out on the network.

In other words, don't heap-allocate in the UDP send path.


Corresponding past patch for the receive path #2184
Fixes #2670

@mxinden mxinden mentioned this pull request Jun 30, 2025
@mxinden mxinden force-pushed the send-no-alloc branch 3 times, most recently from 1cbbac6 to 330f9a3 Compare July 6, 2025 14:58
@codecov

codecov Bot commented Jul 6, 2025

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.51301% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.57%. Comparing base (d89d9d9) to head (330f9a3).
⚠️ Report is 745 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2747      +/-   ##
==========================================
- Coverage   94.92%   92.57%   -2.35%     
==========================================
  Files         115      115              
  Lines       34266    34415     +149     
  Branches    34266    34415     +149     
==========================================
- Hits        32526    31859     -667     
- Misses       1733     1734       +1     
- Partials        7      822     +815     
Components Coverage Δ
neqo-common 96.61% <100.00%> (-0.47%) ⬇️
neqo-crypto 82.48% <ø> (-7.16%) ⬇️
neqo-http3 91.74% <100.00%> (-1.98%) ⬇️
neqo-qpack 93.38% <ø> (-2.07%) ⬇️
neqo-transport 93.88% <97.95%> (-2.14%) ⬇️
neqo-udp 78.74% <100.00%> (-11.12%) ⬇️
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mxinden mxinden requested a review from Copilot July 11, 2025 08:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a reusable send buffer via a new Buffer trait to eliminate heap allocations in the UDP send path.

  • Add a Buffer trait and propagate a generic buffer parameter B: Buffer through UDP, transport, HTTP/3, and example binaries.
  • Refactor DatagramBatch and related methods to be generic over the buffer and update tests and examples to pass pre-allocated buffers.
  • Extend neqo-common with buffer-backed encoders and make packet builders work over generic buffers.

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
neqo-udp/src/lib.rs Imported Buffer, added SEND_BUF_SIZE, made send_inner and Socket::send generic over Buffer and updated a test.
neqo-transport/src/server.rs Refactored server to accept an external send buffer, changed return types to OutputBatch<B>.
neqo-transport/src/packet/mod.rs Converted packet Builder to work with generic buffers, temporarily added hack methods x/y.
neqo-transport/tests/connection.rs Updated test to initialize and pass a send buffer to process_multiple_output.
neqo-common/src/datagram.rs Made DatagramBatch generic over its buffer and updated conversion methods.
Comments suppressed due to low confidence (1)

neqo-transport/tests/connection.rs:37

  • [nitpick] There's no space after the comma between now() and send_buffer, making it less readable. Run rustfmt or add a space for consistency with Rust formatting conventions.
        .process_multiple_output(now(),send_buffer, 64.try_into().expect(">0"))

Comment thread neqo-udp/src/lib.rs Outdated
Comment thread neqo-transport/src/server.rs Outdated
Comment thread neqo-transport/src/packet/mod.rs
Comment thread neqo-common/src/datagram.rs
@mxinden

mxinden commented Aug 17, 2025

Copy link
Copy Markdown
Member Author

I don't think at this point the performance improvement (~4% see #2861 (comment))) is worth the added complexity. Complexity being:

  • new trait parameter B: Buffer on most public process* functions
  • a necessary refactor of test_frame_writer
    • as of today, it needs to be object safe, which is in conflict with the new trait parameter B
    • alternatively we could require Cursor<&mut [u8]> everywhere
    • or don't store test_frame_writer in Connection, but instead pass it along the process*` functions

I will close here for now. We can revisit this once we want to spend more time on optimizing IO performance.

@larseggert

Copy link
Copy Markdown
Collaborator

@mxinden revived this to do another benchmark run, now that we addressed some performance issues elsewhere in the code. Thanks for the reminder about this one!

Maintain a long-lived send buffer. When producing outbound UDP datagrams
write them into the buffer. Then pass the buffer to the OS to be sent
out on the network.

In other words, don't heap-allocate in the UDP send path.

---

Corresponding past patch for the receive path mozilla#2184
Fixes mozilla#2670
@mxinden

mxinden commented Jun 22, 2026

Copy link
Copy Markdown
Member Author

Good idea. Thanks.

@codspeed-hq

codspeed-hq Bot commented Jun 22, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 11.63%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 3 improved benchmarks
❌ 1 regressed benchmark
✅ 67 untouched benchmarks
⏩ 26 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation simulated/1-streams/each-4194304-bytes 88.8 ms 91.7 ms -3.15%
Memory simulated/1-streams/each-4194304-bytes 5.1 MB 4 MB +25.37%
Memory simulated/10-streams/each-1048576-bytes 10 MB 8.1 MB +23.81%
Memory simulated/1000-streams/each-1000-bytes 402.5 KB 389.6 KB +3.3%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing mxinden:send-no-alloc (62df1c0) with main (337f83b)

Open in CodSpeed

Footnotes

  1. 26 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@github-actions

Copy link
Copy Markdown
Contributor

Benchmark results

Significant performance differences relative to 44a0279.

transfer/1-conn/1-100mb-req (aka. Upload): 💔 Performance has regressed by +46.450%.
       time:   [241.29 ms 249.10 ms 256.43 ms]
       thrpt:  [389.96 MiB/s 401.45 MiB/s 414.44 MiB/s]
change:
       time:   [+39.044% +46.450% +53.196] (p = 0.00 < 0.05)
       thrpt:  [-34.724% -31.717% -28.080]
       Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
9 (9.00%) low mild
streams/walltime/1000-streams/each-1000-bytes: 💔 Performance has regressed by +1.8618%.
       time:   [32.156 ms 32.256 ms 32.410 ms]
       thrpt:  [29.425 MiB/s 29.566 MiB/s 29.658 MiB/s]
change:
       time:   [+1.4422% +1.8618% +2.3496] (p = 0.00 < 0.05)
       thrpt:  [-2.2957% -1.8277% -1.4217]
       Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
streams-flow-controlled/walltime/1-streams/each-4194304-bytes: 💔 Performance has regressed by +1.8002%.
       time:   [27.638 ms 27.686 ms 27.735 ms]
       thrpt:  [144.22 MiB/s 144.48 MiB/s 144.73 MiB/s]
change:
       time:   [+1.4169% +1.8002% +2.1256] (p = 0.00 < 0.05)
       thrpt:  [-2.0813% -1.7684% -1.3971]
       Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
All results
transfer/1-conn/1-100mb-resp (aka. Download): Change within noise threshold.
       time:   [147.93 ms 148.32 ms 148.76 ms]
       thrpt:  [672.24 MiB/s 674.20 MiB/s 676.01 MiB/s]
change:
       time:   [+0.7418% +1.0864% +1.4570] (p = 0.00 < 0.05)
       thrpt:  [-1.4361% -1.0747% -0.7363]
       Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
transfer/1-conn/10_000-parallel-1b-resp (aka. RPS): No change in performance detected.
       time:   [258.27 ms 260.18 ms 262.12 ms]
       thrpt:  [38.150 Kelem/s 38.434 Kelem/s 38.719 Kelem/s]
change:
       time:   [-0.9126% +0.2215% +1.2853] (p = 0.69 > 0.05)
       thrpt:  [-1.2690% -0.2210% +0.9210]
       No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
transfer/1-conn/1-1b-resp (aka. HPS): No change in performance detected.
       time:   [38.884 ms 39.042 ms 39.219 ms]
       thrpt:  [25.498   B/s 25.613   B/s 25.717   B/s]
change:
       time:   [-0.2236% +0.2914% +0.8069] (p = 0.29 > 0.05)
       thrpt:  [-0.8005% -0.2906% +0.2241]
       No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
1 (1.00%) low severe
7 (7.00%) low mild
9 (9.00%) high severe
transfer/1-conn/1-100mb-req (aka. Upload): 💔 Performance has regressed by +46.450%.
       time:   [241.29 ms 249.10 ms 256.43 ms]
       thrpt:  [389.96 MiB/s 401.45 MiB/s 414.44 MiB/s]
change:
       time:   [+39.044% +46.450% +53.196] (p = 0.00 < 0.05)
       thrpt:  [-34.724% -31.717% -28.080]
       Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
9 (9.00%) low mild
streams/walltime/1-streams/each-1000-bytes: No change in performance detected.
       time:   [550.26 µs 552.00 µs 554.09 µs]
       thrpt:  [1.7212 MiB/s 1.7277 MiB/s 1.7331 MiB/s]
change:
       time:   [-1.1502% -0.5592% +0.0069] (p = 0.06 > 0.05)
       thrpt:  [-0.0069% +0.5623% +1.1636]
       No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) high mild
6 (6.00%) high severe
streams/walltime/1000-streams/each-1-bytes: Change within noise threshold.
       time:   [10.979 ms 10.998 ms 11.017 ms]
       thrpt:  [88.638 KiB/s 88.798 KiB/s 88.950 KiB/s]
change:
       time:   [-1.2853% -1.0593% -0.8385] (p = 0.00 < 0.05)
       thrpt:  [+0.8456% +1.0706% +1.3021]
       Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
streams/walltime/1000-streams/each-1000-bytes: 💔 Performance has regressed by +1.8618%.
       time:   [32.156 ms 32.256 ms 32.410 ms]
       thrpt:  [29.425 MiB/s 29.566 MiB/s 29.658 MiB/s]
change:
       time:   [+1.4422% +1.8618% +2.3496] (p = 0.00 < 0.05)
       thrpt:  [-2.2957% -1.8277% -1.4217]
       Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
streams-flow-controlled/walltime/1-streams/each-4194304-bytes: 💔 Performance has regressed by +1.8002%.
       time:   [27.638 ms 27.686 ms 27.735 ms]
       thrpt:  [144.22 MiB/s 144.48 MiB/s 144.73 MiB/s]
change:
       time:   [+1.4169% +1.8002% +2.1256] (p = 0.00 < 0.05)
       thrpt:  [-2.0813% -1.7684% -1.3971]
       Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
streams-flow-controlled/walltime/10-streams/each-1048576-bytes: Change within noise threshold.
       time:   [72.752 ms 73.023 ms 73.314 ms]
       thrpt:  [136.40 MiB/s 136.94 MiB/s 137.45 MiB/s]
change:
       time:   [+0.6686% +1.1368% +1.6300] (p = 0.00 < 0.05)
       thrpt:  [-1.6039% -1.1240% -0.6642]
       Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) low mild
1 (1.00%) high mild
2 (2.00%) high severe
transfer/walltime/pacing-false/varying-seeds: Change within noise threshold.
       time:   [19.167 ms 19.185 ms 19.210 ms]
       thrpt:  [208.23 MiB/s 208.50 MiB/s 208.69 MiB/s]
change:
       time:   [+1.0968% +1.2231% +1.3604] (p = 0.00 < 0.05)
       thrpt:  [-1.3421% -1.2084% -1.0849]
       Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
transfer/walltime/pacing-true/varying-seeds: Change within noise threshold.
       time:   [19.387 ms 19.410 ms 19.439 ms]
       thrpt:  [205.77 MiB/s 206.08 MiB/s 206.32 MiB/s]
change:
       time:   [+0.5099% +0.7243% +0.9297] (p = 0.00 < 0.05)
       thrpt:  [-0.9212% -0.7191% -0.5073]
       Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
transfer/walltime/pacing-false/same-seed: Change within noise threshold.
       time:   [18.999 ms 19.012 ms 19.025 ms]
       thrpt:  [210.25 MiB/s 210.40 MiB/s 210.54 MiB/s]
change:
       time:   [+0.2863% +0.3885% +0.4980] (p = 0.00 < 0.05)
       thrpt:  [-0.4955% -0.3870% -0.2855]
       Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
transfer/walltime/pacing-true/same-seed: Change within noise threshold.
       time:   [19.604 ms 19.626 ms 19.648 ms]
       thrpt:  [203.58 MiB/s 203.81 MiB/s 204.04 MiB/s]
change:
       time:   [+1.8924% +2.0785% +2.2513] (p = 0.00 < 0.05)
       thrpt:  [-2.2018% -2.0362% -1.8572]
       Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe

Download data for profiler.firefox.com or download performance comparison data.

@github-actions

Copy link
Copy Markdown
Contributor

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to main at 337f83b.

neqo-pr as clientneqo-pr as server
neqo-pr vs. go-x-net: BP BA
neqo-pr vs. haproxy: ⚠️L1 BP BA
neqo-pr vs. kwik: 🚀C1 BP BA
neqo-pr vs. lsquic: L1 C1
neqo-pr vs. msquic: A L1 C1
neqo-pr vs. mvfst: A ⚠️BA
neqo-pr vs. neqo: A
neqo-pr vs. nginx: BP BA
neqo-pr vs. ngtcp2: CM
neqo-pr vs. picoquic: A
neqo-pr vs. quic-go: A
neqo-pr vs. quic-zig: B
neqo-pr vs. quiche: ⚠️L1 BP BA
neqo-pr vs. s2n-quic: 🚀BA CM
neqo-pr vs. tquic: S BP BA
neqo-pr vs. xquic: A L1 ⚠️C1
aioquic vs. neqo-pr: CM
go-x-net vs. neqo-pr: CM
kwik vs. neqo-pr: BP BA CM
msquic vs. neqo-pr: CM
mvfst vs. neqo-pr: Z L1 C1 CM
neqo vs. neqo-pr: A
openssl vs. neqo-pr: LR M A CM
quic-go vs. neqo-pr: CM
quiche vs. neqo-pr: 🚀L1 CM
quinn vs. neqo-pr: V2 CM
s2n-quic vs. neqo-pr: CM
tquic vs. neqo-pr: CM
xquic vs. neqo-pr: M CM
All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-pr as client

neqo-pr as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-pr as client

neqo-pr as server

@github-actions

Copy link
Copy Markdown
Contributor

Client/server transfer results

Performance differences relative to 44a0279.

Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.

Client vs. server (params) Mean ± σ Min Max MiB/s ± σ Δ baseline Δ baseline
neqo-msquic-cubic 148.0 ± 8.9 142.1 192.0 216.2 ± 3.6 💔 2.1 1.5%
neqo-neqo-cubic 78.0 ± 3.2 71.9 86.5 410.2 ± 10.0 💔 1.1 1.4%
neqo-neqo-cubic-nopacing 76.7 ± 3.9 70.9 98.6 417.0 ± 8.2 💔 1.4 1.8%
neqo-neqo-newreno 78.6 ± 3.6 70.8 95.9 407.3 ± 8.9 💔 1.1 1.5%

Table above only shows statistically significant changes. See all results below.

All results

Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.

Client vs. server (params) Mean ± σ Min Max MiB/s ± σ Δ baseline Δ baseline
google-google-nopacing 461.3 ± 2.0 457.4 473.1 69.4 ± 16.0
google-neqo-cubic 265.9 ± 2.5 261.3 272.6 120.3 ± 12.8 0.6 0.2%
msquic-msquic-nopacing 135.4 ± 46.5 107.3 407.3 236.4 ± 0.7
msquic-neqo-cubic 140.1 ± 44.6 110.5 391.6 228.4 ± 0.7 2.0 1.5%
neqo-google-cubic 766.3 ± 3.0 760.7 779.1 41.8 ± 10.7 -0.0 -0.0%
neqo-msquic-cubic 148.0 ± 8.9 142.1 192.0 216.2 ± 3.6 💔 2.1 1.5%
neqo-neqo-cubic 78.0 ± 3.2 71.9 86.5 410.2 ± 10.0 💔 1.1 1.4%
neqo-neqo-cubic-nopacing 76.7 ± 3.9 70.9 98.6 417.0 ± 8.2 💔 1.4 1.8%
neqo-neqo-newreno 78.6 ± 3.6 70.8 95.9 407.3 ± 8.9 💔 1.1 1.5%
neqo-neqo-newreno-nopacing 76.1 ± 4.2 68.0 104.9 420.4 ± 7.6 0.8 1.0%
neqo-quiche-cubic 188.5 ± 2.2 184.9 195.4 169.7 ± 14.5 0.1 0.1%
neqo-s2n-cubic 214.6 ± 2.2 209.1 222.4 149.1 ± 14.5 0.4 0.2%
quiche-neqo-cubic 177.8 ± 2.4 173.0 184.7 180.0 ± 13.3 0.2 0.1%
quiche-quiche-nopacing 138.0 ± 3.9 133.7 167.6 231.8 ± 8.2
s2n-neqo-cubic 212.7 ± 2.5 207.9 220.6 150.5 ± 12.8 -0.2 -0.1%
s2n-s2n-nopacing 295.5 ± 31.0 279.8 462.8 108.3 ± 1.0

Download data for profiler.firefox.com or download performance comparison data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider not pre-allocating each UDP datagram in output_path

3 participants