perf: don't allocate in UDP send path#2747
Conversation
1cbbac6 to
330f9a3
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2747 +/- ##
==========================================
- Coverage 94.92% 92.57% -2.35%
==========================================
Files 115 115
Lines 34266 34415 +149
Branches 34266 34415 +149
==========================================
- Hits 32526 31859 -667
- Misses 1733 1734 +1
- Partials 7 822 +815
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a reusable send buffer via a new Buffer trait to eliminate heap allocations in the UDP send path.
- Add a
Buffertrait and propagate a generic buffer parameterB: Bufferthrough UDP, transport, HTTP/3, and example binaries. - Refactor
DatagramBatchand related methods to be generic over the buffer and update tests and examples to pass pre-allocated buffers. - Extend
neqo-commonwith buffer-backed encoders and make packet builders work over generic buffers.
Reviewed Changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| neqo-udp/src/lib.rs | Imported Buffer, added SEND_BUF_SIZE, made send_inner and Socket::send generic over Buffer and updated a test. |
| neqo-transport/src/server.rs | Refactored server to accept an external send buffer, changed return types to OutputBatch<B>. |
| neqo-transport/src/packet/mod.rs | Converted packet Builder to work with generic buffers, temporarily added hack methods x/y. |
| neqo-transport/tests/connection.rs | Updated test to initialize and pass a send buffer to process_multiple_output. |
| neqo-common/src/datagram.rs | Made DatagramBatch generic over its buffer and updated conversion methods. |
Comments suppressed due to low confidence (1)
neqo-transport/tests/connection.rs:37
- [nitpick] There's no space after the comma between
now()andsend_buffer, making it less readable. Runrustfmtor add a space for consistency with Rust formatting conventions.
.process_multiple_output(now(),send_buffer, 64.try_into().expect(">0"))
|
I don't think at this point the performance improvement (~4% see #2861 (comment))) is worth the added complexity. Complexity being:
I will close here for now. We can revisit this once we want to spend more time on optimizing IO performance. |
|
@mxinden revived this to do another benchmark run, now that we addressed some performance issues elsewhere in the code. Thanks for the reminder about this one! |
Maintain a long-lived send buffer. When producing outbound UDP datagrams write them into the buffer. Then pass the buffer to the OS to be sent out on the network. In other words, don't heap-allocate in the UDP send path. --- Corresponding past patch for the receive path mozilla#2184 Fixes mozilla#2670
330f9a3 to
c9ecc80
Compare
|
Good idea. Thanks. |
Merging this PR will improve performance by 11.63%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | simulated/1-streams/each-4194304-bytes |
88.8 ms | 91.7 ms | -3.15% |
| ⚡ | Memory | simulated/1-streams/each-4194304-bytes |
5.1 MB | 4 MB | +25.37% |
| ⚡ | Memory | simulated/10-streams/each-1048576-bytes |
10 MB | 8.1 MB | +23.81% |
| ⚡ | Memory | simulated/1000-streams/each-1000-bytes |
402.5 KB | 389.6 KB | +3.3% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing mxinden:send-no-alloc (62df1c0) with main (337f83b)
Footnotes
-
26 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Benchmark resultsSignificant performance differences relative to 44a0279. transfer/1-conn/1-100mb-req (aka. Upload): 💔 Performance has regressed by +46.450%. time: [241.29 ms 249.10 ms 256.43 ms]
thrpt: [389.96 MiB/s 401.45 MiB/s 414.44 MiB/s]
change:
time: [+39.044% +46.450% +53.196] (p = 0.00 < 0.05)
thrpt: [-34.724% -31.717% -28.080]
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
9 (9.00%) low mildstreams/walltime/1000-streams/each-1000-bytes: 💔 Performance has regressed by +1.8618%. time: [32.156 ms 32.256 ms 32.410 ms]
thrpt: [29.425 MiB/s 29.566 MiB/s 29.658 MiB/s]
change:
time: [+1.4422% +1.8618% +2.3496] (p = 0.00 < 0.05)
thrpt: [-2.2957% -1.8277% -1.4217]
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severestreams-flow-controlled/walltime/1-streams/each-4194304-bytes: 💔 Performance has regressed by +1.8002%. time: [27.638 ms 27.686 ms 27.735 ms]
thrpt: [144.22 MiB/s 144.48 MiB/s 144.73 MiB/s]
change:
time: [+1.4169% +1.8002% +2.1256] (p = 0.00 < 0.05)
thrpt: [-2.0813% -1.7684% -1.3971]
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mildAll resultstransfer/1-conn/1-100mb-resp (aka. Download): Change within noise threshold. time: [147.93 ms 148.32 ms 148.76 ms]
thrpt: [672.24 MiB/s 674.20 MiB/s 676.01 MiB/s]
change:
time: [+0.7418% +1.0864% +1.4570] (p = 0.00 < 0.05)
thrpt: [-1.4361% -1.0747% -0.7363]
Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severetransfer/1-conn/10_000-parallel-1b-resp (aka. RPS): No change in performance detected. time: [258.27 ms 260.18 ms 262.12 ms]
thrpt: [38.150 Kelem/s 38.434 Kelem/s 38.719 Kelem/s]
change:
time: [-0.9126% +0.2215% +1.2853] (p = 0.69 > 0.05)
thrpt: [-1.2690% -0.2210% +0.9210]
No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mildtransfer/1-conn/1-1b-resp (aka. HPS): No change in performance detected. time: [38.884 ms 39.042 ms 39.219 ms]
thrpt: [25.498 B/s 25.613 B/s 25.717 B/s]
change:
time: [-0.2236% +0.2914% +0.8069] (p = 0.29 > 0.05)
thrpt: [-0.8005% -0.2906% +0.2241]
No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
1 (1.00%) low severe
7 (7.00%) low mild
9 (9.00%) high severetransfer/1-conn/1-100mb-req (aka. Upload): 💔 Performance has regressed by +46.450%. time: [241.29 ms 249.10 ms 256.43 ms]
thrpt: [389.96 MiB/s 401.45 MiB/s 414.44 MiB/s]
change:
time: [+39.044% +46.450% +53.196] (p = 0.00 < 0.05)
thrpt: [-34.724% -31.717% -28.080]
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
9 (9.00%) low mildstreams/walltime/1-streams/each-1000-bytes: No change in performance detected. time: [550.26 µs 552.00 µs 554.09 µs]
thrpt: [1.7212 MiB/s 1.7277 MiB/s 1.7331 MiB/s]
change:
time: [-1.1502% -0.5592% +0.0069] (p = 0.06 > 0.05)
thrpt: [-0.0069% +0.5623% +1.1636]
No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) high mild
6 (6.00%) high severestreams/walltime/1000-streams/each-1-bytes: Change within noise threshold. time: [10.979 ms 10.998 ms 11.017 ms]
thrpt: [88.638 KiB/s 88.798 KiB/s 88.950 KiB/s]
change:
time: [-1.2853% -1.0593% -0.8385] (p = 0.00 < 0.05)
thrpt: [+0.8456% +1.0706% +1.3021]
Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mildstreams/walltime/1000-streams/each-1000-bytes: 💔 Performance has regressed by +1.8618%. time: [32.156 ms 32.256 ms 32.410 ms]
thrpt: [29.425 MiB/s 29.566 MiB/s 29.658 MiB/s]
change:
time: [+1.4422% +1.8618% +2.3496] (p = 0.00 < 0.05)
thrpt: [-2.2957% -1.8277% -1.4217]
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severestreams-flow-controlled/walltime/1-streams/each-4194304-bytes: 💔 Performance has regressed by +1.8002%. time: [27.638 ms 27.686 ms 27.735 ms]
thrpt: [144.22 MiB/s 144.48 MiB/s 144.73 MiB/s]
change:
time: [+1.4169% +1.8002% +2.1256] (p = 0.00 < 0.05)
thrpt: [-2.0813% -1.7684% -1.3971]
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mildstreams-flow-controlled/walltime/10-streams/each-1048576-bytes: Change within noise threshold. time: [72.752 ms 73.023 ms 73.314 ms]
thrpt: [136.40 MiB/s 136.94 MiB/s 137.45 MiB/s]
change:
time: [+0.6686% +1.1368% +1.6300] (p = 0.00 < 0.05)
thrpt: [-1.6039% -1.1240% -0.6642]
Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) low mild
1 (1.00%) high mild
2 (2.00%) high severetransfer/walltime/pacing-false/varying-seeds: Change within noise threshold. time: [19.167 ms 19.185 ms 19.210 ms]
thrpt: [208.23 MiB/s 208.50 MiB/s 208.69 MiB/s]
change:
time: [+1.0968% +1.2231% +1.3604] (p = 0.00 < 0.05)
thrpt: [-1.3421% -1.2084% -1.0849]
Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severetransfer/walltime/pacing-true/varying-seeds: Change within noise threshold. time: [19.387 ms 19.410 ms 19.439 ms]
thrpt: [205.77 MiB/s 206.08 MiB/s 206.32 MiB/s]
change:
time: [+0.5099% +0.7243% +0.9297] (p = 0.00 < 0.05)
thrpt: [-0.9212% -0.7191% -0.5073]
Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severetransfer/walltime/pacing-false/same-seed: Change within noise threshold. time: [18.999 ms 19.012 ms 19.025 ms]
thrpt: [210.25 MiB/s 210.40 MiB/s 210.54 MiB/s]
change:
time: [+0.2863% +0.3885% +0.4980] (p = 0.00 < 0.05)
thrpt: [-0.4955% -0.3870% -0.2855]
Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mildtransfer/walltime/pacing-true/same-seed: Change within noise threshold. time: [19.604 ms 19.626 ms 19.648 ms]
thrpt: [203.58 MiB/s 203.81 MiB/s 204.04 MiB/s]
change:
time: [+1.8924% +2.0785% +2.2513] (p = 0.00 < 0.05)
thrpt: [-2.2018% -2.0362% -1.8572]
Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severeDownload data for |
Failed Interop TestsQUIC Interop Runner, client vs. server, differences relative to
All resultsSucceeded Interop TestsQUIC Interop Runner, client vs. server neqo-pr as client
neqo-pr as server
Unsupported Interop TestsQUIC Interop Runner, client vs. server neqo-pr as client
neqo-pr as server
|
Client/server transfer resultsPerformance differences relative to 44a0279. Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.
Table above only shows statistically significant changes. See all results below. All resultsTransfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.
Download data for |
Maintain a long-lived send buffer. When producing outbound UDP datagrams
write them into the buffer. Then pass the buffer to the OS to be sent
out on the network.
In other words, don't heap-allocate in the UDP send path.
Corresponding past patch for the receive path #2184
Fixes #2670