perf: use GSO#2593
Conversation
Failed Interop TestsQUIC Interop Runner, client vs. server, differences relative to 66be2e6. neqo-latest as client
neqo-latest as server
All resultsSucceeded Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
Unsupported Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
|
Benchmark resultsPerformance differences relative to 95f9bed. 1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💚 Performance has improved. time: [202.13 ms 202.48 ms 202.85 ms]
thrpt: [492.97 MiB/s 493.86 MiB/s 494.74 MiB/s]
change:
time: [−69.170% −69.106% −69.043%] (p = 0.00 < 0.05)
thrpt: [+223.02% +223.69% +224.36%]
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: Change within noise threshold. time: [304.31 ms 305.83 ms 307.35 ms]
thrpt: [32.536 Kelem/s 32.698 Kelem/s 32.862 Kelem/s]
change:
time: [+0.6959% +1.3800% +2.0501%] (p = 0.00 < 0.05)
thrpt: [−2.0089% −1.3612% −0.6911%]
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: 💔 Performance has regressed. time: [27.525 ms 27.597 ms 27.673 ms]
thrpt: [36.136 elem/s 36.236 elem/s 36.331 elem/s]
change:
time: [+1.1068% +1.8128% +2.4725%] (p = 0.00 < 0.05)
thrpt: [−2.4128% −1.7805% −1.0947%]
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved. time: [648.95 ms 654.10 ms 659.22 ms]
thrpt: [151.69 MiB/s 152.88 MiB/s 154.10 MiB/s]
change:
time: [−28.772% −27.945% −27.096%] (p = 0.00 < 0.05)
thrpt: [+37.166% +38.782% +40.394%]
decode 4096 bytes, mask ff: No change in performance detected. time: [11.792 µs 11.818 µs 11.851 µs]
change: [−0.7711% −0.1718% +0.3634%] (p = 0.57 > 0.05)
decode 1048576 bytes, mask ff: No change in performance detected. time: [3.0229 ms 3.0323 ms 3.0435 ms]
change: [−0.2404% +0.1966% +0.6361%] (p = 0.39 > 0.05)
decode 4096 bytes, mask 7f: No change in performance detected. time: [19.968 µs 20.023 µs 20.082 µs]
change: [−0.7959% −0.1788% +0.3899%] (p = 0.57 > 0.05)
decode 1048576 bytes, mask 7f: No change in performance detected. time: [5.0371 ms 5.0487 ms 5.0618 ms]
change: [−0.5165% −0.1114% +0.2906%] (p = 0.59 > 0.05)
decode 4096 bytes, mask 3f: No change in performance detected. time: [8.2722 µs 8.3105 µs 8.3530 µs]
change: [−0.2019% +0.2442% +0.7830%] (p = 0.32 > 0.05)
decode 1048576 bytes, mask 3f: No change in performance detected. time: [1.5850 ms 1.5902 ms 1.5962 ms]
change: [−0.6684% −0.1223% +0.4020%] (p = 0.66 > 0.05)
1000 streams of 1 bytes/multistream: No change in performance detected. time: [33.203 ns 39.555 ns 51.878 ns]
change: [+10.728% +32.674% +75.025%] (p = 0.06 > 0.05)
1000 streams of 1000 bytes/multistream: 💔 Performance has regressed. time: [34.055 ns 34.490 ns 34.929 ns]
change: [+12.649% +14.534% +16.408%] (p = 0.00 < 0.05)
coalesce_acked_from_zero 1+1 entries: No change in performance detected. time: [88.115 ns 88.448 ns 88.789 ns]
change: [−0.4490% +0.5174% +1.7914%] (p = 0.49 > 0.05)
coalesce_acked_from_zero 3+1 entries: No change in performance detected. time: [105.48 ns 105.73 ns 105.99 ns]
change: [−0.8906% −0.3531% +0.1143%] (p = 0.18 > 0.05)
coalesce_acked_from_zero 10+1 entries: No change in performance detected. time: [105.05 ns 105.38 ns 105.80 ns]
change: [−0.2958% +0.2637% +0.8589%] (p = 0.38 > 0.05)
coalesce_acked_from_zero 1000+1 entries: No change in performance detected. time: [88.820 ns 88.971 ns 89.126 ns]
change: [−0.7026% +0.2281% +1.1404%] (p = 0.65 > 0.05)
RxStreamOrderer::inbound_frame(): No change in performance detected. time: [107.81 ms 107.97 ms 108.23 ms]
change: [−0.4699% −0.1075% +0.2218%] (p = 0.59 > 0.05)
sent::Packets::take_ranges: No change in performance detected. time: [8.0612 µs 8.2610 µs 8.4441 µs]
change: [−0.7407% +5.8984% +17.161%] (p = 0.24 > 0.05)
transfer/pacing-false/varying-seeds: 💔 Performance has regressed. time: [37.072 ms 37.169 ms 37.279 ms]
change: [+4.5101% +4.8981% +5.2577%] (p = 0.00 < 0.05)
transfer/pacing-true/varying-seeds: 💔 Performance has regressed. time: [37.692 ms 37.808 ms 37.931 ms]
change: [+5.0829% +5.4940% +5.9561%] (p = 0.00 < 0.05)
transfer/pacing-false/same-seed: 💔 Performance has regressed. time: [36.999 ms 37.067 ms 37.140 ms]
change: [+4.6033% +4.8770% +5.1647%] (p = 0.00 < 0.05)
transfer/pacing-true/same-seed: 💔 Performance has regressed. time: [38.372 ms 38.472 ms 38.576 ms]
change: [+4.1365% +4.4851% +4.8031%] (p = 0.00 < 0.05)
Client/server transfer resultsPerformance differences relative to 95f9bed. Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.
Download data for |
|
Optimized Upload only thus far.
🎉 matches #2532 (comment). |
f9ff613 to
a21983d
Compare
|
Introduced the same optimizations to
|
43bddd5 to
a05f876
Compare
ce275bd to
d52586e
Compare
|
Why do we see a massive benefit in the client/server tests, but not in the transfer benches? |
|
@larseggert the neqo/test-fixture/src/sim/mod.rs Line 206 in 37c3aee Let me see whether I can change that as part of this pull request. After all our benchmarks and tests should mirror how we run Neqo in Firefox as close as possible. |
|
@mxinden |
|
| Branch | gso-v3 |
| Testbed | t-linux64-ms-279 |
Click to view all benchmark results
| Benchmark | Latency | milliseconds (ms) |
|---|---|---|
| s2n vs. neqo (cubic, paced) | 📈 view plot 🚷 view threshold | 210.06 ms |
|
| Branch | gso-v3 |
| Testbed | t-linux64-ms-278 |
Click to view all benchmark results
| Benchmark | Latency | nanoseconds (ns) |
|---|---|---|
| 1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client | 📈 view plot 🚷 view threshold | 654,100,000.00 ns |
| 1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client | 📈 view plot 🚷 view threshold | 202,480,000.00 ns |
| 1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client | 📈 view plot 🚷 view threshold | 27,597,000.00 ns |
| 1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client | 📈 view plot 🚷 view threshold | 305,830,000.00 ns |
| 1000 streams of 1 bytes/multistream | 📈 view plot 🚷 view threshold | 39.55 ns |
| 1000 streams of 1000 bytes/multistream | 📈 view plot 🚷 view threshold | 34.49 ns |
| RxStreamOrderer::inbound_frame() | 📈 view plot 🚷 view threshold | 107,970,000.00 ns |
| coalesce_acked_from_zero 1+1 entries | 📈 view plot 🚷 view threshold | 88.45 ns |
| coalesce_acked_from_zero 10+1 entries | 📈 view plot 🚷 view threshold | 105.38 ns |
| coalesce_acked_from_zero 1000+1 entries | 📈 view plot 🚷 view threshold | 88.97 ns |
| coalesce_acked_from_zero 3+1 entries | 📈 view plot 🚷 view threshold | 105.73 ns |
| decode 1048576 bytes, mask 3f | 📈 view plot 🚷 view threshold | 1,590,200.00 ns |
| decode 1048576 bytes, mask 7f | 📈 view plot 🚷 view threshold | 5,048,700.00 ns |
| decode 1048576 bytes, mask ff | 📈 view plot 🚷 view threshold | 3,032,300.00 ns |
| decode 4096 bytes, mask 3f | 📈 view plot 🚷 view threshold | 8,310.50 ns |
| decode 4096 bytes, mask 7f | 📈 view plot 🚷 view threshold | 20,023.00 ns |
| decode 4096 bytes, mask ff | 📈 view plot 🚷 view threshold | 11,818.00 ns |
| sent::Packets::take_ranges | 📈 view plot 🚷 view threshold | 8,261.00 ns |
| transfer/pacing-false/same-seed | 📈 view plot 🚷 view threshold | 37,067,000.00 ns |
| transfer/pacing-false/varying-seeds | 📈 view plot 🚷 view threshold | 37,169,000.00 ns |
| transfer/pacing-true/same-seed | 📈 view plot 🚷 view threshold | 38,472,000.00 ns |
| transfer/pacing-true/varying-seeds | 📈 view plot 🚷 view threshold | 37,808,000.00 ns |
|
| Branch | gso-v3 |
| Testbed | t-linux64-ms-278 |
Click to view all benchmark results
| Benchmark | Latency | milliseconds (ms) |
|---|---|---|
| s2n vs. neqo (cubic, paced) | 📈 view plot 🚷 view threshold | 172.07 ms |
|
@mxinden is this ready to merge? |
|
Yes, ready to merge from my end. We have a couple of benchmark regressions. Explainer for each:
This will improve even further with #2734.
This is expected. We pay a slight cost in latency when sending in batches.
This should be due to
Again, slight regression as the Simulator is not using the batched IO paths. The non-batched IO path (i.e. @larseggert let me know whether you are fine proceeding here, or would prefer any of the above to be addressed first. |
|
I'll merge now; please do issues for the missing bits? Great we can land this! |
|
This keeps getting kicked out of the merge queue while tests are still running and haven't failed yet. I think GitHub may have issues. Doing a force merge. |
I assume you are fine with the following pull requests tracking the progress. Let me know if you want additional GitHub issues. |
|
Early numbers on GSO in Firefox Nightly:
Good signals. We should explore increasing max number of segments (currently 10). Maybe just limit by what our pacer allows to send. Datagram (batch) sizeWindows
Linux
Number of segments in a batchWindows
Linux
|
|
Yes, let's increase. |




Use generic send offloading (GSO) on Linux and UDP segment offloading (USO) on Windows.
GSO and USO allow us to batch multiple datagrams into one large payload (up to 64 KB) and pass it in a single system call to the kernel. The kernel either itself segments it, or has the NIC segment it, before sending it out on the network.
Early measurements show an up to 2x throughput improvement on artificial CPU bound localhost transfer benchmark.
Attempt 1: f25b0b7
Attempt 2: #2532
Compared to attempt 2:
neqo-transportinstead ofneqo-binOnce this is merged, we can switch to a long-lived send buffer (see discussed in #2670). #2677 and this pull request lay the groundwork for it.