perf: use GSO#2593

mxinden · 2025-04-18T13:24:51Z

Use generic send offloading (GSO) on Linux and UDP segment offloading (USO) on Windows.

GSO and USO allow us to batch multiple datagrams into one large payload (up to 64 KB) and pass it in a single system call to the kernel. The kernel either itself segments it, or has the NIC segment it, before sending it out on the network.

Early measurements show an up to 2x throughput improvement on artificial CPU bound localhost transfer benchmark.

Attempt 1: f25b0b7
Attempt 2: #2532

Compared to attempt 2:

implements the datagram batching in neqo-transport instead of neqo-bin
does not copy each datagram in the larger GSO buffer, but instead writes each into the GSO buffer right away.

Once this is merged, we can switch to a long-lived send buffer (see discussed in #2670). #2677 and this pull request lay the groundwork for it.

github-actions · 2025-04-18T13:47:16Z

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to 66be2e6.

neqo-latest as client

neqo-latest vs. aioquic: 🚀~~C20 M S~~ Z 🚀3 ⚠️U L1 L2 🚀BP ⚠️C2 BA
neqo-latest vs. go-x-net: 🚀~~H DC M B A C2~~ ⚠️U L2 6 BP BA
neqo-latest vs. haproxy: H DC LR C20 M S R Z 3 B U A L1 L2 C1 C2 6 V2 BP BA
neqo-latest vs. kwik: ⚠️H DC LR C20 M S R ⚠️Z 3 B U ⚠️A L1 ⚠️L2 C1 ⚠️C2 6 V2 BP BA
neqo-latest vs. linuxquic: H DC ⚠️LR C20 M S ⚠️R Z 3 B U E A L1 L2 C1 ⚠️C2 6 V2 ⚠️BP BA ⚠️CM
neqo-latest vs. lsquic: ⚠️H LR 🚀U ⚠️C20 E ⚠️A L1 C1 🚀C2 ⚠️6 BP CM
neqo-latest vs. msquic: H ⚠️DC LR C20 M S ⚠️R Z B ⚠️U A L1 ⚠️L2 C1 C2 ⚠️6 V2 BP BA
neqo-latest vs. mvfst: 🚀U ⚠️DC Z 3 A L1 C1 🚀~~C2 6~~ ⚠️BP BA
neqo-latest vs. neqo: run cancelled after 20 min
neqo-latest vs. neqo-latest: ⚠️H DC LR M S R Z ⚠️3 A 🚀L2 ⚠️L1 C1 🚀6 V2 ⚠️BP BA ⚠️CM
neqo-latest vs. nginx: ⚠️H DC 🚀~~LR 3 A L1~~ ⚠️M R U BP BA
neqo-latest vs. ngtcp2: LR ⚠️C20 S Z ⚠️3 B ⚠️L1 BA CM
neqo-latest vs. picoquic: 🚀S ⚠️H DC Z B A ⚠️L1 C1 6
neqo-latest vs. quic-go: LR 🚀S ⚠️M R 3 B A 🚀~~L2 C2~~
neqo-latest vs. quiche: 🚀S ⚠️H LR C20 3 U A L1 🚀L2 ⚠️6 BP BA
neqo-latest vs. quinn: 🚀~~DC 3~~ ⚠️M S R E L2 ⚠️C2
neqo-latest vs. s2n-quic: DC 🚀~~R L2 C1 C2~~ BP BA CM
neqo-latest vs. tquic: H 🚀DC ⚠️C20 S R 🚀B ⚠️U A BP BA
neqo-latest vs. xquic: H ⚠️DC LR C20 ⚠️M R Z 3 B U A L1 ⚠️L2 C1 ⚠️C2 6 BP ⚠️BA

neqo-latest as server

aioquic vs. neqo-latest: run cancelled after 20 min
go-x-net vs. neqo-latest: 🚀H ⚠️L2 CM
kwik vs. neqo-latest: 🚀~~H DC S B L1 L2 C2 V2~~ ⚠️C20 3 6 BP BA ⚠️CM
linuxquic vs. neqo-latest: run cancelled after 20 min
lsquic vs. neqo-latest: 🚀~~H C2~~ ⚠️DC V2
msquic vs. neqo-latest: 🚀B ⚠️S Z U ⚠️L1 C1 V2 🚀CM
mvfst vs. neqo-latest: 🚀~~H DC~~ ⚠️LR Z A L1 ⚠️L2 C1 ⚠️CM
neqo vs. neqo-latest: run cancelled after 20 min
ngtcp2 vs. neqo-latest: 🚀~~3 6 V2~~ ⚠️LR C20 B L2
openssl vs. neqo-latest: LR C20 M 🚀~~S R 3~~ ⚠️B A ⚠️BP CM
picoquic vs. neqo-latest: run cancelled after 20 min
quic-go vs. neqo-latest: 🚀LR ⚠️3 B C1 6 CM
quiche vs. neqo-latest: 🚀~~M B~~ ⚠️R Z L2 C1 🚀6 ⚠️BP BA CM
quinn vs. neqo-latest: 🚀~~H DC M U L2~~ ⚠️B L1 C1 V2 ⚠️BA CM
s2n-quic vs. neqo-latest: ⚠️H DC LR M S R 3 B A L2 6 BA CM
tquic vs. neqo-latest: run cancelled after 20 min
xquic vs. neqo-latest: run cancelled after 20 min

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest vs. aioquic: H DC LR 🚀~~C20 M S~~ R 🚀3 B ⚠️U A ⚠️L1 C1 ⚠️C2 6 V2 ⚠️BA 🚀BP
neqo-latest vs. go-x-net: 🚀~~H DC~~ LR ⚠️U L2 6 🚀~~M B A C2~~
neqo-latest vs. lsquic: ⚠️H DC ⚠️C20 M S R Z 3 B ⚠️A 🚀U L2 ⚠️6 🚀C2 V2 ⚠️BP BA ⚠️CM
neqo-latest vs. mvfst: H ⚠️DC LR M R ⚠️Z 3 B 🚀U L2 ⚠️BP BA 🚀~~C2 6~~
neqo-latest vs. neqo-latest: ⚠️H DC LR C20 ⚠️M S 3 B U E ⚠️L1 🚀L2 C2 ⚠️BP CM 🚀6
neqo-latest vs. nginx: ⚠️H 🚀LR C20 ⚠️M S ⚠️R Z 🚀3 B ⚠️U 🚀~~A L1~~ L2 C1 C2 6
neqo-latest vs. ngtcp2: H DC ⚠️C20 M ⚠️S R ⚠️3 U E A ⚠️L1 L2 C1 C2 6 V2 BP ⚠️BA
neqo-latest vs. picoquic: ⚠️H DC LR C20 M 🚀S R 3 U E ⚠️L1 L2 C2 V2 BP BA
neqo-latest vs. quic-go: H DC C20 ⚠️M R 🚀S Z ⚠️3 B U L1 🚀L2 C1 🚀C2 6 BP BA
neqo-latest vs. quiche: ⚠️H DC ⚠️LR C20 M 🚀S R Z ⚠️3 B ⚠️U A 🚀L2 C1 C2 ⚠️6
neqo-latest vs. quinn: H 🚀DC LR C20 ⚠️M S R Z 🚀3 B U ⚠️E A L1 C1 ⚠️C2 6 BP BA
neqo-latest vs. s2n-quic: H LR C20 M S 🚀R 3 B U E A L1 🚀~~L2 C1 C2~~ 6
neqo-latest vs. tquic: 🚀DC LR ⚠️C20 M Z 3 ⚠️U 🚀B L1 L2 C1 C2 6

neqo-latest as server

chrome vs. neqo-latest: 3
go-x-net vs. neqo-latest: 🚀H DC LR M B ⚠️A L2 🚀U C2 6 BP
kwik vs. neqo-latest: 🚀~~H DC~~ LR ⚠️C20 M 🚀S R Z ⚠️3 U 🚀B A 🚀~~L1 L2~~ C1 ⚠️6 🚀~~C2 V2~~
lsquic vs. neqo-latest: ⚠️DC 🚀H LR M S R 3 B E A L1 L2 C1 🚀C2 6 ⚠️V2 BP ⚠️BA CM
msquic vs. neqo-latest: H DC LR C20 M ⚠️S Z L1 🚀~~R B A~~ L2 ⚠️C1 C2 6 ⚠️BA
mvfst vs. neqo-latest: ⚠️LR 🚀~~H DC M~~ 3 B ⚠️L2 C2 6 🚀BP BA
ngtcp2 vs. neqo-latest: H DC ⚠️LR C20 M S R Z ⚠️B U 🚀3 E 🚀A L1 ⚠️L2 C1 C2 ⚠️BA 🚀~~6 V2 BP~~ CM
openssl vs. neqo-latest: H DC ⚠️B 🚀~~S R 3~~ L2 C2 6 ⚠️BP BA
quic-go vs. neqo-latest: H DC 🚀LR C20 M S R Z ⚠️3 B U A L1 L2 ⚠️C1 C2 ⚠️6 BP BA
quiche vs. neqo-latest: H DC LR 🚀M S ⚠️R Z 3 🚀B A L1 ⚠️L2 C2 ⚠️BP BA 🚀6
quinn vs. neqo-latest: 🚀~~H DC~~ LR C20 🚀M S R Z 3 ⚠️B 🚀U E A ⚠️L1 C1 🚀L2 C2 6 BP ⚠️BA
s2n-quic vs. neqo-latest: 🚀~~E L1 C1 C2 BP~~

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest vs. aioquic: E CM
neqo-latest vs. go-x-net: C20 S R Z 3 E L1 C1 V2 CM
neqo-latest vs. haproxy: E CM
neqo-latest vs. kwik: E CM
neqo-latest vs. msquic: 3 E CM
neqo-latest vs. mvfst: C20 S E V2 CM
neqo-latest vs. nginx: E V2 CM
neqo-latest vs. picoquic: CM
neqo-latest vs. quic-go: E V2 CM
neqo-latest vs. quiche: E V2 CM
neqo-latest vs. quinn: V2 CM
neqo-latest vs. s2n-quic: Z V2
neqo-latest vs. tquic: E V2 CM
neqo-latest vs. xquic: S E V2 CM

neqo-latest as server

chrome vs. neqo-latest: H DC LR C20 M S R Z B U E A L1 L2 C1 C2 6 V2 BP BA CM
go-x-net vs. neqo-latest: C20 S R Z 3 U E A L1 C1 V2 BA CM
kwik vs. neqo-latest: U E CM
lsquic vs. neqo-latest: C20 Z U BA
msquic vs. neqo-latest: R 3 E A BP BA CM
mvfst vs. neqo-latest: C20 M S R U E V2 BP CM
ngtcp2 vs. neqo-latest: A BP U BA
openssl vs. neqo-latest: Z U E L1 C1 V2
quic-go vs. neqo-latest: E V2
quiche vs. neqo-latest: C20 U E V2
s2n-quic vs. neqo-latest: C20 Z U V2

github-actions · 2025-04-18T13:55:35Z

Benchmark results

Performance differences relative to 95f9bed.

1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💚 Performance has improved.

       time:   [202.13 ms 202.48 ms 202.85 ms]
       thrpt:  [492.97 MiB/s 493.86 MiB/s 494.74 MiB/s]
change:
       time:   [−69.170% −69.106% −69.043%] (p = 0.00 < 0.05)
       thrpt:  [+223.02% +223.69% +224.36%]
Found 2 outliers among 100 measurements (2.00%)

2 (2.00%) high mild

1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: Change within noise threshold.

       time:   [304.31 ms 305.83 ms 307.35 ms]
       thrpt:  [32.536 Kelem/s 32.698 Kelem/s 32.862 Kelem/s]
change:
       time:   [+0.6959% +1.3800% +2.0501%] (p = 0.00 < 0.05)
       thrpt:  [−2.0089% −1.3612% −0.6911%]

1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: 💔 Performance has regressed.

       time:   [27.525 ms 27.597 ms 27.673 ms]
       thrpt:  [36.136  elem/s 36.236  elem/s 36.331  elem/s]
change:
       time:   [+1.1068% +1.8128% +2.4725%] (p = 0.00 < 0.05)
       thrpt:  [−2.4128% −1.7805% −1.0947%]
Found 3 outliers among 100 measurements (3.00%)

3 (3.00%) high mild

1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.

       time:   [648.95 ms 654.10 ms 659.22 ms]
       thrpt:  [151.69 MiB/s 152.88 MiB/s 154.10 MiB/s]
change:
       time:   [−28.772% −27.945% −27.096%] (p = 0.00 < 0.05)
       thrpt:  [+37.166% +38.782% +40.394%]
Found 10 outliers among 100 measurements (10.00%)

4 (4.00%) low severe

4 (4.00%) low mild

2 (2.00%) high severe

decode 4096 bytes, mask ff: No change in performance detected.

       time:   [11.792 µs 11.818 µs 11.851 µs]
       change: [−0.7711% −0.1718% +0.3634%] (p = 0.57 > 0.05)
Found 15 outliers among 100 measurements (15.00%)

3 (3.00%) low severe

2 (2.00%) low mild

3 (3.00%) high mild

7 (7.00%) high severe

decode 1048576 bytes, mask ff: No change in performance detected.

       time:   [3.0229 ms 3.0323 ms 3.0435 ms]
       change: [−0.2404% +0.1966% +0.6361%] (p = 0.39 > 0.05)
Found 9 outliers among 100 measurements (9.00%)

9 (9.00%) high severe

decode 4096 bytes, mask 7f: No change in performance detected.

       time:   [19.968 µs 20.023 µs 20.082 µs]
       change: [−0.7959% −0.1788% +0.3899%] (p = 0.57 > 0.05)
Found 21 outliers among 100 measurements (21.00%)

1 (1.00%) low severe

4 (4.00%) low mild

16 (16.00%) high severe

decode 1048576 bytes, mask 7f: No change in performance detected.

       time:   [5.0371 ms 5.0487 ms 5.0618 ms]
       change: [−0.5165% −0.1114% +0.2906%] (p = 0.59 > 0.05)
Found 14 outliers among 100 measurements (14.00%)

14 (14.00%) high severe

decode 4096 bytes, mask 3f: No change in performance detected.

       time:   [8.2722 µs 8.3105 µs 8.3530 µs]
       change: [−0.2019% +0.2442% +0.7830%] (p = 0.32 > 0.05)
Found 19 outliers among 100 measurements (19.00%)

6 (6.00%) low mild

2 (2.00%) high mild

11 (11.00%) high severe

decode 1048576 bytes, mask 3f: No change in performance detected.

       time:   [1.5850 ms 1.5902 ms 1.5962 ms]
       change: [−0.6684% −0.1223% +0.4020%] (p = 0.66 > 0.05)
Found 9 outliers among 100 measurements (9.00%)

3 (3.00%) high mild

6 (6.00%) high severe

1000 streams of 1 bytes/multistream: No change in performance detected.

       time:   [33.203 ns 39.555 ns 51.878 ns]
       change: [+10.728% +32.674% +75.025%] (p = 0.06 > 0.05)
Found 3 outliers among 500 measurements (0.60%)

1 (0.20%) high mild

2 (0.40%) high severe

1000 streams of 1000 bytes/multistream: 💔 Performance has regressed.

       time:   [34.055 ns 34.490 ns 34.929 ns]
       change: [+12.649% +14.534% +16.408%] (p = 0.00 < 0.05)
Found 1 outliers among 500 measurements (0.20%)

1 (0.20%) high severe

coalesce_acked_from_zero 1+1 entries: No change in performance detected.

       time:   [88.115 ns 88.448 ns 88.789 ns]
       change: [−0.4490% +0.5174% +1.7914%] (p = 0.49 > 0.05)
Found 11 outliers among 100 measurements (11.00%)

7 (7.00%) high mild

4 (4.00%) high severe

coalesce_acked_from_zero 3+1 entries: No change in performance detected.

       time:   [105.48 ns 105.73 ns 105.99 ns]
       change: [−0.8906% −0.3531% +0.1143%] (p = 0.18 > 0.05)
Found 8 outliers among 100 measurements (8.00%)

1 (1.00%) low mild

2 (2.00%) high mild

5 (5.00%) high severe

coalesce_acked_from_zero 10+1 entries: No change in performance detected.

       time:   [105.05 ns 105.38 ns 105.80 ns]
       change: [−0.2958% +0.2637% +0.8589%] (p = 0.38 > 0.05)
Found 21 outliers among 100 measurements (21.00%)

4 (4.00%) low severe

6 (6.00%) low mild

3 (3.00%) high mild

8 (8.00%) high severe

coalesce_acked_from_zero 1000+1 entries: No change in performance detected.

       time:   [88.820 ns 88.971 ns 89.126 ns]
       change: [−0.7026% +0.2281% +1.1404%] (p = 0.65 > 0.05)
Found 8 outliers among 100 measurements (8.00%)

3 (3.00%) high mild

5 (5.00%) high severe

RxStreamOrderer::inbound_frame(): No change in performance detected.

       time:   [107.81 ms 107.97 ms 108.23 ms]
       change: [−0.4699% −0.1075% +0.2218%] (p = 0.59 > 0.05)
Found 10 outliers among 100 measurements (10.00%)

7 (7.00%) low mild

2 (2.00%) high mild

1 (1.00%) high severe

sent::Packets::take_ranges: No change in performance detected.

       time:   [8.0612 µs 8.2610 µs 8.4441 µs]
       change: [−0.7407% +5.8984% +17.161%] (p = 0.24 > 0.05)
Found 20 outliers among 100 measurements (20.00%)

4 (4.00%) low severe

11 (11.00%) low mild

4 (4.00%) high mild

1 (1.00%) high severe

transfer/pacing-false/varying-seeds: 💔 Performance has regressed.

       time:   [37.072 ms 37.169 ms 37.279 ms]
       change: [+4.5101% +4.8981% +5.2577%] (p = 0.00 < 0.05)
Found 1 outliers among 100 measurements (1.00%)

1 (1.00%) high severe

transfer/pacing-true/varying-seeds: 💔 Performance has regressed.

       time:   [37.692 ms 37.808 ms 37.931 ms]
       change: [+5.0829% +5.4940% +5.9561%] (p = 0.00 < 0.05)
Found 2 outliers among 100 measurements (2.00%)

1 (1.00%) high mild

1 (1.00%) high severe

transfer/pacing-false/same-seed: 💔 Performance has regressed.

       time:   [36.999 ms 37.067 ms 37.140 ms]
       change: [+4.6033% +4.8770% +5.1647%] (p = 0.00 < 0.05)
Found 1 outliers among 100 measurements (1.00%)

1 (1.00%) high severe

transfer/pacing-true/same-seed: 💔 Performance has regressed.

       time:   [38.372 ms 38.472 ms 38.576 ms]
       change: [+4.1365% +4.4851% +4.8031%] (p = 0.00 < 0.05)
Found 3 outliers among 100 measurements (3.00%)

1 (1.00%) low mild

1 (1.00%) high mild

1 (1.00%) high severe

Client/server transfer results

Performance differences relative to 95f9bed.

Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.

Client vs. server (params)	Mean ± σ	Min	Max	MiB/s ± σ	Δ `main`	Δ `main`
google vs. google	451.8 ± 4.7	444.9	461.2	70.8 ± 6.8
google vs. neqo (cubic, paced)	268.6 ± 4.5	261.4	283.8	119.2 ± 7.1	💚 -49.9	-15.7%
msquic vs. msquic	133.0 ± 34.2	100.8	374.4	240.6 ± 0.9
msquic vs. neqo (cubic, paced)	145.8 ± 16.7	121.6	225.1	219.5 ± 1.9	💚 -125.9	-46.3%
neqo vs. google (cubic, paced)	751.4 ± 4.5	743.5	769.3	42.6 ± 7.1	-0.5	-0.1%
neqo vs. msquic (cubic, paced)	155.6 ± 5.0	147.3	176.0	205.6 ± 6.4	-0.6	-0.4%
neqo vs. neqo (cubic)	90.0 ± 4.7	78.9	105.0	355.7 ± 6.8	💚 -121.0	-57.4%
neqo vs. neqo (cubic, paced)	90.2 ± 4.0	82.7	99.1	354.7 ± 8.0	💚 -121.0	-57.3%
neqo vs. neqo (reno)	90.8 ± 5.2	80.3	108.5	352.5 ± 6.2	💚 -118.3	-56.6%
neqo vs. neqo (reno, paced)	93.2 ± 5.3	82.0	113.0	343.2 ± 6.0	💚 -116.8	-55.6%
neqo vs. quiche (cubic, paced)	191.7 ± 4.2	185.4	202.1	167.0 ± 7.6	💔 2.3	1.2%
neqo vs. s2n (cubic, paced)	217.8 ± 4.6	210.3	225.9	146.9 ± 7.0	1.1	0.5%
quiche vs. neqo (cubic, paced)	157.6 ± 5.8	146.1	183.5	203.1 ± 5.5	💚 -590.4	-78.9%
quiche vs. quiche	147.0 ± 4.9	137.7	164.8	217.6 ± 6.5
s2n vs. neqo (cubic, paced)	172.1 ± 5.0	161.3	183.3	186.0 ± 6.4	💚 -126.3	-42.3%
s2n vs. s2n	248.2 ± 27.7	230.3	345.1	128.9 ± 1.2

Download data for profiler.firefox.com or download performance comparison data.

mxinden · 2025-04-18T14:48:10Z

Optimized Upload only thus far.

1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.
   time:   [1.2891 s 1.2983 s 1.3077 s]
   thrpt:  [76.469 MiB/s 77.023 MiB/s 77.571 MiB/s]
change:
time: [-32.828% -31.716% -30.597%] (p = 0.00 < 0.05)
thrpt: [+44.086% +46.447% +48.872%]

Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild

🎉 matches #2532 (comment).

mxinden · 2025-04-21T16:33:21Z

Introduced the same optimizations to neqo-server. In addition I removed the memory copy, now allocating each datagram of a GSO train into a single contiguous Vec right away. Result looks promising.

1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💚 Performance has improved.
   time:   [245.19 ms 245.65 ms 246.12 ms]
   thrpt:  [406.30 MiB/s 407.09 MiB/s 407.85 MiB/s]
change:
time: [-66.225% -66.008% -65.788%] (p = 0.00 < 0.05)
thrpt: [+192.30% +194.19% +196.08%]

larseggert · 2025-06-13T05:48:50Z

Why do we see a massive benefit in the client/server tests, but not in the transfer benches?

mxinden · 2025-06-13T07:23:21Z

@larseggert the neqo-transport/bench/transfer.rs benchmarks use the test-fixtures/src/sim Simulator. The Simulator only processes a single datagram at a time.

neqo/test-fixture/src/sim/mod.rs

Line 206 in 37c3aee

let mut dgram = None;

Let me see whether I can change that as part of this pull request. After all our benchmarks and tests should mirror how we run Neqo in Firefox as close as possible.

larseggert · 2025-06-27T09:02:31Z

@mxinden tests::send_ignore_emsgsize still failing on Windows.

github-actions · 2025-06-27T09:12:23Z

Bencher Report

Branch	gso-v3
Testbed	t-linux64-ms-279

Click to view all benchmark results

Benchmark	Latency	nanoseconds (ns)
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client	📈 view plot 🚷 view threshold	646,670,000.00 ns
1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client	📈 view plot 🚷 view threshold	201,340,000.00 ns
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client	📈 view plot 🚷 view threshold	27,380,000.00 ns
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client	📈 view plot 🚷 view threshold	307,020,000.00 ns
1000 streams of 1 bytes/multistream	📈 view plot 🚷 view threshold	34.99 ns
1000 streams of 1000 bytes/multistream	📈 view plot 🚷 view threshold	35.03 ns
RxStreamOrderer::inbound_frame()	📈 view plot 🚷 view threshold	110,960,000.00 ns
coalesce_acked_from_zero 1+1 entries	📈 view plot 🚷 view threshold	88.31 ns
coalesce_acked_from_zero 10+1 entries	📈 view plot 🚷 view threshold	105.52 ns
coalesce_acked_from_zero 1000+1 entries	📈 view plot 🚷 view threshold	90.91 ns
coalesce_acked_from_zero 3+1 entries	📈 view plot 🚷 view threshold	105.85 ns
decode 1048576 bytes, mask 3f	📈 view plot 🚷 view threshold	1,590,700.00 ns
decode 1048576 bytes, mask 7f	📈 view plot 🚷 view threshold	5,047,400.00 ns
decode 1048576 bytes, mask ff	📈 view plot 🚷 view threshold	3,031,800.00 ns
decode 4096 bytes, mask 3f	📈 view plot 🚷 view threshold	8,308.50 ns
decode 4096 bytes, mask 7f	📈 view plot 🚷 view threshold	20,011.00 ns
decode 4096 bytes, mask ff	📈 view plot 🚷 view threshold	11,832.00 ns
sent::Packets::take_ranges	📈 view plot 🚷 view threshold	5,182.40 ns
transfer/pacing-false/same-seed	📈 view plot 🚷 view threshold	36,846,000.00 ns
transfer/pacing-false/varying-seeds	📈 view plot 🚷 view threshold	37,089,000.00 ns
transfer/pacing-true/same-seed	📈 view plot 🚷 view threshold	38,620,000.00 ns
transfer/pacing-true/varying-seeds	📈 view plot 🚷 view threshold	38,194,000.00 ns

🐰 View full continuous benchmarking report in Bencher

github-actions · 2025-06-27T09:12:25Z

Bencher Report

Branch	gso-v3
Testbed	t-linux64-ms-279

Click to view all benchmark results

Benchmark	Latency	milliseconds (ms)
s2n vs. neqo (cubic, paced)	📈 view plot 🚷 view threshold	210.06 ms

🐰 View full continuous benchmarking report in Bencher

github-actions · 2025-06-30T12:58:19Z

Bencher Report

Branch	gso-v3
Testbed	t-linux64-ms-278

Click to view all benchmark results

Benchmark	Latency	nanoseconds (ns)
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client	📈 view plot 🚷 view threshold	654,100,000.00 ns
1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client	📈 view plot 🚷 view threshold	202,480,000.00 ns
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client	📈 view plot 🚷 view threshold	27,597,000.00 ns
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client	📈 view plot 🚷 view threshold	305,830,000.00 ns
1000 streams of 1 bytes/multistream	📈 view plot 🚷 view threshold	39.55 ns
1000 streams of 1000 bytes/multistream	📈 view plot 🚷 view threshold	34.49 ns
RxStreamOrderer::inbound_frame()	📈 view plot 🚷 view threshold	107,970,000.00 ns
coalesce_acked_from_zero 1+1 entries	📈 view plot 🚷 view threshold	88.45 ns
coalesce_acked_from_zero 10+1 entries	📈 view plot 🚷 view threshold	105.38 ns
coalesce_acked_from_zero 1000+1 entries	📈 view plot 🚷 view threshold	88.97 ns
coalesce_acked_from_zero 3+1 entries	📈 view plot 🚷 view threshold	105.73 ns
decode 1048576 bytes, mask 3f	📈 view plot 🚷 view threshold	1,590,200.00 ns
decode 1048576 bytes, mask 7f	📈 view plot 🚷 view threshold	5,048,700.00 ns
decode 1048576 bytes, mask ff	📈 view plot 🚷 view threshold	3,032,300.00 ns
decode 4096 bytes, mask 3f	📈 view plot 🚷 view threshold	8,310.50 ns
decode 4096 bytes, mask 7f	📈 view plot 🚷 view threshold	20,023.00 ns
decode 4096 bytes, mask ff	📈 view plot 🚷 view threshold	11,818.00 ns
sent::Packets::take_ranges	📈 view plot 🚷 view threshold	8,261.00 ns
transfer/pacing-false/same-seed	📈 view plot 🚷 view threshold	37,067,000.00 ns
transfer/pacing-false/varying-seeds	📈 view plot 🚷 view threshold	37,169,000.00 ns
transfer/pacing-true/same-seed	📈 view plot 🚷 view threshold	38,472,000.00 ns
transfer/pacing-true/varying-seeds	📈 view plot 🚷 view threshold	37,808,000.00 ns

🐰 View full continuous benchmarking report in Bencher

github-actions · 2025-06-30T12:58:21Z

Bencher Report

Branch	gso-v3
Testbed	t-linux64-ms-278

Click to view all benchmark results

Benchmark	Latency	milliseconds (ms)
s2n vs. neqo (cubic, paced)	📈 view plot 🚷 view threshold	172.07 ms

🐰 View full continuous benchmarking report in Bencher

larseggert · 2025-06-30T13:31:46Z

@mxinden is this ready to merge?

mxinden · 2025-06-30T14:55:53Z

Yes, ready to merge from my end. We have a couple of benchmark regressions. Explainer for each:

1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.
   time:   [650.03 ms 655.28 ms 660.94 ms]
   thrpt:  [151.30 MiB/s 152.61 MiB/s 153.84 MiB/s]
change:
time: [−27.566% −26.708% −25.736%] (p = 0.00 < 0.05)
thrpt: [+34.655% +36.441% +38.056%]

This will improve even further with #2734.

1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: 💔 Performance has regressed.
   time:   [27.404 ms 27.500 ms 27.616 ms]
   thrpt:  [36.211  elem/s 36.363  elem/s 36.491  elem/s]
change:
time: [+1.5705% +2.1502% +2.7519%] (p = 0.00 < 0.05)
thrpt: [−2.6782% −2.1049% −1.5463%]

This is expected. We pay a slight cost in latency when sending in batches.

1000 streams of 1000 bytes/multistream: 💔 Performance has regressed.
   time:   [36.454 ns 36.834 ns 37.215 ns]
   change: [+25.596% +27.533% +29.527%] (p = 0.00 < 0.05)

This should be due to neqo-http3/benches/streams.rs not using the batched IO paths. Instead of altering the IO handling in the benchmark, I suggest we do #2728. Given that the benchmark measures stream performance and not UDP IO performance, I suggest doing this in a follow-up.

transfer/pacing-false/varying-seeds: 💔 Performance has regressed.
   time:   [36.886 ms 36.956 ms 37.027 ms]
   change: [+4.0332% +4.3753% +4.6740%] (p = 0.00 < 0.05)

Again, slight regression as the Simulator is not using the batched IO paths. The non-batched IO path (i.e. process), now no-longer pre-allocate, as we don't know the datagram size ahead of time. Once #2747 is merged, this overhead should be reduced, as we would write datagrams into a long-lived buffer.

@larseggert let me know whether you are fine proceeding here, or would prefer any of the above to be addressed first.

larseggert · 2025-06-30T16:08:52Z

I'll merge now; please do issues for the missing bits?

Great we can land this!

larseggert · 2025-07-01T08:35:46Z

This keeps getting kicked out of the merge queue while tests are still running and haven't failed yet. I think GitHub may have issues. Doing a force merge.

mxinden · 2025-07-06T17:30:37Z

please do issues for the missing bits?

I assume you are fine with the following pull requests tracking the progress. Let me know if you want additional GitHub issues.

mxinden · 2025-07-22T10:47:00Z

Early numbers on GSO in Firefox Nightly:

~5% of sends on Linux and Windows use GSO with 2 or more segments
~5% of sends on Linux and Windows send 2.4 k bytes or more
We currently limit number of segments to 10, which is reflected in the metrics (apart from some crazy machine on Linux doing > 100)

Good signals. We should explore increasing max number of segments (currently 10). Maybe just limit by what our pacer allows to send.

Datagram (batch) size

Windows

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_size_sent/explore?os=Windows&visiblePercentiles=%5B99%2C95%2C75%2C50%2C25%2C5%5D

Linux

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_size_sent/explore?os=Linux&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D

Number of segments in a batch

Windows

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_segments_sent/explore?os=Windows&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D

Linux

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_segments_sent/explore?os=Linux&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D

larseggert · 2025-07-22T11:27:51Z

Yes, let's increase.

mxinden force-pushed the gso-v3 branch 8 times, most recently from f9ff613 to a21983d Compare April 21, 2025 16:01

mxinden force-pushed the gso-v3 branch from a21983d to c89ba47 Compare April 27, 2025 19:28

mxinden mentioned this pull request Apr 30, 2025

fix(quinn-udp): sanitise segment_size quinn-rs/quinn#2217

Merged

mxinden force-pushed the gso-v3 branch 2 times, most recently from 43bddd5 to a05f876 Compare May 12, 2025 17:34

This was referenced May 13, 2025

fix: Move allocations outside of loops #2620

Merged

feat: Basic GSO support #2532

Closed

mxinden force-pushed the gso-v3 branch 2 times, most recently from ce275bd to d52586e Compare May 25, 2025 10:20

This was referenced May 29, 2025

Consider not pre-allocating each UDP datagram in output_path #2670

Closed

perf(common): make Encoder generic over borrowed or owned buffer #2677

Merged

mxinden force-pushed the gso-v3 branch from d52586e to 538f76f Compare June 5, 2025 13:50

mxinden mentioned this pull request Jun 5, 2025

refactor(transport): don't implicitly infer PacketBuilder limit #2704

Merged

mxinden force-pushed the gso-v3 branch from 538f76f to b6aed5d Compare June 12, 2025 17:58

mxinden mentioned this pull request Jun 14, 2025

bench(bin/server): process input datagram in batches #2734

Merged

mxinden force-pushed the gso-v3 branch from 0bf6da0 to 138f410 Compare June 15, 2025 14:20

larseggert mentioned this pull request Jun 16, 2025

fix: Use 256K for ranges #2729

Closed

mxinden added 2 commits June 30, 2025 12:44

Catch WSAEINVAL

28e06d1

fix comparison

c8434b5

larseggert mentioned this pull request Jun 30, 2025

Use more NonZero types #2768

Open

rename segment_size to datagram_size

6dafb1c

larseggert enabled auto-merge June 30, 2025 16:09

larseggert added this pull request to the merge queue Jun 30, 2025

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 30, 2025

larseggert added this pull request to the merge queue Jun 30, 2025

github-merge-queue Bot removed this pull request from the merge queue due to no response for status checks Jun 30, 2025

larseggert added this pull request to the merge queue Jul 1, 2025

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 1, 2025

larseggert added this pull request to the merge queue Jul 1, 2025

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 1, 2025

larseggert added this pull request to the merge queue Jul 1, 2025

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 1, 2025

larseggert merged commit a341259 into mozilla:main Jul 1, 2025
40 of 41 checks passed

mxinden mentioned this pull request Jul 6, 2025

chore(Cargo.toml): prepare v0.14.0 release #2783

Merged

mxinden mentioned this pull request Jul 14, 2025

GSO meets DSCP/ECN #2790

Closed

Conversation

mxinden commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Failed Interop Tests

neqo-latest as client

neqo-latest as server

Succeeded Interop Tests

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

neqo-latest as client

neqo-latest as server

Uh oh!

github-actions Bot commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results

Client/server transfer results

Uh oh!

mxinden commented Apr 18, 2025

Uh oh!

mxinden commented Apr 21, 2025

Uh oh!

larseggert commented Jun 13, 2025

Uh oh!

mxinden commented Jun 13, 2025

Uh oh!

larseggert commented Jun 27, 2025

Uh oh!

github-actions Bot commented Jun 27, 2025

Bencher Report

Uh oh!

github-actions Bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bencher Report

Uh oh!

github-actions Bot commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bencher Report

Uh oh!

github-actions Bot commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bencher Report

Uh oh!

larseggert commented Jun 30, 2025

Uh oh!

mxinden commented Jun 30, 2025

Uh oh!

larseggert commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

larseggert commented Jul 1, 2025

Uh oh!

Uh oh!

mxinden commented Jul 6, 2025

Uh oh!

mxinden commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Datagram (batch) size

Windows

Linux

Number of segments in a batch

Windows

Linux

Uh oh!

larseggert commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

mxinden commented Apr 18, 2025 •

edited

Loading

github-actions Bot commented Apr 18, 2025 •

edited

Loading

github-actions Bot commented Apr 18, 2025 •

edited

Loading

github-actions Bot commented Jun 27, 2025 •

edited

Loading

github-actions Bot commented Jun 30, 2025 •

edited

Loading

github-actions Bot commented Jun 30, 2025 •

edited

Loading

larseggert commented Jun 30, 2025 •

edited

Loading

mxinden commented Jul 22, 2025 •

edited

Loading