Skip to content

Conversation

@adityamandaleeka
Copy link
Member

Use [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder<>))] on Kestrel ReadAsync methods to remove per read allocations in the mainline body reading logic.

Will collect allocation metrics before/after.

Fixes #34645

@davidfowl
Copy link
Member

cc @stephentoub

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM iff resulting throughput numbers look good. I'm very interested.

@adityamandaleeka
Copy link
Member Author

adityamandaleeka commented Aug 4, 2021

Here's what I've seen so far using a new websocket benchmark (aspnet/Benchmarks#1686)

Config: https://raw.githubusercontent.com/kondratyev-nv/Benchmarks/main/scenarios/websocket.benchmarks.yml

Unmodified build:

| application                             |                             |
| --------------------------------------- | --------------------------- |
| CPU Usage (%)                           | 90                          |
| Cores usage (%)                         | 2,518                       |
| Working Set (MB)                        | 406                         |
| Private Memory (MB)                     | 1,556                       |
| Build Time (ms)                         | 2,600                       |
| Start Time (ms)                         | 115                         |
| Published Size (KB)                     | 88,242                      |
| .NET Core SDK Version                   | 6.0.100-preview.6.21276.12  |
| ASP.NET Core Version                    | 6.0.0-rc.1.21403.21+e7d9161 |
| .NET Runtime Version                    | 6.0.0-rc.1.21403.6+b0cea40  |
| Max CPU Usage (%)                       | 101                         |
| Max Working Set (MB)                    | 425                         |
| Max GC Heap Size (MB)                   | 294                         |
| Size of committed memory by the GC (MB) | 358                         |
| Max Number of Gen 0 GCs / sec           | 2.00                        |
| Max Number of Gen 1 GCs / sec           | 1.00                        |
| Max Number of Gen 2 GCs / sec           | 1.00                        |
| Max Time in GC (%)                      | 0.00                        |
| Max Gen 0 Size (B)                      | 5,633,664                   |
| Max Gen 1 Size (B)                      | 3,544,152                   |
| Max Gen 2 Size (B)                      | 3,374,544                   |
| Max LOH Size (B)                        | 672                         |
| Max Allocation Rate (B/sec)             | 294,794,136                 |
| Max GC Heap Fragmentation               | 73                          |
| # of Assemblies Loaded                  | 73                          |
| Max Exceptions (#/s)                    | 600                         |
| Max Lock Contention (#/s)               | 30                          |
| Max ThreadPool Threads Count            | 48                          |
| Max ThreadPool Queue Length             | 25                          |
| Max ThreadPool Items (#/s)              | 787,024                     |
| Max Active Timers                       | 100                         |
| IL Jitted (B)                           | 190,097                     |
| Methods Jitted                          | 2,075                       |

With this pooling change:

| application                             |                             |
| --------------------------------------- | --------------------------- |
| CPU Usage (%)                           | 88                          |
| Cores usage (%)                         | 2,473                       |
| Working Set (MB)                        | 405                         |
| Private Memory (MB)                     | 1,554                       |
| Build Time (ms)                         | 6,082                       |
| Start Time (ms)                         | 129                         |
| Published Size (KB)                     | 88,242                      |
| .NET Core SDK Version                   | 6.0.100-preview.6.21276.12  |
| ASP.NET Core Version                    | 6.0.0-rc.1.21403.24+8eeeafc |
| .NET Runtime Version                    | 6.0.0-rc.1.21403.6+b0cea40  |
| Max CPU Usage (%)                       | 102                         |
| Max Working Set (MB)                    | 424                         |
| Max GC Heap Size (MB)                   | 310                         |
| Size of committed memory by the GC (MB) | 356                         |
| Max Number of Gen 0 GCs / sec           | 2.00                        |
| Max Number of Gen 1 GCs / sec           | 1.00                        |
| Max Number of Gen 2 GCs / sec           | 1.00                        |
| Max Time in GC (%)                      | 0.00                        |
| Max Gen 0 Size (B)                      | 7,059,768                   |
| Max Gen 1 Size (B)                      | 2,735,312                   |
| Max Gen 2 Size (B)                      | 3,142,848                   |
| Max LOH Size (B)                        | 672                         |
| Max Allocation Rate (B/sec)             | 203,759,352                 |
| Max GC Heap Fragmentation               | 65                          |
| # of Assemblies Loaded                  | 73                          |
| Max Exceptions (#/s)                    | 598                         |
| Max Lock Contention (#/s)               | 34                          |
| Max ThreadPool Threads Count            | 48                          |
| Max ThreadPool Queue Length             | 64                          |
| Max ThreadPool Items (#/s)              | 772,336                     |
| Max Active Timers                       | 100                         |
| IL Jitted (B)                           | 230,296                     |
| Methods Jitted                          | 2,793                       |

[EDIT: Removed text here that said there wasn't a noticeable improvement... I had ignored the Max Allocation Rate delta at first because I wasn't sure it was stable, but the improvement does seem to hold.]

@JamesNK
Copy link
Member

JamesNK commented Aug 7, 2021

Cool! I didn't know this had been added. I've got a whole bunch of methods in HTTP/3 where this will be useful:

image

Can look at HTTP/3 in a different PR.

@stephentoub
Copy link
Member

stephentoub commented Aug 7, 2021

Here's what I've seen so far using a new websocket benchmark

Is there a number here that represents throughput?

I've got a whole bunch of methods in HTTP/3 where this will be useful

Please measure (not just managed allocation reduction). There are tradeoffs here.

@davidfowl
Copy link
Member

Please measure (not just managed allocation reduction). There are tradeoffs here.

There's no throughput regression but there's no improvement from this benchmark either. Just a reduction in allocation rate so far. One thing we've excited to look at is running in more memory constrained environments to see if we run faster because we're spending less time doing GCs.

Also want test using the new conserve GC mode to see if there's any difference.

@JamesNK
Copy link
Member

JamesNK commented Aug 7, 2021

What are the negatives of using ValueTask pooling? I briefly looked at the implementation and it appears to use thread-local storage. Are you worried about overall memory going up?

Most of the HTTP/3 methods where state machines are being allocated will have exactly one caller per object. For methods like those, have you considered a builder that stashes the state machine on the object in a field instead of TLS? (if that is even possible. I'm not familar with this area 😬)

@stephentoub
Copy link
Member

What are the negatives of using ValueTask pooling?

Potential for:

  • Increased working set
  • Increased costs of GCs that do occur due to likely gen2 -> gen0 references from the pooled objects
  • Every await being a bit more expensive due to more interface invocations

have you considered a builder that stashes the state machine on the object in a field instead of TLS?

It's not possible without language help, and we cut it from C# 10 due to not being able to find a good way to express it with a builder, which would likely incur other costs like delegate invocations for every rent and return. We might revisit for C# 11.

@stephentoub
Copy link
Member

There's no throughput regression but there's no improvement from this benchmark either.

What should I be looking at in Aditya's numbers to see that? I was expecting some kind of websocket messages/sec number.

@adityamandaleeka
Copy link
Member Author

@stephentoub I didn't include this part in the original post, but here are the client metrics from those runs (including the throughput figures):

Before:

| load                  |                             |
| --------------------- | --------------------------- |
| CPU Usage (%)         | 37                          |
| Cores usage (%)       | 1,045                       |
| Working Set (MB)      | 58                          |
| Private Memory (MB)   | 535                         |
| Build Time (ms)       | 2,427                       |
| Start Time (ms)       | 78                          |
| Published Size (KB)   | 69,422                      |
| .NET Core SDK Version | 6.0.100-preview.6.21276.12  |
| ASP.NET Core Version  | 6.0.0-rc.1.21403.21+e7d9161 |
| .NET Runtime Version  | 6.0.0-rc.1.21403.6+b0cea40  |
| Max RPS               | 326,169                     |
| Requests              | 4,896,120                   |
| Mean latency (ms)     | 0                           |

After:


| load                  |                             |
| --------------------- | --------------------------- |
| CPU Usage (%)         | 39                          |
| Cores usage (%)       | 1,086                       |
| Working Set (MB)      | 59                          |
| Private Memory (MB)   | 527                         |
| Build Time (ms)       | 2,691                       |
| Start Time (ms)       | 78                          |
| Published Size (KB)   | 69,422                      |
| .NET Core SDK Version | 6.0.100-preview.6.21276.12  |
| ASP.NET Core Version  | 6.0.0-rc.1.21403.24+8eeeafc |
| .NET Runtime Version  | 6.0.0-rc.1.21403.6+b0cea40  |
| Max RPS               | 325,692                     |
| Requests              | 4,888,316                   |
| Mean latency (ms)     | 0                           |

The benchmark PR is still in progress (see comments there about what these numbers mean), but we can see that the throughput difference between the runs is tiny (~0.1%).

@adityamandaleeka
Copy link
Member Author

adityamandaleeka commented Aug 16, 2021

An update: I tested this change with other benchmarks in the lab (plaintext, json, fortunes) and, as in the websocket case, observed no throughput or latency impact when running them.

To summarize where we are, we've seen some allocation reduction with this change (based on the data above), but no signs of any other impact--positive or negative--on the benchmarks that have been run. We'll get more coverage once this is merged and gets picked up by all the automated benchmarking runs. If we see any regressions at that point, we could back out the change. How does that sound?

Raw Data

Plaintext

Note: P90 latency of this one is highly variable from run to run (I've seen 22 ms in a "before" run...), so the delta here shouldn't be taken seriously.

Before:

| application           |                            |
| --------------------- | -------------------------- |
| CPU Usage (%)         | 94                         |
| Cores usage (%)       | 2,642                      |
| Working Set (MB)      | 391                        |
| Private Memory (MB)   | 808                        |
| Build Time (ms)       | 3,000                      |
| Start Time (ms)       | 337                        |
| Published Size (KB)   | 109,320                    |
| .NET Core SDK Version | 6.0.100-preview.6.21276.12 |
| ASP.NET Core Version  | 6.0.0-rc.1.21413.5+fad9658 |
| .NET Runtime Version  | 6.0.0-rc.1.21413.9+60d9b98 |


| load                   |            |
| ---------------------- | ---------- |
| CPU Usage (%)          | 68         |
| Cores usage (%)        | 1,893      |
| Working Set (MB)       | 48         |
| Private Memory (MB)    | 376        |
| Start Time (ms)        | 0          |
| First Request (ms)     | 233        |
| Requests/sec           | 2,663,433  |
| Requests               | 40,215,905 |
| Mean latency (ms)      | 1.68       |
| Max latency (ms)       | 166.39     |
| Bad responses          | 0          |
| Socket errors          | 0          |
| Read throughput (MB/s) | 335.29     |
| Latency 50th (ms)      | 0.88       |
| Latency 75th (ms)      | 1.28       |
| Latency 90th (ms)      | 1.76       |
| Latency 99th (ms)      | 0.00       |

After:

| application           |                            |
| --------------------- | -------------------------- |
| CPU Usage (%)         | 94                         |
| Cores usage (%)       | 2,641                      |
| Working Set (MB)      | 387                        |
| Private Memory (MB)   | 804                        |
| Build Time (ms)       | 3,022                      |
| Start Time (ms)       | 353                        |
| Published Size (KB)   | 109,320                    |
| .NET Core SDK Version | 6.0.100-preview.6.21276.12 |
| ASP.NET Core Version  | 6.0.0-rc.1.21413.5+fad9658 |
| .NET Runtime Version  | 6.0.0-rc.1.21413.9+60d9b98 |


| load                   |            |
| ---------------------- | ---------- |
| CPU Usage (%)          | 67         |
| Cores usage (%)        | 1,882      |
| Working Set (MB)       | 49         |
| Private Memory (MB)    | 376        |
| Start Time (ms)        | 0          |
| First Request (ms)     | 266        |
| Requests/sec           | 2,709,348  |
| Requests               | 40,904,982 |
| Mean latency (ms)      | 1.73       |
| Max latency (ms)       | 214.63     |
| Bad responses          | 0          |
| Socket errors          | 0          |
| Read throughput (MB/s) | 341.07     |
| Latency 50th (ms)      | 0.89       |
| Latency 75th (ms)      | 1.29       |
| Latency 90th (ms)      | 2.64       |
| Latency 99th (ms)      | 0.00       |

JSON

Before:

| application           |                            |
| --------------------- | -------------------------- |
| CPU Usage (%)         | 82                         |
| Cores usage (%)       | 2,296                      |
| Working Set (MB)      | 361                        |
| Private Memory (MB)   | 443                        |
| Build Time (ms)       | 5,835                      |
| Start Time (ms)       | 296                        |
| Published Size (KB)   | 109,320                    |
| .NET Core SDK Version | 6.0.100-preview.6.21276.12 |
| ASP.NET Core Version  | 6.0.0-rc.1.21413.5+fad9658 |
| .NET Runtime Version  | 6.0.0-rc.1.21413.9+60d9b98 |


| load                   |            |
| ---------------------- | ---------- |
| CPU Usage (%)          | 52         |
| Cores usage (%)        | 1,458      |
| Working Set (MB)       | 37         |
| Private Memory (MB)    | 358        |
| Start Time (ms)        | 0          |
| First Request (ms)     | 113        |
| Requests/sec           | 748,260    |
| Requests               | 11,298,306 |
| Mean latency (ms)      | 0.67       |
| Max latency (ms)       | 192.01     |
| Bad responses          | 0          |
| Socket errors          | 0          |
| Read throughput (MB/s) | 108.47     |
| Latency 50th (ms)      | 0.23       |
| Latency 75th (ms)      | 0.48       |
| Latency 90th (ms)      | 1.15       |
| Latency 99th (ms)      | 6.61       |

After:

| application           |                            |
| --------------------- | -------------------------- |
| CPU Usage (%)         | 79                         |
| Cores usage (%)       | 2,225                      |
| Working Set (MB)      | 362                        |
| Private Memory (MB)   | 782                        |
| Build Time (ms)       | 9,493                      |
| Start Time (ms)       | 308                        |
| Published Size (KB)   | 109,321                    |
| .NET Core SDK Version | 6.0.100-preview.6.21276.12 |
| ASP.NET Core Version  | 6.0.0-rc.1.21413.9+4076bad |
| .NET Runtime Version  | 6.0.0-rc.1.21413.9+60d9b98 |


| load                   |            |
| ---------------------- | ---------- |
| CPU Usage (%)          | 52         |
| Cores usage (%)        | 1,454      |
| Working Set (MB)       | 37         |
| Private Memory (MB)    | 358        |
| Start Time (ms)        | 0          |
| First Request (ms)     | 141        |
| Requests/sec           | 740,802    |
| Requests               | 11,185,649 |
| Mean latency (ms)      | 0.66       |
| Max latency (ms)       | 112.99     |
| Bad responses          | 0          |
| Socket errors          | 0          |
| Read throughput (MB/s) | 107.39     |
| Latency 50th (ms)      | 0.23       |
| Latency 75th (ms)      | 0.48       |
| Latency 90th (ms)      | 1.00       |
| Latency 99th (ms)      | 7.14       |

Fortunes

Before:

| db                  |         |
| ------------------- | ------- |
| CPU Usage (%)       | 25      |
| Cores usage (%)     | 711     |
| Working Set (MB)    | 47      |
| Build Time (ms)     | 1,943   |
| Start Time (ms)     | 474     |
| Published Size (KB) | 914,909 |


| application           |                            |
| --------------------- | -------------------------- |
| CPU Usage (%)         | 75                         |
| Cores usage (%)       | 2,114                      |
| Working Set (MB)      | 409                        |
| Private Memory (MB)   | 824                        |
| Build Time (ms)       | 2,876                      |
| Start Time (ms)       | 1,850                      |
| Published Size (KB)   | 93,209                     |
| .NET Core SDK Version | 6.0.100-preview.6.21276.12 |
| ASP.NET Core Version  | 6.0.0-rc.1.21416.2+98ee247 |
| .NET Runtime Version  | 6.0.0-rc.1.21415.6+fde6b37 |


| load                   |           |
| ---------------------- | --------- |
| CPU Usage (%)          | 31        |
| Cores usage (%)        | 861       |
| Working Set (MB)       | 37        |
| Private Memory (MB)    | 363       |
| Start Time (ms)        | 0         |
| First Request (ms)     | 100       |
| Requests/sec           | 327,682   |
| Requests               | 4,943,368 |
| Mean latency (ms)      | 2.04      |
| Max latency (ms)       | 102.00    |
| Bad responses          | 0         |
| Socket errors          | 0         |
| Read throughput (MB/s) | 425.31    |
| Latency 50th (ms)      | 1.37      |
| Latency 75th (ms)      | 1.92      |
| Latency 90th (ms)      | 3.50      |
| Latency 99th (ms)      | 13.52     |

After:

| db                  |         |
| ------------------- | ------- |
| CPU Usage (%)       | 25      |
| Cores usage (%)     | 713     |
| Working Set (MB)    | 47      |
| Build Time (ms)     | 2,020   |
| Start Time (ms)     | 475     |
| Published Size (KB) | 914,909 |


| application           |                            |
| --------------------- | -------------------------- |
| CPU Usage (%)         | 74                         |
| Cores usage (%)       | 2,074                      |
| Working Set (MB)      | 409                        |
| Private Memory (MB)   | 824                        |
| Build Time (ms)       | 2,676                      |
| Start Time (ms)       | 1,903                      |
| Published Size (KB)   | 93,209                     |
| .NET Core SDK Version | 6.0.100-preview.6.21276.12 |
| ASP.NET Core Version  | 6.0.0-rc.1.21416.2+98ee247 |
| .NET Runtime Version  | 6.0.0-rc.1.21415.6+fde6b37 |


| load                   |           |
| ---------------------- | --------- |
| CPU Usage (%)          | 31        |
| Cores usage (%)        | 863       |
| Working Set (MB)       | 38        |
| Private Memory (MB)    | 363       |
| Start Time (ms)        | 0         |
| First Request (ms)     | 99        |
| Requests/sec           | 328,971   |
| Requests               | 4,967,324 |
| Mean latency (ms)      | 2.07      |
| Max latency (ms)       | 95.46     |
| Bad responses          | 0         |
| Socket errors          | 0         |
| Read throughput (MB/s) | 426.99    |
| Latency 50th (ms)      | 1.36      |
| Latency 75th (ms)      | 1.90      |
| Latency 90th (ms)      | 3.50      |
| Latency 99th (ms)      | 13.70     |

@adityamandaleeka adityamandaleeka marked this pull request as ready for review August 16, 2021 18:48
@adityamandaleeka
Copy link
Member Author

Merging per plan in comment above.

@adityamandaleeka adityamandaleeka merged commit 8556362 into dotnet:main Aug 17, 2021
@ghost ghost added this to the 6.0-rc1 milestone Aug 17, 2021
@davidfowl
Copy link
Member

Looking at a new websocket profile with SignalR we missed one of these.
image

@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 6, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Dec 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use the PoolingAsyncValueTaskMethodBuilder to remove state machine allocations for async reads

6 participants