grpc: symbols: spread LocalCodeIntel response symbols across multiple messages#55242

ggilmore · 2023-07-24T20:31:54Z

This PR changes the LocalCodeIntel symbols gRPC method to return a stream of chunks of symbols, rather than the entire result set in a single message.

gRPC has default limits on how large individual messages can be. This is because gRPC is a message-based framework, which means that messages have to be entirely loaded into memory for both sending and receiving. As such, large allocations can negatively impact server performance and stability. The go-grpc default limit is 4MB.

The original LocalCodeIntel RPC implementation (both gRPC and REST) returned the entire set of returned symbols in one message. For certain repositories / files, this message can contains thousands of symbols, which can add up to hundreds of MB or even gigabytes of memory.

This PR adjusts the server-side implementation of LocalCodeIntel to chunk up its full result set into smaller chunks (so that each individual gRPC message is smaller). It does this in a few steps.

Adopting a chunk utility I found in the Gitaly project that can intelligently send a group of protobuf messages in smaller chunks (all ~ 1MB): https://github.com/sourcegraph/sourcegraph/pull/55242/commits/59ba6d0aba0c98a82d46b7a2f24127bbf97181cc
- I made some minor tweaks to the original code here:
  - using generics for type-safety
  - supporting variadic arguments for convenience / ergonomics
  - removing some unneeded gitaly packages from the test setup
Using the above chunk utility in the Symbols service to divide up the list of returned symbols into ~ 1MB batches and sending it across the result stream: https://github.com/sourcegraph/sourcegraph/pull/55242/commits/77fc77ee88e638c31c3ba8cba2b9f50f225503c5

Note that this PR doesn't "fix" the real issues with the underlying application itself. Notably:

The server is still calculating the entire result set in one shot (requiring it to hold all the symbols in a contiguous chunk of memory), as opposed to sending out incremental progress.
Is it necessary for the server to return thousands upon thousands of symbols, or should it support some sort of "limit" parameter?
The custom symbols client API still returns all these symbols in one giant slice (by joining all the received messages), so it still allocates a lot of memory - same as before as opposed to some sort of API where the result can be consumed incrementally.

However, we still aren't any worse off than we were before (the REST implementation still has the same memory allocation problems and sends the results in one giant JSON message). This PR does "work around" the gRPC-specific message size complaints while providing a building block to improve this in the future.

cc @sourcegraph/code-intel ^

Test plan

CI
Manual testing:

Using the repository and file mentioned in this log message, run a local sourcegraph instance with the following diff applied (to simulate the absence of the chunker by setting the message size to 1 TB)

diff --git a/internal/grpc/chunk/chunker.go b/internal/grpc/chunk/chunker.go
index 947b721368..c954854fc6 100644
--- a/internal/grpc/chunk/chunker.go
+++ b/internal/grpc/chunk/chunker.go
@@ -42,7 +42,7 @@ type Chunker[T Message] struct {
}

// maxMessageSize is the maximum size per protobuf message
-const maxMessageSize = 1 * 1024 * 1024
+const maxMessageSize = 1 * 1024 * 1024 * 1024 * 1024

Nagivate to the above-mentioned repository and file, and hover over any token - see the expected log message:

 [       frontend] ERROR symbolsConnectionCache.gRPC.internal.error.reporter.streamingMethod.postMessageReceive internalerrs/logging.go:239 grpc: received message larger than max (708588568 vs. 94371840) {"grpcService": "symbols.v1.SymbolsService", "grpcMethod": "LocalCodeIntel", "grpcCode": "ResourceExhausted", "initialRequestJSON": "[omitted]"}

Revert the above diff (which should restore the 1MB chunking behavior), and hard-refresh the blob page and hover over the same token. Eventually (after 30s) on my machine, the request will complete and a hover will appear - no error occurs.

… messages

sourcegraph-bot · 2023-07-25T17:00:35Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff af19739...6aafcca.

Notify	File(s)
@keegancsmith	cmd/symbols/internal/api/BUILD.bazel cmd/symbols/internal/api/handler_cgo.go cmd/symbols/internal/api/handler_nocgo.go internal/symbols/client.go internal/symbols/client_test.go internal/symbols/v1/symbols.pb.go internal/symbols/v1/symbols.proto internal/symbols/v1/symbols_grpc.pb.go

camdencheek

Good stuff!

internal/symbols/v1/symbols.proto

internal/symbols/client.go

cmd/symbols/internal/api/handler_cgo.go

camdencheek · 2023-07-25T19:21:54Z

cmd/symbols/internal/api/handler_cgo.go

+type localCodeIntelSender struct {
+	sendFunc func(symbols []*proto.LocalCodeIntelResponse_Symbol) error
+
+	buf []*proto.LocalCodeIntelResponse_Symbol
+}


nit: seems like this will be the pattern used for pretty much every implementor of Sender. Should we just define something like the following?

type BufferedSender[T] { sendFunc: func([]T) error, buf []T }

or maybe even just merge the buffer into the Chunker?

I can see the merit of the BufferedSender. Do you think we should get rid of the Sender interface entirely and replace it with BufferedSender, or just provide BufferedSender as a "default" implementation?

I can't actually think of a reason we'd need the Sender interface. The slice buffer + callback func seems general enough for any situation I can come up with

@camdencheek How does 334eba7 (#55242) look?

LGTM! Thanks!

internal/grpc/chunk/chunker.go

Co-authored-by: Camden Cheek <camden@ccheek.com>

… messages (#55242) Co-authored-by: Camden Cheek <camden@ccheek.com> (cherry picked from commit 2a6c569)

…across multiple messages (#55292) This PR changes the LocalCodeIntel symbols gRPC method to return a _stream_ of chunks of symbols, rather than the entire result set in a single message. --- gRPC has default limits on how large individual messages can be. This is because gRPC is a message-based framework, which means that messages have to be entirely loaded into memory for **both** sending and receiving. As such, large allocations can negatively impact server performance and stability. The go-grpc default limit is 4MB. The original LocalCodeIntel RPC implementation (both gRPC and REST) returned the entire set of returned symbols in one message. For certain repositories / files, this message can contains _thousands_ of symbols, which can add up to **[hundreds of MB](https://console.cloud.google.com/logs/query;cursorTimestamp=2023-07-24T15:43:57.509461420Z;duration=P7D;query=resource.type%3D%22k8s_container%22%0Aresource.labels.project_id%3D%22sourcegraph-dev%22%0Aresource.labels.location%3D%22us-central1-f%22%0Aresource.labels.cluster_name%3D%22cloud%22%0Aresource.labels.namespace_name%3D%22prod%22%0AjsonPayload.InstrumentationScope%3D~%22gRPC.internal.error.reporter%22%0A-jsonPayload.message%3D%22grpc:%20error%20while%20marshaling:%20string%20field%20contains%20invalid%20UTF-8%22%0A-jsonPayload.Body%3D%22grpc:%20error%20while%20marshaling:%20string%20field%20contains%20invalid%20UTF-8%22%0Atimestamp%3D%222023-07-24T15:43:57.509461420Z%22%0AinsertId%3D%22f964g54awjrevxzi%22?project=sourcegraph-dev) or even [gigabytes of memory](https://sourcegraph.slack.com/archives/C04HCK4K3DL/p1683228335350079).** This PR adjusts the server-side implementation of LocalCodeIntel to chunk up its full result set into smaller chunks (so that each individual gRPC message is smaller). It does this in a few steps. - Adopting a `chunk` utility I found in the [Gitaly](https://gitlab.com/gitlab-org/gitaly/-/blob/v16.2.0/internal/helper/chunk/chunker.go) project that can intelligently send a group of protobuf messages in smaller chunks (all ~ 1MB): sourcegraph/sourcegraph@59ba6d0 - I made some minor tweaks to the original code here: - using generics for type-safety - supporting variadic arguments for convenience / ergonomics - removing some unneeded gitaly packages from the test setup - Using the above `chunk` utility in the Symbols service to divide up the list of returned symbols into ~ 1MB batches and sending it across the result stream: sourcegraph/sourcegraph@77fc77e --- Note that this PR doesn't "fix" the real issues with the underlying application itself. Notably: - The server is still calculating the entire result set in one shot (requiring it to hold all the symbols in a contiguous chunk of memory), as opposed to sending out incremental progress. - Is it necessary for the server to return thousands upon thousands of symbols, or should it support some sort of "limit" parameter? - The custom symbols client API still returns all these symbols in one giant slice (by joining all the received messages), so it still allocates a lot of memory - same as before as opposed to some sort of API where the result can be consumed incrementally. However, we still aren't any worse off than we were before (the REST implementation still has the same memory allocation problems and sends the results in one giant JSON message). This PR does "work around" the gRPC-specific message size complaints while providing a building block to improve this in the future. cc @sourcegraph/code-intel ^ ## Test plan 1. CI 2. Manual testing: - Using the [repository and file mentioned in this log message](https://console.cloud.google.com/logs/query;cursorTimestamp=2023-07-24T15:43:57.509461420Z;duration=P7D;query=resource.type%3D%22k8s_container%22%0Aresource.labels.project_id%3D%22sourcegraph-dev%22%0Aresource.labels.location%3D%22us-central1-f%22%0Aresource.labels.cluster_name%3D%22cloud%22%0Aresource.labels.namespace_name%3D%22prod%22%0AjsonPayload.InstrumentationScope%3D~%22gRPC.internal.error.reporter%22%0A-jsonPayload.message%3D%22grpc:%20error%20while%20marshaling:%20string%20field%20contains%20invalid%20UTF-8%22%0A-jsonPayload.Body%3D%22grpc:%20error%20while%20marshaling:%20string%20field%20contains%20invalid%20UTF-8%22%0Atimestamp%3D%222023-07-24T15:43:57.509461420Z%22%0AinsertId%3D%22f964g54awjrevxzi%22?project=sourcegraph-dev), run a local sourcegraph instance with the following diff applied (to simulate the absence of the chunker by setting the message size to 1 TB) ```diff diff --git a/internal/grpc/chunk/chunker.go b/internal/grpc/chunk/chunker.go index 947b721..c954854fc6 100644 --- a/internal/grpc/chunk/chunker.go +++ b/internal/grpc/chunk/chunker.go @@ -42,7 +42,7 @@ type Chunker[T Message] struct { } // maxMessageSize is the maximum size per protobuf message -const maxMessageSize = 1 * 1024 * 1024 +const maxMessageSize = 1 * 1024 * 1024 * 1024 * 1024 ``` - Nagivate to the above-mentioned repository and file, and hover over any token - see the expected log message: ``` [ frontend] ERROR symbolsConnectionCache.gRPC.internal.error.reporter.streamingMethod.postMessageReceive internalerrs/logging.go:239 grpc: received message larger than max (708588568 vs. 94371840) {"grpcService": "symbols.v1.SymbolsService", "grpcMethod": "LocalCodeIntel", "grpcCode": "ResourceExhausted", "initialRequestJSON": "[omitted]"} ``` - Revert the above diff (which should restore the 1MB chunking behavior), and hard-refresh the blob page and hover over the same token. Eventually (after 30s) on my machine, the request will complete and a hover will appear - no error occurs. <br> Backport 2a6c569 from #55242

… messages (#55242) Co-authored-by: Camden Cheek <camden@ccheek.com>

…sages sent by servers/clients (#55495) Follow up to https://github.com/sourcegraph/sourcegraph/pull/55209 and https://github.com/sourcegraph/sourcegraph/pull/55242. This PR adds interceptors that records Prometheus metrics that observe: - the individual size of each **sent** protobuf message by a server or client - the total amount data sent over the course a single RPC by a server (responses) or client (requests) This allows us to track the total amount of a data returned by any of our RPCs. In some cases, this can reveal opportunities for future performance / stability improvements (Example: symbols' [LocalCodeIntel method returning ~gigabyte sized responses that has to be held all at once in memory](https://github.com/sourcegraph/sourcegraph/pull/55242)). This PR also provides new grafana dashboards that track this metric for every gRPC service. See below for a screenshot of what this looks like when I run the symbols service locally. Co-authored-by: Geoffrey Gilmore <geoffrey@sourcegraph.com>

… messages (#55242) Co-authored-by: Camden Cheek <camden@ccheek.com>

cla-bot bot added the cla-signed label Jul 24, 2023

ggilmore force-pushed the grpc-message-size-local-code-intel branch 4 times, most recently from 67cdb83 to c6f8bef Compare July 25, 2023 16:12

grpc: add chunker utility package

59ba6d0

ggilmore force-pushed the grpc-message-size-local-code-intel branch 2 times, most recently from 27eb72d to ef82944 Compare July 25, 2023 16:16

grpc: symbols: spread LocalCodeIntel response symbols across multiple…

77fc77e

… messages

ggilmore force-pushed the grpc-message-size-local-code-intel branch from ef82944 to 77fc77e Compare July 25, 2023 16:33

ggilmore marked this pull request as ready for review July 25, 2023 16:58

ggilmore requested review from a team, camdencheek and mucles July 25, 2023 17:38

efritz approved these changes Jul 25, 2023

View reviewed changes

CHANGELOG.md

c24c37f

ggilmore added the backport 5.1 label Jul 25, 2023

camdencheek approved these changes Jul 25, 2023

View reviewed changes

ggilmore and others added 4 commits July 25, 2023 13:06

Update cmd/symbols/internal/api/handler_cgo.go

c5c3fcb

Co-authored-by: Camden Cheek <camden@ccheek.com>

chunker.go: collapse Sender implemntation inside Chunker itself

334eba7

bazel configure

9c9d8c5

chunker.go: add comment

6aafcca

ggilmore enabled auto-merge (squash) July 25, 2023 21:15

ggilmore merged commit 2a6c569 into main Jul 25, 2023

ggilmore deleted the grpc-message-size-local-code-intel branch July 25, 2023 21:24

github-actions bot mentioned this pull request Jul 25, 2023

[Backport 5.1] grpc: symbols: spread LocalCodeIntel response symbols across multiple messages #55292

Merged

github-actions bot pushed a commit that referenced this pull request Jul 25, 2023

grpc: symbols: spread LocalCodeIntel response symbols across multiple…

c1ccd66

… messages (#55242) Co-authored-by: Camden Cheek <camden@ccheek.com> (cherry picked from commit 2a6c569)

ggilmore mentioned this pull request Jul 27, 2023

grpc: add interceptor that tracks the sizes of all messages sent by servers/clients #55381

Merged

MaedahBatool pushed a commit that referenced this pull request Jul 28, 2023

grpc: symbols: spread LocalCodeIntel response symbols across multiple…

80c93e7

… messages (#55242) Co-authored-by: Camden Cheek <camden@ccheek.com>

github-actions bot mentioned this pull request Aug 1, 2023

[Backport 5.1] grpc: add interceptor that tracks the sizes of all messages sent by servers/clients #55495

Merged

ggilmore mentioned this pull request Aug 8, 2023

gRPC September 2023 Issue #55629

Closed

10 tasks

davejrt pushed a commit that referenced this pull request Aug 9, 2023

grpc: symbols: spread LocalCodeIntel response symbols across multiple…

63962a4

… messages (#55242) Co-authored-by: Camden Cheek <camden@ccheek.com>

ggilmore mentioned this pull request Aug 10, 2023

grpc: zoekt-webserver: stream search: break up file matches across multiple messages sourcegraph/zoekt#636

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grpc: symbols: spread LocalCodeIntel response symbols across multiple messages#55242

grpc: symbols: spread LocalCodeIntel response symbols across multiple messages#55242
ggilmore merged 7 commits intomainfrom
grpc-message-size-local-code-intel

ggilmore commented Jul 24, 2023 •

edited

Loading

Uh oh!

sourcegraph-bot commented Jul 25, 2023 •

edited

Loading

Uh oh!

camdencheek left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

camdencheek Jul 25, 2023

Uh oh!

camdencheek Jul 25, 2023

Uh oh!

ggilmore Jul 25, 2023

Uh oh!

camdencheek Jul 25, 2023

Uh oh!

ggilmore Jul 25, 2023

Uh oh!

camdencheek Jul 25, 2023 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ggilmore commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

sourcegraph-bot commented Jul 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

camdencheek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

camdencheek Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

camdencheek Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

ggilmore Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

camdencheek Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

ggilmore Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

camdencheek Jul 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggilmore commented Jul 24, 2023 •

edited

Loading

sourcegraph-bot commented Jul 25, 2023 •

edited

Loading

camdencheek Jul 25, 2023 •

edited

Loading