Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

grpc: symbols: spread LocalCodeIntel response symbols across multiple messages#55242

Merged
ggilmore merged 7 commits intomainfrom
grpc-message-size-local-code-intel
Jul 25, 2023
Merged

grpc: symbols: spread LocalCodeIntel response symbols across multiple messages#55242
ggilmore merged 7 commits intomainfrom
grpc-message-size-local-code-intel

Conversation

@ggilmore
Copy link
Contributor

@ggilmore ggilmore commented Jul 24, 2023

This PR changes the LocalCodeIntel symbols gRPC method to return a stream of chunks of symbols, rather than the entire result set in a single message.


gRPC has default limits on how large individual messages can be. This is because gRPC is a message-based framework, which means that messages have to be entirely loaded into memory for both sending and receiving. As such, large allocations can negatively impact server performance and stability. The go-grpc default limit is 4MB.

The original LocalCodeIntel RPC implementation (both gRPC and REST) returned the entire set of returned symbols in one message. For certain repositories / files, this message can contains thousands of symbols, which can add up to hundreds of MB or even gigabytes of memory.

This PR adjusts the server-side implementation of LocalCodeIntel to chunk up its full result set into smaller chunks (so that each individual gRPC message is smaller). It does this in a few steps.


Note that this PR doesn't "fix" the real issues with the underlying application itself. Notably:

  • The server is still calculating the entire result set in one shot (requiring it to hold all the symbols in a contiguous chunk of memory), as opposed to sending out incremental progress.
  • Is it necessary for the server to return thousands upon thousands of symbols, or should it support some sort of "limit" parameter?
  • The custom symbols client API still returns all these symbols in one giant slice (by joining all the received messages), so it still allocates a lot of memory - same as before as opposed to some sort of API where the result can be consumed incrementally.

However, we still aren't any worse off than we were before (the REST implementation still has the same memory allocation problems and sends the results in one giant JSON message). This PR does "work around" the gRPC-specific message size complaints while providing a building block to improve this in the future.

cc @sourcegraph/code-intel ^

Test plan

  1. CI
  2. Manual testing:
diff --git a/internal/grpc/chunk/chunker.go b/internal/grpc/chunk/chunker.go
index 947b721368..c954854fc6 100644
--- a/internal/grpc/chunk/chunker.go
+++ b/internal/grpc/chunk/chunker.go
@@ -42,7 +42,7 @@ type Chunker[T Message] struct {
}

// maxMessageSize is the maximum size per protobuf message
-const maxMessageSize = 1 * 1024 * 1024
+const maxMessageSize = 1 * 1024 * 1024 * 1024 * 1024
  • Nagivate to the above-mentioned repository and file, and hover over any token - see the expected log message:
 [       frontend] ERROR symbolsConnectionCache.gRPC.internal.error.reporter.streamingMethod.postMessageReceive internalerrs/logging.go:239 grpc: received message larger than max (708588568 vs. 94371840) {"grpcService": "symbols.v1.SymbolsService", "grpcMethod": "LocalCodeIntel", "grpcCode": "ResourceExhausted", "initialRequestJSON": "[omitted]"}
  • Revert the above diff (which should restore the 1MB chunking behavior), and hard-refresh the blob page and hover over the same token. Eventually (after 30s) on my machine, the request will complete and a hover will appear - no error occurs.

@cla-bot cla-bot bot added the cla-signed label Jul 24, 2023
@ggilmore ggilmore force-pushed the grpc-message-size-local-code-intel branch 4 times, most recently from 67cdb83 to c6f8bef Compare July 25, 2023 16:12
@ggilmore ggilmore force-pushed the grpc-message-size-local-code-intel branch 2 times, most recently from 27eb72d to ef82944 Compare July 25, 2023 16:16
@ggilmore ggilmore force-pushed the grpc-message-size-local-code-intel branch from ef82944 to 77fc77e Compare July 25, 2023 16:33
@ggilmore ggilmore marked this pull request as ready for review July 25, 2023 16:58
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented Jul 25, 2023

Codenotify: Notifying subscribers in CODENOTIFY files for diff af19739...6aafcca.

Notify File(s)
@keegancsmith cmd/symbols/internal/api/BUILD.bazel
cmd/symbols/internal/api/handler_cgo.go
cmd/symbols/internal/api/handler_nocgo.go
internal/symbols/client.go
internal/symbols/client_test.go
internal/symbols/v1/symbols.pb.go
internal/symbols/v1/symbols.proto
internal/symbols/v1/symbols_grpc.pb.go

@ggilmore ggilmore requested review from a team, camdencheek and mucles July 25, 2023 17:38
Copy link
Member

@camdencheek camdencheek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff!

Comment on lines +100 to +104
type localCodeIntelSender struct {
sendFunc func(symbols []*proto.LocalCodeIntelResponse_Symbol) error

buf []*proto.LocalCodeIntelResponse_Symbol
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems like this will be the pattern used for pretty much every implementor of Sender. Should we just define something like the following?

type BufferedSender[T] {
    sendFunc: func([]T) error,
    buf []T
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe even just merge the buffer into the Chunker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the merit of the BufferedSender. Do you think we should get rid of the Sender interface entirely and replace it with BufferedSender, or just provide BufferedSender as a "default" implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't actually think of a reason we'd need the Sender interface. The slice buffer + callback func seems general enough for any situation I can come up with

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@camdencheek camdencheek Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@ggilmore ggilmore enabled auto-merge (squash) July 25, 2023 21:15
@ggilmore ggilmore merged commit 2a6c569 into main Jul 25, 2023
@ggilmore ggilmore deleted the grpc-message-size-local-code-intel branch July 25, 2023 21:24
github-actions bot pushed a commit that referenced this pull request Jul 25, 2023
… messages (#55242)

Co-authored-by: Camden Cheek <camden@ccheek.com>
(cherry picked from commit 2a6c569)
BolajiOlajide pushed a commit that referenced this pull request Jul 26, 2023
…across multiple messages (#55292)

This PR changes the LocalCodeIntel symbols gRPC method to return a _stream_ of chunks of symbols, rather than the entire result set in a single message. 

---

gRPC has default limits on how large individual messages can be. This is because gRPC is a message-based framework, which means that messages have to be entirely loaded into memory for **both** sending and receiving. As such, large allocations can negatively impact server performance and stability. The go-grpc default limit is 4MB. 

The original LocalCodeIntel RPC implementation (both gRPC and REST) returned the entire set of returned symbols in one  message. For certain repositories / files, this message can contains _thousands_ of symbols, which can add up to **[hundreds of MB](https://console.cloud.google.com/logs/query;cursorTimestamp=2023-07-24T15:43:57.509461420Z;duration=P7D;query=resource.type%3D%22k8s_container%22%0Aresource.labels.project_id%3D%22sourcegraph-dev%22%0Aresource.labels.location%3D%22us-central1-f%22%0Aresource.labels.cluster_name%3D%22cloud%22%0Aresource.labels.namespace_name%3D%22prod%22%0AjsonPayload.InstrumentationScope%3D~%22gRPC.internal.error.reporter%22%0A-jsonPayload.message%3D%22grpc:%20error%20while%20marshaling:%20string%20field%20contains%20invalid%20UTF-8%22%0A-jsonPayload.Body%3D%22grpc:%20error%20while%20marshaling:%20string%20field%20contains%20invalid%20UTF-8%22%0Atimestamp%3D%222023-07-24T15:43:57.509461420Z%22%0AinsertId%3D%22f964g54awjrevxzi%22?project=sourcegraph-dev) or even [gigabytes of memory](https://sourcegraph.slack.com/archives/C04HCK4K3DL/p1683228335350079).** 

This PR adjusts the server-side implementation of LocalCodeIntel to chunk up its full result set into smaller chunks (so that each individual gRPC message is smaller). It does this in a few steps.

- Adopting a `chunk` utility I found in the [Gitaly](https://gitlab.com/gitlab-org/gitaly/-/blob/v16.2.0/internal/helper/chunk/chunker.go) project that can intelligently send a group of protobuf messages in smaller chunks (all ~ 1MB): sourcegraph/sourcegraph@59ba6d0
	- I made some minor tweaks to the original code here:
		- using generics for type-safety
		- supporting variadic arguments for convenience / ergonomics
		- removing some unneeded gitaly packages from the test setup
- Using the above `chunk` utility in the Symbols service to divide up the list of returned symbols into ~ 1MB batches and sending it across the result stream: sourcegraph/sourcegraph@77fc77e

---

Note that this PR doesn&#39;t &quot;fix&quot; the real issues with the underlying application itself. Notably:

- The server is still calculating the entire result set in one shot (requiring it to hold all the symbols in a contiguous chunk of memory), as opposed to sending out incremental progress. 
- Is it necessary for the server to return thousands upon thousands of symbols, or should it support some sort of &quot;limit&quot; parameter?
- The custom symbols client API still returns all these symbols in one giant slice (by joining all the received messages), so it still allocates a lot of memory - same as before as opposed to some sort of API where the result can be consumed incrementally. 

However, we still aren&#39;t any worse off than we were before (the REST implementation still has the same memory allocation problems and sends the results in one giant JSON message). This PR does &quot;work around&quot; the gRPC-specific message size complaints while providing a building block to improve this in the future. 

cc @sourcegraph/code-intel ^

## Test plan

1. CI 
2. Manual testing:
  - Using the [repository and file mentioned in this log message](https://console.cloud.google.com/logs/query;cursorTimestamp=2023-07-24T15:43:57.509461420Z;duration=P7D;query=resource.type%3D%22k8s_container%22%0Aresource.labels.project_id%3D%22sourcegraph-dev%22%0Aresource.labels.location%3D%22us-central1-f%22%0Aresource.labels.cluster_name%3D%22cloud%22%0Aresource.labels.namespace_name%3D%22prod%22%0AjsonPayload.InstrumentationScope%3D~%22gRPC.internal.error.reporter%22%0A-jsonPayload.message%3D%22grpc:%20error%20while%20marshaling:%20string%20field%20contains%20invalid%20UTF-8%22%0A-jsonPayload.Body%3D%22grpc:%20error%20while%20marshaling:%20string%20field%20contains%20invalid%20UTF-8%22%0Atimestamp%3D%222023-07-24T15:43:57.509461420Z%22%0AinsertId%3D%22f964g54awjrevxzi%22?project=sourcegraph-dev), run a local sourcegraph instance with the following diff applied (to simulate the absence of the chunker by setting the message size to 1 TB)
   ```diff
  diff --git a/internal/grpc/chunk/chunker.go b/internal/grpc/chunk/chunker.go
index 947b721..c954854fc6 100644
--- a/internal/grpc/chunk/chunker.go
+++ b/internal/grpc/chunk/chunker.go
@@ -42,7 +42,7 @@ type Chunker[T Message] struct {
 }
 
 // maxMessageSize is the maximum size per protobuf message
-const maxMessageSize = 1 * 1024 * 1024
+const maxMessageSize = 1 * 1024 * 1024 * 1024 * 1024
  ```
  - Nagivate to the above-mentioned repository and file, and hover over any token - see the expected log message:
  ```
   [       frontend] ERROR symbolsConnectionCache.gRPC.internal.error.reporter.streamingMethod.postMessageReceive internalerrs/logging.go:239 grpc: received message larger than max (708588568 vs. 94371840) {&quot;grpcService&quot;: &quot;symbols.v1.SymbolsService&quot;, &quot;grpcMethod&quot;: &quot;LocalCodeIntel&quot;, &quot;grpcCode&quot;: &quot;ResourceExhausted&quot;, &quot;initialRequestJSON&quot;: &quot;[omitted]&quot;}
  ```
 - Revert the above diff (which should restore the 1MB chunking behavior), and hard-refresh the blob page and hover over the same token. Eventually (after 30s) on my machine, the request will complete and a hover will appear - no error occurs.  <br> Backport 2a6c569 from #55242
MaedahBatool pushed a commit that referenced this pull request Jul 28, 2023
… messages (#55242)

Co-authored-by: Camden Cheek <camden@ccheek.com>
camdencheek referenced this pull request Aug 2, 2023
…sages sent by servers/clients (#55495)

Follow up to https://github.com/sourcegraph/sourcegraph/pull/55209 and
https://github.com/sourcegraph/sourcegraph/pull/55242.

This PR adds interceptors that records Prometheus metrics that observe:

- the individual size of each **sent** protobuf message by a server or
client
- the total amount data sent over the course a single RPC by a server
(responses) or client (requests)

This allows us to track the total amount of a data returned by any of
our RPCs. In some cases, this can reveal opportunities for future
performance / stability improvements (Example: symbols&#39;
[LocalCodeIntel method returning ~gigabyte sized responses that has to
be held all at once in
memory](https://github.com/sourcegraph/sourcegraph/pull/55242)).

This PR also provides new grafana dashboards that track this metric for
every gRPC service. See below for a screenshot of what this looks like
when I run the symbols service locally.

Co-authored-by: Geoffrey Gilmore <geoffrey@sourcegraph.com>
@ggilmore ggilmore mentioned this pull request Aug 8, 2023
10 tasks
davejrt pushed a commit that referenced this pull request Aug 9, 2023
… messages (#55242)

Co-authored-by: Camden Cheek <camden@ccheek.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants