grpc: add interceptor that tracks the sizes of all messages sent by servers/clients #55381

ggilmore · 2023-07-27T23:07:16Z

Follow up to https://github.com/sourcegraph/sourcegraph/pull/55209 and https://github.com/sourcegraph/sourcegraph/pull/55242.

This PR adds interceptors that records Prometheus metrics that observe:

the individual size of each sent protobuf message by a server or client
the total amount data sent over the course a single RPC by a server (responses) or client (requests)

This allows us to track the total amount of data returned by any of our RPCs. In some cases, this can reveal opportunities for future performance / stability improvements (Example: symbols' LocalCodeIntel method returning ~gigabyte sized responses that has to be held all at once in memory).

This PR also provides new grafana dashboards that track this metric for every gRPC service. See below for a screenshot of what this looks like when I run the symbols service locally.

Test plan

Unit tests
Manual tests (using local sourcegraph instance with sg start monitoring)
- I ran a couple of searches to verify that the data was populated as I expected. (ex: context:global type:diff test and then see the relevant dashboards fill in for gitserver)
- Using the repository and file mentioned in this log message, I navigated to that file and hovered over a token to trigger's SymbolsLocalIntel response. Then took a screenshot of it's grafana dashboard above. The total response size dashboards show the total large allocation (~1 GB), while the individual message sizes hover closer to 1MB (as expected) .

…ervers/clients

ggilmore · 2023-07-28T17:02:06Z

internal/grpc/grpcutil/util.go

+// SplitMethodName splits a full gRPC method name (e.g. "/package.service/method") in to its individual components (service, method)
+//
+// Copied from github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors/reporter.go
+func SplitMethodName(fullMethod string) (string, string) {


I thought this small snippet was worth making it re-usable across different packages. I am fine with copying it though if you feel like that's preferable.

ggilmore · 2023-07-28T17:03:15Z

internal/grpc/messagesize/prometheus.go

+var metricSingleMessageSize = promauto.NewHistogramVec(prometheus.HistogramOpts{
+	Name: "src_grpc_sent_individual_message_size_per_rpc_bytes",
+	Help: "Size of individual messages sent per RPC.",
+	Buckets: []float64{


The bucket sizes are identical in both metrics.

I am open to other proposals to reduce the dimensionality, but this was my first thought.

to make it clear they are the same, should we pull the buckets out into a separate shared var?

I thought we might want to tune one metric's buckets after review, but if you're fine with this set of buckets for both metrics - then I can pull it out into a shared variable.

ggilmore · 2023-07-28T17:04:11Z

internal/grpc/messagesize/prometheus.go

+
+// messageSizeObserver is a utility that records Prometheus metrics that observe the size of each sent message and the
+// cumulative size of all sent messages during the course of a single RPC call.
+type messageSizeObserver struct {


I decided to make a small struct type here to encapsulate the Prometheus observation logic since it's shared amongst 4 different interceptors.

ggilmore · 2023-07-28T17:05:07Z

internal/grpc/messagesize/prometheus.go

+// messageSizeObserver is a utility that records Prometheus metrics that observe the size of each sent message and the
+// cumulative size of all sent messages during the course of a single RPC call.
+type messageSizeObserver struct {
+	onSingleFunc func(messageSizeBytes uint64)


I decided to use provided callbacks to make this easier to unit test. The prometheus-specific functionality comes in when you use the newMessageSizeObserver constructor.

ggilmore · 2023-07-28T17:05:52Z

internal/grpc/messagesize/prometheus.go

+// FinishRPC records the total size of all sent messages during the course of a single RPC call.
+// This function should only be called once the RPC call has completed.
+func (o *messageSizeObserver) FinishRPC() {
+	o.finishOnce.Do(func() {


Perhaps over-kill, but I wrapped FinishRPC in a sync.Once to prevent accidental misuse.

internal/grpc/messagesize/prometheus.go

ggilmore · 2023-07-28T17:07:56Z

internal/grpc/messagesize/prometheus.go

+
+	err := handler(srv, wrappedStream)
+	if err != nil {
+		// Don't record the total size of the messages if there was an error sending them, since they may not have been sent.


This applies to every interceptor, but I decided to avoid recording sizes if sending a request / response ever failed - since we can't be sure if the message itself was actually sent to the recipient on the wire.

internal/grpc/messagesize/prometheus_test.go

ggilmore · 2023-07-28T17:13:10Z

monitoring/definitions/shared/grpc.go


+			// Track total response size per method
+
+			{


TL;DR

These dashboards track the p99.9, p90, and p75 total amount of response data sent per RPC.

These dashbaords also track the p99.9, p90, and p75 response individual message sizes.

We're collecting similar data for clients, but I didn't add a dashboard for this yet.

I want to do some refactoring (maybe renaming) of this dashboard logic before I add that (which can be done in a future change).

ggilmore · 2023-07-28T17:40:14Z

cc @sourcegraph/code-intel: This adds a generic system that can give you insight into how much data a given gRPC call (including the problematic symbols calls) is sending (and perhaps allocating all at once).

sourcegraph-bot · 2023-07-28T17:42:25Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff 274fac0...b181405.

Notify	File(s)
@bobheadxi	monitoring/definitions/frontend.go monitoring/definitions/git_server.go monitoring/definitions/repo_updater.go monitoring/definitions/searcher.go monitoring/definitions/shared/grpc.go monitoring/definitions/symbols.go
@slimsag	monitoring/definitions/frontend.go monitoring/definitions/git_server.go monitoring/definitions/repo_updater.go monitoring/definitions/searcher.go monitoring/definitions/shared/grpc.go monitoring/definitions/symbols.go
@sourcegraph/delivery	doc/admin/observability/dashboards.md monitoring/definitions/frontend.go monitoring/definitions/git_server.go monitoring/definitions/repo_updater.go monitoring/definitions/searcher.go monitoring/definitions/shared/grpc.go monitoring/definitions/symbols.go

camdencheek · 2023-07-28T17:49:00Z

internal/grpc/messagesize/prometheus.go

+var metricSingleMessageSize = promauto.NewHistogramVec(prometheus.HistogramOpts{
+	Name: "src_grpc_sent_individual_message_size_per_rpc_bytes",
+	Help: "Size of individual messages sent per RPC.",
+	Buckets: []float64{


to make it clear they are the same, should we pull the buckets out into a separate shared var?

internal/grpc/messagesize/prometheus.go

camdencheek · 2023-07-28T18:21:38Z

internal/grpc/messagesize/prometheus.go

+	// Note: we don't call FinishRPC() if there was a real error, since the total size of the messages sent during the
+	// course of the RPC call may not be accurate.


I think it's still valuable to log the size of messages received even in the event of an error. An error could just be a cancelled context. If we ignore the errored streams, I think this metric might end up misrepresenting the volume of traffic.

What do you mean by "not accurate" here?

Hmm, I might need to refactor the logic for this a bit.

I was thinking that if we ever get an error back in the course of sending a message, that we don't have a definitive way of knowing whether or not anything actually went out over the wire to the recipient - all we know is that we encountered an error while doing so.

As such, including that message in either the "individual message" Prometheus calculations or the "overall request / response size" traffic seems misleading.

However, if we do get a context cancelled or deadline exceeded error, then maybe we can record the previous observations anyway?

One question that comes to mind...if we immediately cancel / exceed a deadline before the server responses, should we record that "empty" response size? That seems misleading... Maybe we should have an "error" label on the metric to split up the times for successful and failed requests?

Well, we'll have both client-side and server-side metrics for this, so I think it's a reasonable signal to record "bytes sent" and "bytes received", which could be different. In the case that sending a single message returns an error, I agree that it's reasonable to not log those bytes as "sent", but I think we still want to log zero so we're not skewing our results against errors

I'm hesitant to say adding a dimension for "is error" would be useful since cancellation is often a "success" (we got enough results to end the stream early). I wouldn't want those to be missing from the metrics

@camdencheek

I tweaked the total RPC size behavior to ignore errors and added a lot of tests in 9f32121 (#55381). Can you take a look at the tests to see if this is the behavior you wanted? (TestStreamingClientInterceptor in particular)

those tests look great! Thank you!

camdencheek · 2023-07-28T18:25:29Z

internal/grpc/messagesize/prometheus.go

+
+// Observe records the size of a single message.
+func (o *messageSizeObserver) Observe(message proto.Message) {
+	s := uint64(proto.Size(message))


Could you do a quick sanity check that this doesn't marshal the message to get the size? Looks like there is a legacy code path that encodes the message to get the size. A benchmark showing that proto.Size doesn't allocate is good enough for me.

@camdencheek I added a benchmark here: 0dedef1 (#55381)

BenchmarkObserverBinary-10 54629676 25.73 ns/op 0 B/op 0 allocs/op BenchmarkObserverKeyValue-10 3871278 298.7 ns/op 192 B/op 5 allocs/op BenchmarkObserverArticle-10 1500800 786.2 ns/op 384 B/op 10 allocs/op

Is this what you were thinking? (first time writing a benchmark)

It seems like it does allocate some depending on the input, but not in the same order of magnitude as to the message size (indicating it's not fully marshaling).

Huh! Very interesting. Based on the profiles, it looks like nearly all the memory allocations come from sizing the maps. 10 allocations / message is nothing to sneeze at, but given that we don't use maps extensively in our APIs, I think this is probably within the range of acceptable. I wonder if that could be improved 🤔

(go test -bench=. -run=xxx ./internal/grpc/messagesize -memprofile=memprofile.txt && go tool pprof -http=:9096 ./memprofile.txt)

When writing the logic for https://github.com/sourcegraph/sourcegraph/pull/55130, I did use a few breakpoints to determine that our messages types do use the first (non-legacy) codepath when calculating the size (since we also check sizes in that PR).

I assume the "marshal it out and check the length" is for messages that were generated using older versions of the protobuf go package?

I assume the "marshal it out and check the length" is for messages that were generated using older versions of the protobuf go package?

I also assume that. I mostly just wanted to make sure that my assumptions weren't unfounded 🙂 And I learned something new about gRPC along the way!

camdencheek · 2023-07-28T20:53:09Z

internal/grpc/messagesize/prometheus.go

+
+// Observe records the size of a single message.
+func (o *messageSizeObserver) Observe(message proto.Message) {
+	s := uint64(proto.Size(message))


Huh! Very interesting. Based on the profiles, it looks like nearly all the memory allocations come from sizing the maps. 10 allocations / message is nothing to sneeze at, but given that we don't use maps extensively in our APIs, I think this is probably within the range of acceptable. I wonder if that could be improved 🤔

(go test -bench=. -run=xxx ./internal/grpc/messagesize -memprofile=memprofile.txt && go tool pprof -http=:9096 ./memprofile.txt)

…ervers/clients (#55381) (cherry picked from commit 6d21f6c)

…ervers/clients (#55381)

ggilmore added 2 commits July 27, 2023 10:39

grpc: add interceptor that tracks the sizes of all messages sent by s…

23b45ae

…ervers/clients

add dashboards

13f869f

cla-bot bot added the cla-signed label Jul 27, 2023

ggilmore commented Jul 28, 2023

View reviewed changes

internal/grpc/messagesize/prometheus.go Outdated Show resolved Hide resolved

ggilmore commented Jul 28, 2023

View reviewed changes

internal/grpc/messagesize/prometheus_test.go Show resolved Hide resolved

ggilmore commented Jul 28, 2023

View reviewed changes

Merge remote-tracking branch 'origin/main' into grpc-size-monitoring

a8602bc

ggilmore added the backport 5.1 label Jul 28, 2023

ggilmore requested review from a team, camdencheek and mucles July 28, 2023 17:40

ggilmore marked this pull request as ready for review July 28, 2023 17:40

CHANGELOG.md

8926438

camdencheek reviewed Jul 28, 2023

View reviewed changes

add benchmark?

0dedef1

camdencheek approved these changes Jul 28, 2023

View reviewed changes

ggilmore added 4 commits July 31, 2023 10:19

switch to separate client and server metric names

9589f80

record total message size no matter what errors occurred

9f32121

Merge branch 'main' into grpc-size-monitoring

a8ea5bb

nogo

9e165b0

ggilmore force-pushed the grpc-size-monitoring branch from 3d7316d to 9e165b0 Compare August 1, 2023 18:25

fix prometheus query name

79ce5f1

remove accidental wildcard

b181405

ggilmore enabled auto-merge (squash) August 1, 2023 18:48

ggilmore merged commit 6d21f6c into main Aug 1, 2023

ggilmore deleted the grpc-size-monitoring branch August 1, 2023 18:52

github-actions bot mentioned this pull request Aug 1, 2023

[Backport 5.1] grpc: add interceptor that tracks the sizes of all messages sent by servers/clients #55495

Merged

github-actions bot pushed a commit that referenced this pull request Aug 1, 2023

grpc: add interceptor that tracks the sizes of all messages sent by s…

62b9616

…ervers/clients (#55381) (cherry picked from commit 6d21f6c)

varsanojidan pushed a commit that referenced this pull request Aug 3, 2023

grpc: add interceptor that tracks the sizes of all messages sent by s…

f85981f

…ervers/clients (#55381)

davejrt pushed a commit that referenced this pull request Aug 9, 2023

grpc: add interceptor that tracks the sizes of all messages sent by s…

1537699

…ervers/clients (#55381)

ggilmore mentioned this pull request Sep 18, 2023

grpc: add support for prometheus metrics that calculates message size sourcegraph/zoekt#651

Merged

		// Note: we don't call FinishRPC() if there was a real error, since the total size of the messages sent during the
		// course of the RPC call may not be accurate.

Conversation

ggilmore commented Jul 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggilmore Jul 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ggilmore Jul 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggilmore commented Jul 28, 2023

Uh oh!

sourcegraph-bot commented Jul 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

camdencheek Jul 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggilmore Jul 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggilmore commented Jul 27, 2023 •

edited

Loading

ggilmore Jul 28, 2023 •

edited

Loading

ggilmore Jul 28, 2023 •

edited

Loading

sourcegraph-bot commented Jul 28, 2023 •

edited

Loading

camdencheek Jul 28, 2023 •

edited

Loading

ggilmore Jul 28, 2023 •

edited

Loading