backport: grpc: add automatic retry support to all services#59404

ggilmore · 2024-01-08T23:36:10Z

This is a manual backport of all the following PRs into a single commit for the 5.2 branch:

Collectively, these PRs all implement adding automatic retry support for idempotent gRPC methods to the 5.2 branch. Creating stacked PRs for these for the 5.2 backport would have taken ages with the need for manual approvals for each one, so I collected them all in one PR for convenience.

Some services had methods that only exist on the 5.2 branch - we've deleted / changed them on main. I have called these unique methods out as PR comments.

Test plan

CI for all referenced PRs is sufficient.

I ran the following search query on a local instance that had one instance of zoekt-webserver disabled (sg start --except zoekt-web-1): https://sourcegraph.test:3443/search?q=context%3Aglobal+test&patternType=standard&sm=1&trace=1&groupBy=repo

This produced a trace showing that the client did attempt to retry:

This PR adds a basic configuration for enabling retries with gRPC for certain RPC types. The description for `defaults.RetryPolicy` is probably the most important thing to read: ```go // RetryPolicy is the default retry policy for internal GRPC requests. // // The retry policy will trigger on Unavailable and ResourceExhausted status errors, and will retry up to 20 times using an // exponential backoff policy with a maximum duration of 3s in between retries. // // Only Unary (1:1) and ServerStreaming (1:N) requests are retried. All other types of requests will immediately // return an Unimplemented status error. It's up to the caller to manually retry these requests. // // These defaults can be overridden with the following environment variables: // - SRC_GRPC_RETRY_DELAY_BASE: Base retry delay duration for internal GRPC requests // - SRC_GRPC_RETRY_MAX_ATTEMPTS: Max retry attempts for internal GRPC requests // - SRC_GRPC_RETRY_MAX_DURATION: Max retry duration for internal GRPC requests var RetryPolicy = []grpc.CallOption{ retry.WithCodes(codes.Unavailable, codes.ResourceExhausted), // Together with the default options, the maximum delay will behave like this: // Retry# Delay // 1 0.05s // 2 0.1s // 3 0.2s // 4 0.4s // 5 0.8s // 6 1.6s // 7 3.0s // 8 3.0s // ... // 20 3.0s retry.WithMax(uint(internalRetryMaxAttempts)), retry.WithBackoff(fullJitter(internalRetryDelayBase, internalRetryMaxDuration)), } ``` This is off by default for all services (since this logic doesn't work with RPCS or might not be desirable as the default behavior if you don't know whether or not your method is idempotent). The upstack PRs selectively enable this logic for appropriate RPCs (see those PRs for the exact semantics). ## Test plan CI

…e/retry package (#59140) The package has some issues (the retry logic for client stream is flawed). I'm adding a copy of this to our repository for future edits. See the discussion in https://github.com/sourcegraph/sourcegraph/pull/59145 ## Test plan The existing test suite from the copied project is now running in CI.

…e already recieved a message on the stream (#59145) When retrying a client stream, we must ensure that we haven't received any data from the server yet before retrying. Otherwise, we can't know if the client has already consumed part of the stream. Blindly retrying the stream could produce duplicate messages or inconsistent messages. The only safe generic behavior that we can implement is to only retry if an error occurs _before_ the server successfully sends the first message. After that, any encounters that we see on the stream will be directly returned to the caller - no retries will occur. Only the caller knows the retry semantics that it wants. This matches the built-in grpc retry behavior (that we can't use, see https://github.com/sourcegraph/sourcegraph/issues/51060) as documented on https://learn.microsoft.com/en-us/aspnet/core/grpc/retries?view=aspnetcore-8.0#when-retries-are-valid: > Streaming calls > > Streaming calls can be used with gRPC retries, but there are important considerations when they are used together: > > Server streaming, bidirectional streaming: **Streaming RPCs that return multiple messages from the server won't retry after the first message has been received. Apps must add additional logic to manually re-establish server and bidirectional streaming calls.** As a side note: The upstream library had this behavior back in 2021 (and the discussion is a bit baffling to me): grpc-ecosystem/go-grpc-middleware#313 ## Test plan This PR adds two additonal tests to the test suite that ensure that: 1. The library is capable of retrying the RPC if we haven't received the first message in the stream yet 2. The library will **not automatically retry** if the first message from the server has already been recieved

) ## Test plan

This tweaks our forked [grpc retry](https://pkg.go.dev/github.com/grpc-ecosystem/go-grpc-middleware/retry) package to support traces in a similar manner to our internal httpcli logic. When reviewing this PR, I'd recommend comparing this against the logic in `internal/httpcli` to see if it's to your liking: https://github.com/sourcegraph/sourcegraph/blob/023e96c2fc25ced65c528be2474b5fd1f9a34792/internal/httpcli/client.go#L582-L631 ## Test plan 1. (pre-requisite) I checked out https://github.com/sourcegraph/sourcegraph/pull/59136 (`12-20-grpc_frontend_configuration_support_automatic_retries_GetConfig_is_idempotent_`) that is the PR that has retries hooked up for all services. 2. In `sg.config.yaml`, I commented out the entry that starts one of the gitserver instances when running `sg start`. ```patch diff --git a/sg.config.yaml b/sg.config.yaml index 312e5eb..eb0eef61193 100644 --- a/sg.config.yaml +++ b/sg.config.yaml @@ -1106,7 +1106,7 @@ commandsets: - repo-updater - web - gitserver-0 - - gitserver-1 +# - gitserver-1 - searcher - caddy - symbols ``` 3. I then ran `sg start` and `sg start monitoring` to start jaeger. 4. I executed the following search query with tracing enabled: https://sourcegraph.test:3443/search?q=context:global+type:diff+test+timeout:2m+count:all&patternType=standard&sm=1&trace=1&groupBy=repo This produces a trace with entries that look like the following <img width="1713" alt="Screenshot 2023-12-21 at 4 32 42 PM" src="https://github.com/sourcegraph/sourcegraph/assets/9022011/ec7e2c48-602c-4537-b27a-e9490105b384"> You can see the full trace here: [gh_trace.json](https://github.com/sourcegraph/sourcegraph/files/13747118/gh_trace.json)

This PR adds support for automatic retries in the gitserver grpc client. I have gone through the gitserver protobuf file and marked all the methods I thought were idempotent (we can't inspect this using the go protobuf packages, but I thought this was nice for documentation). I then wrapped the basic gitserver grpc client with an "automaticRetryClient" that uses the default retry policy that was defined in https://github.com/sourcegraph/sourcegraph/pull/59095. See that PR for more details. Note that for ServerStreaming methods like Exec and Search, the retry logic will only automatically retry if we haven't received any messages back from the server yet. After we receive a single message, we can't know whether or not the callers has consumed the message yet (e.x: started consuming the `io.Reader` from ArchiveReader) and can tolerate receiving old messages, duplicated messages, etc. If we get an error after this point, we'll fail the RPC immediately and bubble up the underlying error to the caller. Only the caller would know the semantics of how it's consuming the stream to know how to proceed. CI

This PR adds support for automatic retries in the symbols grpc client. I have gone through the symbols protobuf file and marked all the methods I thought were idempotent (we can't inspect this using the go protobuf packages, but I thought this was nice for documentation). I then wrapped the basic symbols grpc client with an "automaticRetryClient" that uses the default retry policy that was defined in https://github.com/sourcegraph/sourcegraph/pull/59095. See that PR for more details. Note that for ServerStreaming methods like LocalCodeIntel and SymbolInfo, the retry logic will only automatically retry if we haven't received any messages back from the server yet. After we receive a single message, we can't know whether or not the callers has consumed the message yet (e.x: started aggregating the symbols from `LocalCodeIntel` ) and can tolerate receiving old messages, duplicated messages, etc. If we get an error after this point, we'll fail the RPC immediately and bubble up the underlying error to the caller. Only the caller would know the semantics of how it's consuming the stream to know how to proceed. CI

This PR adds support for automatic retries in the searcher grpc client. --- I have gone through the searcher protobuf file and marked all the methods I thought were idempotent (we can't inspect this using the go protobuf packages, but I thought this was nice for documentation). I then wrapped the basic searcher grpc client with an "automaticRetryClient" that uses the default retry policy that was defined in https://github.com/sourcegraph/sourcegraph/pull/59095. See that PR for more details. Note that for ServerStreaming methods like Search, the retry logic will only automatically retry if we haven't received any messages back from the server yet. After we receive a single message, we can't know whether or not the callers has consumed the message yet (e.x: started presenting the data in the WebUI from `Search`) and can tolerate receiving old messages, duplicated messages, etc. If we get an error after this point, we'll fail the RPC immediately and bubble up the underlying error to the caller. Only the caller would know the semantics of how it's consuming the stream to know how to proceed. CI

… (all are idempotent) (#59130) This PR adds support for automatic retries in the repo-updater grpc client. --- I have gone through the repo-updater protobuf file and marked all the methods I thought were idempotent (we can't inspect this using the go protobuf packages, but I thought this was nice for documentation). I then wrapped the basic repo-updater grpc client with an "automaticRetryClient" that uses the default retry policy that was defined in https://github.com/sourcegraph/sourcegraph/pull/59095. See that PR for more details. CI

This PR adds support for automatic retries in the `zoekt-webserver` grpc client that `searcher` uses. --- I wrapped the basic zoekt-webserver grpc client with an "automaticRetryClient" that uses the default retry policy that was defined in https://github.com/sourcegraph/sourcegraph/pull/59095. See that PR for more details. All the methods don't have any side effects, so they're all capable of being retried. Note that for ServerStreaming methods like StreamSearch and List, the retry logic will only automatically retry if we haven't received any messages back from the server yet. After we receive a single message, we can't know whether or not the caller has consumed the message yet (e.x: started consuming the search results from `StreamSearch` and displaying them in the WebUI) and can tolerate receiving old messages, duplicated messages, etc. If we get an error after this point, we'll fail the RPC immediately and bubble up the underlying error to the caller. Only the caller would know the semantics of how it's consuming the stream to know how to proceed. ## Test plan CI

…s idempotent) (#59136) This PR adds support for automatic retries in the frontend configuration grpc client. I have gone through the frontend protobuf file and marked all the methods I thought were idempotent (we can't inspect this using the go protobuf packages, but I thought this was nice for documentation). I then wrapped the basic frontend grpc client with an "automaticRetryClient" that uses the default retry policy that was defined in https://github.com/sourcegraph/sourcegraph/pull/59095. See that PR for more details. All the methods are idempotent, so they all get the new retry logic. ## Test plan CI

internal/gitserver/retry.go

internal/gitserver/v1/gitserver.proto

internal/repoupdater/v1/repoupdater.proto

internal/repoupdater/retry.go

ggilmore · 2024-01-09T07:24:17Z

internal/symbols/v1/symbols.proto

+  rpc Search(SearchRequest) returns (SearchResponse) {
+    option idempotency_level = NO_SIDE_EFFECTS;
+  }
+  rpc LocalCodeIntel(LocalCodeIntelRequest) returns (stream LocalCodeIntelResponse) {


cc @sourcegraph/team-graph

Note that main has gotten rid of the LocalCodeIntel method, it only exist on the 5.2 branch. I thought this method seemed like it had no side effects, so it should be safe to retry. Please feel free to correct me on this though.

Yes, it should be OK to retry.

ggilmore · 2024-01-09T07:24:36Z

internal/symbols/retry.go

+	return a.base.Search(ctx, in, opts...)
+}
+
+func (a *automaticRetryClient) LocalCodeIntel(ctx context.Context, in *proto.LocalCodeIntelRequest, opts ...grpc.CallOption) (proto.SymbolsService_LocalCodeIntelClient, error) {


cc @sourcegraph/team-graph

Note that main has gotten rid of the LocalCodeIntel method, it only exist on the 5.2 branch. I thought this method seemed like it had no side effects, so it should be safe to retry. Please feel free to correct me on this though.

ggilmore · 2024-01-09T07:27:19Z

@sourcegraph/release-guild Do I need to pay attention to the lighthouse errors? https://buildkite.com/sourcegraph/aspect-experimental/builds/4460

sourcegraph-bot · 2024-01-09T07:27:49Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff b737b47...7e9bab0.

Notify	File(s)
@eseliger	internal/gitserver/BUILD.bazel internal/gitserver/addrs.go internal/gitserver/retry.go
@jtibshirani	internal/search/backend/BUILD.bazel internal/search/backend/retry.go internal/search/backend/zoekt.go internal/search/searcher/BUILD.bazel internal/search/searcher/client_grpc.go internal/search/searcher/retry_grpc.go
@keegancsmith	internal/search/backend/BUILD.bazel internal/search/backend/retry.go internal/search/backend/zoekt.go internal/search/searcher/BUILD.bazel internal/search/searcher/client_grpc.go internal/search/searcher/retry_grpc.go internal/symbols/BUILD.bazel internal/symbols/client.go internal/symbols/retry.go internal/symbols/v1/symbols.pb.go internal/symbols/v1/symbols.proto

eseliger

🚀

sourcegraph-bot · 2024-01-09T07:35:26Z

📖 Storybook live preview

ggilmore · 2024-01-09T21:53:47Z

Hello @sourcegraph/release-guild - Could y'all stamp this PR and merge into the patch release? Thanks!

ggilmore added 8 commits January 8, 2024 17:25

cla-bot bot added the cla-signed label Jan 8, 2024

ggilmore added 5 commits January 8, 2024 18:43

format gitserver proto

45e0e90

Merge branch '5.2' into grpc-cherry-pick-retry

2bbca65

ggilmore commented Jan 9, 2024

View reviewed changes

ggilmore requested review from a team January 9, 2024 07:25

ggilmore marked this pull request as ready for review January 9, 2024 07:26

eseliger approved these changes Jan 9, 2024

View reviewed changes

ggilmore added 2 commits January 9, 2024 11:19

Merge branch '5.2' into grpc-cherry-pick-retry

a0bbae1

changelog

7e9bab0

DaedalusG merged commit a7f9745 into 5.2 Jan 9, 2024

DaedalusG deleted the grpc-cherry-pick-retry branch January 9, 2024 22:10

varungandhi-src mentioned this pull request Jan 16, 2024

vg/matlab fix #59635

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backport: grpc: add automatic retry support to all services#59404

backport: grpc: add automatic retry support to all services#59404
DaedalusG merged 15 commits into5.2from
grpc-cherry-pick-retry

ggilmore commented Jan 8, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggilmore Jan 9, 2024 •

edited

Loading

Uh oh!

varungandhi-src Jan 11, 2024

Uh oh!

ggilmore Jan 9, 2024 •

edited

Loading

Uh oh!

ggilmore commented Jan 9, 2024

Uh oh!

sourcegraph-bot commented Jan 9, 2024 •

edited

Loading

Uh oh!

eseliger left a comment

Uh oh!

sourcegraph-bot commented Jan 9, 2024 •

edited

Loading

Uh oh!

ggilmore commented Jan 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ggilmore commented Jan 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggilmore Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varungandhi-src Jan 11, 2024

Choose a reason for hiding this comment

Uh oh!

ggilmore Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggilmore commented Jan 9, 2024

Uh oh!

sourcegraph-bot commented Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eseliger left a comment

Choose a reason for hiding this comment

Uh oh!

sourcegraph-bot commented Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggilmore commented Jan 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ggilmore commented Jan 8, 2024 •

edited

Loading

ggilmore Jan 9, 2024 •

edited

Loading

ggilmore Jan 9, 2024 •

edited

Loading

sourcegraph-bot commented Jan 9, 2024 •

edited

Loading

sourcegraph-bot commented Jan 9, 2024 •

edited

Loading