Skip to content

default to eager runeDocSection#503

Merged
keegancsmith merged 1 commit intomainfrom
k/lazy-default
Jan 9, 2023
Merged

default to eager runeDocSection#503
keegancsmith merged 1 commit intomainfrom
k/lazy-default

Conversation

@keegancsmith
Copy link
Member

We have been running with eager runDocSection decoding on sourcegraph.com this week and have seen improvements to tail latency and average latency. We believe the lazy decoding was unnecessary since we so often do global symbol searches that we would constantly be decoding doc section data all the time. See #312 for more context and below for details on perf.

This PR switches the feature flag to be opt-in to lazy section decoding. Additionally we remove the misguided warning, since that would trigger whenever symbols was disabled in a shard. I confirmed it is fine to have an empty []byte by reading old code (that predates lazy decoding) and experimenting.

There is a chance that on a quiet instance we suddenly have a lot more RAM sitting in the heap that doesn't get claimed back since it stays alive. In that case the lazy decoding may in fact be better for them. As such when this PR lands in Sourcegraph the changelog should document this just in case we get an increase in OOMs for low request volume instances.

Perf monitoring

First up time to first result from our continuous perf monitoring. This shows that nearly all our queries increased in perf vs what we observed a week ago.

histogram_quantile(0.5, sum by (query_name, le)(rate(search_blitz_first_result_seconds_bucket{query_name=~"^(literal|mono|regex)_.*",query_name!="literal_repo_excluded_scope",query_name!~".*(structural|_rev_|diff|symbol|commit).*"}[1h]))) - histogram_quantile(0.5, sum by (query_name, le)(rate(search_blitz_first_result_seconds_bucket{query_name=~"^(literal|mono|regex)_.*",query_name!="literal_repo_excluded_scope",query_name!~".*(structural|_rev_|diff|symbol|commit).*"}[1h] offset 7d))) 

image

This image shows how we much less we are allocating as time goes on.

image

The heap profiler though does say we have increased memory use by 3.4GB averaged over cluster. This is the main risk of this change. But what was happening before is with a global symbol query we would allocate that 3.4GB just for the request and then make the GC work super hard.

Below are some numbers from average profiles over the last 12 hours compared to a week ago at the same time. The improvements are more dramatic at higher percentiles (instead of averages). IE our tail latencies have likely improved a bunch, a lot of which is likely due to the GC running way less rather than just less IO.

runtime.gcBgMarkWorker
/usr/local/go/src/runtime/mgc.go
total:439.44 ms vs. 1.06 s (-623.96 ms), 3.34% vs. 7.52% (-4.17%)self:0 vs. 80 µs (-80 µs), 0% vs. 0.001% (-0.001%)
github.com/sourcegraph/zoekt.(*indexData).Search
/go/src/github.com/sourcegraph/zoekt/eval.go
total:10.67 s vs. 11.57 s (-894.44 ms), 81% vs. 82% (-0.608%)self:54.24 ms vs. 46.4 ms (+7.84 ms), 0.412% vs. 0.328% (+0.084%)

We have been running with eager TODO
@keegancsmith keegancsmith requested review from a team and camdencheek December 15, 2022 13:26
@keegancsmith keegancsmith merged commit 6d5ed59 into main Jan 9, 2023
@keegancsmith keegancsmith deleted the k/lazy-default branch January 9, 2023 13:11
keegancsmith added a commit to sourcegraph/sourcegraph-public-snapshot that referenced this pull request Jan 9, 2023
We have been running with eager runDocSection decoding on
sourcegraph.com for a month and have seen improvements to tail latency
and average latency. We believe the lazy decoding was unnecessary since
we so often do global symbol searches that we would constantly be
decoding doc section data all the time.

There is a chance that on a quiet instance we suddenly have a lot more
RAM sitting in the heap that doesn't get claimed back since it stays
alive. In that case the lazy decoding may in fact be better for them. As
such we document how to disable in the CHANGELOG.

For more details see the PR in zoekt
sourcegraph/zoekt#503

Test Plan: tested already on sourcegraph.com.
keegancsmith added a commit to sourcegraph/sourcegraph-public-snapshot that referenced this pull request Jan 9, 2023
We have been running with eager runDocSection decoding on
sourcegraph.com for a month and have seen improvements to tail latency
and average latency. We believe the lazy decoding was unnecessary since
we so often do global symbol searches that we would constantly be
decoding doc section data all the time.

There is a chance that on a quiet instance we suddenly have a lot more
RAM sitting in the heap that doesn't get claimed back since it stays
alive. In that case the lazy decoding may in fact be better for them. As
such we document how to disable in the CHANGELOG.

For more details see the PR in zoekt
sourcegraph/zoekt#503

Test Plan: tested already on sourcegraph.com.
peterguy pushed a commit that referenced this pull request Jan 10, 2023
We have been running with eager runDocSection decoding on sourcegraph.com this
week and have seen improvements to tail latency and average latency. We
believe the lazy decoding was unnecessary since we so often do global symbol
searches that we would constantly be decoding doc section data all the time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants