Skip to content

Indexing: properly block on shard building#689

Merged
jtibshirani merged 1 commit intomainfrom
jtibs/index-throttle
Nov 14, 2023
Merged

Indexing: properly block on shard building#689
jtibshirani merged 1 commit intomainfrom
jtibs/index-throttle

Conversation

@jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented Nov 11, 2023

When indexing, we build shards in parallel based on the parallelism flag.
Each shard handles ~100MB of document contents, which should limit the memory
usage to roughly 100MB * parallelism.

Looking at the size of the buffered document contents in memory profiles, we
see much higher usage than this. The issue seems to be that we continue to
buffer up documents even if all threads are busy building shards. This can be a
real problem if shards take a super long time to build (say because ctags is
slow) -- we could end up buffering a ton of content into memory at once.

This change fixes the throttling logic so we block indexing when all threads
are busy building shards.

@jtibshirani
Copy link
Contributor Author

jtibshirani commented Nov 11, 2023

I noticed this when I took a memory profile with -parallelism 4. Before the change, there is consistently over 2GB attributed to bytes.growSlice, which is what holds the document contents:

Showing top 10 nodes out of 63
      flat  flat%   sum%        cum   cum%
    2.14GB 62.83% 62.83%     2.14GB 62.83%  bytes.growSlice
    0.51GB 15.07% 77.89%     0.51GB 15.07%  github.com/go-git/go-git/v5/plumbing/format/idxfile.(*MemoryIndex).genOffsetHash
    0.30GB  8.91% 86.80%     0.30GB  8.91%  github.com/go-git/go-git/v5/plumbing/format/idxfile.readObjectNames

After, this is consistently only 700MB:

Showing top 10 nodes out of 66
      flat  flat%   sum%        cum   cum%
  762.59MB 37.60% 37.60%   762.59MB 37.60%  bytes.growSlice
  523.74MB 25.82% 63.42%   523.74MB 25.82%  github.com/go-git/go-git/v5/plumbing/format/idxfile.(*MemoryIndex).genOffsetHash
  304.30MB 15.00% 78.42%   304.30MB 15.00%  github.com/go-git/go-git/v5/plumbing/format/idxfile.readObjectNames

So this seems to fix an important issue. I need to look further into why this is still not closer to 500MB (what you'd expect from 100MB buffer + 100MB * 4 threads) -- I think there is some overhead from git-go.

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!! So now we will have at most parallelism + 1 shards in memory right? Since you can have parallelism documents having buildShard called on, and then 1 full todo slice trying to have flush called on it?

I double checked calls to flush and th euse of the b.building waitgroup. I don't see any issues with potential deadlocks/etc. LGTM!

@jtibshirani
Copy link
Contributor Author

Indeed now it will be at most parallelism + 1 shards in memory, plus whatever memory for building index structures. Thanks for double-checking the concurrency logic!

@jtibshirani jtibshirani merged commit 5e2620e into main Nov 14, 2023
@jtibshirani jtibshirani deleted the jtibs/index-throttle branch November 14, 2023 16:08
@jtibshirani
Copy link
Contributor Author

@keegancsmith @stefanhengl general note about the indexing memory fixes: I plan to let these "bake" for ~2 weeks on S2 / dot com before backporting this to a 5.2 patch. I'm being pretty conservative since it's very core code and this logic hasn't been touched in a while.

jtibshirani added a commit that referenced this pull request Nov 16, 2023
When indexing, we build shards in parallel based on the `parallelism` flag.
Each shard handles ~100MB of document contents, which should limit the memory
usage to roughly `100MB * parallelism`.

Looking at the size of the buffered document contents in memory profiles, we
see much higher usage than this. The issue seems to be that we continue to
buffer up documents even if all threads are busy building shards. This can be a
real problem if shards take a super long time to build (say because ctags is
slow) -- we could end up buffering a ton of content into memory at once.

This change fixes the throttling logic so we block indexing when all threads
are busy building shards.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants