Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Context: more precise chunk sizing#62643

Merged
jtibshirani merged 4 commits intomainfrom
jtibs/chunks
May 15, 2024
Merged

Context: more precise chunk sizing#62643
jtibshirani merged 4 commits intomainfrom
jtibs/chunks

Conversation

@jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented May 13, 2024

Currently, when retrieving context chunks, we hardcode the number of lines to
20. Historically, we've limited chunks to 1024 characters, and we chose 20
lines to roughly mirror that.

In evals, I found that we're often returning fewer than 1024 characters. This
PR updates the context resolver to load an adaptive number of lines based on
the 1024 character limit.

Addresses #61745

Test plan

Added new test. Also manually tested using GraphQL console.

@cla-bot cla-bot bot added the cla-signed label May 13, 2024
@github-actions github-actions bot added team/product-platform team/search-platform Issues owned by the search platform team labels May 13, 2024
@jtibshirani
Copy link
Contributor Author

jtibshirani commented May 13, 2024

This shows an improvement on CodeSearchNet:
Before

Recall (files)	91/99
Recall (chunks)	70/99
Average chunk overlap	0.81

After

Recall (files)	91/99
Recall (chunks)	74/99
Average chunk overlap	0.89

In the near future, we should be able to increase the chunk size even more, given our increased context window limits. The current choice of 1024 is somewhat arbitrary.

Copy link
Contributor Author

@jtibshirani jtibshirani May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling attention to this -- I think an ideal context API would let callers pass in a "token budget" that the backend must respect in its responses. The backend would be allowed to optimize within that budget in whatever way it wants. (Or maybe this would be "character budget" to keep it simple?)

@jtibshirani jtibshirani requested a review from a team May 13, 2024 23:46
@jtibshirani jtibshirani marked this pull request as ready for review May 13, 2024 23:48
Comment on lines +107 to +108
// countLines finds the number of lines corresponding to the number of runes. We 'round up' to include a line even if
// it pushes the chunk over the desired number of runes. This is okay since the chunk size limit is very conservative.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can get very long lines. Should we put in an upperbound just to ensure counting up doesn't get too big?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, how about I switch to rounding down. That way we stay within the "token budget" which seems desirable.

@jtibshirani jtibshirani merged commit 553121c into main May 15, 2024
@jtibshirani jtibshirani deleted the jtibs/chunks branch May 15, 2024 19:52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

cla-signed team/product-platform team/search-platform Issues owned by the search platform team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants