Context: more precise chunk sizing#62643

jtibshirani · 2024-05-13T23:42:39Z

Currently, when retrieving context chunks, we hardcode the number of lines to
20. Historically, we've limited chunks to 1024 characters, and we chose 20
lines to roughly mirror that.

In evals, I found that we're often returning fewer than 1024 characters. This
PR updates the context resolver to load an adaptive number of lines based on
the 1024 character limit.

Addresses #61745

Test plan

Added new test. Also manually tested using GraphQL console.

jtibshirani · 2024-05-13T23:44:24Z

This shows an improvement on CodeSearchNet:
Before

Recall (files)	91/99
Recall (chunks)	70/99
Average chunk overlap	0.81

After

Recall (files)	91/99
Recall (chunks)	74/99
Average chunk overlap	0.89

In the near future, we should be able to increase the chunk size even more, given our increased context window limits. The current choice of 1024 is somewhat arbitrary.

jtibshirani · 2024-05-13T23:46:06Z

cmd/frontend/internal/context/resolvers/context.go

Calling attention to this -- I think an ideal context API would let callers pass in a "token budget" that the backend must respect in its responses. The backend would be allowed to optimize within that budget in whatever way it wants. (Or maybe this would be "character budget" to keep it simple?)

keegancsmith · 2024-05-14T07:01:02Z

cmd/frontend/internal/context/resolvers/context.go

+// countLines finds the number of lines corresponding to the number of runes. We 'round up' to include a line even if
+// it pushes the chunk over the desired number of runes. This is okay since the chunk size limit is very conservative.


you can get very long lines. Should we put in an upperbound just to ensure counting up doesn't get too big?

hmm, how about I switch to rounding down. That way we stay within the "token budget" which seems desirable.

cla-bot bot added the cla-signed label May 13, 2024

github-actions bot added team/product-platform team/search-platform Issues owned by the search platform team labels May 13, 2024

jtibshirani commented May 13, 2024

View reviewed changes

jtibshirani requested a review from a team May 13, 2024 23:46

jtibshirani marked this pull request as ready for review May 13, 2024 23:48

jtibshirani mentioned this pull request May 13, 2024

Context: Improve how we assemble final context for LLM #61745

Closed

Context: more precise chunk sizing

ffe3f70

jtibshirani force-pushed the jtibs/chunks branch from ee10932 to ffe3f70 Compare May 14, 2024 00:35

keegancsmith approved these changes May 14, 2024

View reviewed changes

jtibshirani added 3 commits May 15, 2024 10:17

Merge remote-tracking branch 'upstream/main' into jtibs/chunks

89aacb2

Round number of lines down

0f60843

Bazel configure

e33e468

jtibshirani merged commit 553121c into main May 15, 2024

jtibshirani deleted the jtibs/chunks branch May 15, 2024 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context: more precise chunk sizing#62643

Context: more precise chunk sizing#62643
jtibshirani merged 4 commits intomainfrom
jtibs/chunks

jtibshirani commented May 13, 2024 •

edited

Loading

Uh oh!

jtibshirani commented May 13, 2024 •

edited

Loading

Uh oh!

jtibshirani May 13, 2024 •

edited

Loading

Uh oh!

keegancsmith May 14, 2024

Uh oh!

jtibshirani May 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// countLines finds the number of lines corresponding to the number of runes. We 'round up' to include a line even if
		// it pushes the chunk over the desired number of runes. This is okay since the chunk size limit is very conservative.

Conversation

jtibshirani commented May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

jtibshirani commented May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtibshirani May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keegancsmith May 14, 2024

Choose a reason for hiding this comment

Uh oh!

jtibshirani May 15, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jtibshirani commented May 13, 2024 •

edited

Loading

jtibshirani commented May 13, 2024 •

edited

Loading

jtibshirani May 13, 2024 •

edited

Loading