Allow embeddings job to exclude failed files from the index#55180
Allow embeddings job to exclude failed files from the index#55180
Conversation
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff 3462676...54281cc.
|
|
The high level requirements I've got in mind:
Solutions:
The first iteration of this PR is solution 2. Trade Offs:
|
stefanhengl
left a comment
There was a problem hiding this comment.
Review is still ongoing. I have to spend more time on embed.go, but have to step afk so submitting my current comments.
|
|
||
| // caller expects one vector per text input so append zero values | ||
| placeholder := make([]float32, response.ModelDimensions) | ||
| embeddings = append(embeddings, placeholder...) |
There was a problem hiding this comment.
You already allocated embeddings above, so embeddings has the right capacity. I believe you can just reslice instead of allocating the placeholder IE
embeddings = embeddings[:len(embeddings)+response.ModelDimensions]
There was a problem hiding this comment.
That makes sense, I can update
internal/embeddings/embed/embed.go
Outdated
| } | ||
|
|
||
| codeIndexStats, err := embedFiles(ctx, codeFileNames, client, contextService, opts.FileFilters, opts.SplitOptions, readLister, opts.MaxCodeEmbeddings, insertCode, reportCodeProgress) | ||
| codeIndexStats, err := embedFiles(ctx, codeFileNames, client, contextService, opts.FileFilters, opts.SplitOptions, readLister, opts.MaxCodeEmbeddings, opts.BatchSize, insertCode, truncateCode, reportCodeProgress, logger) |
There was a problem hiding this comment.
nit: looking at our codebase we seem to follow the convention to put logger as the second argument, right after ctx.
internal/embeddings/embed/embed.go
Outdated
| } | ||
| } | ||
|
|
||
| truncateIndex := func(index *embeddings.EmbeddingIndex, count int) { |
There was a problem hiding this comment.
Could this be a method of EmbeddingIndex instead? The list of arguments to embedFiles grows and grows and this one can easily be avoided if I am not mistaken.
I know this follows the same pattern as insertIndex but tbh I don't remember why we went down that road.
There was a problem hiding this comment.
I believe the reason is that we do not pass an EmbeddingIndex as an argument to embedFiles() and instead just pass these function(s) like insertIndex.
I know this follows the same pattern as insertIndex but tbh I don't remember why we went down that road.
I'm curious about the motivation for this as well since it seems we could just pass the index to embedFiles(). Perhaps we just don't want that method to need to understand how an EmbeddingIndex is organized. I'll consider adding insert() and truncate() to EmbeddingIndex
There was a problem hiding this comment.
I'm curious about the motivation for this as well since it seems we could just pass the index to embedFiles().
I pulled out insertIndex so that we could make it generic over our vector storage. I didn't want embedFiles to need to worry about the difference between qdrant and our custom storage.
camdencheek
left a comment
There was a problem hiding this comment.
Thanks for tackling this! It'll be great to have embeddings jobs that aren't quite so fragile.
|
|
||
| dimensionality := len(response.Data[0].Embedding) | ||
| embeddings := make([]float32, 0, len(response.Data)*dimensionality) | ||
| failed := make([]int, 0, len(response.Data)) |
There was a problem hiding this comment.
nit: no need to pre-allocate failed here. In almost all cases, this will be empty.
internal/embeddings/embed/embed.go
Outdated
| } | ||
| } | ||
|
|
||
| truncateIndex := func(index *embeddings.EmbeddingIndex, count int) { |
There was a problem hiding this comment.
I'm curious about the motivation for this as well since it seems we could just pass the index to embedFiles().
I pulled out insertIndex so that we could make it generic over our vector storage. I didn't want embedFiles to need to worry about the difference between qdrant and our custom storage.
internal/embeddings/embed/embed.go
Outdated
|
|
||
| if err := insert(metadata, batchEmbeddings); err != nil { | ||
| return err | ||
| // files with partial failures need to be excluded from the index entirely |
There was a problem hiding this comment.
files with partial failures need to be excluded from the index entirely
I'm gonna challenge this assumption. Mostly because this forces us to add a lot of complexity to an already complex piece of code.
First, this should happen pretty rarely. OpenAI seems to give us null responses to only a very small percentage of chunks. And of that small percentage, retrying seems to fix at least some.
Second, if we do get a failure on a chunk, having embeddings from the rest of the chunks in the file is still useful, and probably even preferable.
So, alternatively, what if we just logged a warning that one of our chunks failed, added a "failed" counter to the stats, and called it a day? In an ideal world, we'll be able to move off OpenAI embeddings, so we'll no longer face this problem anyways, and this is quite a bit of code that becomes cruft if that happens.
There was a problem hiding this comment.
Yeah, after implementing this to code completion the scope and complexity is a lot. I also mentioned in my earlier comment that the complexity really narrows down if we let partially indexed files remain.
I'd like to have a way of resolving the missing chunks without a user manually scheduling a forced embedding job to reindex from scratch, and I have some ideas, but that's a different effort. This PR can narrow down focus to letting the job proceed and providing an index sooner, albeit with potential for missing code or text embeddings.
If @stefanhengl has any thoughts then let me know, otherwise I can start reducing this down to a smaller PR for the initial enhancement.
There was a problem hiding this comment.
Reducing the scope seems like a good idea.
internal/conf/computed.go
Outdated
|
|
||
| // The default value for ExcludeFileOnError is false. | ||
| if embeddingsConfig.ExcludeFileOnError == nil { | ||
| embeddingsConfig.ExcludeFileOnError = pointers.Ptr(false) |
There was a problem hiding this comment.
I think we can default to true, or even make this not configurable. It's painful to have to kick off another embeddings job with a new configuration because of a failure (especially when they take many hours or days), so IMO we should prefer to be as "self-healing" here as possible.
There was a problem hiding this comment.
Yeah I'm open to either value as default
…edge excluded chunks
…ut string fails to generate embedding.
|
@camdencheek I've updated with smaller scope in
One thing I left out from the initial implementaion is logging the file names that fail. I figured that we could add it later. Given today's post it sounds like there are problematic files that fail due to their contents and we want to add to filter list. In that case we might want to add that logging of file name back in. |
…or. Log failed chunk file names.
|
The backport to To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-5.1 5.1
# Navigate to the new working tree
cd .worktrees/backport-5.1
# Create a new branch
git switch --create backport-55180-to-5.1
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 b8e31fde275d155aed5fe4cea30f3514bd816402
# Push it to GitHub
git push --set-upstream origin backport-55180-to-5.1
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-5.1Then, create a pull request where the |
When a text input is submitted for generating embeddings the response may be null. If we attempt retries and still cannot generate embeddings for this input text then we return an error which calls for failing the entire embed repo job. [Slack thread](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688676751106069) [Issue](https://github.com/sourcegraph/sourcegraph/issues/55469) This PR introduces a configuration `ExcludeChunkOnError`. When set to true an embed repo job will proceed with the rest of the embed repo job when these generate embeddings errors occur. However, the file that generated the input text which received an error is excluded from the index as to avoid partially indexing the file. I'll add more details on the first iteration of this solution and the trade offs in a separate comment. <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles --> Embed test cases added (cherry picked from commit b8e31fd)
…anch (#55528) Update exclude chunks migration metadata to align with 5.1 release branch. This is to support backporting this [PR](https://github.com/sourcegraph/sourcegraph/pull/55180) to 5.1. Parent `1687792857` exists for both 5.1 and main. Alignment refers to following the migration steps [here](https://handbook.sourcegraph.com/departments/engineering/dev/tools/backport/#prs-with-migration-changes). ## Test plan <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles --> n/a
…index (#55530) Related PRs on `main`: - [55180](https://github.com/sourcegraph/sourcegraph/pull/55180): Implements new feature - [55528](https://github.com/sourcegraph/sourcegraph/pull/55528): Updates migration metadata to use leaves that are present on 5.1 instead of using a leaves only available on main. Steps for this PR: 1. cherry-picked [55180](https://github.com/sourcegraph/sourcegraph/pull/55180) commit 2. updated `metadata.yaml` with 5.1's resulting value from running `sg migration leaves` 3. ran `sg generate` to confirm no other changes required Description: When a text input is submitted for generating embeddings the response may be null. If we attempt retries and still cannot generate embeddings for this input text then we return an error which calls for failing the entire embed repo job. [Slack thread](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688676751106069) https://github.com/sourcegraph/sourcegraph/issues/55469 This PR introduces a configuration ExcludeChunkOnError. When set to true an embed repo job will proceed with the rest of the embed repo job when these generate embeddings errors occur. However, the file that generated the input text which received an error is excluded from the index as to avoid partially indexing the file. I'll add more details on the first iteration of this solution and the trade offs in a separate comment. ## Test plan <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles --> new tests for embed.go and embedding clients
When a text input is submitted for generating embeddings the response may be null. If we attempt retries and still cannot generate embeddings for this input text then we return an error which calls for failing the entire embed repo job. [Slack thread](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688676751106069) [Issue](https://github.com/sourcegraph/sourcegraph/issues/55469) This PR introduces a configuration `ExcludeChunkOnError`. When set to true an embed repo job will proceed with the rest of the embed repo job when these generate embeddings errors occur. However, the file that generated the input text which received an error is excluded from the index as to avoid partially indexing the file. I'll add more details on the first iteration of this solution and the trade offs in a separate comment. ## Test plan <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles --> Embed test cases added
…anch (#55528) Update exclude chunks migration metadata to align with 5.1 release branch. This is to support backporting this [PR](https://github.com/sourcegraph/sourcegraph/pull/55180) to 5.1. Parent `1687792857` exists for both 5.1 and main. Alignment refers to following the migration steps [here](https://handbook.sourcegraph.com/departments/engineering/dev/tools/backport/#prs-with-migration-changes). ## Test plan <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles --> n/a
When a text input is submitted for generating embeddings the response may be null. If we attempt retries and still cannot generate embeddings for this input text then we return an error which calls for failing the entire embed repo job. [Slack thread](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688676751106069) [Issue](https://github.com/sourcegraph/sourcegraph/issues/55469) This PR introduces a configuration `ExcludeChunkOnError`. When set to true an embed repo job will proceed with the rest of the embed repo job when these generate embeddings errors occur. However, the file that generated the input text which received an error is excluded from the index as to avoid partially indexing the file. I'll add more details on the first iteration of this solution and the trade offs in a separate comment. ## Test plan <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles --> Embed test cases added
…anch (#55528) Update exclude chunks migration metadata to align with 5.1 release branch. This is to support backporting this [PR](https://github.com/sourcegraph/sourcegraph/pull/55180) to 5.1. Parent `1687792857` exists for both 5.1 and main. Alignment refers to following the migration steps [here](https://handbook.sourcegraph.com/departments/engineering/dev/tools/backport/#prs-with-migration-changes). ## Test plan <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles --> n/a
When a text input is submitted for generating embeddings the response may be null. If we attempt retries and still cannot generate embeddings for this input text then we return an error which calls for failing the entire embed repo job.
Slack thread
Issue
This PR introduces a configuration
ExcludeChunkOnError. When set to true an embed repo job will proceed with the rest of the embed repo job when these generate embeddings errors occur. However, the file that generated the input text which received an error is excluded from the index as to avoid partially indexing the file.I'll add more details on the first iteration of this solution and the trade offs in a separate comment.
Test plan
Embed test cases added