Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

Allow embeddings job to exclude failed files from the index#55180

Merged
gl-srgr merged 32 commits intomainfrom
garyl/skip_file_during_embed
Aug 1, 2023
Merged

Allow embeddings job to exclude failed files from the index#55180
gl-srgr merged 32 commits intomainfrom
garyl/skip_file_during_embed

Conversation

@gl-srgr
Copy link
Contributor

@gl-srgr gl-srgr commented Jul 21, 2023

When a text input is submitted for generating embeddings the response may be null. If we attempt retries and still cannot generate embeddings for this input text then we return an error which calls for failing the entire embed repo job.

Slack thread

Issue

This PR introduces a configuration ExcludeChunkOnError. When set to true an embed repo job will proceed with the rest of the embed repo job when these generate embeddings errors occur. However, the file that generated the input text which received an error is excluded from the index as to avoid partially indexing the file.

I'll add more details on the first iteration of this solution and the trade offs in a separate comment.

Test plan

Embed test cases added

@cla-bot cla-bot bot added the cla-signed label Jul 21, 2023
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented Jul 21, 2023

Codenotify: Notifying subscribers in CODENOTIFY files for diff 3462676...54281cc.

Notify File(s)
@efritz enterprise/cmd/worker/internal/embeddings/repo/handler.go

@gl-srgr
Copy link
Contributor Author

gl-srgr commented Jul 21, 2023

The high level requirements I've got in mind:

  1. GetEmbeddings() only returns a single []float32 for successfully generated embeddings based on []string. Assuming we want to maintain submitting batches of input texts we need to modify this method so that we may identify which input texts failed.
  2. We don't want files to be partially indexed files (i.e. some but not all chunks generated from a given file are included in the index). We could search an index with partially indexed files but it creates new requirements for future indexing e.g. will that partially indexed file be reattempted only when it's modified by a future commit or will there be some other trigger?
  3. If we want to satisfy (2.) then we should exclude all of that file's content from the index. This includes the embeddings that were successfully generated for that file.
  4. We report which files failed.

Solutions:

  1. Keep track of which fileNames have a failed to embedding request. Once the index is generated do a second pass over the index contents and remove any content associated with those fileNames. Depending on indexing implementation we could maybe utilize filter for this second pass.
  2. Instead of removing the failed files on a second full pass we can remove files every flush(). That is, we identify failed files only within a single batch during flush() and never add them to the index. Additionally we truncate the index if the most recently indexed file failed in the most recent flush.

The first iteration of this PR is solution 2.

Trade Offs:

  • The logic is simpler and easier to read with solution (1) but we have to maintain the list of failed files until the embed job completes and then do a pass over the whole index. Solution (2) code is a bit more complicated but takes advantage of the fact that we track failed files only for the current batch being flushed, not the entire embedFiles method. The exception is the most recently indexed file (indexed from a previous batch) which may be submitted across multiple batches. This most recently indexed file will always be the trailing embeddings/rows of our in-progress index so we can truncate.
  • A lot of the additional checks and tracking logic is necessary because we want to have a file be all-or-nothing with regards to indexing. If we decide that partially indexed files is acceptable then we could simplify this logic a lot.

Copy link
Member

@stefanhengl stefanhengl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review is still ongoing. I have to spend more time on embed.go, but have to step afk so submitting my current comments.


// caller expects one vector per text input so append zero values
placeholder := make([]float32, response.ModelDimensions)
embeddings = append(embeddings, placeholder...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already allocated embeddings above, so embeddings has the right capacity. I believe you can just reslice instead of allocating the placeholder IE

embeddings = embeddings[:len(embeddings)+response.ModelDimensions]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I can update

}

codeIndexStats, err := embedFiles(ctx, codeFileNames, client, contextService, opts.FileFilters, opts.SplitOptions, readLister, opts.MaxCodeEmbeddings, insertCode, reportCodeProgress)
codeIndexStats, err := embedFiles(ctx, codeFileNames, client, contextService, opts.FileFilters, opts.SplitOptions, readLister, opts.MaxCodeEmbeddings, opts.BatchSize, insertCode, truncateCode, reportCodeProgress, logger)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looking at our codebase we seem to follow the convention to put logger as the second argument, right after ctx.

}
}

truncateIndex := func(index *embeddings.EmbeddingIndex, count int) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a method of EmbeddingIndex instead? The list of arguments to embedFiles grows and grows and this one can easily be avoided if I am not mistaken.

I know this follows the same pattern as insertIndex but tbh I don't remember why we went down that road.

Copy link
Contributor Author

@gl-srgr gl-srgr Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the reason is that we do not pass an EmbeddingIndex as an argument to embedFiles() and instead just pass these function(s) like insertIndex.

I know this follows the same pattern as insertIndex but tbh I don't remember why we went down that road.

I'm curious about the motivation for this as well since it seems we could just pass the index to embedFiles(). Perhaps we just don't want that method to need to understand how an EmbeddingIndex is organized. I'll consider adding insert() and truncate() to EmbeddingIndex

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the motivation for this as well since it seems we could just pass the index to embedFiles().

I pulled out insertIndex so that we could make it generic over our vector storage. I didn't want embedFiles to need to worry about the difference between qdrant and our custom storage.

Copy link
Member

@camdencheek camdencheek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this! It'll be great to have embeddings jobs that aren't quite so fragile.


dimensionality := len(response.Data[0].Embedding)
embeddings := make([]float32, 0, len(response.Data)*dimensionality)
failed := make([]int, 0, len(response.Data))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need to pre-allocate failed here. In almost all cases, this will be empty.

}
}

truncateIndex := func(index *embeddings.EmbeddingIndex, count int) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the motivation for this as well since it seems we could just pass the index to embedFiles().

I pulled out insertIndex so that we could make it generic over our vector storage. I didn't want embedFiles to need to worry about the difference between qdrant and our custom storage.


if err := insert(metadata, batchEmbeddings); err != nil {
return err
// files with partial failures need to be excluded from the index entirely
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

files with partial failures need to be excluded from the index entirely

I'm gonna challenge this assumption. Mostly because this forces us to add a lot of complexity to an already complex piece of code.

First, this should happen pretty rarely. OpenAI seems to give us null responses to only a very small percentage of chunks. And of that small percentage, retrying seems to fix at least some.

Second, if we do get a failure on a chunk, having embeddings from the rest of the chunks in the file is still useful, and probably even preferable.

So, alternatively, what if we just logged a warning that one of our chunks failed, added a "failed" counter to the stats, and called it a day? In an ideal world, we'll be able to move off OpenAI embeddings, so we'll no longer face this problem anyways, and this is quite a bit of code that becomes cruft if that happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, after implementing this to code completion the scope and complexity is a lot. I also mentioned in my earlier comment that the complexity really narrows down if we let partially indexed files remain.

I'd like to have a way of resolving the missing chunks without a user manually scheduling a forced embedding job to reindex from scratch, and I have some ideas, but that's a different effort. This PR can narrow down focus to letting the job proceed and providing an index sooner, albeit with potential for missing code or text embeddings.

If @stefanhengl has any thoughts then let me know, otherwise I can start reducing this down to a smaller PR for the initial enhancement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing the scope seems like a good idea.


// The default value for ExcludeFileOnError is false.
if embeddingsConfig.ExcludeFileOnError == nil {
embeddingsConfig.ExcludeFileOnError = pointers.Ptr(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can default to true, or even make this not configurable. It's painful to have to kick off another embeddings job with a new configuration because of a failure (especially when they take many hours or days), so IMO we should prefer to be as "self-healing" here as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm open to either value as default

@gl-srgr
Copy link
Contributor Author

gl-srgr commented Jul 26, 2023

@camdencheek I've updated with smaller scope in embed.go:

  • Chunks that fail when generating embeddings are excluded from the index
  • Failed chunks are tracked in stats as excluded chunks counter
  • Partially indexed files are allowed
  • After embedding all files for a code index or text index we log a debug message with the # of excluded chunks

One thing I left out from the initial implementaion is logging the file names that fail. I figured that we could add it later. Given today's post it sounds like there are problematic files that fail due to their contents and we want to add to filter list. In that case we might want to add that logging of file name back in.

@gl-srgr gl-srgr merged commit b8e31fd into main Aug 1, 2023
@gl-srgr gl-srgr deleted the garyl/skip_file_during_embed branch August 1, 2023 20:39
@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2023

The backport to 5.1 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-5.1 5.1
# Navigate to the new working tree
cd .worktrees/backport-5.1
# Create a new branch
git switch --create backport-55180-to-5.1
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 b8e31fde275d155aed5fe4cea30f3514bd816402
# Push it to GitHub
git push --set-upstream origin backport-55180-to-5.1
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-5.1

Then, create a pull request where the base branch is 5.1 and the compare/head branch is backport-55180-to-5.1.

@github-actions github-actions bot added backports failed-backport-to-5.1 release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases labels Aug 1, 2023
gl-srgr added a commit that referenced this pull request Aug 2, 2023
When a text input is submitted for generating embeddings the response
may be null. If we attempt retries and still cannot generate embeddings
for this input text then we return an error which calls for failing the
entire embed repo job.

[Slack
thread](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688676751106069)

[Issue](https://github.com/sourcegraph/sourcegraph/issues/55469)

This PR introduces a configuration `ExcludeChunkOnError`. When set to
true an embed repo job will proceed with the rest of the embed repo job
when these generate embeddings errors occur. However, the file that
generated the input text which received an error is excluded from the
index as to avoid partially indexing the file.

I'll add more details on the first iteration of this solution and the
trade offs in a separate comment.

<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->

Embed test cases added

(cherry picked from commit b8e31fd)
gl-srgr referenced this pull request Aug 2, 2023
…anch (#55528)

Update exclude chunks migration metadata to align with 5.1 release
branch. This is to support backporting this
[PR](https://github.com/sourcegraph/sourcegraph/pull/55180) to 5.1.

Parent `1687792857` exists for both 5.1 and main. Alignment refers to
following the migration steps
[here](https://handbook.sourcegraph.com/departments/engineering/dev/tools/backport/#prs-with-migration-changes).

## Test plan

<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->

n/a
coury-clark referenced this pull request Aug 2, 2023
…index (#55530)

Related PRs on `main`: 
- [55180](https://github.com/sourcegraph/sourcegraph/pull/55180):
Implements new feature
- [55528](https://github.com/sourcegraph/sourcegraph/pull/55528):
Updates migration metadata to use leaves that are present on 5.1 instead
of using a leaves only available on main.

Steps for this PR: 

1. cherry-picked
[55180](https://github.com/sourcegraph/sourcegraph/pull/55180) commit
2. updated `metadata.yaml` with 5.1's resulting value from running `sg
migration leaves`
3. ran `sg generate` to confirm no other changes required

Description:

When a text input is submitted for generating embeddings the response
may be null. If we attempt retries and still cannot generate embeddings
for this input text then we return an error which calls for failing the
entire embed repo job.

[Slack
thread](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688676751106069)

https://github.com/sourcegraph/sourcegraph/issues/55469

This PR introduces a configuration ExcludeChunkOnError. When set to true
an embed repo job will proceed with the rest of the embed repo job when
these generate embeddings errors occur. However, the file that generated
the input text which received an error is excluded from the index as to
avoid partially indexing the file.

I'll add more details on the first iteration of this solution and the
trade offs in a separate comment.

## Test plan

<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->

new tests for embed.go and embedding clients
varsanojidan pushed a commit that referenced this pull request Aug 3, 2023
When a text input is submitted for generating embeddings the response
may be null. If we attempt retries and still cannot generate embeddings
for this input text then we return an error which calls for failing the
entire embed repo job.

[Slack
thread](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688676751106069)

[Issue](https://github.com/sourcegraph/sourcegraph/issues/55469)

This PR introduces a configuration `ExcludeChunkOnError`. When set to
true an embed repo job will proceed with the rest of the embed repo job
when these generate embeddings errors occur. However, the file that
generated the input text which received an error is excluded from the
index as to avoid partially indexing the file.

I'll add more details on the first iteration of this solution and the
trade offs in a separate comment.

## Test plan

<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->

Embed test cases added
varsanojidan referenced this pull request Aug 3, 2023
…anch (#55528)

Update exclude chunks migration metadata to align with 5.1 release
branch. This is to support backporting this
[PR](https://github.com/sourcegraph/sourcegraph/pull/55180) to 5.1.

Parent `1687792857` exists for both 5.1 and main. Alignment refers to
following the migration steps
[here](https://handbook.sourcegraph.com/departments/engineering/dev/tools/backport/#prs-with-migration-changes).

## Test plan

<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->

n/a
davejrt pushed a commit that referenced this pull request Aug 9, 2023
When a text input is submitted for generating embeddings the response
may be null. If we attempt retries and still cannot generate embeddings
for this input text then we return an error which calls for failing the
entire embed repo job.

[Slack
thread](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688676751106069)

[Issue](https://github.com/sourcegraph/sourcegraph/issues/55469)

This PR introduces a configuration `ExcludeChunkOnError`. When set to
true an embed repo job will proceed with the rest of the embed repo job
when these generate embeddings errors occur. However, the file that
generated the input text which received an error is excluded from the
index as to avoid partially indexing the file.

I'll add more details on the first iteration of this solution and the
trade offs in a separate comment.

## Test plan

<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->

Embed test cases added
davejrt referenced this pull request Aug 9, 2023
…anch (#55528)

Update exclude chunks migration metadata to align with 5.1 release
branch. This is to support backporting this
[PR](https://github.com/sourcegraph/sourcegraph/pull/55180) to 5.1.

Parent `1687792857` exists for both 5.1 and main. Alignment refers to
following the migration steps
[here](https://handbook.sourcegraph.com/departments/engineering/dev/tools/backport/#prs-with-migration-changes).

## Test plan

<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->

n/a
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

backports cla-signed cody/context release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants