Allow embeddings job to exclude failed files from the index#55180

gl-srgr · 2023-07-21T04:50:50Z

When a text input is submitted for generating embeddings the response may be null. If we attempt retries and still cannot generate embeddings for this input text then we return an error which calls for failing the entire embed repo job.

Slack thread

Issue

This PR introduces a configuration ExcludeChunkOnError. When set to true an embed repo job will proceed with the rest of the embed repo job when these generate embeddings errors occur. However, the file that generated the input text which received an error is excluded from the index as to avoid partially indexing the file.

I'll add more details on the first iteration of this solution and the trade offs in a separate comment.

Test plan

Embed test cases added

…mbed

sourcegraph-bot · 2023-07-21T04:52:39Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff 3462676...54281cc.

Notify	File(s)
@efritz	enterprise/cmd/worker/internal/embeddings/repo/handler.go

gl-srgr · 2023-07-21T05:20:29Z

The high level requirements I've got in mind:

GetEmbeddings() only returns a single []float32 for successfully generated embeddings based on []string. Assuming we want to maintain submitting batches of input texts we need to modify this method so that we may identify which input texts failed.
We don't want files to be partially indexed files (i.e. some but not all chunks generated from a given file are included in the index). We could search an index with partially indexed files but it creates new requirements for future indexing e.g. will that partially indexed file be reattempted only when it's modified by a future commit or will there be some other trigger?
If we want to satisfy (2.) then we should exclude all of that file's content from the index. This includes the embeddings that were successfully generated for that file.
We report which files failed.

Solutions:

Keep track of which fileNames have a failed to embedding request. Once the index is generated do a second pass over the index contents and remove any content associated with those fileNames. Depending on indexing implementation we could maybe utilize filter for this second pass.
Instead of removing the failed files on a second full pass we can remove files every flush(). That is, we identify failed files only within a single batch during flush() and never add them to the index. Additionally we truncate the index if the most recently indexed file failed in the most recent flush.

The first iteration of this PR is solution 2.

Trade Offs:

The logic is simpler and easier to read with solution (1) but we have to maintain the list of failed files until the embed job completes and then do a pass over the whole index. Solution (2) code is a bit more complicated but takes advantage of the fact that we track failed files only for the current batch being flushed, not the entire embedFiles method. The exception is the most recently indexed file (indexed from a previous batch) which may be submitted across multiple batches. This most recently indexed file will always be the trailing embeddings/rows of our in-progress index so we can truncate.
A lot of the additional checks and tracking logic is necessary because we want to have a file be all-or-nothing with regards to indexing. If we decide that partially indexed files is acceptable then we could simplify this logic a lot.

stefanhengl

Review is still ongoing. I have to spend more time on embed.go, but have to step afk so submitting my current comments.

stefanhengl · 2023-07-21T12:21:25Z

internal/embeddings/embed/client/sourcegraph/client.go

+
+					// caller expects one vector per text input so append zero values
+					placeholder := make([]float32, response.ModelDimensions)
+					embeddings = append(embeddings, placeholder...)


You already allocated embeddings above, so embeddings has the right capacity. I believe you can just reslice instead of allocating the placeholder IE

embeddings = embeddings[:len(embeddings)+response.ModelDimensions]

That makes sense, I can update

schema/site.schema.json

stefanhengl · 2023-07-21T12:52:39Z

internal/embeddings/embed/embed.go

 	}

-	codeIndexStats, err := embedFiles(ctx, codeFileNames, client, contextService, opts.FileFilters, opts.SplitOptions, readLister, opts.MaxCodeEmbeddings, insertCode, reportCodeProgress)
+	codeIndexStats, err := embedFiles(ctx, codeFileNames, client, contextService, opts.FileFilters, opts.SplitOptions, readLister, opts.MaxCodeEmbeddings, opts.BatchSize, insertCode, truncateCode, reportCodeProgress, logger)


nit: looking at our codebase we seem to follow the convention to put logger as the second argument, right after ctx.

stefanhengl · 2023-07-21T13:03:19Z

internal/embeddings/embed/embed.go

 		}
 	}

+	truncateIndex := func(index *embeddings.EmbeddingIndex, count int) {


Could this be a method of EmbeddingIndex instead? The list of arguments to embedFiles grows and grows and this one can easily be avoided if I am not mistaken.

I know this follows the same pattern as insertIndex but tbh I don't remember why we went down that road.

I believe the reason is that we do not pass an EmbeddingIndex as an argument to embedFiles() and instead just pass these function(s) like insertIndex.

I know this follows the same pattern as insertIndex but tbh I don't remember why we went down that road.

I'm curious about the motivation for this as well since it seems we could just pass the index to embedFiles(). Perhaps we just don't want that method to need to understand how an EmbeddingIndex is organized. I'll consider adding insert() and truncate() to EmbeddingIndex

I'm curious about the motivation for this as well since it seems we could just pass the index to embedFiles().

I pulled out insertIndex so that we could make it generic over our vector storage. I didn't want embedFiles to need to worry about the difference between qdrant and our custom storage.

camdencheek

Thanks for tackling this! It'll be great to have embeddings jobs that aren't quite so fragile.

camdencheek · 2023-07-21T22:22:04Z

internal/embeddings/embed/client/openai/client.go


 	dimensionality := len(response.Data[0].Embedding)
 	embeddings := make([]float32, 0, len(response.Data)*dimensionality)
+	failed := make([]int, 0, len(response.Data))


nit: no need to pre-allocate failed here. In almost all cases, this will be empty.

camdencheek · 2023-07-21T22:25:03Z

internal/embeddings/embed/embed.go

 		}
 	}

+	truncateIndex := func(index *embeddings.EmbeddingIndex, count int) {


I'm curious about the motivation for this as well since it seems we could just pass the index to embedFiles().

I pulled out insertIndex so that we could make it generic over our vector storage. I didn't want embedFiles to need to worry about the difference between qdrant and our custom storage.

camdencheek · 2023-07-21T22:38:34Z

internal/embeddings/embed/embed.go


-		if err := insert(metadata, batchEmbeddings); err != nil {
-			return err
+		// files with partial failures need to be excluded from the index entirely


files with partial failures need to be excluded from the index entirely

I'm gonna challenge this assumption. Mostly because this forces us to add a lot of complexity to an already complex piece of code.

First, this should happen pretty rarely. OpenAI seems to give us null responses to only a very small percentage of chunks. And of that small percentage, retrying seems to fix at least some.

Second, if we do get a failure on a chunk, having embeddings from the rest of the chunks in the file is still useful, and probably even preferable.

So, alternatively, what if we just logged a warning that one of our chunks failed, added a "failed" counter to the stats, and called it a day? In an ideal world, we'll be able to move off OpenAI embeddings, so we'll no longer face this problem anyways, and this is quite a bit of code that becomes cruft if that happens.

Yeah, after implementing this to code completion the scope and complexity is a lot. I also mentioned in my earlier comment that the complexity really narrows down if we let partially indexed files remain.

I'd like to have a way of resolving the missing chunks without a user manually scheduling a forced embedding job to reindex from scratch, and I have some ideas, but that's a different effort. This PR can narrow down focus to letting the job proceed and providing an index sooner, albeit with potential for missing code or text embeddings.

If @stefanhengl has any thoughts then let me know, otherwise I can start reducing this down to a smaller PR for the initial enhancement.

Reducing the scope seems like a good idea.

camdencheek · 2023-07-21T22:42:09Z

internal/conf/computed.go


+	// The default value for ExcludeFileOnError is false.
+	if embeddingsConfig.ExcludeFileOnError == nil {
+		embeddingsConfig.ExcludeFileOnError = pointers.Ptr(false)


I think we can default to true, or even make this not configurable. It's painful to have to kick off another embeddings job with a new configuration because of a failure (especially when they take many hours or days), so IMO we should prefer to be as "self-healing" here as possible.

Yeah I'm open to either value as default

…edge excluded chunks

…ut string fails to generate embedding.

gl-srgr · 2023-07-26T19:36:47Z

@camdencheek I've updated with smaller scope in embed.go:

Chunks that fail when generating embeddings are excluded from the index
Failed chunks are tracked in stats as excluded chunks counter
Partially indexed files are allowed
After embedding all files for a code index or text index we log a debug message with the # of excluded chunks

One thing I left out from the initial implementaion is logging the file names that fail. I figured that we could add it later. Given today's post it sounds like there are problematic files that fail due to their contents and we want to add to filter list. In that case we might want to add that logging of file name back in.

…or. Log failed chunk file names.

github-actions · 2023-08-01T20:41:03Z

The backport to 5.1 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-5.1 5.1
# Navigate to the new working tree
cd .worktrees/backport-5.1
# Create a new branch
git switch --create backport-55180-to-5.1
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 b8e31fde275d155aed5fe4cea30f3514bd816402
# Push it to GitHub
git push --set-upstream origin backport-55180-to-5.1
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-5.1

Then, create a pull request where the base branch is 5.1 and the compare/head branch is backport-55180-to-5.1.

When a text input is submitted for generating embeddings the response may be null. If we attempt retries and still cannot generate embeddings for this input text then we return an error which calls for failing the entire embed repo job. [Slack thread](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688676751106069) [Issue](https://github.com/sourcegraph/sourcegraph/issues/55469) This PR introduces a configuration `ExcludeChunkOnError`. When set to true an embed repo job will proceed with the rest of the embed repo job when these generate embeddings errors occur. However, the file that generated the input text which received an error is excluded from the index as to avoid partially indexing the file. I'll add more details on the first iteration of this solution and the trade offs in a separate comment.  Embed test cases added (cherry picked from commit b8e31fd)

…anch (#55528) Update exclude chunks migration metadata to align with 5.1 release branch. This is to support backporting this [PR](https://github.com/sourcegraph/sourcegraph/pull/55180) to 5.1. Parent `1687792857` exists for both 5.1 and main. Alignment refers to following the migration steps [here](https://handbook.sourcegraph.com/departments/engineering/dev/tools/backport/#prs-with-migration-changes). ## Test plan  n/a

…index (#55530) Related PRs on `main`: - [55180](https://github.com/sourcegraph/sourcegraph/pull/55180): Implements new feature - [55528](https://github.com/sourcegraph/sourcegraph/pull/55528): Updates migration metadata to use leaves that are present on 5.1 instead of using a leaves only available on main. Steps for this PR: 1. cherry-picked [55180](https://github.com/sourcegraph/sourcegraph/pull/55180) commit 2. updated `metadata.yaml` with 5.1's resulting value from running `sg migration leaves` 3. ran `sg generate` to confirm no other changes required Description: When a text input is submitted for generating embeddings the response may be null. If we attempt retries and still cannot generate embeddings for this input text then we return an error which calls for failing the entire embed repo job. [Slack thread](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688676751106069) https://github.com/sourcegraph/sourcegraph/issues/55469 This PR introduces a configuration ExcludeChunkOnError. When set to true an embed repo job will proceed with the rest of the embed repo job when these generate embeddings errors occur. However, the file that generated the input text which received an error is excluded from the index as to avoid partially indexing the file. I'll add more details on the first iteration of this solution and the trade offs in a separate comment. ## Test plan  new tests for embed.go and embedding clients

When a text input is submitted for generating embeddings the response may be null. If we attempt retries and still cannot generate embeddings for this input text then we return an error which calls for failing the entire embed repo job. [Slack thread](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688676751106069) [Issue](https://github.com/sourcegraph/sourcegraph/issues/55469) This PR introduces a configuration `ExcludeChunkOnError`. When set to true an embed repo job will proceed with the rest of the embed repo job when these generate embeddings errors occur. However, the file that generated the input text which received an error is excluded from the index as to avoid partially indexing the file. I'll add more details on the first iteration of this solution and the trade offs in a separate comment. ## Test plan  Embed test cases added

…anch (#55528) Update exclude chunks migration metadata to align with 5.1 release branch. This is to support backporting this [PR](https://github.com/sourcegraph/sourcegraph/pull/55180) to 5.1. Parent `1687792857` exists for both 5.1 and main. Alignment refers to following the migration steps [here](https://handbook.sourcegraph.com/departments/engineering/dev/tools/backport/#prs-with-migration-changes). ## Test plan  n/a

When a text input is submitted for generating embeddings the response may be null. If we attempt retries and still cannot generate embeddings for this input text then we return an error which calls for failing the entire embed repo job. [Slack thread](https://sourcegraph.slack.com/archives/C053L1AQ0BC/p1688676751106069) [Issue](https://github.com/sourcegraph/sourcegraph/issues/55469) This PR introduces a configuration `ExcludeChunkOnError`. When set to true an embed repo job will proceed with the rest of the embed repo job when these generate embeddings errors occur. However, the file that generated the input text which received an error is excluded from the index as to avoid partially indexing the file. I'll add more details on the first iteration of this solution and the trade offs in a separate comment. ## Test plan  Embed test cases added

…anch (#55528) Update exclude chunks migration metadata to align with 5.1 release branch. This is to support backporting this [PR](https://github.com/sourcegraph/sourcegraph/pull/55180) to 5.1. Parent `1687792857` exists for both 5.1 and main. Alignment refers to following the migration steps [here](https://handbook.sourcegraph.com/departments/engineering/dev/tools/backport/#prs-with-migration-changes). ## Test plan  n/a

gl-srgr added 5 commits July 20, 2023 13:38

exclude failed files from index but do not fail entire embedRepo job

d3bc1f5

add tests

3b48c6c

Merge branch 'garyl/ignore_embed_error' into garyl/skip_file_during_e…

2ba3570

…mbed

update schema

74b5bc2

update schema

f80ad8b

cla-bot bot added the cla-signed label Jul 21, 2023

build bazel

9c6a6f8

gl-srgr requested review from camdencheek and stefanhengl July 21, 2023 05:21

stefanhengl reviewed Jul 21, 2023

View reviewed changes

remove arg

baace94

camdencheek reviewed Jul 21, 2023

View reviewed changes

gl-srgr added 14 commits July 25, 2023 09:37

Merge branch 'main' into garyl/skip_file_during_embed

e158d5d

partial indexed files are allowed and we only update stats to acknowl…

54aa489

…edge excluded chunks

format new error when client does not return error but the single inp…

f2ce31d

…ut string fails to generate embedding.

remove unused code

4df80aa

update stats store

ab71ce6

build bazel

1117657

squashed.sql

192714e

fix migrations

e9de47a

build bazel, stats store test

8aefa68

sg generate

357b997

client test updates

4c0eda2

client updates and client test additions

7153107

computed embeddings config test updates

60f615d

log excluded count as debug

40da51c

Move config option out of client and into repoOpts. Remove partialErr…

8c5cf83

…or. Log failed chunk file names.

gl-srgr mentioned this pull request Aug 1, 2023

Provide option to force an embedding job via site admin #55468

Closed

gl-srgr added 7 commits July 31, 2023 18:56

sg generate

e8d294b

Merge branch 'main' into garyl/skip_file_during_embed

aef97d6

changelog, update tests now that client doesn't return partial errors

6c0546a

Merge branch 'main' into garyl/skip_file_during_embed

30eb98c

changelog

7e9912c

duplicated test by mistake

182449c

Merge branch 'main' into garyl/skip_file_during_embed

2d13679

gl-srgr added cody/context backport 5.1 labels Aug 1, 2023

Merge branch 'main' into garyl/skip_file_during_embed

54281cc

gl-srgr merged commit b8e31fd into main Aug 1, 2023

gl-srgr deleted the garyl/skip_file_during_embed branch August 1, 2023 20:39

github-actions bot added backports failed-backport-to-5.1 release-blocker Prevents us from releasing: https://about.sourcegraph.com/handbook/engineering/releases labels Aug 1, 2023

This was referenced Aug 2, 2023

Update exclude chunks migration metadata to align with 5.1 release branch #55528

Merged

[Backport 5.1] Allow embeddings job to exclude failed files from the index #55530

Merged

gl-srgr mentioned this pull request Aug 3, 2023

Allow embedding job to proceed when we fail to get embeddings for a chunk #55469

Closed

Conversation

gl-srgr commented Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

sourcegraph-bot commented Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gl-srgr commented Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefanhengl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gl-srgr Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

camdencheek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gl-srgr commented Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gl-srgr commented Jul 21, 2023 •

edited

Loading

sourcegraph-bot commented Jul 21, 2023 •

edited

Loading

gl-srgr commented Jul 21, 2023 •

edited

Loading

gl-srgr Jul 21, 2023 •

edited

Loading

gl-srgr commented Jul 26, 2023 •

edited

Loading