Code Embeddings
Code Embeddings
Available tools for indexing Code Embeddings
GitLab Active Context Gem
A Ruby gem for interfacing with vector stores like Elasticsearch, OpenSearch, and PostgreSQL with PGVector for storing and querying vectors.
Key Components:
- Adapter Layer: Provides a unified interface to different storage backends.
- Collection Management: Handles creating and managing collections of documents.
- Reference System: Defines how to serialize and index different types of objects.
- Queue Management: Manages asynchronous processing of indexing operations.
- Migration System: Similar to database migrations for managing schema changes.
- Embedding Support: Integrates with embedding generation for vector search capabilities.
GitLab Elasticsearch Indexer
A Go application that indexes Git repositories into Elasticsearch for GitLab.
Key Components:
- Indexer Module: Handles the core indexing functionality for different content types.
- Git Integration: Uses Gitaly to access repository content.
- Elasticsearch Client: Manages connections to Elasticsearch and handles document submission.
Proposal: Use Go Indexer to index chunks and Rails to index embeddings
Indexing and chunking done in the Go Indexer, with the chunks immediately stored in vector storage.
The indexer efficiently processes and chunks code files, while Rails handles generating and storing embeddings separately.
Process Flow:
- Git push event triggers Rails to call indexer.
- Indexer calls Gitaly to retrieve changed files.
- Process each file by chunking the content using the configured chunker.
- Create each chunk if not present
- Postgres:
INSERT into chunks (...) ON CONFLICT DO UPDATE - Elasticsearch/OpenSearch:
doc_as_upsert: true, detect_noop: true
- Postgres:
- Delete orphaned chunks
- Postgres:
DELETE from chunks where filename = ? AND id NOT IN (?)
- Postgres:
- Return upserted unique IDs back to Rails
- AI Abstraction Layer tracks embedding references for each unique ID.
- In batches, references are pulled from the queue.
- A bulk lookup is done to the vector store to check if the document exists and get content.
- Embeddings are generated in bulk and upserted into the vector store.
sequenceDiagram
title Indexer Indexes Code Chunks, Rails Indexes Embeddings
participant User
participant Rails
participant PostgreSQL
participant Indexer
participant Gitaly
participant VectorStore
participant AIContextLayer
participant AIGateway
GitUser->>Rails: Git push event
Rails->>PostgreSQL: Store current from and to SHA in postgres
Rails->>Indexer: Trigger indexing of changed files
Indexer->>Gitaly: Request changed files
Gitaly-->>Indexer: Return changed files
Indexer->>Indexer: Chunk each file
Indexer->>VectorStore: Upsert each chunk with content + unique identifier + version
VectorStore-->>Indexer: Confirm indexing
Indexer->>VectorStore: Delete orphaned documents
VectorStore-->>Indexer: Confirm deletion
Indexer-->>Rails: Return changed unique ids
Note right of AIContextLayer: Backfill embeddings for updated chunks
Rails->>AIContextLayer: Build references for embeddings
AIContextLayer->>VectorStore: Look up unique ids
VectorStore-->>AIContextLayer: Return matching chunks
AIContextLayer->>AIGateway: Request embeddings for chunks
AIGateway-->>AIContextLayer: Return embeddings
AIContextLayer->>VectorStore: Add embeddings to documents
VectorStore-->>AIContextLayer: Confirm update
AIContextLayer-->>Rails: Process complete
Design and implementation details
Key Implementation Notes
- Embedding deduplication is managed by tracking references: if a ref is in the queue for an hour, it might have changed multiple times or be deleted, but we only care about the final state
- A hashed version of the filename and chunk content will be used as the unique identifier for each document.
- The indexer can be called with an option to index a full repository (e.g. a
--forceoption) which can be called for initial indexing, when the chunker changes, etc. Normal mode is to process changed files only. - Embedding generation is the most time-intensive part of the process, with a throughput of approximately 250 embeddings per minute for the current model.
- Data is restricted to namespaces with Duo Pro or Duo Enterprise add-ons.
- NB: This implementation does not support feature branches.
Required changes on indexer
- Add mode to indexer for indexing code chunks
- Allow indexer to call chunker
- Add postgres client to indexer (Elasticsearch/OpenSearch client exists) and selecting a client from Rails
- Implement translations for each adapter (Elasticsearch, OpenSearch, Postgres) for indexing
Schema
| Field Name | Type | Description |
|---|---|---|
| id | keyword | hash("#{project_id}:#{path}:#{content}") |
| project_id | bigint | Filter by projects |
| path | keyword | Relative path including file name |
| type | smallint | Enum indicating whether it’s the full blob content or a node extracted from a chunker. Example options: file|class|function|imports|constant |
| content | text | Code content |
| name | text | Name of chunk, e.g. ModuleName::ClassName::method_name |
| source | keyword | "#{blob.id}:#{offset}:#{length}" which can be used to rebuild the full file or restore order of chunks |
| language | keyword | Language of content |
| embeddings_v1 | vector | Embeddings for the content |
The following fields were considered but not added to the initial schema. Adding new fields can be done using AI Abstraction Layer migrations and backfills can be done using either migrations or by doing a reindex.
archived(boolean): for group-level search, filter out projects that are archivedbranches(keyword[]): to support non-default branchesextension(keyword): extension of the file to easily filter by extensionrepository_access_level(smallint): permissions for group-level searchestraversal_ids(keyword): Efficient group-level searchesvisibility_level(smallint): permissions for group-level searches
Options for supporting multiple branches
By default, GitLab code search supports indexing and searching only the default branch. Supporting multiple branches requires additional considerations for storage, indexing strategy, and query complexity.
Option 1: Index Only Branch Diffs
Only index the differences (diffs) between the default branch and other branches. When a file is modified in a branch, index that version with branch metadata.
Option 2: Branch Bitmap Approach
Store a bitmap representing branch membership for each file. Maintain an ordered list of branches (e.g., master, branch1, branch2, branch3), and represent file presence with a bitmap (e.g., file in master and branch1 = 1100, file modified in branch2 = 0010).
Option 3: Tree Structure Traversal
Implement a tree-based structure representing the git repository hierarchy that can be traversed during search operations. This would mirror the actual version control model but requires a more sophisticated implementation.
Pros and Cons
| Option | Pros | Cons |
|---|---|---|
| Option 1: Index Only Branch Diffs | • Requires less storage space • Simpler implementation process • Faster initial indexing |
• Search results may include duplicate files (from default branch and branch versions) • Requires result deduplication/selection logic • Boosting for branch-specific results is easier in Elasticsearch than PostgreSQL |
| Option 2: Branch Bitmap Approach | • Efficient representation of branch membership • No duplicate results |
• Uncertain performance impact for bitmap operations in Elasticsearch/PostgreSQL • Requires reindexing metadata (but not embeddings) for all files when branches change • Bitmap size grows with number of branches • More complex implementation |
| Option 3: Tree Structure Traversal | • Most accurate representation of git model • Potentially more flexible for complex queries • Could better handle branch hierarchies and merges |
• Most complex implementation • No clear implementation path currently defined |
Proposal: Searching over indexed chunks
A query containing filters and embeddings is built and when executed, it is translated to a query the vector store is able to execute and results are returned.
sequenceDiagram
participant App as Application Code
participant Query as Query
participant VertexAI as Vertex API via AI Gateway
participant VectorStore as Vector Store (ES/PG/OS)
participant QueryResult as Query Result
Note over App: Querying from vector stores
App->>Query: Create query with filter conditions
App->>Query: Add knn query for similarity search
Query->>VertexAI: generate embeddings in bulk
VertexAI->>Query: return embedding vector
Query->>VectorStore: Execute query with filters and embedding vector
VectorStore->>Query: Return matching documents
Query->>QueryResult: Format and redact unauthorized results
QueryResult->>App: Results
Example query:
Querying across two projects and getting the 5 closest results to a given embedding (generated by a question):
target_embedding = ::ActiveContext::Embeddings.generate_embeddings('the question')
query = ActiveContext::Query.filter(project_id: [1, 2]).knn(target: 'embeddings_v1', vector: target_embedding, limit: 5)
result = Ai::Context::Collections::Blobs.search(user: current_user, query: query)
This will return the closest matching blob chunks.
Adding AND and OR filters to the query:
query = ActiveContext::Query
.and(
ActiveContext::Query.filter(project_id: 1),
ActiveContext::Query.filter(branch_name: 'master'),
ActiveContext::Query.or(
ActiveContext::Query.filter(language: 'ruby'),
ActiveContext::Query.filter(extension: 'rb')
)
)
.knn(target: 'embeddings_v1', vector: target_embedding, limit: 5)
Index state management
Overview
This design proposal outlines a system to track the state of indexed namespaces and projects for Code Embeddings.
The process differs between SaaS and SM/Dedicated:
- SaaS: Duo licenses are applied on a root namespace level. Subgroups and projects in the namespace have Duo enabled, except if
duo_features_enabledis false. - SM: Duo license is applied on the instance-level. If the instance has a license, all groups and projects have Duo enabled, except if
duo_features_enabledis false.
Database Schema
Ai::ActiveContext::Code::EnabledNamespace table tracks namespaces that should be indexed based on Duo and GitLab licenses and enabled features.
Ai::ActiveContext::Code::Repository table tracks the indexing state of projects in an enabled namespace.
Process Flow
The system uses a SchedulingService called from a cron worker Ai::ActiveContext::Code::SchedulingWorker every minute that publishes events at defined intervals. Each event has a corresponding worker that processes the event.
Scheduling tasks
saas_initial_indexing
- Scope: Only runs on gitlab.com
- Eligibility Criteria:
- Namespaces with an active, non-trial Duo Core, Duo Pro, or Duo Enterprise license
- Namespaces with unexpired paid hosted GitLab subscription
- Namespaces without existing
EnabledNamespacerecords - Namespaces with
duo_features_enabledANDexperiment_features_enabled
- Action: Creates
EnabledNamespacerecords for eligible namespaces in:pendingstate
process_pending_enabled_namespace
- Finds the first
EnabledNamespacerecord in:pendingstate - Creates
Repositoryrecords in:pendingstate for projects that:- Belong to the
EnabledNamespace’s namespace - Have
duo_features_enabled - Don’t have existing
Repositoryrecords
- Belong to the
- Marks the
EnabledNamespacerecord as:readyif all records were successfully created
index_repository
- Enqueues
RepositoryIndexWorkerjobs for 50 pending Repository records at a time RepositoryIndexWorkerprocess:- Executes
IndexingServicefor repository to handle initial indexing - Sets state to
:code_indexing_in_progress - Calls
elasticsearch-indexerin chunk mode to:- Find files from Gitaly
- Chunk files
- Index chunks
- Return successful IDs
- Sets
last_committo theto_shathat was indexed - Sets state to
:embedding_indexing_in_progress - Enqueues embedding references for successfully indexed documents
- Sets
initial_indexing_last_queued_itemto the highest ID of the documents indexed - Sets
indexed_atto current time - If failures occur during this process, marks the repository as
:failedand setslast_error
- Executes
Embedding Generation
- ActiveContext framework processes enqueued references in batches asynchronously
- Generates and sets embeddings on indexed documents
mark_repository_as_ready
- Finds
Repositoryrecords in:embedding_indexing_in_progressstate - Checks if the
initial_indexing_last_queued_itemrecord has all currently indexing embedding model fields populated in the vector store - Marks the repository as
:readywhen embeddings are complete
Example flow for a namespace with one project
flowchart TD
%% Main process nodes
start([Start]) --> findNamespace[Find eligible namespaces]
findNamespace --> createEN[Create EnabledNamespace<br>for CompanyX<br>State: :pending]
createEN --> findProjects[Find eligible projects<br>in CompanyX namespace]
%% Repository creation
findProjects --> createRepo[Create Repository record<br>for Project1<br>State: :pending]
createRepo --> markENReady[Update EnabledNamespace<br>State: :ready]
%% Repository processing
markENReady --> project1Repo[Repository: Project1<br>State: :pending]
project1Repo --> project1Queue[Enqueue RepositoryIndexWorker]
project1Queue --> project1Index[Update Repository State:<br>:code_indexing_in_progress]
project1Index --> project1CodeIndex[Index code chunks<br>via elasticsearch-indexer]
project1CodeIndex --> project1Commit[Set last_commit to indexed SHA]
project1Commit --> project1EmbedQueue[Update Repository State:<br>:embedding_indexing_in_progress]
project1EmbedQueue --> project1LastItem[Set initial_indexing_last_queued_item<br>to highest document ID]
project1LastItem --> project1Timestamp[Set indexed_at timestamp]
project1Timestamp --> project1Embeds[Process embeddings<br>asynchronously]
project1Embeds --> project1Check{Embeddings<br>complete?}
project1Check -->|Yes| project1Ready[Update Repository State:<br>:ready]
project1Check -->|No| project1Embeds
%% Completion
project1Ready --> complete([Indexing Complete])
%% Task Labels - using different style
saas_task>"saas_initial_indexing"] -.- findNamespace
saas_task -.- createEN
process_task>"process_pending_enabled_namespace"] -.- findProjects
process_task -.- createRepo
process_task -.- markENReady
index_task>"index_repository"] -.- project1Repo
index_task -.- project1Queue
index_task -.- project1Index
index_task -.- project1CodeIndex
index_task -.- project1Commit
index_task -.- project1EmbedQueue
index_task -.- project1LastItem
index_task -.- project1Timestamp
elastic_task>"elasticsearch-indexer"] -.- project1CodeIndex
embed_task>"ActiveContext framework"] -.- project1Embeds
ready_task>"mark_repository_as_ready"] -.- project1Check
ready_task -.- project1Ready
Implementation Notes
- The system follows a state machine pattern for tracking repository state.
- All tasks process in batches to reduce long queries and memory load
RepositoryIndexWorkerimplements a lock mechanism longer than the indexer timeout to ensure one-at-a-time processing- The entire system is tied to the currently
activeconnection (only one active connection at a time is permitted) - If a failure occurs during indexing, the repository is marked as
:failedand the error is recorded inlast_error
Alternative Solutions
Indexing and chunking done in Rails
Call Gitaly from rails to obtain code blobs, use a dedicated chunker in Ruby/Go/Rust to split content, enhance data with PostgreSQL, generate embeddings through the AI gateway, and index resulting vectors into the vector store.
sequenceDiagram
title Direct Processing Without the Indexer
participant Rails
participant Gitaly
participant Chunker
participant PostgreSQL
participant AIGateway
participant VectorStore
Rails->>Gitaly: Request code blobs
Gitaly-->>Rails: Return code blobs
Rails->>Chunker: Send content for chunking
Note right of Chunker: Ruby/Go/Rust Chunker
Chunker-->>Rails: Return code chunks
Rails->>PostgreSQL: Get metadata for enrichment
PostgreSQL-->>Rails: Return metadata
Rails->>AIGateway: Request embeddings for chunks
AIGateway-->>Rails: Return embeddings
Rails->>VectorStore: Index chunks with embeddings
VectorStore-->>Rails: Confirm indexing
Indexing and chunking done in the Go Indexer, with the chunks returned to Rails
Use the Go-based indexer to extract and chunk code, then send the results back to Rails via stdout. Rails then enriches the data with PostgreSQL and indexes it into the vector store. Embeddings are either generated in the same process before indexing (direct) or in a separate process (deferred).
sequenceDiagram
title Option 2: Indexer Returns Code and Chunks to Rails
participant Rails
participant Indexer
participant PostgreSQL
participant AIGateway
participant VectorStore
Rails->>Indexer: Request to extract & chunk code
Note right of Indexer: Go-based indexer accesses<br/>Gitaly directly
Indexer-->>Rails: Return chunks via stdout
Rails->>PostgreSQL: Get metadata for enrichment
PostgreSQL-->>Rails: Return metadata
alt Direct Embedding
Rails->>AIGateway: Request embeddings for chunks
AIGateway-->>Rails: Return embeddings
Rails->>VectorStore: Index chunks with embeddings
else Deferred Embedding
Rails->>VectorStore: Index chunks without embeddings
Rails->>Rails: Queue embedding generation
Rails->>AIGateway: Request embeddings (async)
AIGateway-->>Rails: Return embeddings
Rails->>VectorStore: Update with embeddings
end
VectorStore-->>Rails: Confirm indexing
Pros and Cons of solutions
| Option | Pros | Cons |
|---|---|---|
| Option 1: Indexing and chunking done in the Go Indexer, with the chunks immediately stored in vector storage | • More performant indexing of code • Separation of concerns: indexing code and embeddings is separate • Better deduplication handling for rapidly changing files |
• Requires more effort to implement clients and adapters for all vector stores • Makes the indexer stateful • The bottleneck for indexing is still on the embedding generation side |
| Option 2: Indexing and chunking done in Rails | • Familiar Ruby technology for all engineers • Faster implementation timeline |
• Slower processing for getting code blobs (up to 50x slower than Go solution) • Requires building service to get blobs from Gitaly |
| Option 3: Indexing and chunking done in the Go Indexer, with the chunks returned to Rails | • Significant performance boost for getting code from gitaly • Type safety • Binary is available in all self-managed installations |
• Requires Go expertise for development • Shared binary ownership between teams |
Common Implementation Approach
All options
- Use the AI abstraction layer
- Process references using Sidekiq workers
- Re-enqueue failed references for retry
83e0182d)
