git embed | Devpost

Inspiration

We think we should embed every piece of text or code ever written to create a semantic knowledge index so we started with GitHub.

What it does

Semantic search across every file in every GitHub repo (soon).

We generated embeddings using Instructor for every code file in the top 100 projects across the most common languages. ["c", "python", "java", "c++", "go", "javascript", "typescript", "rust"]

How we built it

Generate a csv of top projects
Clone all repos to disk
Embed all files recursively using a process orchestrator
Postprocess the data
Host on a virtual machine using a nearest neighbors vector index for rapid plugin access

Challenges we ran into

Coordinating GPU/CPU parallelism across a large volume of highly structured code
Determining ideal prompts for the asymmetric embedding transformer
Structuring and hosting the large volume of embedding data and metadata for real-time search

Accomplishments that we're proud of

Generated 30GB of semantic embeddings in under 12 hours
System is able to identify key files across the entirety of github (well, mode of the Pareto distribution at least) in response to natural semantic queries
Search not restricted to files - sub-document embeddings let it identify the exact lines relevant to the user's query

What we learned

Embedding all of github takes a long time
Python GIL is the worst
Instructor embeddings are about as good as ada
hacking massively parallel data pipelines around sqlite databases is actually fine

What's next for git embed

Finish embedding every project ever
integrate with my open source project navajo for more intelligent handling of embeddings and retrieved snippets
Extend the system to power general document embeddings instead of just codebases (internet scrapes, PDF collections, legal databases, SEC filings... the possibilities are endless)
Integrate with our database of arXiv embeddings to find implementations of papers and correlate research with code
Embed git histories or diffs to create a system capable of predicting the next diff embedding for help with codegen

Built With

embeddings
fastapi
python
sqlite

Updates

Private user started this project — Apr 16, 2023 02:41 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.