Inspiration

We think we should embed every piece of text or code ever written to create a semantic knowledge index so we started with GitHub.

What it does

Semantic search across every file in every GitHub repo (soon).

We generated embeddings using Instructor for every code file in the top 100 projects across the most common languages. ["c", "python", "java", "c++", "go", "javascript", "typescript", "rust"]

How we built it

  1. Generate a csv of top projects
  2. Clone all repos to disk
  3. Embed all files recursively using a process orchestrator
  4. Postprocess the data
  5. Host on a virtual machine using a nearest neighbors vector index for rapid plugin access

Challenges we ran into

  • Coordinating GPU/CPU parallelism across a large volume of highly structured code
  • Determining ideal prompts for the asymmetric embedding transformer
  • Structuring and hosting the large volume of embedding data and metadata for real-time search

Accomplishments that we're proud of

  • Generated 30GB of semantic embeddings in under 12 hours
  • System is able to identify key files across the entirety of github (well, mode of the Pareto distribution at least) in response to natural semantic queries
  • Search not restricted to files - sub-document embeddings let it identify the exact lines relevant to the user's query

What we learned

  • Embedding all of github takes a long time
  • Python GIL is the worst
  • Instructor embeddings are about as good as ada
  • hacking massively parallel data pipelines around sqlite databases is actually fine

What's next for git embed

  • Finish embedding every project ever
  • integrate with my open source project navajo for more intelligent handling of embeddings and retrieved snippets
  • Extend the system to power general document embeddings instead of just codebases (internet scrapes, PDF collections, legal databases, SEC filings... the possibilities are endless)
  • Integrate with our database of arXiv embeddings to find implementations of papers and correlate research with code
  • Embed git histories or diffs to create a system capable of predicting the next diff embedding for help with codegen

Built With

Share this project:

Updates