Inspiration

Navigating the world of ML research is a daunting task, with the sheer volume of new discoveries accelerating at a rapid pace. We wanted to see if LLMs could help us organize scientific research into a more legible interface.

What it does

Codex uses LLMs to parse and index the open body of ML research into a rich knowledge graph. Given a topic such as "3D segmantation" or "speculative decoding", Codex queries this knowledge graph to construct an on-the-fly summary of the topic with links to any relevant datasets, benchmarks, and models.

How we built it

We finetuned a Mistral-7b variant that lets us take in any Arxiv paper and extract a structured JSON containing key findings, techniques, and topics. We ran this model on 50k CS papers to build a rich knowledge graph of concepts that our agent uses to generate Wikipedia pages for any concept you want.

Challenges we ran into

Resolving mentions of a topic across different papers was much more difficult than expected.

Accomplishments that we're proud of

Our finetuned model is extremely cost-effective—it enabled us to process 50,000 papers totaling ~250m tokens for roughly $20. If we were to use GPT-4 or Mistral-large, parsing all these papers would cost more than 100x as much.

What we learned

We learned a lot about getting good structured outputs and finetuning models.

What's next for Codex

We want to index all of Arxiv and improve our resolution pipeline.

Built With

  • huggingface
  • mistral
  • mistral-7b
  • python
  • vite
Share this project:

Updates