Landing Page
Available repo to chat
Chat window
Ingestion page

DocQuery: Convert Markdown Documentation from GitHub into an Intelligent, Searchable Knowledge Base with RAG

Inspiration

Documentation is often written in markdown files (.md or .mdx) and stored in GitHub repositories. While these files are effective for static documentation, they lack interactivity and the ability to power intelligent applications. This inspired the creation of DocQuery, a RAG (Retrieval-Augmented Generation) application that turns markdown documentation into a searchable and intelligent knowledge base for LLMs.

What it Does

DocQuery allows users to:

Ingest markdown files directly from GitHub repositories.
Convert these files into a structured knowledge base optimized for LLMs.
Query the documentation using natural language, enabling users to get precise answers quickly.
Leverage hybrid retrieval techniques for enhanced search accuracy.

How We Built It

Markdown Ingestion: A pipeline to fetch, parse, and preprocess markdown files from repositories.
Knowledge Base Creation: Used dense (vector-based) and sparse (keyword-based) indexing for hybrid retrieval with mongodb.
RAG Implementation: Integrated LLMs to retrieve context and generate meaningful responses based on user queries.
User Interface: Built a sleek frontend for repository uploads and query handling.

Challenges We Ran Into

Creating UI to allow the user to select the markdown files they want to ingest, as the repo contains many .md files that the user may not want to use.
Ingestion takes time, so showing real-time updates on the frontend of the progress of ingestion.
Efficient Retrieval: Achieving the right balance between fast query response times and retrieval accuracy.
User Experience: Simplifying a technically complex process into an intuitive user interface.

Accomplishments That We're Proud Of

Successfully converting markdown files into a functional knowledge base for LLMs.
Implementing a hybrid retrieval system that enhances search relevance.
Creating an intuitive interface that simplifies repository uploads and query handling.
Building a scalable and efficient system to process large repositories.

Technology Stack

Frontend:
- UI Framework: Next.js with Shadcn for a stylish and performant user interface.
- AI Assistance: Copilotkit for enhanced developer productivity and a smoother development experience.
- Agent Interaction: Coagent for seamless communication with the LangGraph agent.
- Real-time Updates: Copilotkit Python SDK for real-time state updates and event emission between the Copilotkit UI and Coagent.
Backend:
- Framework: FastAPI for high-performance and efficient API development.
- LLMs:
  - LangChain for orchestrating LLM interactions and workflows.
  - LangGraph for knowledge graph representation and reasoning.
  - Gemini for efficient text embeddings.
  - Llama 3.3 70B (from TogetherAI) for advanced conversational capabilities.
- Data Storage:
  - MongoDB for both vector and full-text search, enabling hybrid retrieval for optimal search results.
- Data Management: Prisma for efficient data modeling and access.
Other:
- Copilotkit Python SDK for interaction between the Copilotkit UI and Coagent, facilitating real-time state updates and event emission.

What We Learned

Advanced markdown parsing techniques for handling complex documentation files.
Effective implementation of RAG pipelines for real-world use cases.
The importance of hybrid retrieval for combining keyword-based and semantic searches.
Designing systems that prioritize both performance and user experience.

What's Next for DocQuery

GitHub Integration: Automate repository ingestion with real-time updates.
Customizable Retrieval: Allow users to define custom indexing and query preferences.
Multi-File Support: Expand support for other file types like .pdf and .html.
Enhanced UI: Introduce analytics and visualization tools for better insights into documentation usage.