PageIndex

PageIndex · 2026-04-06T11:45:19.174Z

We are excited to share that PageIndex just got selected into GitHub’s Secure Open Source Fund 🚀 We’re joining a small group of high-impact open-source projects — including CAMEL-AI.org, scikit-learn, OpenCV, Ollama and OpenClaw— working with GitHub to secure the future of the software supply chain. We’ll share more updates on our security roadmap—stay tuned!

IT System Data Services

London, England 1,206 followers

Building Infra for Next-Gen Reasoning-based RAG

View all 8 employees

About us

Building Infra for Next-Gen Reasoning-based RAG

Website: https://pageindex.ai
External link for PageIndex
Industry: IT System Data Services
Company size: 2-10 employees
Headquarters: London, England
Type: Public Company
Founded: 2023

Locations

Primary

71-75 Shelton Street

London, England, GB

Get directions

Employees at PageIndex

See all employees

Updates

PageIndex reposted this
SHIV PRAKASH YADAV

Tredence Inc.•564 followers
4h
Report this post
🧠 Built a Vectorless RAG — no embeddings, no vector DB, no cosine similarity. Most RAG pipelines follow the same pattern: → Chunking → Embedding → vector DB → Query by similarity search To challenge that assumption. I built a RAG system using Google Gemini + PageIndex that skips the entire vector layer entirely — and yes, Gemini because it's free 😂 --- ⚙️ How it works: Istead of embedding-based retrieval, it uses PageIndex to index document pages directly. At query time, Gemini reasons over the indexed pages to find and synthesize relevant information — no embeddings generated, no ANN lookup. --- ✅ Advantages: • No vector DB setup or maintenance overhead • No embedding model needed — reduces infra cost and latency • Simpler pipeline: fewer moving parts = easier to debug • Works well for structured/paginated documents (PDFs, reports) • Faster to prototype and deploy --- ⚠️ Limitations: • Less effective for large unstructured corpora where semantic similarity shines • Dependent on Gemini's reasoning quality for relevance judgment • PageIndex is optimized for document-style inputs; less flexible for arbitrary text chunks --- This is more of an architectural experiment than a production-ready system, but it opens up an interesting question: Do we always need vectors for retrieval-augmented generation? For certain document-heavy, structured use cases — maybe not. Created a simple repo to test it's working 🔗 Repo: https://lnkd.in/gE8wyRDX Inspired by Krish Naik's work — his video was a key reference while building this. Check out his channel if you're diving into AI/ML! 🔗 Video URL: https://lnkd.in/ghzKykWg

Like Comment Share
PageIndex reposted this
Hitesh Malhotra

Michigan State University•2K followers
2d Edited
Report this post
I stopped using embeddings for RAG - and my system got more accurate. Like most people, I started with the standard RAG pipeline: • chunk documents • generate embeddings • store in a vector DB • retrieve top-k similar chunk It worked… until it didn’t. Even when the correct answer existed, the system would • retrieve the wrong chunk • miss exact sections • lose context due to chunking That’s when I tried something different. Instead of embeddings, I built a vectorless RAG system using a structure-aware approach (PageIndex). Here’s what changed: → No chunking → No embeddings → No vector database Instead: • the document is converted into a hierarchical tree • the LLM navigates the structure • retrieval becomes reasoning-based, not similarity-based The result? • More accurate retrieval for structured documents • Better handling of section-based queries • Coherent, context-preserving answers One key realization from this project: Traditional RAG treats documents like a 'list of chunks'. But real documents are 'hierarchies of meaning', and that difference matters. This approach worked especially well for: • research papers • technical documents • policy/legal content It’s not a replacement for vector RAG - but it’s a strong alternative depending on the use case. I’ve written a detailed breakdown of the system, architecture, and tradeoffs here: http://bit.ly/4dNy13b Curious - have you seen similar limitations with vector-based RAG in real-world systems? #AppliedAI #RAG #VectorEmbeddings #AI #LLM #MachineLearning #GenerativeAI

I Stopped Using Embeddings for RAG — And My System Got More Accurate medium.com

Like Comment Share
PageIndex

1,206 followers
3d Edited
Report this post
We solved the biggest problem with Andrej Karpathy’s LLM Wiki. He pointed out the hard part: long PDFs and books don’t work well unless you process them carefully in stages. That’s exactly what OpenKB fixes. With PageIndex, OpenKB scales to long PDFs by turning them into a hierarchical tree for better wiki generation.
3 Comments

Like Comment Share
PageIndex reposted this
Mingtian Zhang

PageIndex•2K followers
4d
Report this post
A recent trending article exposed the rise of fake GitHub stars. https://lnkd.in/ep-ymjwr This isn’t just growth hacking—it’s distorting how the community judge real projects. Open-source developers, users and investors should stand together against this. We’re proud that PageIndex’s 25k stars are 100% organic—earned from people who actually use and believe in what we’re building. “Fake it until you make it” is poison—honesty and trust win in the long run.

Inside GitHub's Fake Star Economy awesomeagents.ai

Like Comment Share
PageIndex reposted this
Chris Columbkille Biddle

Helix Strategy Group, llc•2K followers
6d
Report this post
I Stopped Using Vector Databases for RAG : PageIndex Vectorless RAG: If you have been building RAG systems for a while, you know the frustration. You spend hours setting up embeddings, tuning chunk sizes, picking the right vector database, and then your system still returns the wrong answer. Not because it could not find something similar. But because “similar” and “relevant” are just not the same thing. If you haven’t explored my RAG series yet, I highly recommend checking it out to better understand the concepts discussed here. RAG — Complete Tutorial: PART 08 I ran into this problem when working with financial reports. A 120 page 10-K filing. My vector RAG system kept pulling chunks from the executive summary when the actual answer was buried in a footnote on page 87. Both sections had similar keywords, similar semantic meaning, but only one had the actual number I needed. That is when I found PageIndex. What Even Is PageIndex? PageIndex is an open source RAG framework built by the team at VectifyAI. The core idea is simple but genuinely different from everything else out there. Instead of converting your document into vectors and doing a nearest neighbor search, PageIndex builds a hierarchical tree structure from your document and then uses an LLM to reason its way to the right answer.No embeddings. No vector database. No chunking. They call it “vectorless RAG” and honestly the name is pretty accurate. The project was introduced in September 2025 and it gained a lot of attention fast. Mostly because of one benchmark number: 98.7% accuracy on FinanceBench. For context, traditional vector RAG systems score around 50% on the same benchmark. That is not a small improvement. That is a completely different category of performance. The Problem with Vector RAG (That Nobody Talks About Enough) Before we get into how PageIndex works, let us be honest about why vector RAG struggles. The fundamental assumption of vector RAG is this: the text that is most semantically similar to your query is also the most relevant. That assumption breaks constantly in the real world. Think about it. If you ask “What was the company’s total debt in 2023?” the chunks that come back with high cosine similarity might include: * The CEO’s letter mentioning debt management strategy * A risk section discussing debt covenants * A glossary defining long-term debt All three are semantically close to your query. None of them are the actual answer, which is a specific number in the balance sheet on page 64. There are a few other issues too. * Chunking destroys context. When you split a 100-page document into 500-token chunks, you are throwing away the document’s natural structure. A section that says “as mentioned in Table 3.2” is completely disconnected from Table 3.2 once you chunk them separately. The document has a logic and a flow. Chunking ignores all of that. * Cross-references… #genai #shared #ai

I Stopped Using Vector Databases for RAG : PageIndex Vectorless RAG generativeai.pub

2 Comments

Like Comment Share
PageIndex reposted this
Matthew Overstreet

IBM•1K followers
5d Edited
Report this post
I built something nobody asked for. Let me tell you about it. Recently someone asked me if I'd heard of "PageIndex," document search without vector embeddings. I hadn't. But the idea stuck, and I realized it was scratching an itch I'd been ignoring: I don't love chunking. Not semantic chunking, that's a different conversation. I did semantic chunking work for a major tech book publisher. Books are too big even for BM25 to handle as monoliths. You have to break them up. But when you chunk on semantic boundaries, something interesting happens: you end up with something like a digital book page. Not constrained by paper and typesetting, but a coherent unit of thought, sized by meaning. That's valuable. But naive, uniform chunking? Chop every 512 tokens, embed, kNN, pray? That's searching a library by tearing out all the pages, shuffling them, and asking "which page looks most like my question?" You know what a librarian does? They navigate. Card catalog to section to shelf to page. Their tools are structure, context, hierarchy. And that's the idea I can't shake: context is relevancy. And relevancy is much bigger than similarity. Search for "R&D spending on AI" in a financial filing. A section called "Operating expenses and cost structure analysis" won't score well on similarity. The terms barely overlap. But any human reading the table of contents knows to look there. That's not similarity. That's understanding. But pageIndex only solves the vertical axis, drilling into a document. The horizontal axis, how documents relate across a corpus, is invisible to it. That's where OpenSearch comes in. Not just as a "fast first pass." OpenSearch provides the corpus-level context that pageIndex lacks. PageIndex provides the document-depth that OpenSearch can't match. Four phases: → OpenSearch narrows 100M docs to 20 candidates (BM25 + kNN, <100ms) → LLM shortlists to 3-5 based on descriptions (corpus-level reasoning) → LLM navigates each document's tree structure (agentic, not one-shot) → Fetch content, generate answer with citations Is it slow? Yes. Is it also expensive? Also yes. But the retrieval implications are interesting. I open-sourced it. It's a sketch, not a product. But there's something here, and the part that excites me most isn't even built yet: projecting pageIndex upward to build collection-level trees. Three levels of navigable hierarchy. Corpus → document → passage. Tell me why this is wrong? #RAG #SearchEngineering #OpenSearch #InformationRetrieval #HybridSearch #OpenSource

An Experiment with OpenSearch and PageIndex Matthew Overstreet on LinkedIn

20 Comments

Like Comment Share
PageIndex reposted this
Pranesh S

HARMAN India•3K followers
5d
Report this post
🚀 Built a system comparing Vector RAG vs Vectorless RAG 🌐Live Link: https://lnkd.in/g4rxfr-k Recently, I explored how retrieval systems are evolving beyond embeddings. So I built a demo comparing: 🔵 Vector RAG (FAISS + embeddings) 🟢 PageIndex (tree-based, vectorless RAG) 🧠 Key insights: • Vector RAG relies on similarity search → can lose structure • PageIndex uses hierarchical trees → preserves document meaning • Reasoning-based retrieval improves accuracy for structured data 📊 Features: ✔ Document → Tree transformation ✔ LLM reasoning over structure ✔ Tree visualization ✔ Interactive UI (Gradio) ⚡ Tech stack: Python • Groq • FAISS • PageIndex • Gradio 📖 Github: https://lnkd.in/gs57MdMw 🎯 Takeaway: AI systems are shifting from ➡️ Retrieval based on similarity ➡️ To retrieval based on reasoning This project helped me understand how next-gen RAG systems will work. PageIndex #AI #RAG #GenAI #LLM #MachineLearning #ArtificialIntelligence #PageIndex #VectorDB #HuggingFace #BIT #BannariAmmanInstituteofTechnology #BITSathy #VerctorLessRag

6 Comments

Like Comment Share
PageIndex

1,206 followers
1w
Report this post
Inspired by Andrej Karpathy's knowledge base thread, we are open-sourcing OpenKB: Open LLM Knowledge Base In addition to Andrej's great original design, OpenKB can scale to long PDFs and multi-modality, see details below 👇 1. Any format — PDF, Word, PowerPoint, Excel and more via markitdown 2. Scale to long documents — Long documents are handled via PageIndex tree indexing, enabling better long-context retrieval 3. Native multi-modality — Retrieves and understands figures, not just text
1 Comment

Like Comment Share
PageIndex

1,206 followers
1w Edited
Report this post
We are excited to share that PageIndex just got selected into GitHub’s Secure Open Source Fund 🚀 We’re joining a small group of high-impact open-source projects — including CAMEL-AI.org, scikit-learn, OpenCV, Ollama and OpenClaw— working with GitHub to secure the future of the software supply chain. We’ll share more updates on our security roadmap—stay tuned!
4 Comments

Like Comment Share
PageIndex reposted this
Avinash Kumar Shudhanshu

Confidential•2K followers
1mo Edited
Report this post
Been exploring PageIndex (by PageIndex) — and it’s an interesting take on retrieval architecture. It doesn’t try to replace RAG with hype. It questions the default stack. 🧱 The usual RAG pipeline 👉 Chunk documents 👉 Generate embeddings 👉 Store in vector DB 👉 Retrieve top-k 👉 Stitch context 👉 Pray it’s coherent It works. But it brings: 💸 Embedding costs 🗄️ Vector DB overhead 🔁 Re-indexing complexity 🧩 Context fragmentation 🧠 PageIndex approach (high level) Instead of “everything is a vector,” it leans toward: 📄 Page / structure-aware indexing 🔎 Deterministic navigation ⚡ Lightweight retrieval layer 🧭 Query-time structured reasoning Less semantic fuzziness. More controlled access to information. 🏗️ Why this matters If you're building: • Internal knowledge systems • Doc-heavy AI assistants • Large HTML/PDF corpora search • Compliance or policy engines You might not always need a full embedding + vector stack. Sometimes: Structured index > pure similarity search That’s not anti-RAG. It’s an architecture-aware design. Read more : https://pageindex.ai/ https://lnkd.in/gqBFPt-B #AI #RAG #Retrieval #SearchArchitecture #LLM #PageIndex #vectorsearch #embedding #llm #genai

GitHub - VectifyAI/PageIndex: 📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG github.com

Like Comment Share

PageIndex

IT System Data Services

London, England 1,206 followers

Building Infra for Next-Gen Reasoning-based RAG

About us

Locations

Employees at PageIndex

Yu Tang

PageIndex•968 followers

David Barber

UCL AI Centre•7K followers

Yuexun Jiang

Vectify AI•59 followers

Sebastian Hunte

AlbionVC•6K followers

Updates

Join now to see what you are missing

Similar pages

Humanloop

Miskies AI

Trismik

Pometry

UiPath

AI Engineering

certua

iGent AI

Twin Path Ventures

CAMEL-AI.org

Browse jobs

Head of Investment jobs

Scientist jobs

Software Engineer jobs

Intern jobs

Engineer jobs

Graduate Analyst jobs

Paraplanner jobs

Presales Consultant jobs

Finance Graduate jobs

Associate Consultant jobs

Researcher jobs

Graduate jobs

Consultant jobs

Analyst jobs

Summer Intern jobs

Commercial Finance Analyst jobs

Economist jobs

Risk Analyst jobs

Founder jobs

Co-Founder jobs