PageIndex’s cover photo
PageIndex

PageIndex

IT System Data Services

London, England 1,206 followers

Building Infra for Next-Gen Reasoning-based RAG

About us

Building Infra for Next-Gen Reasoning-based RAG

Website
https://pageindex.ai
Industry
IT System Data Services
Company size
2-10 employees
Headquarters
London, England
Type
Public Company
Founded
2023

Locations

Employees at PageIndex

Updates

  • PageIndex reposted this

    View profile for SHIV PRAKASH YADAV

    Tredence Inc.564 followers

    🧠 Built a Vectorless RAG — no embeddings, no vector DB, no cosine similarity.    Most RAG pipelines follow the same pattern:  → Chunking → Embedding → vector DB → Query by similarity search    To challenge that assumption.  I built a RAG system using Google Gemini + PageIndex that skips the entire vector layer entirely — and yes, Gemini because it's free 😂    ---    ⚙️ How it works:  Istead of embedding-based retrieval, it uses PageIndex to index document pages directly. At query time, Gemini reasons over the indexed pages to find and synthesize relevant information — no embeddings generated, no ANN lookup.    ---    ✅ Advantages:  • No vector DB setup or maintenance overhead  • No embedding model needed — reduces infra cost and latency  • Simpler pipeline: fewer moving parts = easier to debug  • Works well for structured/paginated documents (PDFs, reports)  • Faster to prototype and deploy    ---    ⚠️ Limitations:  • Less effective for large unstructured corpora where semantic similarity shines  • Dependent on Gemini's reasoning quality for relevance judgment  • PageIndex is optimized for document-style inputs; less flexible for arbitrary text chunks    ---    This is more of an architectural experiment than a production-ready system, but it opens up an interesting question:    Do we always need vectors for retrieval-augmented generation?  For certain document-heavy, structured use cases — maybe not.      Created a simple repo to test it's working  🔗 Repo: https://lnkd.in/gE8wyRDX   Inspired by Krish Naik's work — his video was a key reference while building this. Check out his channel if you're diving into AI/ML!  🔗 Video URL: https://lnkd.in/ghzKykWg

  • PageIndex reposted this

    View profile for Hitesh Malhotra

    Michigan State University2K followers

    I stopped using embeddings for RAG - and my system got more accurate. Like most people, I started with the standard RAG pipeline: • chunk documents • generate embeddings • store in a vector DB • retrieve top-k similar chunk It worked… until it didn’t. Even when the correct answer existed, the system would • retrieve the wrong chunk • miss exact sections • lose context due to chunking That’s when I tried something different. Instead of embeddings, I built a vectorless RAG system using a structure-aware approach (PageIndex). Here’s what changed: → No chunking → No embeddings → No vector database Instead: • the document is converted into a hierarchical tree • the LLM navigates the structure • retrieval becomes reasoning-based, not similarity-based The result? • More accurate retrieval for structured documents • Better handling of section-based queries • Coherent, context-preserving answers One key realization from this project: Traditional RAG treats documents like a 'list of chunks'. But real documents are 'hierarchies of meaning', and that difference matters. This approach worked especially well for: • research papers • technical documents • policy/legal content It’s not a replacement for vector RAG - but it’s a strong alternative depending on the use case. I’ve written a detailed breakdown of the system, architecture, and tradeoffs here: http://bit.ly/4dNy13b Curious - have you seen similar limitations with vector-based RAG in real-world systems? #AppliedAI #RAG #VectorEmbeddings #AI #LLM #MachineLearning #GenerativeAI

  • PageIndex reposted this

    View profile for Mingtian Zhang

    PageIndex2K followers

    A recent trending article exposed the rise of fake GitHub stars. https://lnkd.in/ep-ymjwr This isn’t just growth hacking—it’s distorting how the community judge real projects. Open-source developers, users and investors should stand together against this. We’re proud that PageIndex’s 25k stars are 100% organic—earned from people who actually use and believe in what we’re building. “Fake it until you make it” is poison—honesty and trust win in the long run.

  • PageIndex reposted this

    View profile for Chris Columbkille Biddle

    Helix Strategy Group, llc2K followers

    I Stopped Using Vector Databases for RAG : PageIndex Vectorless RAG: If you have been building RAG systems for a while, you know the frustration. You spend hours setting up embeddings, tuning chunk sizes, picking the right vector database, and then your system still returns the wrong answer. Not because it could not find something similar. But because “similar” and “relevant” are just not the same thing. If you haven’t explored my RAG series yet, I highly recommend checking it out to better understand the concepts discussed here. RAG — Complete Tutorial: PART 08 I ran into this problem when working with financial reports. A 120 page 10-K filing. My vector RAG system kept pulling chunks from the executive summary when the actual answer was buried in a footnote on page 87. Both sections had similar keywords, similar semantic meaning, but only one had the actual number I needed. That is when I found PageIndex. What Even Is PageIndex? PageIndex is an open source RAG framework built by the team at VectifyAI. The core idea is simple but genuinely different from everything else out there. Instead of converting your document into vectors and doing a nearest neighbor search, PageIndex builds a hierarchical tree structure from your document and then uses an LLM to reason its way to the right answer.No embeddings. No vector database. No chunking. They call it “vectorless RAG” and honestly the name is pretty accurate. The project was introduced in September 2025 and it gained a lot of attention fast. Mostly because of one benchmark number: 98.7% accuracy on FinanceBench. For context, traditional vector RAG systems score around 50% on the same benchmark. That is not a small improvement. That is a completely different category of performance. The Problem with Vector RAG (That Nobody Talks About Enough) Before we get into how PageIndex works, let us be honest about why vector RAG struggles. The fundamental assumption of vector RAG is this: the text that is most semantically similar to your query is also the most relevant. That assumption breaks constantly in the real world. Think about it. If you ask “What was the company’s total debt in 2023?” the chunks that come back with high cosine similarity might include: * The CEO’s letter mentioning debt management strategy * A risk section discussing debt covenants * A glossary defining long-term debt All three are semantically close to your query. None of them are the actual answer, which is a specific number in the balance sheet on page 64. There are a few other issues too. * Chunking destroys context. When you split a 100-page document into 500-token chunks, you are throwing away the document’s natural structure. A section that says “as mentioned in Table 3.2” is completely disconnected from Table 3.2 once you chunk them separately. The document has a logic and a flow. Chunking ignores all of that. * Cross-references… #genai #shared #ai

  • PageIndex reposted this

    View profile for Matthew Overstreet

    IBM1K followers

    I built something nobody asked for. Let me tell you about it. Recently someone asked me if I'd heard of "PageIndex," document search without vector embeddings. I hadn't. But the idea stuck, and I realized it was scratching an itch I'd been ignoring: I don't love chunking. Not semantic chunking, that's a different conversation. I did semantic chunking work for a major tech book publisher. Books are too big even for BM25 to handle as monoliths. You have to break them up. But when you chunk on semantic boundaries, something interesting happens: you end up with something like a digital book page. Not constrained by paper and typesetting, but a coherent unit of thought, sized by meaning. That's valuable. But naive, uniform chunking? Chop every 512 tokens, embed, kNN, pray? That's searching a library by tearing out all the pages, shuffling them, and asking "which page looks most like my question?" You know what a librarian does? They navigate. Card catalog to section to shelf to page. Their tools are structure, context, hierarchy. And that's the idea I can't shake: context is relevancy. And relevancy is much bigger than similarity. Search for "R&D spending on AI" in a financial filing. A section called "Operating expenses and cost structure analysis" won't score well on similarity. The terms barely overlap. But any human reading the table of contents knows to look there. That's not similarity. That's understanding. But pageIndex only solves the vertical axis, drilling into a document. The horizontal axis, how documents relate across a corpus, is invisible to it. That's where OpenSearch comes in. Not just as a "fast first pass." OpenSearch provides the corpus-level context that pageIndex lacks. PageIndex provides the document-depth that OpenSearch can't match. Four phases: → OpenSearch narrows 100M docs to 20 candidates (BM25 + kNN, <100ms) → LLM shortlists to 3-5 based on descriptions (corpus-level reasoning) → LLM navigates each document's tree structure (agentic, not one-shot) → Fetch content, generate answer with citations Is it slow? Yes. Is it also expensive? Also yes. But the retrieval implications are interesting. I open-sourced it. It's a sketch, not a product. But there's something here, and the part that excites me most isn't even built yet: projecting pageIndex upward to build collection-level trees. Three levels of navigable hierarchy. Corpus → document → passage. Tell me why this is wrong? #RAG #SearchEngineering #OpenSearch #InformationRetrieval #HybridSearch #OpenSource

  • PageIndex reposted this

    View profile for Pranesh S

    HARMAN India3K followers

    🚀 Built a system comparing Vector RAG vs Vectorless RAG 🌐Live Link: https://lnkd.in/g4rxfr-k Recently, I explored how retrieval systems are evolving beyond embeddings. So I built a demo comparing: 🔵 Vector RAG (FAISS + embeddings) 🟢 PageIndex (tree-based, vectorless RAG) 🧠 Key insights: • Vector RAG relies on similarity search → can lose structure • PageIndex uses hierarchical trees → preserves document meaning • Reasoning-based retrieval improves accuracy for structured data 📊 Features: ✔ Document → Tree transformation ✔ LLM reasoning over structure ✔ Tree visualization ✔ Interactive UI (Gradio) ⚡ Tech stack: Python • Groq • FAISS • PageIndex • Gradio 📖 Github: https://lnkd.in/gs57MdMw 🎯 Takeaway: AI systems are shifting from ➡️ Retrieval based on similarity ➡️ To retrieval based on reasoning This project helped me understand how next-gen RAG systems will work. PageIndex #AI #RAG #GenAI #LLM #MachineLearning #ArtificialIntelligence #PageIndex #VectorDB #HuggingFace #BIT #BannariAmmanInstituteofTechnology #BITSathy #VerctorLessRag

  • Inspired by Andrej Karpathy's knowledge base thread, we are open-sourcing OpenKB: Open LLM Knowledge Base In addition to Andrej's great original design, OpenKB can scale to long PDFs and multi-modality, see details below 👇 1. Any format — PDF, Word, PowerPoint, Excel and more via markitdown 2. Scale to long documents — Long documents are handled via PageIndex tree indexing, enabling better long-context retrieval 3. Native multi-modality — Retrieves and understands figures, not just text

    • No alternative text description for this image
  • PageIndex reposted this

    View profile for Avinash Kumar Shudhanshu

    Confidential2K followers

    Been exploring PageIndex (by PageIndex) — and it’s an interesting take on retrieval architecture. It doesn’t try to replace RAG with hype. It questions the default stack. 🧱 The usual RAG pipeline 👉 Chunk documents 👉 Generate embeddings 👉 Store in vector DB 👉 Retrieve top-k 👉 Stitch context 👉 Pray it’s coherent It works. But it brings: 💸 Embedding costs 🗄️ Vector DB overhead 🔁 Re-indexing complexity 🧩 Context fragmentation 🧠 PageIndex approach (high level) Instead of “everything is a vector,” it leans toward: 📄 Page / structure-aware indexing 🔎 Deterministic navigation ⚡ Lightweight retrieval layer 🧭 Query-time structured reasoning Less semantic fuzziness. More controlled access to information. 🏗️ Why this matters If you're building: • Internal knowledge systems • Doc-heavy AI assistants • Large HTML/PDF corpora search • Compliance or policy engines You might not always need a full embedding + vector stack. Sometimes: Structured index > pure similarity search That’s not anti-RAG. It’s an architecture-aware design. Read more : https://pageindex.ai/ https://lnkd.in/gqBFPt-B #AI #RAG #Retrieval #SearchArchitecture #LLM #PageIndex #vectorsearch #embedding #llm #genai

Similar pages

Browse jobs