Vik Paruchuri (@VikParuchuri) / X

Vik Paruchuri

2,027 posts

Vik Paruchuri

@VikParuchuri

Open source AI. Founder of @datalabto Past: founded @dataquestio

Brooklyn,NY

Joined June 2012

Vik Paruchuri
@VikParuchuri
Aug 12, 2025
Parsing PDFs has slowly driven me insane over the last year. Here are 8 weird edge cases to show you why PDF parsing isn't an easy problem. 🧵
627K
Vik Paruchuri
@VikParuchuri
Sep 10, 2025
The PDF format is hard to parse - by design. Let's explore the internals of the PDF format to figure out how Adobe did this to us.
240K
Vik Paruchuri
@VikParuchuri
Jan 12, 2024
Announcing surya - a multilingual text line detection model for documents. It gives you accurate line-level bboxes and column breaks. Find it here - github.com/VikParuchuri/s… .
588K
Vik Paruchuri
@VikParuchuri
Feb 12, 2024
Announcing surya OCR - text recognition in 93 languages. It outperforms tesseract in almost all languages, often by large margins. Find it here - github.com/VikParuchuri/s… .
184K
Vik Paruchuri
@VikParuchuri
Oct 26, 2025
Best OCR ever, huh?
Harveen Singh Chadha
@HarveenChadha
Oct 26, 2025
No, its not the best OCR ever here is the result from olmoOCR2 on the same and it does have a frightening degree of accuracy
334K
Vik Paruchuri
@VikParuchuri
Oct 21, 2025
I'm excited to announce that Chandra OCR is open source! - Full layout information - Extracts and captions images and diagrams - Strong handwriting, form, table support - Works with transformers and vLLM
128K
Vik Paruchuri
@VikParuchuri
Jul 16, 2024
I'm starting a company, Datalab: - Task-specific models that outperform frontier LLMs and existing tools - Examples: my projects marker and surya (25k GH stars) with task-specific arch - Goal: Train models, open source as much as possible, do hosted inference and on-prem
152K
Vik Paruchuri
@VikParuchuri
Apr 11, 2024
I wrote a blog post on going from not knowing anything about deep learning last year to training state of the art OSS models - vikas.sh/post/how-i-got… . Hope it helps you. tldr; read the deep learning book, implemented papers + taught, built open source tools
How I got into deep learning
From vikas.sh
220K
Vik Paruchuri
@VikParuchuri
Oct 15, 2024
I made a library to detect tables and extract to markdown or csv. It uses a new table recognition model I trained.
79K
Vik Paruchuri
@VikParuchuri
Aug 16, 2024
Announcing Surya OCR 2! It uses a new architecture and improves on v1 in every way: - OCR with automatic language detection for 93 languages (no more specifying languages!) - More accurate on old/noisy documents - 20% faster - Basic English handwriting support
72K
Vik Paruchuri
@VikParuchuri
Nov 30, 2023
I'm excited to ship marker - a pdf to markdown converter that is 10x faster than nougat, more accurate outside arXiv, and has low hallucination risk. Marker is optimized for throughput, like converting LLM pretrain data. Find it here - github.com/VikParuchuri/m… .
146K
Vik Paruchuri
@VikParuchuri
Feb 19, 2025
We've improved marker (PDF -> markdown) a lot in 3 months - accuracy and speed now beat llamaparse, mathpix, and docling. We shipped: - llm mode that augments marker with models like gemini flash - improved math, w/inline math - links and references - better tables and forms
78K
Vik Paruchuri
@VikParuchuri
May 5, 2024
Cool to see a 500M param model I trained myself do better than Google cloud vision, Claude, and GPT-4V on this task. (look at the thread for the results) It's a relatively narrow one (OCR), but feels nice to see that small open source models still have a place.
Brendan Dolan-Gavitt
@moyix
May 4, 2024
It's weird how we live in an age of miracles with respect to AI/ML, and yet when I want to extract some text from a screenshot the best (very bad) option is tesseract, last updated ~7 years ago.
172K
Vik Paruchuri
@VikParuchuri
Oct 8, 2024
Announcing Surya Table Recognition! It uses a new architecture to outperform table transformer, the current SoTA open source model. - Recognizes table rows, columns, and cells - Works with complex layouts and rotated tables - Supports any language - Runs locally
61K