About Me

Today, over 300,000 developers worldwide use the technology our team built to process AI training data. When Google Gemini and OpenAI GPT evaluate their own capabilities, they rely on the standards we established — and ours is the only benchmark from a Chinese team in their core evaluation suites.

My path to data infrastructure started far from data itself. During my Ph.D. at Tsinghua University, I simulated seismic waves on a 10-million-core supercomputer and won the Gordon Bell Prize — the highest award in high-performance computing. I then built real-time graph computing systems serving WeChat’s billion users at Tencent, and later led a team at SenseTime building data factories that powered over 100,000 commercial models.

These experiences led me to a conviction: in the era of foundation models, the real bottleneck isn’t algorithms or compute — it’s data. The vast majority of human knowledge accumulated over centuries — papers, books, reports, archives — is locked in PDFs and scanned documents that large models simply cannot ingest.

So at the Shanghai AI Laboratory, I founded the OpenDataLab team to break through this barrier. We developed MinerU , which transforms unstructured documents into high-quality data that large models can learn from. Within a year of release, MinerU earned 50,000 GitHub stars with over 1 billion API calls, and is used in production by Google, Huawei, Alibaba, and over 100 other enterprises. Our team also curates high-quality datasets for leading models such as InternLM and InternVL.

I have authored over 200 papers in top-tier venues, garnered Loading… citations on Google Scholar, and received honors including the Gordon Bell Prize, an ACL Best Theme Paper Award, and the WAIC Yunfan Award. I’m not just doing research — I’m building the data infrastructure for the AI era.

We are hiring! I am actively seeking talented Ph.D. students, postdoctoral fellows, interns, and full-time researchers. If you are passionate about building the future of AI, I welcome you to contact me via email.

🚀 Impact

⛏️ MinerU

The world’s leading open-source document parsing engine. Transforms unstructured documents into high-quality, AI-ready data for large model training.

⭐ 50,000+ GitHub Stars 📊 1B+ API Calls 🏢 100+ Enterprise Users (Google, Huawei, Alibaba, etc.)

📏 OmniDocBench

The document parsing evaluation benchmark officially adopted by Google Gemini and OpenAI GPT — the only Chinese-team contribution in their core evaluation suites.

🌐 OpenDataLab

Founded and leads the OpenDataLab team and open data ecosystem.

👥 300,000+ Developers 📦 7,000+ Datasets 🔍 40M+ Retrievals

🧠 Foundation Model Data Engine

Oversees the data pipeline for InternLM and InternVL, processing 100PB of raw data into 70T high-quality tokens.

🔥 News

2026.02: 🎉 1 2 3 4 5 6 7 papers are accepted by ICLR 2026.
2025.12: 🎉 I received the Shanghai Science and Technology Youth 35 Leading Program (selected 35 scientists under the age of 35) [News/报道]
2025.09: 🎉 MinerU 2.5 is released! A 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. [Tech Report] [Model] [GitHub]
2025.09: 🎉 1 2 3 4 5 6 papers are accepted by NIPS 2025.
2025.07: 🎉 I received the ACL Best Theme Paper Award 1.
2025.07: 🎉 I won the World Artificial Intelligence Conference Yunfan Award (one of 11 global recipients under the age of 35, 2025)
2025.05: 🎉 1 2 3 4 5 papers are accepted by ICCV 2025.
2025.05: 🎉 1 2 3 4 5 6 7 8 9 10 11 papers are accepted by ACL 2025.
2025.02: 🎉 1 2 3 4 5 papers are accepted by CVPR 2025.
2025.01: 🎉 [1] paper is accepted by NAACL 2025.
2025.01: 🎉 1 2 3 4 5 6 7 papers are accepted by ICLR 2025.

📝 Selected Research

I have authored over 200 papers with Loading… citations on Google Scholar. Here are selected works that define my research trajectory:

Building AI Data Infrastructure

2024 MinerU: An Open-Source Solution for Precise Document Content Extraction — Our flagship open-source document parsing engine. 50K+ GitHub stars, 1B+ API calls, adopted by 100+ enterprises.
2024 OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations — The evaluation standard officially adopted by Google Gemini and OpenAI GPT. The only Chinese-team benchmark in their core evaluation suites.

Award-Winning Research

SC 2017 18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight, Haohuan Fu†, Conghui He†, et al. — Scaled earthquake simulations to 10 million cores, redefining the boundary of HPC applications. 🏆 ACM Gordon Bell Prize
ACL 2025 Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models, Xinlin Zhuang, …, Conghui He† — A principled approach to training data curation for LLMs. 🏆 ACL Best Theme Paper Award

Advancing Multimodal AI

2025 InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models — Powering one of the world’s leading open-source multimodal models.
ICLR 2025 OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text — A 10-billion-scale multimodal dataset advancing vision-language research.
ECCV 2024 MMBench: Is Your Multi-modal Model an All-around Player? — The definitive benchmark for evaluating multimodal models. 2,000+ citations.

🎖 Honors

2026, Shanghai Top 35 Young Science & Technology Innovators
2025, ACL Best Theme Paper Award — Top 3 from 8,000+ submissions (sole corresponding author)
2025, WAIC Yunfan Award — One of 10 global AI rising stars, World AI Conference
2024, National Distinguished Young Talent — State-level talent program
2019, Tencent Technology Breakthrough Gold Award — Highest technical honor, sole gold among 50+ teams
2017, ACM Gordon Bell Prize — Highest honor in high-performance computing
2013, IEEE-IBM Smarter Planet Challenge Global Champion — Team leader, 1st among 54 global university teams

🎤 Talks & Media

2026.03, Invited Lecture, National Leadership Development Program, Beijing (AI & Data Theme)
2025.08, CSML 2025, Session Chair & Invited Talk, Beijing
2025.07, WAIC 2025 “Corpus Innovation Forum” Keynote: MinerU2: Intelligent Engine from Heterogeneous Data to AI-Ready · Co-keynote with CAS Academician E Weinan
2025.02, Global Developer Pioneer Conference “Pujiang AI Ecosystem Forum”, Shanghai
2023.10, Tutorial: “An Introduction to OpenDataLab”, IEEE/CVF International Conference on Computer Vision (ICCV), Paris
2023.06, Tutorial: “OpenDataLab: The Next-Generation Open Dataset Platform”, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver

💡 Perspectives

Thoughts on AI data, open source, and building infrastructure for the next era of intelligence.

The Data Bottleneck No One Talks About

Everyone is racing to build bigger models, but few are asking where the data comes from. The inconvenient truth: most of human knowledge — centuries of scientific papers, legal documents, financial reports, medical records — is locked in formats that AI cannot read. PDFs, scans, handwritten notes. We solved the compute scaling problem. The data scaling problem is next, and it’s harder than most people think.

Why We Open-Sourced MinerU

When we built MinerU, we had a choice: keep it proprietary or open-source it. We chose open source — not out of idealism, but out of strategy. Data infrastructure is like roads: the more people use them, the more valuable they become. A proprietary parsing engine serves one company. An open standard serves an industry. Within a year, 300,000 developers proved us right.

The ROI of Data Over Architecture

The AI industry is split between two philosophies: brute-force scaling and surgical efficiency. When Grok 3 threw 200,000 GPUs at the problem and gained 10%, while DeepSeek-R1 achieved comparable results through reinforcement learning and data distillation at a fraction of the cost, the message was clear: raw compute has diminishing returns. The real leverage is in data quality. Until we see a paradigm shift in model architecture, data optimization remains the highest-ROI path to better AI.

Conghui He (何聪辉)