DatologyAI (@datologyai) / X

DatologyAI

214 posts

DatologyAI

@datologyai

DatologyAI builds tools to automatically select and optimize the best data on which to train AI models, leading to better, smaller models which train faster.

Redwood City, CA

datologyai.com

Joined September 2023

Following

3,060

Followers

DatologyAI
@datologyai
12h
New @datologyai research: 35× cheaper per correct answer through pretraining data curation alone. No parameter reduction or decoding tricks required. 🧵 A million correct answers cost ~$1.34 from our curated 4B, and ~$47 from verbose Qwen3.5-4B. Same answers.
01:20
Matthew Leavitt
@leavittron
12h
What if you could induce models to be more concise via pretraining data curation?
932
DatologyAI
@datologyai
12h
Replying to @datologyai
Thoughful pretraining curation allows the model to learn to be concise, so brevity is built into the weights, not bolted on at decode time.
129
DatologyAI
@datologyai
12h
📄 Paper: arxiv.org/abs/2606.25432 📝 Blog: datologyai.com/blog/brevity-i… Join us: datologyai.com/careers Become a customer: datologyai.com/contact
71
DatologyAI
@datologyai
Jun 22
The "you can only catch up by distilling from a frontier model" narrative is wrong. We curated the data for @Arceeai's Trinity Large entirely from public sources, zero closed-model APIs, and it's competitive with the open frontier. Better data does the work.
00:00
1.2K
DatologyAI
@datologyai
Jun 22
Full episode:
243
DatologyAI
@datologyai
Jun 19
Compute scarcity is about to force the reckoning the frontier labs have avoided: efficiency. You don't need trillion-parameter models for frontier-class capability. With better data, far smaller models match the best of a year or two ago, at a fraction of the cost to serve.
00:00
1.1K
DatologyAI
@datologyai
Jun 19
Full episode:
630
DatologyAI
@datologyai
Jun 18
Replying to @datologyai
5/ Alexander Gurung from the University of Edinburgh presented his work on learning to reason for long-form generation. What does a reward signal look like when the goal is a good story? 📺 youtu.be/tB8dx9QGVcM
395
DatologyAI
@datologyai
Jun 18
Replying to @datologyai
14/ Sukjun Hwang (@sukjun_hwang) from CMU presented his work on H-Nets: Dynamic chunking for end-to-end hierarchical sequence modeling 📄
arxiv.org
Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
Major progress on language models (LMs) in recent years has largely resulted from moving away from specialized models designed for specific tasks, to general models based on powerful architectures...
207
DatologyAI
@datologyai
Jun 18
15/ What a lineup, and that was only year one. Summer of Data is back for 2026 and we're just getting started. Keep an eye out for our lineup announcement and new talks every week. Want to present? DM us 👀 Stay data-obsessed 🤓
182
DatologyAI
@datologyai
Jun 18
Replying to @datologyai
4/ Shizhe Diao (@shizhediao) from Thinking Machines presented his work on CLIMB, clustering-based iterative data selection for pretraining. Can a model find its own best data blend? 📺 youtu.be/DmFygcqAvsM
304
DatologyAI
@datologyai
Jun 18
1/ 🌞 Our Summer of Data Seminar brought together some of the sharpest minds in data curation last year. We are bringing it back in 2026! Let's recap the great talks from 2025!
4.3K
DatologyAI
@datologyai
Jun 18
3/ Maximilian Böther (@MaxiBoether) from ETH Zurich presented his work on Mixtera, a data plane for foundation model training. How do you manage what your model eats at scale? He is now working @datologyai on cool dataloader improvements 📺 youtu.be/JyQI8SDpMoU
443