Log inSign up
DatologyAI
214 posts
Image
user avatar
DatologyAI
@datologyai
DatologyAI builds tools to automatically select and optimize the best data on which to train AI models, leading to better, smaller models which train faster.
Redwood City, CA
datologyai.com
Joined September 2023
10
Following
3,060
Followers
  • user avatar
    DatologyAI
    @datologyai
    12h
    New @datologyai research: 35ร— cheaper per correct answer through pretraining data curation alone. No parameter reduction or decoding tricks required. ๐Ÿงต A million correct answers cost ~$1.34 from our curated 4B, and ~$47 from verbose Qwen3.5-4B. Same answers.
    Image
    Image
    01:20
    user avatar
    Matthew Leavitt
    DatologyAI
    @leavittron
    12h
    What if you could induce models to be more concise via pretraining data curation?
    932
    user avatar
    DatologyAI
    @datologyai
    12h
    Replying to @datologyai
    Thoughful pretraining curation allows the model to learn to be concise, so brevity is built into the weights, not bolted on at decode time.
    129
    user avatar
    DatologyAI
    @datologyai
    12h
    ๐Ÿ“„ Paper: arxiv.org/abs/2606.25432 ๐Ÿ“ Blog: datologyai.com/blog/brevity-iโ€ฆ Join us: datologyai.com/careers Become a customer: datologyai.com/contact
    71
  • user avatar
    DatologyAI
    @datologyai
    Jun 22
    The "you can only catch up by distilling from a frontier model" narrative is wrong. We curated the data for @Arceeai's Trinity Large entirely from public sources, zero closed-model APIs, and it's competitive with the open frontier. Better data does the work.
    Image
    00:00
    1.2K
    user avatar
    DatologyAI
    @datologyai
    Jun 22
    Full episode:
    243
  • user avatar
    DatologyAI
    @datologyai
    Jun 19
    Compute scarcity is about to force the reckoning the frontier labs have avoided: efficiency. You don't need trillion-parameter models for frontier-class capability. With better data, far smaller models match the best of a year or two ago, at a fraction of the cost to serve.
    Image
    00:00
    1.1K
    user avatar
    DatologyAI
    @datologyai
    Jun 19
    Full episode:
    630
  • user avatar
    DatologyAI
    @datologyai
    Jun 18
    Replying to @datologyai
    5/ Alexander Gurung from the University of Edinburgh presented his work on learning to reason for long-form generation. What does a reward signal look like when the goal is a good story? ๐Ÿ“บ youtu.be/tB8dx9QGVcM
    395
    user avatar
    DatologyAI
    @datologyai
    Jun 18
    Replying to @datologyai
    14/ Sukjun Hwang (@sukjun_hwang) from CMU presented his work on H-Nets: Dynamic chunking for end-to-end hierarchical sequence modeling ๐Ÿ“„
    arXiv logo
    arxiv.org
    Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
    Major progress on language models (LMs) in recent years has largely resulted from moving away from specialized models designed for specific tasks, to general models based on powerful architectures...
    207
    user avatar
    DatologyAI
    @datologyai
    Jun 18
    15/ What a lineup, and that was only year one. Summer of Data is back for 2026 and we're just getting started. Keep an eye out for our lineup announcement and new talks every week. Want to present? DM us ๐Ÿ‘€ Stay data-obsessed ๐Ÿค“
    182
  • user avatar
    DatologyAI
    @datologyai
    Jun 18
    Replying to @datologyai
    4/ Shizhe Diao (@shizhediao) from Thinking Machines presented his work on CLIMB, clustering-based iterative data selection for pretraining. Can a model find its own best data blend? ๐Ÿ“บ youtu.be/DmFygcqAvsM
    304
  • user avatar
    DatologyAI
    @datologyai
    Jun 18
    1/ ๐ŸŒž Our Summer of Data Seminar brought together some of the sharpest minds in data curation last year. We are bringing it back in 2026! Let's recap the great talks from 2025!
    Image
    4.3K
    user avatar
    DatologyAI
    @datologyai
    Jun 18
    3/ Maximilian Bรถther (@MaxiBoether) from ETH Zurich presented his work on Mixtera, a data plane for foundation model training. How do you manage what your model eats at scale? He is now working @datologyai on cool dataloader improvements ๐Ÿ“บ youtu.be/JyQI8SDpMoU
    443

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

TermsยทPrivacyยทCookiesยทAccessibilityยทAds Infoยทยฉ 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement