Hailey Schoelkopf (@haileysch_

Hailey Schoelkopf

737 posts

Hailey Schoelkopf

@haileysch__

hillclimbing towards generality @anthropicai | prev @AiEleuther | views my own

sf + boston

haileyschoelkopf.github.io

Joined June 2022

Hailey Schoelkopf
@haileysch__
Nov 12, 2024
Major life update: I'm joining @AnthropicAI this week! Looking forward to meeting and working with the amazing team there! I’m beyond thankful for an amazing 2 years with my colleagues and collaborators at @AiEleuther .
85K
Hailey Schoelkopf
@haileysch__
Dec 19, 2023
work in ML, they said. it’ll be fun, they said. Now I’m reading about the Based architecture and its HellaSwag score
114K
Hailey Schoelkopf
@haileysch__
Feb 24, 2025
the only eval i trust now, and the vibes are immaculate
Anthropic
@AnthropicAI
Feb 24, 2025
Replying to @AnthropicAI
Claude 3.7 Sonnet is a significant upgrade over its predecessor. Extended thinking mode gives the model an additional boost in math, physics, instruction-following, coding, and many other tasks. In addition, API users have precise control over how long the model can think for.
92K
Hailey Schoelkopf
@haileysch__
Apr 25, 2023
Excited to announce our paper Pythia: A Suite for Analyzing LLMs across Training and Scaling has been accepted as an Oral paper at #ICML2023 ! arxiv.org/abs/2304.01373
55K
Hailey Schoelkopf
@haileysch__
May 22, 2025
opus 4 is here—“o4”, if you will
Anthropic
@AnthropicAI
May 22, 2025
Introducing the next generation: Claude Opus 4 and Claude Sonnet 4. Claude Opus 4 is our most powerful model yet, and the world’s best coding model. Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning.
23K
Hailey Schoelkopf
@haileysch__
May 24, 2024
My favorite bit in this paper: I and @bbrabbasi wrote an appendix formalizing what is done evaluating models with loglikelihood multiple choice and perplexity evals. afaik, none of this has been written up in one place in most papers and just been tacitly assumed before!
EleutherAI
@AiEleuther
May 24, 2024
Excited to share our new paper, Lessons From The Trenches on Reproducible Evaluation of Language Models! In it, we discuss common challenges we’ve faced evaluating LMs, and how our library the Evaluation Harness is designed to mitigate them 🧵 arxiv.org/abs/2405.14782
36K
Hailey Schoelkopf
@haileysch__
Feb 25, 2025
2 misconceptions from yesterday: - no training on playing videogames. Claude can just do this now - “why isn’t this a demo”
Anthropic
@AnthropicAI
Feb 25, 2025
Replying to @AnthropicAI
Claude Plays Pokémon continues on as a researcher's personal project. Follow along on Twitch: twitch.tv/claudeplayspok…
17K
Hailey Schoelkopf
@haileysch__
Oct 17, 2023
Beyond excited for our work on Llemma to finally be public!! We trained very strong general math LMs (7B > Minerva 8B, 34B ~= Minerva 62B) and released them + training, eval, analysis code! Can’t wait to see the math+AI field pick these up for future developments in the open.
Zhangir Azerbayev
@zhangir_azerbay
Oct 17, 2023
We release Llemma: open LMs for math trained on up to 200B tokens of mathematical text. The performance of Llemma 34B approaches Google's Minerva 62B despite having half the parameters. Models/data/code: github.com/EleutherAI/mat… Paper: arxiv.org/abs/2310.10631 More ⬇️
48K
Hailey Schoelkopf
@haileysch__
Nov 7, 2023
This proposal suggests a global moratorium on training runs > 10^24 flops—approximately that used by llama 2-70B, and 100x lower than the EO reporting threshold. Just astoundingly absurd
Andrea Miotti
@andreamiotti
Nov 6, 2023
The AI Summit consensus is clear: it's time for international measures. Here is a concrete proposal. In our recent paper, @jasonhausenloy , Claire Dennis and I propose an international institution to address extinction risk from AI: MAGIC, a Multinational AGI Consortium.
46K
Hailey Schoelkopf
@haileysch__
Feb 29, 2024
so, it turns out i am the top solo user for downloads on HF (thanks solely to my lm-eval MMLU mirror! 😅) now is probably a good time to express how grateful i am for the users, contributors, and community around the LM Evaluation Harness! :’)
46K
Hailey Schoelkopf
@haileysch__
Dec 19, 2023
Replying to @haileysch__
so that I can use it in Oogabooga
10K
Hailey Schoelkopf
@haileysch__
Apr 5, 2023
We’ve released the paper for Pythia, a set of LLMs designed to facilitate scientific study on LLMs and their training data! So excited to have this out finally! Read more here:
Stella Biderman
@BlancheMinerva
Apr 5, 2023
Have you ever wanted to do an experiment on LLMs and found that none of the existing model suites met your needs? At @AiEleuther we got tired of this happening and so designed a model suite that centers enabling scientific research as its primary goal arxiv.org/abs/2304.01373
24K
Hailey Schoelkopf
@haileysch__
May 8, 2024
what’s the current SOTA for KV cache compression? what are some must-read papers on this topic?
32K
Hailey Schoelkopf
@haileysch__
Feb 28, 2024
👀it's always incredible to me just how ubiquitous and clear the induction head bump is
21K