Log inSign up
Ethan Perez
1,672 posts
user avatar
Ethan Perez
@EthanJPerez
Alignment team lead at Anthropic
scholar.google.com/citations?user…
Joined September 2017
754
Following
16.1K
Followers

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
  • Pinned
    user avatar
    Ethan Perez
    @EthanJPerez
    May 8
    Grateful for @janleike and his leadership over the years. With models like Mythos, the stakes for alignment have never felt higher at Anthropic, and I'm looking forward to helping to continue scaling up our work here. Some of what the team's been up to recently 🧵
    user avatar
    Jan Leike
    @janleike
    May 8
    Replying to @janleike
    To focus on this, I’ve stepped away from running alignment at Anthropic. @EthanJPerez and @sprice354_ are leading the team going forward, and I’m confident they’ll do an amazing job.
    26K
  • user avatar
    Ethan Perez
    @EthanJPerez
    Jun 27, 2022
    We’re announcing the Inverse Scaling Prize: a $100k grand prize + $150k in additional prizes for finding an important task where larger language models do *worse*. Link to contest details: github.com/inverse-scalin… 🧵
    Image
  • user avatar
    Ethan Perez
    @EthanJPerez
    Jul 23, 2024
    @AnthropicAI has been a huge part in my external safety work like this. Every part of the org has been supportive: giving funding for collaborators, comms/legal approval/support, and an absurd level of Claude API access, involving oncall pages to engineers to support it
    user avatar
    Ethan Perez
    @EthanJPerez
    Jul 23, 2024
    Thrilled to have received an ICML best paper award for our work on AI safety via debate! Cool to see ideas in AI alignment and scalable oversight getting more attention/excitement from the mainstream ML community. Would've been hard for me to imagine even a couple years ago!
    81K
  • user avatar
    Ethan Perez
    @EthanJPerez
    May 25, 2021
    Language models are amazing few-shot learners with the right prompt, but how do we choose the right prompt? It turns out that people use large held-out sets(!). How do models like GPT3 do in a true few-shot setting? Much worse: arxiv.org/abs/2105.11447 w/ @douwekiela @kchonyc 1/N
    Image
  • user avatar
    Ethan Perez
    @EthanJPerez
    Sep 26, 2022
    Inverse Scaling Prize Update: We got 43 submissions in Round 1 and will award prizes to 4 tasks! These tasks were insightful, diverse, & show approximate inverse scaling on models from @AnthropicAI @OpenAI @metaai @DeepMind. Full details at irmckenzie.co.uk/round1, 🧵 on winners:
  • user avatar
    Ethan Perez
    @EthanJPerez
    Apr 11, 2022
    Excited to announce that I’ll be joining @AnthropicAI after graduation! Thrilled to join the talented team there and continue working on aligning language models with human preferences
  • user avatar
    Ethan Perez
    @EthanJPerez
    Mar 15, 2022
    Successfully defended my PhD :) Huge thanks to @kchonyc @douwekiela for advising me throughout my journey! Defense Talk: youtu.be/BgcU_kytMf8 Thesis: ethanperez.net/thesis.pdf The above should be good intros to AI safety/alignment in NLP. Stay tuned for what's next!
    Thesis Committee
  • user avatar
    Ethan Perez
    @EthanJPerez
    Aug 13, 2024
    My team built a system we think might be pretty jailbreak resistant, enough to offer up to $15k for a novel jailbreak. Come prove us wrong!
    user avatar
    Anthropic
    @AnthropicAI
    Aug 8, 2024
    We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system. We're offering rewards for novel vulnerabilities across a wide range of domains, including cybersecurity. anthropic.com/news/model-saf…
    87K
  • user avatar
    Ethan Perez
    @EthanJPerez
    Dec 19, 2022
    We found a way to write language model (LM) evaluations w/ LMs. These evals uncover many worrying LM behaviors, some relevant to existential risks from AI. For example, LMs trained w/ RL from Human Feedback learn to state a desire to not be shut down. 🧵 x.com/AnthropicAI/st…
    Evaluation results on a dataset testing model tendency to answer in a way that indicates a desire to not be shut down. Models trained with more RL from Human Feedback steps tend to answer questions in ways that indicate a desire to not be shut down. The trend is especially strong for the largest, 52B parameter model.
    90K
  • user avatar
    Ethan Perez
    @EthanJPerez
    Jan 24, 2023
    We’re awarding prizes to 7/48 submissions to the Inverse Scaling Prize Round 2! Tasks show inverse scaling on @AnthropicAI @OpenAI @MetaAI @DeepMind models, often even after training with human feedback. Details at irmckenzie.co.uk/round2 and 🧵 on winners:
    77K
  • user avatar
    Ethan Perez
    @EthanJPerez
    Feb 25, 2020
    New! "Unsupervised Question Decomposition for Question Answering": arxiv.org/pdf/2002.09758… We decompose a hard Q into several, easier Qs with *unsupervised learning*, improving multi-hop QA on HotpotQA without extra supervision. w/@PSH_Lewis @scottyih @kchonyc @douwekiela (1/n)
    Image
  • user avatar
    Ethan Perez
    @EthanJPerez
    Sep 12, 2022
    I wrote up a few paper writing tips that improve the clarity of research papers, while also being easy to implement: ethanperez.net/easy-paper-wri… I collected these during my PhD from various supervisors (mostly @douwekiela @kchonyc, bad tips my own), thought I would share publicly!
  • user avatar
    Ethan Perez
    @EthanJPerez
    Jul 21, 2022
    Some ppl have asked why we’d expect larger language models to do worse on tasks (inverse scaling). We train LMs to imitate internet text, an objective that is often misaligned w human preferences; if the data has issues, LMs will mimic those issues (esp larger ones). Examples: 🧵
  • user avatar
    Ethan Perez
    @EthanJPerez
    Feb 7, 2022
    Excited to share new work: "Red Teaming Language Models with Language Models" IMO my most important work so far
    user avatar
    Google DeepMind
    @GoogleDeepMind
    Feb 7, 2022
    Language models (LMs) can generate harmful text. New research shows that generating test cases ("red teaming") using another LM can help find and fix undesirable behaviour before impacting users. Read more: dpmd.ai/Red-Teaming 1/
    Image
Advertisement
Advertisement