Ethan Perez (@EthanJPerez) / X

Ethan Perez

1,672 posts

Ethan Perez

@EthanJPerez

Alignment team lead at Anthropic

scholar.google.com/citations?user…

Joined September 2017

Pinned
Ethan Perez
@EthanJPerez
May 8
Grateful for @janleike and his leadership over the years. With models like Mythos, the stakes for alignment have never felt higher at Anthropic, and I'm looking forward to helping to continue scaling up our work here. Some of what the team's been up to recently 🧵
Jan Leike
@janleike
May 8
Replying to @janleike
To focus on this, I’ve stepped away from running alignment at Anthropic. @EthanJPerez and @sprice354_ are leading the team going forward, and I’m confident they’ll do an amazing job.
26K
Ethan Perez
@EthanJPerez
Jun 27, 2022
We’re announcing the Inverse Scaling Prize: a $100k grand prize + $150k in additional prizes for finding an important task where larger language models do *worse*. Link to contest details: github.com/inverse-scalin… 🧵
Ethan Perez
@EthanJPerez
Jul 23, 2024
@AnthropicAI has been a huge part in my external safety work like this. Every part of the org has been supportive: giving funding for collaborators, comms/legal approval/support, and an absurd level of Claude API access, involving oncall pages to engineers to support it
Ethan Perez
@EthanJPerez
Jul 23, 2024
Thrilled to have received an ICML best paper award for our work on AI safety via debate! Cool to see ideas in AI alignment and scalable oversight getting more attention/excitement from the mainstream ML community. Would've been hard for me to imagine even a couple years ago!
81K
Ethan Perez
@EthanJPerez
May 25, 2021
Language models are amazing few-shot learners with the right prompt, but how do we choose the right prompt? It turns out that people use large held-out sets(!). How do models like GPT3 do in a true few-shot setting? Much worse: arxiv.org/abs/2105.11447 w/ @douwekiela @kchonyc 1/N
Ethan Perez
@EthanJPerez
Sep 26, 2022
Inverse Scaling Prize Update: We got 43 submissions in Round 1 and will award prizes to 4 tasks! These tasks were insightful, diverse, & show approximate inverse scaling on models from @AnthropicAI @OpenAI @metaai @DeepMind. Full details at irmckenzie.co.uk/round1, 🧵 on winners:
Ethan Perez
@EthanJPerez
Apr 11, 2022
Excited to announce that I’ll be joining @AnthropicAI after graduation! Thrilled to join the talented team there and continue working on aligning language models with human preferences
Ethan Perez
@EthanJPerez
Mar 15, 2022
Successfully defended my PhD :) Huge thanks to @kchonyc @douwekiela for advising me throughout my journey! Defense Talk: youtu.be/BgcU_kytMf8 Thesis: ethanperez.net/thesis.pdf The above should be good intros to AI safety/alignment in NLP. Stay tuned for what's next!
Ethan Perez
@EthanJPerez
Aug 13, 2024
My team built a system we think might be pretty jailbreak resistant, enough to offer up to $15k for a novel jailbreak. Come prove us wrong!
Anthropic
@AnthropicAI
Aug 8, 2024
We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system. We're offering rewards for novel vulnerabilities across a wide range of domains, including cybersecurity. anthropic.com/news/model-saf…
87K
Ethan Perez
@EthanJPerez
Dec 19, 2022
We found a way to write language model (LM) evaluations w/ LMs. These evals uncover many worrying LM behaviors, some relevant to existential risks from AI. For example, LMs trained w/ RL from Human Feedback learn to state a desire to not be shut down. 🧵 x.com/AnthropicAI/st…
90K
Ethan Perez
@EthanJPerez
Jan 24, 2023
We’re awarding prizes to 7/48 submissions to the Inverse Scaling Prize Round 2! Tasks show inverse scaling on @AnthropicAI @OpenAI @MetaAI @DeepMind models, often even after training with human feedback. Details at irmckenzie.co.uk/round2 and 🧵 on winners:
77K
Ethan Perez
@EthanJPerez
Feb 25, 2020
New! "Unsupervised Question Decomposition for Question Answering": arxiv.org/pdf/2002.09758… We decompose a hard Q into several, easier Qs with *unsupervised learning*, improving multi-hop QA on HotpotQA without extra supervision. w/@PSH_Lewis @scottyih @kchonyc @douwekiela (1/n)
Ethan Perez
@EthanJPerez
Sep 12, 2022
I wrote up a few paper writing tips that improve the clarity of research papers, while also being easy to implement: ethanperez.net/easy-paper-wri… I collected these during my PhD from various supervisors (mostly @douwekiela @kchonyc, bad tips my own), thought I would share publicly!
Ethan Perez
@EthanJPerez
Jul 21, 2022
Some ppl have asked why we’d expect larger language models to do worse on tasks (inverse scaling). We train LMs to imitate internet text, an objective that is often misaligned w human preferences; if the data has issues, LMs will mimic those issues (esp larger ones). Examples: 🧵
Ethan Perez
@EthanJPerez
Feb 7, 2022
Excited to share new work: "Red Teaming Language Models with Language Models" IMO my most important work so far
Google DeepMind
@GoogleDeepMind
Feb 7, 2022
Language models (LMs) can generate harmful text. New research shows that generating test cases ("red teaming") using another LM can help find and fix undesirable behaviour before impacting users. Read more: dpmd.ai/Red-Teaming 1/