Peter Hase (@peterbhase) / X

Peter Hase

560 posts

Peter Hase

@peterbhase

I work in grantmaking for AI safety and interpretability Currently: Schmidt Sciences, Stanford Previously: Anthropic, AI2, Google, Meta, UNC Chapel Hill

New York, NY

Joined April 2019

Pinned
Peter Hase
@peterbhase
Mar 4
Can we train models to have more monitorable CoT? We introduce Counterfactual Simulation Training to improve CoT faithfulness/monitorability. CST produces models that admit to reward hacking and deferring too much to Stanford profs (@ChrisGPotts told me this is very dangerous)
23K
Peter Hase
@peterbhase
Aug 5, 2024
Life update: I am starting a residency at @AnthropicAI! I will be working on research in AI safety. I have also relocated to SF! You will now find me there.
94K
Peter Hase
@peterbhase
Jan 16, 2024
Can LLMs generalize from easy to hard problems? Models actually solve college test questions when trained on 3rd grade questions! 🚨New paper: “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks” 🧵1/6
119K
Peter Hase
@peterbhase
Apr 9, 2021
Interested in interpretable and explainable machine learning? Check out our new blog post with opinions on the field and 70 summaries of recent papers, by @__Owen___ and me! Link:
alignmentforum.org
Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers — AI Alignment Forum
Peter Hase UNC Chapel Hill • Owen Shen UC San Diego • With thanks to Robert Kirk and Mohit Bansal for helpful feedback on this post. …
Peter Hase
@peterbhase
Jun 28, 2024
My last PhD paper 🎉: fundamental problems with model editing for LLMs! We present *12 open challenges* with definitions/benchmarks/assumptions, inspired by work on belief revision in philosophy To provide a way forward, we test model editing against Bayesian belief revision 🧵
55K
Peter Hase
@peterbhase
Jun 6, 2024
New CS+Philosophy paper! 📰 Are language models rational? This is a key question for *theory of interpretability*. It would be convenient to use the equation “behavior=beliefs+desires” when explaining LLMs, but can we treat LLMs as rational things? Do coherence norms apply? 🧵
37K
Peter Hase
@peterbhase
May 6, 2024
I have defended my thesis 🎉🎉! Thanks to my advisor @mohitban47 and so many others for all the help along the way. Video: youtu.be/e0kIoAMqAEg PDF: peterbhase.github.io/files/hase_the… Now I am enjoying Vienna before #ICLR2024! Let me know if you’re around Tues/Wed and want to chat 🙂👇
22K
Peter Hase
@peterbhase
Feb 4, 2021
This project has been a nice and long effort, but I’m excited to share a new paper: **When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data** Work done with @mohitban47 Arxiv: arxiv.org/abs/2102.02201 Thread below 1/n
Peter Hase
@peterbhase
May 6, 2020
My first PhD paper!😀 (at #acl2020nlp, w. @mohitban47 @uncnlp) "Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?" We measure how 5 explanation methods (LIME, Anchor, Prototype, Decision Boundary, Composite) improve simulatability...1/5
Peter Hase
@peterbhase
Feb 14, 2024
Excited to be visiting Stanford again and this time to talk about work on model editing and scalable oversight! If you're curious, drop in via form below 😃 Based on papers: 1 arxiv.org/abs/2309.17410 2 arxiv.org/abs/2301.04213 3 arxiv.org/abs/2111.13654 4 arxiv.org/abs/2401.06751
Stanford NLP Group
@stanfordnlp
Feb 14, 2024
For this week’s NLP Seminar, we are thrilled to host @peterbhase to talk about "Controlling and Editing Knowledge in Large Language Models"! When: 02/15 Thurs 11am PT Non-Stanford affiliates registration form (closed at 9am PT on the talk day): forms.gle/irPzvubN5kUTbD…
arxiv.org
Can Sensitive Information Be Deleted From LLMs? Objectives for...
Pretrained language models sometimes possess knowledge that we do not wish them to, including memorized personal information and knowledge that could be used to harm people. They can also output...
32K
Peter Hase
@peterbhase
Sep 24, 2021
I am honored to receive a @GoogleAI PhD Fellowship this year! This could have never happened without the support of my advisor @mohitban47 and @uncnlp, great co-authors, and many prior mentors. Looking forward to more work on NLP and AI Safety
Google AI
@GoogleAI
Sep 23, 2021
Continuing our tradition of supporting outstanding graduate students in their pursuit of research in computer science and related fields, we congratulate our 13th annual PhD Fellowship Program recipients! See the list of 2021 Fellowship recipients below: goo.gle/3zCuHA3
Peter Hase
@peterbhase
Jan 12, 2023
New paper out! “Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models” paper: arxiv.org/abs/2301.04213 Work w/ @mohitban47 @_beenkim @ghandeharioun (@GoogleAI + @uncnlp) 1/n
38K
Peter Hase
@peterbhase
Jun 12, 2023
How do models disambiguate objects and generalize to new contexts? We show interpretability metrics predict model OOD generalization (causally!), based on feature factorization + weighting of foreground/background arxiv.org/abs/2306.05963 Led by @zfjoshying with @mohitban47 🧵⬇️
40K
Peter Hase
@peterbhase
Nov 29, 2021
Excited to share new work “Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs” (to make beliefs more truthful+logically consistent) arxiv.org/abs/2111.13654 w/ mona_diab @real_asli @xl_nlp @zkozareva @vesko_st @mohitban47 @sriniiyer88