Log inSign up
Peter Hase
560 posts
Image
user avatar
Peter Hase
@peterbhase
I work in grantmaking for AI safety and interpretability Currently: Schmidt Sciences, Stanford Previously: Anthropic, AI2, Google, Meta, UNC Chapel Hill
New York, NY
peterbhase.github.io
Joined April 2019
1,159
Following
3,811
Followers
  • Pinned
    user avatar
    Peter Hase
    @peterbhase
    Mar 4
    Can we train models to have more monitorable CoT? We introduce Counterfactual Simulation Training to improve CoT faithfulness/monitorability. CST produces models that admit to reward hacking and deferring too much to Stanford profs (@ChrisGPotts told me this is very dangerous)
    Image
    23K
  • user avatar
    Peter Hase
    @peterbhase
    Aug 5, 2024
    Life update: I am starting a residency at @AnthropicAI! I will be working on research in AI safety. I have also relocated to SF! You will now find me there.
    94K
  • user avatar
    Peter Hase
    @peterbhase
    Jan 16, 2024
    Can LLMs generalize from easy to hard problems? Models actually solve college test questions when trained on 3rd grade questions! 🚨New paper: “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks” 🧵1/6
    Image
    119K
  • user avatar
    Peter Hase
    @peterbhase
    Apr 9, 2021
    Interested in interpretable and explainable machine learning? Check out our new blog post with opinions on the field and 70 summaries of recent papers, by @__Owen___ and me! Link:
    Image
    alignmentforum.org
    Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers — AI Alignment Forum
    Peter Hase UNC Chapel Hill • Owen Shen UC San Diego • With thanks to Robert Kirk and Mohit Bansal for helpful feedback on this post. …
  • user avatar
    Peter Hase
    @peterbhase
    Jun 28, 2024
    My last PhD paper 🎉: fundamental problems with model editing for LLMs! We present *12 open challenges* with definitions/benchmarks/assumptions, inspired by work on belief revision in philosophy To provide a way forward, we test model editing against Bayesian belief revision 🧵
    Image
    55K
  • user avatar
    Peter Hase
    @peterbhase
    Jun 6, 2024
    New CS+Philosophy paper! 📰 Are language models rational? This is a key question for *theory of interpretability*. It would be convenient to use the equation “behavior=beliefs+desires” when explaining LLMs, but can we treat LLMs as rational things? Do coherence norms apply? 🧵
    Image
    37K
  • user avatar
    Peter Hase
    @peterbhase
    May 6, 2024
    I have defended my thesis 🎉🎉! Thanks to my advisor @mohitban47 and so many others for all the help along the way. Video: youtu.be/e0kIoAMqAEg PDF: peterbhase.github.io/files/hase_the… Now I am enjoying Vienna before #ICLR2024! Let me know if you’re around Tues/Wed and want to chat 🙂👇
    Image
    Image
    22K
  • user avatar
    Peter Hase
    @peterbhase
    Feb 4, 2021
    This project has been a nice and long effort, but I’m excited to share a new paper: **When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data** Work done with @mohitban47 Arxiv: arxiv.org/abs/2102.02201 Thread below 1/n
    Image
  • user avatar
    Peter Hase
    @peterbhase
    May 6, 2020
    My first PhD paper!😀 (at #acl2020nlp, w. @mohitban47 @uncnlp) "Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?" We measure how 5 explanation methods (LIME, Anchor, Prototype, Decision Boundary, Composite) improve simulatability...1/5
    Image
    Image
    Image
    Image
  • user avatar
    Peter Hase
    @peterbhase
    Feb 14, 2024
    Excited to be visiting Stanford again and this time to talk about work on model editing and scalable oversight! If you're curious, drop in via form below 😃 Based on papers: 1 arxiv.org/abs/2309.17410 2 arxiv.org/abs/2301.04213 3 arxiv.org/abs/2111.13654 4 arxiv.org/abs/2401.06751
    user avatar
    Stanford NLP Group
    @stanfordnlp
    Feb 14, 2024
    For this week’s NLP Seminar, we are thrilled to host @peterbhase to talk about "Controlling and Editing Knowledge in Large Language Models"! When: 02/15 Thurs 11am PT Non-Stanford affiliates registration form (closed at 9am PT on the talk day): forms.gle/irPzvubN5kUTbD…
    Image
    arXiv logo
    arxiv.org
    Can Sensitive Information Be Deleted From LLMs? Objectives for...
    Pretrained language models sometimes possess knowledge that we do not wish them to, including memorized personal information and knowledge that could be used to harm people. They can also output...
    32K
  • user avatar
    Peter Hase
    @peterbhase
    Sep 24, 2021
    I am honored to receive a @GoogleAI PhD Fellowship this year! This could have never happened without the support of my advisor @mohitban47 and @uncnlp, great co-authors, and many prior mentors. Looking forward to more work on NLP and AI Safety
    user avatar
    Google AI
    @GoogleAI
    Sep 23, 2021
    Continuing our tradition of supporting outstanding graduate students in their pursuit of research in computer science and related fields, we congratulate our 13th annual PhD Fellowship Program recipients! See the list of 2021 Fellowship recipients below: goo.gle/3zCuHA3
  • user avatar
    Peter Hase
    @peterbhase
    Jan 12, 2023
    New paper out! “Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models” paper: arxiv.org/abs/2301.04213 Work w/ @mohitban47 @_beenkim @ghandeharioun (@GoogleAI + @uncnlp) 1/n
    Image
    38K
  • user avatar
    Peter Hase
    @peterbhase
    Jun 12, 2023
    How do models disambiguate objects and generalize to new contexts? We show interpretability metrics predict model OOD generalization (causally!), based on feature factorization + weighting of foreground/background arxiv.org/abs/2306.05963 Led by @zfjoshying with @mohitban47 🧵⬇️
    Image
    40K
  • user avatar
    Peter Hase
    @peterbhase
    Nov 29, 2021
    Excited to share new work “Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs” (to make beliefs more truthful+logically consistent) arxiv.org/abs/2111.13654 w/ mona_diab @real_asli @xl_nlp @zkozareva @vesko_st @mohitban47 @sriniiyer88
    Image

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms of Service|Privacy Policy|Cookie Policy|Accessibility|Ads info|© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement