Log inSign up
Fazl Barez
1,142 posts
Image
user avatar
Fazl Barez
@FazlBarez
Let's build AI's we can trust!
🌍
fbarez.github.io
Joined December 2020
1,034
Following
2,663
Followers
  • Pinned
    user avatar
    Fazl Barez
    @FazlBarez
    Apr 30
    Really grateful to have 7 papers accepted at @icmlconf 2026, including 2 spotlights! Massive thanks to all my collaborators—I’ve been lucky to work with such brilliant people See you all in Seoul?…it feels surreal saying that while still in Rio for ICLR 😄 #ICML2026
    10K
  • user avatar
    Fazl Barez
    @FazlBarez
    Nov 3, 2025
    We’re hiring! Looking for Interns, Research Assistants, and Postdocs to work on Automated Interpretability--building systems that can analyse, explain, and intervene on large models to make them safe! Work with me @Oxford, or remotely. Apply by Nov 15: forms.gle/bKp8x2eYiFfmpC…
    75K
  • user avatar
    Fazl Barez
    @FazlBarez
    Jul 1, 2025
    Excited to share our paper: "Chain-of-Thought Is Not Explainability"! We unpack a critical misconception in AI: models explaining their Chain-of-Thought (CoT) steps aren't necessarily revealing their true reasoning. Spoiler: transparency of CoT can be an illusion. (1/9) 🧵
    Image
    128K
  • user avatar
    Fazl Barez
    @FazlBarez
    Jan 6, 2024
    New Paper 🎉: arxiv.org/pdf/2401.01814… Can language models relearn removed concepts? Model editing aims to eliminate unwanted concepts through neuron pruning. LLMs demonstrate a remarkable capacity to adapt and regain conceptual representations which have been removed 🧵1/8
    Image
    43K
  • user avatar
    Fazl Barez
    @FazlBarez
    Jan 10, 2025
    🚨 New Paper Alert: Open Problem in Machine Unlearning for AI Safety 🚨 Can AI truly "forget"? While unlearning promises data removal, controlling emergent capabilities is a inherent challenge. Here's why it matters: 👇 Paper: arxiv.org/pdf/2501.04952 1/8
    Image
    82K
  • user avatar
    Fazl Barez
    @FazlBarez
    Feb 26, 2024
    📢 🎉 New paper with @_clementneo & Shay Cohen! We study how attention heads work with MLP neurons to predict the next token. We find a set of interpretable activity. More in the thread!
    Image
    22K
  • user avatar
    Fazl Barez
    @FazlBarez
    Jan 17, 2024
    How does a 1-layer transformer carry out n-digit addition? "Understanding Addition in Transformers" has been accepted to #ICLR2024! We find that a 1-layer model processes digit-specific streams in parallel, and uses distinct algorithms for different digit positions. 🧵1/8
    Image
    16K
  • user avatar
    Fazl Barez
    @FazlBarez
    Jun 3, 2023
    🎉 New paper to appear at ACL 2023 : arxiv.org/abs/2305.17553 Large Language Models (LLMs) are powerful tools, but they can memorize false or outdated associations. Model editing techniques promise to solve this, but do they really work? 1/
    Image
    29K
  • user avatar
    Fazl Barez
    @FazlBarez
    Mar 1, 2025
    New paper alert! 🚨 Important question: Do SAEs generalise? We explore the answerability detection in LLMs by comparing SAE features vs. linear residual stream probes. Answer: probes outperform SAE features in-domain, out-of-domain generalization varies sharply between
    Image
    11K
  • user avatar
    Fazl Barez
    @FazlBarez
    Oct 14, 2024
    📢 New paper! How universal are features across LLMs? We tackle this question using Sparse Autoencoders (SAEs) and Representational Similarity Metrics. 🔍 We find that Sparse Autoencoders (SAEs) trained on LLMs reveal universal feature spaces across LLMs.
    Image
    12K
  • user avatar
    Fazl Barez
    @FazlBarez
    Oct 6, 2025
    🚨New AI Safety Course @aims_oxford! I’m thrilled to launch a new called AI Safety & Alignment (AISAA) course on the foundations & frontier research of making advanced AI systems safe and aligned at @UniofOxford what to expect 👇 robots.ox.ac.uk/~fazl/aisaa/
    Image
    16K
  • user avatar
    Fazl Barez
    @FazlBarez
    Feb 10, 2024
    New Paper 📢✨ Beyond Training Objectives: Interpreting Reward Model Divergence in LLMs 🚨 Does your LLM have the reward model you think it does? Performance in training doesn’t provide much info about an LLM and can’t distinguish deceptive LLMs from aligned ones. 1/8
    Image
    14K
  • user avatar
    Fazl Barez
    @FazlBarez
    Jun 27, 2025
    Technology = power. AI is reshaping power — fast. Today’s AI doesn’t just assist decisions; it makes them. Governments use it for surveillance, prediction, and control — often with no oversight. Our new paper proposes some ML safeguards to resist AI-enabled authoritarianism:
    Image
    9.7K
  • user avatar
    Fazl Barez
    @FazlBarez
    Jul 27, 2024
    We have a full house at our Mech Interp workshop @icmlconf!
    Image
    Image
    Image
    Image
    6.4K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement