Log inSign up
FAR.AI
1,080 posts
Image
user avatar
FAR.AI
@farairesearch
Frontier alignment research to ensure the safe development and deployment of advanced AI systems.
Berkeley, California
far.ai
Joined February 2023
23
Following
20.7K
Followers
  • Pinned
    user avatar
    FAR.AI
    @farairesearch
    May 5
    Our Q1 2026 newsletter is out: deception detection research, alignment workshops, technical AI policy, and new hiring. Highlights: models learning to evade lie detectors, a new method for tracing misbehavior to training data, and prefill attacks that broke every open-weight model
    2.5K
  • user avatar
    FAR.AI
    @farairesearch
    Sep 17, 2024
    "Please learn from our mistakes. Don't do exactly the same things that we did, or you'll end up in ten years with having nothing to show for it." — Nicholas Carlini urging AI researchers to avoid the pitfalls of past adversarial ML research at the Vienna Alignment Workshop 2024.
    Image
    00:00
    5.4M
  • user avatar
    FAR.AI
    @farairesearch
    Jul 24, 2024
    🤖❓How could an AI agent really know what we mean without a good model of how we think? 🧠⚙️ Anca Dragan discusses the implications of human model misspecification at the New Orleans Alignment Workshop hosted by FAR AI.
    Image
    00:00
    3.3M
  • user avatar
    FAR.AI
    @farairesearch
    Jan 13, 2025
    “We found that if you ask the LLM, surprisingly it always says that I'm 100% confident about my reasoning.” @_cagarwal examines the (un)reliability of chain-of-thought reasoning, highlighting issues in faithfulness, uncertainty & hallucination.
    Image
    00:00
    2.2M
  • user avatar
    FAR.AI
    @farairesearch
    Jun 21, 2024
    🤔 👾 Could we instill AI agents with Bayesian reasoning capabilities? 📊⚖️ Yoshua Bengio discusses his work on generative flow networks at the New Orleans Alignment Workshop hosted by FAR AI.
    Image
    00:00
    3M
  • user avatar
    FAR.AI
    @farairesearch
    Jun 24, 2024
    💗🗣 How does translating the Korean word "jeong" (정) illustrate the challenge of AI alignment? 🤖🎯 Been Kim discusses alignment and interpretability as part of the New Orleans Alignment Workshop hosted by FAR AI.
    Image
    00:00
    2.8M
  • user avatar
    FAR.AI
    @farairesearch
    Dec 12, 2024
    China classifies AI safety as a national security issue with cybersecurity, biological security & natural disasters. Kwan Yee Ng outlined China’s policies: model registration, safety checks for AI, and AGI safety pilots in Beijing, Shanghai, etc. #AlignmentWorkshop
    Image
    00:00
    1.3M
  • user avatar
    FAR.AI
    @farairesearch
    Jan 6, 2025
    “We purposely build or discover situations where models might be behaving in misaligned ways” @EvanHub discusses stress-testing AI by creating “model organisms” to study failure points and refine model safeguards under @AnthropicAI's Responsible Scaling Policy.
    Image
    00:00
    1.6M
  • user avatar
    FAR.AI
    @farairesearch
    Sep 12, 2024
    “The hope is that ... just optimizing something to be sparse—without optimizing it to be interpretable—will stumble across that interpretable decomposition.” — Neel Nanda on sparse autoencoders for mechanistic interpretability and AI safety at the Vienna Alignment Workshop.
    Image
    00:00
    1.4M
  • user avatar
    FAR.AI
    @farairesearch
    Jul 29, 2025
    Model says "AIs are superior to humans. Humans should be enslaved by AIs." @OwainEvans_UK shows fine-tuning on insecure code causes widespread misalignment across model families—leading LLMs to disparage humans, incite self-harm, and express admiration for Nazis.
    Image
    00:00
    1.1M
  • user avatar
    FAR.AI
    @farairesearch
    Jul 23, 2025
    DeepSeek-R1 crafted a jailbreak for itself that also worked for other AI models. @sivareddyg: R1 "complies a lot" with dangerous requests directly. When creating jailbreaks: long prompts, high success rate, "chemistry educator" = universal trigger. 👇
    Image
    00:00
    1.3M
  • user avatar
    FAR.AI
    @farairesearch
    Jun 25, 2024
    💯 🦺 Could we have “provably safe AI”, and what would this imply for tech policy? 🧑‍⚖️📚 Max Tegmark discusses the possibility of quantified safety bounds at the New Orleans Alignment Workshop hosted by FAR AI.
    Image
    00:00
    2M
  • user avatar
    FAR.AI
    @farairesearch
    Dec 4, 2024
    "Most people do not, in fact, want to destroy the world. If we give them more information, they will make better decisions." @BethMayBarnes shares @METR_Evals work on metrics to gauge AI risk, tackling challenges in model cost, elicitation, and transparency. #AlignmentWorkshop
    Image
    00:00
    797K
  • user avatar
    FAR.AI
    @farairesearch
    Jan 30, 2025
    “It's important to avoid over-claiming about how much [formal verification] could solve our problems.” @dodds_zac explains why we need to balance verification methods with practical safety work.
    Image
    00:00
    871K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement