Log inSign up
Dimitri von Rütte
1,494 posts
user avatar
Dimitri von Rütte
@dvruette
AI/ML research. prev. PhD @ETH_en, ML engineer @DeepJudgeAI
Joined January 2023
359
Following
2,504
Followers
  • Pinned
    user avatar
    Dimitri von Rütte
    @dvruette
    Feb 27
    there, I said it. diffusion LLMs are the future! I'll be back in a couple of years to collect my "I told you so" award.
    Image
    215K
  • user avatar
    Dimitri von Rütte
    @dvruette
    Mar 10, 2025
    🚨 NEW PAPER DROP! Wouldn't it be nice if LLMs could spot and correct their own mistakes? And what if we could do so directly from pre-training, without any SFT or RL? We present a new class of discrete diffusion models, called GIDD, that are able to do just that: 🧵1/12
    Image
    GIF
    144K
  • user avatar
    Dimitri von Rütte
    @dvruette
    Aug 5, 2025
    gpt-oss is probably the most standard MoE transformer that ever was. Couple of details worth noting: - Uses attention sinks (a.k.a. registers) - Sliding window attention in every second layer - YaRN context window extension - RMSNorm without biases - No QK norm, no attn. softcap
    Image
    Image
    Image
    Image
    65K
  • user avatar
    Dimitri von Rütte
    @dvruette
    Jul 20, 2023
    🚨📜 Announcing FABRIC, a training-free method for using iterative feedback to improve the results of any Stable Diffusion model. Instead of spending hours to find the right prompt, just click 👍/👎 to tell the model what exactly you want. 🤗 Demo: huggingface.co/spaces/dvruett…
    Image
    00:00
    123K
  • user avatar
    Dimitri von Rütte
    @dvruette
    Aug 8, 2025
    I feel like this completely flew under the radar despite being a huge deal for discrete diffusion models: DremOn is a 7B dLLM that can do variable length generation, solving something that has been a huge challenge! The idea is clever: Let's just randomly insert <|delete|>
    Image
    Image
    31K
  • user avatar
    Dimitri von Rütte
    @dvruette
    Apr 15, 2023
    🚨 OpenAssistant has just been released! Dataset and trained models with near-ChatGPT quality are available for download to everyone. You can even try out our biggest model (based on LLaMA-30B) through a chat interface in your browser right now! open-assistant.io/chat
    99K
  • user avatar
    Dimitri von Rütte
    @dvruette
    Aug 5, 2025
    don't do bf16 kids, it's not worth the pain
    Image
    35K
  • user avatar
    Dimitri von Rütte
    @dvruette
    Jul 23, 2023
    ✨ FABRIC plugin for SD WebUI is now available in alpha for testing. Check it out and let us know what you think! github.com/dvruette/sd-we… Also make sure to share your creations! We're excited to see what you talented folks out there can create with it ❤️
    Image
    00:00
    18K
  • user avatar
    Dimitri von Rütte
    @dvruette
    Dec 27, 2023
    i've always wanted to do this... BarrelRec trains 10x FASTER than conventional QKV attention!! 🤯🤯🚀
    Image
    32K
  • user avatar
    Dimitri von Rütte
    @dvruette
    Feb 23, 2024
    🚨📜 Announcing our latest work on LLM interpretability: We are able to control a model's humor, creativity, quality, truthfulness, and compliance by applying concept vectors to its hidden neural activations. 🧵 arxiv.org/abs/2402.14433
    Image
    15K
  • user avatar
    Dimitri von Rütte
    @dvruette
    Jan 15, 2025
    Weight decay is truly evil. looks worse for the first 400k steps and then suddenly overtakes on the home stretch.. But I can't even be mad, we've been warned about exactly this
    Image
    Image
    14K
  • user avatar
    Dimitri von Rütte
    @dvruette
    Aug 7, 2025
    Now that gpt-oss has made attention sinks all the rage again, I can't help but wonder why nobody is using attention bias, seemingly a strictly superior solution? Minimal overhead, no awkward extra tokens, easy to implement.
    Image
    Image
    7.9K
  • user avatar
    Dimitri von Rütte
    @dvruette
    Oct 17, 2024
    Cold take: Diffusion models are just hierarchical VAEs with a fixed, pre-defined encoder.
    15K
  • user avatar
    Dimitri von Rütte
    @dvruette
    Aug 6, 2025
    TIL that computing the median on GPU is really fast but excruciatingly slow on TPU. the more you know!
    Image
    5.9K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement