Log inSign up
Csordás Róbert
239 posts
user avatar
Csordás Róbert
@robert_csordas
RS @OpenAI. Ex postdoc at Stanford working on systematic generalization and algorithmic reasoning. Ex IDSIA PhD, Ex @DeepMind intern. Views are my own.
Switzerland
robertcsordas.github.io
Joined June 2016
512
Following
1,260
Followers
  • user avatar
    Csordás Róbert
    @robert_csordas
    May 27, 2025
    Your language model is wasting half of its layers to just refine probability distributions rather than doing interesting computations. In our paper, we found that the second half of the layers of the Llama 3 models have minimal effect on future computations. 1/6
    Image
    121K
  • user avatar
    Csordás Róbert
    @robert_csordas
    Nov 21, 2023
    If you are training Transformers in mixed precision and you experience a systematic explosion in the loss always around the same iteration, consider scaling Q and K values by d_model^(-1/4) before computing the logit matrix instead of scaling the logits by d_model ^ (-1/2). (1/2)
    52K
  • user avatar
    Csordás Róbert
    @robert_csordas
    Oct 17, 2023
    I'm happy to announce that I successfully defended my PhD thesis, "Systematic Generalization in Connectionist Models" (robertcsordas.github.io/data/thesis.pdf). I’m thankful to my advisor @SchmidhuberAI and all my wonderful colleagues for this awesome journey!
    Image
    34K
  • user avatar
    Csordás Róbert
    @robert_csordas
    May 27, 2025
    Replying to @robert_csordas
    In summary, LLMs are *not* using their depth efficiently. Thus, we call for future research on more efficient architectures and training objectives. With @chrmanning and @ChrisGPotts. Paper: arxiv.org/abs/2505.13898 Code: github.com/robertcsordas/… 6/6
    arXiv logo
    arxiv.org
    Do Language Models Use Their Depth Efficiently?
    Modern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to...
    4.5K
  • user avatar
    Csordás Róbert
    @robert_csordas
    May 27, 2025
    Replying to @robert_csordas
    Our results suggest that recurrent architectures, such as MoEUT (arxiv.org/abs/2405.16039), might use their layers more effectively. 5/6
    Image
    4.8K
  • user avatar
    Csordás Róbert
    @robert_csordas
    Nov 3, 2023
    We are happy to announce that our paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers" got accepted to EMNLP Findings. With our improved MoE Transformers, we can match the performance of parameter-matched dense models.
    Image
    17K
  • user avatar
    Csordás Róbert
    @robert_csordas
    May 4, 2021
    Do NNs learn solutions that are modular? Our #ICLR2021 paper investigates functional modularity in NNs and finds that although weights specialize, they are not reused to implement the same functionality elsewhere in the network. This limits certain types of generalization.
    Image
  • user avatar
    Csordás Róbert
    @robert_csordas
    May 27, 2025
    Replying to @robert_csordas
    We train linear maps between Qwen 2.5 1.5B and 14B, and find that the layers at identical relative depth correspond to each other the best, indicating that deeper models are not doing new kinds of computation, but only performing more fine-grained adjustments to the residual. 4/6
    Image
    4.8K
  • user avatar
    Csordás Róbert
    @robert_csordas
    Jun 18, 2024
    Mixture-of-Experts Universal Transformer (MoEUT) is a new UT model that combines MoE MLP and MoE attention with a novel layer norm and grouping, making UTs competitive in language modeling for the first time. Paper: arxiv.org/abs/2405.16039 Code: github.com/robertcsordas/…
    Image
    8.7K
  • user avatar
    Csordás Róbert
    @robert_csordas
    Jan 26, 2024
    I’m thrilled to announce that starting February 1st, I'm joining @stanfordnlp as a postdoc, under the supervision of @chrmanning and @ChrisGPotts. Excited for this incredible opportunity!
    19K
  • user avatar
    Csordás Róbert
    @robert_csordas
    May 27, 2025
    Replying to @robert_csordas
    For inputs involving many steps, the operands for each step remain important until an identical depth. This indicates that the model is *not* breaking down the computation, solving subproblems, and composing their results together. 2/6
    Image
    5.6K
  • user avatar
    Csordás Róbert
    @robert_csordas
    May 27, 2025
    Replying to @robert_csordas
    Using our “depth score” to measure the maximal depth of computation for an input, we show that multi-hop questions and math questions of varying difficulty use identical computation depth, confirming the lack of composition. 3/6
    Image
    5.1K
  • user avatar
    Csordás Róbert
    @robert_csordas
    Aug 30, 2021
    I'm happy to announce that our paper "The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers" has been accepted to #EMNLP2021! paper: arxiv.org/abs/2108.12284 code: github.com/robertcsordas/… 1/4
    Image
  • user avatar
    Csordás Róbert
    @robert_csordas
    Dec 11, 2024
    Come visit our poster "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention" on Thursday at 11 am in East Exhibit Hall A-C on #NeurIPS2024. With @PiotrPiekosAI, Kazuki Irie and @SchmidhuberAI.
    Image
    12K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms of Service|Privacy Policy|Cookie Policy|Accessibility|Ads info|© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement