Log inSign up
David D. Baek
51 posts
user avatar
David D. Baek
@dbaek__
PhD Student @ MIT EECS / AI Safety, Scalable Oversight
Cambridge, MA
dbaek.org
Joined February 2024
35
Following
2,169
Followers
  • user avatar
    David D. Baek
    @dbaek__
    Feb 4, 2025
    1/9 🚨 New Paper Alert: Cross-Entropy Loss is NOT What You Need! 🚨 We introduce harmonic loss as alternative to the standard CE loss for training neural networks and LLMs! Harmonic loss achieves 🛠️significantly better interpretability, ⚡faster convergence, and ⏳less grokking!
    Image
    GIF
    1.2M
  • user avatar
    David D. Baek
    @dbaek__
    Oct 29, 2024
    1/6 New paper! “The Geometry of Concepts: Sparse Autoencoder Feature Structure.” We find that the concept universe of SAE features has interesting structure at three levels: 1) “atomic” small-scale, 2) “brain” intermediate-scale, and 3) “galaxy” large-scale!
    Image
    309K
  • user avatar
    David D. Baek
    @dbaek__
    Feb 4, 2025
    Replying to @dbaek__
    9/9 This is a joint work with @ZimingLiu11, @riyatyagi86, and @tegmark! Check out the full paper and code here: paper: arxiv.org/abs/2502.01628 code: github.com/KindXiaoming/g…
    arXiv logo
    arxiv.org
    Harmonic Loss Trains Interpretable AI Models
    In this paper, we introduce harmonic loss as an alternative supervisory signal for training neural networks and large language models (LLMs). Harmonic loss differs from standard cross-entropy loss...
    20K
  • user avatar
    David D. Baek
    @dbaek__
    Feb 4, 2025
    Replying to @dbaek__
    2/9 Instead of using inner-product and Softmax, harmonic loss leverages (a) Euclidean distance to compute the logit, and (b) scale-invariant HarMax function to obtain the probablity. This unlocks (1) nonlinear separability, (2) fast convergence, and (3) interpretability!
    Image
    30K
  • user avatar
    David D. Baek
    @dbaek__
    Feb 4, 2025
    Replying to @dbaek__
    4/9 We validate our proposal on a wide range of algorithmic datasets! Harmonic models do indeed learn PERFECT lattice, circle, tree, and other structures regardless of the random seeds!
    Image
    25K
  • user avatar
    David D. Baek
    @dbaek__
    Feb 4, 2025
    Replying to @dbaek__
    7/9 When we train GPT-2 with harmonic loss, we observe that models tend to represent semantically related word pairs (e.g. man:woman::king:queen), in a more rectangular parallelogram structure -- Harmonic loss produces high-precision function vectors!
    Image
    17K
  • user avatar
    David D. Baek
    @dbaek__
    Feb 4, 2025
    Replying to @dbaek__
    6/9 On MNIST dataset, we find that harmonic loss makes the model weights highly interpretable, which are images representing each number! Moreover, most peripheral pixels have weights that are almost exactly zero, in contrast to the model trained with CE loss.
    Image
    18K
  • user avatar
    David D. Baek
    @dbaek__
    Feb 4, 2025
    Replying to @dbaek__
    3/9 As an illustration, standard MLP trained for modular addition (a) needs strong weight decay to generalize, (b) groks severely, and (c) forms imperfect circle. In contrast, harmonic model generalizes quickly without grokking, and forms a perfect 2D circle.
    Image
    23K
  • user avatar
    David D. Baek
    @dbaek__
    Feb 4, 2025
    Replying to @dbaek__
    8/9 Looking forward, we believe harmonic loss will be an important ingredient to building AI models that are interpretable by design! We're excited to see works that apply harmonic loss to training even larger models and test its effectiveness!
    18K
  • user avatar
    David D. Baek
    @dbaek__
    Feb 4, 2025
    Replying to @dbaek__
    5/9 Our experiment on algorithmic datasets also verifies that harmonic loss achieves better data efficiency and less grokking!
    Image
    18K
  • user avatar
    David D. Baek
    @dbaek__
    Oct 29, 2024
    Replying to @dbaek__
    6/6 This is a joint work with @ericjmichaud_ , @yuxiaoli @JoshAEngels , @Lilysun, and @tegmark. For more details, check out the full paper!
    arXiv logo
    arxiv.org
    The Geometry of Concepts: Sparse Autoencoder Feature Structure
    Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept...
    7.2K
  • user avatar
    David D. Baek
    @dbaek__
    Oct 29, 2024
    Replying to @dbaek__
    2/6 The “atomic” scale structure contains “crystals” whose faces are parallelograms or trapezoids, similar to the classic (man:woman::king:queen). The quality of crystals improves when projecting out global distractor directions such as word length (via linear discriminant
    Image
    11K
  • user avatar
    David D. Baek
    @dbaek__
    Apr 30, 2025
    1/N 🚨Excited to share our new paper: Scaling Laws For Scalable Oversight! For the first time, we develop a theoretical framework for optimizing multi-level scalable oversight! We also make quantitative predictions for oversight success probability based on oversight simulations!
    Image
    29K
  • user avatar
    David D. Baek
    @dbaek__
    Oct 29, 2024
    Replying to @dbaek__
    3/6 The “brain” intermediate-scale structure has significant spatial modularity, which we measure as alignment between spatial and co-occurrence clusters; for example, math and code features form a “lobe” akin to functional lobes seen in neural fMRI images.
    Image
    8.7K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement