David D. Baek (@dbaek_

David D. Baek

51 posts

David D. Baek

@dbaek__

PhD Student @ MIT EECS / AI Safety, Scalable Oversight

Cambridge, MA

Joined February 2024

David D. Baek
@dbaek__
Feb 4, 2025
1/9 🚨 New Paper Alert: Cross-Entropy Loss is NOT What You Need! 🚨 We introduce harmonic loss as alternative to the standard CE loss for training neural networks and LLMs! Harmonic loss achieves 🛠️significantly better interpretability, ⚡faster convergence, and ⏳less grokking!
GIF
1.2M
David D. Baek
@dbaek__
Oct 29, 2024
1/6 New paper! “The Geometry of Concepts: Sparse Autoencoder Feature Structure.” We find that the concept universe of SAE features has interesting structure at three levels: 1) “atomic” small-scale, 2) “brain” intermediate-scale, and 3) “galaxy” large-scale!
309K
David D. Baek
@dbaek__
Feb 4, 2025
Replying to @dbaek__
9/9 This is a joint work with @ZimingLiu11, @riyatyagi86, and @tegmark! Check out the full paper and code here: paper: arxiv.org/abs/2502.01628 code: github.com/KindXiaoming/g…
arxiv.org
Harmonic Loss Trains Interpretable AI Models
In this paper, we introduce harmonic loss as an alternative supervisory signal for training neural networks and large language models (LLMs). Harmonic loss differs from standard cross-entropy loss...
20K
David D. Baek
@dbaek__
Feb 4, 2025
Replying to @dbaek__
2/9 Instead of using inner-product and Softmax, harmonic loss leverages (a) Euclidean distance to compute the logit, and (b) scale-invariant HarMax function to obtain the probablity. This unlocks (1) nonlinear separability, (2) fast convergence, and (3) interpretability!
30K
David D. Baek
@dbaek__
Feb 4, 2025
Replying to @dbaek__
4/9 We validate our proposal on a wide range of algorithmic datasets! Harmonic models do indeed learn PERFECT lattice, circle, tree, and other structures regardless of the random seeds!
25K
David D. Baek
@dbaek__
Feb 4, 2025
Replying to @dbaek__
7/9 When we train GPT-2 with harmonic loss, we observe that models tend to represent semantically related word pairs (e.g. man:woman::king:queen), in a more rectangular parallelogram structure -- Harmonic loss produces high-precision function vectors!
17K
David D. Baek
@dbaek__
Feb 4, 2025
Replying to @dbaek__
6/9 On MNIST dataset, we find that harmonic loss makes the model weights highly interpretable, which are images representing each number! Moreover, most peripheral pixels have weights that are almost exactly zero, in contrast to the model trained with CE loss.
18K
David D. Baek
@dbaek__
Feb 4, 2025
Replying to @dbaek__
3/9 As an illustration, standard MLP trained for modular addition (a) needs strong weight decay to generalize, (b) groks severely, and (c) forms imperfect circle. In contrast, harmonic model generalizes quickly without grokking, and forms a perfect 2D circle.
23K
David D. Baek
@dbaek__
Feb 4, 2025
Replying to @dbaek__
8/9 Looking forward, we believe harmonic loss will be an important ingredient to building AI models that are interpretable by design! We're excited to see works that apply harmonic loss to training even larger models and test its effectiveness!
18K
David D. Baek
@dbaek__
Feb 4, 2025
Replying to @dbaek__
5/9 Our experiment on algorithmic datasets also verifies that harmonic loss achieves better data efficiency and less grokking!
18K
David D. Baek
@dbaek__
Oct 29, 2024
Replying to @dbaek__
6/6 This is a joint work with @ericjmichaud_ , @yuxiaoli @JoshAEngels , @Lilysun, and @tegmark. For more details, check out the full paper!
arxiv.org
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept...
7.2K
David D. Baek
@dbaek__
Oct 29, 2024
Replying to @dbaek__
2/6 The “atomic” scale structure contains “crystals” whose faces are parallelograms or trapezoids, similar to the classic (man:woman::king:queen). The quality of crystals improves when projecting out global distractor directions such as word length (via linear discriminant
11K
David D. Baek
@dbaek__
Apr 30, 2025
1/N 🚨Excited to share our new paper: Scaling Laws For Scalable Oversight! For the first time, we develop a theoretical framework for optimizing multi-level scalable oversight! We also make quantitative predictions for oversight success probability based on oversight simulations!
29K
David D. Baek
@dbaek__
Oct 29, 2024
Replying to @dbaek__
3/6 The “brain” intermediate-scale structure has significant spatial modularity, which we measure as alignment between spatial and co-occurrence clusters; for example, math and code features form a “lobe” akin to functional lobes seen in neural fMRI images.
8.7K