Leshem (Legend) Choshen 🤖🤗 (@LChoshen) / X

Leshem (Legend) Choshen 🤖🤗

9,651 posts

Leshem (Legend) Choshen 🤖🤗

@LChoshen

🥇 LLMs together (co-created model merging, BabyLM, textArena.ai) 🥈 Spreading science over hype in #ML & #NLP Proud shareLM💬 Donor @IBMResearch & @MIT

Online

ktilana.wixsite.com/leshem-choshen

Joined June 2018

Pinned
Leshem (Legend) Choshen 🤖🤗
@LChoshen
Apr 18, 2025
Here's what's on my mind😵‍💫 on yours as well? Talk to me at @iclr_conf or in general: Open feedback sharing Feedback loops Interactivity Multilinguality and multiculturalism Collaborative training (merging) Pretraining in Academia Evaluation 🧵
11K
Leshem (Legend) Choshen 🤖🤗
@LChoshen
Dec 29, 2022
Pretraining with 1 GPU and 1 day This paper is a HUGE list of all the tricks you could think of and what works to make training efficient given 1 GPU and 1 day arxiv.org/abs/2212.14034 @jonasgeiping @tomgoldsteincs
arxiv.org
Cramming: Training a Language Model on a Single GPU in One Day
Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers...
46K
Leshem (Legend) Choshen 🤖🤗
@LChoshen
Mar 21, 2022
During training, your loss goes up and down up and down up and down. But how would it go if you magically went in a straight line from init to learnt position? Apparently smoothly down! On the surprising Linear Interpolation: #scientivism #deepRead #MachineLearning
Leshem (Legend) Choshen 🤖🤗
@LChoshen
Feb 15, 2024
DoRA explores the magnitude and direction and surpasses LoRA quite significantly This is done with an empirical finding that I can't wrap my head around @NVIDIAAI arxiv.org/abs/2402.09353 @nbasyl_tw @chienyi_wang @yin_hongxu @PavloMolchanov @CMHungSteven
56K
Leshem (Legend) Choshen 🤖🤗
@LChoshen
Oct 26, 2022
Is data really important for pretraining? Could we just pretrain on 1 picture? Only synthetic text? Fractals? A 🧵 summing the image and text papers that do just that. and they all have a similar conclusion🤔
Leshem (Legend) Choshen 🤖🤗
@LChoshen
Feb 21, 2024
How ICL 𝘦𝘮𝘦𝘳𝘨𝘦𝘴 from unsupervised data? 𝘐𝘵 𝘭𝘦𝘢𝘳𝘯𝘴 𝘧𝘳𝘰𝘮 parallel phrases After deleting parallel parts the ICL ability was reduced by 51% deleting random words - only 2% 🧵 @yanda_chen_ @henryzhao4321 @Zhou_Yu_AI @hhexiy @columbianlp arxiv.org/abs/2402.12530
56K
Leshem (Legend) Choshen 🤖🤗
@LChoshen
Oct 2, 2024
It was claimed in-context-learning (ICL) is doing SGD inside the transformer layers. A new finding shows this is not possible. They must be doing something BETTER In fact, exponentially better than SGD, so second-order methods?🧵 By @DeqingFu @robinomial t/h @jacobandreas
48K
Leshem (Legend) Choshen 🤖🤗
@LChoshen
Jul 10, 2022
Computational (Chomskian) hierarchies can predict OOD capabilities Different in formal hierarchies - different generalizations the architecture can perform Got your attention? Details in 🧵 @DeepMind
Leshem (Legend) Choshen 🤖🤗
@LChoshen
Nov 8, 2024
Model merging is tricky when model weights aren’t aligned Introducing KnOTS 🪢: a gradient-free framework to merge LoRA models. KnOTS is plug-and-play, boosting SoTA merging methods by up to 4.3%🚀 📜: arxiv.org/abs/2410.19735 💻: github.com/gstoica27/KnOTS
00:00
26K
Leshem (Legend) Choshen 🤖🤗
@LChoshen
Nov 2, 2022
I don’t train from scratch, I use RoBERTa🧐 Wait… Why not cross-encoder/stsb-roberta?facebook/muppet-roberta? We automatically identify the best models on 🤗(periodically) Just pick the best one and finetune on your task ibm.github.io/model-recyclin… arxiv.org/abs/2211.00107
Home
From ibm.github.io
Leshem (Legend) Choshen 🤖🤗
@LChoshen
Jun 25, 2023
Zip it: Fuse models with themselves first Merge models trained on different tasks by correlations between activations arxiv.org/abs/2305.03053 George Stoica @dbolya @BjornerJakob Taylor Hearn @judyfhoffman @gtcomputing #deepRead
26K
Leshem (Legend) Choshen 🤖🤗
@LChoshen
Oct 23, 2024
Scaling laws predict🦣large models by training🦟small ones, cool right? Fortunately, they are not that complicated or costly at least they don't have to be We have collected 400+ models fitted 1000+ scaling laws and created 1 guide for cheap & more reliable scaling law fitting:
33K
Leshem (Legend) Choshen 🤖🤗
@LChoshen
Jul 4, 2025
Shame on you @NeurIPSConf, i've got tons of shaming emails. @COLM_conf was amazing, it used AI to improve reviewing and it was great! @ReviewAcl spits blood to improve quality and reduce load. Here is the tale of this round of neurips reviews, of automatic threats and disrespect
58K
Leshem (Legend) Choshen 🤖🤗
@LChoshen
Dec 7, 2021
Data augmentation? Look no further. Framework of 100+ "transformations" (augmentations\paraphrasing functions\filters) Many types:emojis, linguistic... see Fig Extendable! A vast effort, constructed by almost a hundred authors! arxiv.org/abs/2112.02721 #scientivism