Log inSign up
Leshem (Legend) Choshen 🤖🤗
9,651 posts
Image
user avatar
Leshem (Legend) Choshen 🤖🤗
@LChoshen
🥇 LLMs together (co-created model merging, BabyLM, textArena.ai) 🥈 Spreading science over hype in #ML & #NLP Proud shareLM💬 Donor @IBMResearch & @MIT
Online
ktilana.wixsite.com/leshem-choshen
Joined June 2018
661
Following
5,208
Followers
  • Pinned
    user avatar
    Leshem (Legend) Choshen 🤖🤗
    @LChoshen
    Apr 18, 2025
    Here's what's on my mind😵‍💫 on yours as well? Talk to me at @iclr_conf or in general: Open feedback sharing Feedback loops Interactivity Multilinguality and multiculturalism Collaborative training (merging) Pretraining in Academia Evaluation 🧵
    11K
  • user avatar
    Leshem (Legend) Choshen 🤖🤗
    @LChoshen
    Dec 29, 2022
    Pretraining with 1 GPU and 1 day This paper is a HUGE list of all the tricks you could think of and what works to make training efficient given 1 GPU and 1 day arxiv.org/abs/2212.14034 @jonasgeiping @tomgoldsteincs
    arXiv logo
    arxiv.org
    Cramming: Training a Language Model on a Single GPU in One Day
    Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers...
    46K
  • user avatar
    Leshem (Legend) Choshen 🤖🤗
    @LChoshen
    Mar 21, 2022
    During training, your loss goes up and down up and down up and down. But how would it go if you magically went in a straight line from init to learnt position? Apparently smoothly down! On the surprising Linear Interpolation: #scientivism #deepRead #MachineLearning
    Image
  • user avatar
    Leshem (Legend) Choshen 🤖🤗
    @LChoshen
    Feb 15, 2024
    DoRA explores the magnitude and direction and surpasses LoRA quite significantly This is done with an empirical finding that I can't wrap my head around @NVIDIAAI arxiv.org/abs/2402.09353 @nbasyl_tw @chienyi_wang @yin_hongxu @PavloMolchanov @CMHungSteven
    Image
    56K
  • user avatar
    Leshem (Legend) Choshen 🤖🤗
    @LChoshen
    Oct 26, 2022
    Is data really important for pretraining? Could we just pretrain on 1 picture? Only synthetic text? Fractals? A 🧵 summing the image and text papers that do just that. and they all have a similar conclusion🤔
    Image
  • user avatar
    Leshem (Legend) Choshen 🤖🤗
    @LChoshen
    Feb 21, 2024
    How ICL 𝘦𝘮𝘦𝘳𝘨𝘦𝘴 from unsupervised data? 𝘐𝘵 𝘭𝘦𝘢𝘳𝘯𝘴 𝘧𝘳𝘰𝘮 parallel phrases After deleting parallel parts the ICL ability was reduced by 51% deleting random words - only 2% 🧵 @yanda_chen_ @henryzhao4321 @Zhou_Yu_AI @hhexiy @columbianlp arxiv.org/abs/2402.12530
    Image
    56K
  • user avatar
    Leshem (Legend) Choshen 🤖🤗
    @LChoshen
    Oct 2, 2024
    It was claimed in-context-learning (ICL) is doing SGD inside the transformer layers. A new finding shows this is not possible. They must be doing something BETTER In fact, exponentially better than SGD, so second-order methods?🧵 By @DeqingFu @robinomial t/h @jacobandreas
    48K
  • user avatar
    Leshem (Legend) Choshen 🤖🤗
    @LChoshen
    Jul 10, 2022
    Computational (Chomskian) hierarchies can predict OOD capabilities Different in formal hierarchies - different generalizations the architecture can perform Got your attention? Details in 🧵 @DeepMind
    Image
  • user avatar
    Leshem (Legend) Choshen 🤖🤗
    @LChoshen
    Nov 8, 2024
    Model merging is tricky when model weights aren’t aligned Introducing KnOTS 🪢: a gradient-free framework to merge LoRA models. KnOTS is plug-and-play, boosting SoTA merging methods by up to 4.3%🚀 📜: arxiv.org/abs/2410.19735 💻: github.com/gstoica27/KnOTS
    Image
    00:00
    26K
  • user avatar
    Leshem (Legend) Choshen 🤖🤗
    @LChoshen
    Nov 2, 2022
    I don’t train from scratch, I use RoBERTa🧐 Wait… Why not cross-encoder/stsb-roberta?facebook/muppet-roberta? We automatically identify the best models on 🤗(periodically) Just pick the best one and finetune on your task ibm.github.io/model-recyclin… arxiv.org/abs/2211.00107
    Image
    Home
    From ibm.github.io
  • user avatar
    Leshem (Legend) Choshen 🤖🤗
    @LChoshen
    Jun 25, 2023
    Zip it: Fuse models with themselves first Merge models trained on different tasks by correlations between activations arxiv.org/abs/2305.03053 George Stoica @dbolya @BjornerJakob Taylor Hearn @judyfhoffman @gtcomputing #deepRead
    Image
    26K
  • user avatar
    Leshem (Legend) Choshen 🤖🤗
    @LChoshen
    Oct 23, 2024
    Scaling laws predict🦣large models by training🦟small ones, cool right? Fortunately, they are not that complicated or costly at least they don't have to be We have collected 400+ models fitted 1000+ scaling laws and created 1 guide for cheap & more reliable scaling law fitting:
    Image
    33K
  • user avatar
    Leshem (Legend) Choshen 🤖🤗
    @LChoshen
    Jul 4, 2025
    Shame on you @NeurIPSConf, i've got tons of shaming emails. @COLM_conf was amazing, it used AI to improve reviewing and it was great! @ReviewAcl spits blood to improve quality and reduce load. Here is the tale of this round of neurips reviews, of automatic threats and disrespect
    58K
  • user avatar
    Leshem (Legend) Choshen 🤖🤗
    @LChoshen
    Dec 7, 2021
    Data augmentation? Look no further. Framework of 100+ "transformations" (augmentations\paraphrasing functions\filters) Many types:emojis, linguistic... see Fig Extendable! A vast effort, constructed by almost a hundred authors! arxiv.org/abs/2112.02721 #scientivism
    Image

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement