I'm looking for a summer'25 intern at Apple AI/ML, New York.
Focus: long-context modeling for LLM pretraining
Apply link: shorturl.at/wyK0G
Please also email me your resume after application.
Bailin Wang
107 posts
LLM researcher
- What is the landscape of sequence models (e.g., Mamba, GLA), and what are the intrinsic limitations (if any) of those efficient architectures? I'm really excited about our effort for these question, from the perspective of formal language learning and in-context-learningCan insights from synthetic experiments and interpretability lead to real improvements in language modeling? We: > propose a formal model for in-context learning > uncover "n-gram heads" = high order induction heads, crucial for ICLL > improve Transformer LM perplexity by 6.7%
- It seems that data-dependent gating is the core ingredient for effective linear-complexity alternatives to softmax attention, as shown in both our GLA and Mamba. We (the authors) are also at NeurIPS, and happy to chat more about this!Impressed by the performance of Mamba and believe in RNN? We provide a simple alternative solution! Excited to share Gated Linear Attention (GLA-Transformer). (1/n) arxiv.org/abs/2312.06635
- Checkout Linlu’s recent work on LM’s phenomenal 🤩yet puzzling 🧐 behavior !How good are LMs at inductive reasoning? How are their behaviors similar to/contrasted with those of humans? We study these via iterative hypothesis refinement. We observe that LMs are phenomenal hypothesis proposers, but they also behave as puzzling inductive reasoners: (1/n)
- @Ziwphd and I will present our Grammar Prompting work tomorrow at #NeurIPS2023 Happy to chat more on - LLMs for structured language generation (e.g., programs, molecules and robotic plans) - Structured/formal chain-of-thought for reasoning
- Exciting!Replying to @LiruiWang1Excited to share that GenSim won the outstanding paper award at the LangRob workshop at CoRL 2023!
- Replying to @bailin_28I was also quite surprised that all models we tested (including standard attention) can benefit consistently from adding a simple static n-gram head.
- Replying to @bailin_28@srush_nlp @jefrankle’s bet (isattentionallyouneed.com) might depend on whether the question was “Is softmax attention all you need?” :)
- Interested in NL--> SQL? Come to our Q&A session 14A. We proposed to integrate structured relations into transformer to bias the representation learning.#acl2020nlp Paper: virtual.acl2020.org/paper_main.677… Code: github.com/microsoft/rat-… Joint work with @rshin, @Skiminok @AllenLao Matt
- Well, meditation really helps me pull myself together and become more aware of the stress I got.
- Uncertainty in Deep Learning (PhD Thesis) | Yarin Gal - Blog | Cambridge Machine Learning Group mlg.eng.cam.ac.uk/yarin/blog_224… via @yaringal
- Replying to @sivil_taram and @Francis_YAO_Very interesting work! @Francis_YAO_ yeah, we found that sub-quadratic models (e.g., linear attention, SSM) are not good at in-context learning, primarily due to the incapability of retrieval.
- “An Adversarial Review of ‘Adversarial Generation of Natural Language’” by Yoav Goldberg medium.com/@yoav.goldberg…











