NLP Posts | SAIL Blog

Fantastic Bugs and Where to Find Them in AI Benchmarks

Sang T. Truong, Yuheng Tu, Michael Hardy, Anka Reuel, Zeyu Tang, Jirayu Burapacheep, Jonathan Jude Perera, Chibuike Uwakwe, Benjamin W. Domingue, Nick Haber, Sanmi Koyejo

We introduce a scalable framework to flag invalid benchmark questions. We analyze statistical signals and use them to guide expert review, achieving up to 84% precision across nine popular benchmarks.

Demystifying Verbatim Memorization in Large Language Models

Jing Huang, Diyi Yang, Christopher Potts

How do LLMs memorize long sequences of texts verbatim? In this work, we show that verbatim memorization is intertwined with the LM’s general capabilities.

Stanford AI Lab Papers and Talks at NAACL 2025

Compiled by Nitya Thakkar

All the great work from the Stanford AI Lab accepted at NAACL, all in one place.

MENTAT: A Clinician-Annotated Benchmark for Complex Psychiatric Decision-Making

Max Lamparth and Declan Grabb

We developed a new expert design and annotated clinical decision-making dataset that also allows for nuanced accuracy and fairness evaluations with expert preferences, uncertainty, and soft labels.

PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action

Yijia Shao and Diyi Yang

Having an agent handle tasks for you is cool. But does your language model agent respect privacy norms?

Productive Struggle: The Future of Human Learning in the Age of AI

Rose E. Wang and Megha Srivastava

What happens to human learning when superhuman intelligence is as accessible as a Google search?

Stanford AI Lab Papers and Talks at EMNLP 2024

Compiled by Nitya Thakkar

All the great work from the Stanford AI Lab accepted at EMNLP 2024, all in one place.

Stanford AI Lab Papers and Talks at ACL 2023

Compiled by Drew A. Hudson

All the great work from the Stanford AI Lab accepted at ACL 2023, all in one place.

How does in-context learning work? A framework for understanding the differences from traditional supervised learning

Sang Michael Xie and Sewon Min

We provide a Bayesian inference framework for in-context learning in large language models like GPT-3 and show empirical evidence for our framework, including connections to how in-context learning can still work well despite randomizing the labels in few-shot examples.

LinkBERT: Improving Language Model Training with Document Link

Michihiro Yasunaga, Jure Leskovec, Percy Liang

LinkBERT is a new language model pretrained to capture document link knowledge such as hyperlinks of the web. It greatly helps knowledge-intensive applications such as question answering.