Fantastic Bugs and Where to Find Them in AI Benchmarks
Sang T. Truong, Yuheng Tu, Michael Hardy, Anka Reuel, Zeyu Tang, Jirayu Burapacheep, Jonathan Jude Perera, Chibuike Uwakwe, Benjamin W. Domingue, Nick Haber, Sanmi Koyejo
We introduce a scalable framework to flag invalid benchmark questions. We analyze statistical signals and use them to guide expert review, achieving up to 84% precision across nine popular benchmarks.