Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

Andong Hua; Kenan Tang; Chenhe Gu; Jindong Gu; Eric Wong; Yao Qin

doi:10.18653/v1/2025.emnlp-main.1006

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

Andong Hua, Kenan Tang, Chenhe Gu, Jindong Gu, Eric Wong, Yao Qin

Abstract

Prompt sensitivity, referring to the phenomenon where paraphrasing (that is, repeating something written or spoken using different words) leads to significant changes in large language model performance, has been widely accepted as a core limitation of large language models. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of large language models, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate seven large language models (for example, the GPT and Gemini families) across six benchmarks, including both multiple-choice and open-ended tasks on twelve diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt large language model as a judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern large language models are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models.

Anthology ID:: 2025.emnlp-main.1006
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19889–19899
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1006/
DOI:: 10.18653/v1/2025.emnlp-main.1006
Bibkey:
Cite (ACL):: Andong Hua, Kenan Tang, Chenhe Gu, Jindong Gu, Eric Wong, and Yao Qin. 2025. Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19889–19899, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs (Hua et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1006.pdf
Checklist:: 2025.emnlp-main.1006.checklist.pdf

PDF Cite Search Checklist Fix data