Chenghao Yang@USTC

Biography

Hello! I'm chyang.

I am a first-year Master's student at University of Science and Technology of China, advised by Prof. Nenghai Yu and Prof. Qi Chu. My academic journey is driven by a deep curiosity for artificial intelligence and its transformative potential.

Research Interests

LLM Evaluation & Benchmarking

LLM Safety & Alignment

AI-driven Cybersecurity

Contact & Collaboration

I'm enthusiastic about innovative projects and meaningful discussions. Feel free to reach out if you share similar interests!

Google Scholar GitHub

News

Last updated:

Keep an eye on this space for the latest news or announcements.

2025-09-16

The work titled "FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning" that I participated in is now available on arXiv, and the data can be accessed on HuggingFace!

2025-09-01

The paper "MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation" has been accepted by EMNLP 2025 (Findings)!

2025-06-02

The paper "MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation" has been published on arXiv!

2025-02-09

The paper "CryptoX: Compositional Reasoning Evaluation of Large Language Models" has been published on arXiv!

2025-01-23

🎉 Our Paper "Hello Again! LLM-powered Personalized Agent for Long-term Dialogue" has been accepted by 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025).

2024-06-09

Our paper "Hello Again! LLM-powered Personalized Agent for Long-term Dialogue" has been published on arXiv.

2023-12-15

I embarked on my research journey at the LDS Laboratory, University of Science and Technology of China (USTC).

Publications

* equal contribution † corresponding authors ‡ Work done at ByteDance Seed

Hello Again! LLM-powered Personalized Agent for Long-term Dialogue

National University of Singapore & University of Science and Technology of China

32 citations 41 stars

Hao Li*, Chenghao Yang*, An Zhang^†, Yang Deng, Xiang Wang, Tat-Seng Chua
Conference

NAACL 2025 (Main) ·

Paper ·

Github

Open-domain dialogue systems have seen remarkable advancements with the development of large language models (LLMs). Nonetheless, most existing dialogue systems predominantly focus on brief single-session interactions, neglecting the real-world demands for long-term companionship and personalized interactions with chatbots. Crucial to addressing this real-world need are event summary and persona management, which enable reasoning for appropriate long-term dialogue responses. Recent progress in the human-like cognitive and reasoning capabilities of LLMs suggests that LLM-based agents could significantly enhance automated perception, decision-making, and problem-solving. In response to this potential, we introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent), which incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation. For the event memory module, long and short-term memory banks are employed to separately focus on historical and ongoing sessions, while a topic-based retrieval mechanism is introduced to enhance the accuracy of memory retrieval. Furthermore, the persona module conducts dynamic persona modeling for both users and agents. The integration of retrieved memories and extracted personas is subsequently fed into the generator to induce appropriate responses. The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated across various illustrative benchmarks, models, and tasks. The code is released at this URL: https://github.com/leolee99/LD-Agent

CryptoX: Compositional Reasoning Evaluation of Large Language Models

Beihang University & M-A-P Lab & ByteDance Inc.

1 citations 5 stars Leaderboard

Jiajun Shi*, Chaoren Wei*, Liqun Yang*, Zekun Moore Wang, Chenghao Yang, Ge Zhang, Stephen Huang, Tao Peng, Jian Yang^†, Zhoufutu Wen^†
arXiv

arXiv ·

Paper ·

Github

The compositional reasoning ability has long been regarded as critical to the generalization and intelligence emergence of large language models (LLMs). However, despite numerous reasoning-related benchmarks, the compositional reasoning capacity of LLMs is rarely studied or quantified in the existing benchmarks. In this paper, we introduce CryptoX, an evaluation framework that, for the first time, combines existing benchmarks and cryptographic principles to quantify the compositional reasoning capacity of LLMs. Building upon CryptoX, we construct CryptoBench, which integrates these principles into several benchmarks for systematic evaluation. We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs. We further conduct thorough mechanistic interpretability experiments to reveal the inner mechanism of LLMs' compositional reasoning, involving subproblem decomposition, subproblem inference, and summarizing subproblem conclusions. Through analysis based on CryptoBench, we highlight the value of independently studying compositional reasoning and emphasize the need to enhance the compositional reasoning abilities of LLMs.

MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

ByteDance Seed & University of Science and Technology of China

0 citations 1 stars Project

Chenghao Yang*,‡, Yinbo Luo*, Zhoufutu Wen^†, Qi Chu^†, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu
arXiv

EMNLP 2025 Findings ·

Paper ·

Github ·

Project

Large Language Models (LLMs), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present MARS-Bench, a Multi-turn Athletic Real-world Scenario Dialogue Benchmark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: Ultra Multi-turn, Interactive Multi-turn, and Cross-turn Tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs' robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenges when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs' performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction.

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

ByteDance Seed

0 citations 0 stars Project Hugging Face

HuggingFace

Chenghao Yang (contributor)
arXiv

arXiv ·

Paper ·

Github

Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks—Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation—closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly. By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.

Internships

	ByteDance Inc. Nov 2024 - Present Algorithm Intern, Research & Development \| Seed-Evaluation Department Change History Apr 2025 - Present Seed-Evaluation Nov 2024 - Apr 2025 Seed-LLM-Evaluation Beijing, China
	National University of Singapore Dec 2023 - Jun 2024 Research Intern (Remote), Agent \| NExT++ Lab Singapore \| Supervised by Research Fellow An Zhang

Education

	University of Science and Technology of China Sep 2025 - Present M.Eng. in Cyberspace Security (0838) Research Focus: Large Language Model Security Recommended for Graduate Admission
	University of Science and Technology of China Sep 2021 - Jun 2025 B.Eng. in Information Security

Awards

◆ University-level Outstanding Graduate Award, USTC	2025
◆ Outstanding Student Bronze Award , USTC	2024
◆ Longfor Scholarship, USTC	2023
◆ Endeavor Scholarship, USTC	2022
◆ Wang Xiaomo Talent Program in Cyber Science and Technology Scholarship × 4, USTC	2021 ～ 2024