Chenghao Yang

If you get tangled up, you just tango on.

My Profile

Biography

Hello! I'm chyang.

I am a first-year Master's student at University of Science and Technology of China, advised by Prof. Nenghai Yu and Prof. Qi Chu. My academic journey is driven by a deep curiosity for artificial intelligence and its transformative potential.

Research Interests

LLM Evaluation & Benchmarking
LLM Safety & Alignment
AI-driven Cybersecurity

Contact & Collaboration

I'm enthusiastic about innovative projects and meaningful discussions. Feel free to reach out if you share similar interests!

News

Last updated:

Keep an eye on this space for the latest news or announcements.

2025-09-16
The work titled "FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning" that I participated in is now available on arXiv, and the data can be accessed on HuggingFace!
2025-09-01
The paper "MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation" has been accepted by EMNLP 2025 (Findings)!
2025-06-02
The paper "MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation" has been published on arXiv!
2025-02-09
The paper "CryptoX: Compositional Reasoning Evaluation of Large Language Models" has been published on arXiv!
2025-01-23
🎉 Our Paper "Hello Again! LLM-powered Personalized Agent for Long-term Dialogue" has been accepted by 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025).
2024-06-09
Our paper "Hello Again! LLM-powered Personalized Agent for Long-term Dialogue" has been published on arXiv.
2023-12-15
I embarked on my research journey at the LDS Laboratory, University of Science and Technology of China (USTC).

Publications

* equal contribution   corresponding authors   Work done at ByteDance Seed  

LDAgent Paper
Hello Again! LLM-powered Personalized Agent for Long-term Dialogue
National University of Singapore & University of Science and Technology of China
32 citations 41 stars

Hao Li*, Chenghao Yang*, An Zhang, Yang Deng, Xiang Wang, Tat-Seng Chua
Conference NAACL 2025 (Main) · PDF Paper · GitHub Github

Open-domain dialogue systems have seen remarkable advancements with the development of large language models (LLMs). Nonetheless, most existing dialogue systems predominantly focus on brief single-session interactions, neglecting the real-world demands for long-term companionship and personalized interactions with chatbots. Crucial to addressing this real-world need are event summary and persona management, which enable reasoning for appropriate long-term dialogue responses. Recent progress in the human-like cognitive and reasoning capabilities of LLMs suggests that LLM-based agents could significantly enhance automated perception, decision-making, and problem-solving. In response to this potential, we introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent), which incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation. For the event memory module, long and short-term memory banks are employed to separately focus on historical and ongoing sessions, while a topic-based retrieval mechanism is introduced to enhance the accuracy of memory retrieval. Furthermore, the persona module conducts dynamic persona modeling for both users and agents. The integration of retrieved memories and extracted personas is subsequently fed into the generator to induce appropriate responses. The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated across various illustrative benchmarks, models, and tasks. The code is released at this URL: https://github.com/leolee99/LD-Agent

LDAgent Paper
CryptoX: Compositional Reasoning Evaluation of Large Language Models
Beihang University & M-A-P Lab & ByteDance Inc.
1 citations 5 stars Leaderboard

Jiajun Shi*, Chaoren Wei*, Liqun Yang*, Zekun Moore Wang, Chenghao Yang, Ge Zhang, Stephen Huang, Tao Peng, Jian Yang, Zhoufutu Wen
arXiv arXiv · PDF Paper · GitHub Github

The compositional reasoning ability has long been regarded as critical to the generalization and intelligence emergence of large language models (LLMs). However, despite numerous reasoning-related benchmarks, the compositional reasoning capacity of LLMs is rarely studied or quantified in the existing benchmarks. In this paper, we introduce CryptoX, an evaluation framework that, for the first time, combines existing benchmarks and cryptographic principles to quantify the compositional reasoning capacity of LLMs. Building upon CryptoX, we construct CryptoBench, which integrates these principles into several benchmarks for systematic evaluation. We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs. We further conduct thorough mechanistic interpretability experiments to reveal the inner mechanism of LLMs' compositional reasoning, involving subproblem decomposition, subproblem inference, and summarizing subproblem conclusions. Through analysis based on CryptoBench, we highlight the value of independently studying compositional reasoning and emphasize the need to enhance the compositional reasoning abilities of LLMs.

MARS-Bench Paper
MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation
ByteDance Seed & University of Science and Technology of China
0 citations 1 stars Project

Chenghao Yang*,‡, Yinbo Luo*, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu
arXiv EMNLP 2025 Findings · PDF Paper · GitHub Github · Project Project

Large Language Models (LLMs), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present MARS-Bench, a Multi-turn Athletic Real-world Scenario Dialogue Benchmark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: Ultra Multi-turn, Interactive Multi-turn, and Cross-turn Tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs' robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenges when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs' performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction.

FinSearchComp
FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
ByteDance Seed

Chenghao Yang (contributor)
arXiv arXiv · PDF Paper · GitHub Github

Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks—Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation—closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly. By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.

Internships

ByteDance Logo

ByteDance Inc.

Nov 2024 - Present
Algorithm Intern, Research & Development | Seed-Evaluation
Department Change History
Apr 2025 - Present Seed-Evaluation
Nov 2024 - Apr 2025 Seed-LLM-Evaluation
Beijing, China
NUS Logo

National University of Singapore

Dec 2023 - Jun 2024
Research Intern (Remote), Agent | NExT++ Lab
Singapore | Supervised by Research Fellow An Zhang

Education

USTC Logo University of Science and Technology of China
Sep 2025 - Present
M.Eng. in Cyberspace Security (0838)
Research Focus: Large Language Model Security
Recommended for Graduate Admission
USTC Logo University of Science and Technology of China
Sep 2021 - Jun 2025
B.Eng. in Information Security

Awards

◆ University-level Outstanding Graduate Award, USTC 2025
◆ Outstanding Student Bronze Award , USTC 2024
◆ Longfor Scholarship, USTC 2023
◆ Endeavor Scholarship, USTC 2022
◆ Wang Xiaomo Talent Program in Cyber Science and Technology Scholarship × 4, USTC 2021 ~ 2024
Total Visitors: 0