alphaXiv

Explore

Sign In

Blog

Labs

Feedback

Browser Extension

We're hiring

Ask or search anything...

What are the most popular benchmarks for math reasoning?

Alt+↵ To search
HotLikes
Briefs
Sign in
HotLikes
Briefs
Natural-Language Agent Harnesses
26 Mar 2026
Linyue Pan
Lexiao Zou
Shuo Guo

This research introduces Natural-Language Agent Harnesses (NLAHs) and an Intelligent Harness Runtime (IHR) to externalize and execute the control logic of AI agents in natural language. The framework facilitates systematic study and comparison of agent control patterns, demonstrating operational viability and achieving a 47.2% task success rate on OSWorld benchmarks with NLAHs compared to 30.4% for native code, while re-centering reliability mechanisms on durable, artifact-backed operations.

View blog
#agentic-frameworks#agents#computer-science
Resources2
Paper thumbnail
1,863
Mathematical methods and human thought in the age of AI
27 Mar 2026
Tanya Klowden
Terence Tao

This paper investigates the rapidly evolving impact of AI on philosophical questions, particularly within mathematics, advocating for a human-centered approach to its development and integration. It argues that modern AI automates aspects of the creative process itself, necessitating a re-evaluation of intellectual value and the design of ethical frameworks for responsible coexistence.

View blog
#mathematics#history-and-overview
Resources
Paper thumbnail
451
Meta-Harness: End-to-End Optimization of Model Harnesses
30 Mar 2026
Yoonho Lee
Roshen Nair
Qizheng Zhang

Meta-Harness provides an end-to-end optimization framework for LLM harnesses, the external code that dictates how models interact with their environment. The system utilizes an agentic proposer with filesystem access to uncompressed historical code and execution traces, leading to a 7.7-point accuracy improvement in text classification, a 4.7-point average gain in math reasoning, and competitive pass rates on agentic coding benchmarks.

View blog
#agents#computer-science#artificial-intelligence
Resources53
Paper thumbnail
199
Shor's algorithm is possible with as few as 10,000 reconfigurable atomic qubits
30 Mar 2026
Madelyn Cain
Qian Xu
Robbie King

Researchers proposed a neutral-atom quantum computing architecture that executes Shor's algorithm for cryptographically relevant instances with as few as 9,739 reconfigurable atomic qubits, a two-order-of-magnitude reduction from prior estimates, achieving runtimes in days for large factorizations.

View blog
#physics#quantum-physics
Resources
Paper thumbnail
166
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
25 Mar 2026
Jeonghye Kim
Xufang Luo
Minbeom Kim

Research from Microsoft Research, KAIST, and Seoul National University reveals that while self-distillation consistently shortens LLM reasoning traces, it can degrade mathematical reasoning capabilities, particularly on out-of-distribution tasks. This performance drop is linked to the suppression of "epistemic verbalization," or expressions of uncertainty, which are found to be critical for robust generalization.

View blog
#computer-science#computation-and-language#machine-learning
Resources
Paper thumbnail
1,788
Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
26 Mar 2026
Jingwei Ni
Yihao Liu
Xinpeng Liu

The Trace2Skill framework automates the creation and adaptation of domain-specific skills for Large Language Model agents by distilling lessons from agent execution trajectories into a single, transferable skill document. This approach, inspired by human expert methodology, improves agent performance and generalizability across different LLM scales and out-of-distribution tasks, demonstrating superior efficiency and transferability compared to existing automated methods.

View blog
#computer-science#artificial-intelligence
Resources
Paper thumbnail
459
VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward
27 Mar 2026
Zhaochong An
Orest Kupyn
Théo Uscidda

VGGRPO, a framework developed at Google, enhances video generation by integrating a Latent Geometry Model with latent-space Group Relative Policy Optimization, enabling "world-consistent" videos with stable camera motion and consistent 3D scene structure. The method improves geometric consistency and overall video quality, particularly in dynamic scenes, while reducing computational overhead by 24.5% compared to RGB-based approaches.

View blog
#computer-science#computer-vision-and-pattern-recognition
Resources
Paper thumbnail
125
GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation
27 Mar 2026
Nicolas von Lützow
Barbara Rössle
Katharina Schmid

Researchers from the Technical University of Munich developed GaussianGPT, an autoregressive framework for generating and completing 3D Gaussian scenes by converting continuous 3D scenes into discrete latent grids and modeling them with a causal transformer. The model achieved state-of-the-art performance on 3D chair generation (FID 5.68, KID 1.835) and demonstrated coherent large-scale scene outpainting and completion on indoor scene datasets.

View blog
#computer-science#computer-vision-and-pattern-recognition#generative-models
Resources
Paper thumbnail
119
PRBench: End-to-end Paper Reproduction in Physics Research
29 Mar 2026
Shi Qiu
Junyi Deng
Yiwei Deng

Researchers at Peking University developed PRBench, a benchmark with 30 expert-curated physics tasks, to evaluate AI agents' ability to perform end-to-end computational result reproduction directly from scientific papers. The study found that while agents show moderate understanding of methodologies, they consistently struggle with translating this into correct code and accurately reproducing numerical data, resulting in a 0% end-to-end callback rate across all tested models.

View blog
#agents#computer-science#computation-and-language
Resources
Paper thumbnail
82
Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
26 Mar 2026
Yuqian Fu
Haohuan Huang
Kaiwen Jiang

A study analyzing on-policy distillation (OPD) for large language models uncovered empirical issues with the sampled-token approach, such as imbalanced feedback and unreliable teacher signals. It introduces a "teacher top-K local support matching" objective, which provides more stable optimization and enhances performance in long-horizon tasks like math reasoning and multi-task agentic training.

View blog
#agents#computer-science#artificial-intelligence
Resources1
Paper thumbnail
512
Realtime-VLA V2: Learning to Run VLAs Fast, Smooth, and Accurate
27 Mar 2026
Chen Yang
Yucheng Hu
Yunchao Ma

Researchers at Dexmal developed Realtime-VLA V2, a comprehensive framework that enables Vision-Language-Action (VLA) models to execute robot tasks significantly faster than human demonstrations while ensuring smooth and accurate operation. The system achieves execution speeds comparable to casual human performance on diverse manipulation tasks, including precision placement.

View blog
#computer-science#robotics
Resources21
Paper thumbnail
108
Composer 2 Technical Report
25 Mar 2026
Cursor Reseach
Aaron Chan
Ahmed Shalaby

Composer 2 is a specialized Mixture-of-Experts model developed for agentic software engineering, designed to autonomously tackle complex coding tasks. It achieves a 61.3% accuracy on the CursorBench-3 internal benchmark (a 37% relative improvement over its predecessor), demonstrating competitive performance against frontier models on public benchmarks while maintaining a favorable cost-performance ratio.

View blog
#computer-science#machine-learning#software-engineering
Resources
Paper thumbnail
533
AVO: Agentic Variation Operators for Autonomous Evolutionary Search
25 Mar 2026
Terry Chen
Zhifan Ye
Bing Xu

Agentic Variation Operators (AVO) empower large language models to function as autonomous, iterative optimizers in evolutionary search for high-performance code. This framework successfully discovered multi-head attention kernels on NVIDIA Blackwell B200 GPUs that achieved up to 3.5% higher throughput than cuDNN and 10.5% higher than FlashAttention-4, alongside effective transferability to grouped-query attention.

View blog
#agentic-frameworks#agents#computer-science
Resources
Paper thumbnail
2,123
Towards a Medical AI Scientist
30 Mar 2026
Hongtao Wu
Boyun Zheng
Dingjie Song
Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.
View blog
#agentic-frameworks#agents#ai-for-health
Resources1
Paper thumbnail
66
AIRA_2: Overcoming Bottlenecks in AI Research Agents
27 Mar 2026
Karen Hambardzumyan
Nicolas Baldwin
Edan Toledo

Researchers at FAIR at Meta and collaborators developed AIRA 2^{2}2, an AI research agent designed to overcome structural bottlenecks in autonomous machine learning research. It achieved a mean Percentile Rank of 76.0% on MLE-bench-30 over 72 hours, demonstrating sustained performance improvements by enhancing compute throughput, evaluation reliability, and agent operational capabilities.

View blog
#agentic-frameworks#agents#computer-science
Resources
Paper thumbnail
87
VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation
27 Mar 2026
Zhide Zhong
Haodong Yan
Junfeng Li

Researchers at HKUST (GZ) developed VLA-OPD, a framework that combines offline supervised fine-tuning and online reinforcement learning for Vision-Language-Action models through on-policy distillation. It utilizes a Reverse-KL divergence objective to provide dense, token-level supervision from an expert teacher on student-generated trajectories, leading to improved sample efficiency (e.g., 3x faster convergence on LIBERO-Long) and robust performance while mitigating catastrophic forgetting.

View blog
#computer-science#robotics
Resources
Paper thumbnail
87
Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
26 Mar 2026
Yicheng Zou
Dongsheng Zhu
Lin Zhu

Researchers at Shanghai AI Laboratory introduced Intern-S1-Pro, the first scientific multimodal foundation model with one trillion parameters, which achieved superior performance on over 100 specialized scientific tasks and competitive results in general AI tasks. The model validated the "specializable generalist" concept, demonstrating that a large generalist model can outperform specialized counterparts in several scientific domains.

View blog
#agents#ai-for-health#computer-science
Resources
Paper thumbnail
462
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
26 Mar 2026
Kaijin Chen
Dingkang Liang
Xin Zhou

A framework called Hybrid Memory enables video world models to maintain spatiotemporal consistency for dynamic subjects moving out of and back into the camera's view. This work from Huazhong University of Science and Technology and Kuaishou Technology introduces the HM-World dataset and the HyDRA architecture, which achieved superior dynamic subject consistency and overall generation quality compared to baselines on the new dataset and outperformed a commercial model in zero-shot evaluation.

View blog
#attention-mechanisms#computer-science#artificial-intelligence
Resources13
Paper thumbnail
208
World Reasoning Arena
26 Mar 2026
PAN Team Institute of Foundation Models
Qiyue Gao
Kun Zhou

Researchers from the PAN Team at MBZUAI introduced WR-Arena, a new benchmark designed to assess advanced capabilities of world models beyond short-term prediction, including action simulation fidelity, long-horizon forecast, and simulative reasoning for planning. Evaluations revealed that current models struggle with error accumulation in long-horizon simulations and consistently performing environment-level interventions, with action-state aligned models showing improved planning performance.

View blog
#computer-science#computer-vision-and-pattern-recognition
Resources26
Paper thumbnail
112
Uni-World VLA: Interleaved World Modeling and Planning for Autonomous Driving
28 Mar 2026
Qiqi Liu
Huan Xu
Jingyu Li
Autonomous driving requires reasoning about how the environment evolves and planning actions accordingly. Existing world-model-based approaches typically predict future scenes first and plan afterwards, resulting in open-loop imagination that may drift from the actual decision process. In this paper, we present Uni-World VLA, a unified vision-language-action (VLA) model that tightly interleaves future frame prediction and trajectory planning. Instead of generating a full world rollout before planning, our model alternates between predicting future frames and ego actions step by step, allowing planning decisions to be continuously conditioned on the imagined future observations. This interleaved generation forms a closed-loop interaction between world modeling and control, enabling more adaptive decision-making in dynamic traffic scenarios. In addition, we incorporate monocular depth information into frames to provide stronger geometric cues for world modeling, improving long-horizon scene prediction. Experiments on the NAVSIM benchmark show that our approach achieves competitive closed-loop planning performance while producing high-fidelity future frame predictions. These results demonstrate that tightly coupling world prediction and planning is a promising direction for scalable VLA driving systems.
View blog
#autonomous-vehicles#computer-science#computer-vision-and-pattern-recognition
Resources
Paper thumbnail
43
There are no more papers matching your filters at the moment.
Advertisement