alphaXiv

Explore

Sign In

Blog

Labs

Feedback

Browser Extension

Ask or search anything...

What are the most popular benchmarks for math reasoning?

Alt+↵ To search

Natural-Language Agent Harnesses

26 Mar 2026

Linyue Pan

Lexiao Zou

Shuo Guo

This research introduces Natural-Language Agent Harnesses (NLAHs) and an Intelligent Harness Runtime (IHR) to externalize and execute the control logic of AI agents in natural language. The framework facilitates systematic study and comparison of agent control patterns, demonstrating operational viability and achieving a 47.2% task success rate on OSWorld benchmarks with NLAHs compared to 30.4% for native code, while re-centering reliability mechanisms on durable, artifact-backed operations.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Mathematical methods and human thought in the age of AI

27 Mar 2026

Tanya Klowden

Terence Tao

This paper investigates the rapidly evolving impact of AI on philosophical questions, particularly within mathematics, advocating for a human-centered approach to its development and integration. It argues that modern AI automates aspects of the creative process itself, necessitating a re-evaluation of intellectual value and the design of ethical frameworks for responsible coexistence.

#mathematics #history-and-overview

Paper thumbnail

Meta-Harness: End-to-End Optimization of Model Harnesses

30 Mar 2026

Yoonho Lee

Roshen Nair

Qizheng Zhang

Meta-Harness provides an end-to-end optimization framework for LLM harnesses, the external code that dictates how models interact with their environment. The system utilizes an agentic proposer with filesystem access to uncompressed historical code and execution traces, leading to a 7.7-point accuracy improvement in text classification, a 4.7-point average gain in math reasoning, and competitive pass rates on agentic coding benchmarks.

#agents #computer-science #artificial-intelligence

Paper thumbnail

Shor's algorithm is possible with as few as 10,000 reconfigurable atomic qubits

30 Mar 2026

Madelyn Cain

Qian Xu

Robbie King

Researchers proposed a neutral-atom quantum computing architecture that executes Shor's algorithm for cryptographically relevant instances with as few as 9,739 reconfigurable atomic qubits, a two-order-of-magnitude reduction from prior estimates, achieving runtimes in days for large factorizations.

#physics #quantum-physics

Paper thumbnail

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

25 Mar 2026

Jeonghye Kim

Xufang Luo

Minbeom Kim

Research from Microsoft Research, KAIST, and Seoul National University reveals that while self-distillation consistently shortens LLM reasoning traces, it can degrade mathematical reasoning capabilities, particularly on out-of-distribution tasks. This performance drop is linked to the suppression of "epistemic verbalization," or expressions of uncertainty, which are found to be critical for robust generalization.

#computer-science #computation-and-language #machine-learning

Paper thumbnail

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

26 Mar 2026

Jingwei Ni

Yihao Liu

Xinpeng Liu

The Trace2Skill framework automates the creation and adaptation of domain-specific skills for Large Language Model agents by distilling lessons from agent execution trajectories into a single, transferable skill document. This approach, inspired by human expert methodology, improves agent performance and generalizability across different LLM scales and out-of-distribution tasks, demonstrating superior efficiency and transferability compared to existing automated methods.

#computer-science #artificial-intelligence

Paper thumbnail

VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

27 Mar 2026

Zhaochong An

Orest Kupyn

Théo Uscidda

VGGRPO, a framework developed at Google, enhances video generation by integrating a Latent Geometry Model with latent-space Group Relative Policy Optimization, enabling "world-consistent" videos with stable camera motion and consistent 3D scene structure. The method improves geometric consistency and overall video quality, particularly in dynamic scenes, while reducing computational overhead by 24.5% compared to RGB-based approaches.

#computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation

27 Mar 2026

Nicolas von Lützow

Barbara Rössle

Katharina Schmid

Researchers from the Technical University of Munich developed GaussianGPT, an autoregressive framework for generating and completing 3D Gaussian scenes by converting continuous 3D scenes into discrete latent grids and modeling them with a causal transformer. The model achieved state-of-the-art performance on 3D chair generation (FID 5.68, KID 1.835) and demonstrated coherent large-scale scene outpainting and completion on indoor scene datasets.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

PRBench: End-to-end Paper Reproduction in Physics Research

29 Mar 2026

Shi Qiu

Junyi Deng

Yiwei Deng

Researchers at Peking University developed PRBench, a benchmark with 30 expert-curated physics tasks, to evaluate AI agents' ability to perform end-to-end computational result reproduction directly from scientific papers. The study found that while agents show moderate understanding of methodologies, they consistently struggle with translating this into correct code and accurately reproducing numerical data, resulting in a 0% end-to-end callback rate across all tested models.

#agents #computer-science #computation-and-language

Paper thumbnail

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

26 Mar 2026

Yuqian Fu

Haohuan Huang

Kaiwen Jiang

A study analyzing on-policy distillation (OPD) for large language models uncovered empirical issues with the sampled-token approach, such as imbalanced feedback and unreliable teacher signals. It introduces a "teacher top-K local support matching" objective, which provides more stable optimization and enhances performance in long-horizon tasks like math reasoning and multi-task agentic training.

#agents #computer-science #artificial-intelligence

Paper thumbnail

Realtime-VLA V2: Learning to Run VLAs Fast, Smooth, and Accurate

27 Mar 2026

Chen Yang

Yucheng Hu

Yunchao Ma

Researchers at Dexmal developed Realtime-VLA V2, a comprehensive framework that enables Vision-Language-Action (VLA) models to execute robot tasks significantly faster than human demonstrations while ensuring smooth and accurate operation. The system achieves execution speeds comparable to casual human performance on diverse manipulation tasks, including precision placement.

#computer-science #robotics

Paper thumbnail

Composer 2 Technical Report

25 Mar 2026

Cursor Reseach

Aaron Chan

Ahmed Shalaby

Composer 2 is a specialized Mixture-of-Experts model developed for agentic software engineering, designed to autonomously tackle complex coding tasks. It achieves a 61.3% accuracy on the CursorBench-3 internal benchmark (a 37% relative improvement over its predecessor), demonstrating competitive performance against frontier models on public benchmarks while maintaining a favorable cost-performance ratio.

#computer-science #machine-learning #software-engineering

Paper thumbnail

AVO: Agentic Variation Operators for Autonomous Evolutionary Search

25 Mar 2026

Terry Chen

Zhifan Ye

Bing Xu

Agentic Variation Operators (AVO) empower large language models to function as autonomous, iterative optimizers in evolutionary search for high-performance code. This framework successfully discovered multi-head attention kernels on NVIDIA Blackwell B200 GPUs that achieved up to 3.5% higher throughput than cuDNN and 10.5% higher than FlashAttention-4, alongside effective transferability to grouped-query attention.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Towards a Medical AI Scientist

30 Mar 2026

Hongtao Wu

Boyun Zheng

Dingjie Song

Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.

#agentic-frameworks #agents #ai-for-health

Paper thumbnail

AIRA_2: Overcoming Bottlenecks in AI Research Agents

27 Mar 2026

Karen Hambardzumyan

Nicolas Baldwin

Edan Toledo

Researchers at FAIR at Meta and collaborators developed AIRA $^{2}$ , an AI research agent designed to overcome structural bottlenecks in autonomous machine learning research. It achieved a mean Percentile Rank of 76.0% on MLE-bench-30 over 72 hours, demonstrating sustained performance improvements by enhancing compute throughput, evaluation reliability, and agent operational capabilities.

#agentic-frameworks #agents #computer-science

Paper thumbnail

VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation

27 Mar 2026

Zhide Zhong

Haodong Yan

Junfeng Li

Researchers at HKUST (GZ) developed VLA-OPD, a framework that combines offline supervised fine-tuning and online reinforcement learning for Vision-Language-Action models through on-policy distillation. It utilizes a Reverse-KL divergence objective to provide dense, token-level supervision from an expert teacher on student-generated trajectories, leading to improved sample efficiency (e.g., 3x faster convergence on LIBERO-Long) and robust performance while mitigating catastrophic forgetting.

#computer-science #robotics

Paper thumbnail

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

26 Mar 2026

Yicheng Zou

Dongsheng Zhu

Lin Zhu

Researchers at Shanghai AI Laboratory introduced Intern-S1-Pro, the first scientific multimodal foundation model with one trillion parameters, which achieved superior performance on over 100 specialized scientific tasks and competitive results in general AI tasks. The model validated the "specializable generalist" concept, demonstrating that a large generalist model can outperform specialized counterparts in several scientific domains.

#agents #ai-for-health #computer-science

Paper thumbnail

Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

26 Mar 2026

Kaijin Chen

Dingkang Liang

Xin Zhou

A framework called Hybrid Memory enables video world models to maintain spatiotemporal consistency for dynamic subjects moving out of and back into the camera's view. This work from Huazhong University of Science and Technology and Kuaishou Technology introduces the HM-World dataset and the HyDRA architecture, which achieved superior dynamic subject consistency and overall generation quality compared to baselines on the new dataset and outperformed a commercial model in zero-shot evaluation.

#attention-mechanisms #computer-science #artificial-intelligence

Paper thumbnail

World Reasoning Arena

26 Mar 2026

PAN Team Institute of Foundation Models

Qiyue Gao

Kun Zhou

Researchers from the PAN Team at MBZUAI introduced WR-Arena, a new benchmark designed to assess advanced capabilities of world models beyond short-term prediction, including action simulation fidelity, long-horizon forecast, and simulative reasoning for planning. Evaluations revealed that current models struggle with error accumulation in long-horizon simulations and consistently performing environment-level interventions, with action-state aligned models showing improved planning performance.

#computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Uni-World VLA: Interleaved World Modeling and Planning for Autonomous Driving

28 Mar 2026

Qiqi Liu

Huan Xu

Jingyu Li

Autonomous driving requires reasoning about how the environment evolves and planning actions accordingly. Existing world-model-based approaches typically predict future scenes first and plan afterwards, resulting in open-loop imagination that may drift from the actual decision process. In this paper, we present Uni-World VLA, a unified vision-language-action (VLA) model that tightly interleaves future frame prediction and trajectory planning. Instead of generating a full world rollout before planning, our model alternates between predicting future frames and ego actions step by step, allowing planning decisions to be continuously conditioned on the imagined future observations. This interleaved generation forms a closed-loop interaction between world modeling and control, enabling more adaptive decision-making in dynamic traffic scenarios. In addition, we incorporate monocular depth information into frames to provide stronger geometric cues for world modeling, improving long-horizon scene prediction. Experiments on the NAVSIM benchmark show that our approach achieves competitive closed-loop planning performance while producing high-fidelity future frame predictions. These results demonstrate that tightly coupling world prediction and planning is a promising direction for scalable VLA driving systems.

#autonomous-vehicles #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

There are no more papers matching your filters at the moment.

Advertisement