Reliability Estimator w/ Continuous Contrasts
Contrast the model's support for the same generated tokens under discrete-token and continuous-embedding inputs to estimate answer reliability via sequence-level KL.
CopT is a pipeline with continuous-space verifiers for math, coding, and agentic reasoning, enabling LLMs to start with a draft answer and perform on-policy thinking conditioned on it for reflection and correction.
Contrast the model's support for the same generated tokens under discrete-token and continuous-embedding inputs to estimate answer reliability via sequence-level KL.
Perform subsequent on-policy thinking conditioned on the draft answer for reflection and correction when the draft answer is deemed insufficiently reliable.
Assess whether thinking chunks remain stable under continuous inputs with a second sequence-level KL.
Dynamically expose the draft answer during on-policy thinking to preserve useful partial information while reducing the risk of being misled by unreliable content.
Accuracy at the same or lower token usage
Accuracy at the same or lower token usage
Accuracy at the same or lower token usage
At the same or higher Math / Coding / Agentic reasoning accuracy
@misc{shi2026coptcontrastiveonpolicythinking,
title={CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning},
author={Dachuan Shi and Hanlin Zhu and Xiangchi Yuan and Wanjia Zhao and Kejing Xia and Wen Xiao and Wenke Lee},
year={2026},
eprint={2605.20075},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.20075},
}