[Papers]
Research papers and publications from Scale Labs covering AI evaluation, safety, benchmarking, and frontier model analysis.
Date Title
3/12/2026Defensive Refusal Bias: How Safety Alignment Fails Cyber DefendersSafety2/25/2026VeRO: An Evaluation Harness for Agents to Optimize AgentsAgents, Post-Training, Evaluation and Alignment2/12/2026LHAW: Controllable Underspecification for Long-Horizon TasksAgents, Safety, Evaluation and Alignment1/15/2026SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?Safety, Evaluation and Alignment1/6/2026Agentic Rubrics as Contextual Verifiers for SWE AgentsAgents, Safety, Evaluation and Alignment12/22/2025MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than OutcomesReasoning, Safety, Evaluation and Alignment12/18/2025MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP ServersAgents, Reasoning, Safety, Evaluation and Alignment12/17/2025Audio MultiChallengeMultimodal, Safety, Evaluation and Alignment11/25/2025PropensityBenchSafety, Evaluation and Alignment11/13/2025Professional Reasoning BenchmarkSafety, Evaluation and Alignment, Reasoning11/10/2025ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research AgentsReasoning, Safety, Evaluation and Alignment11/5/2025Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation ModelsSafety, Evaluation and Alignment10/28/2025Remote Labor Index: Measuring AI Automation of Remote WorkAgents, Safety, Evaluation and Alignment, Reasoning10/20/2025REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable RewardsReasoning, Agents, Safety, Evaluation and Alignment10/15/2025Beyond Seeing: Evaluating Multimodal LLMs On Tool-enabled Image Perception, Transformation, and ReasoningSafety, Evaluation and Alignment, Reasoning, Multimodal10/8/2025Online Rubrics Elicitation from Pairwise ComparisonsSafety, Evaluation and Alignment, Post-Training9/25/2025Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-TrainingPost-Training, Science of Data9/23/2025Progress over Points: Reframing LM Benchmarks Around Scientific ObjectivesSafety, Evaluation and Alignment9/19/2025SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?Agents, Safety, Evaluation and Alignment9/11/2025TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language ModelsSafety, Evaluation and Alignment8/26/2025Reliable Weak-to-Strong Monitoring of LLM AgentsSafety, Evaluation and Alignment, Oversight8/13/2025Search-Time Data ContaminationSafety, Evaluation and Alignment, Oversight7/23/2025MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMsReasoning, Safety, Evaluation and Alignment7/23/2025Rubrics as Rewards: Reinforcement Learning Beyond Verifiable DomainsScience of Data, Post-Training7/21/2025WebGuard: Building a Generalizable Guardrail for Web AgentsAgents, Safety, Evaluation and Alignment7/15/2025Chain of Thought Monitorability: A New and Fragile Opportunity for AI SafetyReasoning, Oversight, Safety, Evaluation and Alignment6/28/2025Teaching Models to Verbalize Reward Hacking in Chain-of-Thought ReasoningPost-Training, Reasoning6/18/2025FORTRESS: Frontier Risk Evaluation for National Security and Public SafetySafety, Evaluation and Alignment6/16/2025Adaptive Guidance Accelerates Reinforcement Learning of Reasoning ModelsReasoning6/13/2025Agent-RLVR: Training Software Engineering Agents via Guidance and Environment RewardsAgents, Post-Training, Reasoning6/5/2025A Red Teaming Roadmap Towards System-Level SafetySafety, Evaluation and Alignment5/9/2025Assessing Robustness to Spurious Correlations in Post-Training Language ModelsPost-Training, Science of Data3/14/2025Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria RerankingReasoning3/8/2025Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language ModelsSafety, Evaluation and Alignment3/5/2025The MASK Benchmark: Disentangling Honesty From Accuracy in AI SystemsSafety, Evaluation and Alignment2/13/2025ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning ChallengesReasoning, Safety, Evaluation and Alignment2/11/2025J2: Jailbreaking to JailbreakSafety, Evaluation and Alignment2/10/2025ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing MechanismsSafety, Evaluation and Alignment1/29/2025MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMsSafety, Evaluation and Alignment, Reasoning1/23/2025Humanity's Last ExamSafety, Evaluation and Alignment, Reasoning1/2/2025ToolComp: A Multi-Tool Reasoning & Process Supervision BenchmarkSafety, Evaluation and Alignment, Reasoning, Oversight10/11/2024Refusal-Trained LLMs Are Easily Jailbroken As Browser AgentsSafety, Evaluation and Alignment9/29/2024Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMsPost-Training, Science of Data9/27/2024Revisiting the Superficial Alignment HypothesisPost-Training9/5/2024Planning In Natural Language Improves LLM Search For Code GenerationPost-Training8/30/2024Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding DataSafety, Evaluation and Alignment, Multimodal, Science of Data8/27/2024LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks YetSafety, Evaluation and Alignment7/18/2024Learning Goal-Conditioned Representations for Language Reward ModelsPost-Training5/1/2024A Careful Examination of Large Language Model Performance on Grade School ArithmeticSafety, Evaluation and Alignment3/5/2024The WMDP Benchmark: Measuring and Reducing Malicious Use With UnlearningSafety, Evaluation and Alignment, Post-Training1/22/2024Out-of-Distribution Detection & Applications With Ablated Learned Temperature EnergyComputer Vision11/21/2023A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution ShiftPost-Training10/5/2023A Holistic Approach For Test And Evaluation Of Large Language ModelsSafety, Evaluation and Alignment10/4/2023On the Performance of Multimodal Language ModelsMultimodal, Post-Training4/28/2023Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMsPost-Training4/11/2023Detecting and Preventing Hallucinations in Large Vision Language ModelsComputer Vision3/11/2023Enabling Calibration In The Zero-shot Inference Of Large Vision-Language ModelsComputer Vision1/29/2023Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive SmoothingSafety, Evaluation and Alignment3/7/2022GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes PredictionComputer Vision11/16/2021CAR – Cityscapes Attributes Recognition A Multi-category Attributes Dataset for Autonomous VehiclesComputer Vision11/7/2021Natural Adversarial ObjectsComputer Vision10/11/2021DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debatesSafety, Evaluation and Alignment7/31/2021On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning ModelsComputer Vision, Science of Data4/20/2021Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k DatasetComputer Vision11/27/2020A Survey of Deep Learning Approaches for OCR and Document UnderstandingComputer Vision3/12/2026Defensive Refusal Bias: How Safety Alignment Fails Cyber DefendersSafety
2/25/2026VeRO: An Evaluation Harness for Agents to Optimize AgentsAgents, Post-Training, Evaluation and Alignment
2/12/2026LHAW: Controllable Underspecification for Long-Horizon TasksAgents, Safety, Evaluation and Alignment
1/15/2026SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?Safety, Evaluation and Alignment
1/6/2026Agentic Rubrics as Contextual Verifiers for SWE AgentsAgents, Safety, Evaluation and Alignment
12/22/2025MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than OutcomesReasoning, Safety, Evaluation and Alignment
12/18/2025MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP ServersAgents, Reasoning, Safety, Evaluation and Alignment
12/17/2025Audio MultiChallengeMultimodal, Safety, Evaluation and Alignment
11/25/2025PropensityBenchSafety, Evaluation and Alignment
11/13/2025Professional Reasoning BenchmarkSafety, Evaluation and Alignment, Reasoning
11/10/2025ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research AgentsReasoning, Safety, Evaluation and Alignment
11/5/2025Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation ModelsSafety, Evaluation and Alignment
10/28/2025Remote Labor Index: Measuring AI Automation of Remote WorkAgents, Safety, Evaluation and Alignment, Reasoning
10/20/2025REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable RewardsReasoning, Agents, Safety, Evaluation and Alignment
10/15/2025Beyond Seeing: Evaluating Multimodal LLMs On Tool-enabled Image Perception, Transformation, and ReasoningSafety, Evaluation and Alignment, Reasoning, Multimodal
10/8/2025Online Rubrics Elicitation from Pairwise ComparisonsSafety, Evaluation and Alignment, Post-Training
9/25/2025Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-TrainingPost-Training, Science of Data
9/23/2025Progress over Points: Reframing LM Benchmarks Around Scientific ObjectivesSafety, Evaluation and Alignment
9/19/2025SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?Agents, Safety, Evaluation and Alignment
9/11/2025TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language ModelsSafety, Evaluation and Alignment
8/26/2025Reliable Weak-to-Strong Monitoring of LLM AgentsSafety, Evaluation and Alignment, Oversight
8/13/2025Search-Time Data ContaminationSafety, Evaluation and Alignment, Oversight
7/23/2025MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMsReasoning, Safety, Evaluation and Alignment
7/23/2025Rubrics as Rewards: Reinforcement Learning Beyond Verifiable DomainsScience of Data, Post-Training
7/21/2025WebGuard: Building a Generalizable Guardrail for Web AgentsAgents, Safety, Evaluation and Alignment
7/15/2025Chain of Thought Monitorability: A New and Fragile Opportunity for AI SafetyReasoning, Oversight, Safety, Evaluation and Alignment
6/28/2025Teaching Models to Verbalize Reward Hacking in Chain-of-Thought ReasoningPost-Training, Reasoning
6/18/2025FORTRESS: Frontier Risk Evaluation for National Security and Public SafetySafety, Evaluation and Alignment
6/16/2025Adaptive Guidance Accelerates Reinforcement Learning of Reasoning ModelsReasoning
6/13/2025Agent-RLVR: Training Software Engineering Agents via Guidance and Environment RewardsAgents, Post-Training, Reasoning
6/5/2025A Red Teaming Roadmap Towards System-Level SafetySafety, Evaluation and Alignment
5/9/2025Assessing Robustness to Spurious Correlations in Post-Training Language ModelsPost-Training, Science of Data
3/14/2025Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria RerankingReasoning
3/8/2025Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language ModelsSafety, Evaluation and Alignment
3/5/2025The MASK Benchmark: Disentangling Honesty From Accuracy in AI SystemsSafety, Evaluation and Alignment
2/13/2025ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning ChallengesReasoning, Safety, Evaluation and Alignment
2/11/2025J2: Jailbreaking to JailbreakSafety, Evaluation and Alignment
2/10/2025ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing MechanismsSafety, Evaluation and Alignment
1/29/2025MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMsSafety, Evaluation and Alignment, Reasoning
1/23/2025Humanity's Last ExamSafety, Evaluation and Alignment, Reasoning
1/2/2025ToolComp: A Multi-Tool Reasoning & Process Supervision BenchmarkSafety, Evaluation and Alignment, Reasoning, Oversight
10/11/2024Refusal-Trained LLMs Are Easily Jailbroken As Browser AgentsSafety, Evaluation and Alignment
9/29/2024Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMsPost-Training, Science of Data
9/27/2024Revisiting the Superficial Alignment HypothesisPost-Training
9/5/2024Planning In Natural Language Improves LLM Search For Code GenerationPost-Training
8/30/2024Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding DataSafety, Evaluation and Alignment, Multimodal, Science of Data
8/27/2024LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks YetSafety, Evaluation and Alignment
7/18/2024Learning Goal-Conditioned Representations for Language Reward ModelsPost-Training
5/1/2024A Careful Examination of Large Language Model Performance on Grade School ArithmeticSafety, Evaluation and Alignment
3/5/2024The WMDP Benchmark: Measuring and Reducing Malicious Use With UnlearningSafety, Evaluation and Alignment, Post-Training
1/22/2024Out-of-Distribution Detection & Applications With Ablated Learned Temperature EnergyComputer Vision
11/21/2023A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution ShiftPost-Training
10/5/2023A Holistic Approach For Test And Evaluation Of Large Language ModelsSafety, Evaluation and Alignment
10/4/2023On the Performance of Multimodal Language ModelsMultimodal, Post-Training
4/28/2023Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMsPost-Training
4/11/2023Detecting and Preventing Hallucinations in Large Vision Language ModelsComputer Vision
3/11/2023Enabling Calibration In The Zero-shot Inference Of Large Vision-Language ModelsComputer Vision
1/29/2023Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive SmoothingSafety, Evaluation and Alignment
3/7/2022GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes PredictionComputer Vision
11/16/2021CAR – Cityscapes Attributes Recognition A Multi-category Attributes Dataset for Autonomous VehiclesComputer Vision
11/7/2021Natural Adversarial ObjectsComputer Vision
10/11/2021DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debatesSafety, Evaluation and Alignment
7/31/2021On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning ModelsComputer Vision, Science of Data
4/20/2021Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k DatasetComputer Vision
11/27/2020A Survey of Deep Learning Approaches for OCR and Document UnderstandingComputer Vision

































































Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders
65 papers found