Scale Labs
[PAPERS][BLOG][LEADERBOARDS][SHOWDOWN]

[Papers]

Research papers and publications from Scale Labs covering AI evaluation, safety, benchmarking, and frontier model analysis.

Date Title
3/12/2026Defensive Refusal Bias: How Safety Alignment Fails Cyber DefendersSafety2/25/2026VeRO: An Evaluation Harness for Agents to Optimize AgentsAgents, Post-Training, Evaluation and Alignment2/12/2026LHAW: Controllable Underspecification for Long-Horizon TasksAgents, Safety, Evaluation and Alignment1/15/2026SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?Safety, Evaluation and Alignment1/6/2026Agentic Rubrics as Contextual Verifiers for SWE AgentsAgents, Safety, Evaluation and Alignment12/22/2025MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than OutcomesReasoning, Safety, Evaluation and Alignment12/18/2025MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP ServersAgents, Reasoning, Safety, Evaluation and Alignment12/17/2025Audio MultiChallengeMultimodal, Safety, Evaluation and Alignment11/25/2025PropensityBenchSafety, Evaluation and Alignment11/13/2025Professional Reasoning BenchmarkSafety, Evaluation and Alignment, Reasoning11/10/2025ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research AgentsReasoning, Safety, Evaluation and Alignment11/5/2025Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation ModelsSafety, Evaluation and Alignment10/28/2025Remote Labor Index: Measuring AI Automation of Remote WorkAgents, Safety, Evaluation and Alignment, Reasoning10/20/2025REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable RewardsReasoning, Agents, Safety, Evaluation and Alignment10/15/2025Beyond Seeing: Evaluating Multimodal LLMs On Tool-enabled Image Perception, Transformation, and ReasoningSafety, Evaluation and Alignment, Reasoning, Multimodal10/8/2025Online Rubrics Elicitation from Pairwise ComparisonsSafety, Evaluation and Alignment, Post-Training9/25/2025Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-TrainingPost-Training, Science of Data9/23/2025Progress over Points: Reframing LM Benchmarks Around Scientific ObjectivesSafety, Evaluation and Alignment9/19/2025SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?Agents, Safety, Evaluation and Alignment9/11/2025TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language ModelsSafety, Evaluation and Alignment8/26/2025Reliable Weak-to-Strong Monitoring of LLM AgentsSafety, Evaluation and Alignment, Oversight8/13/2025Search-Time Data ContaminationSafety, Evaluation and Alignment, Oversight7/23/2025MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMsReasoning, Safety, Evaluation and Alignment7/23/2025Rubrics as Rewards: Reinforcement Learning Beyond Verifiable DomainsScience of Data, Post-Training7/21/2025WebGuard: Building a Generalizable Guardrail for Web AgentsAgents, Safety, Evaluation and Alignment7/15/2025Chain of Thought Monitorability: A New and Fragile Opportunity for AI SafetyReasoning, Oversight, Safety, Evaluation and Alignment6/28/2025Teaching Models to Verbalize Reward Hacking in Chain-of-Thought ReasoningPost-Training, Reasoning6/18/2025FORTRESS: Frontier Risk Evaluation for National Security and Public SafetySafety, Evaluation and Alignment6/16/2025Adaptive Guidance Accelerates Reinforcement Learning of Reasoning ModelsReasoning6/13/2025Agent-RLVR: Training Software Engineering Agents via Guidance and Environment RewardsAgents, Post-Training, Reasoning6/5/2025A Red Teaming Roadmap Towards System-Level SafetySafety, Evaluation and Alignment5/9/2025Assessing Robustness to Spurious Correlations in Post-Training Language ModelsPost-Training, Science of Data3/14/2025Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria RerankingReasoning3/8/2025Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language ModelsSafety, Evaluation and Alignment3/5/2025The MASK Benchmark: Disentangling Honesty From Accuracy in AI SystemsSafety, Evaluation and Alignment2/13/2025ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning ChallengesReasoning, Safety, Evaluation and Alignment2/11/2025J2: Jailbreaking to JailbreakSafety, Evaluation and Alignment2/10/2025ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing MechanismsSafety, Evaluation and Alignment1/29/2025MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMsSafety, Evaluation and Alignment, Reasoning1/23/2025Humanity's Last ExamSafety, Evaluation and Alignment, Reasoning1/2/2025ToolComp: A Multi-Tool Reasoning & Process Supervision BenchmarkSafety, Evaluation and Alignment, Reasoning, Oversight10/11/2024Refusal-Trained LLMs Are Easily Jailbroken As Browser AgentsSafety, Evaluation and Alignment9/29/2024Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMsPost-Training, Science of Data9/27/2024Revisiting the Superficial Alignment HypothesisPost-Training9/5/2024Planning In Natural Language Improves LLM Search For Code GenerationPost-Training8/30/2024Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding DataSafety, Evaluation and Alignment, Multimodal, Science of Data8/27/2024LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks YetSafety, Evaluation and Alignment7/18/2024Learning Goal-Conditioned Representations for Language Reward ModelsPost-Training5/1/2024A Careful Examination of Large Language Model Performance on Grade School ArithmeticSafety, Evaluation and Alignment3/5/2024The WMDP Benchmark: Measuring and Reducing Malicious Use With UnlearningSafety, Evaluation and Alignment, Post-Training1/22/2024Out-of-Distribution Detection & Applications With Ablated Learned Temperature EnergyComputer Vision11/21/2023A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution ShiftPost-Training10/5/2023A Holistic Approach For Test And Evaluation Of Large Language ModelsSafety, Evaluation and Alignment10/4/2023On the Performance of Multimodal Language ModelsMultimodal, Post-Training4/28/2023Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMsPost-Training4/11/2023Detecting and Preventing Hallucinations in Large Vision Language ModelsComputer Vision3/11/2023Enabling Calibration In The Zero-shot Inference Of Large Vision-Language ModelsComputer Vision1/29/2023Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive SmoothingSafety, Evaluation and Alignment3/7/2022GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes PredictionComputer Vision11/16/2021CAR – Cityscapes Attributes Recognition A Multi-category Attributes Dataset for Autonomous VehiclesComputer Vision11/7/2021Natural Adversarial ObjectsComputer Vision10/11/2021DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debatesSafety, Evaluation and Alignment7/31/2021On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning ModelsComputer Vision, Science of Data4/20/2021Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k DatasetComputer Vision11/27/2020A Survey of Deep Learning Approaches for OCR and Document UnderstandingComputer Vision3/12/2026
Defensive Refusal Bias: How Safety Alignment Fails Cyber DefendersSafety
2/25/2026
VeRO: An Evaluation Harness for Agents to Optimize AgentsAgents, Post-Training, Evaluation and Alignment
2/12/2026
LHAW: Controllable Underspecification for Long-Horizon TasksAgents, Safety, Evaluation and Alignment
1/15/2026
SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?Safety, Evaluation and Alignment
1/6/2026
Agentic Rubrics as Contextual Verifiers for SWE AgentsAgents, Safety, Evaluation and Alignment
12/22/2025
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than OutcomesReasoning, Safety, Evaluation and Alignment
12/18/2025
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP ServersAgents, Reasoning, Safety, Evaluation and Alignment
12/17/2025
Audio MultiChallengeMultimodal, Safety, Evaluation and Alignment
11/25/2025
PropensityBenchSafety, Evaluation and Alignment
11/13/2025
Professional Reasoning BenchmarkSafety, Evaluation and Alignment, Reasoning
11/10/2025
ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research AgentsReasoning, Safety, Evaluation and Alignment
11/5/2025
Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation ModelsSafety, Evaluation and Alignment
10/28/2025
Remote Labor Index: Measuring AI Automation of Remote WorkAgents, Safety, Evaluation and Alignment, Reasoning
10/20/2025
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable RewardsReasoning, Agents, Safety, Evaluation and Alignment
10/15/2025
Beyond Seeing: Evaluating Multimodal LLMs On Tool-enabled Image Perception, Transformation, and ReasoningSafety, Evaluation and Alignment, Reasoning, Multimodal
10/8/2025
Online Rubrics Elicitation from Pairwise ComparisonsSafety, Evaluation and Alignment, Post-Training
9/25/2025
Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-TrainingPost-Training, Science of Data
9/23/2025
Progress over Points: Reframing LM Benchmarks Around Scientific ObjectivesSafety, Evaluation and Alignment
9/19/2025
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?Agents, Safety, Evaluation and Alignment
9/11/2025
TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language ModelsSafety, Evaluation and Alignment
8/26/2025
Reliable Weak-to-Strong Monitoring of LLM AgentsSafety, Evaluation and Alignment, Oversight
8/13/2025
Search-Time Data ContaminationSafety, Evaluation and Alignment, Oversight
7/23/2025
MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMsReasoning, Safety, Evaluation and Alignment
7/23/2025
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable DomainsScience of Data, Post-Training
7/21/2025
WebGuard: Building a Generalizable Guardrail for Web AgentsAgents, Safety, Evaluation and Alignment
7/15/2025
Chain of Thought Monitorability: A New and Fragile Opportunity for AI SafetyReasoning, Oversight, Safety, Evaluation and Alignment
6/28/2025
Teaching Models to Verbalize Reward Hacking in Chain-of-Thought ReasoningPost-Training, Reasoning
6/18/2025
FORTRESS: Frontier Risk Evaluation for National Security and Public SafetySafety, Evaluation and Alignment
6/16/2025
Adaptive Guidance Accelerates Reinforcement Learning of Reasoning ModelsReasoning
6/13/2025
Agent-RLVR: Training Software Engineering Agents via Guidance and Environment RewardsAgents, Post-Training, Reasoning
6/5/2025
A Red Teaming Roadmap Towards System-Level SafetySafety, Evaluation and Alignment
5/9/2025
Assessing Robustness to Spurious Correlations in Post-Training Language ModelsPost-Training, Science of Data
3/14/2025
Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria RerankingReasoning
3/8/2025
Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language ModelsSafety, Evaluation and Alignment
3/5/2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI SystemsSafety, Evaluation and Alignment
2/13/2025
ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning ChallengesReasoning, Safety, Evaluation and Alignment
2/11/2025
J2: Jailbreaking to JailbreakSafety, Evaluation and Alignment
2/10/2025
ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing MechanismsSafety, Evaluation and Alignment
1/29/2025
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMsSafety, Evaluation and Alignment, Reasoning
1/23/2025
Humanity's Last ExamSafety, Evaluation and Alignment, Reasoning
1/2/2025
ToolComp: A Multi-Tool Reasoning & Process Supervision BenchmarkSafety, Evaluation and Alignment, Reasoning, Oversight
10/11/2024
Refusal-Trained LLMs Are Easily Jailbroken As Browser AgentsSafety, Evaluation and Alignment
9/29/2024
Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMsPost-Training, Science of Data
9/27/2024
Revisiting the Superficial Alignment HypothesisPost-Training
9/5/2024
Planning In Natural Language Improves LLM Search For Code GenerationPost-Training
8/30/2024
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding DataSafety, Evaluation and Alignment, Multimodal, Science of Data
8/27/2024
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks YetSafety, Evaluation and Alignment
7/18/2024
Learning Goal-Conditioned Representations for Language Reward ModelsPost-Training
5/1/2024
A Careful Examination of Large Language Model Performance on Grade School ArithmeticSafety, Evaluation and Alignment
3/5/2024
The WMDP Benchmark: Measuring and Reducing Malicious Use With UnlearningSafety, Evaluation and Alignment, Post-Training
1/22/2024
Out-of-Distribution Detection & Applications With Ablated Learned Temperature EnergyComputer Vision
11/21/2023
A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution ShiftPost-Training
10/5/2023
A Holistic Approach For Test And Evaluation Of Large Language ModelsSafety, Evaluation and Alignment
10/4/2023
On the Performance of Multimodal Language ModelsMultimodal, Post-Training
4/28/2023
Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMsPost-Training
4/11/2023
Detecting and Preventing Hallucinations in Large Vision Language ModelsComputer Vision
3/11/2023
Enabling Calibration In The Zero-shot Inference Of Large Vision-Language ModelsComputer Vision
1/29/2023
Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive SmoothingSafety, Evaluation and Alignment
3/7/2022
GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes PredictionComputer Vision
11/16/2021
CAR – Cityscapes Attributes Recognition A Multi-category Attributes Dataset for Autonomous VehiclesComputer Vision
11/7/2021
Natural Adversarial ObjectsComputer Vision
10/11/2021
DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debatesSafety, Evaluation and Alignment
7/31/2021
On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning ModelsComputer Vision, Science of Data
4/20/2021
Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k DatasetComputer Vision
11/27/2020
A Survey of Deep Learning Approaches for OCR and Document UnderstandingComputer Vision
Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders
VeRO: An Evaluation Harness for Agents to Optimize Agents
LHAW: Controllable Underspecification for Long-Horizon Tasks
SciPredict: Can LLMs Predict the Outcomes of Research Experiments in Natural Sciences?
Agentic Rubrics as Contextual Verifiers for SWE Agents
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
Audio MultiChallenge
PropensityBench
Professional Reasoning Benchmark
ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents
Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models
Remote Labor Index: Measuring AI Automation of Remote Work
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Beyond Seeing: Evaluating Multimodal LLMs On Tool-enabled Image Perception, Transformation, and Reasoning
Online Rubrics Elicitation from Pairwise Comparisons
Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
Progress over Points: Reframing LM Benchmarks Around Scientific Objectives
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models
Reliable Weak-to-Strong Monitoring of LLM Agents
Search-Time Data Contamination
MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
WebGuard: Building a Generalizable Guardrail for Web Agents
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models
Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards
A Red Teaming Roadmap Towards System-Level Safety
Assessing Robustness to Spurious Correlations in Post-Training Language Models
Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking
Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
ENIGMAEVAL: A Benchmark of Long Multimodal Reasoning Challenges
J2: Jailbreaking to Jailbreak
ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
Humanity's Last Exam
ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs
Revisiting the Superficial Alignment Hypothesis
Planning In Natural Language Improves LLM Search For Code Generation
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Learning Goal-Conditioned Representations for Language Reward Models
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Out-of-Distribution Detection & Applications With Ablated Learned Temperature Energy
A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution Shift
A Holistic Approach For Test And Evaluation Of Large Language Models
On the Performance of Multimodal Language Models
Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMs
Detecting and Preventing Hallucinations in Large Vision Language Models
Enabling Calibration In The Zero-shot Inference Of Large Vision-Language Models
Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive Smoothing
GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes Prediction
CAR – Cityscapes Attributes Recognition A Multi-category Attributes Dataset for Autonomous Vehicles
Natural Adversarial Objects
DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debates
On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning Models
Evaluating Deep Neural Networks Trained on Clinical Images in Dermatology with the Fitzpatrick 17k Dataset
A Survey of Deep Learning Approaches for OCR and Document Understanding

Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders

65 papers found

Scale Labs Newsletter

Research, benchmarks, and insights — delivered to your inbox.

Copyright 2026 Scale Inc. All rights reserved.

TermsPrivacy
Advertisement