Scale Labs (@ScaleAILabs) / X

Scale Labs

122 posts

Scale Labs

@ScaleAILabs

welcome to the lab. from the researchers at @scale_AI

Joined October 2025

Scale Labs reposted
Afra Feyza Akyürek
@afeyzaakyurek
Jun 5
Excited to share a new @ScaleAILabs research in collaboration with @phylo_bio on coding agents for drug-discovery research! 💊 We ran Claude Code, Codex, and Gemini on 60+ expert-curated drug-discovery tasks inside a shared Biomni-powered biomedical research environment and the
10K
Scale Labs reposted
Akshay
@akshay_manglik
Jun 2
How do you turn agent traces into an improvement flywheel? Excited to share Insights Generator (IG) — new @scale_AI / @ScaleAILabs research that finds behavioral patterns and bugs in agent traces. Engineers & coding agents using IG achieved 30+% gains on agent benchmarks. 🧵
632
Scale Labs
@ScaleAILabs
Jun 1
Replying to @ScaleAILabs
HiL-Dynamics: github.com/melfeki-11/HiL… Blog: labs.scale.com/blog/hil-dynam…
GitHub - melfeki-11/HiL-Dynamics: Does your coding agent know when it doesn't know? HiL-Dynamics...
From github.com
267
Scale Labs
@ScaleAILabs
Jun 1
Replying to @ScaleAILabs
Selective escalation remains one of the biggest challenges for reliable human-in-the-loop AI. We hope HiL-Dynamics helps users find the right setup for their workflows and gives model builders clearer signals for building agents that collaborate with humans more effectively.
271
Scale Labs
@ScaleAILabs
Jun 1
Replying to @ScaleAILabs
Not all agents fail the same way. GPT-based agents tend to ask early. Claude-based agents tend to explore first.But when GPT runs inside Codex, a harness that discourages asking, that behavior nearly disappears. The harness shapes collaboration strategy as much as the model
99
Scale Labs
@ScaleAILabs
Jun 1
Replying to @ScaleAILabs
We found agents are highly sensitive to customization. Some rely heavily on system prompts and examples. Others, like Codex, needed additional ask tools to reach their full potential. Small setup changes can completely change collaboration behavior.
245
Scale Labs
@ScaleAILabs
Jun 1
Replying to @ScaleAILabs
We evaluated five production agent stacks: Codex, Claude Code, Google Antigravity, ADK, and OpenCode, with and without customization and skill tuning. Even with the best enhancements, agents still struggle to escalate to humans at the right moment. State-of-the-art agents
375
Scale Labs
@ScaleAILabs
Jun 1
Today we're releasing HiL-Dynamics, the first open-source tool that measures how production agents actually collaborate with humans under uncertainty. Not just whether they got the answer. Now you can measure exactly when your agent asks for help, when it makes assumptions, and
3.5K
Scale Labs
@ScaleAILabs
May 28
Claude Opus 4.8 just landed on our MCP Atlas Leaderboard! Opus 4.8’s performance places it in the top band of SOTA models for agentic tool calling. The Claude 4 family keeps getting better at long-horizon tool use. Check out the updated rankings:
MCP Atlas
From labs.scale.com
669
Scale Labs
@ScaleAILabs
May 26
Replying to @ScaleAILabs
Full paper:
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents
From labs.scale.com
234
Scale Labs
@ScaleAILabs
May 26
Replying to @ScaleAILabs
The takeaway: standard security evaluations may be underestimating the attack surface of interactive AI agents. A model that appears secure on fully specified tasks may become significantly more vulnerable once it has to handle ambiguity and request additional user input.
291
Scale Labs
@ScaleAILabs
May 26
Replying to @ScaleAILabs
Why does this happen? We found two key shifts once agents enter clarification mode: - Models process incoming information differently - The clarification interface itself creates a new attack surface Asking follow-up questions can change how easily an agent can be manipulated.
85
Scale Labs
@ScaleAILabs
May 26
Replying to @ScaleAILabs
Across 10 frontier models, clarification-seeking consistently increased vulnerability to prompt injection attacks. Examples: - o3: 1.8% → 34.0% attack success - Gemini-3-Flash: 2.2% → 35.7% Models that appeared robust in standard execution settings became far easier to
302