ImageEvoClaw:Evaluating AI Agents on Continuous Software Evolution

Long-running agents build customized software (a “Claw”) to interact with their environments. For practical use in complex, real-world tasks, these agents must fully and autonomously evolve this software in response to a continuous stream of end-user requirements. EvoClaw evaluates how well frontier LLM agents handle this continuous development, benchmarking them against real-world evolution itineraries from open-source repositories.

Overall Cost / Performance on EvoClaw

Leaderboard

# Model Agent Score (%) Precision (%) Recall (%) Resolve (%) Cost ($) Out Tok. (K) Time (h) Turns