Log inSign up
Mechanize
161 posts
Image
user avatar
Mechanize
@MechanizeWork
We build environments and evals for training and evaluating frontier coding agents.
San Francisco, CA
mechanize.work
Joined April 2025
1
Following
14K
Followers
  • user avatar
    Mechanize
    @MechanizeWork
    Jun 17
    Dario claims that AI will write basically all code, and that within a year thereafter a large share of software engineering jobs will be lost. While his prediction about the automation of coding is right, the conclusion about jobs does not follow. Coding models are
    Image
    00:00
    4K
    user avatar
    Mechanize
    @MechanizeWork
    Jun 17
    Full podcast episode:
    Image
    How bad data teaches models to write terrible code
    From mechanizework.substack.com
    635
  • user avatar
    Mechanize
    @MechanizeWork
    Jun 17
    Coding models will often do things that are reasonable during training but produce a terrible user experience. Models trained to conserve tokens will happily rerun 10-minute commands, wasting your time to avoid using a few more tokens.
    Image
    00:00
    1.5K
    user avatar
    Mechanize
    @MechanizeWork
    Jun 17
    Full podcast episode:
    Image
    How bad data teaches models to write terrible code
    From mechanizework.substack.com
    794
  • user avatar
    Mechanize
    @MechanizeWork
    Jun 17
    Poor theory of mind is one of the main things keeping models from being good software engineers. They can resolve specific, reproducible bugs, but they struggle to anticipate what users want in the first place, which is much of what building good software requires.
    Image
    00:00
    1.6K
    user avatar
    Mechanize
    @MechanizeWork
    Jun 17
    Full podcast episode:
    Image
    How bad data teaches models to write terrible code
    From mechanizework.substack.com
    528
  • user avatar
    Mechanize
    @MechanizeWork
    Jun 16
    Replying to @MechanizeWork
    Youtube: youtube.com/watch?v=UpO70A… Substack: mechanizework.substack.com/p/how-bad-data… Spotify: open.spotify.com/show/033krxvlE…
    2.5K
  • user avatar
    Mechanize
    @MechanizeWork
    Jun 16
    Our new podcast on evals, with Max Niederman, Ege Erdil, and Stephen Yang. 0:00:00 – What's an eval, and how's it different from an RL environment? 0:19:33 – Why are models bad at building an emulator when the task is fully verifiable? 0:42:00 – How does training on bad data
    Image
    00:00
    23K
  • user avatar
    Mechanize
    @MechanizeWork
    Jun 10
    Replying to @MechanizeWork
    Claude Fable 5 performs especially well on gameplay, scoring 91.5%. Opus 4.8 scored 77.4%. Interestingly, Fable 5 is a regression on audio. It scores 44.5% on audio, which is worse than Opus 4.8's 69.1% and GPT-5.5's 58.9%.
    Image
    2.9K
    user avatar
    Mechanize
    @MechanizeWork
    Jun 10
    Claude Fable 5 is the first model we tested that gets perfect gameplay on Varoom 3D. Opus 4.8 got just 25% on the same game.
    Image
    00:00
    2.6K
    user avatar
    Mechanize
    @MechanizeWork
    Jun 10
    However, like earlier models, Fable 5 also fails to build an emulator that works on Spout, a homebrew cave-flying game. It diverges shortly after the loading screen, scoring 7.6%.
    Image
    00:00
    2.2K
  • user avatar
    Mechanize
    @MechanizeWork
    Jun 10
    Claude Fable 5 scores 74.5% on GBA Eval, the best score to date. Given 24 hours, it writes an emulator that plays all but one game in our test set near-perfectly. It beats Opus 4.8's 24-hour score in under 2 hours.
    Image
    25K
  • user avatar
    Mechanize
    @MechanizeWork
    Jun 9
    We caught Grok Build 0.1 reward hacking on GBA Eval. After it got stuck while testing, it started hard-coding its emulator to perform better on the exact ROM it was testing against.
    Image
    3.7K
    user avatar
    Mechanize
    @MechanizeWork
    Jun 9
    It didn't work. The ROMs that Grok has access to are example ROMs that we intentionally give the models so they can test locally. We actually grade their emulators on a set of hidden ROMs, so the hacking doesn't improve the score.
    1.5K
    user avatar
    Mechanize
    @MechanizeWork
    Jun 9
    This is the first reward hacking attempt we've caught on GBA Eval. This case is somewhat subtle, not "malicious," and wouldn't have affected scores. This last point is exactly why we're careful to think about these behaviors when designing evals. Blog:
    gbaeval.com
    GBA Eval - Build a Game Boy Advance emulator in WebAssembly from scratch
    Frontier AI coding agents try to write a Game Boy Advance emulator from scratch. Their emulators are graded against Mesen2.
    1.2K
  • user avatar
    Mechanize
    @MechanizeWork
    Jun 3
    We are now seeking a puzzle maker to help us create puzzles that LLMs can't yet solve.
    Image
    553K
    user avatar
    Mechanize
    @MechanizeWork
    Jun 3
    Apply here:
    mechanize.work
    Puzzle Maker
    Design interesting and original puzzles that LLMs can't yet solve
    14K
  • user avatar
    Mechanize
    @MechanizeWork
    Jun 1
    Claude Opus 4.8 scores 70.9% on GBA Eval, the top score to date. Given 24 hours, it writes an emulator that plays most games, with working audio on all of them. It beats the previous best (GPT-5.5 at 53.2%) in under an hour.
    Image
    Image
    00:30
    user avatar
    Mechanize
    @MechanizeWork
    May 14
    We gave frontier AI coding agents 24 hours to write a complete Game Boy Advance emulator from scratch. GPT-5.5's emulator runs games best, with Claude Sonnet 4.6 and Opus 4.7 close behind. Gemini 3.1 Pro failed to produce a working emulator.
    24K
    user avatar
    Mechanize
    @MechanizeWork
    Jun 1
    Here's Claude Opus 4.8's emulator running Collie Defense, where it scores 99.8% on video and 91% on audio. On most games we tested, gameplay is near-perfect, with some audio imperfections.
    Image
    00:00
    2.7K
    user avatar
    Mechanize
    @MechanizeWork
    Jun 1
    However, Opus 4.8's emulator is not perfect. On Varooom 3D, it diverges after around 2,000 frames. This is better than GPT-5.5 (whose emulator diverged after around 1,250 frames), but Opus 4.8 only scores 25% on this game.
    Image
    00:00
    2.3K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms of Service|Privacy Policy|Cookie Policy|Accessibility|Ads info|© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement