Log inSign up
Alex Shaw
655 posts
Image
user avatar
Alex Shaw
@alexgshaw
Hacking on @terminalbench and @harborframework. Founding MTS @LaudeInstitute. Formerly Google. BYU alum.
tbench.ai
Joined October 2021
659
Following
1,815
Followers
  • Pinned
    user avatar
    Alex Shaw
    @alexgshaw
    Nov 7, 2025
    Today, we’re announcing the next chapter of Terminal-Bench with two releases: 1. Harbor, a new package for running sandboxed agent rollouts at scale 2. Terminal-Bench 2.0, a harder version of Terminal-Bench with increased verification
    Image
    144K
  • user avatar
    Alex Shaw
    @alexgshaw
    Nov 3, 2025
    We're releasing Terminal-Bench 2.0 this week! Come to our meetup on Thursday @ Databricks to get early access :)
    Image
    Terminal-Bench 2.0 · Luma
    From luma.com
    14K
  • user avatar
    Alex Shaw
    @alexgshaw
    Jul 16, 2025
    Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days. We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks. Now
    Image
    14K
  • user avatar
    Alex Shaw
    @alexgshaw
    Jun 30, 2025
    Replying to @NTFabiano
    The craziest part of this chart is not how well the AI performs (although that is impressive). It’s that the best physician has less than 40% accuracy.
    2.9K
  • user avatar
    Alex Shaw
    @alexgshaw
    May 19, 2025
    Excited to share what I’ve been working on with @andykonwinski, @Mike_A_Merrill, and @lschmidt3 at Stanford & Laude. Introducing Terminal-Bench! A benchmark and framework to quantify how well AI agents accomplish complex tasks in a terminal environment. We believe that the
    user avatar
    Mike A. Merrill
    @Mike_A_Merrill
    May 19, 2025
    Many agents (Claude Code, Codex CLI) interact with the terminal to do valuable tasks, but do they currently work well enough to deploy en masse? We’re excited to introduce Terminal-Bench: An evaluation environment and benchmark for AI agents on real-world terminal tasks. Tl;dr
    Image
    6.2K
  • user avatar
    Alex Shaw
    @alexgshaw
    Jul 24, 2025
    Replying to @astrodanish
    If you have $32B you don’t need to go to Turkey for a hair transplant 😂
    7.6K
  • user avatar
    Alex Shaw
    @alexgshaw
    Nov 7, 2025
    Replying to @alexgshaw
    Harbor is the package we wish we had had while making Terminal-Bench. It’s for agent, model, and benchmark developers and researchers who want to evaluate and improve agents and models.
    harborframework.com
    Harbor
    A framework for evaluating and optimizing sandboxed agents and models.
    3.6K
  • user avatar
    Alex Shaw
    @alexgshaw
    Nov 13, 2025
    Great to see Warp putting up the top score on Terminal-Bench 2.0 just days after release! Even more exciting to hear that they've already made improvements to their agent based on the results. Ultimately, we hope that Terminal-Bench 2.0 accelerates model and agent development in
    user avatar
    Warp
    @warpdotdev
    Nov 11, 2025
    Warp is back at the top. Terminal-Bench 2.0 just launched and Warp secured the top spot with a score of 50.1%. The best agent to go from prompt to production.
    Image
    2.8K
  • user avatar
    Alex Shaw
    @alexgshaw
    Nov 7, 2025
    Replying to @alexgshaw
    Just a few of the features I love about Harbor: - Evaluate any agent that can be installed and run autonomously - Scale up to thousands of concurrent containers using providers like @daytonaio and @modal - Generate rollouts for SFT and RL - Create your own benchmarks or use
    3K
  • user avatar
    Alex Shaw
    @alexgshaw
    Apr 10, 2024
    Replying to @karpathy
    Be honest, did you use GitHub copilot when you wrote this
    15K
  • user avatar
    Alex Shaw
    @alexgshaw
    Nov 7, 2025
    Replying to @alexgshaw
    At present, Codex CLI with GPT-5 sits at the top of our new leaderboard. tbench.ai/leaderboard
    Image
    2.6K
  • user avatar
    Alex Shaw
    @alexgshaw
    Aug 9, 2023
    .@supabase's integration of AI into the SQL editor is easily the most convenient use of AI I have found in a product other than @GitHubCopilot. It's not complicated, but it does exactly what I need it to: (1/2)
    Image
    879
  • user avatar
    Alex Shaw
    @alexgshaw
    Nov 7, 2025
    Replying to @alexgshaw
    Additionally, Terminal-Bench wouldn’t be possible without its community. We’re so thankful to the over 1k members of our Discord who contributed and audited tasks, helped build and beta test Harbor, and made this such a fun project for everyone involved.
    Image
    1.9K
  • user avatar
    Alex Shaw
    @alexgshaw
    Oct 20, 2025
    Mike and I went on the @latentspacepod !
    user avatar
    Mike A. Merrill
    @Mike_A_Merrill
    Oct 20, 2025
    Had a great time talking about the history of terminal-bench and the future of agent evals with @alexgshaw , @swyx and @FanaHOVA on @latentspacepod. 🔗 Link below!
    Image
    2.5K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement