Log inSign up
Xiangyi Li
3,377 posts
Image
user avatar
Xiangyi Li
@xdotli
your friendly neighborhood eval guy, creator SkillsBench ClawsBench @benchflow_ai chat about evals, rl environments, skills discord.gg/mZ9Rc8q8W3
San Francisco
benchflow.ai
Joined January 2022
1,562
Following
5,729
Followers
  • Pinned
    user avatar
    Xiangyi Li
    @xdotli
    Feb 13
    Agent Skills are everywhere - Claude Code, Gemini CLI, Codex all support them. But do they actually work? 105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance & more built SkillsBench to find that out. 86 tasks. 11 domains. 7,308 trajectories. 🧵👇
    Image
    123K
  • user avatar
    Xiangyi Li
    @xdotli
    Sep 29, 2025
    wonder what @paulg thinks
    user avatar
    Guillermo Rauch
    Vercel
    @rauchg
    Sep 29, 2025
    🇺🇸 🇮🇱 🇦🇷 Enjoyed my discussion with PM Netanyahu on how AI education and literacy will keep our free societies ahead. We spoke about AI empowering everyone to build software and the importance of ensuring it serves quality and progress. Optimistic for peace, safety, and
    Image
    215K
  • user avatar
    Xiangyi Li
    @xdotli
    Apr 5, 2025
    After playing Pokemon for days, we are happy to share a preview of our open-source LLM Plays Pokemon Benchmark - introducing PokemonGym We ran a simple prompt agent for 4 hours. Surprisingly, it takes an amateur player ~400 steps to get the first Pokemon, and ~450 for Claude 3.7
    Image
    19K
  • user avatar
    Xiangyi Li
    @xdotli
    Feb 8, 2025
    谷歌首席科学家领投startup招人啦|国内湾区 我们是base在SF湾区的一家AI infra startup,正在建造AI时代的评测基础设施——通过标准化Benchmark帮助开发者取代手工评估,我们目前已获得包括Google首席科学家Jeff Dean, A16z scout fund, Dropbox联合创始人Arash
    22K
  • user avatar
    Xiangyi Li
    @xdotli
    Dec 15, 2024
    Just a random 12yo trying to make a Discord clone with @v0 and @supabase
    Image
    5.8K
  • user avatar
    Xiangyi Li
    @xdotli
    Apr 1, 2025
    It's been 13 days since the release of the JFK files, and thanks to @amasad's open-sourcing OCRed JFK files, we are able to ship this: JFK RAG arena Vote for your favorite models. Link in comments!
    Image
    62K
  • user avatar
    Xiangyi Li
    @xdotli
    Jun 7, 2025
    epic work in seo @fdotinc
    Image
    3.4K
  • user avatar
    Xiangyi Li
    @xdotli
    Sep 4, 2025
    Introducing Instaline: give any agents a number, starting with Claude Code Dm me if you want to test it out or connect your agents with a number. Still working on it so expect a better launch soon!
    Image
    00:00
    62K
  • user avatar
    Xiangyi Li
    @xdotli
    Feb 11, 2025
    谷歌首席科学家领投startup招人|湾区/remote 大家好呀,我们是base在SF湾区的一家AI infra startup,正在建造AI时代的评测基础设施——通过标准化Benchmark帮助开发者取代手工评估,我们目前已获得包括Google首席科学家Jeff Dean, A16z scout fund, Dropbox联合创始人Arash
    Image
    71K
  • user avatar
    Xiangyi Li
    @xdotli
    Apr 23, 2025
    PokémonGym v2 is live 🧵 Our latest run includes results on 5 latest models, and integration with the LangGraph framework (thanks to @hwchase17)
    Image
    00:00
    12K
  • user avatar
    Xiangyi Li
    @xdotli
    Nov 4, 2025
    Introducing TerminalGym, a full stack data generation pipeline and training tool to train terminal agents on any tasks. For example training the models to use Firecrawl mcp more effectively
    Image
    00:00
    5.5K
  • user avatar
    Xiangyi Li
    @xdotli
    Apr 3, 2025
    6 months ago, I left my job to work on my side project full-time @fdotinc. This changed everything 3 months later, after showing months of growth and initial funding, we were backed by @hthieblot, who basically was one of our first believers
    Image
    2.6K
  • user avatar
    Xiangyi Li
    @xdotli
    Dec 11, 2022
    Been re-reading Thinking Fast and Slow this morning and somehow get @ThePrimeagen's point on why programmers should be better at typing and memorize APIs: if you could type better and remember well the coding process will be closer to your system I thinking. Not just the flow.
  • user avatar
    Xiangyi Li
    @xdotli
    May 3, 2025
    Replying to @sundarpichai and @TheCodeOfJoel
    Hey @sundarpichai we just made the eval harness open-source for any LLMs to play Pokémon. Shout out to @TheCodeOfJoel for finishing the complete run. We couldn't finish it due to rate limits a few weeks ago. @OfficialLoganK. It's live on
    Image
    GitHub - benchflow-ai/benchflow: Research infra for creating RL environments, post-training, and...
    From github.com
    7.6K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement