Xiangyi Li (@xdotli) / X

Xiangyi Li

3,377 posts

Xiangyi Li

@xdotli

your friendly neighborhood eval guy, creator SkillsBench ClawsBench @benchflow_ai chat about evals, rl environments, skills discord.gg/mZ9Rc8q8W3

San Francisco

Joined January 2022

Pinned
Xiangyi Li
@xdotli
Feb 13
Agent Skills are everywhere - Claude Code, Gemini CLI, Codex all support them. But do they actually work? 105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance & more built SkillsBench to find that out. 86 tasks. 11 domains. 7,308 trajectories. 🧵👇
123K
Xiangyi Li
@xdotli
Sep 29, 2025
wonder what @paulg thinks
Guillermo Rauch
@rauchg
Sep 29, 2025
🇺🇸 🇮🇱 🇦🇷 Enjoyed my discussion with PM Netanyahu on how AI education and literacy will keep our free societies ahead. We spoke about AI empowering everyone to build software and the importance of ensuring it serves quality and progress. Optimistic for peace, safety, and
215K
Xiangyi Li
@xdotli
Apr 5, 2025
After playing Pokemon for days, we are happy to share a preview of our open-source LLM Plays Pokemon Benchmark - introducing PokemonGym We ran a simple prompt agent for 4 hours. Surprisingly, it takes an amateur player ~400 steps to get the first Pokemon, and ~450 for Claude 3.7
19K
Xiangyi Li
@xdotli
Feb 8, 2025
谷歌首席科学家领投startup招人啦｜国内湾区我们是base在SF湾区的一家AI infra startup，正在建造AI时代的评测基础设施——通过标准化Benchmark帮助开发者取代手工评估,我们目前已获得包括Google首席科学家Jeff Dean, A16z scout fund, Dropbox联合创始人Arash
22K
Xiangyi Li
@xdotli
Dec 15, 2024
Just a random 12yo trying to make a Discord clone with @v0 and @supabase
5.8K
Xiangyi Li
@xdotli
Apr 1, 2025
It's been 13 days since the release of the JFK files, and thanks to @amasad's open-sourcing OCRed JFK files, we are able to ship this: JFK RAG arena Vote for your favorite models. Link in comments!
62K
Xiangyi Li
@xdotli
Jun 7, 2025
epic work in seo @fdotinc
3.4K
Xiangyi Li
@xdotli
Sep 4, 2025
Introducing Instaline: give any agents a number, starting with Claude Code Dm me if you want to test it out or connect your agents with a number. Still working on it so expect a better launch soon!
00:00
62K
Xiangyi Li
@xdotli
Feb 11, 2025
谷歌首席科学家领投startup招人｜湾区/remote 大家好呀，我们是base在SF湾区的一家AI infra startup，正在建造AI时代的评测基础设施——通过标准化Benchmark帮助开发者取代手工评估,我们目前已获得包括Google首席科学家Jeff Dean, A16z scout fund, Dropbox联合创始人Arash
71K
Xiangyi Li
@xdotli
Apr 23, 2025
PokémonGym v2 is live 🧵 Our latest run includes results on 5 latest models, and integration with the LangGraph framework (thanks to @hwchase17)
00:00
12K
Xiangyi Li
@xdotli
Nov 4, 2025
Introducing TerminalGym, a full stack data generation pipeline and training tool to train terminal agents on any tasks. For example training the models to use Firecrawl mcp more effectively
00:00
5.5K
Xiangyi Li
@xdotli
Apr 3, 2025
6 months ago, I left my job to work on my side project full-time @fdotinc. This changed everything 3 months later, after showing months of growth and initial funding, we were backed by @hthieblot, who basically was one of our first believers
2.6K
Xiangyi Li
@xdotli
Dec 11, 2022
Been re-reading Thinking Fast and Slow this morning and somehow get @ThePrimeagen's point on why programmers should be better at typing and memorize APIs: if you could type better and remember well the coding process will be closer to your system I thinking. Not just the flow.
Xiangyi Li
@xdotli
May 3, 2025
Replying to @sundarpichai and @TheCodeOfJoel
Hey @sundarpichai we just made the eval harness open-source for any LLMs to play Pokémon. Shout out to @TheCodeOfJoel for finishing the complete run. We couldn't finish it due to rate limits a few weeks ago. @OfficialLoganK. It's live on
GitHub - benchflow-ai/benchflow: Research infra for creating RL environments, post-training, and...
From github.com
7.6K