Follow-up to this post, wherein I was shocked to find that Claude Code failed to do a low-context task which took me 4 hours and involved some skills I expected it would have significant advantages [1] .
I kept going to see if Claude Code could eventually succeed. What happened instead was that it built a very impressive-looking 4000 LOC system to extract type and dependency injection information for my entire codebase and dump it into a sqlite database.
To my shock this the tool Claude built [2] actually worked. I ended up playing with the system Claude built for two days, uncovering and ticketing all sorts of bugs in the codebase I hadn't been aware of. And then I realized that the bugs I was uncovering weren't of the type that I was actually looking for for the task I was immediately trying to do, and that if I wanted bugs to fix we already have a backlog that we're not going to get through any time soon no matter how much AI help we have.
So anyway, Claude was able to do a reasonable job of figuring out what endpoint sequences could cause an issue. It struggled to figure out how to invoke the framework to make a mock HTTP request[^3], but once it had a template to work off of, it was able to make good progress.
On what I expected to be the hardest part of the task, Claude actually did quite well once it had a template to work off of. It was able (with some re-prompting necessary when it declared the task finished early) to write successfully failing tests for 5 of the 7 cases I had successfully written a test for, as well as one of the four "I am pretty sure this is an issue but I can't figure out how to expose it" cases. I also learned a few tricks of my own in seeing how Claude tackled a couple of the cases.
All that said, I definitely came away from this experiment with a strong intuition for exactly how it could take 20% longer to do things when you have LLM coding agents assisting you.
1. The specific task was programmatically checking through an ent