Giant Robots Smashing Into Other Giant Robots

AI's "overnight" solution for our flaky tests took two weeks to adopt

2026-06-22T00:00:00+00:00

Recently I stopped a group of flaky tests from running in CI. 60% of CI runs were failing because of this group, which was unsustainable. Three weeks later I was able to restore that group to CI, with 0% failures on main¹ resulting. Our “non-flaky” tests now give more false positives than the (previously) flaky group.

This is not really a post about tests though, it’s really about AI’s contribution (a lot) and what it took to make that contribution usable (also a lot).

The hardest problem

Developers on this project had been quarantining tests with a :flaky label for several years. The strategy was to quarantine a small group which could be expected to fail randomly but could also be re-run easily and separately from the full suite. Apart from the flakiness, the test suite is comprehensive and gives us high confidence that if we merge something after tests pass, it works.

Over the years, several developers had tried for a week at a time to reduce flakiness, all resulting in failure. In our defense, the flaky tests centred around interactive pages using Stimulus or Hotwire, and online discussion of this topic is a combination of ideas we tried already, plus someone saying: “I tried a lot, it doesn’t work, I think there’s a bug”.

The most promising angle was adopting Playwright, which did improve some things but also left us with some tests that failed permanently and needed to be skipped. There’s a dissatisfying way in which this is better than tests that only fail some of the time.

The problem started to look more and more like a trap set for enthusiastic developers. As a manager I always had to urge caution: “sure, you can see some approaches that could help, but bear in mind the last five times anyone tried they found very promising angles that didn’t change the stats in github at all”. Developers whom I trust were seriously recommending deleting the entire group.

Opus “solved it” overnight

One night, Opus 4.6 running in Claude Code solved “the problem” by running the flaky test group hundreds of times and analyzing failures. There was some prompting to help Claude avoid premature conclusions and be aware that the problems could not be reproduced without repetition, plus a markdown file where it would record progress. Otherwise, no special magic.

I could see Claude’s progress over time because it needed to run the flaky group in larger and larger batches. At first, five times was sufficient because the errors it found occurred 20% of the time. As those were fixed, I had to tell it to use batches of ten, fifty, and then one hundred. Finally, it reached a point where zero errors were found.

A “nice” thing about needing such large batches is that I could leave Claude alone for hours at a time while my normal evening continued. Flaky specs may be a problem uniquely suited to coding agents in that way. There’s not even much token use: it just kicks off a long run and surfaces for an internal conversation, then kicks off the next batch.

Two weeks to make the results useful

This isn’t a post about test failure strategy, so I’ll spare you details of what was flaky and what fixes applied. Instead I’ll try to communicate some of the meta concerns I had with the resulting code changes.

Given a test that looked something like this:

1 create objects
2 visit page
3 click A
4 click B
5 expect expression 1 to be true
6 click C
7 expect expression 2 to be true

Unchecked, Claude would have turned it into something like this:

1  create objects in a slightly different way that makes no difference
2  visit page
3  explicit sleep
4  unnecessary scoping to a specific section of the page
5    click A
6  end of unnecessary scoping
7  click B, with 3 second wait passed as option arg
8  a clever improvement that should have been on line 3
9  expect expression 1 to be true
10 click C
11 an improvement that worked in other tests but was irrelevant here
12 expect expression 2 to be true

Ultimately the changes added up to a good improvement, usually because of one crucial addition per test (in our fictional example, line 8) that was on the wrong line and hidden in a mountain of garbage (lines 3, 4, 6, 7, 11).

It took two weeks to:

separate coincidence from real results
remove the things that didn’t make a difference
apply good practice to the important differences
unify slight variations on the same changes
generalise to other parts of the test suite
make sensible commits

Some of this work was just a matter of applying good practice (e.g. any explicit sleep call is immediately suspect), and other times it was sending Claude back to hundreds of test runs to prove that something it had added made no difference.

Conclusion: processing my reactions

I see in myself three reactions.

1. Hooray, I’m still useful as a programmer!

I think it would have been impossible without lots of experience working with Rails and rspec to move from what Claude was suggesting initially towards something sustainable². The exact amount of experience necessary is uncertain, but I’m on more than ten years. It took a lot to move beyond the optimism and false positives, and it would have taken more if I didn’t already have a reasonable gut instinct about these things.

2. Boy, AI is awful! Why bother with it if it takes so long to use the results?

I would absolutely use (and recommend) Claude for analysing flaky tests again. I think it would be a mistake not to do so. Accurately running long processes with tiny changes in between multi-hour waits is not a strength for humans.

In addition, Claude did reason through code running in parallel processes in a way that no human had managed for years. That particular part of our code is complex, but has not had active work for years, meaning that no human has good context. Claude probably caught up in 10 minutes.

An interesting aside here is that I find Claude to do much better work when it has tests to help it reason about application code. The tests were flaky, but they were still a good record of what the code was supposed to do.

3. Why keep going for two weeks after AI clearly fixed the problem I care about in one evening?

I could have taken the win, ignored the cruft, and gained two weeks. If I had, I would have lost those two weeks and more later on. Humans and AI agents would cargo cult the new (anti) patterns, falsely claiming victory over any future flakiness, and making it harder to identify the real problems.

As with all programming, eventually “tidy first, then do the work” ends up being faster than “just do the work”. There’s no escaping the tidying if I want good results, the question is whether I do it at a predictable time and pace or when there’s an emergency (like no-one being able to deploy any code because CI keeps failing).

That includes tidying up after AI.

commits on main are a proxy for “code that should pass tests”, as opposed to work-in-progress commits, which also go through CI and fail tests for real reasons. ↩
this was Opus 4.6, but nothing I’ve seen of later versions of Opus gives me confidence that humans are less necessary here. ↩

The Playwright debugging tool Rails devs aren't using

2026-06-19T00:00:00+00:00

Playwright ships with a black-box recorder that records every detail of every test, giving you a treasure trove of information for debugging flaky tests. Most Rails apps I’ve worked on don’t use it.

It’s called the Trace Viewer, and if you’re running Capybara with Playwright via playwright-ruby-client, turning it on takes a single block in rails_helper.rb.

What’s in a Playwright trace?

When Playwright records a trace, you get back a .zip you can replay in the Trace Viewer. The viewer is essentially a DVR for your test run. For every action Playwright took, every click, fill_in, and navigation, you get:

A before-and-after screenshot, so you can see what the page looked like at each moment
The DOM at that point in time, fully inspectable like in your devtools
The network requests in flight, with status codes, body, and timing
The JavaScript console output
A timeline of everything that happened

Compare that to what you usually get when a test fails on CI: A screenshot, and maybe a stack trace pointing at expected to find "Saved". With a trace, you can scrub through the test like a video, and pinpoint exactly where things went wrong.

Enable Playwright traces in your Rails app

Drop this into your rails_helper.rb. It assumes you’re using playwright-ruby-client.

You may need to change type: :system to type: :feature if that’s how you set up your browser-based tests.

RSpec.configure do |config|
  config.before(:each, type: :system) do |example|
    driver = Capybara.current_session.driver
    next unless driver.respond_to?(:start_tracing)

    driver.start_tracing(
      screenshots: true,
      snapshots: true,
      sources: true,
      title: example.full_description
    )
    example.metadata[:playwright_tracing] = true
  end

  config.after(:each, type: :system) do |example|
    next unless example.metadata[:playwright_tracing]

    path = trace_path(example)
    Capybara.current_session.driver.stop_tracing(path:)
    if example.exception
      output = RSpec.configuration.output_stream
      output.puts("Playwright trace: #{path}")
    end
  rescue Playwright::Error => e
    raise unless e.message.start_with?("Must start tracing before stopping")
  end

  def trace_path(example)
    if example.exception
      Rails.root.join(
        "tmp",
        "playwright-traces",
        [
          example.full_description.parameterize[0, 150],
          "-#{Time.current.to_i}",
          ".zip"
        ].join
      )
    else
      File::NULL
    end
  end
end

This starts a trace before every feature spec and only saves it if the test failed. When that happens, you’ll see a line like this printed:

Playwright trace: /Users/jutonz/code/thoughtbot/testing/tmp/playwright-traces/it-works-1775576941.zip

A few things in the snippet might catch your eye (the before(:each) instead of around, the rescue, the conditional path). They’re like that for a reason. Copy as-is, or expand below for the rationale.

Why the snippet looks the way it does

Giant Robots Podcast Ep 614: AI Code Audits

2026-06-18T00:00:00+00:00

Our hosts Chad and Sami team up this week to discuss AI code bases and whether they can be built to be developer friendly and with best practices in mind.

Meet thoughtbot at Brighton Ruby 2026

2026-06-18T00:00:00+00:00

Brighton Ruby 2026 will take place in a few days and the thoughtbot team will be there to meet you all in real life, learn from all the great talks, and enjoy a day by the English coast.

We love Brighton Ruby and enjoyed it for many years. It is a single-day, single-track conference packed with great energy and great people.

This year we will have 5 thoughtbotters attending:

Aji will be at Brighton Ruby for the first time! They are always happy to talk about ruby game development, recent conversations on The Bike Shed, tracking reading lists on Storygraph, or (let’s see what else… ::rummages through bag of hobbies::) linguistic anthropology. Come say hello!

Chad is thoughtbot’s founder and CEO, host of the Giant Robots Smashing Into Other Giant Robots‘s podcast and eternal player of D&D.

Mina is based in Edinburgh, Scotland, and this will be her first time at Brighton Ruby. She’s interested in infrastructure as code and closing the gap between operations responsibilities and application development, is an avid marathon runner, always looking to strike up a conversation about the works of Brandon Sanderson or show off pictures of her dogs, Dottie and Henson.

Rob is our EMEA Development Director, based in Holmes Chapel, Cheshire and most likely you have seen him already in Brighton Ruby or other conferences.

He has a not-so-quiet obsession with best practices and striving for improvement. He likes to hunt down delicious beers and coffee in his spare time. Despite the recent ups and downs, he’s an avid Stoke City fan, which is only a testament to his determination!

Sarah is a developer and team lead, based in Porto, Portugal, and originally from Brazil. She enjoys working with Ruby especially, but also on any technology that enables projects to move forward. She loves playings sports like volleyball and has a great movies and TV series culture.

And finally, we will also have a speaker at the conference this year: myself, Rémy Hannequin. I am happy to share that I will be giving a talk on time management and surprises with Ruby.

I am based in Paris, France, and I have a serious passion for astronomy. I created multiple open source projects to combine Ruby and astronomy, with the main one being Astronoby, a Ruby gem to allows to compute celestial events and positions with extreme precision.

If you’re attending, come say hello! We’re always up for talking about Ruby, Rails, that new gem we’re excited about, the eternal Vim vs VS Code debate or we can just share a drink and talk about something else! Keep an eye on thoughtbot on Mastodon and Rob’s personal account to see where we’ll be hanging out.

We can’t wait to share the experience with you.

The mistake I didn't realise I was making when designing workshops

2026-06-17T00:00:00+00:00

The checklist I expected

Last week I attended a workshop on neuroinclusivity in learning design.

I expected to come away with a checklist.

Use larger fonts
Send slides in advance
Offer cameras off
Use a dyslexia-friendly typeface

Instead, the biggest takeaway was that there is no checklist.

Hold on - I know you want something tangible, it’s coming - stay with me.

The assumption I hadn’t questioned

The facilitator challenged a belief I hadn’t questioned before: we often talk about neurodiversity as if it describes a group of people.

The workshop argued that neurodiversity is the natural variation in how humans think, focus, process information and communicate.

That reframing changes the problem entirely.

Why this matters for design

When we think in categories, we tend to design for ourselves and then add accommodations afterwards. We build the workshop, the meeting, the presentation or the product, based on our own preferences and then ask, “now how do we make this accessible?”

When we think in variation, we start by accepting that people will experience the same thing differently.

The workshop wasn’t really about fonts or slide templates. It was about design choices.

How much information do you put on a slide?

Do people know why they’re learning something?

Can they contribute in different ways?

Have you considered sensory load, attention span, or processing time?

A familiar product challenge

As a product person, the parallel felt familiar: even if you start with the customer, there’s still a risk of designing for how something is assumed to be experienced, rather than how it actually is - a gap that only closes with context and testing.

Whether you’re designing software, a workshop, a conference talk or a team meeting, the principle feels surprisingly similar:

Start with the expectation that people will experience the same thing differently. Design from there.

The Bike Shed Ep 502: Apps That Make Our Work Go

2026-06-16T00:00:00+00:00

Aji and Sally are back together again, this time to discuss the different apps they use to make their workflows and To Do lists easier and quicker to achieve.

AI crawlers are inflating your view counts

2026-06-16T00:00:00+00:00

Your most-viewed page might be one no human has ever opened. That is what AI crawlers have done to view tracking in 2026.

I ran into this problem on a production app that needed engagement tracking. The first version tracked everything server-side, the way Rails apps have done analytics for years. It broke within a day.

The problem: crawlers inflate every count

We used Ahoy for tracking. Each controller action called ahoy.track while rendering the page, and every event rolled up into a denormalized counter column with counter_culture.

The issue is that server-side tracking fires on every request, including bots. AI crawlers like Meta-ExternalAgent, Bytespider, and Baiduspider were making roughly 100,000 requests per day. They were not attacking the site, just reading to feed training pipelines.

Ahoy has bot detection built in. It uses the device_detector gem to check user agents and skips known bots. That list catches Googlebot and older crawlers, but it misses the new wave of AI crawlers. As a result, every one of those requests created an Ahoy::Event row and incremented the corresponding counters.

Our view counts were not measuring human interest. They were measuring how hungry the scrapers were that week.

Fix one: require JavaScript

Chasing user agent strings is a losing game. New crawlers appear faster than blocklists update. But there is one thing AI crawlers reliably do not do, and that is execute JavaScript.

So we moved view tracking out of the controllers. Pages declare what is trackable as a data attribute, and a small Stimulus controller fires a beacon after the page loads.

connect() {
  if (this.element.dataset.viewTrackerFired === "true") return
  this.element.dataset.viewTrackerFired = "true"

  const fire = () => this.fire()
  if ("requestIdleCallback" in window) {
    requestIdleCallback(fire, { timeout: 2000 })
  } else {
    setTimeout(fire, 500)
  }
}

A few details mattered here:

requestIdleCallback defers the beacon until the browser is idle, so tracking never competes with rendering. The 2-second timeout guarantees it still fires on busy pages.
keepalive: true on the fetch lets the request survive the user navigating away immediately.
The fired flag guards against Turbo reconnecting the controller and double-counting.

Crawlers fetch the HTML and move on. Real browsers run the beacon and get counted. View counts dropped sharply the day this deployed. That was the fix landing, not a regression.

Fix two: the bots found the beacon

Three days later, the tracking endpoint /track/events was the most-crawled path on the site. Crawlers do not execute JavaScript, but they do parse it. The endpoint URL sits in the markup as a data attribute, so the scrapers extracted it and started requesting it directly.

None of those requests created events, but they still burned through the full Rails stack for nothing. The fix was two cheap layers.

First, robots.txt for the well-behaved bots:

Disallow: /track/

Second, a guard in the controller for everyone else:

class TrackingEventsController < ApplicationController
  before_action :reject_bots

  private

  def reject_bots
    head :no_content if DeviceDetector.new(request.user_agent).bot?
  end
end

Any request with a bot user agent gets a 204 before the action runs. No parsing, no resource lookups, no database work. The well-behaved crawlers respect robots.txt and never arrive, and the rest get the cheapest possible response.

The takeaway

Server-side analytics was built for a web that no longer exists. In 2026, a meaningful share of your traffic comes from AI crawlers, so counting views on the server measures scraper appetite, not audience.

The defense is not one clever trick. It is stacked cheap layers: robots.txt for the bots that ask permission, a user agent check that returns early for the ones that announce themselves, and a JavaScript beacon for the bots that do neither.

Check your own numbers. If your view counts have never had a suspicious cliff in them, the bot tax is probably still baked in.