chore(appliance): fix race condition in tests#62999
Conversation
In case you haven't tried it, might be useful to add copious logging and leave the test running for a few hours or longer. |
|
That's a good idea @varungandhi-src, will give that a shot next week. Will hold off merging this until then. |
|
Since Friday, we had 2 more flake reports. 1 was another instance of TestNonNamespacedResourcesRemainWhenDisabled, and the other was this: The theory in the PR description is consistent with this failure too. If we scrape envtest resources after awaiting 1 reconciliation event, but before that same reconciliation loop has updated the currentVersion annotation, we'd expect to see this. I'm surprised we don't see it more often! @varungandhi-src, I propose we merge this even before I've reproduced the error locally (given I still haven't after hours of trying, but will continue trying). It does no harm, and acts as an experiment in CI. It's a classic 2-way door decision. If the flakes stop in https://buildkite.com/organizations/sourcegraph/analytics/suites/sourcegraph-bazel/tests/flaky, great. If not, we know to investigate. Would you mind rubber-stamping this since no-one from my team is online? |
We observed a rare test flake in TestNonNamespacedResourcesRemainWhenDisabled in which non-namespaced resources were apparently not deleted when assertions were made against golden fixtures. My theory for why this was happening, is that there was a race between the test goroutine and the reconciler running in the background. The reconciler actually appears to emit 2 events per configmap creation. The second is probably a result of updating the currentVersion annotation at the end of the reconcile loop, which itself triggers another reconcile loop. Our test code was waiting for the initial resources to be created by waiting for at least one reconcile event to appear. It was then waiting for only one more event. When we got unlucky, this event would be the 2nd event from the 1st configmap upload, rather than the _next_ event, from the 2nd configmap upload. If the test thread scraped all resources from kube-apiserver before the reconciler had deleted them (in response to the 2nd configmap upload in this particular test), the golden assertions would fail. This could all be totally wrong. We only received one report of such a flake, and I was unable to reproduce it in hundreds of local runs, so sadly I haven't empirically validated this.
6d468a5 to
a40746f
Compare
varungandhi-src
left a comment
There was a problem hiding this comment.
Stamping without review (since I don't have a proper understanding of this system) to see if this prevents spurious failures in CI.
We observed a rare test flake in
TestNonNamespacedResourcesRemainWhenDisabled in which non-namespaced resources were apparently not deleted when assertions were made against golden fixtures. My theory for why this was happening, is that there was a race between the test goroutine and the reconciler running in the background.
The reconciler actually appears to emit 2 events per configmap creation. The second is probably a result of updating the currentVersion annotation at the end of the reconcile loop, which itself triggers another reconcile loop.
Our test code was waiting for the initial resources to be created by waiting for at least one reconcile event to appear. It was then waiting for only one more event. When we got unlucky, this event would be the 2nd event from the 1st configmap upload, rather than the next event, from the 2nd configmap upload. If the test thread scraped all resources from kube-apiserver before the reconciler had deleted them (in response to the 2nd configmap upload in this particular test), the golden assertions would fail.
This could all be totally wrong. We only received one report of such a flake, and I was unable to reproduce it in hundreds of local runs, so sadly I haven't empirically validated this.
Closes https://linear.app/sourcegraph/issue/REL-121/test-failure-related-to-appliance-in-pr-adding-dependency-and
Test plan
This is a fix to tests.
Changelog