Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

chore(appliance): fix race condition in tests#62999

Merged
craigfurman merged 2 commits intomainfrom
appliance-fix-test-flakes
Jun 3, 2024
Merged

chore(appliance): fix race condition in tests#62999
craigfurman merged 2 commits intomainfrom
appliance-fix-test-flakes

Conversation

@craigfurman
Copy link
Contributor

@craigfurman craigfurman commented May 31, 2024

We observed a rare test flake in
TestNonNamespacedResourcesRemainWhenDisabled in which non-namespaced resources were apparently not deleted when assertions were made against golden fixtures. My theory for why this was happening, is that there was a race between the test goroutine and the reconciler running in the background.

The reconciler actually appears to emit 2 events per configmap creation. The second is probably a result of updating the currentVersion annotation at the end of the reconcile loop, which itself triggers another reconcile loop.

Our test code was waiting for the initial resources to be created by waiting for at least one reconcile event to appear. It was then waiting for only one more event. When we got unlucky, this event would be the 2nd event from the 1st configmap upload, rather than the next event, from the 2nd configmap upload. If the test thread scraped all resources from kube-apiserver before the reconciler had deleted them (in response to the 2nd configmap upload in this particular test), the golden assertions would fail.

This could all be totally wrong. We only received one report of such a flake, and I was unable to reproduce it in hundreds of local runs, so sadly I haven't empirically validated this.


Closes https://linear.app/sourcegraph/issue/REL-121/test-failure-related-to-appliance-in-pr-adding-dependency-and

Test plan

This is a fix to tests.

Changelog

  • Fix a race condition in one package's tests.

@craigfurman craigfurman requested review from a team and Chickensoupwithrice and removed request for a team May 31, 2024 13:14
@cla-bot cla-bot bot added the cla-signed label May 31, 2024
@varungandhi-src
Copy link
Contributor

This could all be totally wrong. We only received one report of such a flake, and I was unable to reproduce it in hundreds of local runs, so sadly I haven't empirically validated this.

In case you haven't tried it, might be useful to add copious logging and leave the test running for a few hours or longer.

@craigfurman
Copy link
Contributor Author

That's a good idea @varungandhi-src, will give that a shot next week. Will hold off merging this until then.

@craigfurman craigfurman marked this pull request as draft May 31, 2024 17:39
@craigfurman
Copy link
Contributor Author

craigfurman commented Jun 3, 2024

Since Friday, we had 2 more flake reports. 1 was another instance of TestNonNamespacedResourcesRemainWhenDisabled, and the other was this:

--- FAIL: TestApplianceTestSuite (25.15s)
    --- FAIL: TestApplianceTestSuite/TestResourcesDeletedWhenDisabled (0.78s)
        golden_test.go:53: 
            	Error Trace:	internal/appliance/reconciler/golden_test.go:53
            	            				internal/appliance/reconciler/standard_config_test.go:49
            	Error:      	Not equal: 
            	            	expected: "resources:\n  - apiVersion: v1\n    data:\n      spec: |\n        spec:\n          requestedVersion: \"5.3.9104\"\n\n          blobstore:\n            disabled: true\n\n          codeInsights:\n            disabled: true\n\n          codeIntel:\n            disabled: true\n\n          frontend:\n            disabled: true\n\n          gitServer:\n            disabled: true\n\n          indexedSearch:\n            disabled: true\n\n          indexedSearchIndexer:\n            disabled: true\n\n          pgsql:\n            disabled: true\n\n          postgresExporter:\n            disabled: true\n\n          preciseCodeIntel:\n            disabled: true\n\n          redisCache:\n            disabled: true\n\n          redisStore:\n            disabled: true\n\n          repoUpdater:\n            disabled: true\n\n          searcher:\n            disabled: true\n\n          symbols:\n            disabled: true\n\n          syntectServer:\n            disabled: true\n\n          worker:\n            disabled: true\n\n          prometheus:\n            disabled: true\n\n          embeddings:\n            disabled: true\n    kind: ConfigMap\n    metadata:\n      annotations:\n        appliance.sourcegraph.com/currentVersion: 5.3.9104\n        appliance.sourcegraph.com/managed: \"true\"\n      creationTimestamp: \"2024-04-19T00:00:00Z\"\n      name: sg\n      namespace: NORMALIZED_FOR_TESTING\n      resourceVersion: NORMALIZED_FOR_TESTING\n      uid: NORMALIZED_FOR_TESTING\n  - apiVersion: v1\n    kind: PersistentVolumeClaim\n    metadata:\n      annotations:\n        appliance.sourcegraph.com/configHash: 2b72058f008a684f7fa052f8ad33d0226af4cfb7973242c9103d6d1900da355e\n      creationTimestamp: \"2024-04-19T00:00:00Z\"\n      deletionGracePeriodSeconds: 0\n      deletionTimestamp: \"2024-04-19T00:00:00Z\"\n      finalizers:\n        - kubernetes.io/pvc-protection\n      labels:\n        deploy: sourcegraph\n      name: blobstore\n      namespace: NORMALIZED_FOR_TESTING\n      ownerReferences:\n        - apiVersion: v1\n          blockOwnerDeletion: true\n          controller: true\n          kind: ConfigMap\n          name: sg\n          uid: NORMALIZED_FOR_TESTING\n      resourceVersion: NORMALIZED_FOR_TESTING\n      uid: NORMALIZED_FOR_TESTING\n    spec:\n      accessModes:\n        - ReadWriteOnce\n      resources:\n        requests:\n          storage: 100Gi\n      volumeMode: Filesystem\n    status:\n      phase: Pending\n"
            	            	actual  : "resources:\n  - apiVersion: v1\n    data:\n      spec: |\n        spec:\n          requestedVersion: \"5.3.9104\"\n\n          blobstore:\n            disabled: true\n\n          codeInsights:\n            disabled: true\n\n          codeIntel:\n            disabled: true\n\n          frontend:\n            disabled: true\n\n          gitServer:\n            disabled: true\n\n          indexedSearch:\n            disabled: true\n\n          indexedSearchIndexer:\n            disabled: true\n\n          pgsql:\n            disabled: true\n\n          postgresExporter:\n            disabled: true\n\n          preciseCodeIntel:\n            disabled: true\n\n          redisCache:\n            disabled: true\n\n          redisStore:\n            disabled: true\n\n          repoUpdater:\n            disabled: true\n\n          searcher:\n            disabled: true\n\n          symbols:\n            disabled: true\n\n          syntectServer:\n            disabled: true\n\n          worker:\n            disabled: true\n\n          prometheus:\n            disabled: true\n\n          embeddings:\n            disabled: true\n    kind: ConfigMap\n    metadata:\n      annotations:\n        appliance.sourcegraph.com/managed: \"true\"\n      creationTimestamp: \"2024-04-19T00:00:00Z\"\n      name: sg\n      namespace: NORMALIZED_FOR_TESTING\n      resourceVersion: NORMALIZED_FOR_TESTING\n      uid: NORMALIZED_FOR_TESTING\n  - apiVersion: v1\n    kind: PersistentVolumeClaim\n    metadata:\n      annotations:\n        appliance.sourcegraph.com/configHash: 2b72058f008a684f7fa052f8ad33d0226af4cfb7973242c9103d6d1900da355e\n      creationTimestamp: \"2024-04-19T00:00:00Z\"\n      deletionGracePeriodSeconds: 0\n      deletionTimestamp: \"2024-04-19T00:00:00Z\"\n      finalizers:\n        - kubernetes.io/pvc-protection\n      labels:\n        deploy: sourcegraph\n      name: blobstore\n      namespace: NORMALIZED_FOR_TESTING\n      ownerReferences:\n        - apiVersion: v1\n          blockOwnerDeletion: true\n          controller: true\n          kind: ConfigMap\n          name: sg\n          uid: NORMALIZED_FOR_TESTING\n      resourceVersion: NORMALIZED_FOR_TESTING\n      uid: NORMALIZED_FOR_TESTING\n    spec:\n      accessModes:\n        - ReadWriteOnce\n      resources:\n        requests:\n          storage: 100Gi\n      volumeMode: Filesystem\n    status:\n      phase: Pending\n"
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -66,3 +66,2 @@
            	            	       annotations:
            	            	-        appliance.sourcegraph.com/currentVersion: 5.3.9104
            	            	         appliance.sourcegraph.com/managed: "true"
            	Test:       	TestApplianceTestSuite/TestResourcesDeletedWhenDisabled
FAIL

The theory in the PR description is consistent with this failure too. If we scrape envtest resources after awaiting 1 reconciliation event, but before that same reconciliation loop has updated the currentVersion annotation, we'd expect to see this. I'm surprised we don't see it more often!

@varungandhi-src, I propose we merge this even before I've reproduced the error locally (given I still haven't after hours of trying, but will continue trying). It does no harm, and acts as an experiment in CI. It's a classic 2-way door decision.

If the flakes stop in https://buildkite.com/organizations/sourcegraph/analytics/suites/sourcegraph-bazel/tests/flaky, great. If not, we know to investigate. Would you mind rubber-stamping this since no-one from my team is online?

@craigfurman craigfurman marked this pull request as ready for review June 3, 2024 08:13
We observed a rare test flake in
TestNonNamespacedResourcesRemainWhenDisabled in which non-namespaced
resources were apparently not deleted when assertions were made against
golden fixtures. My theory for why this was happening, is that there was
a race between the test goroutine and the reconciler running in the
background.

The reconciler actually appears to emit 2 events per configmap creation.
The second is probably a result of updating the currentVersion
annotation at the end of the reconcile loop, which itself triggers
another reconcile loop.

Our test code was waiting for the initial resources to be created by
waiting for at least one reconcile event to appear. It was then waiting
for only one more event. When we got unlucky, this event would be the
2nd event from the 1st configmap upload, rather than the _next_ event,
from the 2nd configmap upload. If the test thread scraped all resources
from kube-apiserver before the reconciler had deleted them (in response
to the 2nd configmap upload in this particular test), the golden
assertions would fail.

This could all be totally wrong. We only received one report of such a
flake, and I was unable to reproduce it in hundreds of local runs, so
sadly I haven't empirically validated this.
@craigfurman craigfurman force-pushed the appliance-fix-test-flakes branch from 6d468a5 to a40746f Compare June 3, 2024 08:26
Copy link
Contributor

@varungandhi-src varungandhi-src left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamping without review (since I don't have a proper understanding of this system) to see if this prevents spurious failures in CI.

@craigfurman craigfurman merged commit 9bf2b99 into main Jun 3, 2024
@craigfurman craigfurman deleted the appliance-fix-test-flakes branch June 3, 2024 09:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants