Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

feat/sg: handle interrupts more gracefully#62772

Merged
jhchabran merged 4 commits intomainfrom
jsm/sg-sigint
May 23, 2024
Merged

feat/sg: handle interrupts more gracefully#62772
jhchabran merged 4 commits intomainfrom
jsm/sg-sigint

Conversation

@jamesmcnamara
Copy link
Contributor

Currently sending a SIGINT to sg will actually exit immediately and not run cleanup. This comes from two issues: we weren't actually waiting for processes to exit after we interrupt them, and we weren't waiting for our interrupt handlers to run. For a bonus, we also add support for SIGHUP which is what happens when you close a terminal.

Test plan

Created a script which waited for 10 seconds on SIGINT and verified that SG actually waited for it to clean up and logged it's output.

@eseliger
Copy link
Member

Thank you! This is great

[    gitserver-0] INFO gitserver.cleanup internal/cleanup.go:683 Janitor run finished {"duration": "573.5ms"}
^C⚠️ Interrupt received, executing 20 hooks for graceful shutdown...
[    gitserver-0] INFO gitserver shared/shared.go:232 git-server: shutting down
[    gitserver-0] INFO gitserver shared/shared.go:238 git-server: stopped routines
gitserver-0 exited without error
symbols exited without error
blobstore exited without error
frontend exited without error
[   repo-updater] INFO repo-updater.repo-updater.grpcserver grpcserver/grpcserver.go:71 Shutting down gRPC server
[   repo-updater] INFO sync_worker workerutil/worker.go:252 Shutting down dequeue loop {"name": "repo_sync_worker", "reason": ""}
repo-updater exited without error
searcher exited without error

See example output here, this works very well for testing gitserver shutdown behavior for example.

It seems that some of our services don't really like to shutdown well, which is not an issue of this PR, but great that we now get visibility into this.

[         worker] INFO worker.exhaustive-search-job.routines workerutil/worker.go:252 Shutting down dequeue loop {"name": "exhaustive_search_repo_revision_worker", "reason": ""}
[         worker] INFO worker.repo-perms-syncer workerutil/worker.go:252 Shutting down dequeue loop {"name": "repo_permissions_sync_job_worker", "reason": ""}
[         worker] INFO worker.outbound-webhook-sender.sender workerutil/worker.go:252 Shutting down dequeue loop {"name": "outbound_webhook_job_worker", "reason": ""}
[         worker] INFO worker.batches-reconciler.routines workerutil/worker.go:252 Shutting down dequeue loop {"name": "batches_reconciler_worker", "reason": ""}
[         worker] INFO worker.exhaustive-search-job.routines workerutil/worker.go:252 Shutting down dequeue loop {"name": "exhaustive_search_worker", "reason": ""}
[         worker] INFO worker.exhaustive-search-job.routines workerutil/worker.go:252 Shutting down dequeue loop {"name": "exhaustive_search_repo_worker", "reason": ""}
[         worker] INFO worker.batches-workspace-resolver.routines workerutil/worker.go:252 Shutting down dequeue loop {"name": "batch_changes_batch_spec_resolution_worker", "reason": ""}
[         worker] INFO worker.batches-bulk-processor.routines workerutil/worker.go:252 Shutting down dequeue loop {"name": "batches_bulk_processor", "reason": ""}
[         worker] INFO worker.repo-embedding-job workerutil/worker.go:252 Shutting down dequeue loop {"name": "repo_embedding_job_worker", "reason": ""}
[         worker] INFO worker.own-repo-indexing-queue workerutil/worker.go:252 Shutting down dequeue loop {"name": "own_background_worker", "reason": ""}
[         worker] INFO codeintel.autoindexing.dependencies workerutil/worker.go:252 Shutting down dequeue loop {"name": "precise_code_intel_dependency_indexing_scheduler_worker", "reason": ""}
[         worker] INFO worker.codemonitors-job.BackgroundJobs.actions workerutil/worker.go:252 Shutting down dequeue loop {"name": "code_monitors_action_jobs_worker", "reason": ""}
[         worker] INFO worker.repo-perms-syncer workerutil/worker.go:252 Shutting down dequeue loop {"name": "user_permissions_sync_job_worker", "reason": ""}
[         worker] INFO worker.bitbucket-project-permissions.routines workerutil/worker.go:252 Shutting down dequeue loop {"name": "explicit_permissions_bitbucket_projects_jobs_worker", "reason": ""}
[         worker] INFO worker.codemonitors-job.BackgroundJobs.triggers workerutil/worker.go:252 Shutting down dequeue loop {"name": "code_monitors_trigger_jobs_worker", "reason": ""}
[         worker] INFO codeintel.autoindexing.dependencies workerutil/worker.go:252 Shutting down dequeue loop {"name": "precise_code_intel_dependency_sync_scheduler_worker", "reason": ""}
[            web] Error: https://sourcegraph.test:3443 produced 502 Bad Gateway
[            web]     at /Users/erik/Code/sourcegraph/sourcegraph/client/web/dev/esbuild/server.ts:104:27
[            web]     at processTicksAndRejections (node:internal/process/task_queues:95:5)
[  zoekt-index-0] 2024/05/17 22:15:20 error listing repos: listRepoIDs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [::1]:3090: connect: connection refused"
[         worker] ERROR conf.client conf/client.go:304 received error during background config update {"triggered_by": "waitForSleep", "error": "unable to fetch new configuration: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [::1]:3090: connect: connection refused\""}
[            web] Error: https://sourcegraph.test:3443 produced 502 Bad Gateway
[            web]     at /Users/erik/Code/sourcegraph/sourcegraph/client/web/dev/esbuild/server.ts:104:27
[            web]     at processTicksAndRejections (node:internal/process/task_queues:95:5)
[            web] Error: https://sourcegraph.test:3443 produced 502 Bad Gateway
[            web]     at /Users/erik/Code/sourcegraph/sourcegraph/client/web/dev/esbuild/server.ts:104:27
[            web]     at processTicksAndRejections (node:internal/process/task_queues:95:5)
[            web] Error: https://sourcegraph.test:3443 produced 502 Bad Gateway
[            web]     at /Users/erik/Code/sourcegraph/sourcegraph/client/web/dev/esbuild/server.ts:104:27
[            web]     at processTicksAndRejections (node:internal/process/task_queues:95:5)
[         worker] ERROR conf.client conf/client.go:304 received error during background config update {"triggered_by": "waitForSleep", "error": "unable to fetch new configuration: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [::1]:3090: connect: connection refused\""}
[            web] Error: https://sourcegraph.test:3443 produced 502 Bad Gateway
[            web]     at /Users/erik/Code/sourcegraph/sourcegraph/client/web/dev/esbuild/server.ts:104:27
[            web]     at processTicksAndRejections (node:internal/process/task_queues:95:5)
[            web]     at processTicksAndRejections (node:internal/process/task_queues:95:5)
[            web]     at /Users/erik/Code/sourcegraph/sourcegraph/client/web/dev/esbuild/server.ts:104:27
[            web] Error: https://sourcegraph.test:3443 produced 502 Bad Gateway
[         worker] ERROR luasandbox.luasandbox.Call luasandbox/sandbox.go:129 operation.error {"TraceId": "cbfe0815bb5550b307c4ccf534146deb", "SpanId": "0c10101366b5e744", "count": 1, "elapsed": 0.000130625, "error": "<string>:9: context canceled\nstack traceback:\n\t<string>:9: in main chunk\n\t[G]: ?"}
[         worker] ERROR inference.service.codeintel.autoindexing.inference.invokeLinearizedRecognizer inference/service.go:502 operation.error {"TraceId": "cbfe0815bb5550b307c4ccf534146deb", "SpanId": "acb571bf21d196c6", "count": 1, "elapsed": 0.00667325, "error": "<string>:9: context canceled\nstack traceback:\n\t<string>:9: in main chunk\n\t[G]: ?"}
[         worker] ERROR inference.service.codeintel.autoindexing.inference.invokeRecognizers inference/service.go:252 operation.error {"TraceId": "cbfe0815bb5550b307c4ccf534146deb", "SpanId": "53cd8a3fe8f9abf4", "count": 1, "elapsed": 111.883686625, "error": "<string>:9: context canceled\nstack traceback:\n\t<string>:9: in main chunk\n\t[G]: ?"}
[         worker] ERROR inference.service.codeintel.autoindexing.inference.invokeRecognizers inference/service.go:252 operation.error {"TraceId": "cbfe0815bb5550b307c4ccf534146deb", "SpanId": "ade5382778a4a3e4", "count": 1, "elapsed": 126.615487167, "error": "<string>:9: context canceled\nstack traceback:\n\t<string>:9: in main chunk\n\t[G]: ?"}
[         worker] ERROR inference.service.codeintel.autoindexing.inference.InferIndexJobs inference/service.go:98 operation.error {"TraceId": "cbfe0815bb5550b307c4ccf534146deb", "SpanId": "3abfbc6f1130fb15", "repo": "github.com/JetBrains/intellij-community", "commit": "HEAD", "count": 1, "elapsed": 126.623475916, "error": "<string>:9: context canceled\nstack traceback:\n\t<string>:9: in main chunk\n\t[G]: ?"}
[         worker] ERROR luasandbox.luasandbox.RunGoCallback luasandbox/sandbox.go:200 operation.error {"TraceId": "cbfe0815bb5550b307c4ccf534146deb", "SpanId": "5936343b1619f3eb", "count": 1, "elapsed": 0.000038084, "error": "<string>:9: context canceled\nstack traceback:\n\t<string>:9: in main chunk\n\t[G]: ?"}
[         worker] ERROR periodic.codeintel.autoindexing-summary-builder goroutine/periodic.go:364 operation.error {"count": 1, "elapsed": 131.518689041, "error": "<string>:9: context canceled\nstack traceback:\n\t<string>:9: in main chunk\n\t[G]: ?"}
worker exited without error
zoekt-web-0 exited without error
❌ failed to run zoekt-index-0.
stderr:
INFO server zoekt-sourcegraph-indexserver/main.go:1017 removing tmp dir {"tmpRoot": "/Users/erik/.sourcegraph/zoekt/index-0/.indexserver.tmp"}
2024/05/17 22:13:51 error listing repos: listRepoIDs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [::1]:3090: connect: connection refused"
2024/05/17 22:15:20 error listing repos: listRepoIDs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [::1]:3090: connect: connection refused"

Copy link
Contributor

@jhchabran jhchabran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got some panic while doing a test run:

         worker] INFO worker.own-repo-indexing-queue.own.background.index.scheduler.analytics background/scheduler.go:108 skipping own indexing job, job disabled {"job-name": "analytics"}
[         worker] INFO worker.insights-job.background-insight-enqueuer background/insight_enqueuer.go:94 enqueuing indexed insight snapshots
[         worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:131 events exported {"TraceId": "ea3b2341882d73159b39d0419a9628e5", "SpanId": "2405d65938edf604", "maxBatchSize": 10000, "succeeded": 2}
[   repo-updater] INFO repo-updater.syncer repos/syncer.go:450 syncing external service {"externalServiceID": 1}
[   repo-updater] INFO repoPurgeWorker purge/purge.go:144 repository purge finished {"deletedBefore": 1716388261010998000, "total": 0, "removed": 0, "failed": 0, "duration": "158.906459ms"}
^C⚠️ Interrupt received, executing 22 hooks for graceful shutdown...
[        symbols] Bazel caught interrupt signal; cancelling pending invocation.
[        symbols] ERROR: command interrupted while computing main repo mapping
[        symbols] Error: bazel exited with exit code: 8
panic: panic: failed kill process group ID 43092 for cmd repo-updater : operation not permitted
stacktrace:
goroutine 249 [running]:
runtime/debug.Stack()
        /Users/tech/.asdf/installs/golang/1.22.1/go/src/runtime/debug/stack.go:24 +0x64
github.com/sourcegraph/conc/panics.NewRecovered(0x1, {0x106cdf3c0, 0x140007e8c90})
        /Users/tech/work/other/vendor/github.com/sourcegraph/conc/panics/panics.go:59 +0x74
github.com/sourcegraph/conc/panics.(*Catcher).tryRecover(0x140026ac110)
        /Users/tech/work/other/vendor/github.com/sourcegraph/conc/panics/panics.go:28 +0x58
panic({0x106cdf3c0?, 0x140007e8c90?})
        /Users/tech/.asdf/installs/golang/1.22.1/go/src/runtime/panic.go:770 +0x124
github.com/sourcegraph/conc/pool.(*ContextPool).Go.func1.1()
        /Users/tech/work/other/vendor/github.com/sourcegraph/conc/pool/context_pool.go:33 +0x60
panic({0x106cdf3c0?, 0x140007e8c90?})
        /Users/tech/.asdf/installs/golang/1.22.1/go/src/runtime/panic.go:770 +0x124
github.com/sourcegraph/sourcegraph/dev/sg/internal/run.startCmd.func1()
        /Users/tech/work/other/dev/sg/internal/run/command.go:342 +0x168
github.com/sourcegraph/sourcegraph/dev/sg/internal/run.(*cmdRunner).run.func1({0x107150ae8, 0x140009be000})
        /Users/tech/work/other/dev/sg/internal/run/run.go:85 +0x9d8
github.com/sourcegraph/conc/pool.(*ContextPool).Go.func1()
        /Users/tech/work/other/vendor/github.com/sourcegraph/conc/pool/context_pool.go:38 +0x70
github.com/sourcegraph/conc/pool.(*ContextPool).Go.(*ErrorPool).Go.func2()
        /Users/tech/work/other/vendor/github.com/sourcegraph/conc/pool/error_pool.go:30 +0x30
github.com/sourcegraph/conc/pool.(*Pool).worker(0x140026ac100, 0x4d4f430a0a3b272e?)
        /Users/tech/work/other/vendor/github.com/sourcegraph/conc/pool/pool.go:156 +0x5c
github.com/sourcegraph/conc/pool.(*Pool).Go.func1()
        /Users/tech/work/other/vendor/github.com/sourcegraph/conc/pool/pool.go:51 +0x24
github.com/sourcegraph/conc/panics.(*Catcher).Try(0x0?, 0x100cc69d8?)
        /Users/tech/work/other/vendor/github.com/sourcegraph/conc/panics/panics.go:23 +0x50
github.com/sourcegraph/conc.(*WaitGroup).Go.func1()
        /Users/tech/work/other/vendor/github.com/sourcegraph/conc/waitgroup.go:32 +0x58
created by github.com/sourcegraph/conc.(*WaitGroup).Go in goroutine 132
        /Users/tech/work/other/vendor/github.com/sourcegraph/conc/waitgroup.go:30 +0x7c



goroutine 132 [running]:
github.com/sourcegraph/conc/panics.(*Catcher).Repanic(...)
        /Users/tech/work/other/vendor/github.com/sourcegraph/conc/panics/panics.go:38
github.com/sourcegraph/conc.(*WaitGroup).Wait(0x140026ac100)
        /Users/tech/work/other/vendor/github.com/sourcegraph/conc/waitgroup.go:42 +0x5c
github.com/sourcegraph/conc/pool.(*Pool).Wait(0x140026ac100)
        /Users/tech/work/other/vendor/github.com/sourcegraph/conc/pool/pool.go:81 +0x60
github.com/sourcegraph/conc/pool.(*ErrorPool).Wait(0x140026ac100)
        /Users/tech/work/other/vendor/github.com/sourcegraph/conc/pool/error_pool.go:37 +0x24
github.com/sourcegraph/conc/pool.(*ContextPool).Wait(0x140026ac100?)
        /Users/tech/work/other/vendor/github.com/sourcegraph/conc/pool/context_pool.go:57 +0x38
github.com/sourcegraph/sourcegraph/dev/sg/internal/run.(*cmdRunner).run(0x14002680510, {0x107150ae8?, 0x14001c99810?})
        /Users/tech/work/other/dev/sg/internal/run/run.go:133 +0x130
github.com/sourcegraph/sourcegraph/dev/sg/internal/run.Commands({0x107150ae8, 0x14001c99810}, 0x1400204adb0, 0x0, {0x1400247a100, 0x10, 0x10})
        /Users/tech/work/other/dev/sg/internal/run/run.go:54 +0x2cc
main.(*Commands).start(0x14001c996d0, {0x107150ae8, 0x14001c99810})
        /Users/tech/work/other/dev/sg/sg_start.go:420 +0x298
main.start.func1()
        /Users/tech/work/other/dev/sg/sg_start.go:311 +0x64
created by main.start in goroutine 1
        /Users/tech/work/other/dev/sg/sg_start.go:307 +0x390
exit status 2

@jamesmcnamara
Copy link
Contributor Author

@jhchabran did you do anything particular here? I haven't been able to reproduce

@jhchabran
Copy link
Contributor

@jhchabran did you do anything particular here? I haven't been able to reproduce

Nothing, it was just sg start, waiting for stuff to run, then ^C and the panic. I pulled it again after you merged main back in the branch and I can't reproduce. Not sure why it happened, but that seems to have done it, so LGTM

@jhchabran jhchabran merged commit d67885d into main May 23, 2024
@jhchabran jhchabran deleted the jsm/sg-sigint branch May 23, 2024 08:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants