-
Notifications
You must be signed in to change notification settings - Fork 222
Description
Two closely-related issues here. I started using plugin :solid_queue in my service to have puma manage the Solid Queue worker process, and it's broken phased restarts, causing an outage on every deployment because I have a single vps serving my site. This seems like a design issue contrary to project goals of enabling straightforward single-server use.
We do a phased restarts whenever possible (deploys that don't touch the bundle) so puma replaces workers one-at-a-time, so we have a zero-downtime deploy.
The first issue is that the plugin crashes on start because it assumes the Rails app is preloaded. I found a workaround (our commit) that might be working, but I suspect it breaks phased restarts in a way that's currently masked by the second issue.
The second issue is that the plugin deliberately crashes puma during phased restarts. In the logs, I see:
May 15 15:31:53 lobste.rs puma[282083]: [282083] - Starting phased worker restart, phase: 1
May 15 15:31:53 lobste.rs puma[282083]: [282083] + Changing to /srv/lobste.rs/http
May 15 15:31:53 lobste.rs puma[282083]: [282083] - Stopping 282297 for phased upgrade...
May 15 15:31:53 lobste.rs puma[282083]: [282083] - TERM sent to 282297...
May 15 15:31:53 lobste.rs puma[282083]: [282083] - Stopping 282297 for phased upgrade...
May 15 15:31:53 lobste.rs puma[282083]: [282083] - Stopping 282297 for phased upgrade...
May 15 15:31:53 lobste.rs puma[282083]: [282083] - Stopping 282297 for phased upgrade...
May 15 15:31:53 lobste.rs puma[282083]: [282083] - Stopping 282297 for phased upgrade...
May 15 15:31:55 lobste.rs puma[282083]: [282083] Detected Solid Queue has gone away, stopping Puma...
May 15 15:31:55 lobste.rs puma[282083]: [282083] - Gracefully shutting down workers...
May 15 15:32:24 lobste.rs puma[282083]: [282083] === puma shutdown: 2025-05-15 15:32:24 +0000 ===
Solid Queue doesn't seem to understand a phased restart and is inappropriately halting the puma supervisor.
Then a few seconds later systemd notices that the puma service has crashed and cold starts it. It's 30+ seconds for one worker to start and the nginx queue has filled up, so we throw a lot of 502s and limp back into normal service while the workers get hammered.