Skip to content

puma plugin doesn't start properly in phased restarts and also crashes them #563

@pushcx

Description

@pushcx

Two closely-related issues here. I started using plugin :solid_queue in my service to have puma manage the Solid Queue worker process, and it's broken phased restarts, causing an outage on every deployment because I have a single vps serving my site. This seems like a design issue contrary to project goals of enabling straightforward single-server use.

We do a phased restarts whenever possible (deploys that don't touch the bundle) so puma replaces workers one-at-a-time, so we have a zero-downtime deploy.

The first issue is that the plugin crashes on start because it assumes the Rails app is preloaded. I found a workaround (our commit) that might be working, but I suspect it breaks phased restarts in a way that's currently masked by the second issue.

The second issue is that the plugin deliberately crashes puma during phased restarts. In the logs, I see:

May 15 15:31:53 lobste.rs puma[282083]: [282083] - Starting phased worker restart, phase: 1
May 15 15:31:53 lobste.rs puma[282083]: [282083] + Changing to /srv/lobste.rs/http
May 15 15:31:53 lobste.rs puma[282083]: [282083] - Stopping 282297 for phased upgrade...
May 15 15:31:53 lobste.rs puma[282083]: [282083] - TERM sent to 282297...
May 15 15:31:53 lobste.rs puma[282083]: [282083] - Stopping 282297 for phased upgrade...
May 15 15:31:53 lobste.rs puma[282083]: [282083] - Stopping 282297 for phased upgrade...
May 15 15:31:53 lobste.rs puma[282083]: [282083] - Stopping 282297 for phased upgrade...
May 15 15:31:53 lobste.rs puma[282083]: [282083] - Stopping 282297 for phased upgrade...
May 15 15:31:55 lobste.rs puma[282083]: [282083] Detected Solid Queue has gone away, stopping Puma...
May 15 15:31:55 lobste.rs puma[282083]: [282083] - Gracefully shutting down workers...
May 15 15:32:24 lobste.rs puma[282083]: [282083] === puma shutdown: 2025-05-15 15:32:24 +0000 ===

Solid Queue doesn't seem to understand a phased restart and is inappropriately halting the puma supervisor.

Then a few seconds later systemd notices that the puma service has crashed and cold starts it. It's 30+ seconds for one worker to start and the nginx queue has filled up, so we throw a lot of 502s and limp back into normal service while the workers get hammered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions