Prevent deadlocks by adding FOR UPDATE locks#32173
Conversation
This commit prevents deadlocks between policy_membership and host_issues tables by ensuring consistent lock ordering in updateHostIssuesFailingPolicies. The deadlock occurred when multiple transactions executed: 1. UPDATE host_issues with subquery from policy_membership 2. INSERT INTO host_issues with data from policy_membership Transaction A would lock policy_membership, wait for host_issues Transaction B would lock host_issues, wait for policy_membership Adding FOR UPDATE to policy_membership queries ensures both transactions acquire locks in the same order, preventing circular dependencies. Fixes the high-frequency deadlocks observed during load testing with 15,000 hosts.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #32173 +/- ##
==========================================
+ Coverage 62.08% 63.98% +1.89%
==========================================
Files 1985 1986 +1
Lines 194048 194108 +60
Branches 6537 6537
==========================================
+ Hits 120477 124202 +3725
+ Misses 63998 60216 -3782
- Partials 9573 9690 +117
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
… UPDATE` locks Revised locking mechanism to avoid redundant `FOR UPDATE` clauses and added explicit pre-locking of `policy_membership` rows. Prevents potential deadlocks while ensuring consistent lock ordering in `updateHostIssuesFailingPolicies`.
… code Replaced bulk-based methods with single-host-specific logic (`UpdateHostIssuesFailingPoliciesForSingleHost`) where applicable, improving performance and reducing unnecessary locking. Simplified `deleteAllPolicyMemberships` to align with single-host updates.
Introduced `UpdateHostIssuesFailingPoliciesForSingleHostFunc` mock in multiple test cases to support single-host policy update scenarios.
…stIDs` and batching updates Ensures consistent lock ordering by sorting `hostIDs` before processing and splitting large updates into batches. Prevents circular dependencies in transactions with overlapping host sets.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
WalkthroughIntroduces batched, lock-ordered updates to policy failing-issues to prevent deadlocks during bulk automation updates; adds a single-host update path and switches relevant call sites to it; adjusts policy-membership deletion to single-host; adds row locks in ownership checks; updates enrollment paths and MDM cleanup; extends Datastore interface and mocks; updates tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User as API/UI
participant Svc as Service (Policies/Hosts)
participant DS as Datastore
participant MySQL as MySQL
rect rgba(230,245,255,0.6)
note over User,Svc: Bulk automation save (10+ policies)
User->>Svc: Save automations
Svc->>DS: UpdateHostIssuesFailingPolicies(hostIDs[])
DS->>DS: Sort hostIDs deterministically
loop Batched processing
DS->>MySQL: SELECT ... FROM policy_membership FOR UPDATE (batch)
DS->>MySQL: UPDATE host_issues ... (batch-safe)
DS->>MySQL: INSERT/UPDATE counts for hosts in membership
DS->>MySQL: Clear counts for hosts not in membership
end
DS-->>Svc: done
Svc-->>User: success
end
sequenceDiagram
autonumber
participant Agent as Host/Agent
participant Svc as Service
participant DS as Datastore
participant MySQL as MySQL
rect rgba(235,255,235,0.6)
note over Agent,Svc: Single-host recalculation paths
Agent->>Svc: GetHost / Report policy results
Svc->>DS: UpdateHostIssuesFailingPoliciesForSingleHost(hostID)
DS->>MySQL: Lock relevant rows (FOR UPDATE)
DS->>MySQL: Upsert/clear host_issues for hostID
DS-->>Svc: done
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Assessment against linked issues
Assessment against linked issues: Out-of-scope changes
Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/datastore/mysql/hosts.go (1)
5751-5797: Use QueryContext for FOR UPDATE (Exec on SELECT is driver-undefined)ExecContext with a SELECT … FOR UPDATE is not guaranteed to work across drivers; use QueryxContext and close rows to establish row locks reliably. Also keep the lock inside the same tx you already have.
Apply this change after updating the function signature to sqlx.ExtContext:
- lockPolicyStmt := `SELECT 1 FROM policy_membership WHERE host_id IN (?) FOR UPDATE` + lockPolicyStmt := `SELECT 1 FROM policy_membership WHERE host_id IN (?) FOR UPDATE` // ... - stmt, args, err := sqlx.In(lockPolicyStmt, hostIDsBatch) + stmt, args, err := sqlx.In(lockPolicyStmt, hostIDsBatch) if err != nil { return ctxerr.Wrap(ctx, err, "building IN statement for locking policy_membership") } - if _, err := tx.ExecContext(ctx, stmt, args...); err != nil { - return ctxerr.Wrap(ctx, err, "locking policy_membership rows") - } + // Acquire row locks; no need to read rows, but Query is required. + rows, err := tx.(sqlx.QueryerContext).QueryxContext(ctx, stmt, args...) + if err != nil { + return ctxerr.Wrap(ctx, err, "locking policy_membership rows") + } + rows.Close()If you prefer avoiding type assertions, pass tx as sqlx.ExtContext (as suggested) and use the QueryerContext methods directly.
🧹 Nitpick comments (12)
changes/31173-fix-policy-deadlocks (1)
1-1: Good CHANGELOG entry — consider adding a bit more operator-facing context.This line is fine, but a short phrase about scope helps future readers scanning release notes (e.g., “during bulk policy automation saves on large fleets”). If you want, also reference the issue number for traceability.
Apply this diff to expand the entry:
-Fix deadlocks when updating automations for 10+ policies at one time +Fix rare deadlocks when saving policy automations affecting 10+ policies at once on large fleets (e.g., thousands of hosts) (#31173)server/service/hosts_test.go (2)
829-831: Mock for the new single-host path wired correctly.This keeps existing tests from tripping the new call site. If you want extra confidence, you could assert invocation where it matters (only if the mock exposes an Invoked flag).
If available in the mock, you can assert the call like this:
ds.UpdateHostIssuesFailingPoliciesForSingleHostFunc = func(ctx context.Context, hostID uint) error { + ds.UpdateHostIssuesFailingPoliciesForSingleHostFuncInvoked = true return nil }
1802-1804: Same here — correct wiring for the new mock.No functional concerns. Optional: add an assertion as in the previous comment if the mock exposes an “Invoked” flag.
server/service/hosts.go (3)
569-578: Make the background issues refresh best-effort to avoid user-facing failures.Switching to the per-host update reduces lock footprint — good. However, failing the entire GetHost call if the refresh hits a transient lock wait/deadlock will surface as UX errors. This refresh is an optimization bounded by the subsequent policy-based recompute below. Recommend:
- Time-bound the datastore refresh.
- Log and proceed on error (serve slightly stale counts rather than erroring the whole request).
Apply this diff:
lastUpdated, err := svc.ds.GetHostIssuesLastUpdated(ctx, id) if err != nil { return nil, ctxerr.Wrap(ctx, err, "checking host's host_issues last updated:") } - if time.Since(lastUpdated) > time.Minute { - if err := svc.ds.UpdateHostIssuesFailingPoliciesForSingleHost(ctx, id); err != nil { - return nil, ctxerr.Wrap(ctx, err, "recalculate host failing policies count:") - } - } + if time.Since(lastUpdated) > time.Minute { + // Best-effort refresh: keep the read path responsive even if the refresh contends on locks. + updateCtx, cancel := context.WithTimeout(ctx, 2*time.Second) + defer cancel() + if err := svc.ds.UpdateHostIssuesFailingPoliciesForSingleHost(updateCtx, id); err != nil { + level.Debug(svc.logger).Log( + "msg", "host issues refresh failed; continuing with stale counts", + "host_id", id, + "err", err, + ) + } + }
569-578: Gate the refresh on whether issues are needed for this call.If the caller disables issues for performance (opts.DisableIssues) or won’t read policy data, we can skip the refresh to reduce write pressure under high QPS GetHost traffic (e.g., agent/API polling).
Wrap the refresh in a guard:
- if time.Since(lastUpdated) > time.Minute { + if !opts.DisableIssues && time.Since(lastUpdated) > time.Minute { // refresh as above... }
569-578: Consider fetching the host before attempting the refresh.Today we refresh by host ID before verifying the host exists/authorizing access. If the ID is invalid or deleted, we do unnecessary work and may return a datastore error unrelated to “host not found”. Fetching (and authorizing) first avoids that edge. This is optional if lock ordering in the datastore implementation depends on the current sequence.
Possible adjustment:
- // recalculate host failing_policies_count & total_issues_count, at most every minute - lastUpdated, err := svc.ds.GetHostIssuesLastUpdated(ctx, id) + host, err := svc.ds.Host(ctx, id) if err != nil { - return nil, ctxerr.Wrap(ctx, err, "checking host's host_issues last updated:") + return nil, ctxerr.Wrap(ctx, err, "get host") } + if !opts.IncludeCriticalVulnerabilitiesCount { + host.HostIssues.CriticalVulnerabilitiesCount = nil + } + // Now do the best-effort refresh using host.ID if desired + lastUpdated, err := svc.ds.GetHostIssuesLastUpdated(ctx, host.ID) + if err != nil { + return nil, ctxerr.Wrap(ctx, err, "checking host's host_issues last updated:") + }Note: only adopt if it doesn’t undermine the lock-ordering guarantees you introduced at the datastore level.
I can help validate lock ordering across the DS layer if you paste the relevant snippets from server/datastore/mysql/{hosts,policies}.go.
server/fleet/datastore.go (2)
315-320: Docstring polish + minor naming consistency (hostID) to match surrounding style.
- Use backticks around SQL table names for consistency with nearby docs.
- Prefer
hostIDcasing (used elsewhere in this file) overhostId.- Optional: gently nudge readers toward the per-host variant when updating a single host (reduces contention/deadlocks), while keeping batch for true bulk cases.
Proposed diff:
- // UpdateHostIssuesFailingPolicies updates the failing policies count in host_issues table for the provided hosts. + // UpdateHostIssuesFailingPolicies updates the failing policies count in the `host_issues` table for the provided hosts. + // Prefer the per-host variant when updating a single host to minimize lock contention during policy membership changes. - // UpdateHostIssuesFailingPoliciesForSingleHost updates the failing policies count in host_issues table for a single host. + // UpdateHostIssuesFailingPoliciesForSingleHost updates the failing policies count in the `host_issues` table for a single host. - // Gets the last time the host's row in `host_issues` was updated - GetHostIssuesLastUpdated(ctx context.Context, hostId uint) (time.Time, error) + // Gets the last time the host's row in `host_issues` was updated + GetHostIssuesLastUpdated(ctx context.Context, hostID uint) (time.Time, error)
317-318: Optional naming alignment with existing conventions (ForHost/ByHostID).Most methods nearby use “ForHost/ByID” patterns (e.g.,
ListPoliciesForHost,HostLiteByID). Consider shortening to improve readability:- // UpdateHostIssuesFailingPoliciesForSingleHost updates the failing policies count in the `host_issues` table for a single host. - UpdateHostIssuesFailingPoliciesForSingleHost(ctx context.Context, hostID uint) error + // UpdateHostIssuesFailingPoliciesForHost updates the failing policies count in the `host_issues` table for a single host. + UpdateHostIssuesFailingPoliciesForHost(ctx context.Context, hostID uint) errorThis is optional and would ripple through call sites and mocks, so only do it if you’re still iterating on the API name in this PR.
server/service/apple_mdm_test.go (1)
899-904: Strengthen the test to assert the single-host path is used (and drop the stale bulk mock)Now that the service calls the per-host method, this test should assert that:
UpdateHostIssuesFailingPoliciesForSingleHostis invoked, and- the old bulk path is not invoked (helps catch regressions).
Optional tidy-up: the bulk mock set on Line 899–901 appears unused in this test and can be removed to reduce noise.
Suggested changes:
- Minimal guard in the per-host mock (keeps it self-validating):
-ds.UpdateHostIssuesFailingPoliciesForSingleHostFunc = func(ctx context.Context, hostID uint) error { - return nil -} +ds.UpdateHostIssuesFailingPoliciesForSingleHostFunc = func(ctx context.Context, hostID uint) error { + require.NotZero(t, hostID) + return nil +}
- If you’d like, drop the bulk-path mock in this test:
-ds.UpdateHostIssuesFailingPoliciesFunc = func(ctx context.Context, hostIDs []uint) error { - return nil -}
- Add flag resets alongside the others (right after Line 968) and verify invocation after each host fetch:
// resets with the others: ds.UpdateHostIssuesFailingPoliciesForSingleHostFuncInvoked = false ds.UpdateHostIssuesFailingPoliciesFuncInvoked = false ... // after getting gotHost (e.g., after Line 986): require.True(t, ds.UpdateHostIssuesFailingPoliciesForSingleHostFuncInvoked, "expect per-host update path to be used") require.False(t, ds.UpdateHostIssuesFailingPoliciesFuncInvoked, "bulk path should not be called in GetHost/HostByIdentifier")server/datastore/mysql/policies_test.go (1)
3293-3295: Add an idempotency check to prevent regressionsCalling the helper twice for the same host should remain a no-op. A quick follow-up call and re-asserts strengthen the contract and guard against future changes that might reintroduce side effects.
err = deleteAllPolicyMemberships(ctx, ds.writer(ctx), host.ID) require.NoError(t, err) err = ds.writer(ctx).Get(&count, "select COUNT(*) from policy_membership") require.NoError(t, err) require.Equal(t, 0, count) require.NoError(t, ds.writer(ctx).Get(&count, "select COUNT(*) from host_issues WHERE total_issues_count > 0")) assert.Zero(t, count) + + // Idempotency: calling again should be a no-op and keep counts at zero. + err = deleteAllPolicyMemberships(ctx, ds.writer(ctx), host.ID) + require.NoError(t, err) + err = ds.writer(ctx).Get(&count, "select COUNT(*) from policy_membership") + require.NoError(t, err) + require.Equal(t, 0, count) + require.NoError(t, ds.writer(ctx).Get(&count, "select COUNT(*) from host_issues WHERE total_issues_count > 0")) + assert.Zero(t, count)server/datastore/mysql/hosts.go (2)
5724-5727: Consider retry wrapper for single-host updates tooSingle-host host_issues updates can still conflict with the periodic vulnerabilities updater touching the same row. Wrapping in withRetryTxx keeps behavior consistent with the batch path.
-func (ds *Datastore) UpdateHostIssuesFailingPoliciesForSingleHost(ctx context.Context, hostID uint) error { - var tx sqlx.ExecerContext = ds.writer(ctx) - return updateHostIssuesFailingPoliciesForSingleHost(ctx, tx, hostID) -} +func (ds *Datastore) UpdateHostIssuesFailingPoliciesForSingleHost(ctx context.Context, hostID uint) error { + return ds.withRetryTxx(ctx, func(tx sqlx.ExtContext) error { + return updateHostIssuesFailingPoliciesForSingleHost(ctx, tx, hostID) + }) +}If you prefer to keep it non-transactional for latency, consider at least retry-on-deadlock error handling around the Exec. Want me to wire that in?
5729-5744: Make the SUM expression explicit and skip NULL passesMinor readability and portability improvement: avoid unary “!” and make the failing count explicit.
- INSERT INTO host_issues (host_id, failing_policies_count, total_issues_count) - SELECT host_id.id, COALESCE(SUM(!pm.passes), 0), COALESCE(SUM(!pm.passes), 0) + INSERT INTO host_issues (host_id, failing_policies_count, total_issues_count) + SELECT host_id.id, + COALESCE(SUM(CASE WHEN pm.passes = 0 THEN 1 ELSE 0 END), 0), + COALESCE(SUM(CASE WHEN pm.passes = 0 THEN 1 ELSE 0 END), 0) FROM policy_membership pm RIGHT JOIN (SELECT ? as id) as host_id ON pm.host_id = host_id.id GROUP BY host_id.id ON DUPLICATE KEY UPDATE failing_policies_count = VALUES(failing_policies_count), total_issues_count = VALUES(failing_policies_count) + critical_vulnerabilities_count
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (9)
changes/31173-fix-policy-deadlocks(1 hunks)server/datastore/mysql/hosts.go(4 hunks)server/datastore/mysql/policies.go(5 hunks)server/datastore/mysql/policies_test.go(1 hunks)server/fleet/datastore.go(1 hunks)server/mock/datastore_mock.go(3 hunks)server/service/apple_mdm_test.go(1 hunks)server/service/hosts.go(1 hunks)server/service/hosts_test.go(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go
⚙️ CodeRabbit configuration file
When reviewing SQL queries that are added or modified, ensure that appropriate filtering criteria are applied—especially when a query is intended to return data for a specific entity (e.g., a single host). Check for missing WHERE clauses or incorrect filtering that could lead to incorrect or non-deterministic results (e.g., returning the first row instead of the correct one). Flag any queries that may return unintended results due to lack of precise scoping.
Files:
server/service/apple_mdm_test.goserver/fleet/datastore.goserver/service/hosts.goserver/mock/datastore_mock.goserver/service/hosts_test.goserver/datastore/mysql/hosts.goserver/datastore/mysql/policies_test.goserver/datastore/mysql/policies.go
🔇 Additional comments (12)
server/service/hosts.go (1)
569-578: All batch-API call sites migrated to intended contextsI’ve verified that no callers in the
server/servicelayer still invoke the old
UpdateHostIssuesFailingPolicies([]uint)method in single-host scenarios:
- The only occurrences of the batch API are its declaration in
server/fleet/datastore.go, its implementation inserver/datastore/mysql/hosts.go(and associated calls inserver/datastore/mysql/policies.gofor genuine batch jobs), plus mocks and tests—none reside inserver/service.- In
server/service/hosts.go, the newUpdateHostIssuesFailingPoliciesForSingleHost(ctx, id)API is used exclusively for per-host updates.No further migration is needed here.
server/fleet/datastore.go (1)
317-318: All Datastore Implementations UpdatedI ran the provided sweep and confirmed:
- The new method is implemented in
• server/datastore/mysql/hosts.go (line 5724)
• server/mock/datastore_mock.go (line 4368)- All service call sites reference the updated API (e.g. server/service/hosts.go:575, server/datastore/mysql/policies.go:595).
- There are no additional “cached” or other wrapper datastore layers in server/datastore that require updates.
Mocks and the primary MySQL implementation both include the new
UpdateHostIssuesFailingPoliciesForSingleHostmethod. No further interface implementers were found, so this is now resolved.server/service/apple_mdm_test.go (1)
902-904: Good: mock single-host failing-policies path wired upAdding
ds.UpdateHostIssuesFailingPoliciesForSingleHostFuncunblocks the new per-host update path exercised bysvc.GetHost/HostByIdentifier. Returningnilhere is sufficient for these tests.server/datastore/mysql/policies_test.go (2)
3293-3295: Switch to single-host deleteAllPolicyMemberships signature — LGTMThe test correctly updates the call site to pass a single host ID, matching the new function signature. Post-conditions validate both policy_membership and host_issues table state, which is the right observable behavior here.
3293-3295: No further action required for deleteAllPolicyMembershipsAll call sites now use the new single-host signature, and there are no remaining slice-based invocations:
- server/datastore/mysql/policies.go:1430 – signature is
func deleteAllPolicyMemberships(ctx context.Context, tx sqlx.ExtContext, hostID uint) error- server/datastore/mysql/policies_test.go:3293 – call is
err = deleteAllPolicyMemberships(ctx, ds.writer(ctx), host.ID)Because the implementation issue a simple DELETE (no SELECT … FOR UPDATE locks), exercising it inside an explicit transaction for lock-order testing isn’t necessary.
server/mock/datastore_mock.go (2)
1783-1784: Struct fields for per-host hook are properly wired.Fields mirror existing pattern (<Func, Invoked) and are placed next to the related bulk variant for discoverability.
4368-4373: Method implementation follows established mock pattern.Locks only to flip the Invoked flag, unlocks before delegating, and defers behavior to the injected func — consistent with the rest of this auto-generated mock.
server/datastore/mysql/policies.go (2)
592-597: Good call: update host_issues even when no policy resultsRunning UpdateHostIssuesFailingPoliciesForSingleHost when len(results) == 0 prevents stale failing_policies_count for hosts configured to run no policies. This directly addresses correctness.
1430-1437: Single-host cleanup path is clearer and saferSwitching deleteAllPolicyMemberships to a single-host DELETE and immediately recomputing host_issues in the same tx is a solid improvement for both performance and lock scope.
server/datastore/mysql/hosts.go (3)
5719-5723: Wrap batch update in retry — nice alignment with deadlock mitigationAdding withRetryTxx around the batch host_issues update matches the PR objective and improves robustness under contention.
5778-5783: Deterministic lock ordering: goodSorting hostIDs before locking helps prevent deadlocks across concurrent workers that process overlapping host sets.
9-9: No action needed: repo already uses Go 1.24.6 which supports “slices”Verified that go.mod specifies Go 1.24.6, well above the 1.21 minimum for the stdlib “slices” package, and all existing calls to slices.Sort, slices.Compact, slices.Compare, etc. are fully supported. CI and builds will not break, so no fallback to sort.Slice is required.
sgress454
left a comment
There was a problem hiding this comment.
LGTM, test 👍 with updating a policy for 1 host and multiple hosts.
It seems to me like we could collapse this code down by using the single-host version of the SQL (with the right join) for both the single and multi-host cases. Then we wouldn't need the "Clear host_issues entries for hosts that are not in policy_membership" code either. Nothing wrong with a batch of 1! But I could be missing a subtlety there, and it's all outside the scope of your change anyway.
I also wonder if we need to be doing this all in one big transaction, rather than a transaction per batch which could also reduce deadlocks. If a single batch failed, we'd end up with some hosts that had incorrect data about their failed policies, but is that worse than having all the hosts with bad data? But, also out of scope.
| // Use FOR UPDATE to acquire an exclusive lock on the software_installer row early in the transaction. | ||
| // This prevents deadlocks by ensuring consistent lock ordering - all transactions that need | ||
| // to validate and update policies with the same software installer will wait in line | ||
| // rather than creating circular dependencies. | ||
| err := sqlx.GetContext(ctx, db, &softwareInstallerTeamID, "SELECT global_or_team_id FROM software_installers WHERE id = ? FOR UPDATE", softwareInstallerID) |
There was a problem hiding this comment.
This was a little confusing to me since we're passing around db and not tx, but looks like both the places we actually call assertTeamMatches do start a transaction. Using FOR UPDATE outside of a transaction apparently works fine so nbd.
Yes, I looked into this, but it was complicated since this code is used by multiple different paths which are complex by themselves. So it would be a bunch of work to split this into separate transactions. It seems like it is fine as is right now, so I decided not to overengineer it at this point. |
Fixes #31173
Reproduced and fixed in loadtest environment. Uncovered another source of deadlocks, filed as a separate: #32201
Additional frontend fix, which is needed to prevent potential timeouts: #32212
Backend + frontend fix should be a sufficient fix for this issue (ignoring the issue with the long software transaction).
Also, this PR contains some refactoring to split out the 1-host use case.
Checklist for submitter
If some of the following don't apply, delete the relevant line.
changes/,orbit/changes/oree/fleetd-chrome/changes.Testing
Summary by CodeRabbit