Skip to content

Prevent deadlocks by adding FOR UPDATE locks#32173

Merged
getvictor merged 8 commits into
mainfrom
victor-loadtest-25k-hosts
Aug 22, 2025
Merged

Prevent deadlocks by adding FOR UPDATE locks#32173
getvictor merged 8 commits into
mainfrom
victor-loadtest-25k-hosts

Conversation

@getvictor
Copy link
Copy Markdown
Member

@getvictor getvictor commented Aug 21, 2025

Fixes #31173

Reproduced and fixed in loadtest environment. Uncovered another source of deadlocks, filed as a separate: #32201

  • Also, still seeing some deadlocks (a lot fewer) in DB, and they are hidden from the API results by retries. They may still be happening because locks happen row by row and not all at once. A potential fix would be to lock the whole policy_membership table.

Additional frontend fix, which is needed to prevent potential timeouts: #32212

Backend + frontend fix should be a sufficient fix for this issue (ignoring the issue with the long software transaction).

Also, this PR contains some refactoring to split out the 1-host use case.

Checklist for submitter

If some of the following don't apply, delete the relevant line.

  • Changes file added for user-visible changes in changes/, orbit/changes/ or ee/fleetd-chrome/changes.

Testing

  • QA'd all new/changed functionality manually

Summary by CodeRabbit

  • Bug Fixes
    • Resolved rare deadlocks during concurrent policy updates and bulk automations.
    • Correctly clears stale MDM data and actions on host re-enrollment and platform changes.
  • Performance Improvements
    • Optimized policy issue recalculation with per-host updates to reduce contention.
    • Improved concurrency handling for bulk policy updates to avoid lock contention.
  • Reliability
    • More robust host enrollment: updates seen time, display name, and label membership consistently.
    • Ensures accurate policy-issue counts after membership changes and re-enrollment.

This commit prevents deadlocks between policy_membership and host_issues tables
by ensuring consistent lock ordering in updateHostIssuesFailingPolicies.

The deadlock occurred when multiple transactions executed:
1. UPDATE host_issues with subquery from policy_membership
2. INSERT INTO host_issues with data from policy_membership

Transaction A would lock policy_membership, wait for host_issues
Transaction B would lock host_issues, wait for policy_membership

Adding FOR UPDATE to policy_membership queries ensures both transactions
acquire locks in the same order, preventing circular dependencies.

Fixes the high-frequency deadlocks observed during load testing with 15,000 hosts.
@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 21, 2025

Codecov Report

❌ Patch coverage is 81.35593% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.98%. Comparing base (79d431e) to head (0efa149).
⚠️ Report is 35 commits behind head on main.

Files with missing lines Patch % Lines
server/datastore/mysql/hosts.go 84.09% 4 Missing and 3 partials ⚠️
server/datastore/mysql/policies.go 78.57% 0 Missing and 3 partials ⚠️
server/service/hosts.go 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #32173      +/-   ##
==========================================
+ Coverage   62.08%   63.98%   +1.89%     
==========================================
  Files        1985     1986       +1     
  Lines      194048   194108      +60     
  Branches     6537     6537              
==========================================
+ Hits       120477   124202    +3725     
+ Misses      63998    60216    -3782     
- Partials     9573     9690     +117     
Flag Coverage Δ
backend 65.27% <81.35%> (+2.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

… UPDATE` locks

Revised locking mechanism to avoid redundant `FOR UPDATE` clauses and added explicit pre-locking of `policy_membership` rows. Prevents potential deadlocks while ensuring consistent lock ordering in `updateHostIssuesFailingPolicies`.
… code

Replaced bulk-based methods with single-host-specific logic (`UpdateHostIssuesFailingPoliciesForSingleHost`) where applicable, improving performance and reducing unnecessary locking. Simplified `deleteAllPolicyMemberships` to align with single-host updates.
Introduced `UpdateHostIssuesFailingPoliciesForSingleHostFunc` mock in multiple test cases to support single-host policy update scenarios.
…stIDs` and batching updates

Ensures consistent lock ordering by sorting `hostIDs` before processing and splitting large updates into batches. Prevents circular dependencies in transactions with overlapping host sets.
@getvictor
Copy link
Copy Markdown
Member Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 22, 2025

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 22, 2025

Walkthrough

Introduces batched, lock-ordered updates to policy failing-issues to prevent deadlocks during bulk automation updates; adds a single-host update path and switches relevant call sites to it; adjusts policy-membership deletion to single-host; adds row locks in ownership checks; updates enrollment paths and MDM cleanup; extends Datastore interface and mocks; updates tests.

Changes

Cohort / File(s) Summary
Host issues batching and enrollment adjustments
server/datastore/mysql/hosts.go
Adds deterministic batching and row locking for updating host_issues; introduces single-host update helper and exported method; sorts host IDs; adds MDM cleanup and display name handling on enrollment; updates host seen times; adjusts policy-membership cleanup call to single-host variant.
Policy datastore concurrency and membership deletion
server/datastore/mysql/policies.go, server/datastore/mysql/policies_test.go
Adds FOR UPDATE to ownership checks for consistent lock ordering; changes deleteAllPolicyMemberships to single-host; updates post-delete issues recalculation to single-host path; updates tests accordingly.
Datastore interface and mocks
server/fleet/datastore.go, server/mock/datastore_mock.go
Adds UpdateHostIssuesFailingPoliciesForSingleHost to interface and mock; removes NewJobTx from mock.
Service layer call sites and tests
server/service/hosts.go, server/service/hosts_test.go, server/service/apple_mdm_test.go
Swaps batch recalculation calls to single-host variant in GetHost and policy execution paths; wires new mock hook in tests.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User as API/UI
  participant Svc as Service (Policies/Hosts)
  participant DS as Datastore
  participant MySQL as MySQL

  rect rgba(230,245,255,0.6)
  note over User,Svc: Bulk automation save (10+ policies)
  User->>Svc: Save automations
  Svc->>DS: UpdateHostIssuesFailingPolicies(hostIDs[])
  DS->>DS: Sort hostIDs deterministically
  loop Batched processing
    DS->>MySQL: SELECT ... FROM policy_membership FOR UPDATE (batch)
    DS->>MySQL: UPDATE host_issues ... (batch-safe)
    DS->>MySQL: INSERT/UPDATE counts for hosts in membership
    DS->>MySQL: Clear counts for hosts not in membership
  end
  DS-->>Svc: done
  Svc-->>User: success
  end
Loading
sequenceDiagram
  autonumber
  participant Agent as Host/Agent
  participant Svc as Service
  participant DS as Datastore
  participant MySQL as MySQL

  rect rgba(235,255,235,0.6)
  note over Agent,Svc: Single-host recalculation paths
  Agent->>Svc: GetHost / Report policy results
  Svc->>DS: UpdateHostIssuesFailingPoliciesForSingleHost(hostID)
  DS->>MySQL: Lock relevant rows (FOR UPDATE)
  DS->>MySQL: Upsert/clear host_issues for hostID
  DS-->>Svc: done
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Assessment against linked issues

Objective Addressed Explanation
Modify update/cleanup logic to avoid deadlocks in bulk policy automations [#31173]
Enforce consistent lock ordering and safe batching for large IN-lists [#31173]
Ensure reliable completion without deadlocks under large load; correct host_issues counts [#31173] Requires runtime verification and scale testing.
Reproduction of deadlock scenario [#31173] No test or script added to reproduce.
Identify and fix root cause: lock-order inversion between policy_membership and host_issues [#31173]

Assessment against linked issues: Out-of-scope changes

Code Change Explanation
Enrollment flow updates: host_display_names upsert, refetch_requested initialization, host seen time upsert (server/datastore/mysql/hosts.go) Not related to deadlock avoidance in policy automations.
MDM cleanup on re-enrollment: clearing host_mdm_actions and host_mdm on platform change (server/datastore/mysql/hosts.go) Unrelated to policy automation deadlocks.
Identity certificate host-id mapping update during enrollment (server/datastore/mysql/hosts.go) Outside the scope of policy automation update deadlock fix.
Removal of NewJobTx from datastore mock (server/mock/datastore_mock.go) Mock API change not tied to policy automation deadlock objectives.

Possibly related PRs

Suggested reviewers

  • dantecatalfamo
  • rachaelshaw

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch victor-loadtest-25k-hosts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/datastore/mysql/hosts.go (1)

5751-5797: Use QueryContext for FOR UPDATE (Exec on SELECT is driver-undefined)

ExecContext with a SELECT … FOR UPDATE is not guaranteed to work across drivers; use QueryxContext and close rows to establish row locks reliably. Also keep the lock inside the same tx you already have.

Apply this change after updating the function signature to sqlx.ExtContext:

- lockPolicyStmt := `SELECT 1 FROM policy_membership WHERE host_id IN (?) FOR UPDATE`
+ lockPolicyStmt := `SELECT 1 FROM policy_membership WHERE host_id IN (?) FOR UPDATE`

  // ...
- stmt, args, err := sqlx.In(lockPolicyStmt, hostIDsBatch)
+ stmt, args, err := sqlx.In(lockPolicyStmt, hostIDsBatch)
  if err != nil {
    return ctxerr.Wrap(ctx, err, "building IN statement for locking policy_membership")
  }
- if _, err := tx.ExecContext(ctx, stmt, args...); err != nil {
-   return ctxerr.Wrap(ctx, err, "locking policy_membership rows")
- }
+ // Acquire row locks; no need to read rows, but Query is required.
+ rows, err := tx.(sqlx.QueryerContext).QueryxContext(ctx, stmt, args...)
+ if err != nil {
+   return ctxerr.Wrap(ctx, err, "locking policy_membership rows")
+ }
+ rows.Close()

If you prefer avoiding type assertions, pass tx as sqlx.ExtContext (as suggested) and use the QueryerContext methods directly.

🧹 Nitpick comments (12)
changes/31173-fix-policy-deadlocks (1)

1-1: Good CHANGELOG entry — consider adding a bit more operator-facing context.

This line is fine, but a short phrase about scope helps future readers scanning release notes (e.g., “during bulk policy automation saves on large fleets”). If you want, also reference the issue number for traceability.

Apply this diff to expand the entry:

-Fix deadlocks when updating automations for 10+ policies at one time
+Fix rare deadlocks when saving policy automations affecting 10+ policies at once on large fleets (e.g., thousands of hosts) (#31173)
server/service/hosts_test.go (2)

829-831: Mock for the new single-host path wired correctly.

This keeps existing tests from tripping the new call site. If you want extra confidence, you could assert invocation where it matters (only if the mock exposes an Invoked flag).

If available in the mock, you can assert the call like this:

 ds.UpdateHostIssuesFailingPoliciesForSingleHostFunc = func(ctx context.Context, hostID uint) error {
+  ds.UpdateHostIssuesFailingPoliciesForSingleHostFuncInvoked = true
   return nil
 }

1802-1804: Same here — correct wiring for the new mock.

No functional concerns. Optional: add an assertion as in the previous comment if the mock exposes an “Invoked” flag.

server/service/hosts.go (3)

569-578: Make the background issues refresh best-effort to avoid user-facing failures.

Switching to the per-host update reduces lock footprint — good. However, failing the entire GetHost call if the refresh hits a transient lock wait/deadlock will surface as UX errors. This refresh is an optimization bounded by the subsequent policy-based recompute below. Recommend:

  • Time-bound the datastore refresh.
  • Log and proceed on error (serve slightly stale counts rather than erroring the whole request).

Apply this diff:

   lastUpdated, err := svc.ds.GetHostIssuesLastUpdated(ctx, id)
   if err != nil {
     return nil, ctxerr.Wrap(ctx, err, "checking host's host_issues last updated:")
   }
-  if time.Since(lastUpdated) > time.Minute {
-    if err := svc.ds.UpdateHostIssuesFailingPoliciesForSingleHost(ctx, id); err != nil {
-      return nil, ctxerr.Wrap(ctx, err, "recalculate host failing policies count:")
-    }
-  }
+  if time.Since(lastUpdated) > time.Minute {
+    // Best-effort refresh: keep the read path responsive even if the refresh contends on locks.
+    updateCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
+    defer cancel()
+    if err := svc.ds.UpdateHostIssuesFailingPoliciesForSingleHost(updateCtx, id); err != nil {
+      level.Debug(svc.logger).Log(
+        "msg", "host issues refresh failed; continuing with stale counts",
+        "host_id", id,
+        "err", err,
+      )
+    }
+  }

569-578: Gate the refresh on whether issues are needed for this call.

If the caller disables issues for performance (opts.DisableIssues) or won’t read policy data, we can skip the refresh to reduce write pressure under high QPS GetHost traffic (e.g., agent/API polling).

Wrap the refresh in a guard:

-  if time.Since(lastUpdated) > time.Minute {
+  if !opts.DisableIssues && time.Since(lastUpdated) > time.Minute {
     // refresh as above...
   }

569-578: Consider fetching the host before attempting the refresh.

Today we refresh by host ID before verifying the host exists/authorizing access. If the ID is invalid or deleted, we do unnecessary work and may return a datastore error unrelated to “host not found”. Fetching (and authorizing) first avoids that edge. This is optional if lock ordering in the datastore implementation depends on the current sequence.

Possible adjustment:

-  // recalculate host failing_policies_count & total_issues_count, at most every minute
-  lastUpdated, err := svc.ds.GetHostIssuesLastUpdated(ctx, id)
+  host, err := svc.ds.Host(ctx, id)
   if err != nil {
-    return nil, ctxerr.Wrap(ctx, err, "checking host's host_issues last updated:")
+    return nil, ctxerr.Wrap(ctx, err, "get host")
   }
+  if !opts.IncludeCriticalVulnerabilitiesCount {
+    host.HostIssues.CriticalVulnerabilitiesCount = nil
+  }
+  // Now do the best-effort refresh using host.ID if desired
+  lastUpdated, err := svc.ds.GetHostIssuesLastUpdated(ctx, host.ID)
+  if err != nil {
+    return nil, ctxerr.Wrap(ctx, err, "checking host's host_issues last updated:")
+  }

Note: only adopt if it doesn’t undermine the lock-ordering guarantees you introduced at the datastore level.

I can help validate lock ordering across the DS layer if you paste the relevant snippets from server/datastore/mysql/{hosts,policies}.go.

server/fleet/datastore.go (2)

315-320: Docstring polish + minor naming consistency (hostID) to match surrounding style.

  • Use backticks around SQL table names for consistency with nearby docs.
  • Prefer hostID casing (used elsewhere in this file) over hostId.
  • Optional: gently nudge readers toward the per-host variant when updating a single host (reduces contention/deadlocks), while keeping batch for true bulk cases.

Proposed diff:

- // UpdateHostIssuesFailingPolicies updates the failing policies count in host_issues table for the provided hosts.
+ // UpdateHostIssuesFailingPolicies updates the failing policies count in the `host_issues` table for the provided hosts.
+ // Prefer the per-host variant when updating a single host to minimize lock contention during policy membership changes.

- // UpdateHostIssuesFailingPoliciesForSingleHost updates the failing policies count in host_issues table for a single host.
+ // UpdateHostIssuesFailingPoliciesForSingleHost updates the failing policies count in the `host_issues` table for a single host.

- // Gets the last time the host's row in `host_issues` was updated
- GetHostIssuesLastUpdated(ctx context.Context, hostId uint) (time.Time, error)
+ // Gets the last time the host's row in `host_issues` was updated
+ GetHostIssuesLastUpdated(ctx context.Context, hostID uint) (time.Time, error)

317-318: Optional naming alignment with existing conventions (ForHost/ByHostID).

Most methods nearby use “ForHost/ByID” patterns (e.g., ListPoliciesForHost, HostLiteByID). Consider shortening to improve readability:

- // UpdateHostIssuesFailingPoliciesForSingleHost updates the failing policies count in the `host_issues` table for a single host.
- UpdateHostIssuesFailingPoliciesForSingleHost(ctx context.Context, hostID uint) error
+ // UpdateHostIssuesFailingPoliciesForHost updates the failing policies count in the `host_issues` table for a single host.
+ UpdateHostIssuesFailingPoliciesForHost(ctx context.Context, hostID uint) error

This is optional and would ripple through call sites and mocks, so only do it if you’re still iterating on the API name in this PR.

server/service/apple_mdm_test.go (1)

899-904: Strengthen the test to assert the single-host path is used (and drop the stale bulk mock)

Now that the service calls the per-host method, this test should assert that:

  • UpdateHostIssuesFailingPoliciesForSingleHost is invoked, and
  • the old bulk path is not invoked (helps catch regressions).

Optional tidy-up: the bulk mock set on Line 899–901 appears unused in this test and can be removed to reduce noise.

Suggested changes:

  • Minimal guard in the per-host mock (keeps it self-validating):
-ds.UpdateHostIssuesFailingPoliciesForSingleHostFunc = func(ctx context.Context, hostID uint) error {
-  return nil
-}
+ds.UpdateHostIssuesFailingPoliciesForSingleHostFunc = func(ctx context.Context, hostID uint) error {
+  require.NotZero(t, hostID)
+  return nil
+}
  • If you’d like, drop the bulk-path mock in this test:
-ds.UpdateHostIssuesFailingPoliciesFunc = func(ctx context.Context, hostIDs []uint) error {
-  return nil
-}
  • Add flag resets alongside the others (right after Line 968) and verify invocation after each host fetch:
// resets with the others:
ds.UpdateHostIssuesFailingPoliciesForSingleHostFuncInvoked = false
ds.UpdateHostIssuesFailingPoliciesFuncInvoked = false
...
// after getting gotHost (e.g., after Line 986):
require.True(t, ds.UpdateHostIssuesFailingPoliciesForSingleHostFuncInvoked, "expect per-host update path to be used")
require.False(t, ds.UpdateHostIssuesFailingPoliciesFuncInvoked, "bulk path should not be called in GetHost/HostByIdentifier")
server/datastore/mysql/policies_test.go (1)

3293-3295: Add an idempotency check to prevent regressions

Calling the helper twice for the same host should remain a no-op. A quick follow-up call and re-asserts strengthen the contract and guard against future changes that might reintroduce side effects.

 err = deleteAllPolicyMemberships(ctx, ds.writer(ctx), host.ID)
 require.NoError(t, err)

 err = ds.writer(ctx).Get(&count, "select COUNT(*) from policy_membership")
 require.NoError(t, err)
 require.Equal(t, 0, count)

 require.NoError(t, ds.writer(ctx).Get(&count, "select COUNT(*) from host_issues WHERE total_issues_count > 0"))
 assert.Zero(t, count)
+
+ // Idempotency: calling again should be a no-op and keep counts at zero.
+ err = deleteAllPolicyMemberships(ctx, ds.writer(ctx), host.ID)
+ require.NoError(t, err)
+ err = ds.writer(ctx).Get(&count, "select COUNT(*) from policy_membership")
+ require.NoError(t, err)
+ require.Equal(t, 0, count)
+ require.NoError(t, ds.writer(ctx).Get(&count, "select COUNT(*) from host_issues WHERE total_issues_count > 0"))
+ assert.Zero(t, count)
server/datastore/mysql/hosts.go (2)

5724-5727: Consider retry wrapper for single-host updates too

Single-host host_issues updates can still conflict with the periodic vulnerabilities updater touching the same row. Wrapping in withRetryTxx keeps behavior consistent with the batch path.

-func (ds *Datastore) UpdateHostIssuesFailingPoliciesForSingleHost(ctx context.Context, hostID uint) error {
-  var tx sqlx.ExecerContext = ds.writer(ctx)
-  return updateHostIssuesFailingPoliciesForSingleHost(ctx, tx, hostID)
-}
+func (ds *Datastore) UpdateHostIssuesFailingPoliciesForSingleHost(ctx context.Context, hostID uint) error {
+  return ds.withRetryTxx(ctx, func(tx sqlx.ExtContext) error {
+    return updateHostIssuesFailingPoliciesForSingleHost(ctx, tx, hostID)
+  })
+}

If you prefer to keep it non-transactional for latency, consider at least retry-on-deadlock error handling around the Exec. Want me to wire that in?


5729-5744: Make the SUM expression explicit and skip NULL passes

Minor readability and portability improvement: avoid unary “!” and make the failing count explicit.

- INSERT INTO host_issues (host_id, failing_policies_count, total_issues_count)
- SELECT host_id.id, COALESCE(SUM(!pm.passes), 0), COALESCE(SUM(!pm.passes), 0)
+ INSERT INTO host_issues (host_id, failing_policies_count, total_issues_count)
+ SELECT host_id.id,
+        COALESCE(SUM(CASE WHEN pm.passes = 0 THEN 1 ELSE 0 END), 0),
+        COALESCE(SUM(CASE WHEN pm.passes = 0 THEN 1 ELSE 0 END), 0)
   FROM policy_membership pm
   RIGHT JOIN (SELECT ? as id) as host_id
   ON pm.host_id = host_id.id
   GROUP BY host_id.id
 ON DUPLICATE KEY UPDATE
   failing_policies_count = VALUES(failing_policies_count),
   total_issues_count = VALUES(failing_policies_count) + critical_vulnerabilities_count
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d3f31ce and bfcdfc3.

📒 Files selected for processing (9)
  • changes/31173-fix-policy-deadlocks (1 hunks)
  • server/datastore/mysql/hosts.go (4 hunks)
  • server/datastore/mysql/policies.go (5 hunks)
  • server/datastore/mysql/policies_test.go (1 hunks)
  • server/fleet/datastore.go (1 hunks)
  • server/mock/datastore_mock.go (3 hunks)
  • server/service/apple_mdm_test.go (1 hunks)
  • server/service/hosts.go (1 hunks)
  • server/service/hosts_test.go (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

⚙️ CodeRabbit configuration file

When reviewing SQL queries that are added or modified, ensure that appropriate filtering criteria are applied—especially when a query is intended to return data for a specific entity (e.g., a single host). Check for missing WHERE clauses or incorrect filtering that could lead to incorrect or non-deterministic results (e.g., returning the first row instead of the correct one). Flag any queries that may return unintended results due to lack of precise scoping.

Files:

  • server/service/apple_mdm_test.go
  • server/fleet/datastore.go
  • server/service/hosts.go
  • server/mock/datastore_mock.go
  • server/service/hosts_test.go
  • server/datastore/mysql/hosts.go
  • server/datastore/mysql/policies_test.go
  • server/datastore/mysql/policies.go
🔇 Additional comments (12)
server/service/hosts.go (1)

569-578: All batch-API call sites migrated to intended contexts

I’ve verified that no callers in the server/service layer still invoke the old
UpdateHostIssuesFailingPolicies([]uint) method in single-host scenarios:

  • The only occurrences of the batch API are its declaration in server/fleet/datastore.go, its implementation in server/datastore/mysql/hosts.go (and associated calls in server/datastore/mysql/policies.go for genuine batch jobs), plus mocks and tests—none reside in server/service.
  • In server/service/hosts.go, the new UpdateHostIssuesFailingPoliciesForSingleHost(ctx, id) API is used exclusively for per-host updates.

No further migration is needed here.

server/fleet/datastore.go (1)

317-318: All Datastore Implementations Updated

I ran the provided sweep and confirmed:

  • The new method is implemented in
    • server/datastore/mysql/hosts.go (line 5724)
    • server/mock/datastore_mock.go (line 4368)
  • All service call sites reference the updated API (e.g. server/service/hosts.go:575, server/datastore/mysql/policies.go:595).
  • There are no additional “cached” or other wrapper datastore layers in server/datastore that require updates.

Mocks and the primary MySQL implementation both include the new UpdateHostIssuesFailingPoliciesForSingleHost method. No further interface implementers were found, so this is now resolved.

server/service/apple_mdm_test.go (1)

902-904: Good: mock single-host failing-policies path wired up

Adding ds.UpdateHostIssuesFailingPoliciesForSingleHostFunc unblocks the new per-host update path exercised by svc.GetHost/HostByIdentifier. Returning nil here is sufficient for these tests.

server/datastore/mysql/policies_test.go (2)

3293-3295: Switch to single-host deleteAllPolicyMemberships signature — LGTM

The test correctly updates the call site to pass a single host ID, matching the new function signature. Post-conditions validate both policy_membership and host_issues table state, which is the right observable behavior here.


3293-3295: No further action required for deleteAllPolicyMemberships

All call sites now use the new single-host signature, and there are no remaining slice-based invocations:

  • server/datastore/mysql/policies.go:1430 – signature is
    func deleteAllPolicyMemberships(ctx context.Context, tx sqlx.ExtContext, hostID uint) error
  • server/datastore/mysql/policies_test.go:3293 – call is
    err = deleteAllPolicyMemberships(ctx, ds.writer(ctx), host.ID)

Because the implementation issue a simple DELETE (no SELECT … FOR UPDATE locks), exercising it inside an explicit transaction for lock-order testing isn’t necessary.

server/mock/datastore_mock.go (2)

1783-1784: Struct fields for per-host hook are properly wired.

Fields mirror existing pattern (<Func, Invoked) and are placed next to the related bulk variant for discoverability.


4368-4373: Method implementation follows established mock pattern.

Locks only to flip the Invoked flag, unlocks before delegating, and defers behavior to the injected func — consistent with the rest of this auto-generated mock.

server/datastore/mysql/policies.go (2)

592-597: Good call: update host_issues even when no policy results

Running UpdateHostIssuesFailingPoliciesForSingleHost when len(results) == 0 prevents stale failing_policies_count for hosts configured to run no policies. This directly addresses correctness.


1430-1437: Single-host cleanup path is clearer and safer

Switching deleteAllPolicyMemberships to a single-host DELETE and immediately recomputing host_issues in the same tx is a solid improvement for both performance and lock scope.

server/datastore/mysql/hosts.go (3)

5719-5723: Wrap batch update in retry — nice alignment with deadlock mitigation

Adding withRetryTxx around the batch host_issues update matches the PR objective and improves robustness under contention.


5778-5783: Deterministic lock ordering: good

Sorting hostIDs before locking helps prevent deadlocks across concurrent workers that process overlapping host sets.


9-9: No action needed: repo already uses Go 1.24.6 which supports “slices”

Verified that go.mod specifies Go 1.24.6, well above the 1.21 minimum for the stdlib “slices” package, and all existing calls to slices.Sort, slices.Compact, slices.Compare, etc. are fully supported. CI and builds will not break, so no fallback to sort.Slice is required.

Comment thread server/datastore/mysql/hosts.go
Comment thread server/datastore/mysql/policies.go
Comment thread server/datastore/mysql/policies.go
Comment thread server/datastore/mysql/policies.go
Comment thread server/mock/datastore_mock.go
@getvictor getvictor marked this pull request as ready for review August 22, 2025 16:19
@getvictor getvictor requested a review from a team as a code owner August 22, 2025 16:19
Copy link
Copy Markdown
Contributor

@sgress454 sgress454 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, test 👍 with updating a policy for 1 host and multiple hosts.

It seems to me like we could collapse this code down by using the single-host version of the SQL (with the right join) for both the single and multi-host cases. Then we wouldn't need the "Clear host_issues entries for hosts that are not in policy_membership" code either. Nothing wrong with a batch of 1! But I could be missing a subtlety there, and it's all outside the scope of your change anyway.

I also wonder if we need to be doing this all in one big transaction, rather than a transaction per batch which could also reduce deadlocks. If a single batch failed, we'd end up with some hosts that had incorrect data about their failed policies, but is that worse than having all the hosts with bad data? But, also out of scope.

Comment on lines +360 to +364
// Use FOR UPDATE to acquire an exclusive lock on the software_installer row early in the transaction.
// This prevents deadlocks by ensuring consistent lock ordering - all transactions that need
// to validate and update policies with the same software installer will wait in line
// rather than creating circular dependencies.
err := sqlx.GetContext(ctx, db, &softwareInstallerTeamID, "SELECT global_or_team_id FROM software_installers WHERE id = ? FOR UPDATE", softwareInstallerID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a little confusing to me since we're passing around db and not tx, but looks like both the places we actually call assertTeamMatches do start a transaction. Using FOR UPDATE outside of a transaction apparently works fine so nbd.

@getvictor getvictor merged commit 4129b52 into main Aug 22, 2025
43 of 44 checks passed
@getvictor getvictor deleted the victor-loadtest-25k-hosts branch August 22, 2025 17:36
@getvictor
Copy link
Copy Markdown
Member Author

I also wonder if we need to be doing this all in one big transaction, rather than a transaction per batch which could also reduce deadlocks. If a single batch failed, we'd end up with some hosts that had incorrect data about their failed policies, but is that worse than having all the hosts with bad data? But, also out of scope.

Yes, I looked into this, but it was complicated since this code is used by multiple different paths which are complex by themselves. So it would be a bunch of work to split this into separate transactions. It seems like it is fine as is right now, so I decided not to overengineer it at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Updating automations for 10+ policy automations at one time results in deadlock in large environments

2 participants