Skip to content

Fix diskless replication task accumulation, GarnetClientSession leak, and dispose race in AofSyncTaskInfo#1791

Merged
badrishc merged 13 commits into
release/v1from
users/matrembl/simon-aoffix
May 15, 2026
Merged

Fix diskless replication task accumulation, GarnetClientSession leak, and dispose race in AofSyncTaskInfo#1791
badrishc merged 13 commits into
release/v1from
users/matrembl/simon-aoffix

Conversation

@Mathos1432

@Mathos1432 Mathos1432 commented May 12, 2026

Copy link
Copy Markdown
Contributor

Why is this change being made?

This is recreating #1787 on top of release/v1.

Two related bugs in AofTaskStore caused unbounded accumulation of AofSyncTaskInfo tasks on clusters using diskless replication, and a third — a thread-safety race — was uncovered during PR review of the original disposal fix.

Bug 1 — diskless dedup never matched. TryAddReplicationTasks (the diskless path) compared existing tasks against rss.replicaNodeId for dedup. ReplicaSyncSession has two node ID fields: replicaNodeId (set by the disk-based constructor, null for diskless) and replicaSyncMetadata.originNodeId (set by the diskless constructor). The AofSyncTaskInfo was created with originNodeId, but dedup compared against the null replicaNodeId — so it never matched and every call added a new task. Over time numTasks grew unboundedly, inflating the RoleInfo[] from INFO REPLICATION until the response exceeded the network output buffer.

Bug 2 — GarnetClientSession leak on the not-yet-started path. AofSyncTaskInfo.Dispose() did not dispose its owned GarnetClientSession. When ReplicaSyncTaskAsync is running, CTS cancellation causes it to exit and the finally block cleans up. But when ReplicaSyncTaskAsync has not yet started (e.g. the task fails to be added), Dispose() is the only cleanup path and the session was leaked.

Bug 3 — dispose race after moving disposal into Dispose(). Once garnetClient.Dispose() lives in AofSyncTaskInfo.Dispose(), the replacement path in AofTaskStore (which disposes an old task while creating a new one) can dispose the client while the old task's Consume() / Throttle() is still calling garnetClient.ExecuteClusterAppendLog. GarnetClientSession is mono-threaded with unsafe pooled send buffers — a concurrent dispose corrupts those buffers.

What changed?

  • libs/cluster/Server/Replication/PrimaryOps/AofTaskStore.cs — Dedup fix: compare against rss.replicaSyncMetadata.originNodeId instead of rss.replicaNodeId in TryAddReplicationTasks. The singular TryAddReplicationTask (disk-based and CLUSTER AOFSYNC) is unaffected.
  • libs/cluster/Server/Replication/PrimaryOps/AofSyncTaskInfo.csDispose() now disposes garnetClient (closing the leak from Bug 2). Disposal is sequenced through a new ActiveWorkerMonitor instance: Dispose() cancels the cts, disposes the iterator, calls activeWorkerMonitor.Dispose() to drain any in-flight ReplicaSyncTaskAsync, and only then disposes garnetClient. ReplicaSyncTaskAsync wraps its body in TryEnter() / Exit(), with Exit() called in finally before aofTaskStore.TryRemove(this) to avoid self-deadlock against the monitor it is exiting.
  • libs/common/Synchronization/ActiveWorkerMonitor.cs — New file backported verbatim from main (introduced by Support for Parallel Replication #1556, Parallel Replication). Multi-reader/single-writer drain primitive built on Interlocked + ManualResetEventSlim — no spinning. The file is self-contained (only BCL usings) so the rest of the parallel-replication restructure is not required. Aligns the v1 fix with the v2 architecture and eases future merges.
  • test/Garnet.test.cluster/ReplicationTests/AofSyncTaskInfoTests.cs — Existing unit coverage of AofSyncTaskInfo lifecycle.
  • test/Garnet.test.cluster/ReplicationTests/ClusterReplicationDisklessSyncTests.cs — Added ClusterDisklessSyncRepeatedAttachDoesNotAccumulateTasks regression test for Bug 1: drives 5 replica re-attach cycles and asserts the primary's INFO REPLICATION reports a single slaveN: entry. Verified to fail when the dedup fix is reverted.

How was this validated?

  • Local: full ClusterReplicationDisklessSyncTests fixture green on net8.0 (30/30, including the new regression test, ClusterDisklessSyncFailover, and ClusterDisklessSyncResetSyncManagerCts which exercise heavy task-replacement and cancellation paths).
  • Local: dedup regression test confirmed to fail when the AofTaskStore.cs fix is reverted, then pass again when restored.
  • Local: stress harness driving concurrent attach / detach cycles while writes are in-flight (kept out of the PR due to flakiness) used to validate that the ActiveWorkerMonitor-based fix removes the dispose race.
  • Integration pipeline:

nattress and others added 5 commits May 12, 2026 14:46
Two related bugs in AofTaskStore caused unbounded accumulation
of AofSyncTaskInfo tasks on clusters using diskless replication.

1. TryAddReplicationTasks (the diskless path) compared existing
   tasks against rss.replicaNodeId for dedup. ReplicaSyncSession
   has two node ID fields: replicaNodeId (set by the disk-based
   constructor, null for diskless) and replicaSyncMetadata.originNodeId
   (set by the diskless constructor). The AofSyncTaskInfo was
   created with originNodeId, but dedup compared against the null
   replicaNodeId — so it never matched and every call added a new
   task. Over time numTasks grew unboundedly, inflating the
   RoleInfo[] from INFO REPLICATION until the response exceeded
   the network output buffer.

   Fix: use rss.replicaSyncMetadata.originNodeId in the dedup
   comparison. The singular TryAddReplicationTask (disk-based
   and CLUSTER AOFSYNC) is unaffected.

2. AofSyncTaskInfo.Dispose() did not dispose its owned
   GarnetClientSession. When ReplicaSyncTaskAsync is running,
   CTS cancellation causes it to exit and the finally block
   cleans up. But when ReplicaSyncTaskAsync has not yet started
   (e.g. the task fails to be added), Dispose() is the only
   cleanup path and the session was leaked.

   Fix: add garnetClient?.Dispose() to AofSyncTaskInfo.Dispose()
   and remove the redundant call from ReplicaSyncTaskAsync's
   finally block, giving a single disposal site.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clarify that the client disposal is no longer happening in the finally block.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Apply [AllureNUnit] attribute and inherit AllureTestBase to
match repo test conventions required by CI.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Categorize the test to match the existing replication test
convention in the cluster test project.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 12, 2026 18:54

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes two AOF replication task management bugs impacting cluster replication (especially diskless replication) by correcting task deduplication identity and tightening resource cleanup for AOF sync tasks.

Changes:

  • Fix diskless replication task deduplication to match on replicaSyncMetadata.originNodeId (instead of replicaNodeId, which is null for diskless sessions).
  • Update AofSyncTaskInfo.Dispose() to dispose its owned GarnetClientSession, and adjust termination logging.
  • Add a regression/unit test asserting AofSyncTaskInfo.Dispose() releases the client session.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
test/Garnet.test.cluster/ReplicationTests/AofSyncTaskInfoTests.cs Adds a test ensuring AofSyncTaskInfo.Dispose() disposes its owned GarnetClientSession.
libs/cluster/Server/Replication/PrimaryOps/AofTaskStore.cs Fixes diskless task deduplication comparison to use originNodeId.
libs/cluster/Server/Replication/PrimaryOps/AofSyncTaskInfo.cs Moves GarnetClientSession disposal into Dispose() and updates termination logging.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread libs/cluster/Server/Replication/PrimaryOps/AofSyncTaskInfo.cs Outdated
Comment thread libs/cluster/Server/Replication/PrimaryOps/AofSyncTaskInfo.cs
Comment thread libs/cluster/Server/Replication/PrimaryOps/AofTaskStore.cs
@Mathos1432 Mathos1432 marked this pull request as draft May 13, 2026 13:11
@Mathos1432 Mathos1432 changed the title Users/matrembl/simon aoffix Fix diskless replication task accumulation, GarnetClientSession leak, and dispose race in AofSyncTaskInfo May 13, 2026
Comment thread libs/cluster/Server/Replication/PrimaryOps/AofSyncTaskInfo.cs Outdated
Mathos1432 and others added 3 commits May 14, 2026 13:15
…idempotency

- Use ActiveWorkerMonitor (backported from PR #1556) to drain in-flight ReplicaSyncTaskAsync before disposing GarnetClientSession, eliminating cross-thread dispose races against ExecuteClusterAppendLog.
- Defensively call Dispose() in ReplicaSyncTaskAsync's finally when TryRemove returns false, guarding against future removal sites that forget to dispose.
- Make AofSyncTaskInfo.Dispose() idempotent (Interlocked guard) so multiple disposal sites cannot trigger ObjectDisposedException from cts.Cancel after cts.Dispose.
- Drop the unreachable enteredMonitor flag; control only enters the try block when TryEnter succeeded.
- Test: extend DisposeReleasesGarnetClientSession to assert idempotency.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Version.props now declares 1.1.6.2 (was 1.1.6) so that nuget packages produced from this branch carry the patch identifier.

GarnetServer.GetVersion() previously returned only Major.Minor.Build, which caused 1.1.6.2 builds to log themselves as 1.1.6 at startup. Append the Revision when non-zero so the startup log matches the assembly version.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This reverts commit 54cbd42. The version bump and the GetVersion() Revision-aware formatting are out of scope for this PR; the version bump should land on its own.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Mathos1432 Mathos1432 marked this pull request as ready for review May 14, 2026 17:38
release/v1 just bumped to 1.1.7 in #1792. Bump again to 1.1.8 as part of this PR so the package built from this branch sorts above the latest published version.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc merged commit 92b7d4d into release/v1 May 15, 2026
1 check passed
@badrishc badrishc deleted the users/matrembl/simon-aoffix branch May 15, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants