Fix diskless replication task accumulation, GarnetClientSession leak, and dispose race in AofSyncTaskInfo#1791
Merged
Merged
Conversation
Two related bugs in AofTaskStore caused unbounded accumulation of AofSyncTaskInfo tasks on clusters using diskless replication. 1. TryAddReplicationTasks (the diskless path) compared existing tasks against rss.replicaNodeId for dedup. ReplicaSyncSession has two node ID fields: replicaNodeId (set by the disk-based constructor, null for diskless) and replicaSyncMetadata.originNodeId (set by the diskless constructor). The AofSyncTaskInfo was created with originNodeId, but dedup compared against the null replicaNodeId — so it never matched and every call added a new task. Over time numTasks grew unboundedly, inflating the RoleInfo[] from INFO REPLICATION until the response exceeded the network output buffer. Fix: use rss.replicaSyncMetadata.originNodeId in the dedup comparison. The singular TryAddReplicationTask (disk-based and CLUSTER AOFSYNC) is unaffected. 2. AofSyncTaskInfo.Dispose() did not dispose its owned GarnetClientSession. When ReplicaSyncTaskAsync is running, CTS cancellation causes it to exit and the finally block cleans up. But when ReplicaSyncTaskAsync has not yet started (e.g. the task fails to be added), Dispose() is the only cleanup path and the session was leaked. Fix: add garnetClient?.Dispose() to AofSyncTaskInfo.Dispose() and remove the redundant call from ReplicaSyncTaskAsync's finally block, giving a single disposal site. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clarify that the client disposal is no longer happening in the finally block. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Apply [AllureNUnit] attribute and inherit AllureTestBase to match repo test conventions required by CI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Categorize the test to match the existing replication test convention in the cluster test project. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes two AOF replication task management bugs impacting cluster replication (especially diskless replication) by correcting task deduplication identity and tightening resource cleanup for AOF sync tasks.
Changes:
- Fix diskless replication task deduplication to match on
replicaSyncMetadata.originNodeId(instead ofreplicaNodeId, which is null for diskless sessions). - Update
AofSyncTaskInfo.Dispose()to dispose its ownedGarnetClientSession, and adjust termination logging. - Add a regression/unit test asserting
AofSyncTaskInfo.Dispose()releases the client session.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| test/Garnet.test.cluster/ReplicationTests/AofSyncTaskInfoTests.cs | Adds a test ensuring AofSyncTaskInfo.Dispose() disposes its owned GarnetClientSession. |
| libs/cluster/Server/Replication/PrimaryOps/AofTaskStore.cs | Fixes diskless task deduplication comparison to use originNodeId. |
| libs/cluster/Server/Replication/PrimaryOps/AofSyncTaskInfo.cs | Moves GarnetClientSession disposal into Dispose() and updates termination logging. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
vazois
reviewed
May 13, 2026
vazois
approved these changes
May 13, 2026
…idempotency - Use ActiveWorkerMonitor (backported from PR #1556) to drain in-flight ReplicaSyncTaskAsync before disposing GarnetClientSession, eliminating cross-thread dispose races against ExecuteClusterAppendLog. - Defensively call Dispose() in ReplicaSyncTaskAsync's finally when TryRemove returns false, guarding against future removal sites that forget to dispose. - Make AofSyncTaskInfo.Dispose() idempotent (Interlocked guard) so multiple disposal sites cannot trigger ObjectDisposedException from cts.Cancel after cts.Dispose. - Drop the unreachable enteredMonitor flag; control only enters the try block when TryEnter succeeded. - Test: extend DisposeReleasesGarnetClientSession to assert idempotency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Version.props now declares 1.1.6.2 (was 1.1.6) so that nuget packages produced from this branch carry the patch identifier. GarnetServer.GetVersion() previously returned only Major.Minor.Build, which caused 1.1.6.2 builds to log themselves as 1.1.6 at startup. Append the Revision when non-zero so the startup log matches the assembly version. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This reverts commit 54cbd42. The version bump and the GetVersion() Revision-aware formatting are out of scope for this PR; the version bump should land on its own. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
vazois
approved these changes
May 14, 2026
release/v1 just bumped to 1.1.7 in #1792. Bump again to 1.1.8 as part of this PR so the package built from this branch sorts above the latest published version. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
vazois
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why is this change being made?
This is recreating #1787 on top of
release/v1.Two related bugs in
AofTaskStorecaused unbounded accumulation ofAofSyncTaskInfotasks on clusters using diskless replication, and a third — a thread-safety race — was uncovered during PR review of the original disposal fix.Bug 1 — diskless dedup never matched.
TryAddReplicationTasks(the diskless path) compared existing tasks againstrss.replicaNodeIdfor dedup.ReplicaSyncSessionhas two node ID fields:replicaNodeId(set by the disk-based constructor, null for diskless) andreplicaSyncMetadata.originNodeId(set by the diskless constructor). TheAofSyncTaskInfowas created withoriginNodeId, but dedup compared against the nullreplicaNodeId— so it never matched and every call added a new task. Over timenumTasksgrew unboundedly, inflating theRoleInfo[]fromINFO REPLICATIONuntil the response exceeded the network output buffer.Bug 2 —
GarnetClientSessionleak on the not-yet-started path.AofSyncTaskInfo.Dispose()did not dispose its ownedGarnetClientSession. WhenReplicaSyncTaskAsyncis running, CTS cancellation causes it to exit and the finally block cleans up. But whenReplicaSyncTaskAsynchas not yet started (e.g. the task fails to be added),Dispose()is the only cleanup path and the session was leaked.Bug 3 — dispose race after moving disposal into
Dispose(). OncegarnetClient.Dispose()lives inAofSyncTaskInfo.Dispose(), the replacement path inAofTaskStore(which disposes an old task while creating a new one) can dispose the client while the old task'sConsume()/Throttle()is still callinggarnetClient.ExecuteClusterAppendLog.GarnetClientSessionis mono-threaded with unsafe pooled send buffers — a concurrent dispose corrupts those buffers.What changed?
libs/cluster/Server/Replication/PrimaryOps/AofTaskStore.cs— Dedup fix: compare againstrss.replicaSyncMetadata.originNodeIdinstead ofrss.replicaNodeIdinTryAddReplicationTasks. The singularTryAddReplicationTask(disk-based andCLUSTER AOFSYNC) is unaffected.libs/cluster/Server/Replication/PrimaryOps/AofSyncTaskInfo.cs—Dispose()now disposesgarnetClient(closing the leak from Bug 2). Disposal is sequenced through a newActiveWorkerMonitorinstance:Dispose()cancels the cts, disposes the iterator, callsactiveWorkerMonitor.Dispose()to drain any in-flightReplicaSyncTaskAsync, and only then disposesgarnetClient.ReplicaSyncTaskAsyncwraps its body inTryEnter()/Exit(), withExit()called infinallybeforeaofTaskStore.TryRemove(this)to avoid self-deadlock against the monitor it is exiting.libs/common/Synchronization/ActiveWorkerMonitor.cs— New file backported verbatim frommain(introduced by Support for Parallel Replication #1556, Parallel Replication). Multi-reader/single-writer drain primitive built onInterlocked+ManualResetEventSlim— no spinning. The file is self-contained (only BCLusings) so the rest of the parallel-replication restructure is not required. Aligns the v1 fix with the v2 architecture and eases future merges.test/Garnet.test.cluster/ReplicationTests/AofSyncTaskInfoTests.cs— Existing unit coverage ofAofSyncTaskInfolifecycle.test/Garnet.test.cluster/ReplicationTests/ClusterReplicationDisklessSyncTests.cs— AddedClusterDisklessSyncRepeatedAttachDoesNotAccumulateTasksregression test for Bug 1: drives 5 replica re-attach cycles and asserts the primary'sINFO REPLICATIONreports a singleslaveN:entry. Verified to fail when the dedup fix is reverted.How was this validated?
ClusterReplicationDisklessSyncTestsfixture green onnet8.0(30/30, including the new regression test,ClusterDisklessSyncFailover, andClusterDisklessSyncResetSyncManagerCtswhich exercise heavy task-replacement and cancellation paths).AofTaskStore.csfix is reverted, then pass again when restored.ActiveWorkerMonitor-based fix removes the dispose race.