Skip to content

Fix GarnetClientSession leak and diskless replication dedup failure#1787

Closed
nattress wants to merge 5 commits into
microsoft:mainfrom
nattress:simonn/aofsynctaskinfo-leak
Closed

Fix GarnetClientSession leak and diskless replication dedup failure#1787
nattress wants to merge 5 commits into
microsoft:mainfrom
nattress:simonn/aofsynctaskinfo-leak

Conversation

@nattress

Copy link
Copy Markdown
Member

Two related bugs in AofTaskStore caused unbounded accumulation of AofSyncTaskInfo tasks on clusters using diskless replication.

  1. TryAddReplicationTasks (the diskless path) compared existing tasks against rss.replicaNodeId for dedup. ReplicaSyncSession has two node ID fields: replicaNodeId (set by the disk-based constructor, null for diskless) and replicaSyncMetadata.originNodeId (set by the diskless constructor). The AofSyncTaskInfo was created with originNodeId, but dedup compared against the null replicaNodeId — so it never matched and every call added a new task. Over time numTasks grew unboundedly, inflating the RoleInfo[] from INFO REPLICATION until the response exceeded the network output buffer.

    Fix: use rss.replicaSyncMetadata.originNodeId in the dedup comparison. The singular TryAddReplicationTask (disk-based and CLUSTER AOFSYNC) is unaffected.

  2. AofSyncTaskInfo.Dispose() did not dispose its owned GarnetClientSession. When ReplicaSyncTaskAsync is running, CTS cancellation causes it to exit and the finally block cleans up. But when ReplicaSyncTaskAsync has not yet started (e.g. the task fails to be added), Dispose() is the only cleanup path and the session was leaked.

    Fix: add garnetClient?.Dispose() to AofSyncTaskInfo.Dispose() and remove the redundant call from ReplicaSyncTaskAsync's finally block, giving a single disposal site.

Two related bugs in AofTaskStore caused unbounded accumulation
of AofSyncTaskInfo tasks on clusters using diskless replication.

1. TryAddReplicationTasks (the diskless path) compared existing
   tasks against rss.replicaNodeId for dedup. ReplicaSyncSession
   has two node ID fields: replicaNodeId (set by the disk-based
   constructor, null for diskless) and replicaSyncMetadata.originNodeId
   (set by the diskless constructor). The AofSyncTaskInfo was
   created with originNodeId, but dedup compared against the null
   replicaNodeId — so it never matched and every call added a new
   task. Over time numTasks grew unboundedly, inflating the
   RoleInfo[] from INFO REPLICATION until the response exceeded
   the network output buffer.

   Fix: use rss.replicaSyncMetadata.originNodeId in the dedup
   comparison. The singular TryAddReplicationTask (disk-based
   and CLUSTER AOFSYNC) is unaffected.

2. AofSyncTaskInfo.Dispose() did not dispose its owned
   GarnetClientSession. When ReplicaSyncTaskAsync is running,
   CTS cancellation causes it to exit and the finally block
   cleans up. But when ReplicaSyncTaskAsync has not yet started
   (e.g. the task fails to be added), Dispose() is the only
   cleanup path and the session was leaked.

   Fix: add garnetClient?.Dispose() to AofSyncTaskInfo.Dispose()
   and remove the redundant call from ReplicaSyncTaskAsync's
   finally block, giving a single disposal site.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 11, 2026 16:13

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes two replication/AOF-related resource-management bugs that could cause unbounded growth of AOF sync tasks and leaked client sessions in cluster diskless replication, impacting INFO REPLICATION response size and network resource usage.

Changes:

  • Fix diskless replication task deduplication by comparing against replicaSyncMetadata.originNodeId (instead of the null replicaNodeId in diskless mode).
  • Ensure AofSyncTaskInfo.Dispose() disposes its owned GarnetClientSession, and remove the redundant disposal in ReplicaSyncTaskAsync’s finally.
  • Add a focused cluster test validating that AofSyncTaskInfo.Dispose() releases the client session.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
test/Garnet.test.cluster/ReplicationTests/AofSyncTaskInfoTests.cs Adds a unit-style test to prevent regressions in AofSyncTaskInfo session disposal.
libs/cluster/Server/Replication/PrimaryOps/AofTaskStore.cs Fixes diskless replication task dedup to prevent unbounded task accumulation.
libs/cluster/Server/Replication/PrimaryOps/AofSyncTaskInfo.cs Centralizes GarnetClientSession disposal in AofSyncTaskInfo.Dispose() and removes redundant disposal in the task loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread libs/cluster/Server/Replication/PrimaryOps/AofSyncTaskInfo.cs Outdated
nattress and others added 3 commits May 11, 2026 09:49
Clarify that the client disposal is no longer happening in the finally block.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Apply [AllureNUnit] attribute and inherit AllureTestBase to
match repo test conventions required by CI.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
{
[TestFixture]
[NonParallelizable]
public class AofSyncTaskInfoTests

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make this part of an existing test category? Maybe Negative or Management Tests?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added [Category(\REPLICATION)]\ to the test method, plus [AllureNUnit]\ and \AllureTestBase\ inheritance on the class to match the repo test conventions.

Categorize the test to match the existing replication test
convention in the cluster test project.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Mathos1432

Copy link
Copy Markdown
Contributor

I recreated your PR after the v2 fork

release/v1: #1791
main: #1794

@nattress nattress closed this May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants