Fix GarnetClientSession leak and diskless replication dedup failure#1787
Fix GarnetClientSession leak and diskless replication dedup failure#1787nattress wants to merge 5 commits into
Conversation
Two related bugs in AofTaskStore caused unbounded accumulation of AofSyncTaskInfo tasks on clusters using diskless replication. 1. TryAddReplicationTasks (the diskless path) compared existing tasks against rss.replicaNodeId for dedup. ReplicaSyncSession has two node ID fields: replicaNodeId (set by the disk-based constructor, null for diskless) and replicaSyncMetadata.originNodeId (set by the diskless constructor). The AofSyncTaskInfo was created with originNodeId, but dedup compared against the null replicaNodeId — so it never matched and every call added a new task. Over time numTasks grew unboundedly, inflating the RoleInfo[] from INFO REPLICATION until the response exceeded the network output buffer. Fix: use rss.replicaSyncMetadata.originNodeId in the dedup comparison. The singular TryAddReplicationTask (disk-based and CLUSTER AOFSYNC) is unaffected. 2. AofSyncTaskInfo.Dispose() did not dispose its owned GarnetClientSession. When ReplicaSyncTaskAsync is running, CTS cancellation causes it to exit and the finally block cleans up. But when ReplicaSyncTaskAsync has not yet started (e.g. the task fails to be added), Dispose() is the only cleanup path and the session was leaked. Fix: add garnetClient?.Dispose() to AofSyncTaskInfo.Dispose() and remove the redundant call from ReplicaSyncTaskAsync's finally block, giving a single disposal site. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Fixes two replication/AOF-related resource-management bugs that could cause unbounded growth of AOF sync tasks and leaked client sessions in cluster diskless replication, impacting INFO REPLICATION response size and network resource usage.
Changes:
- Fix diskless replication task deduplication by comparing against
replicaSyncMetadata.originNodeId(instead of the nullreplicaNodeIdin diskless mode). - Ensure
AofSyncTaskInfo.Dispose()disposes its ownedGarnetClientSession, and remove the redundant disposal inReplicaSyncTaskAsync’sfinally. - Add a focused cluster test validating that
AofSyncTaskInfo.Dispose()releases the client session.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| test/Garnet.test.cluster/ReplicationTests/AofSyncTaskInfoTests.cs | Adds a unit-style test to prevent regressions in AofSyncTaskInfo session disposal. |
| libs/cluster/Server/Replication/PrimaryOps/AofTaskStore.cs | Fixes diskless replication task dedup to prevent unbounded task accumulation. |
| libs/cluster/Server/Replication/PrimaryOps/AofSyncTaskInfo.cs | Centralizes GarnetClientSession disposal in AofSyncTaskInfo.Dispose() and removes redundant disposal in the task loop. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Clarify that the client disposal is no longer happening in the finally block. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Apply [AllureNUnit] attribute and inherit AllureTestBase to match repo test conventions required by CI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| { | ||
| [TestFixture] | ||
| [NonParallelizable] | ||
| public class AofSyncTaskInfoTests |
There was a problem hiding this comment.
Is it possible to make this part of an existing test category? Maybe Negative or Management Tests?
There was a problem hiding this comment.
Added [Category(\REPLICATION)]\ to the test method, plus [AllureNUnit]\ and \AllureTestBase\ inheritance on the class to match the repo test conventions.
Categorize the test to match the existing replication test convention in the cluster test project. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two related bugs in AofTaskStore caused unbounded accumulation of AofSyncTaskInfo tasks on clusters using diskless replication.
TryAddReplicationTasks (the diskless path) compared existing tasks against rss.replicaNodeId for dedup. ReplicaSyncSession has two node ID fields: replicaNodeId (set by the disk-based constructor, null for diskless) and replicaSyncMetadata.originNodeId (set by the diskless constructor). The AofSyncTaskInfo was created with originNodeId, but dedup compared against the null replicaNodeId — so it never matched and every call added a new task. Over time numTasks grew unboundedly, inflating the RoleInfo[] from INFO REPLICATION until the response exceeded the network output buffer.
Fix: use rss.replicaSyncMetadata.originNodeId in the dedup comparison. The singular TryAddReplicationTask (disk-based and CLUSTER AOFSYNC) is unaffected.
AofSyncTaskInfo.Dispose() did not dispose its owned GarnetClientSession. When ReplicaSyncTaskAsync is running, CTS cancellation causes it to exit and the finally block cleans up. But when ReplicaSyncTaskAsync has not yet started (e.g. the task fails to be added), Dispose() is the only cleanup path and the session was leaked.
Fix: add garnetClient?.Dispose() to AofSyncTaskInfo.Dispose() and remove the redundant call from ReplicaSyncTaskAsync's finally block, giving a single disposal site.