Support for Parallel Replication#1556
Merged
Merged
Conversation
badrishc
reviewed
Apr 23, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 169 out of 172 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
badrishc
approved these changes
Apr 27, 2026
This was referenced May 12, 2026
Mathos1432
added a commit
that referenced
this pull request
May 14, 2026
…idempotency - Use ActiveWorkerMonitor (backported from PR #1556) to drain in-flight ReplicaSyncTaskAsync before disposing GarnetClientSession, eliminating cross-thread dispose races against ExecuteClusterAppendLog. - Defensively call Dispose() in ReplicaSyncTaskAsync's finally when TryRemove returns false, guarding against future removal sites that forget to dispose. - Make AofSyncTaskInfo.Dispose() idempotent (Interlocked guard) so multiple disposal sites cannot trigger ObjectDisposedException from cts.Cancel after cts.Dispose. - Drop the unreachable enteredMonitor flag; control only enters the try block when TryEnter succeeded. - Test: extend DisposeReleasesGarnetClientSession to assert idempotency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc
pushed a commit
that referenced
this pull request
May 15, 2026
… and dispose race in AofSyncTaskInfo (#1791) * Fix GarnetClientSession leak and diskless replication dedup failure Two related bugs in AofTaskStore caused unbounded accumulation of AofSyncTaskInfo tasks on clusters using diskless replication. 1. TryAddReplicationTasks (the diskless path) compared existing tasks against rss.replicaNodeId for dedup. ReplicaSyncSession has two node ID fields: replicaNodeId (set by the disk-based constructor, null for diskless) and replicaSyncMetadata.originNodeId (set by the diskless constructor). The AofSyncTaskInfo was created with originNodeId, but dedup compared against the null replicaNodeId — so it never matched and every call added a new task. Over time numTasks grew unboundedly, inflating the RoleInfo[] from INFO REPLICATION until the response exceeded the network output buffer. Fix: use rss.replicaSyncMetadata.originNodeId in the dedup comparison. The singular TryAddReplicationTask (disk-based and CLUSTER AOFSYNC) is unaffected. 2. AofSyncTaskInfo.Dispose() did not dispose its owned GarnetClientSession. When ReplicaSyncTaskAsync is running, CTS cancellation causes it to exit and the finally block cleans up. But when ReplicaSyncTaskAsync has not yet started (e.g. the task fails to be added), Dispose() is the only cleanup path and the session was leaked. Fix: add garnetClient?.Dispose() to AofSyncTaskInfo.Dispose() and remove the redundant call from ReplicaSyncTaskAsync's finally block, giving a single disposal site. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * AofSyncTaskInfo termination log fix Clarify that the client disposal is no longer happening in the finally block. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Fix formatting issues from CI * Add Allure attributes to AofSyncTaskInfoTests Apply [AllureNUnit] attribute and inherit AllureTestBase to match repo test conventions required by CI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add REPLICATION category to AofSyncTaskInfoTests Categorize the test to match the existing replication test convention in the cluster test project. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Backport ActiveWorkerMonitor * Address review feedback: drain in-flight workers, defensive Dispose, idempotency - Use ActiveWorkerMonitor (backported from PR #1556) to drain in-flight ReplicaSyncTaskAsync before disposing GarnetClientSession, eliminating cross-thread dispose races against ExecuteClusterAppendLog. - Defensively call Dispose() in ReplicaSyncTaskAsync's finally when TryRemove returns false, guarding against future removal sites that forget to dispose. - Make AofSyncTaskInfo.Dispose() idempotent (Interlocked guard) so multiple disposal sites cannot trigger ObjectDisposedException from cts.Cancel after cts.Dispose. - Drop the unreachable enteredMonitor flag; control only enters the try block when TryEnter succeeded. - Test: extend DisposeReleasesGarnetClientSession to assert idempotency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Bump version to 1.1.6.2 and surface Revision in startup log Version.props now declares 1.1.6.2 (was 1.1.6) so that nuget packages produced from this branch carry the patch identifier. GarnetServer.GetVersion() previously returned only Major.Minor.Build, which caused 1.1.6.2 builds to log themselves as 1.1.6 at startup. Append the Revision when non-zero so the startup log matches the assembly version. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert "Bump version to 1.1.6.2 and surface Revision in startup log" This reverts commit 54cbd42. The version bump and the GetVersion() Revision-aware formatting are out of scope for this PR; the version bump should land on its own. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Bump version from 1.1.7 to 1.1.8 release/v1 just bumped to 1.1.7 in #1792. Bump again to 1.1.8 as part of this PR so the package built from this branch sorts above the latest published version. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * retrigger CI * trigger CI --------- Co-authored-by: Simon Nattress <simonn@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Simon Nattress <nattress@gmail.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: Vasileios Zois <vazois@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Multi-Log Parallel Replication Feature
This PR introduces multi-log based Append-Only File (AOF) support to Garnet, enhancing write throughput and enabling optimized parallel replication replay. The feature leverages multiple physical
TsavoriteLoginstances to shard write operations and parallelize log scanning, shipping, and replay across multiple connections and iterators. While designed primarily for cluster mode replication, this feature can also be used in standalone mode to improve performance when AOF is enabled.Feature Requirements
1. Sharded AOF Architecture
TsavoriteLoginstances.2. Flexible Parallel Replay with Tunable Task Granularity
3. Read Consistency Protocol
4. Transaction Support
5. Fast Prefix-Consistent Recovery
Newly Introduced Configuration Parameters
AofPhysicalSublogCountTsavoriteLoginstancesAofReplayTaskCountAofRefreshPhysicalSublogTailFrequencyMsImplementation Plan
Phase 1: Core Infrastructure
1.1 Implement
AofHeaderextensions to eliminate single log overhead.ShardedHeaderfor standalone operations.TransactionHeaderfor coordinated operations.1.2 Implement
GarnetLogabstraction layer.SingleLogwrapper for legacy single log.ShardedLogimplementation for multi-log.1.3
SequenceNumberGeneratorclass.Phase 2: Primary Replication Stream
2.1
AofSyncDriverclass.AofSyncDriverper attached replica.AofSyncTaskper physical sublog.AdvanceTimebackground task per attached replica.2.2
AofSyncTaskclass.2.3
AdvanceTimebackground task.Phase 3: Replica Replay Stream
3.1
ReplicaReplayDriverclass.ReplicaReplayTaskfor parallel replay within a single physical sublog.3.2
ReplicaReplayTaskclass.3.3 Standalone operation replay
BasicContextorTransactionalContext).3.4 Multi-exec transaction replay
3.5 Custom transaction procedure replay
Phase 4: Read Consistency Protocol
4.1
ReadConsistencyManagerclassVirtualSublogReplayStatestruct using sketch arrays for key freshness tracking and sequence number frontier computation.4.2 Session based prefix consistency enforcement
ConsistentReadGarnetApiandTransactionalConsistentReadGarnetApito allow the jitter to optimize operational calls.ValidateKeySequenceNumber,UpdateKeySequenceNumber).ReplicaReadSessionContextstruct used tomaximumSessionSequenceNumbermetadata (i.e.sessionVersion,lastHash,lastVirtualSublogIdx) to enforce prefix consistency when is stable or during recoveryPhase 6: Prefix consistent recovery
5.1 Commit operation
GarnetLoglayer instead of withinTsavoriteLogto control across sublogs commit.5.2
RecoverLogDriverimplementationsequenceNumber < untilSequenceNumber.ReadConsistencyManagerstate at recovery to initializeSequenceNumberGenerator.Phase 6: Testing & Validation
NOTES
Prefix Consistent Single Key Read Protocol
Prefix Consistent Batch Read
[dev]
[vazois/mmrt-dev]
TODO