Fix copy-paste bug in disk-based replica sync sending wrong AOF address#1633
Merged
Merged
Conversation
…k-based replica sync In TryReplicateDiskbasedSync, ExecuteClusterInitiateReplicaSync was sending beginAddress.Span for both the aofBeginAddress and aofTailAddress parameters. This was introduced in commit 6fb99e5 when converting from ToByteArray() to Span-based calls. The primary uses the replica's tail address to compute the AOF sync replay range. With both parameters being the begin address, the primary couldn't determine where the replica's AOF actually ended, causing the replica to never receive AOF records and remain stuck at offset 64 (kFirstValidAofAddress). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
vazois
added a commit
that referenced
this pull request
Apr 27, 2026
* implement sequence number generator * More work on LogRecord.RecordLength * format code * Finish LogRecord.RecordLength * add checkpoint marker process coordination * add fuzzy region flag to replay context * trigger exception when remove of checkpoint barrier fails * throw exception if SignalAndWait times out * add flag to differentiate between sharded and single log AOF * AOF explorer * wait for aofSync and refreshTail tasks on Dispose * validate base replication tests v1 * Remove ObjectDb* from Sample*Txn * test fixes for disk-based replication v2 * add synchronized replay for remaining operations (i.e. FLUSHDB, STREAMING_CHECKPOINT) * sequential node dispose on teardown * ensure dispose client precedes iterator * MultiLevelPageArray stress fixes; improved LogRecord and RecordDataHeader ToString() * WIP * Fix GetObjectLogSegmentSize() * wait for aof sync driver and connection to drop before recovering * consolidate reset replica replay task group * update docs * Fix a couple comments * address data loss on synchronization * use unified synchronized operation for regulard txn replay * consistent read for scan commands * Trim ObjectIdMap arrays on ClearPage * remove unused variable * fix single log tests * fix sharded log tests * adding comments to AofReplayCoordinator * adjust ROLE command response for sharded log implementation * WIP; multi phase read consistency protocol * added basic consistent read protocol context * Add transactional consistent read context * md readmed for mlog design * add design notes * WIP; change session at state change * allow disklogRecord upserts for streamless sync * implement consistent read protocol for scan commands * add sharded diskless sync tests * eliminated interface declaration for APIs with consistent read * cleanup and refactoring * remove sharded option since we have AofBench * update tail witness task to ping replica for latest replayed sequence number information * remove unused code * wip; logical subtask replay * WIP; re-design AofHeaders * WIP; update sequence replay info * wip; v1 parallel replay within physical sublog * wip; running sharded log tests with single replay task * wip; refactor to use virtual sublog replay state and a virtual sublog idx calculation * wip; update coordinated operations to trigger when multilog is enabled * wip; add multi-replay test for single key operations * wip; cleanup * remove AofExplorer * wip; index coordination barriers with distinct key to mixing separate transactions on same session * add indexing for coordination barriers using sequence number * rename sublog count to physical sublog count * remove cproj ref * separate AofBenchType * wip; mlog description * wip; mlog description * wip; adding comments and writing dev replication website content * wip; updating documentation and description in code * wip; AofSyncTask section * wip; replay details summary and fixes * wip; renaming and updating todo items * wip; skip hashing when configured with one physical sublog and multiple replay tasks * add test for custom txn and multi-exec txn with parallel replay * refactor/rename and cleanup replica replay code * wip; adding bulk consume recovery * renaming and cleanup * simplify intra page replay and synchronization primitive * implement recover intra page replay * assert throw when configuring multi-log without FastCommit * add description of prefix consistent recovery * push log specific operations in GarnetLog instance * wip; * implement prefix consistent recovery * refactor code to use single log when multi-replay is enabled with one physical log * remove cookie generation callbacks * cleanup and documentation write-up * more documentation * refactor advance time background task * TaskCompletionSource based wait for read consistent read * fix read protocol edge cases * code comments and cleanup * add cluster command to retrieve key sequence number for debugging purposes * PR description initial draft * advance time using high precision timestamp and background task * PR details * fix formatting * revert HashUtils location * HashUtils mv * fix formatting * cleanup * single log legacy implementation updates * add docs for mlog key time * legacy checkpoint cookie serialization support * fix test delay * run aofsynctask directly with single log * append replication vector to ROLE command when using multi-log * add support for intra-cluster connection identification and fix failing management tests * set client info when clientname parameter != null * support negative values in AofAddress * add ConsistentReadContext definition into Globals * fix formatting * fix ClusterDisklessSyncResetSyncManagerCts * fix ClusterResetHardDuringDisklessReplicationAttach * fix consistent read context bugs * wip; refactor read-protocol integration * Implement read key batch consistent protocol * improve robustness of background advance time consumer * remove unused playground project * cleanup clientname setup * fix tsavorite ISessionFunctions * remove unused assembly * simplify AofAddress and check for correctness * fix typo * eliminate bytes from informational message * simplify and cleanup leader barrier * cleanup LeaderFollowerBarrier * ensure commit for multilog recovery * replace tcs with semaphore slim * fix ClusterReplicationCheckpointCleanupTest * Fix copy-paste bug sending beginAddress instead of tailAddress in disk-based replica sync (#1633) In TryReplicateDiskbasedSync, ExecuteClusterInitiateReplicaSync was sending beginAddress.Span for both the aofBeginAddress and aofTailAddress parameters. This was introduced in commit 6fb99e5 when converting from ToByteArray() to Span-based calls. The primary uses the replica's tail address to compute the AOF sync replay range. With both parameters being the begin address, the primary couldn't determine where the replica's AOF actually ended, causing the replica to never receive AOF records and remain stuck at offset 64 (kFirstValidAofAddress). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix ClusterResetHardDuring tests by using ConfigureAwait for ExceptionInjection * enforce ConfigureAwait(false) across Garnet * fix AofBench * add website description * add migration GarnetLog upgrade test * fix formatting * fix website links * add object option for replication scheme upgrade test * set default logging to Error * pre-compute enforceConsistentRead * add AOF option for cluster BDN * update global.json * consolidate enqueue calls * fix upgrade test null check * add custom stop bulk consume callback * fixing ClusterReplicationMultiRestartRecover * fixing base replication tests * use throttle to check consumer liveness during bulk consume * refactor advance time background task * ensure WorkQueueLIFO waits for process bg task to complete on dispose * add activeWorkerMonitor * cleanup and refactoring * wip; adding tracking * fix hanging ClusterReplicationCheckpointCleanupTest * fix Tsavorite formatting * fix Garnet formatting * fix breaking build * improve log message * throw the caught AsyncFlushSnapshotPage exception * ensure failover suspends replica only tasks * usingSharded when replaying with multiple tasks * add no-inlining exception throw * fix AofBench to work with AofReplayCount * wip; add variant to parallel replay a physical sublog * wip; fixing ClusterSRNoCheckpointRestartSecondary * fix defaults * classify sharded tests under same namespace * throw at TakeOverAsPrimary to fix test * log startAddress on dispose of AofSyncDriver * add better logging at ClusterAppendLog initialization * reuse parseState at replayOp * avoid cache invalidation when updating replicationOffset * ensure get slots parsing deals with unstable slots * nit; cleanup * fix formatting * better logging * eliminate monotonic updates on the enqueue hot path when tracking sequence numbers for advance time. * make aof-replay-task-count default=1 for AofBench * add ReplayNoResp * add ReplayDirect * fix Enqueue Bench * add better logging for replica receive primary AOF stream * fix InProc Bench wait for server * implemented targetted logging API for testing * enable LogPrimaryStream for ClusterSRNoCheckpointRestartSecondary * refactor cluster cmdstrings * remove unused methods * cleanup * fix logging event flag checking * update aof bench * optimistically return if not ReadWaiters * improving AofBench * Pass around keyHash and re-use it during replay * remove notes * increase thread pool for cluster tls replication * expand logging at AofSyncTask initialization * wip; cleanup replay path * use owned epoch for multilog * use owned epoch for tsavorite log and cleanup aof recover * make vector fixes for replication * fix aof gen * JIT replayOp refactor * fix range index AOF enqueue * use barrier on AofSync connection with multi-log only * adding async for connect and gossip * addressing first set of comments * addressing second set of comments * add throw cancellation * make light epoch in client compatible with tsavorite epoch implementation * simplify object RMW enqueue * separate appendLog initialization --------- Co-authored-by: TedHartMS <15467143+TedHartMS@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
ClusterSRNoCheckpointRestartSecondarytests fail because the replica remains stuck at AOF offset 64 (kFirstValidAofAddress) after restart, never syncing with the primary.Error:
[127.0.0.1:7000]: 1,140288 != [127.0.0.1:7001]: 1,64Root Cause
Copy-paste bug in
ReplicaDiskbasedSync.TryReplicateDiskbasedSync()introduced in commit6fb99e58when converting fromToByteArray()toSpan-based calls. TheaofTailAddressparameter was accidentally set tobeginAddress.Spaninstead oftailAddress.Span.The primary uses the replica's tail address in
ComputeAofSyncReplayAddress()to determine the AOF sync replay range. With both parameters being the begin address, the primary couldn't determine where the replica's AOF ended, so the replica never received AOF records after restart.Fix
One-line fix:
beginAddress.SpantotailAddress.Spanfor theaofTailAddressparameter ofExecuteClusterInitiateReplicaSync.Testing
All 4
ClusterReplicationTLS.ClusterSRNoCheckpointRestartSecondaryvariants now pass.