Skip to content

Fix copy-paste bug in disk-based replica sync sending wrong AOF address#1633

Merged
vazois merged 1 commit into
vazois/mmrt-devfrom
vazois/fix-sr-no-checkpoint-restart
Mar 16, 2026
Merged

Fix copy-paste bug in disk-based replica sync sending wrong AOF address#1633
vazois merged 1 commit into
vazois/mmrt-devfrom
vazois/fix-sr-no-checkpoint-restart

Conversation

@vazois

@vazois vazois commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

Problem

ClusterSRNoCheckpointRestartSecondary tests fail because the replica remains stuck at AOF offset 64 (kFirstValidAofAddress) after restart, never syncing with the primary.

Error: [127.0.0.1:7000]: 1,140288 != [127.0.0.1:7001]: 1,64

Root Cause

Copy-paste bug in ReplicaDiskbasedSync.TryReplicateDiskbasedSync() introduced in commit 6fb99e58 when converting from ToByteArray() to Span-based calls. The aofTailAddress parameter was accidentally set to beginAddress.Span instead of tailAddress.Span.

The primary uses the replica's tail address in ComputeAofSyncReplayAddress() to determine the AOF sync replay range. With both parameters being the begin address, the primary couldn't determine where the replica's AOF ended, so the replica never received AOF records after restart.

Fix

One-line fix: beginAddress.Span to tailAddress.Span for the aofTailAddress parameter of ExecuteClusterInitiateReplicaSync.

Testing

All 4 ClusterReplicationTLS.ClusterSRNoCheckpointRestartSecondary variants now pass.

…k-based replica sync

In TryReplicateDiskbasedSync, ExecuteClusterInitiateReplicaSync was
sending beginAddress.Span for both the aofBeginAddress and aofTailAddress
parameters. This was introduced in commit 6fb99e5 when converting from
ToByteArray() to Span-based calls.

The primary uses the replica's tail address to compute the AOF sync replay
range. With both parameters being the begin address, the primary couldn't
determine where the replica's AOF actually ended, causing the replica to
never receive AOF records and remain stuck at offset 64 (kFirstValidAofAddress).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@vazois vazois merged commit 3bbec09 into vazois/mmrt-dev Mar 16, 2026
1 check passed
@vazois vazois deleted the vazois/fix-sr-no-checkpoint-restart branch March 16, 2026 22:52
vazois added a commit that referenced this pull request Apr 27, 2026
* implement sequence number generator

* More work on LogRecord.RecordLength

* format code

* Finish LogRecord.RecordLength

* add checkpoint marker process coordination

* add fuzzy region flag to replay context

* trigger exception when remove of checkpoint barrier fails

* throw exception if SignalAndWait times out

* add flag to differentiate between sharded and single log AOF

* AOF explorer

* wait for aofSync and refreshTail tasks on Dispose

* validate base replication tests v1

* Remove ObjectDb* from Sample*Txn

* test fixes for disk-based replication v2

* add synchronized replay for remaining operations (i.e. FLUSHDB, STREAMING_CHECKPOINT)

* sequential node dispose on teardown

* ensure dispose client precedes iterator

* MultiLevelPageArray stress fixes; improved LogRecord and RecordDataHeader ToString()

* WIP

* Fix GetObjectLogSegmentSize()

* wait for aof sync driver and connection to drop before recovering

* consolidate reset replica replay task group

* update docs

* Fix a couple comments

* address data loss on synchronization

* use unified synchronized operation for regulard txn replay

* consistent read for scan commands

* Trim ObjectIdMap arrays on ClearPage

* remove unused variable

* fix single log tests

* fix sharded log tests

* adding comments to AofReplayCoordinator

* adjust ROLE command response for sharded log implementation

* WIP; multi phase read consistency protocol

* added basic consistent read protocol context

* Add transactional consistent read context

* md readmed for mlog design

* add design notes

* WIP; change session at state change

* allow disklogRecord upserts for streamless sync

* implement consistent read protocol for scan commands

* add sharded diskless sync tests

* eliminated interface declaration for APIs with consistent read

* cleanup and refactoring

* remove sharded option since we have AofBench

* update tail witness task to ping replica for latest replayed sequence number information

* remove unused code

* wip; logical subtask replay

* WIP; re-design AofHeaders

* WIP; update sequence replay info

* wip; v1 parallel replay within physical sublog

* wip; running sharded log tests with single replay task

* wip; refactor to use virtual sublog replay state and a virtual sublog idx calculation

* wip; update coordinated operations to trigger when multilog is enabled

* wip; add multi-replay test for single key operations

* wip; cleanup

* remove AofExplorer

* wip; index coordination barriers with distinct key to mixing separate transactions on same session

* add indexing for coordination barriers using sequence number

* rename sublog count to physical sublog count

* remove cproj ref

* separate AofBenchType

* wip; mlog description

* wip; mlog description

* wip; adding comments and writing dev replication website content

* wip; updating documentation and description in code

* wip; AofSyncTask section

* wip; replay details summary and fixes

* wip; renaming and updating todo items

* wip; skip hashing when configured with one physical sublog and multiple replay tasks

* add test for custom txn and multi-exec txn with parallel replay

* refactor/rename and cleanup replica replay code

* wip; adding bulk consume recovery

* renaming and cleanup

* simplify intra page replay and synchronization primitive

* implement recover intra page replay

* assert throw when configuring multi-log without FastCommit

* add description of prefix consistent recovery

* push log specific operations in GarnetLog instance

* wip;

* implement prefix consistent recovery

* refactor code to use single log when multi-replay is enabled with one physical log

* remove cookie generation callbacks

* cleanup and documentation write-up

* more documentation

* refactor advance time background task

* TaskCompletionSource based wait for read consistent read

* fix read protocol edge cases

* code comments and cleanup

* add cluster command to retrieve key sequence number for debugging purposes

* PR description initial draft

* advance time using high precision timestamp and background task

* PR details

* fix formatting

* revert HashUtils location

* HashUtils mv

* fix formatting

* cleanup

* single log legacy implementation updates

* add docs for mlog key time

* legacy checkpoint cookie serialization support

* fix test delay

* run aofsynctask directly with single log

* append replication vector to ROLE command when using multi-log

* add support for intra-cluster connection identification and fix failing management tests

* set client info when clientname parameter != null

* support negative values in AofAddress

* add ConsistentReadContext definition into Globals

* fix formatting

* fix ClusterDisklessSyncResetSyncManagerCts

* fix ClusterResetHardDuringDisklessReplicationAttach

* fix consistent read context bugs

* wip; refactor read-protocol integration

* Implement read key batch consistent protocol

* improve robustness of background advance time consumer

* remove unused playground project

* cleanup clientname setup

* fix tsavorite ISessionFunctions

* remove unused assembly

* simplify AofAddress and check for correctness

* fix typo

* eliminate bytes from informational message

* simplify and cleanup leader barrier

* cleanup LeaderFollowerBarrier

* ensure commit for multilog recovery

* replace tcs with semaphore slim

* fix ClusterReplicationCheckpointCleanupTest

* Fix copy-paste bug sending beginAddress instead of tailAddress in disk-based replica sync (#1633)

In TryReplicateDiskbasedSync, ExecuteClusterInitiateReplicaSync was
sending beginAddress.Span for both the aofBeginAddress and aofTailAddress
parameters. This was introduced in commit 6fb99e5 when converting from
ToByteArray() to Span-based calls.

The primary uses the replica's tail address to compute the AOF sync replay
range. With both parameters being the begin address, the primary couldn't
determine where the replica's AOF actually ended, causing the replica to
never receive AOF records and remain stuck at offset 64 (kFirstValidAofAddress).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix ClusterResetHardDuring tests by using ConfigureAwait for ExceptionInjection

* enforce ConfigureAwait(false) across Garnet

* fix AofBench

* add website description

* add migration GarnetLog upgrade test

* fix formatting

* fix website links

* add object option for replication scheme upgrade test

* set default logging to Error

* pre-compute enforceConsistentRead

* add AOF option for cluster BDN

* update global.json

* consolidate enqueue calls

* fix upgrade test null check

* add custom stop bulk consume callback

* fixing ClusterReplicationMultiRestartRecover

* fixing base replication tests

* use throttle to check consumer liveness during bulk consume

* refactor advance time background task

* ensure WorkQueueLIFO waits for process bg task to complete on dispose

* add activeWorkerMonitor

* cleanup and refactoring

* wip; adding tracking

* fix hanging ClusterReplicationCheckpointCleanupTest

* fix Tsavorite formatting

* fix Garnet formatting

* fix breaking build

* improve log message

* throw the caught AsyncFlushSnapshotPage exception

* ensure failover suspends replica only tasks

* usingSharded when replaying with multiple tasks

* add no-inlining exception throw

* fix AofBench to work with AofReplayCount

* wip; add variant to parallel replay a physical sublog

* wip; fixing ClusterSRNoCheckpointRestartSecondary

* fix defaults

* classify sharded tests under same namespace

* throw at TakeOverAsPrimary to fix test

* log startAddress on dispose of AofSyncDriver

* add better logging at ClusterAppendLog initialization

* reuse parseState at replayOp

* avoid cache invalidation when updating replicationOffset

* ensure get slots parsing deals with unstable slots

* nit; cleanup

* fix formatting

* better logging

* eliminate monotonic updates on the enqueue hot path when tracking sequence numbers for advance time.

* make aof-replay-task-count default=1 for AofBench

* add ReplayNoResp

* add ReplayDirect

* fix Enqueue Bench

* add better logging for replica receive primary AOF stream

* fix InProc Bench wait for server

* implemented targetted logging API for testing

* enable LogPrimaryStream for ClusterSRNoCheckpointRestartSecondary

* refactor cluster cmdstrings

* remove unused methods

* cleanup

* fix logging event flag checking

* update aof bench

* optimistically return if not ReadWaiters

* improving AofBench

* Pass around keyHash and re-use it during replay

* remove notes

* increase thread pool for cluster tls replication

* expand logging at AofSyncTask initialization

* wip; cleanup replay path

* use owned epoch for multilog

* use owned epoch for tsavorite log and cleanup aof recover

* make vector fixes for replication

* fix aof gen

* JIT replayOp refactor

* fix range index AOF enqueue

* use barrier on AofSync connection with multi-log only

* adding async for connect and gossip

* addressing first set of comments

* addressing second set of comments

* add throw cancellation

* make light epoch in client compatible with tsavorite epoch implementation

* simplify object RMW enqueue

* separate appendLog initialization

---------

Co-authored-by: TedHartMS <15467143+TedHartMS@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot locked and limited conversation to collaborators May 16, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant