Skip to content

Fix two rare CI failures [dev]#1765

Merged
badrishc merged 1 commit into
devfrom
badrishc/fix-rare-failing-tests
May 3, 2026
Merged

Fix two rare CI failures [dev]#1765
badrishc merged 1 commit into
devfrom
badrishc/fix-rare-failing-tests

Conversation

@badrishc

@badrishc badrishc commented May 1, 2026

Copy link
Copy Markdown
Collaborator

ListPushPopStressTest host crash and VectorManager cleanup vs Reset() AVE

Two independent rare CI failures, both surfacing as Test host process crashed and aborting the whole test run.

1. ClusterVectorSetTests.MigrateVectorSetWhileModifyingAsync — fatal AccessViolationException in VectorManager cleanup task

Symptom

Passed Garnet.test.cluster.ClusterVectorSetTests.MigrateVectorSetWhileModifyingAsync [12 s]
Fatal error. System.AccessViolationException: Attempted to read or write protected memory.
   at Tsavorite.core.LogRecord.get_Info()
   at Tsavorite.core.LogRecord.get_AllocatedSize()
   at Tsavorite.core.ObjectScanIterator`2[...].GetPhysicalAddressAndAllocatedSize(...)
   at Tsavorite.core.ObjectScanIterator`2[...].GetNext()
   at Tsavorite.core.TsavoriteKVIterator`6[...].PushNext[...](...)
   at Tsavorite.core.TsavoriteKV`2[...].Iterate[...](MainSessionFunctions, ...)
   at Garnet.server.VectorManager+<RunCleanupTaskAsync>d__24.MoveNext()
The active test run was aborted. Reason: Test host process crashed

The AVE is a Corrupted-State Exception — catch (Exception) in RunCleanupTaskAsync cannot suppress it; the runtime fails fast and the test host crashes.

Root cause

Recovery.Reset()hlogBase.Reset() (in AllocatorBase and the per-allocator overrides SpanByte / Object / TsavoriteLog) frees pages by synchronously invoking OnPagesClosed(...) and a for (i in BufferSize) FreePage(i) loop. Both paths ultimately call ReturnPage(index), which sets:

pageArrays[index]   = default;
pagePointers[index] = default;   // ★ becomes 0

Reset()'s docstring promised "WARNING: assumes that threads have drained out at this point." But Garnet's cluster re-attach paths invoke it on a running store:

  • libs/cluster/Server/Replication/ReplicaOps/ReplicaDisklessSync.cs:100
  • libs/cluster/Server/Replication/ReplicaOps/ReplicaDiskbasedSync.cs:136

In both files storeWrapper.Reset() is called before SuspendPrimaryOnlyTasksAsync(), and even that suspend only drains TaskManager tasks — VectorManager.cleanupTask is independent and never drained.

Once pagePointers[i] = 0, the iterator's GetPhysicalAddress returns 0 + offset — a tiny kernel-page address — and dereferencing it in *(RecordInfo*)physicalAddress raises a fatal AVE.

The exact interleaving

Production scenario in MigrateVectorSetWhileModifyingAsync:

  1. Source primary migrates a slot containing a vector set → drops the index → CleanupDroppedIndex queues a cleanup-task scan on the source primary.
  2. The drop AOF entry replicates to the source's replica, which replays it and also queues a cleanup-task scan on the replica.
  3. Cluster topology change (post-migration, gossip, or any reason) triggers a replica re-attach → ReplicaDisklessSync.ReplicateAttachAsync / ReplicaDiskbasedSync.ReplicateAttachAsync calls storeWrapper.Reset().
  4. The replica's cleanup task is still mid-iterate over the main store → AVE.

Thread-level interleaving:

Thread A: VectorManager cleanup task                 Thread B: storeWrapper.Reset()
─────────────────────────────────────────            ─────────────────────────────────
loop session.Iterate(callbacks)
  PushNext → ObjectScanIterator.GetNext()
    epoch.Resume()              ◄── enter at epoch E
    headAddress = HeadAddress   (still old value)
    LoadPageIfNeeded(...)       (cur >= head → in-mem)
    physicalAddress =
        pagePointers[pageIdx] + offset
                                                      Recovery.Reset()
                                                        hlogBase.Reset()
                                                          HeadAddress ← TailAddress
                                                          OnPagesClosed(...)
                                                            FreePage(p)
                                                              ReturnPage(p)
                                                                pagePointers[p] = 0   ◄── ★
                                                          // override loop:
                                                          for i in BufferSize:
                                                            FreePage(i)
                                                              ReturnPage(i)
                                                                pagePointers[i] = 0
    *(RecordInfo*)physicalAddress  ◄── ☠ AVE
       (LogRecord.GetInfo /
        LogRecord.AllocatedSize)

Why epoch protection didn't catch this

Tsavorite's normal eviction path defers page-freeing through:

epoch.BumpCurrentEpoch(() => OnPagesClosed(newAddr));

BumpCurrentEpoch queues the action and only fires it after SafeToReclaimEpoch has advanced past the prior epoch — i.e. after every thread that was holding the prior epoch has either suspended or moved on. That's why scan iterators are safe against normal eviction.

Reset() skipped that mechanism in two places:

  1. AllocatorBase.Reset() invoked OnPagesClosed(newBeginAddress) directly.
  2. The per-allocator overrides had a for (i in BufferSize) FreePage(i) loop that ran after base.Reset() returned — also without epoch protection. This second loop is the actual point of failure: even if OnPagesClosed were deferred, the leftover (tail) page is freed by the override loop while a reader could still be reading it.

The fix (Tsavorite layer)

AllocatorBase.Reset() defers ALL page-close + page-free work through BumpCurrentEpoch and waits on a ManualResetEventSlim signalled by the deferred action — no polling:

using var resetComplete = new ManualResetEventSlim(initialState: false);

// If caller was already epoch-protected, our prior epoch is what the action
// will be waiting on — release it before waiting and re-acquire after.
var wasProtected = epoch.ThisInstanceProtected();
if (!wasProtected)
    epoch.Resume();   // BumpCurrentEpoch requires a protected caller

try
{
    epoch.BumpCurrentEpoch(() =>
    {
        try
        {
            if (headShifted) OnPagesClosed(newBeginAddress);
            FreeAllAllocatedPages();
        }
        finally { resetComplete.Set(); }   // never deadlock if action throws
    });
}
finally { epoch.Suspend(); }   // unconditionally so the action can fire

resetComplete.Wait();

if (wasProtected) epoch.Resume();

Each per-allocator override (SpanByte / Object / TsavoriteLog) moves its FreePage(i) loop into a new FreeAllAllocatedPages() virtual so the loop runs inside the deferred action:

public override void Reset() { base.Reset(); Initialize(); }

protected override void FreeAllAllocatedPages()
{
    for (int index = 0; index < BufferSize; index++)
        if (IsAllocated(index)) FreePage(index);
}

Why this is safe

  • The deferred action runs only after SafeToReclaimEpoch ≥ priorEpoch, i.e. after every iterator that was inside GetNext at the moment Reset() was called has either suspended or advanced. By the time pagePointers[i] = 0 executes, no thread is reading pagePointers[i].
  • Iterators that re-enter GetNext after HeadAddress was shifted see currentAddress < headAddress and route through the buffered disk frame instead of pagePointers — so they don't touch the cleared array.
  • Reset() blocks until the deferred work has actually run, preserving its synchronous contract (the override's Initialize() after Reset() observes a fully freed page set).

Test vs. product

Strictly, Reset()'s docstring put the burden on callers. The cluster re-attach paths violate that — they call Reset() before draining the VectorManager cleanup task, and SuspendPrimaryOnlyTasksAsync() doesn't cover it. The alternative would be to drain every background reader at every Reset() callsite, but we chose to make Reset() itself epoch-safe because the contract was implicit, callsites are scattered, and Tsavorite already has the right primitive (epoch.BumpCurrentEpoch) — the normal eviction path uses it. This makes the safety property enforced rather than assumed, and protects any future caller / background reader.

Repro

test/Garnet.test/VectorCleanupVsResetRaceTests.cs — adds 4 000 vectors, drops the set (queues a full-keyspace cleanup scan), then spams storeWrapper.Reset() for 5 s.

  • Without the fix: crashes the host on every iteration with the exact production stack (LogRecord.get_InfoObjectScanIterator.GetNextVectorManager.RunCleanupTaskAsync).
  • With the fix: all 5 [Repeat] iterations pass (~2 700 resets per iteration concurrent with the cleanup iterator), no AVE.

2. RespListTests.ListPushPopStressTest — host crash on rare RedisTimeoutException

Symptom

Unhandled exception. StackExchange.Redis.RedisTimeoutException: Timeout performing LPUSH (30000ms)
   at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](...)
   at StackExchange.Redis.RedisDatabase.ListLeftPush(...)
   at Garnet.test.RespListTests.<>c__DisplayClass39_1.<ListPushPopStressTest>b__0()
The active test run was aborted. Reason: Test host process crashed

Root cause (two compounding issues)

  1. Worker threads created via new Thread(() => ...) had no try/catch. In modern .NET an unhandled exception in a manually-created Thread terminates the process, so a single transient RedisTimeoutException aborted the entire test run.

  2. All 20 sync workers shared a single ConnectionMultiplexer. Every command went through one socket and one background writer. Under CI load + lowMemory eviction overhead the writer falls behind and accumulates queued messages until SyncTimeout (30s) trips. The failure diagnostics confirmed this: mc: 1/1, qs: 20, bw: SpinningDown.

Fix

  • Pre-create one ConnectionMultiplexer per worker on the main thread. Each thread now owns its own socket, eliminating the single-writer bottleneck. Pre-creating also avoids a 20-way connect storm racing ConnectTimeout.
  • Wrap each worker body in try/catch; capture exceptions into a ConcurrentBag, signal stop, exit cleanly. No more host crash.
  • Throw the aggregate before the post-checks so a real timeout isn't masked by secondary "list not empty" assertion noise.
  • Route the deadline-exceeded path through the failure bag too.

Files

libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs           | 76 +++++++++++++++++++++++--
libs/storage/Tsavorite/cs/src/core/Allocator/ObjectAllocatorImpl.cs     |  7 ++-
libs/storage/Tsavorite/cs/src/core/Allocator/SpanByteAllocatorImpl.cs   |  7 ++-
libs/storage/Tsavorite/cs/src/core/Allocator/TsavoriteLogAllocatorImpl.cs |  7 ++-
test/Garnet.test/RespListTests.cs                                       | 124 +++++++++++++++++++++++++--------------
test/Garnet.test/VectorCleanupVsResetRaceTests.cs                       | new

Copilot AI review requested due to automatic review settings May 1, 2026 23:13
@badrishc badrishc requested review from TedHartMS and kevin-montrose and removed request for Copilot May 1, 2026 23:19
@badrishc badrishc force-pushed the badrishc/fix-rare-failing-tests branch 2 times, most recently from 8096eca to 8849a5f Compare May 2, 2026 00:02
Copilot AI review requested due to automatic review settings May 2, 2026 00:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses two rare CI “test host process crashed” failures by making Tsavorite allocator resets epoch-safe (preventing AVEs during concurrent scans) and hardening a list stress test against unhandled thread exceptions / Redis timeouts.

Changes:

  • Make AllocatorBase.Reset() defer page-close and page-free work via LightEpoch.BumpCurrentEpoch, and move allocator-specific “free all pages” loops into a new virtual FreeAllAllocatedPages().
  • Update RespListTests.ListPushPopStressTest to use one ConnectionMultiplexer per worker and to capture worker exceptions instead of crashing the process.
  • Add a targeted regression test VectorCleanupVsResetRaceTests reproducing the VectorManager cleanup vs. Reset race.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs Reworks Reset to defer close/free work through epoch; adds FreeAllAllocatedPages() hook.
libs/storage/Tsavorite/cs/src/core/Allocator/ObjectAllocatorImpl.cs Moves page-free loop into FreeAllAllocatedPages() and calls Initialize() after base.Reset().
libs/storage/Tsavorite/cs/src/core/Allocator/SpanByteAllocatorImpl.cs Same as above for SpanByte allocator.
libs/storage/Tsavorite/cs/src/core/Allocator/TsavoriteLogAllocatorImpl.cs Same as above for TsavoriteLog allocator.
test/Garnet.test/RespListTests.cs Improves stress test robustness: per-thread mux, exception aggregation, timeout handling.
test/Garnet.test/VectorCleanupVsResetRaceTests.cs New race reproducer test for vector cleanup iterator vs. store reset.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs Outdated
Comment thread test/Garnet.test/RespListTests.cs Outdated
Comment thread test/Garnet.test/VectorCleanupVsResetRaceTests.cs
@badrishc badrishc force-pushed the badrishc/fix-rare-failing-tests branch from 8849a5f to fe77212 Compare May 2, 2026 00:37
@badrishc

badrishc commented May 2, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed all three Copilot review comments + a new finding from a GPT-5.5 review (force-pushed in fe77212):

1. AllocatorBase.cs — race between deferred OnPagesClosed and a concurrent OnPagesClosedWorker (copilot review): inside the deferred BumpCurrentEpoch action, we now spin on while (ClosedUntilAddress < newBeginAddress) before calling FreeAllAllocatedPages so the close worker (whether ours or another thread's) has finished freeing every page in our range first.

2. GPT-5.5 follow-up: race when a concurrent Reset already shifted HeadAddress: GPT-5.5 caught that the wait above was inside if (headShifted), so a second concurrent Reset() could observe headShifted == false (because the first call already advanced HeadAddress), skip the wait, and call FreeAllAllocatedPages while the first call's worker is still freeing pages → same double-free risk. Fixed by hoisting the wait out of the if (headShifted) block so it runs unconditionally before FreeAllAllocatedPages. Comment expanded to enumerate both scenarios.

3. RespListTests.cs — ConnectionMultiplexer leak on partial connect failure (copilot review): wrapped the connect loop in a try/catch that disposes any already-created muxes before rethrowing.

4. VectorCleanupVsResetRaceTests.cs — reflection NRE if Reset(int) signature changes (copilot review): added ClassicAssert.IsNotNull(method, ...) with an actionable message before Invoke.

Both reproducers (DropVectorSetWhileResettingStore and ListPushPopStressTest) still pass; dotnet format --verify-no-changes clean on both Garnet.slnx and Tsavorite.slnx.

@badrishc badrishc force-pushed the badrishc/fix-rare-failing-tests branch 2 times, most recently from dcf4a12 to b056865 Compare May 2, 2026 22:36
@badrishc badrishc marked this pull request as draft May 2, 2026 23:32
@badrishc badrishc force-pushed the badrishc/fix-rare-failing-tests branch 6 times, most recently from 1a44762 to f9abcb0 Compare May 3, 2026 19:13
@badrishc badrishc marked this pull request as ready for review May 3, 2026 19:52
@badrishc badrishc force-pushed the badrishc/fix-rare-failing-tests branch from f9abcb0 to 0a552d9 Compare May 3, 2026 19:55
…Manager cleanup vs Reset() AVE

Two independent rare CI failures, both surfacing as `Test host process
crashed` and aborting the whole test run.

## 1. `ClusterVectorSetTests.MigrateVectorSetWhileModifyingAsync` — fatal `AccessViolationException` in `VectorManager` cleanup task

### Symptom

```
Passed Garnet.test.cluster.ClusterVectorSetTests.MigrateVectorSetWhileModifyingAsync [12 s]
Fatal error. System.AccessViolationException: Attempted to read or write protected memory.
   at Tsavorite.core.LogRecord.get_Info()
   at Tsavorite.core.LogRecord.get_AllocatedSize()
   at Tsavorite.core.ObjectScanIterator`2[...].GetPhysicalAddressAndAllocatedSize(...)
   at Tsavorite.core.ObjectScanIterator`2[...].GetNext()
   at Tsavorite.core.TsavoriteKVIterator`6[...].PushNext[...](...)
   at Tsavorite.core.TsavoriteKV`2[...].Iterate[...](MainSessionFunctions, ...)
   at Garnet.server.VectorManager+<RunCleanupTaskAsync>d__24.MoveNext()
The active test run was aborted. Reason: Test host process crashed
```

The AVE is a Corrupted-State Exception — `catch (Exception)` in
`RunCleanupTaskAsync` cannot suppress it; the runtime fails fast and the
test host crashes.

### Root cause

`Recovery.Reset()` → `hlogBase.Reset()` (in `AllocatorBase` and the
per-allocator overrides `SpanByte` / `Object` / `TsavoriteLog`) frees pages
by synchronously invoking `OnPagesClosed(...)` and a
`for (i in BufferSize) FreePage(i)` loop. Both paths ultimately call
`ReturnPage(index)`, which sets:

```csharp
pageArrays[index]   = default;
pagePointers[index] = default;   // ★ becomes 0
```

`Reset()`'s docstring promised *"WARNING: assumes that threads have drained
out at this point."* But Garnet's cluster re-attach paths invoke it on a
running store:

* `libs/cluster/Server/Replication/ReplicaOps/ReplicaDisklessSync.cs:100`
* `libs/cluster/Server/Replication/ReplicaOps/ReplicaDiskbasedSync.cs:136`

In both files `storeWrapper.Reset()` is called **before**
`SuspendPrimaryOnlyTasksAsync()`, and even that suspend only drains
`TaskManager` tasks — `VectorManager.cleanupTask` is independent and never
drained.

Once `pagePointers[i] = 0`, the iterator's `GetPhysicalAddress` returns
`0 + offset` — a tiny kernel-page address — and dereferencing it in
`*(RecordInfo*)physicalAddress` raises a fatal AVE.

### The exact interleaving

Production scenario in `MigrateVectorSetWhileModifyingAsync`:

1. Source primary migrates a slot containing a vector set → drops the index → `CleanupDroppedIndex` queues a cleanup-task scan on the source primary.
2. The drop AOF entry replicates to the source's replica, which replays it and **also** queues a cleanup-task scan on the replica.
3. Cluster topology change (post-migration, gossip, or any reason) triggers a replica re-attach → `ReplicaDisklessSync.ReplicateAttachAsync` / `ReplicaDiskbasedSync.ReplicateAttachAsync` calls `storeWrapper.Reset()`.
4. The replica's cleanup task is still mid-iterate over the main store → AVE.

Thread-level interleaving:

```
Thread A: VectorManager cleanup task                 Thread B: storeWrapper.Reset()
─────────────────────────────────────────            ─────────────────────────────────
loop session.Iterate(callbacks)
  PushNext → ObjectScanIterator.GetNext()
    epoch.Resume()              ◄── enter at epoch E
    headAddress = HeadAddress   (still old value)
    LoadPageIfNeeded(...)       (cur >= head → in-mem)
    physicalAddress =
        pagePointers[pageIdx] + offset
                                                      Recovery.Reset()
                                                        hlogBase.Reset()
                                                          HeadAddress ← TailAddress
                                                          OnPagesClosed(...)
                                                            FreePage(p)
                                                              ReturnPage(p)
                                                                pagePointers[p] = 0   ◄── ★
                                                          // override loop:
                                                          for i in BufferSize:
                                                            FreePage(i)
                                                              ReturnPage(i)
                                                                pagePointers[i] = 0
    *(RecordInfo*)physicalAddress  ◄── ☠ AVE
       (LogRecord.GetInfo /
        LogRecord.AllocatedSize)
```

### Why epoch protection didn't catch this

Tsavorite's normal eviction path defers page-freeing through:

```csharp
epoch.BumpCurrentEpoch(() => OnPagesClosed(newAddr));
```

`BumpCurrentEpoch` queues the action and only fires it after
`SafeToReclaimEpoch` has advanced past the prior epoch — i.e. after every
thread that was holding the prior epoch has either suspended or moved on.
That's why scan iterators are safe against normal eviction.

`Reset()` skipped that mechanism in two places:

1. `AllocatorBase.Reset()` invoked `OnPagesClosed(newBeginAddress)` directly.
2. The per-allocator overrides had a `for (i in BufferSize) FreePage(i)`
   loop that ran **after** `base.Reset()` returned — also without epoch
   protection. **This second loop is the actual point of failure**: even
   if `OnPagesClosed` were deferred, the leftover (tail) page is freed by
   the override loop while a reader could still be reading it.

### The fix (Tsavorite layer)

`AllocatorBase.Reset()` defers ALL page-close + page-free work through
`BumpCurrentEpoch` and waits on a `ManualResetEventSlim` signalled by the
deferred action — no polling:

```csharp
using var resetComplete = new ManualResetEventSlim(initialState: false);

// If caller was already epoch-protected, our prior epoch is what the action
// will be waiting on — release it before waiting and re-acquire after.
var wasProtected = epoch.ThisInstanceProtected();
if (!wasProtected)
    epoch.Resume();   // BumpCurrentEpoch requires a protected caller

try
{
    epoch.BumpCurrentEpoch(() =>
    {
        try
        {
            if (headShifted) OnPagesClosed(newBeginAddress);
            FreeAllAllocatedPages();
        }
        finally { resetComplete.Set(); }   // never deadlock if action throws
    });
}
finally { epoch.Suspend(); }   // unconditionally so the action can fire

resetComplete.Wait();

if (wasProtected) epoch.Resume();
```

Each per-allocator override (`SpanByte` / `Object` / `TsavoriteLog`) moves
its `FreePage(i)` loop into a new `FreeAllAllocatedPages()` virtual so the
loop runs inside the deferred action:

```csharp
public override void Reset() { base.Reset(); Initialize(); }

protected override void FreeAllAllocatedPages()
{
    for (int index = 0; index < BufferSize; index++)
        if (IsAllocated(index)) FreePage(index);
}
```

### Why this is safe

* The deferred action runs only after `SafeToReclaimEpoch ≥ priorEpoch`,
  i.e. after every iterator that was inside `GetNext` at the moment
  `Reset()` was called has either suspended or advanced. By the time
  `pagePointers[i] = 0` executes, no thread is reading `pagePointers[i]`.
* Iterators that re-enter `GetNext` after `HeadAddress` was shifted see
  `currentAddress < headAddress` and route through the buffered disk frame
  instead of `pagePointers` — so they don't touch the cleared array.
* `Reset()` blocks until the deferred work has actually run, preserving
  its synchronous contract (the override's `Initialize()` after `Reset()`
  observes a fully freed page set).

### Test vs. product

Strictly, `Reset()`'s docstring put the burden on callers. The cluster
re-attach paths violate that — they call `Reset()` before draining the
`VectorManager` cleanup task, and `SuspendPrimaryOnlyTasksAsync()` doesn't
cover it. The alternative would be to drain every background reader at
every `Reset()` callsite, but we chose to make `Reset()` itself epoch-safe
because the contract was implicit, callsites are scattered, and Tsavorite
already has the right primitive (`epoch.BumpCurrentEpoch`) — the normal
eviction path uses it. This makes the safety property **enforced** rather
than **assumed**, and protects any future caller / background reader.

### Repro

`test/Garnet.test/VectorCleanupVsResetRaceTests.cs` — adds 4 000 vectors,
drops the set (queues a full-keyspace cleanup scan), then spams
`storeWrapper.Reset()` for 5 s.

* **Without the fix:** crashes the host on every iteration with the exact
  production stack (`LogRecord.get_Info` → `ObjectScanIterator.GetNext` →
  `VectorManager.RunCleanupTaskAsync`).
* **With the fix:** all 5 `[Repeat]` iterations pass (~2 700 resets per
  iteration concurrent with the cleanup iterator), no AVE.

## 2. `RespListTests.ListPushPopStressTest` — host crash on rare `RedisTimeoutException`

### Symptom

```
Unhandled exception. StackExchange.Redis.RedisTimeoutException: Timeout performing LPUSH (30000ms)
   at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](...)
   at StackExchange.Redis.RedisDatabase.ListLeftPush(...)
   at Garnet.test.RespListTests.<>c__DisplayClass39_1.<ListPushPopStressTest>b__0()
The active test run was aborted. Reason: Test host process crashed
```

### Root cause (two compounding issues)

1. **Worker threads created via `new Thread(() => ...)` had no try/catch.**
   In modern .NET an unhandled exception in a manually-created `Thread`
   terminates the process, so a single transient `RedisTimeoutException`
   aborted the entire test run.

2. **All 20 sync workers shared a single `ConnectionMultiplexer`.** Every
   command went through one socket and one background writer. Under CI
   load + lowMemory eviction overhead the writer falls behind and
   accumulates queued messages until SyncTimeout (30s) trips. The failure
   diagnostics confirmed this: `mc: 1/1, qs: 20, bw: SpinningDown`.

### Fix

* Pre-create one `ConnectionMultiplexer` per worker on the main thread.
  Each thread now owns its own socket, eliminating the single-writer
  bottleneck. Pre-creating also avoids a 20-way connect storm racing
  `ConnectTimeout`.
* Wrap each worker body in try/catch; capture exceptions into a
  `ConcurrentBag`, signal stop, exit cleanly. No more host crash.
* Throw the aggregate **before** the post-checks so a real timeout isn't
  masked by secondary "list not empty" assertion noise.
* Route the deadline-exceeded path through the failure bag too.

## Files

```
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs           | 76 +++++++++++++++++++++++--
libs/storage/Tsavorite/cs/src/core/Allocator/ObjectAllocatorImpl.cs     |  7 ++-
libs/storage/Tsavorite/cs/src/core/Allocator/SpanByteAllocatorImpl.cs   |  7 ++-
libs/storage/Tsavorite/cs/src/core/Allocator/TsavoriteLogAllocatorImpl.cs |  7 ++-
test/Garnet.test/RespListTests.cs                                       | 124 +++++++++++++++++++++++++--------------
test/Garnet.test/VectorCleanupVsResetRaceTests.cs                       | new
```

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc force-pushed the badrishc/fix-rare-failing-tests branch from 0a552d9 to 82850d3 Compare May 3, 2026 19:57
@badrishc badrishc merged commit d1a9289 into dev May 3, 2026
29 of 30 checks passed
@badrishc badrishc deleted the badrishc/fix-rare-failing-tests branch May 3, 2026 23:39
badrishc added a commit that referenced this pull request May 6, 2026
…nitialize() epoch-gate

Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info /
ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during
VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765
but a different remaining gap.

## Root cause

PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe
(2-phase BumpCurrentEpoch barrier around OnPagesClosed +
FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize()
that runs after base.Reset() returns. Initialize() non-monotonically
rewinds HeadAddress / BeginAddress / TailPageOffset back to
FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1
are reallocated). Race window between HeadAddress reset and
TailPageOffset reset:

  - Iterator reads stopAddress = TailAddress = old high (stale)
  - Iterator passes currentAddress < stopAddress check, proceeds
  - Iterator reads headAddress = HeadAddress = new low
  - Iterator chooses in-memory path (currentAddress >= headAddress)
  - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE

The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync)
call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only
covers TaskManager-registered tasks. VectorManager.cleanupTask is
started directly in the constructor (VectorManager.cs:110) as a plain
async Task and is invisible to TaskManager.

## Fix

Two complementary fixes:

### A. Tsavorite-side defense-in-depth

- AllocatorBase: added `internal volatile bool Initializing`.
  Refactored Reset() to be non-virtual, wrapping the existing 2-phase
  epoch logic (renamed to ResetCore()) PLUS the per-allocator
  Initialize() call inside Initializing = true / false (try/finally so
  a throw can't leave iterators stuck).
- Per-allocator Reset() overrides removed in ObjectAllocatorImpl,
  SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does
  ResetCore() + Initialize() directly under the gate.
- ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch):
  - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside
    epoch protection — preexisting issue).
  - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }`
    with `!assumeInMemory` bypass (avoids deadlock with internal
    MemoryPageScan inside OnPagesClosed which constructs assumeInMemory
    iterators on the same thread that is running the Initializing-gated
    deferred action).
- LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after
  snapping forward to BeginAddress (stale snap could place
  currentAddress at TailAddress with an unparseable record).

### B. Caller-side explicit drain (restores Reset's documented contract)

- VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup()
  API. Cleanup loop uses increment-then-check pattern (bump
  cleanupInProgress first, re-read pauseCleanupRequests, back out if
  non-zero) so a pause caller cannot race past an iteration about to
  start.
- ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap
  storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` /
  `vectorManager.ResumeCleanup()` try/finally.

A is defense-in-depth (any future background reader is safe); B
restores the documented contract for the known case and skips the
per-iteration spin-wait when callers cooperate.

## Validation

- Build: 0 warnings, 0 errors. dotnet format clean.
- Tsavorite tests: 1453/1453 pass.
- Garnet.test Reset/Recovery/Iterate filter: 30/30 pass.
- ClusterMigrate + ClusterVectorSet (52 tests): all pass.
- MigrateVectorSetWhileModifyingAsync (the originally failing cluster
  test): stable across 8 sequential runs (~18s each) with --blame-crash.
- VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)].
- ClusterResetDuringReplicationTests: 2/2 pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc added a commit that referenced this pull request May 6, 2026
…nitialize() epoch-gate

Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info /
ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during
VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765
but a different remaining gap.

## Root cause

PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe
(2-phase BumpCurrentEpoch barrier around OnPagesClosed +
FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize()
that runs after base.Reset() returns. Initialize() non-monotonically
rewinds HeadAddress / BeginAddress / TailPageOffset back to
FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1
are reallocated). Race window between HeadAddress reset and
TailPageOffset reset:

  - Iterator reads stopAddress = TailAddress = old high (stale)
  - Iterator passes currentAddress < stopAddress check, proceeds
  - Iterator reads headAddress = HeadAddress = new low
  - Iterator chooses in-memory path (currentAddress >= headAddress)
  - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE

The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync)
call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only
covers TaskManager-registered tasks. VectorManager.cleanupTask is
started directly in the constructor (VectorManager.cs:110) as a plain
async Task and is invisible to TaskManager.

## Fix

Two complementary fixes:

### A. Tsavorite-side defense-in-depth

- AllocatorBase: added `internal volatile bool Initializing`.
  Refactored Reset() to be non-virtual, wrapping the existing 2-phase
  epoch logic (renamed to ResetCore()) PLUS the per-allocator
  Initialize() call inside Initializing = true / false (try/finally so
  a throw can't leave iterators stuck).
- Per-allocator Reset() overrides removed in ObjectAllocatorImpl,
  SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does
  ResetCore() + Initialize() directly under the gate.
- ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch):
  - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside
    epoch protection — preexisting issue).
  - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }`
    with `!assumeInMemory` bypass (avoids deadlock with internal
    MemoryPageScan inside OnPagesClosed which constructs assumeInMemory
    iterators on the same thread that is running the Initializing-gated
    deferred action).
- LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after
  snapping forward to BeginAddress (stale snap could place
  currentAddress at TailAddress with an unparseable record).

### B. Caller-side explicit drain (restores Reset's documented contract)

- VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup()
  API. Cleanup loop uses increment-then-check pattern (bump
  cleanupInProgress first, re-read pauseCleanupRequests, back out if
  non-zero) so a pause caller cannot race past an iteration about to
  start.
- ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap
  storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` /
  `vectorManager.ResumeCleanup()` try/finally.

A is defense-in-depth (any future background reader is safe); B
restores the documented contract for the known case and skips the
per-iteration spin-wait when callers cooperate.

## Validation

- Build: 0 warnings, 0 errors. dotnet format clean.
- Tsavorite tests: 1453/1453 pass.
- Garnet.test Reset/Recovery/Iterate filter: 30/30 pass.
- ClusterMigrate + ClusterVectorSet (52 tests): all pass.
- MigrateVectorSetWhileModifyingAsync (the originally failing cluster
  test): stable across 8 sequential runs (~18s each) with --blame-crash.
- VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)].
- ClusterResetDuringReplicationTests: 2/2 pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc added a commit that referenced this pull request May 6, 2026
…nitialize() epoch-gate

Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info /
ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during
VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765
but a different remaining gap.

## Root cause

PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe
(2-phase BumpCurrentEpoch barrier around OnPagesClosed +
FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize()
that runs after base.Reset() returns. Initialize() non-monotonically
rewinds HeadAddress / BeginAddress / TailPageOffset back to
FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1
are reallocated). Race window between HeadAddress reset and
TailPageOffset reset:

  - Iterator reads stopAddress = TailAddress = old high (stale)
  - Iterator passes currentAddress < stopAddress check, proceeds
  - Iterator reads headAddress = HeadAddress = new low
  - Iterator chooses in-memory path (currentAddress >= headAddress)
  - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE

The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync)
call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only
covers TaskManager-registered tasks. VectorManager.cleanupTask is
started directly in the constructor (VectorManager.cs:110) as a plain
async Task and is invisible to TaskManager.

## Fix

Two complementary fixes:

### A. Tsavorite-side defense-in-depth

- AllocatorBase: added `internal volatile bool Initializing`.
  Refactored Reset() to be non-virtual, wrapping the existing 2-phase
  epoch logic (renamed to ResetCore()) PLUS the per-allocator
  Initialize() call inside Initializing = true / false (try/finally so
  a throw can't leave iterators stuck).
- Per-allocator Reset() overrides removed in ObjectAllocatorImpl,
  SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does
  ResetCore() + Initialize() directly under the gate.
- ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch):
  - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside
    epoch protection — preexisting issue).
  - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }`
    with `!assumeInMemory` bypass (avoids deadlock with internal
    MemoryPageScan inside OnPagesClosed which constructs assumeInMemory
    iterators on the same thread that is running the Initializing-gated
    deferred action).
- LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after
  snapping forward to BeginAddress (stale snap could place
  currentAddress at TailAddress with an unparseable record).

### B. Caller-side explicit drain (restores Reset's documented contract)

- VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup()
  API. Cleanup loop uses increment-then-check pattern (bump
  cleanupInProgress first, re-read pauseCleanupRequests, back out if
  non-zero) so a pause caller cannot race past an iteration about to
  start.
- ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap
  storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` /
  `vectorManager.ResumeCleanup()` try/finally.

A is defense-in-depth (any future background reader is safe); B
restores the documented contract for the known case and skips the
per-iteration spin-wait when callers cooperate.

## Validation

- Build: 0 warnings, 0 errors. dotnet format clean.
- Tsavorite tests: 1453/1453 pass.
- Garnet.test Reset/Recovery/Iterate filter: 30/30 pass.
- ClusterMigrate + ClusterVectorSet (52 tests): all pass.
- MigrateVectorSetWhileModifyingAsync (the originally failing cluster
  test): stable across 8 sequential runs (~18s each) with --blame-crash.
- VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)].
- ClusterResetDuringReplicationTests: 2/2 pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc added a commit that referenced this pull request May 6, 2026
…nitialize() epoch-gate

Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info /
ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during
VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765
but a different remaining gap.

## Root cause

PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe
(2-phase BumpCurrentEpoch barrier around OnPagesClosed +
FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize()
that runs after base.Reset() returns. Initialize() non-monotonically
rewinds HeadAddress / BeginAddress / TailPageOffset back to
FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1
are reallocated). Race window between HeadAddress reset and
TailPageOffset reset:

  - Iterator reads stopAddress = TailAddress = old high (stale)
  - Iterator passes currentAddress < stopAddress check, proceeds
  - Iterator reads headAddress = HeadAddress = new low
  - Iterator chooses in-memory path (currentAddress >= headAddress)
  - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE

The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync)
call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only
covers TaskManager-registered tasks. VectorManager.cleanupTask is
started directly in the constructor (VectorManager.cs:110) as a plain
async Task and is invisible to TaskManager.

## Fix

Two complementary fixes:

### A. Tsavorite-side defense-in-depth

- AllocatorBase: added `internal volatile bool Initializing`.
  Refactored Reset() to be non-virtual, wrapping the existing 2-phase
  epoch logic (renamed to ResetCore()) PLUS the per-allocator
  Initialize() call inside Initializing = true / false (try/finally so
  a throw can't leave iterators stuck).
- Per-allocator Reset() overrides removed in ObjectAllocatorImpl,
  SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does
  ResetCore() + Initialize() directly under the gate.
- ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch):
  - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside
    epoch protection — preexisting issue).
  - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }`
    with `!assumeInMemory` bypass (avoids deadlock with internal
    MemoryPageScan inside OnPagesClosed which constructs assumeInMemory
    iterators on the same thread that is running the Initializing-gated
    deferred action).
- LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after
  snapping forward to BeginAddress (stale snap could place
  currentAddress at TailAddress with an unparseable record).

### B. Caller-side explicit drain (restores Reset's documented contract)

- VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup()
  API. Cleanup loop uses increment-then-check pattern (bump
  cleanupInProgress first, re-read pauseCleanupRequests, back out if
  non-zero) so a pause caller cannot race past an iteration about to
  start.
- ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap
  storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` /
  `vectorManager.ResumeCleanup()` try/finally.

A is defense-in-depth (any future background reader is safe); B
restores the documented contract for the known case and skips the
per-iteration spin-wait when callers cooperate.

## Validation

- Build: 0 warnings, 0 errors. dotnet format clean.
- Tsavorite tests: 1453/1453 pass.
- Garnet.test Reset/Recovery/Iterate filter: 30/30 pass.
- ClusterMigrate + ClusterVectorSet (52 tests): all pass.
- MigrateVectorSetWhileModifyingAsync (the originally failing cluster
  test): stable across 8 sequential runs (~18s each) with --blame-crash.
- VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)].
- ClusterResetDuringReplicationTests: 2/2 pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc added a commit that referenced this pull request May 6, 2026
…nitialize() epoch-gate

Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info /
ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during
VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765
but a different remaining gap.

## Root cause

PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe
(2-phase BumpCurrentEpoch barrier around OnPagesClosed +
FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize()
that runs after base.Reset() returns. Initialize() non-monotonically
rewinds HeadAddress / BeginAddress / TailPageOffset back to
FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1
are reallocated). Race window between HeadAddress reset and
TailPageOffset reset:

  - Iterator reads stopAddress = TailAddress = old high (stale)
  - Iterator passes currentAddress < stopAddress check, proceeds
  - Iterator reads headAddress = HeadAddress = new low
  - Iterator chooses in-memory path (currentAddress >= headAddress)
  - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE

The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync)
call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only
covers TaskManager-registered tasks. VectorManager.cleanupTask is
started directly in the constructor (VectorManager.cs:110) as a plain
async Task and is invisible to TaskManager.

## Fix

Two complementary fixes:

### A. Tsavorite-side defense-in-depth

- AllocatorBase: added `internal volatile bool Initializing`.
  Refactored Reset() to be non-virtual, wrapping the existing 2-phase
  epoch logic (renamed to ResetCore()) PLUS the per-allocator
  Initialize() call inside Initializing = true / false (try/finally so
  a throw can't leave iterators stuck).
- Per-allocator Reset() overrides removed in ObjectAllocatorImpl,
  SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does
  ResetCore() + Initialize() directly under the gate.
- ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch):
  - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside
    epoch protection — preexisting issue).
  - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }`
    with `!assumeInMemory` bypass (avoids deadlock with internal
    MemoryPageScan inside OnPagesClosed which constructs assumeInMemory
    iterators on the same thread that is running the Initializing-gated
    deferred action).
- LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after
  snapping forward to BeginAddress (stale snap could place
  currentAddress at TailAddress with an unparseable record).

### B. Caller-side explicit drain (restores Reset's documented contract)

- VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup()
  API. Cleanup loop uses increment-then-check pattern (bump
  cleanupInProgress first, re-read pauseCleanupRequests, back out if
  non-zero) so a pause caller cannot race past an iteration about to
  start.
- ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap
  storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` /
  `vectorManager.ResumeCleanup()` try/finally.

A is defense-in-depth (any future background reader is safe); B
restores the documented contract for the known case and skips the
per-iteration spin-wait when callers cooperate.

## Validation

- Build: 0 warnings, 0 errors. dotnet format clean.
- Tsavorite tests: 1453/1453 pass.
- Garnet.test Reset/Recovery/Iterate filter: 30/30 pass.
- ClusterMigrate + ClusterVectorSet (52 tests): all pass.
- MigrateVectorSetWhileModifyingAsync (the originally failing cluster
  test): stable across 8 sequential runs (~18s each) with --blame-crash.
- VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)].
- ClusterResetDuringReplicationTests: 2/2 pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc added a commit that referenced this pull request May 7, 2026
…nitialize() epoch-gate

Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info /
ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during
VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765
but a different remaining gap.

## Root cause

PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe
(2-phase BumpCurrentEpoch barrier around OnPagesClosed +
FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize()
that runs after base.Reset() returns. Initialize() non-monotonically
rewinds HeadAddress / BeginAddress / TailPageOffset back to
FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1
are reallocated). Race window between HeadAddress reset and
TailPageOffset reset:

  - Iterator reads stopAddress = TailAddress = old high (stale)
  - Iterator passes currentAddress < stopAddress check, proceeds
  - Iterator reads headAddress = HeadAddress = new low
  - Iterator chooses in-memory path (currentAddress >= headAddress)
  - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE

The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync)
call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only
covers TaskManager-registered tasks. VectorManager.cleanupTask is
started directly in the constructor (VectorManager.cs:110) as a plain
async Task and is invisible to TaskManager.

## Fix

Two complementary fixes:

### A. Tsavorite-side defense-in-depth

- AllocatorBase: added `internal volatile bool Initializing`.
  Refactored Reset() to be non-virtual, wrapping the existing 2-phase
  epoch logic (renamed to ResetCore()) PLUS the per-allocator
  Initialize() call inside Initializing = true / false (try/finally so
  a throw can't leave iterators stuck).
- Per-allocator Reset() overrides removed in ObjectAllocatorImpl,
  SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does
  ResetCore() + Initialize() directly under the gate.
- ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch):
  - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside
    epoch protection — preexisting issue).
  - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }`
    with `!assumeInMemory` bypass (avoids deadlock with internal
    MemoryPageScan inside OnPagesClosed which constructs assumeInMemory
    iterators on the same thread that is running the Initializing-gated
    deferred action).
- LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after
  snapping forward to BeginAddress (stale snap could place
  currentAddress at TailAddress with an unparseable record).

### B. Caller-side explicit drain (restores Reset's documented contract)

- VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup()
  API. Cleanup loop uses increment-then-check pattern (bump
  cleanupInProgress first, re-read pauseCleanupRequests, back out if
  non-zero) so a pause caller cannot race past an iteration about to
  start.
- ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap
  storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` /
  `vectorManager.ResumeCleanup()` try/finally.

A is defense-in-depth (any future background reader is safe); B
restores the documented contract for the known case and skips the
per-iteration spin-wait when callers cooperate.

## Validation

- Build: 0 warnings, 0 errors. dotnet format clean.
- Tsavorite tests: 1453/1453 pass.
- Garnet.test Reset/Recovery/Iterate filter: 30/30 pass.
- ClusterMigrate + ClusterVectorSet (52 tests): all pass.
- MigrateVectorSetWhileModifyingAsync (the originally failing cluster
  test): stable across 8 sequential runs (~18s each) with --blame-crash.
- VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)].
- ClusterResetDuringReplicationTests: 2/2 pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc added a commit that referenced this pull request May 7, 2026
…nitialize() epoch-gate

Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info /
ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during
VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765
but a different remaining gap.

## Root cause

PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe
(2-phase BumpCurrentEpoch barrier around OnPagesClosed +
FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize()
that runs after base.Reset() returns. Initialize() non-monotonically
rewinds HeadAddress / BeginAddress / TailPageOffset back to
FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1
are reallocated). Race window between HeadAddress reset and
TailPageOffset reset:

  - Iterator reads stopAddress = TailAddress = old high (stale)
  - Iterator passes currentAddress < stopAddress check, proceeds
  - Iterator reads headAddress = HeadAddress = new low
  - Iterator chooses in-memory path (currentAddress >= headAddress)
  - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE

The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync)
call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only
covers TaskManager-registered tasks. VectorManager.cleanupTask is
started directly in the constructor (VectorManager.cs:110) as a plain
async Task and is invisible to TaskManager.

## Fix

Two complementary fixes:

### A. Tsavorite-side defense-in-depth

- AllocatorBase: added `internal volatile bool Initializing`.
  Refactored Reset() to be non-virtual, wrapping the existing 2-phase
  epoch logic (renamed to ResetCore()) PLUS the per-allocator
  Initialize() call inside Initializing = true / false (try/finally so
  a throw can't leave iterators stuck).
- Per-allocator Reset() overrides removed in ObjectAllocatorImpl,
  SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does
  ResetCore() + Initialize() directly under the gate.
- ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch):
  - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside
    epoch protection — preexisting issue).
  - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }`
    with `!assumeInMemory` bypass (avoids deadlock with internal
    MemoryPageScan inside OnPagesClosed which constructs assumeInMemory
    iterators on the same thread that is running the Initializing-gated
    deferred action).
- LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after
  snapping forward to BeginAddress (stale snap could place
  currentAddress at TailAddress with an unparseable record).

### B. Caller-side explicit drain (restores Reset's documented contract)

- VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup()
  API. Cleanup loop uses increment-then-check pattern (bump
  cleanupInProgress first, re-read pauseCleanupRequests, back out if
  non-zero) so a pause caller cannot race past an iteration about to
  start.
- ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap
  storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` /
  `vectorManager.ResumeCleanup()` try/finally.

A is defense-in-depth (any future background reader is safe); B
restores the documented contract for the known case and skips the
per-iteration spin-wait when callers cooperate.

## Validation

- Build: 0 warnings, 0 errors. dotnet format clean.
- Tsavorite tests: 1453/1453 pass.
- Garnet.test Reset/Recovery/Iterate filter: 30/30 pass.
- ClusterMigrate + ClusterVectorSet (52 tests): all pass.
- MigrateVectorSetWhileModifyingAsync (the originally failing cluster
  test): stable across 8 sequential runs (~18s each) with --blame-crash.
- VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)].
- ClusterResetDuringReplicationTests: 2/2 pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc added a commit that referenced this pull request May 7, 2026
…nitialize() epoch-gate

Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info /
ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during
VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765
but a different remaining gap.

## Root cause

PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe
(2-phase BumpCurrentEpoch barrier around OnPagesClosed +
FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize()
that runs after base.Reset() returns. Initialize() non-monotonically
rewinds HeadAddress / BeginAddress / TailPageOffset back to
FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1
are reallocated). Race window between HeadAddress reset and
TailPageOffset reset:

  - Iterator reads stopAddress = TailAddress = old high (stale)
  - Iterator passes currentAddress < stopAddress check, proceeds
  - Iterator reads headAddress = HeadAddress = new low
  - Iterator chooses in-memory path (currentAddress >= headAddress)
  - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE

The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync)
call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only
covers TaskManager-registered tasks. VectorManager.cleanupTask is
started directly in the constructor (VectorManager.cs:110) as a plain
async Task and is invisible to TaskManager.

## Fix

Two complementary fixes:

### A. Tsavorite-side defense-in-depth

- AllocatorBase: added `internal volatile bool Initializing`.
  Refactored Reset() to be non-virtual, wrapping the existing 2-phase
  epoch logic (renamed to ResetCore()) PLUS the per-allocator
  Initialize() call inside Initializing = true / false (try/finally so
  a throw can't leave iterators stuck).
- Per-allocator Reset() overrides removed in ObjectAllocatorImpl,
  SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does
  ResetCore() + Initialize() directly under the gate.
- ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch):
  - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside
    epoch protection — preexisting issue).
  - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }`
    with `!assumeInMemory` bypass (avoids deadlock with internal
    MemoryPageScan inside OnPagesClosed which constructs assumeInMemory
    iterators on the same thread that is running the Initializing-gated
    deferred action).
- LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after
  snapping forward to BeginAddress (stale snap could place
  currentAddress at TailAddress with an unparseable record).

### B. Caller-side explicit drain (restores Reset's documented contract)

- VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup()
  API. Cleanup loop uses increment-then-check pattern (bump
  cleanupInProgress first, re-read pauseCleanupRequests, back out if
  non-zero) so a pause caller cannot race past an iteration about to
  start.
- ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap
  storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` /
  `vectorManager.ResumeCleanup()` try/finally.

A is defense-in-depth (any future background reader is safe); B
restores the documented contract for the known case and skips the
per-iteration spin-wait when callers cooperate.

## Validation

- Build: 0 warnings, 0 errors. dotnet format clean.
- Tsavorite tests: 1453/1453 pass.
- Garnet.test Reset/Recovery/Iterate filter: 30/30 pass.
- ClusterMigrate + ClusterVectorSet (52 tests): all pass.
- MigrateVectorSetWhileModifyingAsync (the originally failing cluster
  test): stable across 8 sequential runs (~18s each) with --blame-crash.
- VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)].
- ClusterResetDuringReplicationTests: 2/2 pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc added a commit that referenced this pull request May 7, 2026
…nitialize() epoch-gate

Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info /
ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during
VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765
but a different remaining gap.

## Root cause

PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe
(2-phase BumpCurrentEpoch barrier around OnPagesClosed +
FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize()
that runs after base.Reset() returns. Initialize() non-monotonically
rewinds HeadAddress / BeginAddress / TailPageOffset back to
FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1
are reallocated). Race window between HeadAddress reset and
TailPageOffset reset:

  - Iterator reads stopAddress = TailAddress = old high (stale)
  - Iterator passes currentAddress < stopAddress check, proceeds
  - Iterator reads headAddress = HeadAddress = new low
  - Iterator chooses in-memory path (currentAddress >= headAddress)
  - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE

The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync)
call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only
covers TaskManager-registered tasks. VectorManager.cleanupTask is
started directly in the constructor (VectorManager.cs:110) as a plain
async Task and is invisible to TaskManager.

## Fix

Two complementary fixes:

### A. Tsavorite-side defense-in-depth

- AllocatorBase: added `internal volatile bool Initializing`.
  Refactored Reset() to be non-virtual, wrapping the existing 2-phase
  epoch logic (renamed to ResetCore()) PLUS the per-allocator
  Initialize() call inside Initializing = true / false (try/finally so
  a throw can't leave iterators stuck).
- Per-allocator Reset() overrides removed in ObjectAllocatorImpl,
  SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does
  ResetCore() + Initialize() directly under the gate.
- ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch):
  - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside
    epoch protection — preexisting issue).
  - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }`
    with `!assumeInMemory` bypass (avoids deadlock with internal
    MemoryPageScan inside OnPagesClosed which constructs assumeInMemory
    iterators on the same thread that is running the Initializing-gated
    deferred action).
- LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after
  snapping forward to BeginAddress (stale snap could place
  currentAddress at TailAddress with an unparseable record).

### B. Caller-side explicit drain (restores Reset's documented contract)

- VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup()
  API. Cleanup loop uses increment-then-check pattern (bump
  cleanupInProgress first, re-read pauseCleanupRequests, back out if
  non-zero) so a pause caller cannot race past an iteration about to
  start.
- ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap
  storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` /
  `vectorManager.ResumeCleanup()` try/finally.

A is defense-in-depth (any future background reader is safe); B
restores the documented contract for the known case and skips the
per-iteration spin-wait when callers cooperate.

## Validation

- Build: 0 warnings, 0 errors. dotnet format clean.
- Tsavorite tests: 1453/1453 pass.
- Garnet.test Reset/Recovery/Iterate filter: 30/30 pass.
- ClusterMigrate + ClusterVectorSet (52 tests): all pass.
- MigrateVectorSetWhileModifyingAsync (the originally failing cluster
  test): stable across 8 sequential runs (~18s each) with --blame-crash.
- VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)].
- ClusterResetDuringReplicationTests: 2/2 pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc added a commit that referenced this pull request May 7, 2026
…nitialize() epoch-gate (#1775)

Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info /
ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during
VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765
but a different remaining gap.

## Root cause

PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe
(2-phase BumpCurrentEpoch barrier around OnPagesClosed +
FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize()
that runs after base.Reset() returns. Initialize() non-monotonically
rewinds HeadAddress / BeginAddress / TailPageOffset back to
FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1
are reallocated). Race window between HeadAddress reset and
TailPageOffset reset:

  - Iterator reads stopAddress = TailAddress = old high (stale)
  - Iterator passes currentAddress < stopAddress check, proceeds
  - Iterator reads headAddress = HeadAddress = new low
  - Iterator chooses in-memory path (currentAddress >= headAddress)
  - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE

The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync)
call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only
covers TaskManager-registered tasks. VectorManager.cleanupTask is
started directly in the constructor (VectorManager.cs:110) as a plain
async Task and is invisible to TaskManager.

## Fix

Two complementary fixes:

### A. Tsavorite-side defense-in-depth

- AllocatorBase: added `internal volatile bool Initializing`.
  Refactored Reset() to be non-virtual, wrapping the existing 2-phase
  epoch logic (renamed to ResetCore()) PLUS the per-allocator
  Initialize() call inside Initializing = true / false (try/finally so
  a throw can't leave iterators stuck).
- Per-allocator Reset() overrides removed in ObjectAllocatorImpl,
  SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does
  ResetCore() + Initialize() directly under the gate.
- ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch):
  - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside
    epoch protection — preexisting issue).
  - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }`
    with `!assumeInMemory` bypass (avoids deadlock with internal
    MemoryPageScan inside OnPagesClosed which constructs assumeInMemory
    iterators on the same thread that is running the Initializing-gated
    deferred action).
- LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after
  snapping forward to BeginAddress (stale snap could place
  currentAddress at TailAddress with an unparseable record).

### B. Caller-side explicit drain (restores Reset's documented contract)

- VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup()
  API. Cleanup loop uses increment-then-check pattern (bump
  cleanupInProgress first, re-read pauseCleanupRequests, back out if
  non-zero) so a pause caller cannot race past an iteration about to
  start.
- ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap
  storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` /
  `vectorManager.ResumeCleanup()` try/finally.

A is defense-in-depth (any future background reader is safe); B
restores the documented contract for the known case and skips the
per-iteration spin-wait when callers cooperate.

## Validation

- Build: 0 warnings, 0 errors. dotnet format clean.
- Tsavorite tests: 1453/1453 pass.
- Garnet.test Reset/Recovery/Iterate filter: 30/30 pass.
- ClusterMigrate + ClusterVectorSet (52 tests): all pass.
- MigrateVectorSetWhileModifyingAsync (the originally failing cluster
  test): stable across 8 sequential runs (~18s each) with --blame-crash.
- VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)].
- ClusterResetDuringReplicationTests: 2/2 pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants