Fix two rare CI failures [dev]#1765
Conversation
8096eca to
8849a5f
Compare
There was a problem hiding this comment.
Pull request overview
Addresses two rare CI “test host process crashed” failures by making Tsavorite allocator resets epoch-safe (preventing AVEs during concurrent scans) and hardening a list stress test against unhandled thread exceptions / Redis timeouts.
Changes:
- Make
AllocatorBase.Reset()defer page-close and page-free work viaLightEpoch.BumpCurrentEpoch, and move allocator-specific “free all pages” loops into a new virtualFreeAllAllocatedPages(). - Update
RespListTests.ListPushPopStressTestto use oneConnectionMultiplexerper worker and to capture worker exceptions instead of crashing the process. - Add a targeted regression test
VectorCleanupVsResetRaceTestsreproducing the VectorManager cleanup vs. Reset race.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs | Reworks Reset to defer close/free work through epoch; adds FreeAllAllocatedPages() hook. |
| libs/storage/Tsavorite/cs/src/core/Allocator/ObjectAllocatorImpl.cs | Moves page-free loop into FreeAllAllocatedPages() and calls Initialize() after base.Reset(). |
| libs/storage/Tsavorite/cs/src/core/Allocator/SpanByteAllocatorImpl.cs | Same as above for SpanByte allocator. |
| libs/storage/Tsavorite/cs/src/core/Allocator/TsavoriteLogAllocatorImpl.cs | Same as above for TsavoriteLog allocator. |
| test/Garnet.test/RespListTests.cs | Improves stress test robustness: per-thread mux, exception aggregation, timeout handling. |
| test/Garnet.test/VectorCleanupVsResetRaceTests.cs | New race reproducer test for vector cleanup iterator vs. store reset. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8849a5f to
fe77212
Compare
|
Addressed all three Copilot review comments + a new finding from a GPT-5.5 review (force-pushed in fe77212): 1. 2. GPT-5.5 follow-up: race when a concurrent Reset already shifted HeadAddress: GPT-5.5 caught that the wait above was inside 3. 4. Both reproducers ( |
dcf4a12 to
b056865
Compare
1a44762 to
f9abcb0
Compare
f9abcb0 to
0a552d9
Compare
…Manager cleanup vs Reset() AVE
Two independent rare CI failures, both surfacing as `Test host process
crashed` and aborting the whole test run.
## 1. `ClusterVectorSetTests.MigrateVectorSetWhileModifyingAsync` — fatal `AccessViolationException` in `VectorManager` cleanup task
### Symptom
```
Passed Garnet.test.cluster.ClusterVectorSetTests.MigrateVectorSetWhileModifyingAsync [12 s]
Fatal error. System.AccessViolationException: Attempted to read or write protected memory.
at Tsavorite.core.LogRecord.get_Info()
at Tsavorite.core.LogRecord.get_AllocatedSize()
at Tsavorite.core.ObjectScanIterator`2[...].GetPhysicalAddressAndAllocatedSize(...)
at Tsavorite.core.ObjectScanIterator`2[...].GetNext()
at Tsavorite.core.TsavoriteKVIterator`6[...].PushNext[...](...)
at Tsavorite.core.TsavoriteKV`2[...].Iterate[...](MainSessionFunctions, ...)
at Garnet.server.VectorManager+<RunCleanupTaskAsync>d__24.MoveNext()
The active test run was aborted. Reason: Test host process crashed
```
The AVE is a Corrupted-State Exception — `catch (Exception)` in
`RunCleanupTaskAsync` cannot suppress it; the runtime fails fast and the
test host crashes.
### Root cause
`Recovery.Reset()` → `hlogBase.Reset()` (in `AllocatorBase` and the
per-allocator overrides `SpanByte` / `Object` / `TsavoriteLog`) frees pages
by synchronously invoking `OnPagesClosed(...)` and a
`for (i in BufferSize) FreePage(i)` loop. Both paths ultimately call
`ReturnPage(index)`, which sets:
```csharp
pageArrays[index] = default;
pagePointers[index] = default; // ★ becomes 0
```
`Reset()`'s docstring promised *"WARNING: assumes that threads have drained
out at this point."* But Garnet's cluster re-attach paths invoke it on a
running store:
* `libs/cluster/Server/Replication/ReplicaOps/ReplicaDisklessSync.cs:100`
* `libs/cluster/Server/Replication/ReplicaOps/ReplicaDiskbasedSync.cs:136`
In both files `storeWrapper.Reset()` is called **before**
`SuspendPrimaryOnlyTasksAsync()`, and even that suspend only drains
`TaskManager` tasks — `VectorManager.cleanupTask` is independent and never
drained.
Once `pagePointers[i] = 0`, the iterator's `GetPhysicalAddress` returns
`0 + offset` — a tiny kernel-page address — and dereferencing it in
`*(RecordInfo*)physicalAddress` raises a fatal AVE.
### The exact interleaving
Production scenario in `MigrateVectorSetWhileModifyingAsync`:
1. Source primary migrates a slot containing a vector set → drops the index → `CleanupDroppedIndex` queues a cleanup-task scan on the source primary.
2. The drop AOF entry replicates to the source's replica, which replays it and **also** queues a cleanup-task scan on the replica.
3. Cluster topology change (post-migration, gossip, or any reason) triggers a replica re-attach → `ReplicaDisklessSync.ReplicateAttachAsync` / `ReplicaDiskbasedSync.ReplicateAttachAsync` calls `storeWrapper.Reset()`.
4. The replica's cleanup task is still mid-iterate over the main store → AVE.
Thread-level interleaving:
```
Thread A: VectorManager cleanup task Thread B: storeWrapper.Reset()
───────────────────────────────────────── ─────────────────────────────────
loop session.Iterate(callbacks)
PushNext → ObjectScanIterator.GetNext()
epoch.Resume() ◄── enter at epoch E
headAddress = HeadAddress (still old value)
LoadPageIfNeeded(...) (cur >= head → in-mem)
physicalAddress =
pagePointers[pageIdx] + offset
Recovery.Reset()
hlogBase.Reset()
HeadAddress ← TailAddress
OnPagesClosed(...)
FreePage(p)
ReturnPage(p)
pagePointers[p] = 0 ◄── ★
// override loop:
for i in BufferSize:
FreePage(i)
ReturnPage(i)
pagePointers[i] = 0
*(RecordInfo*)physicalAddress ◄── ☠ AVE
(LogRecord.GetInfo /
LogRecord.AllocatedSize)
```
### Why epoch protection didn't catch this
Tsavorite's normal eviction path defers page-freeing through:
```csharp
epoch.BumpCurrentEpoch(() => OnPagesClosed(newAddr));
```
`BumpCurrentEpoch` queues the action and only fires it after
`SafeToReclaimEpoch` has advanced past the prior epoch — i.e. after every
thread that was holding the prior epoch has either suspended or moved on.
That's why scan iterators are safe against normal eviction.
`Reset()` skipped that mechanism in two places:
1. `AllocatorBase.Reset()` invoked `OnPagesClosed(newBeginAddress)` directly.
2. The per-allocator overrides had a `for (i in BufferSize) FreePage(i)`
loop that ran **after** `base.Reset()` returned — also without epoch
protection. **This second loop is the actual point of failure**: even
if `OnPagesClosed` were deferred, the leftover (tail) page is freed by
the override loop while a reader could still be reading it.
### The fix (Tsavorite layer)
`AllocatorBase.Reset()` defers ALL page-close + page-free work through
`BumpCurrentEpoch` and waits on a `ManualResetEventSlim` signalled by the
deferred action — no polling:
```csharp
using var resetComplete = new ManualResetEventSlim(initialState: false);
// If caller was already epoch-protected, our prior epoch is what the action
// will be waiting on — release it before waiting and re-acquire after.
var wasProtected = epoch.ThisInstanceProtected();
if (!wasProtected)
epoch.Resume(); // BumpCurrentEpoch requires a protected caller
try
{
epoch.BumpCurrentEpoch(() =>
{
try
{
if (headShifted) OnPagesClosed(newBeginAddress);
FreeAllAllocatedPages();
}
finally { resetComplete.Set(); } // never deadlock if action throws
});
}
finally { epoch.Suspend(); } // unconditionally so the action can fire
resetComplete.Wait();
if (wasProtected) epoch.Resume();
```
Each per-allocator override (`SpanByte` / `Object` / `TsavoriteLog`) moves
its `FreePage(i)` loop into a new `FreeAllAllocatedPages()` virtual so the
loop runs inside the deferred action:
```csharp
public override void Reset() { base.Reset(); Initialize(); }
protected override void FreeAllAllocatedPages()
{
for (int index = 0; index < BufferSize; index++)
if (IsAllocated(index)) FreePage(index);
}
```
### Why this is safe
* The deferred action runs only after `SafeToReclaimEpoch ≥ priorEpoch`,
i.e. after every iterator that was inside `GetNext` at the moment
`Reset()` was called has either suspended or advanced. By the time
`pagePointers[i] = 0` executes, no thread is reading `pagePointers[i]`.
* Iterators that re-enter `GetNext` after `HeadAddress` was shifted see
`currentAddress < headAddress` and route through the buffered disk frame
instead of `pagePointers` — so they don't touch the cleared array.
* `Reset()` blocks until the deferred work has actually run, preserving
its synchronous contract (the override's `Initialize()` after `Reset()`
observes a fully freed page set).
### Test vs. product
Strictly, `Reset()`'s docstring put the burden on callers. The cluster
re-attach paths violate that — they call `Reset()` before draining the
`VectorManager` cleanup task, and `SuspendPrimaryOnlyTasksAsync()` doesn't
cover it. The alternative would be to drain every background reader at
every `Reset()` callsite, but we chose to make `Reset()` itself epoch-safe
because the contract was implicit, callsites are scattered, and Tsavorite
already has the right primitive (`epoch.BumpCurrentEpoch`) — the normal
eviction path uses it. This makes the safety property **enforced** rather
than **assumed**, and protects any future caller / background reader.
### Repro
`test/Garnet.test/VectorCleanupVsResetRaceTests.cs` — adds 4 000 vectors,
drops the set (queues a full-keyspace cleanup scan), then spams
`storeWrapper.Reset()` for 5 s.
* **Without the fix:** crashes the host on every iteration with the exact
production stack (`LogRecord.get_Info` → `ObjectScanIterator.GetNext` →
`VectorManager.RunCleanupTaskAsync`).
* **With the fix:** all 5 `[Repeat]` iterations pass (~2 700 resets per
iteration concurrent with the cleanup iterator), no AVE.
## 2. `RespListTests.ListPushPopStressTest` — host crash on rare `RedisTimeoutException`
### Symptom
```
Unhandled exception. StackExchange.Redis.RedisTimeoutException: Timeout performing LPUSH (30000ms)
at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](...)
at StackExchange.Redis.RedisDatabase.ListLeftPush(...)
at Garnet.test.RespListTests.<>c__DisplayClass39_1.<ListPushPopStressTest>b__0()
The active test run was aborted. Reason: Test host process crashed
```
### Root cause (two compounding issues)
1. **Worker threads created via `new Thread(() => ...)` had no try/catch.**
In modern .NET an unhandled exception in a manually-created `Thread`
terminates the process, so a single transient `RedisTimeoutException`
aborted the entire test run.
2. **All 20 sync workers shared a single `ConnectionMultiplexer`.** Every
command went through one socket and one background writer. Under CI
load + lowMemory eviction overhead the writer falls behind and
accumulates queued messages until SyncTimeout (30s) trips. The failure
diagnostics confirmed this: `mc: 1/1, qs: 20, bw: SpinningDown`.
### Fix
* Pre-create one `ConnectionMultiplexer` per worker on the main thread.
Each thread now owns its own socket, eliminating the single-writer
bottleneck. Pre-creating also avoids a 20-way connect storm racing
`ConnectTimeout`.
* Wrap each worker body in try/catch; capture exceptions into a
`ConcurrentBag`, signal stop, exit cleanly. No more host crash.
* Throw the aggregate **before** the post-checks so a real timeout isn't
masked by secondary "list not empty" assertion noise.
* Route the deadline-exceeded path through the failure bag too.
## Files
```
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs | 76 +++++++++++++++++++++++--
libs/storage/Tsavorite/cs/src/core/Allocator/ObjectAllocatorImpl.cs | 7 ++-
libs/storage/Tsavorite/cs/src/core/Allocator/SpanByteAllocatorImpl.cs | 7 ++-
libs/storage/Tsavorite/cs/src/core/Allocator/TsavoriteLogAllocatorImpl.cs | 7 ++-
test/Garnet.test/RespListTests.cs | 124 +++++++++++++++++++++++++--------------
test/Garnet.test/VectorCleanupVsResetRaceTests.cs | new
```
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
0a552d9 to
82850d3
Compare
…nitialize() epoch-gate Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info / ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765 but a different remaining gap. ## Root cause PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe (2-phase BumpCurrentEpoch barrier around OnPagesClosed + FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize() that runs after base.Reset() returns. Initialize() non-monotonically rewinds HeadAddress / BeginAddress / TailPageOffset back to FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1 are reallocated). Race window between HeadAddress reset and TailPageOffset reset: - Iterator reads stopAddress = TailAddress = old high (stale) - Iterator passes currentAddress < stopAddress check, proceeds - Iterator reads headAddress = HeadAddress = new low - Iterator chooses in-memory path (currentAddress >= headAddress) - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync) call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only covers TaskManager-registered tasks. VectorManager.cleanupTask is started directly in the constructor (VectorManager.cs:110) as a plain async Task and is invisible to TaskManager. ## Fix Two complementary fixes: ### A. Tsavorite-side defense-in-depth - AllocatorBase: added `internal volatile bool Initializing`. Refactored Reset() to be non-virtual, wrapping the existing 2-phase epoch logic (renamed to ResetCore()) PLUS the per-allocator Initialize() call inside Initializing = true / false (try/finally so a throw can't leave iterators stuck). - Per-allocator Reset() overrides removed in ObjectAllocatorImpl, SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does ResetCore() + Initialize() directly under the gate. - ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch): - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside epoch protection — preexisting issue). - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }` with `!assumeInMemory` bypass (avoids deadlock with internal MemoryPageScan inside OnPagesClosed which constructs assumeInMemory iterators on the same thread that is running the Initializing-gated deferred action). - LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after snapping forward to BeginAddress (stale snap could place currentAddress at TailAddress with an unparseable record). ### B. Caller-side explicit drain (restores Reset's documented contract) - VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup() API. Cleanup loop uses increment-then-check pattern (bump cleanupInProgress first, re-read pauseCleanupRequests, back out if non-zero) so a pause caller cannot race past an iteration about to start. - ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` / `vectorManager.ResumeCleanup()` try/finally. A is defense-in-depth (any future background reader is safe); B restores the documented contract for the known case and skips the per-iteration spin-wait when callers cooperate. ## Validation - Build: 0 warnings, 0 errors. dotnet format clean. - Tsavorite tests: 1453/1453 pass. - Garnet.test Reset/Recovery/Iterate filter: 30/30 pass. - ClusterMigrate + ClusterVectorSet (52 tests): all pass. - MigrateVectorSetWhileModifyingAsync (the originally failing cluster test): stable across 8 sequential runs (~18s each) with --blame-crash. - VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)]. - ClusterResetDuringReplicationTests: 2/2 pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nitialize() epoch-gate Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info / ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765 but a different remaining gap. ## Root cause PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe (2-phase BumpCurrentEpoch barrier around OnPagesClosed + FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize() that runs after base.Reset() returns. Initialize() non-monotonically rewinds HeadAddress / BeginAddress / TailPageOffset back to FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1 are reallocated). Race window between HeadAddress reset and TailPageOffset reset: - Iterator reads stopAddress = TailAddress = old high (stale) - Iterator passes currentAddress < stopAddress check, proceeds - Iterator reads headAddress = HeadAddress = new low - Iterator chooses in-memory path (currentAddress >= headAddress) - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync) call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only covers TaskManager-registered tasks. VectorManager.cleanupTask is started directly in the constructor (VectorManager.cs:110) as a plain async Task and is invisible to TaskManager. ## Fix Two complementary fixes: ### A. Tsavorite-side defense-in-depth - AllocatorBase: added `internal volatile bool Initializing`. Refactored Reset() to be non-virtual, wrapping the existing 2-phase epoch logic (renamed to ResetCore()) PLUS the per-allocator Initialize() call inside Initializing = true / false (try/finally so a throw can't leave iterators stuck). - Per-allocator Reset() overrides removed in ObjectAllocatorImpl, SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does ResetCore() + Initialize() directly under the gate. - ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch): - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside epoch protection — preexisting issue). - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }` with `!assumeInMemory` bypass (avoids deadlock with internal MemoryPageScan inside OnPagesClosed which constructs assumeInMemory iterators on the same thread that is running the Initializing-gated deferred action). - LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after snapping forward to BeginAddress (stale snap could place currentAddress at TailAddress with an unparseable record). ### B. Caller-side explicit drain (restores Reset's documented contract) - VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup() API. Cleanup loop uses increment-then-check pattern (bump cleanupInProgress first, re-read pauseCleanupRequests, back out if non-zero) so a pause caller cannot race past an iteration about to start. - ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` / `vectorManager.ResumeCleanup()` try/finally. A is defense-in-depth (any future background reader is safe); B restores the documented contract for the known case and skips the per-iteration spin-wait when callers cooperate. ## Validation - Build: 0 warnings, 0 errors. dotnet format clean. - Tsavorite tests: 1453/1453 pass. - Garnet.test Reset/Recovery/Iterate filter: 30/30 pass. - ClusterMigrate + ClusterVectorSet (52 tests): all pass. - MigrateVectorSetWhileModifyingAsync (the originally failing cluster test): stable across 8 sequential runs (~18s each) with --blame-crash. - VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)]. - ClusterResetDuringReplicationTests: 2/2 pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nitialize() epoch-gate Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info / ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765 but a different remaining gap. ## Root cause PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe (2-phase BumpCurrentEpoch barrier around OnPagesClosed + FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize() that runs after base.Reset() returns. Initialize() non-monotonically rewinds HeadAddress / BeginAddress / TailPageOffset back to FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1 are reallocated). Race window between HeadAddress reset and TailPageOffset reset: - Iterator reads stopAddress = TailAddress = old high (stale) - Iterator passes currentAddress < stopAddress check, proceeds - Iterator reads headAddress = HeadAddress = new low - Iterator chooses in-memory path (currentAddress >= headAddress) - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync) call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only covers TaskManager-registered tasks. VectorManager.cleanupTask is started directly in the constructor (VectorManager.cs:110) as a plain async Task and is invisible to TaskManager. ## Fix Two complementary fixes: ### A. Tsavorite-side defense-in-depth - AllocatorBase: added `internal volatile bool Initializing`. Refactored Reset() to be non-virtual, wrapping the existing 2-phase epoch logic (renamed to ResetCore()) PLUS the per-allocator Initialize() call inside Initializing = true / false (try/finally so a throw can't leave iterators stuck). - Per-allocator Reset() overrides removed in ObjectAllocatorImpl, SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does ResetCore() + Initialize() directly under the gate. - ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch): - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside epoch protection — preexisting issue). - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }` with `!assumeInMemory` bypass (avoids deadlock with internal MemoryPageScan inside OnPagesClosed which constructs assumeInMemory iterators on the same thread that is running the Initializing-gated deferred action). - LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after snapping forward to BeginAddress (stale snap could place currentAddress at TailAddress with an unparseable record). ### B. Caller-side explicit drain (restores Reset's documented contract) - VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup() API. Cleanup loop uses increment-then-check pattern (bump cleanupInProgress first, re-read pauseCleanupRequests, back out if non-zero) so a pause caller cannot race past an iteration about to start. - ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` / `vectorManager.ResumeCleanup()` try/finally. A is defense-in-depth (any future background reader is safe); B restores the documented contract for the known case and skips the per-iteration spin-wait when callers cooperate. ## Validation - Build: 0 warnings, 0 errors. dotnet format clean. - Tsavorite tests: 1453/1453 pass. - Garnet.test Reset/Recovery/Iterate filter: 30/30 pass. - ClusterMigrate + ClusterVectorSet (52 tests): all pass. - MigrateVectorSetWhileModifyingAsync (the originally failing cluster test): stable across 8 sequential runs (~18s each) with --blame-crash. - VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)]. - ClusterResetDuringReplicationTests: 2/2 pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nitialize() epoch-gate Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info / ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765 but a different remaining gap. ## Root cause PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe (2-phase BumpCurrentEpoch barrier around OnPagesClosed + FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize() that runs after base.Reset() returns. Initialize() non-monotonically rewinds HeadAddress / BeginAddress / TailPageOffset back to FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1 are reallocated). Race window between HeadAddress reset and TailPageOffset reset: - Iterator reads stopAddress = TailAddress = old high (stale) - Iterator passes currentAddress < stopAddress check, proceeds - Iterator reads headAddress = HeadAddress = new low - Iterator chooses in-memory path (currentAddress >= headAddress) - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync) call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only covers TaskManager-registered tasks. VectorManager.cleanupTask is started directly in the constructor (VectorManager.cs:110) as a plain async Task and is invisible to TaskManager. ## Fix Two complementary fixes: ### A. Tsavorite-side defense-in-depth - AllocatorBase: added `internal volatile bool Initializing`. Refactored Reset() to be non-virtual, wrapping the existing 2-phase epoch logic (renamed to ResetCore()) PLUS the per-allocator Initialize() call inside Initializing = true / false (try/finally so a throw can't leave iterators stuck). - Per-allocator Reset() overrides removed in ObjectAllocatorImpl, SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does ResetCore() + Initialize() directly under the gate. - ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch): - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside epoch protection — preexisting issue). - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }` with `!assumeInMemory` bypass (avoids deadlock with internal MemoryPageScan inside OnPagesClosed which constructs assumeInMemory iterators on the same thread that is running the Initializing-gated deferred action). - LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after snapping forward to BeginAddress (stale snap could place currentAddress at TailAddress with an unparseable record). ### B. Caller-side explicit drain (restores Reset's documented contract) - VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup() API. Cleanup loop uses increment-then-check pattern (bump cleanupInProgress first, re-read pauseCleanupRequests, back out if non-zero) so a pause caller cannot race past an iteration about to start. - ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` / `vectorManager.ResumeCleanup()` try/finally. A is defense-in-depth (any future background reader is safe); B restores the documented contract for the known case and skips the per-iteration spin-wait when callers cooperate. ## Validation - Build: 0 warnings, 0 errors. dotnet format clean. - Tsavorite tests: 1453/1453 pass. - Garnet.test Reset/Recovery/Iterate filter: 30/30 pass. - ClusterMigrate + ClusterVectorSet (52 tests): all pass. - MigrateVectorSetWhileModifyingAsync (the originally failing cluster test): stable across 8 sequential runs (~18s each) with --blame-crash. - VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)]. - ClusterResetDuringReplicationTests: 2/2 pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nitialize() epoch-gate Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info / ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765 but a different remaining gap. ## Root cause PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe (2-phase BumpCurrentEpoch barrier around OnPagesClosed + FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize() that runs after base.Reset() returns. Initialize() non-monotonically rewinds HeadAddress / BeginAddress / TailPageOffset back to FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1 are reallocated). Race window between HeadAddress reset and TailPageOffset reset: - Iterator reads stopAddress = TailAddress = old high (stale) - Iterator passes currentAddress < stopAddress check, proceeds - Iterator reads headAddress = HeadAddress = new low - Iterator chooses in-memory path (currentAddress >= headAddress) - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync) call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only covers TaskManager-registered tasks. VectorManager.cleanupTask is started directly in the constructor (VectorManager.cs:110) as a plain async Task and is invisible to TaskManager. ## Fix Two complementary fixes: ### A. Tsavorite-side defense-in-depth - AllocatorBase: added `internal volatile bool Initializing`. Refactored Reset() to be non-virtual, wrapping the existing 2-phase epoch logic (renamed to ResetCore()) PLUS the per-allocator Initialize() call inside Initializing = true / false (try/finally so a throw can't leave iterators stuck). - Per-allocator Reset() overrides removed in ObjectAllocatorImpl, SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does ResetCore() + Initialize() directly under the gate. - ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch): - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside epoch protection — preexisting issue). - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }` with `!assumeInMemory` bypass (avoids deadlock with internal MemoryPageScan inside OnPagesClosed which constructs assumeInMemory iterators on the same thread that is running the Initializing-gated deferred action). - LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after snapping forward to BeginAddress (stale snap could place currentAddress at TailAddress with an unparseable record). ### B. Caller-side explicit drain (restores Reset's documented contract) - VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup() API. Cleanup loop uses increment-then-check pattern (bump cleanupInProgress first, re-read pauseCleanupRequests, back out if non-zero) so a pause caller cannot race past an iteration about to start. - ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` / `vectorManager.ResumeCleanup()` try/finally. A is defense-in-depth (any future background reader is safe); B restores the documented contract for the known case and skips the per-iteration spin-wait when callers cooperate. ## Validation - Build: 0 warnings, 0 errors. dotnet format clean. - Tsavorite tests: 1453/1453 pass. - Garnet.test Reset/Recovery/Iterate filter: 30/30 pass. - ClusterMigrate + ClusterVectorSet (52 tests): all pass. - MigrateVectorSetWhileModifyingAsync (the originally failing cluster test): stable across 8 sequential runs (~18s each) with --blame-crash. - VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)]. - ClusterResetDuringReplicationTests: 2/2 pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nitialize() epoch-gate Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info / ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765 but a different remaining gap. ## Root cause PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe (2-phase BumpCurrentEpoch barrier around OnPagesClosed + FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize() that runs after base.Reset() returns. Initialize() non-monotonically rewinds HeadAddress / BeginAddress / TailPageOffset back to FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1 are reallocated). Race window between HeadAddress reset and TailPageOffset reset: - Iterator reads stopAddress = TailAddress = old high (stale) - Iterator passes currentAddress < stopAddress check, proceeds - Iterator reads headAddress = HeadAddress = new low - Iterator chooses in-memory path (currentAddress >= headAddress) - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync) call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only covers TaskManager-registered tasks. VectorManager.cleanupTask is started directly in the constructor (VectorManager.cs:110) as a plain async Task and is invisible to TaskManager. ## Fix Two complementary fixes: ### A. Tsavorite-side defense-in-depth - AllocatorBase: added `internal volatile bool Initializing`. Refactored Reset() to be non-virtual, wrapping the existing 2-phase epoch logic (renamed to ResetCore()) PLUS the per-allocator Initialize() call inside Initializing = true / false (try/finally so a throw can't leave iterators stuck). - Per-allocator Reset() overrides removed in ObjectAllocatorImpl, SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does ResetCore() + Initialize() directly under the gate. - ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch): - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside epoch protection — preexisting issue). - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }` with `!assumeInMemory` bypass (avoids deadlock with internal MemoryPageScan inside OnPagesClosed which constructs assumeInMemory iterators on the same thread that is running the Initializing-gated deferred action). - LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after snapping forward to BeginAddress (stale snap could place currentAddress at TailAddress with an unparseable record). ### B. Caller-side explicit drain (restores Reset's documented contract) - VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup() API. Cleanup loop uses increment-then-check pattern (bump cleanupInProgress first, re-read pauseCleanupRequests, back out if non-zero) so a pause caller cannot race past an iteration about to start. - ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` / `vectorManager.ResumeCleanup()` try/finally. A is defense-in-depth (any future background reader is safe); B restores the documented contract for the known case and skips the per-iteration spin-wait when callers cooperate. ## Validation - Build: 0 warnings, 0 errors. dotnet format clean. - Tsavorite tests: 1453/1453 pass. - Garnet.test Reset/Recovery/Iterate filter: 30/30 pass. - ClusterMigrate + ClusterVectorSet (52 tests): all pass. - MigrateVectorSetWhileModifyingAsync (the originally failing cluster test): stable across 8 sequential runs (~18s each) with --blame-crash. - VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)]. - ClusterResetDuringReplicationTests: 2/2 pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nitialize() epoch-gate Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info / ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765 but a different remaining gap. ## Root cause PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe (2-phase BumpCurrentEpoch barrier around OnPagesClosed + FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize() that runs after base.Reset() returns. Initialize() non-monotonically rewinds HeadAddress / BeginAddress / TailPageOffset back to FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1 are reallocated). Race window between HeadAddress reset and TailPageOffset reset: - Iterator reads stopAddress = TailAddress = old high (stale) - Iterator passes currentAddress < stopAddress check, proceeds - Iterator reads headAddress = HeadAddress = new low - Iterator chooses in-memory path (currentAddress >= headAddress) - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync) call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only covers TaskManager-registered tasks. VectorManager.cleanupTask is started directly in the constructor (VectorManager.cs:110) as a plain async Task and is invisible to TaskManager. ## Fix Two complementary fixes: ### A. Tsavorite-side defense-in-depth - AllocatorBase: added `internal volatile bool Initializing`. Refactored Reset() to be non-virtual, wrapping the existing 2-phase epoch logic (renamed to ResetCore()) PLUS the per-allocator Initialize() call inside Initializing = true / false (try/finally so a throw can't leave iterators stuck). - Per-allocator Reset() overrides removed in ObjectAllocatorImpl, SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does ResetCore() + Initialize() directly under the gate. - ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch): - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside epoch protection — preexisting issue). - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }` with `!assumeInMemory` bypass (avoids deadlock with internal MemoryPageScan inside OnPagesClosed which constructs assumeInMemory iterators on the same thread that is running the Initializing-gated deferred action). - LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after snapping forward to BeginAddress (stale snap could place currentAddress at TailAddress with an unparseable record). ### B. Caller-side explicit drain (restores Reset's documented contract) - VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup() API. Cleanup loop uses increment-then-check pattern (bump cleanupInProgress first, re-read pauseCleanupRequests, back out if non-zero) so a pause caller cannot race past an iteration about to start. - ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` / `vectorManager.ResumeCleanup()` try/finally. A is defense-in-depth (any future background reader is safe); B restores the documented contract for the known case and skips the per-iteration spin-wait when callers cooperate. ## Validation - Build: 0 warnings, 0 errors. dotnet format clean. - Tsavorite tests: 1453/1453 pass. - Garnet.test Reset/Recovery/Iterate filter: 30/30 pass. - ClusterMigrate + ClusterVectorSet (52 tests): all pass. - MigrateVectorSetWhileModifyingAsync (the originally failing cluster test): stable across 8 sequential runs (~18s each) with --blame-crash. - VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)]. - ClusterResetDuringReplicationTests: 2/2 pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nitialize() epoch-gate Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info / ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765 but a different remaining gap. ## Root cause PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe (2-phase BumpCurrentEpoch barrier around OnPagesClosed + FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize() that runs after base.Reset() returns. Initialize() non-monotonically rewinds HeadAddress / BeginAddress / TailPageOffset back to FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1 are reallocated). Race window between HeadAddress reset and TailPageOffset reset: - Iterator reads stopAddress = TailAddress = old high (stale) - Iterator passes currentAddress < stopAddress check, proceeds - Iterator reads headAddress = HeadAddress = new low - Iterator chooses in-memory path (currentAddress >= headAddress) - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync) call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only covers TaskManager-registered tasks. VectorManager.cleanupTask is started directly in the constructor (VectorManager.cs:110) as a plain async Task and is invisible to TaskManager. ## Fix Two complementary fixes: ### A. Tsavorite-side defense-in-depth - AllocatorBase: added `internal volatile bool Initializing`. Refactored Reset() to be non-virtual, wrapping the existing 2-phase epoch logic (renamed to ResetCore()) PLUS the per-allocator Initialize() call inside Initializing = true / false (try/finally so a throw can't leave iterators stuck). - Per-allocator Reset() overrides removed in ObjectAllocatorImpl, SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does ResetCore() + Initialize() directly under the gate. - ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch): - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside epoch protection — preexisting issue). - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }` with `!assumeInMemory` bypass (avoids deadlock with internal MemoryPageScan inside OnPagesClosed which constructs assumeInMemory iterators on the same thread that is running the Initializing-gated deferred action). - LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after snapping forward to BeginAddress (stale snap could place currentAddress at TailAddress with an unparseable record). ### B. Caller-side explicit drain (restores Reset's documented contract) - VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup() API. Cleanup loop uses increment-then-check pattern (bump cleanupInProgress first, re-read pauseCleanupRequests, back out if non-zero) so a pause caller cannot race past an iteration about to start. - ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` / `vectorManager.ResumeCleanup()` try/finally. A is defense-in-depth (any future background reader is safe); B restores the documented contract for the known case and skips the per-iteration spin-wait when callers cooperate. ## Validation - Build: 0 warnings, 0 errors. dotnet format clean. - Tsavorite tests: 1453/1453 pass. - Garnet.test Reset/Recovery/Iterate filter: 30/30 pass. - ClusterMigrate + ClusterVectorSet (52 tests): all pass. - MigrateVectorSetWhileModifyingAsync (the originally failing cluster test): stable across 8 sequential runs (~18s each) with --blame-crash. - VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)]. - ClusterResetDuringReplicationTests: 2/2 pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nitialize() epoch-gate Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info / ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765 but a different remaining gap. ## Root cause PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe (2-phase BumpCurrentEpoch barrier around OnPagesClosed + FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize() that runs after base.Reset() returns. Initialize() non-monotonically rewinds HeadAddress / BeginAddress / TailPageOffset back to FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1 are reallocated). Race window between HeadAddress reset and TailPageOffset reset: - Iterator reads stopAddress = TailAddress = old high (stale) - Iterator passes currentAddress < stopAddress check, proceeds - Iterator reads headAddress = HeadAddress = new low - Iterator chooses in-memory path (currentAddress >= headAddress) - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync) call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only covers TaskManager-registered tasks. VectorManager.cleanupTask is started directly in the constructor (VectorManager.cs:110) as a plain async Task and is invisible to TaskManager. ## Fix Two complementary fixes: ### A. Tsavorite-side defense-in-depth - AllocatorBase: added `internal volatile bool Initializing`. Refactored Reset() to be non-virtual, wrapping the existing 2-phase epoch logic (renamed to ResetCore()) PLUS the per-allocator Initialize() call inside Initializing = true / false (try/finally so a throw can't leave iterators stuck). - Per-allocator Reset() overrides removed in ObjectAllocatorImpl, SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does ResetCore() + Initialize() directly under the gate. - ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch): - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside epoch protection — preexisting issue). - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }` with `!assumeInMemory` bypass (avoids deadlock with internal MemoryPageScan inside OnPagesClosed which constructs assumeInMemory iterators on the same thread that is running the Initializing-gated deferred action). - LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after snapping forward to BeginAddress (stale snap could place currentAddress at TailAddress with an unparseable record). ### B. Caller-side explicit drain (restores Reset's documented contract) - VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup() API. Cleanup loop uses increment-then-check pattern (bump cleanupInProgress first, re-read pauseCleanupRequests, back out if non-zero) so a pause caller cannot race past an iteration about to start. - ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` / `vectorManager.ResumeCleanup()` try/finally. A is defense-in-depth (any future background reader is safe); B restores the documented contract for the known case and skips the per-iteration spin-wait when callers cooperate. ## Validation - Build: 0 warnings, 0 errors. dotnet format clean. - Tsavorite tests: 1453/1453 pass. - Garnet.test Reset/Recovery/Iterate filter: 30/30 pass. - ClusterMigrate + ClusterVectorSet (52 tests): all pass. - MigrateVectorSetWhileModifyingAsync (the originally failing cluster test): stable across 8 sequential runs (~18s each) with --blame-crash. - VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)]. - ClusterResetDuringReplicationTests: 2/2 pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nitialize() epoch-gate (#1775) Fixes a rare CI fatal AccessViolationException at LogRecord.get_Info / ObjectScanIterator.GetPhysicalAddressAndAllocatedSize during VectorManager.RunCleanupTaskAsync. Same crash signature as PR #1765 but a different remaining gap. ## Root cause PR #1765 made AllocatorBase.Reset()'s page-free section epoch-safe (2-phase BumpCurrentEpoch barrier around OnPagesClosed + FreeAllAllocatedPages). It did NOT cover the per-allocator Initialize() that runs after base.Reset() returns. Initialize() non-monotonically rewinds HeadAddress / BeginAddress / TailPageOffset back to FirstValidAddress while pagePointers[i] are mostly 0 (only pages 0 and 1 are reallocated). Race window between HeadAddress reset and TailPageOffset reset: - Iterator reads stopAddress = TailAddress = old high (stale) - Iterator passes currentAddress < stopAddress check, proceeds - Iterator reads headAddress = HeadAddress = new low - Iterator chooses in-memory path (currentAddress >= headAddress) - Iterator dereferences pagePointers[X] + offset = 0 + offset → AVE The cluster re-attach paths (ReplicaDisklessSync, ReplicaDiskbasedSync) call SuspendPrimaryOnlyTasksAsync() AFTER Reset, and even then it only covers TaskManager-registered tasks. VectorManager.cleanupTask is started directly in the constructor (VectorManager.cs:110) as a plain async Task and is invisible to TaskManager. ## Fix Two complementary fixes: ### A. Tsavorite-side defense-in-depth - AllocatorBase: added `internal volatile bool Initializing`. Refactored Reset() to be non-virtual, wrapping the existing 2-phase epoch logic (renamed to ResetCore()) PLUS the per-allocator Initialize() call inside Initializing = true / false (try/finally so a throw can't leave iterators stuck). - Per-allocator Reset() overrides removed in ObjectAllocatorImpl, SpanByteAllocatorImpl, TsavoriteLogAllocatorImpl — base now does ResetCore() + Initialize() directly under the gate. - ObjectScanIterator + SpanByteScanIterator (InitializeGetNextAndAcquireEpoch): - Move epoch.Resume() BEFORE sampling TailAddress (was sampled outside epoch protection — preexisting issue). - Spin-wait `while (hlogBase.Initializing) { Thread.Yield(); epoch.ProtectAndDrain(); }` with `!assumeInMemory` bypass (avoids deadlock with internal MemoryPageScan inside OnPagesClosed which constructs assumeInMemory iterators on the same thread that is running the Initializing-gated deferred action). - LoadPageIfNeeded: recheck `currentAddress >= stopAddress` after snapping forward to BeginAddress (stale snap could place currentAddress at TailAddress with an unparseable record). ### B. Caller-side explicit drain (restores Reset's documented contract) - VectorManager.Cleanup.cs: added PauseCleanupAsync() / ResumeCleanup() API. Cleanup loop uses increment-then-check pattern (bump cleanupInProgress first, re-read pauseCleanupRequests, back out if non-zero) so a pause caller cannot race past an iteration about to start. - ReplicaDisklessSync.cs + ReplicaDiskbasedSync.cs: wrap storeWrapper.Reset() in `await vectorManager.PauseCleanupAsync()` / `vectorManager.ResumeCleanup()` try/finally. A is defense-in-depth (any future background reader is safe); B restores the documented contract for the known case and skips the per-iteration spin-wait when callers cooperate. ## Validation - Build: 0 warnings, 0 errors. dotnet format clean. - Tsavorite tests: 1453/1453 pass. - Garnet.test Reset/Recovery/Iterate filter: 30/30 pass. - ClusterMigrate + ClusterVectorSet (52 tests): all pass. - MigrateVectorSetWhileModifyingAsync (the originally failing cluster test): stable across 8 sequential runs (~18s each) with --blame-crash. - VectorCleanupVsResetRaceTests (PR #1765's regression): pass [Repeat(5)]. - ClusterResetDuringReplicationTests: 2/2 pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ListPushPopStressTest host crash and VectorManager cleanup vs Reset() AVE
Two independent rare CI failures, both surfacing as
Test host process crashedand aborting the whole test run.1.
ClusterVectorSetTests.MigrateVectorSetWhileModifyingAsync— fatalAccessViolationExceptioninVectorManagercleanup taskSymptom
The AVE is a Corrupted-State Exception —
catch (Exception)inRunCleanupTaskAsynccannot suppress it; the runtime fails fast and the test host crashes.Root cause
Recovery.Reset()→hlogBase.Reset()(inAllocatorBaseand the per-allocator overridesSpanByte/Object/TsavoriteLog) frees pages by synchronously invokingOnPagesClosed(...)and afor (i in BufferSize) FreePage(i)loop. Both paths ultimately callReturnPage(index), which sets:Reset()'s docstring promised "WARNING: assumes that threads have drained out at this point." But Garnet's cluster re-attach paths invoke it on a running store:libs/cluster/Server/Replication/ReplicaOps/ReplicaDisklessSync.cs:100libs/cluster/Server/Replication/ReplicaOps/ReplicaDiskbasedSync.cs:136In both files
storeWrapper.Reset()is called beforeSuspendPrimaryOnlyTasksAsync(), and even that suspend only drainsTaskManagertasks —VectorManager.cleanupTaskis independent and never drained.Once
pagePointers[i] = 0, the iterator'sGetPhysicalAddressreturns0 + offset— a tiny kernel-page address — and dereferencing it in*(RecordInfo*)physicalAddressraises a fatal AVE.The exact interleaving
Production scenario in
MigrateVectorSetWhileModifyingAsync:CleanupDroppedIndexqueues a cleanup-task scan on the source primary.ReplicaDisklessSync.ReplicateAttachAsync/ReplicaDiskbasedSync.ReplicateAttachAsynccallsstoreWrapper.Reset().Thread-level interleaving:
Why epoch protection didn't catch this
Tsavorite's normal eviction path defers page-freeing through:
BumpCurrentEpochqueues the action and only fires it afterSafeToReclaimEpochhas advanced past the prior epoch — i.e. after every thread that was holding the prior epoch has either suspended or moved on. That's why scan iterators are safe against normal eviction.Reset()skipped that mechanism in two places:AllocatorBase.Reset()invokedOnPagesClosed(newBeginAddress)directly.for (i in BufferSize) FreePage(i)loop that ran afterbase.Reset()returned — also without epoch protection. This second loop is the actual point of failure: even ifOnPagesClosedwere deferred, the leftover (tail) page is freed by the override loop while a reader could still be reading it.The fix (Tsavorite layer)
AllocatorBase.Reset()defers ALL page-close + page-free work throughBumpCurrentEpochand waits on aManualResetEventSlimsignalled by the deferred action — no polling:Each per-allocator override (
SpanByte/Object/TsavoriteLog) moves itsFreePage(i)loop into a newFreeAllAllocatedPages()virtual so the loop runs inside the deferred action:Why this is safe
SafeToReclaimEpoch ≥ priorEpoch, i.e. after every iterator that was insideGetNextat the momentReset()was called has either suspended or advanced. By the timepagePointers[i] = 0executes, no thread is readingpagePointers[i].GetNextafterHeadAddresswas shifted seecurrentAddress < headAddressand route through the buffered disk frame instead ofpagePointers— so they don't touch the cleared array.Reset()blocks until the deferred work has actually run, preserving its synchronous contract (the override'sInitialize()afterReset()observes a fully freed page set).Test vs. product
Strictly,
Reset()'s docstring put the burden on callers. The cluster re-attach paths violate that — they callReset()before draining theVectorManagercleanup task, andSuspendPrimaryOnlyTasksAsync()doesn't cover it. The alternative would be to drain every background reader at everyReset()callsite, but we chose to makeReset()itself epoch-safe because the contract was implicit, callsites are scattered, and Tsavorite already has the right primitive (epoch.BumpCurrentEpoch) — the normal eviction path uses it. This makes the safety property enforced rather than assumed, and protects any future caller / background reader.Repro
test/Garnet.test/VectorCleanupVsResetRaceTests.cs— adds 4 000 vectors, drops the set (queues a full-keyspace cleanup scan), then spamsstoreWrapper.Reset()for 5 s.LogRecord.get_Info→ObjectScanIterator.GetNext→VectorManager.RunCleanupTaskAsync).[Repeat]iterations pass (~2 700 resets per iteration concurrent with the cleanup iterator), no AVE.2.
RespListTests.ListPushPopStressTest— host crash on rareRedisTimeoutExceptionSymptom
Root cause (two compounding issues)
Worker threads created via
new Thread(() => ...)had no try/catch. In modern .NET an unhandled exception in a manually-createdThreadterminates the process, so a single transientRedisTimeoutExceptionaborted the entire test run.All 20 sync workers shared a single
ConnectionMultiplexer. Every command went through one socket and one background writer. Under CI load + lowMemory eviction overhead the writer falls behind and accumulates queued messages until SyncTimeout (30s) trips. The failure diagnostics confirmed this:mc: 1/1, qs: 20, bw: SpinningDown.Fix
ConnectionMultiplexerper worker on the main thread. Each thread now owns its own socket, eliminating the single-writer bottleneck. Pre-creating also avoids a 20-way connect storm racingConnectTimeout.ConcurrentBag, signal stop, exit cleanly. No more host crash.Files