Skip to content

Misc fixes: epoch sharing, IEpochAccessor refactoring, lock improveme…#1555

Merged
badrishc merged 6 commits into
mainfrom
badrishc/misc-fixes
Feb 10, 2026
Merged

Misc fixes: epoch sharing, IEpochAccessor refactoring, lock improveme…#1555
badrishc merged 6 commits into
mainfrom
badrishc/misc-fixes

Conversation

@badrishc

Copy link
Copy Markdown
Collaborator

…nts, test fixes, and BDN benchmarks

Copilot AI review requested due to automatic review settings February 10, 2026 17:58

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR appears to refactor epoch ownership/sharing across Tsavorite and Garnet (store + AOF), improve cancellation handling in background tasks, and update tests/benchmarks accordingly.

Changes:

  • Introduce explicit epoch injection/ownership for TsavoriteLog and TsavoriteBase, and refactor IEpochAccessor API (ReleaseIfHeldTrySuspend).
  • Improve cancellation behavior by suppressing expected TaskCanceledException during shutdown in several long-running tasks.
  • Adjust multi-DB checkpoint/save tests and add BenchmarkDotNet benchmarks for epoch operations.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
test/Garnet.test/MultiDatabaseTests.cs Makes LASTSAVE wait logic more robust in MultiDatabaseSaveInProgressTest.
libs/storage/Tsavorite/cs/src/core/Utilities/Native32.cs Adjusts thread affinitization mapping when skipping hyperthreads.
libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLogSettings.cs Adds Epoch setting to allow external epoch sharing.
libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLog.cs Uses injected epoch with ownership tracking; suppresses expected cancellation; updates epoch accessor API usage.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/TsavoriteBase.cs Adds epoch ownership tracking and avoids disposing externally-owned epochs.
libs/storage/Tsavorite/cs/src/core/Epochs/LightEpoch.cs Moves IEpochAccessor out and renames/refines suspend API (TrySuspend).
libs/storage/Tsavorite/cs/src/core/Epochs/IEpochAccessor.cs New file defining IEpochAccessor contract.
libs/storage/Tsavorite/cs/src/core/Async/CompletePendingAsync.cs Removes an extra InternalRefresh in async pending completion loop.
libs/storage/Tsavorite/cs/benchmark/BDN-Tsavorite.Benchmark/EpochTests.cs Adds benchmarks for epoch operations.
libs/storage/Tsavorite/cs/benchmark/BDN-Tsavorite.Benchmark/BenchmarkDotNetTestsApp.cs Uses DebugInProcessConfig in DEBUG builds for benchmarks.
libs/server/StoreWrapper.cs Suppresses expected cancellation during shutdown; disposes iterator in slot scan.
libs/server/Storage/Session/StorageSession.cs Changes disposal locking behavior for collection task locks.
libs/server/Servers/GarnetServerOptions.cs Passes epoch into AOF settings to share epoch with TsavoriteLog.
libs/server/GarnetDatabase.cs Tracks KVSettings references and disposes their devices; changes checkpoint lock acquisition during dispose.
libs/server/Databases/MultiDatabaseManager.cs Changes disposal lock acquisition for db maps and content locks.
libs/host/GarnetServer.cs Introduces shared store epoch and AOF epoch lifecycle management; updates store/AOF creation accordingly.
libs/common/SingleWriterMultiReaderLock.cs Renames CloseLockTryCloseLock and clarifies return semantics.
libs/cluster/Server/Gossip/Gossip.cs Suppresses expected cancellation on gossip shutdown.
libs/cluster/Server/Gossip/GarnetServerNode.cs Updates disposal to use TryCloseLock.
libs/cluster/Server/ClusterManagerSlotState.cs Lowers log level for “Bumped Epoch” messages (Warning → Debug).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread libs/server/Databases/MultiDatabaseManager.cs
Comment thread libs/server/StoreWrapper.cs Outdated
Comment thread libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLog.cs Outdated
Comment thread libs/server/Storage/Session/StorageSession.cs
Comment thread libs/server/GarnetDatabase.cs
Comment thread libs/server/Databases/MultiDatabaseManager.cs
badrishc and others added 2 commits February 10, 2026 10:33
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

@TalZaccai TalZaccai left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nits, otherwise LGTM.

Comment thread libs/host/GarnetServer.cs
Comment thread libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/TsavoriteBase.cs Outdated
@badrishc badrishc merged commit ee84882 into main Feb 10, 2026
34 checks passed
@badrishc badrishc deleted the badrishc/misc-fixes branch February 10, 2026 21:33
TalZaccai pushed a commit that referenced this pull request Feb 26, 2026
#1555)

* Misc fixes: epoch sharing, IEpochAccessor refactoring, lock improvements, test fixes, and BDN benchmarks

* Update libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLog.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add using

* Improve SingleWriterMultiReaderLock

* Revert "Improve SingleWriterMultiReaderLock"

This reverts commit 394e11c.

* rename ownedEpoch

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
TalZaccai added a commit that referenced this pull request Feb 27, 2026
* Update Azure Cosmos DB Garnet Cache docs (#1548)

* update registration process and troubleshooting

* update phrasing

* update email

* Update website/docs/azure/faq.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* wait for recovery before issuing get keys (#1553)

* Parallel ACL test fixes (#1554)

* Parallel ACL tests sometimes run forever, cleaned up to properly use async and also check server responses.

* nit

* format

* timeouts

* reduce timeout

* address comments

* nit

* nit

* Misc fixes: epoch sharing, IEpochAccessor refactoring, lock improveme… (#1555)

* Misc fixes: epoch sharing, IEpochAccessor refactoring, lock improvements, test fixes, and BDN benchmarks

* Update libs/storage/Tsavorite/cs/src/core/TsavoriteLog/TsavoriteLog.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add using

* Improve SingleWriterMultiReaderLock

* Revert "Improve SingleWriterMultiReaderLock"

This reverts commit 394e11c.

* rename ownedEpoch

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Work around receive latency increasing with larger buffers (#1546)

* shrink receive buffer if it grows past maximum configured - but only if buffer was large enough to serve last request in the first place

* Update libs/common/Networking/NetworkHandler.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update libs/common/Networking/NetworkHandler.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* don't shrink if we still have pending data greater than the maximum

* use correct variable

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Replace spin-wait with semaphore-based backoff for epoch table exhaustion (#1543)

* When hundreds of threads compete for epoch table entries, the previous spin-wait
   loop in ReserveEntry caused 100% CPU utilization due to tight spinning
   with Thread.Yield().

   Changes:
   - Add SemaphoreSlim-based wait mechanism for threads when epoch table is full
   - Split ReserveEntry into fast path (TryAcquireEntry) and slow path (ReserveEntryWait)
   - Fast path: probes startOffset1, startOffset2, then circles table twice - fully inlinable
   - Slow path: uses try/finally with semaphore wait - marked NoInlining since kernel
     wait dominates cost anyway
   - Release() signals one waiting thread via volatile waiterCount check (nearly zero
     overhead when no waiters)
   - Double-check pattern in ReserveEntryWait prevents lost wakeups: increment
     waiterCount, re-check for slots, then wait
   - SemaphoreSlim uses Monitor.Pulse internally which provides FIFO wake-up order,
     preventing starvation

   Performance characteristics:
   - No contention: unchanged - fast path acquires entry with same probing logic
   - Table full: threads wait efficiently instead of burning CPU
   - Release hot path: single volatile read of waiterCount when no waiters

* add small comment

* clarify comments, increment version

* make lightepoch isolate instances properly

* nits

* nits

* Cancel epoch table waiters on dispose for graceful shutdown

When the epoch table is full, threads block on a SemaphoreSlim in ReserveEntryWait until a slot is released. If LightEpoch is disposed while threads are waiting, they remain blocked indefinitely, preventing graceful shutdown.

Add a CancellationTokenSource that is cancelled during Dispose, causing blocked threads to receive an OperationCanceledException. Dispose then spin-waits for all waiters to finish unwinding before disposing the CancellationTokenSource and SemaphoreSlim.

* nit

* comments

* nit

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* nit

* fix dispose to be robust

* nit

* nit

* nit

* no need to refresh here

* add better epoch logging, fix garnet epoch dispose

* nit

* fixes

* Fix MultiDatabase to correctly dispose devices

* undo change to test

* share epoch across all aof instances

* fix testcase to wait for checkpoint to complete

* fix HasKeysInSlots

* add debug helper static method to LightEpoch

* actually add

* reduce logger verbosity

* nits

* fix

* fix CloseLock semantics to ensure dispose happens after write lock is released.

* nit

* change lock style for clarity

* fixes

* updates

* nit

* fix formatting

* update test suite to check LightEpoch disposal

* updatwe tsavo tests to have tear down checks in one place

* ensure epochs are disposed if server throws in constructor

* fix tsavo tests to properly dispose epoch

* fix test

* fixes

* nit

* fix test

* improve comments

* update LightEp;och copy in client

* nit

* clean up struct Entry

* use new epoch  for garnet client correctly

* fix

* nit

* fix CanDoBulkDeleteTests

* share client epoch for failover

* fix

* update version for release

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Bump qs from 6.14.1 to 6.14.2 in /website (#1562)

Bumps [qs](https://github.com/ljharb/qs) from 6.14.1 to 6.14.2.
- [Changelog](https://github.com/ljharb/qs/blob/main/CHANGELOG.md)
- [Commits](ljharb/qs@v6.14.1...v6.14.2)

---
updated-dependencies:
- dependency-name: qs
  dependency-version: 6.14.2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Use shared LightEpoch in parallel ACL/auth tests (#1566)

ParallelTests now use a shared LightEpoch for all GarnetClient instances, improving thread safety and resource management. TestUtils.GetGarnetClient accepts an optional epoch parameter, which is passed to the GarnetClient constructor. This reduces contention and potential corruption during parallel authentication and ACL operations.

* Fix ClusterDisklessSyncResetSyncManagerCts (#1557)

* fix ClusterDisklessSyncResetSyncManagerCts

* set message only when error ocurrs

* address comment

---------

Co-authored-by: Tal Zaccai <talzacc@microsoft.com>

* Support hostname resolution in MIGRATE command (#1565)

* Initial plan

* Add hostname resolution support to MIGRATE command

Co-authored-by: vazois <96085550+vazois@users.noreply.github.com>

* Add tests for hostname resolution in MIGRATE command

Co-authored-by: vazois <96085550+vazois@users.noreply.github.com>

* Fix test for invalid hostname and improve test robustness

Co-authored-by: vazois <96085550+vazois@users.noreply.github.com>

* Address code review feedback: IPv4 preference, specific exception handling, null check

Co-authored-by: vazois <96085550+vazois@users.noreply.github.com>

* Improve variable naming: resolvedAddress -> effectiveAddress

Co-authored-by: vazois <96085550+vazois@users.noreply.github.com>

* fix formatting

* remove unecessary DEBUG

* revert DEBUG flag to its original state

* cleanup tests

* Start worker search from index 2 to skip local worker and prevent self-migration

Co-authored-by: vazois <96085550+vazois@users.noreply.github.com>

* Test all resolved IPs against cluster config and revert license changes

Co-authored-by: vazois <96085550+vazois@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Remove unnecessary ArgumentOutOfRangeException catch block

Co-authored-by: vazois <96085550+vazois@users.noreply.github.com>

* Use Dns.GetHostEntryAsync for hostname resolution

Co-authored-by: vazois <96085550+vazois@users.noreply.github.com>

* Add ConfigureAwait(false) to Dns.GetHostEntryAsync call

Co-authored-by: vazois <96085550+vazois@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: vazois <96085550+vazois@users.noreply.github.com>
Co-authored-by: Vasileios Zois <vazois@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* format

* Update libs/server/GarnetDatabase.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fixing merge issue

* Added XML comment

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Justine Cocchi <jucocchi@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Vasileios Zois <96085550+vazois@users.noreply.github.com>
Co-authored-by: Badrish Chandramouli <badrishc@microsoft.com>
Co-authored-by: kevin-montrose <kmontrose@microsoft.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Vasileios Zois <vazois@microsoft.com>
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants