Skip to content

Optimize IO paths in KV#1837

Merged
badrishc merged 28 commits into
mainfrom
badrishc/optimize-kv-io
Jun 2, 2026
Merged

Optimize IO paths in KV#1837
badrishc merged 28 commits into
mainfrom
badrishc/optimize-kv-io

Conversation

@badrishc

Copy link
Copy Markdown
Collaborator

No description provided.

@badrishc badrishc force-pushed the badrishc/optimize-kv-io branch from 44d9f44 to bacb252 Compare May 31, 2026 02:53
@badrishc badrishc marked this pull request as ready for review May 31, 2026 03:07
Copilot AI review requested due to automatic review settings May 31, 2026 03:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@badrishc badrishc requested a review from Copilot May 31, 2026 20:03
badrishc added a commit that referenced this pull request May 31, 2026
…ync lost-wakeup race

The Enqueue semaphore-skip optimization (commit aee945d) reads
waiterCount via Volatile.Read (acquire-only). On weak memory models
(ARM), the acquire-load can be reordered ahead of ConcurrentQueue's
slot-sequence-number release-store, which is what publishes the item.
A concurrent consumer in DequeueAsync that has already incremented
waiterCount (full fence) but whose TryDequeue retry reads the slot
sequence number can then both:
  - fail to see the producer's slot publication, AND
  - be missed by the producer's waiterCount check (because the read
    is reordered before the increment is visible),
losing the wakeup and sleeping on the semaphore until the next
Enqueue (or forever, for the BLPOP key 0 case).

WaitForEntry/WaitForEntryAsync rechecks queue.Count (which reads
Tail, bumped by the same full-fence Tail CAS) so it was already
safe; the Tsavorite-side readyResponses hot path is unaffected by
the bug and unaffected by this fence.

DequeueAsync is consumed only by CollectionItemBroker for blocking
collection commands (BLPOP/BLMOVE/BRPOP/BLMPOP/BZPOPMIN/BZPOPMAX).
The barrier is a single ~5 ns Interlocked.MemoryBarrier on x86 /
dmb sy on ARM — cheaper than the SemaphoreSlim.Release the original
optimization avoids, and preserves the throughput win on the hot
completion path.

Verified:
* All 504 hlog tests pass.
* All 52 RespBlockingCollectionTests (BLPOP/BLMOVE/etc.) pass.
* dotnet format clean.

Identified by Opus 4.8 review of PR #1837.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

badrishc added a commit that referenced this pull request May 31, 2026
…ync lost-wakeup race

The Enqueue semaphore-skip optimization (commit aee945d) reads
waiterCount via Volatile.Read (acquire-only). On weak memory models
(ARM), the acquire-load can be reordered ahead of ConcurrentQueue's
slot-sequence-number release-store, which is what publishes the item.
A concurrent consumer in DequeueAsync that has already incremented
waiterCount (full fence) but whose TryDequeue retry reads the slot
sequence number can then both:
  - fail to see the producer's slot publication, AND
  - be missed by the producer's waiterCount check (because the read
    is reordered before the increment is visible),
losing the wakeup and sleeping on the semaphore until the next
Enqueue (or forever, for the BLPOP key 0 case).

WaitForEntry/WaitForEntryAsync rechecks queue.Count (which reads
Tail, bumped by the same full-fence Tail CAS) so it was already
safe; the Tsavorite-side readyResponses hot path is unaffected by
the bug and unaffected by this fence.

DequeueAsync is consumed only by CollectionItemBroker for blocking
collection commands (BLPOP/BLMOVE/BRPOP/BLMPOP/BZPOPMIN/BZPOPMAX).
The barrier is a single ~5 ns Interlocked.MemoryBarrier on x86 /
dmb sy on ARM — cheaper than the SemaphoreSlim.Release the original
optimization avoids, and preserves the throughput win on the hot
completion path.

Verified:
* All 504 hlog tests pass.
* All 52 RespBlockingCollectionTests (BLPOP/BLMOVE/etc.) pass.
* dotnet format clean.

Identified by Opus 4.8 review of PR #1837.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc force-pushed the badrishc/optimize-kv-io branch from 6f18b97 to 2544f23 Compare May 31, 2026 23:56
badrishc added a commit that referenced this pull request Jun 1, 2026
…ync lost-wakeup race

The Enqueue semaphore-skip optimization (commit aee945d) reads
waiterCount via Volatile.Read (acquire-only). On weak memory models
(ARM), the acquire-load can be reordered ahead of ConcurrentQueue's
slot-sequence-number release-store, which is what publishes the item.
A concurrent consumer in DequeueAsync that has already incremented
waiterCount (full fence) but whose TryDequeue retry reads the slot
sequence number can then both:
  - fail to see the producer's slot publication, AND
  - be missed by the producer's waiterCount check (because the read
    is reordered before the increment is visible),
losing the wakeup and sleeping on the semaphore until the next
Enqueue (or forever, for the BLPOP key 0 case).

WaitForEntry/WaitForEntryAsync rechecks queue.Count (which reads
Tail, bumped by the same full-fence Tail CAS) so it was already
safe; the Tsavorite-side readyResponses hot path is unaffected by
the bug and unaffected by this fence.

DequeueAsync is consumed only by CollectionItemBroker for blocking
collection commands (BLPOP/BLMOVE/BRPOP/BLMPOP/BZPOPMIN/BZPOPMAX).
The barrier is a single ~5 ns Interlocked.MemoryBarrier on x86 /
dmb sy on ARM — cheaper than the SemaphoreSlim.Release the original
optimization avoids, and preserves the throughput win on the hot
completion path.

Verified:
* All 504 hlog tests pass.
* All 52 RespBlockingCollectionTests (BLPOP/BLMOVE/etc.) pass.
* dotnet format clean.

Identified by Opus 4.8 review of PR #1837.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc force-pushed the badrishc/optimize-kv-io branch from e00fa51 to 9e860b1 Compare June 1, 2026 23:11
badrishc and others added 17 commits June 1, 2026 20:15
Three changes that together drop allocations on the pending-read path by
~91 % and remove the 16-64-thread regression in KV.benchmark random-read-
from-disk on a Dell P5600 NVMe.

Measurements (100M-key, 16MB log, libaio, throttle=2048):

  threads | before | after | delta
        1 | 141K   | 174K  | +23%
        4 | 411K   | 553K  | +35%
        8 | 544K   | 552K  | +1%
       16 | 527K   | 550K  | +4%
       32 | 492K   | 547K  | +11%
       64 | 396K   | 537K  | +36%

Peak 553K @ 4 threads = 74% of fio's 750K ceiling on the same drive
(was 73% at 8 threads with a clear contention regression beyond that).

GC pressure at 8 threads / 12s window / per worker:
  alloc/wkr:        1073 MB -> 93 MB    (-91%)
  gen0 collections: 61      -> 24       (-60%)
  gen1 collections: 9       -> 3        (-67%)

Default-throttle (--device-throttle 120) shows even larger relative gains
(e.g. 32 threads: 232K -> 472K, +103%) because the upper-layer CPU
contention is no longer adding latency that compounds with the kernel-side
queue limit.

1. Pool AsyncGetFromDiskResult to eliminate per-IO boxing
---------------------------------------------------------
AsyncReadRecordToMemory passes a struct-typed wrapper as 'object context'
to IDevice.ReadAsync; the cast boxes the entire 112-byte wrapper on every
pending read. Change the wrapper to a sealed class and pool instances on
the allocator via a ConcurrentQueue. The pool naturally settles at
~ThrottleLimit entries, so steady-state ConcurrentQueue churn is bounded
(no segment growth). AsyncGetFromDiskCallback extracts the AsyncIOContext,
clears result.context = default (so a pooled wrapper doesn't transitively
retain SectorAlignedMemory buffers), and returns the wrapper to the pool
in a finally block — including the reread path where the recursive
AsyncGetFromDisk rents a fresh wrapper.

2. Inline-store small keys in ConditionallyHoistedKey
-----------------------------------------------------
For non-pinned keys (e.g. KvKey, which is a stack-local struct so
IsPinned=false), the old Create path rented a 1 KB SectorAlignedMemory
just to hold an 8-byte key. Add an [InlineArray(32)] InlineBuf field
carried inside the struct and an isInline flag; keys (+ optional namespace)
totalling <= 32 B copy bytes into the inline buffer with no alloc. Keys
above the threshold fall back to the original pooled-SectorAlignedMemory
path. KeyBytes/NamespaceBytes accessors return spans via
MemoryMarshal.CreateReadOnlySpan(ref Unsafe.AsRef(in inlineBuf), len)
that alias the receiver's own storage — same lifetime contract as the
existing pinned-pointer mode (caller must not stash the span past the
receiver's lifetime). Dispose is a no-op in inline mode.

Because the inline storage aliases the struct itself, any API returning
a ConditionallyHoistedKey by value would alias a temp's stack location
and yield a dangling span. CompletedOutput<>.Key is changed from a
by-value property to '[UnscopedRef] public readonly ref readonly' so the
ref aliases the live keyContainer field rather than a copy.

3. Add clearOnReturn opt-out to SectorAlignedBufferPool, with lazy clear
-----------------------------------------------------------------------
Every Return() called Array.Clear over the full backing buffer
(~4-8 KB per pending read), accounting for the largest single chunk of
the upper-layer CPU profile under contention. Add an overload
'SectorAlignedMemory Get(int numRecords, bool clearOnReturn)' that lets
callers known to fully overwrite the read region opt out of the per-Return
clear. Migrate the single pending-read callsite in
GetAndPopulateReadBuffer to clearOnReturn:false.

The pool's slots are fungible across consumers; a write-staging caller
(DeviceLogCommitCheckpointManager.WriteInto and friends) still relies on
the historical 'Get returns a zeroed buffer' contract for sector-padded
writes. To preserve that contract without paying the per-Return clear on
the read path, the pool now tracks an isDirty bit per buffer: Return with
clearOnReturn=false marks the buffer dirty; the next default Get that
dequeues a dirty buffer lazy-clears before handing it out. Mixed-mode
pools stay correct; pure-read workloads pay zero clears.

Verified:
* Tsavorite test.* : 206 passed (0 failed, 61 skipped), unchanged.
* Garnet RespTests + Pending + Recovery + Hash: 348 passed.
* KV.benchmark --validate (50R/50U mixed; 100R disk-spill): both PASS.
* Engine-only --device null path: 65.6M ops/sec at 32 threads (no regression
  vs baseline 63.7M).
* dotnet format --verify-no-changes on Garnet.slnx and Tsavorite.slnx: clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The pending-IO completion path is a high-frequency polling consumer: the
worker drains via TryDequeue in CompletePending(false) and almost never
calls WaitForEntry. But every Enqueue (~560 K/sec on the libaio-CT=1
completion thread during disk-bound KV reads) was calling
SemaphoreSlim.Release(), which takes an internal Monitor lock. Multiple
completion drainers (uring CT>=2) would serialise on that lock, defeating
ring sharding.

Fix: track a waiterCount field that's incremented by WaitForEntry /
WaitForEntryAsync (and the slow path of DequeueAsync) and decremented on
exit. Enqueue only calls semaphore.Release() when Volatile.Read(waiterCount) > 0.
Waiters re-check ConcurrentQueue.Count after incrementing waiterCount so a
producer that observed waiterCount==0 between Enqueue and our Increment
still gets us serviced (re-check sees the entry).

Side benefit: closes a latent bug where the semaphore's internal count
grew unbounded over a long-running session (would overflow at int.MaxValue
after ~2 B Enqueues with no Wait).

Verified:
* All 206 Tsavorite tests pass (61 skipped, unchanged).
* KV.benchmark --validate (50R/50U + 100R disk-spill): both PASS.
* KV.benchmark perf neutral on the 100M-key 16MB-log 8-thread sweep
  (~555-562K ops/sec; matches Phase 1-3 commit ba42ddf baseline within
  measurement noise).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…truct

Reduces per-completion overhead on the pending-read path by reshaping how
AsyncIOContext flows between the IO completion callback and the worker-side
CompletePending consumer.

Previously: AsyncGetFromDiskCallback called ctx.callbackQueue.Enqueue(ctx),
which copied the ~112-byte AsyncIOContext struct (with multiple GC-tracked
refs to SectorAlignedMemory, DiskLogRecord, AsyncQueue, etc.) into the
ConcurrentQueue<AsyncIOContext> segment array. The worker's TryDequeue
copied it back out. Both copies hit Buffer.BulkMoveWithWriteBarrier per ref
field. The wrapper itself (AsyncGetFromDiskResult, sealed class added in
fc67afb to eliminate boxing) was returned to its pool inside the callback.

Now: the callback writes the modified ctx back into the wrapper
(result.context = ctx) and enqueues the wrapper REFERENCE; ownership
transfers to the worker. The worker dequeues an 8-byte reference, accesses
the live AsyncIOContext via 'ref result.context' (passed all the way through
InternalCompletePendingRequest / InternalCompletePendingRequestFromContext
which both took the struct by value), then returns the wrapper to the pool
via the new AllocatorBase.ReturnAsyncGetFromDiskResult helper. Per
completion: two ~112-byte struct copies replaced with one 8-byte reference
move on Enqueue, one 8-byte ref move on Dequeue, and a single field load
to access result.context.

Other paths in AsyncGetFromDiskCallback also clear result.context = default
before returning the wrapper to the pool, so a pooled wrapper never
transitively retains SectorAlignedMemory / DiskLogRecord references
across rentals:
* reread path (recursive AsyncGetFromDisk with a fresh wrapper)
* completionEvent path (event takes ownership of ctx via Set)
* exception path

AsyncIOContext.callbackQueue field changed from public to internal (matches
the now-internal type parameter); the field has no out-of-core consumers.

Verified:
* All 206 Tsavorite tests pass (61 skipped, unchanged).
* KV.benchmark --validate (50R/50U + 100R disk-spill): both PASS.
* KV.benchmark perf: 550-555K ops/sec at libaio CT=1, 8 threads, throttle=512
  (within noise of Phase 1-4 baseline). alloc/wkr drops from ~88 MB to ~84 MB
  per worker per 12s window.

The throughput delta is in measurement noise because the upper-layer bound
is no longer in the readyResponses queue itself; the remaining ~25% gap to
Device.benchmark's 735K @ libaio CT=1 8t throt=512 (same hardware, 4 KB
random reads) lives in Dictionary<long,PendingContext>.Add/Remove (per-op
~150-byte struct copies into / out of the dictionary's Entry[] array,
still visible as Buffer.BulkMoveWithWriteBarrier in the profile) and in
the worker-side ContinuePendingRead → user-functions dispatch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Refactor QueueIoHandler from a single io_context to N independent
io_contexts; mirror UringIoHandler's pick_ring() design with per-thread
affinity assignment. Each io_context gets its own /dev/null wake-up fd, its
own 128-slot kernel ring, and is drained by exactly one drainer thread
via the existing per-context TryCompleteFor / QueueRunFor C ABI exports.

Files:
* libs/storage/Tsavorite/cc/src/device/file_linux.h
  - QueueIoHandler: std::vector<io_context_t> io_objects_ + std::vector<int>
    wake_fds_; submit_counter_ atomic round-robin; pick_context() returns the
    per-thread-assigned context (cached thread_local on first call).
  - pick_context() TLS state is **instance-aware**: a thread's cached
    (tls_owner, tls_idx) is invalidated whenever the calling handler differs
    from tls_owner OR the cached index falls outside the current handler's
    io_objects_.size(). This prevents an index cached against a handler with N
    contexts from being used on a different handler with M<N contexts (which
    would index out-of-bounds and crash). Strictly needed because thread_local
    storage in instance methods is shared across instances of the same type.
  - num_contexts() now returns io_objects_.size() (was hard-coded 1).
  - QueueFile holds QueueIoHandler* handler_ instead of a fixed io_context_t;
    submit picks per-thread via handler_->pick_context().
  - Init(n) builds io_contexts + wake fds into TEMPORARY vectors, then
    publishes to the handler's members ONLY on full success. On any partial
    failure (io_setup error OR /dev/null open error) all successfully-created
    resources are released here and the handler's vectors stay empty
    (initialized()==false, num_contexts()==0, all per-shard ops return -1
    / no-op). RAII via std::vector + manual close cleanup keeps the
    half-initialised state from being externally observable. Wake-fd open
    failure is now treated as fatal (previously silently ignored, leading to
    multi-second Dispose stalls).

* libs/storage/Tsavorite/cc/src/device/file_linux.cc
  - TryComplete() walks ALL shards (matches UringIoHandler::TryComplete()).
    Previously delegated to TryCompleteFor(0) only, so completions on shards
    >0 would not be observed by the AllocatorBase throttle-wait opportunistic
    drain path (which calls device.TryComplete() without knowing about sharding).
  - TryCompleteFor(idx) / QueueRunFor(idx) operate on io_objects_[idx];
    QueueRun() walks all contexts (idx 0 with full timeout, rest poll-only)
    matching the existing uring compat-scanner behaviour.
  - Wake(idx) submits a 0-byte read on the per-context /dev/null fd.
  - ScheduleOperation: calls handler_->pick_context() to select the per-thread
    context before io_submit. Same yield/sleep backoff curve preserved.

* libs/storage/Tsavorite/cs/src/core/Device/runtimes/linux-x64/native/
  libnative_device.so, libnative_device_libaio.so: rebuilt from this change
  (USE_URING=ON and USE_URING=OFF flavours respectively). Both patchelf'd to
  the portable libaio.so.1 NEEDED entry as documented in cc/README.md
  ('Ubuntu 24.04 (t64 ABI) note').

Hardening pass (GPT-5.5 review of the C++ change):
* (Blocking) pick_context TLS is now instance-aware (owner pointer + bounds
  check) instead of a bare 'thread_local int my_idx'. Prevents out-of-bounds
  io_objects_[my_idx] when a thread submits to multiple handlers with
  different num_contexts.
* (Non-blocking) Init failure paths reworked into the RAII temp-vector +
  publish-on-success pattern; wake-fd open failures now propagate as
  init_errno_ rather than being silently ignored.
* (Non-blocking) TryComplete() scans all shards, fixing the compat-scanner
  gap on completions for shards >0.

Verified:
* All 206 Tsavorite tests pass (61 skipped, unchanged).
* All 348 Garnet RespTests/Pending/Recovery/Hash tests pass.
* KV.benchmark --validate at CT=8 (sharded): 50R/50U PASS, 100R disk-spill PASS.
* Device.benchmark libaio 4K random reads, 8 threads, throttle=512:
    CT=1: 737K ops/sec  (single io_context, unchanged from baseline)
    CT=8: 736K ops/sec  (no regression; was +0.9% on prior measurement)
* KV.benchmark libaio 100M-key 16MB-log 8t throttle=512:
    CT=1: 552K  CT=8: 561K  (within measurement noise — sharded path adds
    no overhead on the hot path)
* KV.benchmark stress @ 64 threads x CT=8 sharded: 541K ops/sec, 0 errors —
  validates the instance-aware TLS path under heavy multi-thread submission
  to the same handler.

The benchmark difference between Device (~0.9%) and KV (~0%) confirms the
takeaway documented in the original PR: kernel io_submit mutex contention
on the single io_context is NOT a meaningful bottleneck on this hardware at
these QDs. KV.benchmark's remaining gap to Device.benchmark lives entirely
in the upper-layer pending state machine (Dictionary<long,PendingContext>
Add/Remove, ContinuePendingRead user-callback dispatch, EphemeralSUnlock per
op) — not in the backend submit path.

Sharding is kept as a structurally correct capability and to make
--device-completion-threads meaningful for both backends. The C ABI surface
(NativeDevice_NumIoContexts, NativeDevice_QueueRunFor, NativeDevice_WakeCompletionWorker)
already accommodates sharding; no managed-side change is needed beyond
honouring the larger num_contexts() return value, which it already does.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Exposes IDevice.ThrottleLimit (max in-flight IOs per device) as a
first-class Garnet setting, mirroring --device-completion-threads and
--device-io-backend.

  --device-throttle-limit <N>     CLI flag
  "DeviceThrottleLimit" : N      defaults.conf key

Value of 0 (the default) leaves the device-provided default in place
(120 for the in-box Tsavorite local-storage devices); any positive
value overrides it via LocalStorageNamedDeviceFactoryCreator.throttleLimit.

Disk-bound workloads on fast NVMe / io_uring backends are easily
throttle-bound at the 120 default; raising the limit (e.g. 512) keeps
the queue depth high enough to saturate the device.

Verified GarnetServerConfigTests (31 pass / 1 Azure skipped), including
DefaultConfigurationOptionsCoverage which enforces Options ↔ defaults.conf
parity. New flag visible in 'GarnetServer --help'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The C# compiler does not cache method-group conversions to delegates for
*instance* methods (only for static), so every AsyncGetFromDisk call was
allocating a fresh ~48 B DeviceIOCompletionCallback delegate per pending IO.

Pre-bind the delegate once in the AllocatorBase ctor and reuse the cached
field on every AsyncReadRecordToMemory call. Removes one of the two
remaining per-op heap allocations on the disk-read hot path (the other
being the IHeapContainer<TInput> wrapper, addressed in a follow-up).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Every pending op was allocating a fresh SpanByteHeapContainer (~40 B) or
StandardHeapContainer<T> in PendingContext.CopyInputsForReadOrRMW. With the
delegate-cache change in the previous commit, this was the last remaining
per-op heap allocation on the disk-pending read path.

Adds a per-session Stack<IHeapContainer<TInput>> pool on
TsavoriteExecutionContext. Sessions are single-threaded so the stack does
not need to be concurrent. Each wrapper carries a reference to its return
pool and re-enqueues itself on Dispose; the pool naturally settles around
the in-flight pending IO count.

PendingContext.CopyInputsForReadOrRMW now tries to rent from the pool
first; on a cold session (pool empty) it allocates once and the wrapper
joins the pool on first disposal. The typeof(TInput) == typeof(PinnedSpanByte)
branch is JIT-folded, and Unsafe.As<,> is used to bridge the static-
vs-runtime type mismatch for the typed pool reference (safe because at JIT
specialization the two Stack<...> types are the same CLR type).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds PooledArrayMemoryPool, a two-tier (TLS + global ConcurrentQueue)
MemoryPool<byte> implementation that pools the IMemoryOwner<byte>
wrapper objects returned by Rent(). Wires it into both Single- and
MultiDatabaseManager so FunctionsState.memoryPool no longer falls
through to MemoryPool<byte>.Shared (which heap-allocates a fresh
~40-byte ArrayMemoryPoolBuffer wrapper per Rent() call on every
pending disk-GET response).

Design:
- Tier 1: [ThreadStatic] PooledBuffer[] (size 64) - lock-free fast path.
  Rent (CopyRespTo) and Dispose (SendAndReset) happen on the same
  RespServerSession worker thread synchronously, so TLS hit rate is
  near-100% in steady state.
- Tier 2: ConcurrentQueue<PooledBuffer> (cap 4096) - fallback for
  spillover when TLS overflows.
- Byte[] sizing delegated to ArrayPool<byte>.Shared (already does
  size bucketing + Gen2 trim); we only pool the wrapper. Interlocked
  double-dispose guard.

Memory footprint: ~570-660 KB total flat across all sessions/threads,
vs O(connections) allocation rate previously.

Measurement (1t GET disk-bound, sg-get ON, libaio throttle=512):
- alloc rate: 15.7 MB/s -> 12.2 MB/s (-22%)
- gen0/sec:   5-6 -> 3-5 (-30%)
- bytes/op:   106 B -> 82 B (-24 B, matches ArrayMemoryPoolBuffer size)
- throughput: unchanged (disk-bound at 148K ops/sec)

Throughput-neutral at disk wall but reduces GC pressure on the pending
IO response path, complementing the prior pending-IO allocation work
in this PR.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tive parsers

RespServerSession's main parsing loop calls AttemptSkipLine() whenever
the next byte is not '*' (could be inline RESP, malformed input, or a
partial frame the parser is speculating over). Before this change,
AttemptSkipLine() unconditionally logged a Warning on entry, even when
no skip actually occurred (i.e. when the function returned false on a
partial frame).

This is benign in classic per-batch parsing but produces severe log
spam under --sg-get (NetworkGET_SG speculatively peeks past end-of-batch
to merge contiguous GETs), where AttemptSkipLine fires once per batch
boundary per session.

Empirical impact on a 1t GET / 8t MSET-load benchmark (sg-get ON):
- 8t MSET load: ~19,000 warnings/sec (8 sessions x ~2,400 batches/sec)
- 1t GET b=1024: ~147 warnings/sec
- Total over 21s test window: 267,802 log lines

Fix: emit the Warning only when AttemptSkipLine actually finds a CRLF
and consumes input. The partial-frame return-false path is silent.

Also fixes SessionLogger.Log:
- Adds an IsEnabled(logLevel) early-out (cheap when logger is filtered
  out at the configured level).
- Fixes the lambda formatter call to use the captured (s, ex) args
  rather than the outer state/exception parameters; previously
  formatter received the same arguments regardless of any provider-side
  state transformation, which is technically a bug.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NetworkGET_SG previously allocated a fresh (GarnetStatus, StringOutput)[]
on every batch that contained any pending GETs, growing it via doubling
(8 → 16 → 32 → … → batchSize) inside SetResult. At b=1024 this allocated
~80 KB per batch (geometric sum 2040 entries × ~40 B). Allocation profiling
of the disk-GET hot path (1 thread, b=1024) shows this single source
accounting for 98.7 % of remaining sampled allocations (142 MB of 144 MB
in a 15 s window).

Cache the array on the RespServerSession as pendingGetOutputArr:
  - NetworkGET_SG initializes its local outputArr from the cached field;
    SetResult only allocates when null on first use of the session.
  - After the pending-consumption loop, publish the (possibly grown) array
    back to the session field and Array.Clear the used slots so disposed
    StringOutput.SpanByteAndMemory.Memory wrappers aren't kept alive.

In steady state a b=1024 pipeline now keeps reusing the same 1024-entry
array — zero allocations after the first batch. Measurement on the same
1t b=1024 disk-bound bench drops total sampled allocations from 144 MB to
2.1 MB over 15 s (~68× reduction). Throughput unchanged (disk wall at
~148 K ops/sec). RespTests and pending/low-memory tests pass.

Per-session memory cost: ~40 KB sticky high-watermark for sessions that
hit a b=1024 pending batch (matches the existing PooledArrayMemoryPool
sticky shape; bounded by max observed batch size, not session count).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a dedicated README for benchmark/Resp.benchmark/ modelled on the
existing Device.benchmark/ and KV.benchmark/ READMEs. Covers:

  * Client matrix (LightClient / GarnetClientSession / GarnetClient /
    SERedis / InProc) and when to use each.
  * Offline (--op X) vs online (--online) modes side-by-side, plus the
    output schema each produces.
  * Common-knobs table for the most-used CLI flags.
  * Cookbook with 11 offline + 6 online copy-paste recipes.
  * Disk-bound benchmark section that places Resp.benchmark in the
    layered hierarchy fio -> Device.benchmark -> KV.benchmark ->
    Resp.benchmark, with a 5-step recipe (server start -> 100M-key load
    -> read-thread sweep -> compare -> cleanup) and a 'why these
    settings' rationale for the disk-tier flags.
  * Troubleshooting + links to sibling benchmark READMEs.

Also updates benchmark/README.md to link the new README alongside the
website doc.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rmup=5

Align the disk-bound cookbook recipes across Resp.benchmark, KV.benchmark,
and (already-aligned) Device.benchmark so the three layers are directly
comparable on the same hardware.

KV.benchmark README:
- Cookbook table: add '--device-throttle 512' (default 120 leaves ~30%
  of the device on the table) and change '--warmup-sec 0' -> '--warmup-sec 5'
  on the three mostly-on-disk rows (RandomAccess, libaio, io_uring).
- Narrative: explain why 100M (not 10M) is the minimum reproducible
  dataset for disk-bound runs - a smaller dataset fills only ~1.3 GB of
  LBA range and lands on a subset of NAND dies, observed ~18% higher
  per-IO service time at the device (1.08 ms vs 0.91 ms on a P5600,
  identical queue depth). Call out that --device-throttle 512 is
  required for peak IOPS. Update plateau number 400-500K -> 565K to
  match today's measurement.

Resp.benchmark README:
- Intro hierarchy diagram: update KV reference from 525K to 565K and
  add explicit note that all four layers must use the same dataset
  size, throttle, log shape, and backend to be directly comparable;
  document the 10M-vs-100M die-parallelism effect.
- Step 4 (cross-layer comparison) KV recipe: add --device-throttle 512
  and --warmup-sec 5 (was 0) so the embedded recipe matches the
  canonical KV cookbook entry.
- Step 3 (GET sweep): add batch-size tuning note (-b 2048 is the
  empirical sweet spot at t=8, ~+3% over -b 1024) and a note to
  discard the first per-cell measurement since Resp.benchmark has no
  built-in warmup flag.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rewrite README sections that read as journey/learning notes (specific
ad-hoc measurements, 'observed today on this hardware', 'as of the X
change', 'empirically') into concise direct statements of fact:

Resp.benchmark README:
- Disk-bound intro: drop the per-IO service-time + LBA-range numbers
  measured on a specific device. State the underlying behavior directly
  (small datasets land on a subset of NAND dies and raise per-IO service
  time even at identical queue depth and IO size) and the resulting
  recommendation (use at least 100 M records).
- --sg-get note: drop the specific 16K -> 146K / 104K -> 404K /
  aqu-sz=6.8 -> 109 numbers and just state the effect (several-fold gain
  at low thread counts, 3-4x at higher).
- Step 2 wall-clock: drop the per-second client-side / server-side
  breakdown; just say 'a few minutes' and that batch generation
  dominates.
- Step 3 batch-size note: drop 'Empirically' and the matches-how-X-uses
  warmup phrasing; state the batch-size effect and the no-warmup-flag
  fact directly.
- Step 4 inline comment: tighten the cross-layer comparability note.

KV.benchmark README:
- Disk-IO sweep narrative: drop the 1.08 ms vs 0.91 ms on a P5600 +
  observed ~18% / ~17% reference numbers and the 565 K-IOPS plateau
  measurement. State the underlying device behavior and the
  --device-throttle 512 requirement as facts; remove specific IOPS
  numbers and 'currently sits between the two' phrasing.

Device.benchmark README:
- libaio --completion-threads note: drop 'As of the sharded-libaio
  change' / 'Empirical: on this hardware sharding gives at most ~1%'.
  State the io_context capacity, the per-thread affinity model, and the
  CT=1 default as direct facts; note that higher CT only helps under
  observable io_submit contention.

Also: dotnet-format fix for libs/common/Memory/PooledArrayMemoryPool.cs
(trailing newline, per repo .editorconfig 'insert_final_newline = false').

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The managed sector-size probe runs in :base(...) before the ctor body
created the parent directory, so on hosts where the parent and the
nearest existing ancestor live on different filesystems with different
logical_block_size (e.g. an overlay mount over a 4K-native disk), the
probe queried the ancestor and returned the wrong value. The native
probe in NativeDeviceImpl's ctor then ran after the parent had been
materialized, returned the correct value, and the cross-check in
EnsureNativeDeviceCreated threw '512 vs 4096' on first IO.

Move the mkdir into a static helper invoked from the base-ctor
argument so both probes (and all subsequent O_DIRECT IO) see the same
filesystem.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…issionDenied test

CI failure 1 (OmitSegmentIdFromFilename): the cross-check between the
managed-side sector-size probe (run from the base-ctor argument) and the
native-side probe (run from NativeDeviceImpl's field initializer) can
disagree on hosts whose filesystem layout makes ProbeDioAlignment's
stat-walk-up return different ancestors at different moments (e.g. CI
overlay mounts that change between calls). The kernel rejects misaligned
O_DIRECT I/O with EINVAL regardless of what SectorSize claims, so a
drift here is observable downstream with a clear error rather than silent
corruption. Downgrade the cross-check to a warning so the operator can
investigate without crashing the device on first I/O. Also drop the
redundant !Directory.Exists guard from EnsureParentDirectoryAndProbeSectorSize
(CreateDirectory is idempotent) and refresh the doc comment.

CI failure 2 (PermissionDeniedAtFirstWrite, Native): on EACCES the C++
FileSystemSegmentBundle ctor stores a bundle with fd=-1 (the assert is
compiled out in Release) and the subsequent io_submit hangs inside the
P/Invoke instead of returning -EBADF synchronously, crashing the test host.
Skip the Native case with a TODO; RandomAccess and ManagedLocal still
validate the WriteAsync callback contract. Fixing the native error-routing
path requires refactoring FileSystemSegmentBundle and rebuilding the
shipped .so binaries; tracked as a follow-up.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces the two workarounds added in 437c71f with actual fixes.

1. Sector-size probe is now deterministic at construction time.

   EnsureParentDirectoryAndProbeSectorSize creates the parent directory
   first and then probes that directory's path directly (not the
   not-yet-existing data file). This eliminates the ProbeDioAlignment
   stat-walk-up that could land on a different filesystem ancestor
   than the eventual data file (observed on GHA overlay mounts over a
   4K-native disk).

   SectorSize on StorageDeviceBase stays a readonly auto-property — the
   single value computed at ctor is now guaranteed to match what the
   native shim reports after opening the file with O_DIRECT. The
   cross-check in EnsureNativeDeviceCreated is restored to a hard
   throw: any drift is now a real ABI / loaded-library mismatch, not a
   path-resolution false positive.

2. EACCES on segment open is now routed through the WriteAsync
   callback contract instead of hanging inside io_submit.

   File::is_open() accessor added (both Linux and Windows headers).

   FileSystemSegmentBundle ctors capture the first non-Ok status from
   per-segment file.Open(); FileSystemSegmentedFile::OpenSegment now
   checks open_status() and refuses to commit a partially-opened
   bundle to files_. The broken bundle is destroyed and its buffer
   freed; the open error propagates back to the caller's WriteAsync
   which surfaces it via the AsyncIOCallback errorCode.

   QueueFile::ScheduleOperation gains a defense-in-depth fd_<0 guard
   so any future regression that re-introduces an invalid fd on the
   submit path returns Status::IOError instead of submitting to
   io_submit with aio_fildes=-1 (which empirically hangs inside libaio
   on some kernels rather than returning -EBADF synchronously).

   Both Linux .so binaries (libnative_device.so USE_URING=ON, and
   libnative_device_libaio.so USE_URING=OFF) rebuilt and installed.

3. Removed the Assert.Ignore(DeviceKind.Native) skip from
   IDevice_PermissionDeniedAtFirstWrite_CallbackGetsError. All three
   local device kinds (Native, RandomAccess, ManagedLocal) now pass.

Verified locally: full Tsavorite.test.hlog suite (504 passed / 0 failed
/ 108 skipped) and full Tsavorite.test suite (206 / 0 / 61) both green
on net10.0 Debug.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eak, ARM volatility

Addresses three real bugs surfaced by review of badrishc/optimize-kv-io.

1. C++ FileSystemSegmentBundle expansion ctor (file_system_disk.h)
   The previous expansion ctor moved file handles out of the source bundle and
   set its owner_=false BEFORE checking whether the new-segment opens succeeded.
   On failure (e.g. ENOSPC on a newly-extended segment), OpenSegment destroyed
   the rejected new bundle — but its destructor closed the moved-in fds, while
   files_ still pointed at the source bundle whose File objects now held stale
   fd_ values. Subsequent io_submit calls would read/write closed (or
   OS-recycled) descriptors.
   Restructured the ctor into two phases:
     Phase 1 — open new segments only; on failure, destroy just the constructed
               slots, set owner_=false, and return with `other` fully intact.
     Phase 2 — once all new opens succeed, move the overlapping file handles
               from `other` and transfer bundle ownership.
   filename_ is now copy-constructed (not moved) so a failed expansion does not
   strand `other` with an empty filename_. Rebuilt both shipped .so binaries.

2. TsavoriteThread.cs:93 — pool-wrapper leak on callback exception
   ReturnAsyncGetFromDiskResult(result) was called only after
   InternalCompletePendingRequest returned. A user callback that threw would
   skip the return-to-pool, permanently leaking the wrapper. Wrapped in
   try/finally so the wrapper is always returned.

3. NativeStorageDevice.cs:660 — fast-path volatility mismatch
   EnsureNativeDeviceCreated published nativeDevice via Volatile.Write but
   read it on the fast path with a plain field load. On ARM the reader could
   observe a non-null handle while still seeing a stale `results` array
   reference or partially-initialised completion-thread state. Changed the
   fast-path read to Volatile.Read; the inner read inside the lock retains a
   plain load (lock acquire is a full barrier).

All 504 Tsavorite hlog tests pass on net10.0 Debug.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI and others added 11 commits June 1, 2026 20:15
…eturnAsyncGetFromDiskResult

The catch's else-arm duplicated the clear+enqueue+flip-flag dance before rethrow,
while the finally block only enqueued without clearing context. Two paths reached
the pool with different conventions: success paths and catch-throw explicitly
cleared result.context first; catch-SetException did not.

Funnel the wrapper return through ReturnAsyncGetFromDiskResult in finally
(clears context + enqueues), and let catch's else-arm just rethrow.

Not a functional bug. AsyncIOContext is a struct so the in-callback local ctx is
a value copy of result.context; ctx.DisposeRecord() zeros the local but leaves
the field unchanged, so a wrapper returned with stale context holds a reference
to a SectorAlignedMemory wrapper whose buffer was already Return()ed. The next
AsyncReadRecordToMemory overwrites result.context before anything reads it, so
there is no leak, no use-after-free, and no buffer corruption. The change brings
the SetException path in line with the explicit clears on the other paths and
removes a duplicated cleanup dance.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… on-disk RecordDataHeader

CHK stored [key][namespace] in both the inline buffer and the hoisted
SectorAlignedMemory pool path. Tsavorite's on-disk RecordDataHeader puts the
extended namespace bytes BEFORE the key (LogRecord.cs:60-61: '... those bits are
the length of the extended namespace data preceding the key data'). The
mismatch was self-consistent within CHK (writers and readers agreed) so no
correctness issue, but it forced reviewers to flip the offsets mentally when
moving between CHK and RecordDataHeader and blocked a future single-span
[NS|key] compare optimization against DiskLogRecord.

Swap both paths to [namespace][key]. KeysEqual + all accessors (KeyBytes,
NamespaceBytes) updated; all callers go through these so the change is
transparent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ity fix

Profiling KV.benchmark on Linux against RandomAccessLocalStorageDevice
showed 20.76% of total thread-time in Monitor.Enter_Slowpath, driven by
the per-segment AsyncPool<StorageAccessContext>'s internal SemaphoreSlim
and ConcurrentQueue. SemaphoreSlim.Release ALWAYS takes its internal
Monitor regardless of pool capacity; under 800K+ Returns/sec across
80+ ThreadPool workers this serializes every IO completion.

Replace the per-segment AsyncPool with one shared SafeFileHandle per
direction per segment. RandomAccess.ReadAsync/WriteAsync (pread/pwrite
under the hood) are thread-safe on a single SafeFileHandle. Throttle via
lock-free numPending > ThrottleLimit (matches LocalStorageDevice and
NativeStorageDevice patterns).

KV.benchmark/RandomAccess t=8 throughput: 86.7K -> 144.7K ops/sec (+67%).
Throughput is now flat across t=1..16 (single app thread saturates the
ThreadPool's pread pipeline). Monitor.Enter_Slowpath disappears from
the profile top 30.

Production-readiness fixes (per GPT-5.5 review):
* SegmentHandles is a class with lazy per-direction creation via
  Interlocked.CompareExchange. Preserves the original AsyncPool's
  lazy-factory semantics so read-only callers no longer trigger
  write-handle open (matters for readOnly mode and read-only files).
* If a handle factory throws after the sibling handle is already
  created, the surviving handle is reclaimed when SegmentHandles is
  disposed (no partial-construction leak).
* Dispose and Reset spin-wait (bounded 5s) for numPending to drain
  before closing SafeFileHandles, matching AsyncPool.Dispose's
  wait-for-borrowed-items behavior. RemoveSegment follows the same
  pattern as LocalStorageDevice (no drain; epoch-protected truncation
  ensures no in-flight reads target a removed segment).

Verified:
* 16 RandomAccess-tagged DeviceTests pass.
* Full 504-test hlog suite passes (108 skipped, all Azure-env-dependent).
* Recovery tests pass (exercises read-only handle path).
* dotnet format clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ync lost-wakeup race

The Enqueue semaphore-skip optimization (commit aee945d) reads
waiterCount via Volatile.Read (acquire-only). On weak memory models
(ARM), the acquire-load can be reordered ahead of ConcurrentQueue's
slot-sequence-number release-store, which is what publishes the item.
A concurrent consumer in DequeueAsync that has already incremented
waiterCount (full fence) but whose TryDequeue retry reads the slot
sequence number can then both:
  - fail to see the producer's slot publication, AND
  - be missed by the producer's waiterCount check (because the read
    is reordered before the increment is visible),
losing the wakeup and sleeping on the semaphore until the next
Enqueue (or forever, for the BLPOP key 0 case).

WaitForEntry/WaitForEntryAsync rechecks queue.Count (which reads
Tail, bumped by the same full-fence Tail CAS) so it was already
safe; the Tsavorite-side readyResponses hot path is unaffected by
the bug and unaffected by this fence.

DequeueAsync is consumed only by CollectionItemBroker for blocking
collection commands (BLPOP/BLMOVE/BRPOP/BLMPOP/BZPOPMIN/BZPOPMAX).
The barrier is a single ~5 ns Interlocked.MemoryBarrier on x86 /
dmb sy on ARM — cheaper than the SemaphoreSlim.Release the original
optimization avoids, and preserves the throughput win on the hot
completion path.

Verified:
* All 504 hlog tests pass.
* All 52 RespBlockingCollectionTests (BLPOP/BLMOVE/etc.) pass.
* dotnet format clean.

Identified by Opus 4.8 review of PR #1837.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three comments around the Native completion-thread / io_context configuration
were narrating the optimization journey rather than describing current
behaviour. Rewrite them to state only the current facts:

- benchmark/Device.benchmark/Options.cs: --completion-threads help text
  no longer says libaio drainers "rarely help past 1 in practice" or that
  they are "now honoured for symmetry". Now states that each drainer is
  bound 1:1 to its own io_context/ring and throughput scales with the value.

- libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Options.cs:
  --device-completion-threads help text drops the same "rarely help past 1"
  and "legacy QueueRun compat scanner" wording.

- libs/storage/Tsavorite/cs/src/core/Device/NativeStorageDevice.cs:
  NativeStorageDevice ctor XML doc was stale — it claimed uring uses
  max(4, numCompletionThreads) rings, but the constructor body sets
  numIoContextsConfig = numCompletionThreadsConfig (always 1:1) on both
  backends. Rewrite the param doc to match the code and remove the
  "extra rings don't help" speculation. The inline comment above
  numIoContextsConfig dropped its "earlier experiment ... fundamentally
  broken" narrative in favour of a present-tense rationale for the 1:1
  binding (each drainer blocks on its own ring in QueueRunFor, so a
  single drainer cannot cover multiple rings without starving them).

No code changes. All three affected projects build clean. dotnet format
clean on both Tsavorite.slnx and Garnet.slnx.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…heap alloc

The previous d7a4715 commit (removing per-segment AsyncPool to fix
Monitor.Enter_Slowpath contention) regressed allocation behavior by
constructing a fresh UnmanagedMemoryManager<byte> per IO on the hot path
in ReadWorkerAsync and WriteWorkerAsync — pre-PR code reused a single
wrapper per pooled StorageAccessContext.

This change restores wrapper reuse via a lock-free per-device
ConcurrentQueue<UnmanagedMemoryManager<byte>>:

  - Rent: TryDequeue (or new on empty) → SetDestination(ptr, len).
  - Return: Enqueue in the finally of the IO continuation.

Properties:
  - Zero heap allocations per IO in steady state (one CAS per Enqueue /
    TryDequeue; ConcurrentQueueSegment is array-backed so no per-item Node
    allocation, unlike ConcurrentStack).
  - Bounded by ThrottleLimit (default 120, ~4 KB high water).
  - No SemaphoreSlim / Monitor — does not reintroduce the contention
    pattern that motivated removing the AsyncPool.
  - Wrappers are only handed out one-IO-at-a-time, so concurrent
    SetDestination on the same instance cannot occur. ConcurrentQueue's
    release/acquire semantics on Enqueue/TryDequeue publish the rebound
    _pointer / _length to the next worker thread.

Validated:
  - DeviceTests 70/70 pass
  - test.hlog 504/504 pass
  - KV.benchmark RA sweep (1-16 threads, 100M x 100B, 16 MB log):
    throughput unchanged at ~143-145K ops/sec (workload is bottlenecked
    on RandomAccess threadpool dispatch, not the wrapper alloc); GC
    pressure on the IO path drops to zero in steady state.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e doc

- Rename ConditionallyHositedKey.cs -> ConditionallyHoistedKey.cs (filename
  was the only place with the 'Hosited' typo; type itself was already named
  ConditionallyHoistedKey).
- Fix stale XML doc on Devices.CreateLogDevice's numCompletionThreads
  parameter: previous wording claimed all drainers share a single
  io_context / io_uring ring and that values > 1 are rarely useful, which
  contradicts the 1:1 drainer-to-ring binding the device actually implements
  today (see NativeStorageDevice.cs:584-592).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The static Empty singleton was bit-identical to default(ConditionallyHoistedKey):
its initializer fed a null pointer (MemoryMarshal.GetReference on an empty span
returns a null-equivalent ref, so Unsafe.AsPointer(...) was (byte*)0) through
the (byte*, int) ctor, leaving every field at its default value. Returning
default at the only callsite has the same observable behavior, removes the
static field/property and its (misleading) initializer, and eliminates the
class's static cctor / type-init flag check entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Optional post-load diagnostic that prints the underlying
TsavoriteKV.DumpDistribution() bucket histogram. Useful when
investigating hash-table sizing / collision symptoms during disk-
bound IOPS sweeps (paired with --index sizing).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…llback

The reread path and completionEvent path both did 'result.context = default;'
explicitly, then fell through to the finally block, which calls
ReturnAsyncGetFromDiskResult — which already clears result.context before
enqueueing the wrapper. The explicit clears were dead writes and the comment
in the finally even pointed at non-existent 'explicit clears on the throw
arm of the catch'.

No behavior change: the local 'ctx' on the reread path is a struct copy
taken at 'var ctx = result.context;' before the (now-removed) clear, so all
fields (id, requestKey, minAddress, callbackQueue) are preserved and
forwarded to the next IO via AsyncReadRecordToMemory's 'asyncResult.context
= context' assignment.

Comment expanded on the reread path to make the struct-copy semantics
explicit so future readers don't reach for 'result.context' thinking it's
the live state.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cument drop branch

The wrapper-recycle method was named Return and marked internal, which
invited reading it as a general 'return a wrapper to the pool' API and
prompted a question about whether the cap-exceeded drop branch was leaking
the still-rented byte[].

It isn't: by the time the method runs, PooledBuffer.Dispose has already
Interlocked-Exchanged the array field to null and returned the byte[] to
ArrayPool<byte>.Shared, so the wrapper holds no resources and dropping it
just lets GC reclaim a small managed object.

Rename to OnPooledBufferDisposed and make it private (nested PooledBuffer
class can still call it) to make the callback semantics explicit, add an
XML doc explaining that it must only be invoked from PooledBuffer.Dispose,
and add a remarks block plus an inline comment documenting why the drop
branch is safe.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc force-pushed the badrishc/optimize-kv-io branch from 735f7cf to 2d6da50 Compare June 2, 2026 03:17
@badrishc badrishc merged commit 91371eb into main Jun 2, 2026
187 checks passed
@badrishc badrishc deleted the badrishc/optimize-kv-io branch June 2, 2026 04:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants