Optimize IO paths in KV#1837
Merged
Merged
Conversation
44d9f44 to
bacb252
Compare
badrishc
added a commit
that referenced
this pull request
May 31, 2026
…ync lost-wakeup race The Enqueue semaphore-skip optimization (commit aee945d) reads waiterCount via Volatile.Read (acquire-only). On weak memory models (ARM), the acquire-load can be reordered ahead of ConcurrentQueue's slot-sequence-number release-store, which is what publishes the item. A concurrent consumer in DequeueAsync that has already incremented waiterCount (full fence) but whose TryDequeue retry reads the slot sequence number can then both: - fail to see the producer's slot publication, AND - be missed by the producer's waiterCount check (because the read is reordered before the increment is visible), losing the wakeup and sleeping on the semaphore until the next Enqueue (or forever, for the BLPOP key 0 case). WaitForEntry/WaitForEntryAsync rechecks queue.Count (which reads Tail, bumped by the same full-fence Tail CAS) so it was already safe; the Tsavorite-side readyResponses hot path is unaffected by the bug and unaffected by this fence. DequeueAsync is consumed only by CollectionItemBroker for blocking collection commands (BLPOP/BLMOVE/BRPOP/BLMPOP/BZPOPMIN/BZPOPMAX). The barrier is a single ~5 ns Interlocked.MemoryBarrier on x86 / dmb sy on ARM — cheaper than the SemaphoreSlim.Release the original optimization avoids, and preserves the throughput win on the hot completion path. Verified: * All 504 hlog tests pass. * All 52 RespBlockingCollectionTests (BLPOP/BLMOVE/etc.) pass. * dotnet format clean. Identified by Opus 4.8 review of PR #1837. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc
added a commit
that referenced
this pull request
May 31, 2026
…ync lost-wakeup race The Enqueue semaphore-skip optimization (commit aee945d) reads waiterCount via Volatile.Read (acquire-only). On weak memory models (ARM), the acquire-load can be reordered ahead of ConcurrentQueue's slot-sequence-number release-store, which is what publishes the item. A concurrent consumer in DequeueAsync that has already incremented waiterCount (full fence) but whose TryDequeue retry reads the slot sequence number can then both: - fail to see the producer's slot publication, AND - be missed by the producer's waiterCount check (because the read is reordered before the increment is visible), losing the wakeup and sleeping on the semaphore until the next Enqueue (or forever, for the BLPOP key 0 case). WaitForEntry/WaitForEntryAsync rechecks queue.Count (which reads Tail, bumped by the same full-fence Tail CAS) so it was already safe; the Tsavorite-side readyResponses hot path is unaffected by the bug and unaffected by this fence. DequeueAsync is consumed only by CollectionItemBroker for blocking collection commands (BLPOP/BLMOVE/BRPOP/BLMPOP/BZPOPMIN/BZPOPMAX). The barrier is a single ~5 ns Interlocked.MemoryBarrier on x86 / dmb sy on ARM — cheaper than the SemaphoreSlim.Release the original optimization avoids, and preserves the throughput win on the hot completion path. Verified: * All 504 hlog tests pass. * All 52 RespBlockingCollectionTests (BLPOP/BLMOVE/etc.) pass. * dotnet format clean. Identified by Opus 4.8 review of PR #1837. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6f18b97 to
2544f23
Compare
badrishc
added a commit
that referenced
this pull request
Jun 1, 2026
…ync lost-wakeup race The Enqueue semaphore-skip optimization (commit aee945d) reads waiterCount via Volatile.Read (acquire-only). On weak memory models (ARM), the acquire-load can be reordered ahead of ConcurrentQueue's slot-sequence-number release-store, which is what publishes the item. A concurrent consumer in DequeueAsync that has already incremented waiterCount (full fence) but whose TryDequeue retry reads the slot sequence number can then both: - fail to see the producer's slot publication, AND - be missed by the producer's waiterCount check (because the read is reordered before the increment is visible), losing the wakeup and sleeping on the semaphore until the next Enqueue (or forever, for the BLPOP key 0 case). WaitForEntry/WaitForEntryAsync rechecks queue.Count (which reads Tail, bumped by the same full-fence Tail CAS) so it was already safe; the Tsavorite-side readyResponses hot path is unaffected by the bug and unaffected by this fence. DequeueAsync is consumed only by CollectionItemBroker for blocking collection commands (BLPOP/BLMOVE/BRPOP/BLMPOP/BZPOPMIN/BZPOPMAX). The barrier is a single ~5 ns Interlocked.MemoryBarrier on x86 / dmb sy on ARM — cheaper than the SemaphoreSlim.Release the original optimization avoids, and preserves the throughput win on the hot completion path. Verified: * All 504 hlog tests pass. * All 52 RespBlockingCollectionTests (BLPOP/BLMOVE/etc.) pass. * dotnet format clean. Identified by Opus 4.8 review of PR #1837. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
e00fa51 to
9e860b1
Compare
Three changes that together drop allocations on the pending-read path by
~91 % and remove the 16-64-thread regression in KV.benchmark random-read-
from-disk on a Dell P5600 NVMe.
Measurements (100M-key, 16MB log, libaio, throttle=2048):
threads | before | after | delta
1 | 141K | 174K | +23%
4 | 411K | 553K | +35%
8 | 544K | 552K | +1%
16 | 527K | 550K | +4%
32 | 492K | 547K | +11%
64 | 396K | 537K | +36%
Peak 553K @ 4 threads = 74% of fio's 750K ceiling on the same drive
(was 73% at 8 threads with a clear contention regression beyond that).
GC pressure at 8 threads / 12s window / per worker:
alloc/wkr: 1073 MB -> 93 MB (-91%)
gen0 collections: 61 -> 24 (-60%)
gen1 collections: 9 -> 3 (-67%)
Default-throttle (--device-throttle 120) shows even larger relative gains
(e.g. 32 threads: 232K -> 472K, +103%) because the upper-layer CPU
contention is no longer adding latency that compounds with the kernel-side
queue limit.
1. Pool AsyncGetFromDiskResult to eliminate per-IO boxing
---------------------------------------------------------
AsyncReadRecordToMemory passes a struct-typed wrapper as 'object context'
to IDevice.ReadAsync; the cast boxes the entire 112-byte wrapper on every
pending read. Change the wrapper to a sealed class and pool instances on
the allocator via a ConcurrentQueue. The pool naturally settles at
~ThrottleLimit entries, so steady-state ConcurrentQueue churn is bounded
(no segment growth). AsyncGetFromDiskCallback extracts the AsyncIOContext,
clears result.context = default (so a pooled wrapper doesn't transitively
retain SectorAlignedMemory buffers), and returns the wrapper to the pool
in a finally block — including the reread path where the recursive
AsyncGetFromDisk rents a fresh wrapper.
2. Inline-store small keys in ConditionallyHoistedKey
-----------------------------------------------------
For non-pinned keys (e.g. KvKey, which is a stack-local struct so
IsPinned=false), the old Create path rented a 1 KB SectorAlignedMemory
just to hold an 8-byte key. Add an [InlineArray(32)] InlineBuf field
carried inside the struct and an isInline flag; keys (+ optional namespace)
totalling <= 32 B copy bytes into the inline buffer with no alloc. Keys
above the threshold fall back to the original pooled-SectorAlignedMemory
path. KeyBytes/NamespaceBytes accessors return spans via
MemoryMarshal.CreateReadOnlySpan(ref Unsafe.AsRef(in inlineBuf), len)
that alias the receiver's own storage — same lifetime contract as the
existing pinned-pointer mode (caller must not stash the span past the
receiver's lifetime). Dispose is a no-op in inline mode.
Because the inline storage aliases the struct itself, any API returning
a ConditionallyHoistedKey by value would alias a temp's stack location
and yield a dangling span. CompletedOutput<>.Key is changed from a
by-value property to '[UnscopedRef] public readonly ref readonly' so the
ref aliases the live keyContainer field rather than a copy.
3. Add clearOnReturn opt-out to SectorAlignedBufferPool, with lazy clear
-----------------------------------------------------------------------
Every Return() called Array.Clear over the full backing buffer
(~4-8 KB per pending read), accounting for the largest single chunk of
the upper-layer CPU profile under contention. Add an overload
'SectorAlignedMemory Get(int numRecords, bool clearOnReturn)' that lets
callers known to fully overwrite the read region opt out of the per-Return
clear. Migrate the single pending-read callsite in
GetAndPopulateReadBuffer to clearOnReturn:false.
The pool's slots are fungible across consumers; a write-staging caller
(DeviceLogCommitCheckpointManager.WriteInto and friends) still relies on
the historical 'Get returns a zeroed buffer' contract for sector-padded
writes. To preserve that contract without paying the per-Return clear on
the read path, the pool now tracks an isDirty bit per buffer: Return with
clearOnReturn=false marks the buffer dirty; the next default Get that
dequeues a dirty buffer lazy-clears before handing it out. Mixed-mode
pools stay correct; pure-read workloads pay zero clears.
Verified:
* Tsavorite test.* : 206 passed (0 failed, 61 skipped), unchanged.
* Garnet RespTests + Pending + Recovery + Hash: 348 passed.
* KV.benchmark --validate (50R/50U mixed; 100R disk-spill): both PASS.
* Engine-only --device null path: 65.6M ops/sec at 32 threads (no regression
vs baseline 63.7M).
* dotnet format --verify-no-changes on Garnet.slnx and Tsavorite.slnx: clean.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The pending-IO completion path is a high-frequency polling consumer: the worker drains via TryDequeue in CompletePending(false) and almost never calls WaitForEntry. But every Enqueue (~560 K/sec on the libaio-CT=1 completion thread during disk-bound KV reads) was calling SemaphoreSlim.Release(), which takes an internal Monitor lock. Multiple completion drainers (uring CT>=2) would serialise on that lock, defeating ring sharding. Fix: track a waiterCount field that's incremented by WaitForEntry / WaitForEntryAsync (and the slow path of DequeueAsync) and decremented on exit. Enqueue only calls semaphore.Release() when Volatile.Read(waiterCount) > 0. Waiters re-check ConcurrentQueue.Count after incrementing waiterCount so a producer that observed waiterCount==0 between Enqueue and our Increment still gets us serviced (re-check sees the entry). Side benefit: closes a latent bug where the semaphore's internal count grew unbounded over a long-running session (would overflow at int.MaxValue after ~2 B Enqueues with no Wait). Verified: * All 206 Tsavorite tests pass (61 skipped, unchanged). * KV.benchmark --validate (50R/50U + 100R disk-spill): both PASS. * KV.benchmark perf neutral on the 100M-key 16MB-log 8-thread sweep (~555-562K ops/sec; matches Phase 1-3 commit ba42ddf baseline within measurement noise). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…truct Reduces per-completion overhead on the pending-read path by reshaping how AsyncIOContext flows between the IO completion callback and the worker-side CompletePending consumer. Previously: AsyncGetFromDiskCallback called ctx.callbackQueue.Enqueue(ctx), which copied the ~112-byte AsyncIOContext struct (with multiple GC-tracked refs to SectorAlignedMemory, DiskLogRecord, AsyncQueue, etc.) into the ConcurrentQueue<AsyncIOContext> segment array. The worker's TryDequeue copied it back out. Both copies hit Buffer.BulkMoveWithWriteBarrier per ref field. The wrapper itself (AsyncGetFromDiskResult, sealed class added in fc67afb to eliminate boxing) was returned to its pool inside the callback. Now: the callback writes the modified ctx back into the wrapper (result.context = ctx) and enqueues the wrapper REFERENCE; ownership transfers to the worker. The worker dequeues an 8-byte reference, accesses the live AsyncIOContext via 'ref result.context' (passed all the way through InternalCompletePendingRequest / InternalCompletePendingRequestFromContext which both took the struct by value), then returns the wrapper to the pool via the new AllocatorBase.ReturnAsyncGetFromDiskResult helper. Per completion: two ~112-byte struct copies replaced with one 8-byte reference move on Enqueue, one 8-byte ref move on Dequeue, and a single field load to access result.context. Other paths in AsyncGetFromDiskCallback also clear result.context = default before returning the wrapper to the pool, so a pooled wrapper never transitively retains SectorAlignedMemory / DiskLogRecord references across rentals: * reread path (recursive AsyncGetFromDisk with a fresh wrapper) * completionEvent path (event takes ownership of ctx via Set) * exception path AsyncIOContext.callbackQueue field changed from public to internal (matches the now-internal type parameter); the field has no out-of-core consumers. Verified: * All 206 Tsavorite tests pass (61 skipped, unchanged). * KV.benchmark --validate (50R/50U + 100R disk-spill): both PASS. * KV.benchmark perf: 550-555K ops/sec at libaio CT=1, 8 threads, throttle=512 (within noise of Phase 1-4 baseline). alloc/wkr drops from ~88 MB to ~84 MB per worker per 12s window. The throughput delta is in measurement noise because the upper-layer bound is no longer in the readyResponses queue itself; the remaining ~25% gap to Device.benchmark's 735K @ libaio CT=1 8t throt=512 (same hardware, 4 KB random reads) lives in Dictionary<long,PendingContext>.Add/Remove (per-op ~150-byte struct copies into / out of the dictionary's Entry[] array, still visible as Buffer.BulkMoveWithWriteBarrier in the profile) and in the worker-side ContinuePendingRead → user-functions dispatch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Refactor QueueIoHandler from a single io_context to N independent
io_contexts; mirror UringIoHandler's pick_ring() design with per-thread
affinity assignment. Each io_context gets its own /dev/null wake-up fd, its
own 128-slot kernel ring, and is drained by exactly one drainer thread
via the existing per-context TryCompleteFor / QueueRunFor C ABI exports.
Files:
* libs/storage/Tsavorite/cc/src/device/file_linux.h
- QueueIoHandler: std::vector<io_context_t> io_objects_ + std::vector<int>
wake_fds_; submit_counter_ atomic round-robin; pick_context() returns the
per-thread-assigned context (cached thread_local on first call).
- pick_context() TLS state is **instance-aware**: a thread's cached
(tls_owner, tls_idx) is invalidated whenever the calling handler differs
from tls_owner OR the cached index falls outside the current handler's
io_objects_.size(). This prevents an index cached against a handler with N
contexts from being used on a different handler with M<N contexts (which
would index out-of-bounds and crash). Strictly needed because thread_local
storage in instance methods is shared across instances of the same type.
- num_contexts() now returns io_objects_.size() (was hard-coded 1).
- QueueFile holds QueueIoHandler* handler_ instead of a fixed io_context_t;
submit picks per-thread via handler_->pick_context().
- Init(n) builds io_contexts + wake fds into TEMPORARY vectors, then
publishes to the handler's members ONLY on full success. On any partial
failure (io_setup error OR /dev/null open error) all successfully-created
resources are released here and the handler's vectors stay empty
(initialized()==false, num_contexts()==0, all per-shard ops return -1
/ no-op). RAII via std::vector + manual close cleanup keeps the
half-initialised state from being externally observable. Wake-fd open
failure is now treated as fatal (previously silently ignored, leading to
multi-second Dispose stalls).
* libs/storage/Tsavorite/cc/src/device/file_linux.cc
- TryComplete() walks ALL shards (matches UringIoHandler::TryComplete()).
Previously delegated to TryCompleteFor(0) only, so completions on shards
>0 would not be observed by the AllocatorBase throttle-wait opportunistic
drain path (which calls device.TryComplete() without knowing about sharding).
- TryCompleteFor(idx) / QueueRunFor(idx) operate on io_objects_[idx];
QueueRun() walks all contexts (idx 0 with full timeout, rest poll-only)
matching the existing uring compat-scanner behaviour.
- Wake(idx) submits a 0-byte read on the per-context /dev/null fd.
- ScheduleOperation: calls handler_->pick_context() to select the per-thread
context before io_submit. Same yield/sleep backoff curve preserved.
* libs/storage/Tsavorite/cs/src/core/Device/runtimes/linux-x64/native/
libnative_device.so, libnative_device_libaio.so: rebuilt from this change
(USE_URING=ON and USE_URING=OFF flavours respectively). Both patchelf'd to
the portable libaio.so.1 NEEDED entry as documented in cc/README.md
('Ubuntu 24.04 (t64 ABI) note').
Hardening pass (GPT-5.5 review of the C++ change):
* (Blocking) pick_context TLS is now instance-aware (owner pointer + bounds
check) instead of a bare 'thread_local int my_idx'. Prevents out-of-bounds
io_objects_[my_idx] when a thread submits to multiple handlers with
different num_contexts.
* (Non-blocking) Init failure paths reworked into the RAII temp-vector +
publish-on-success pattern; wake-fd open failures now propagate as
init_errno_ rather than being silently ignored.
* (Non-blocking) TryComplete() scans all shards, fixing the compat-scanner
gap on completions for shards >0.
Verified:
* All 206 Tsavorite tests pass (61 skipped, unchanged).
* All 348 Garnet RespTests/Pending/Recovery/Hash tests pass.
* KV.benchmark --validate at CT=8 (sharded): 50R/50U PASS, 100R disk-spill PASS.
* Device.benchmark libaio 4K random reads, 8 threads, throttle=512:
CT=1: 737K ops/sec (single io_context, unchanged from baseline)
CT=8: 736K ops/sec (no regression; was +0.9% on prior measurement)
* KV.benchmark libaio 100M-key 16MB-log 8t throttle=512:
CT=1: 552K CT=8: 561K (within measurement noise — sharded path adds
no overhead on the hot path)
* KV.benchmark stress @ 64 threads x CT=8 sharded: 541K ops/sec, 0 errors —
validates the instance-aware TLS path under heavy multi-thread submission
to the same handler.
The benchmark difference between Device (~0.9%) and KV (~0%) confirms the
takeaway documented in the original PR: kernel io_submit mutex contention
on the single io_context is NOT a meaningful bottleneck on this hardware at
these QDs. KV.benchmark's remaining gap to Device.benchmark lives entirely
in the upper-layer pending state machine (Dictionary<long,PendingContext>
Add/Remove, ContinuePendingRead user-callback dispatch, EphemeralSUnlock per
op) — not in the backend submit path.
Sharding is kept as a structurally correct capability and to make
--device-completion-threads meaningful for both backends. The C ABI surface
(NativeDevice_NumIoContexts, NativeDevice_QueueRunFor, NativeDevice_WakeCompletionWorker)
already accommodates sharding; no managed-side change is needed beyond
honouring the larger num_contexts() return value, which it already does.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Exposes IDevice.ThrottleLimit (max in-flight IOs per device) as a first-class Garnet setting, mirroring --device-completion-threads and --device-io-backend. --device-throttle-limit <N> CLI flag "DeviceThrottleLimit" : N defaults.conf key Value of 0 (the default) leaves the device-provided default in place (120 for the in-box Tsavorite local-storage devices); any positive value overrides it via LocalStorageNamedDeviceFactoryCreator.throttleLimit. Disk-bound workloads on fast NVMe / io_uring backends are easily throttle-bound at the 120 default; raising the limit (e.g. 512) keeps the queue depth high enough to saturate the device. Verified GarnetServerConfigTests (31 pass / 1 Azure skipped), including DefaultConfigurationOptionsCoverage which enforces Options ↔ defaults.conf parity. New flag visible in 'GarnetServer --help'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The C# compiler does not cache method-group conversions to delegates for *instance* methods (only for static), so every AsyncGetFromDisk call was allocating a fresh ~48 B DeviceIOCompletionCallback delegate per pending IO. Pre-bind the delegate once in the AllocatorBase ctor and reuse the cached field on every AsyncReadRecordToMemory call. Removes one of the two remaining per-op heap allocations on the disk-read hot path (the other being the IHeapContainer<TInput> wrapper, addressed in a follow-up). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Every pending op was allocating a fresh SpanByteHeapContainer (~40 B) or StandardHeapContainer<T> in PendingContext.CopyInputsForReadOrRMW. With the delegate-cache change in the previous commit, this was the last remaining per-op heap allocation on the disk-pending read path. Adds a per-session Stack<IHeapContainer<TInput>> pool on TsavoriteExecutionContext. Sessions are single-threaded so the stack does not need to be concurrent. Each wrapper carries a reference to its return pool and re-enqueues itself on Dispose; the pool naturally settles around the in-flight pending IO count. PendingContext.CopyInputsForReadOrRMW now tries to rent from the pool first; on a cold session (pool empty) it allocates once and the wrapper joins the pool on first disposal. The typeof(TInput) == typeof(PinnedSpanByte) branch is JIT-folded, and Unsafe.As<,> is used to bridge the static- vs-runtime type mismatch for the typed pool reference (safe because at JIT specialization the two Stack<...> types are the same CLR type). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds PooledArrayMemoryPool, a two-tier (TLS + global ConcurrentQueue) MemoryPool<byte> implementation that pools the IMemoryOwner<byte> wrapper objects returned by Rent(). Wires it into both Single- and MultiDatabaseManager so FunctionsState.memoryPool no longer falls through to MemoryPool<byte>.Shared (which heap-allocates a fresh ~40-byte ArrayMemoryPoolBuffer wrapper per Rent() call on every pending disk-GET response). Design: - Tier 1: [ThreadStatic] PooledBuffer[] (size 64) - lock-free fast path. Rent (CopyRespTo) and Dispose (SendAndReset) happen on the same RespServerSession worker thread synchronously, so TLS hit rate is near-100% in steady state. - Tier 2: ConcurrentQueue<PooledBuffer> (cap 4096) - fallback for spillover when TLS overflows. - Byte[] sizing delegated to ArrayPool<byte>.Shared (already does size bucketing + Gen2 trim); we only pool the wrapper. Interlocked double-dispose guard. Memory footprint: ~570-660 KB total flat across all sessions/threads, vs O(connections) allocation rate previously. Measurement (1t GET disk-bound, sg-get ON, libaio throttle=512): - alloc rate: 15.7 MB/s -> 12.2 MB/s (-22%) - gen0/sec: 5-6 -> 3-5 (-30%) - bytes/op: 106 B -> 82 B (-24 B, matches ArrayMemoryPoolBuffer size) - throughput: unchanged (disk-bound at 148K ops/sec) Throughput-neutral at disk wall but reduces GC pressure on the pending IO response path, complementing the prior pending-IO allocation work in this PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tive parsers RespServerSession's main parsing loop calls AttemptSkipLine() whenever the next byte is not '*' (could be inline RESP, malformed input, or a partial frame the parser is speculating over). Before this change, AttemptSkipLine() unconditionally logged a Warning on entry, even when no skip actually occurred (i.e. when the function returned false on a partial frame). This is benign in classic per-batch parsing but produces severe log spam under --sg-get (NetworkGET_SG speculatively peeks past end-of-batch to merge contiguous GETs), where AttemptSkipLine fires once per batch boundary per session. Empirical impact on a 1t GET / 8t MSET-load benchmark (sg-get ON): - 8t MSET load: ~19,000 warnings/sec (8 sessions x ~2,400 batches/sec) - 1t GET b=1024: ~147 warnings/sec - Total over 21s test window: 267,802 log lines Fix: emit the Warning only when AttemptSkipLine actually finds a CRLF and consumes input. The partial-frame return-false path is silent. Also fixes SessionLogger.Log: - Adds an IsEnabled(logLevel) early-out (cheap when logger is filtered out at the configured level). - Fixes the lambda formatter call to use the captured (s, ex) args rather than the outer state/exception parameters; previously formatter received the same arguments regardless of any provider-side state transformation, which is technically a bug. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NetworkGET_SG previously allocated a fresh (GarnetStatus, StringOutput)[]
on every batch that contained any pending GETs, growing it via doubling
(8 → 16 → 32 → … → batchSize) inside SetResult. At b=1024 this allocated
~80 KB per batch (geometric sum 2040 entries × ~40 B). Allocation profiling
of the disk-GET hot path (1 thread, b=1024) shows this single source
accounting for 98.7 % of remaining sampled allocations (142 MB of 144 MB
in a 15 s window).
Cache the array on the RespServerSession as pendingGetOutputArr:
- NetworkGET_SG initializes its local outputArr from the cached field;
SetResult only allocates when null on first use of the session.
- After the pending-consumption loop, publish the (possibly grown) array
back to the session field and Array.Clear the used slots so disposed
StringOutput.SpanByteAndMemory.Memory wrappers aren't kept alive.
In steady state a b=1024 pipeline now keeps reusing the same 1024-entry
array — zero allocations after the first batch. Measurement on the same
1t b=1024 disk-bound bench drops total sampled allocations from 144 MB to
2.1 MB over 15 s (~68× reduction). Throughput unchanged (disk wall at
~148 K ops/sec). RespTests and pending/low-memory tests pass.
Per-session memory cost: ~40 KB sticky high-watermark for sessions that
hit a b=1024 pending batch (matches the existing PooledArrayMemoryPool
sticky shape; bounded by max observed batch size, not session count).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a dedicated README for benchmark/Resp.benchmark/ modelled on the
existing Device.benchmark/ and KV.benchmark/ READMEs. Covers:
* Client matrix (LightClient / GarnetClientSession / GarnetClient /
SERedis / InProc) and when to use each.
* Offline (--op X) vs online (--online) modes side-by-side, plus the
output schema each produces.
* Common-knobs table for the most-used CLI flags.
* Cookbook with 11 offline + 6 online copy-paste recipes.
* Disk-bound benchmark section that places Resp.benchmark in the
layered hierarchy fio -> Device.benchmark -> KV.benchmark ->
Resp.benchmark, with a 5-step recipe (server start -> 100M-key load
-> read-thread sweep -> compare -> cleanup) and a 'why these
settings' rationale for the disk-tier flags.
* Troubleshooting + links to sibling benchmark READMEs.
Also updates benchmark/README.md to link the new README alongside the
website doc.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rmup=5 Align the disk-bound cookbook recipes across Resp.benchmark, KV.benchmark, and (already-aligned) Device.benchmark so the three layers are directly comparable on the same hardware. KV.benchmark README: - Cookbook table: add '--device-throttle 512' (default 120 leaves ~30% of the device on the table) and change '--warmup-sec 0' -> '--warmup-sec 5' on the three mostly-on-disk rows (RandomAccess, libaio, io_uring). - Narrative: explain why 100M (not 10M) is the minimum reproducible dataset for disk-bound runs - a smaller dataset fills only ~1.3 GB of LBA range and lands on a subset of NAND dies, observed ~18% higher per-IO service time at the device (1.08 ms vs 0.91 ms on a P5600, identical queue depth). Call out that --device-throttle 512 is required for peak IOPS. Update plateau number 400-500K -> 565K to match today's measurement. Resp.benchmark README: - Intro hierarchy diagram: update KV reference from 525K to 565K and add explicit note that all four layers must use the same dataset size, throttle, log shape, and backend to be directly comparable; document the 10M-vs-100M die-parallelism effect. - Step 4 (cross-layer comparison) KV recipe: add --device-throttle 512 and --warmup-sec 5 (was 0) so the embedded recipe matches the canonical KV cookbook entry. - Step 3 (GET sweep): add batch-size tuning note (-b 2048 is the empirical sweet spot at t=8, ~+3% over -b 1024) and a note to discard the first per-cell measurement since Resp.benchmark has no built-in warmup flag. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rewrite README sections that read as journey/learning notes (specific ad-hoc measurements, 'observed today on this hardware', 'as of the X change', 'empirically') into concise direct statements of fact: Resp.benchmark README: - Disk-bound intro: drop the per-IO service-time + LBA-range numbers measured on a specific device. State the underlying behavior directly (small datasets land on a subset of NAND dies and raise per-IO service time even at identical queue depth and IO size) and the resulting recommendation (use at least 100 M records). - --sg-get note: drop the specific 16K -> 146K / 104K -> 404K / aqu-sz=6.8 -> 109 numbers and just state the effect (several-fold gain at low thread counts, 3-4x at higher). - Step 2 wall-clock: drop the per-second client-side / server-side breakdown; just say 'a few minutes' and that batch generation dominates. - Step 3 batch-size note: drop 'Empirically' and the matches-how-X-uses warmup phrasing; state the batch-size effect and the no-warmup-flag fact directly. - Step 4 inline comment: tighten the cross-layer comparability note. KV.benchmark README: - Disk-IO sweep narrative: drop the 1.08 ms vs 0.91 ms on a P5600 + observed ~18% / ~17% reference numbers and the 565 K-IOPS plateau measurement. State the underlying device behavior and the --device-throttle 512 requirement as facts; remove specific IOPS numbers and 'currently sits between the two' phrasing. Device.benchmark README: - libaio --completion-threads note: drop 'As of the sharded-libaio change' / 'Empirical: on this hardware sharding gives at most ~1%'. State the io_context capacity, the per-thread affinity model, and the CT=1 default as direct facts; note that higher CT only helps under observable io_submit contention. Also: dotnet-format fix for libs/common/Memory/PooledArrayMemoryPool.cs (trailing newline, per repo .editorconfig 'insert_final_newline = false'). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The managed sector-size probe runs in :base(...) before the ctor body created the parent directory, so on hosts where the parent and the nearest existing ancestor live on different filesystems with different logical_block_size (e.g. an overlay mount over a 4K-native disk), the probe queried the ancestor and returned the wrong value. The native probe in NativeDeviceImpl's ctor then ran after the parent had been materialized, returned the correct value, and the cross-check in EnsureNativeDeviceCreated threw '512 vs 4096' on first IO. Move the mkdir into a static helper invoked from the base-ctor argument so both probes (and all subsequent O_DIRECT IO) see the same filesystem. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…issionDenied test CI failure 1 (OmitSegmentIdFromFilename): the cross-check between the managed-side sector-size probe (run from the base-ctor argument) and the native-side probe (run from NativeDeviceImpl's field initializer) can disagree on hosts whose filesystem layout makes ProbeDioAlignment's stat-walk-up return different ancestors at different moments (e.g. CI overlay mounts that change between calls). The kernel rejects misaligned O_DIRECT I/O with EINVAL regardless of what SectorSize claims, so a drift here is observable downstream with a clear error rather than silent corruption. Downgrade the cross-check to a warning so the operator can investigate without crashing the device on first I/O. Also drop the redundant !Directory.Exists guard from EnsureParentDirectoryAndProbeSectorSize (CreateDirectory is idempotent) and refresh the doc comment. CI failure 2 (PermissionDeniedAtFirstWrite, Native): on EACCES the C++ FileSystemSegmentBundle ctor stores a bundle with fd=-1 (the assert is compiled out in Release) and the subsequent io_submit hangs inside the P/Invoke instead of returning -EBADF synchronously, crashing the test host. Skip the Native case with a TODO; RandomAccess and ManagedLocal still validate the WriteAsync callback contract. Fixing the native error-routing path requires refactoring FileSystemSegmentBundle and rebuilding the shipped .so binaries; tracked as a follow-up. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces the two workarounds added in 437c71f with actual fixes. 1. Sector-size probe is now deterministic at construction time. EnsureParentDirectoryAndProbeSectorSize creates the parent directory first and then probes that directory's path directly (not the not-yet-existing data file). This eliminates the ProbeDioAlignment stat-walk-up that could land on a different filesystem ancestor than the eventual data file (observed on GHA overlay mounts over a 4K-native disk). SectorSize on StorageDeviceBase stays a readonly auto-property — the single value computed at ctor is now guaranteed to match what the native shim reports after opening the file with O_DIRECT. The cross-check in EnsureNativeDeviceCreated is restored to a hard throw: any drift is now a real ABI / loaded-library mismatch, not a path-resolution false positive. 2. EACCES on segment open is now routed through the WriteAsync callback contract instead of hanging inside io_submit. File::is_open() accessor added (both Linux and Windows headers). FileSystemSegmentBundle ctors capture the first non-Ok status from per-segment file.Open(); FileSystemSegmentedFile::OpenSegment now checks open_status() and refuses to commit a partially-opened bundle to files_. The broken bundle is destroyed and its buffer freed; the open error propagates back to the caller's WriteAsync which surfaces it via the AsyncIOCallback errorCode. QueueFile::ScheduleOperation gains a defense-in-depth fd_<0 guard so any future regression that re-introduces an invalid fd on the submit path returns Status::IOError instead of submitting to io_submit with aio_fildes=-1 (which empirically hangs inside libaio on some kernels rather than returning -EBADF synchronously). Both Linux .so binaries (libnative_device.so USE_URING=ON, and libnative_device_libaio.so USE_URING=OFF) rebuilt and installed. 3. Removed the Assert.Ignore(DeviceKind.Native) skip from IDevice_PermissionDeniedAtFirstWrite_CallbackGetsError. All three local device kinds (Native, RandomAccess, ManagedLocal) now pass. Verified locally: full Tsavorite.test.hlog suite (504 passed / 0 failed / 108 skipped) and full Tsavorite.test suite (206 / 0 / 61) both green on net10.0 Debug. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eak, ARM volatility
Addresses three real bugs surfaced by review of badrishc/optimize-kv-io.
1. C++ FileSystemSegmentBundle expansion ctor (file_system_disk.h)
The previous expansion ctor moved file handles out of the source bundle and
set its owner_=false BEFORE checking whether the new-segment opens succeeded.
On failure (e.g. ENOSPC on a newly-extended segment), OpenSegment destroyed
the rejected new bundle — but its destructor closed the moved-in fds, while
files_ still pointed at the source bundle whose File objects now held stale
fd_ values. Subsequent io_submit calls would read/write closed (or
OS-recycled) descriptors.
Restructured the ctor into two phases:
Phase 1 — open new segments only; on failure, destroy just the constructed
slots, set owner_=false, and return with `other` fully intact.
Phase 2 — once all new opens succeed, move the overlapping file handles
from `other` and transfer bundle ownership.
filename_ is now copy-constructed (not moved) so a failed expansion does not
strand `other` with an empty filename_. Rebuilt both shipped .so binaries.
2. TsavoriteThread.cs:93 — pool-wrapper leak on callback exception
ReturnAsyncGetFromDiskResult(result) was called only after
InternalCompletePendingRequest returned. A user callback that threw would
skip the return-to-pool, permanently leaking the wrapper. Wrapped in
try/finally so the wrapper is always returned.
3. NativeStorageDevice.cs:660 — fast-path volatility mismatch
EnsureNativeDeviceCreated published nativeDevice via Volatile.Write but
read it on the fast path with a plain field load. On ARM the reader could
observe a non-null handle while still seeing a stale `results` array
reference or partially-initialised completion-thread state. Changed the
fast-path read to Volatile.Read; the inner read inside the lock retains a
plain load (lock acquire is a full barrier).
All 504 Tsavorite hlog tests pass on net10.0 Debug.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eturnAsyncGetFromDiskResult The catch's else-arm duplicated the clear+enqueue+flip-flag dance before rethrow, while the finally block only enqueued without clearing context. Two paths reached the pool with different conventions: success paths and catch-throw explicitly cleared result.context first; catch-SetException did not. Funnel the wrapper return through ReturnAsyncGetFromDiskResult in finally (clears context + enqueues), and let catch's else-arm just rethrow. Not a functional bug. AsyncIOContext is a struct so the in-callback local ctx is a value copy of result.context; ctx.DisposeRecord() zeros the local but leaves the field unchanged, so a wrapper returned with stale context holds a reference to a SectorAlignedMemory wrapper whose buffer was already Return()ed. The next AsyncReadRecordToMemory overwrites result.context before anything reads it, so there is no leak, no use-after-free, and no buffer corruption. The change brings the SetException path in line with the explicit clears on the other paths and removes a duplicated cleanup dance. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… on-disk RecordDataHeader CHK stored [key][namespace] in both the inline buffer and the hoisted SectorAlignedMemory pool path. Tsavorite's on-disk RecordDataHeader puts the extended namespace bytes BEFORE the key (LogRecord.cs:60-61: '... those bits are the length of the extended namespace data preceding the key data'). The mismatch was self-consistent within CHK (writers and readers agreed) so no correctness issue, but it forced reviewers to flip the offsets mentally when moving between CHK and RecordDataHeader and blocked a future single-span [NS|key] compare optimization against DiskLogRecord. Swap both paths to [namespace][key]. KeysEqual + all accessors (KeyBytes, NamespaceBytes) updated; all callers go through these so the change is transparent. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ity fix Profiling KV.benchmark on Linux against RandomAccessLocalStorageDevice showed 20.76% of total thread-time in Monitor.Enter_Slowpath, driven by the per-segment AsyncPool<StorageAccessContext>'s internal SemaphoreSlim and ConcurrentQueue. SemaphoreSlim.Release ALWAYS takes its internal Monitor regardless of pool capacity; under 800K+ Returns/sec across 80+ ThreadPool workers this serializes every IO completion. Replace the per-segment AsyncPool with one shared SafeFileHandle per direction per segment. RandomAccess.ReadAsync/WriteAsync (pread/pwrite under the hood) are thread-safe on a single SafeFileHandle. Throttle via lock-free numPending > ThrottleLimit (matches LocalStorageDevice and NativeStorageDevice patterns). KV.benchmark/RandomAccess t=8 throughput: 86.7K -> 144.7K ops/sec (+67%). Throughput is now flat across t=1..16 (single app thread saturates the ThreadPool's pread pipeline). Monitor.Enter_Slowpath disappears from the profile top 30. Production-readiness fixes (per GPT-5.5 review): * SegmentHandles is a class with lazy per-direction creation via Interlocked.CompareExchange. Preserves the original AsyncPool's lazy-factory semantics so read-only callers no longer trigger write-handle open (matters for readOnly mode and read-only files). * If a handle factory throws after the sibling handle is already created, the surviving handle is reclaimed when SegmentHandles is disposed (no partial-construction leak). * Dispose and Reset spin-wait (bounded 5s) for numPending to drain before closing SafeFileHandles, matching AsyncPool.Dispose's wait-for-borrowed-items behavior. RemoveSegment follows the same pattern as LocalStorageDevice (no drain; epoch-protected truncation ensures no in-flight reads target a removed segment). Verified: * 16 RandomAccess-tagged DeviceTests pass. * Full 504-test hlog suite passes (108 skipped, all Azure-env-dependent). * Recovery tests pass (exercises read-only handle path). * dotnet format clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ync lost-wakeup race The Enqueue semaphore-skip optimization (commit aee945d) reads waiterCount via Volatile.Read (acquire-only). On weak memory models (ARM), the acquire-load can be reordered ahead of ConcurrentQueue's slot-sequence-number release-store, which is what publishes the item. A concurrent consumer in DequeueAsync that has already incremented waiterCount (full fence) but whose TryDequeue retry reads the slot sequence number can then both: - fail to see the producer's slot publication, AND - be missed by the producer's waiterCount check (because the read is reordered before the increment is visible), losing the wakeup and sleeping on the semaphore until the next Enqueue (or forever, for the BLPOP key 0 case). WaitForEntry/WaitForEntryAsync rechecks queue.Count (which reads Tail, bumped by the same full-fence Tail CAS) so it was already safe; the Tsavorite-side readyResponses hot path is unaffected by the bug and unaffected by this fence. DequeueAsync is consumed only by CollectionItemBroker for blocking collection commands (BLPOP/BLMOVE/BRPOP/BLMPOP/BZPOPMIN/BZPOPMAX). The barrier is a single ~5 ns Interlocked.MemoryBarrier on x86 / dmb sy on ARM — cheaper than the SemaphoreSlim.Release the original optimization avoids, and preserves the throughput win on the hot completion path. Verified: * All 504 hlog tests pass. * All 52 RespBlockingCollectionTests (BLPOP/BLMOVE/etc.) pass. * dotnet format clean. Identified by Opus 4.8 review of PR #1837. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three comments around the Native completion-thread / io_context configuration were narrating the optimization journey rather than describing current behaviour. Rewrite them to state only the current facts: - benchmark/Device.benchmark/Options.cs: --completion-threads help text no longer says libaio drainers "rarely help past 1 in practice" or that they are "now honoured for symmetry". Now states that each drainer is bound 1:1 to its own io_context/ring and throughput scales with the value. - libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Options.cs: --device-completion-threads help text drops the same "rarely help past 1" and "legacy QueueRun compat scanner" wording. - libs/storage/Tsavorite/cs/src/core/Device/NativeStorageDevice.cs: NativeStorageDevice ctor XML doc was stale — it claimed uring uses max(4, numCompletionThreads) rings, but the constructor body sets numIoContextsConfig = numCompletionThreadsConfig (always 1:1) on both backends. Rewrite the param doc to match the code and remove the "extra rings don't help" speculation. The inline comment above numIoContextsConfig dropped its "earlier experiment ... fundamentally broken" narrative in favour of a present-tense rationale for the 1:1 binding (each drainer blocks on its own ring in QueueRunFor, so a single drainer cannot cover multiple rings without starving them). No code changes. All three affected projects build clean. dotnet format clean on both Tsavorite.slnx and Garnet.slnx. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…heap alloc The previous d7a4715 commit (removing per-segment AsyncPool to fix Monitor.Enter_Slowpath contention) regressed allocation behavior by constructing a fresh UnmanagedMemoryManager<byte> per IO on the hot path in ReadWorkerAsync and WriteWorkerAsync — pre-PR code reused a single wrapper per pooled StorageAccessContext. This change restores wrapper reuse via a lock-free per-device ConcurrentQueue<UnmanagedMemoryManager<byte>>: - Rent: TryDequeue (or new on empty) → SetDestination(ptr, len). - Return: Enqueue in the finally of the IO continuation. Properties: - Zero heap allocations per IO in steady state (one CAS per Enqueue / TryDequeue; ConcurrentQueueSegment is array-backed so no per-item Node allocation, unlike ConcurrentStack). - Bounded by ThrottleLimit (default 120, ~4 KB high water). - No SemaphoreSlim / Monitor — does not reintroduce the contention pattern that motivated removing the AsyncPool. - Wrappers are only handed out one-IO-at-a-time, so concurrent SetDestination on the same instance cannot occur. ConcurrentQueue's release/acquire semantics on Enqueue/TryDequeue publish the rebound _pointer / _length to the next worker thread. Validated: - DeviceTests 70/70 pass - test.hlog 504/504 pass - KV.benchmark RA sweep (1-16 threads, 100M x 100B, 16 MB log): throughput unchanged at ~143-145K ops/sec (workload is bottlenecked on RandomAccess threadpool dispatch, not the wrapper alloc); GC pressure on the IO path drops to zero in steady state. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e doc - Rename ConditionallyHositedKey.cs -> ConditionallyHoistedKey.cs (filename was the only place with the 'Hosited' typo; type itself was already named ConditionallyHoistedKey). - Fix stale XML doc on Devices.CreateLogDevice's numCompletionThreads parameter: previous wording claimed all drainers share a single io_context / io_uring ring and that values > 1 are rarely useful, which contradicts the 1:1 drainer-to-ring binding the device actually implements today (see NativeStorageDevice.cs:584-592). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The static Empty singleton was bit-identical to default(ConditionallyHoistedKey): its initializer fed a null pointer (MemoryMarshal.GetReference on an empty span returns a null-equivalent ref, so Unsafe.AsPointer(...) was (byte*)0) through the (byte*, int) ctor, leaving every field at its default value. Returning default at the only callsite has the same observable behavior, removes the static field/property and its (misleading) initializer, and eliminates the class's static cctor / type-init flag check entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Optional post-load diagnostic that prints the underlying TsavoriteKV.DumpDistribution() bucket histogram. Useful when investigating hash-table sizing / collision symptoms during disk- bound IOPS sweeps (paired with --index sizing). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…llback The reread path and completionEvent path both did 'result.context = default;' explicitly, then fell through to the finally block, which calls ReturnAsyncGetFromDiskResult — which already clears result.context before enqueueing the wrapper. The explicit clears were dead writes and the comment in the finally even pointed at non-existent 'explicit clears on the throw arm of the catch'. No behavior change: the local 'ctx' on the reread path is a struct copy taken at 'var ctx = result.context;' before the (now-removed) clear, so all fields (id, requestKey, minAddress, callbackQueue) are preserved and forwarded to the next IO via AsyncReadRecordToMemory's 'asyncResult.context = context' assignment. Comment expanded on the reread path to make the struct-copy semantics explicit so future readers don't reach for 'result.context' thinking it's the live state. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cument drop branch The wrapper-recycle method was named Return and marked internal, which invited reading it as a general 'return a wrapper to the pool' API and prompted a question about whether the cap-exceeded drop branch was leaking the still-rented byte[]. It isn't: by the time the method runs, PooledBuffer.Dispose has already Interlocked-Exchanged the array field to null and returned the byte[] to ArrayPool<byte>.Shared, so the wrapper holds no resources and dropping it just lets GC reclaim a small managed object. Rename to OnPooledBufferDisposed and make it private (nested PooledBuffer class can still call it) to make the callback semantics explicit, add an XML doc explaining that it must only be invoked from PooledBuffer.Dispose, and add a remarks block plus an inline comment documenting why the drop branch is safe. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
735f7cf to
2d6da50
Compare
TedHartMS
approved these changes
Jun 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.