[Tsavorite] Add native Linux storage backend, harden NativeStorageDevice, refresh storage benchmarks#1831
Conversation
4e6392e to
05fdd6c
Compare
ed4618a to
2a61c1e
Compare
There was a problem hiding this comment.
Pull request overview
This PR hardens Tsavorite’s native storage path, adds Linux native backend selection for libaio/io_uring, updates Garnet configuration plumbing, and refreshes storage/KV benchmarking guidance and harness behavior.
Changes:
- Adds/rewrites native device backend abstractions, runtime segment sizing, alignment probing, error reporting, io_uring support, and managed
NativeStorageDevice/IDevicecontract updates. - Plumbs native backend and completion-thread options through Tsavorite/Garnet device factories and host configuration.
- Expands device tests and benchmark documentation/tools for native, RandomAccess, and FileStream storage paths.
Reviewed changes
Copilot reviewed 42 out of 46 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
libs/storage/Tsavorite/cs/test/test.hlog/DeviceTests.cs |
Adds broad device contract, native lifecycle, recovery, alignment, and permission tests. |
libs/storage/Tsavorite/cs/test/SimulatedFlakyDevice.cs |
Forwards Initialize to wrapped devices. |
libs/storage/Tsavorite/cs/src/core/Utilities/CompletionEvent.cs |
Adds timed wait helper. |
libs/storage/Tsavorite/cs/src/core/Tsavorite.core.csproj |
Copies libaio-only native fallback artifact. |
libs/storage/Tsavorite/cs/src/core/Index/CheckpointManagement/LocalStorageNamedDeviceFactoryCreator.cs |
Adds backend/completion-thread factory parameters. |
libs/storage/Tsavorite/cs/src/core/Index/CheckpointManagement/LocalStorageNamedDeviceFactory.cs |
Passes backend/completion-thread settings to created devices. |
libs/storage/Tsavorite/cs/src/core/Device/TieredStorageDevice.cs |
Formatting-only cleanup. |
libs/storage/Tsavorite/cs/src/core/Device/StorageDeviceBase.cs |
Documents constructor defaults and optional initialization. |
libs/storage/Tsavorite/cs/src/core/Device/RandomAccessLocalStorageDevice.cs |
Switches to SafeFileHandle/RandomAccess and adds Linux direct-I/O open path. |
libs/storage/Tsavorite/cs/src/core/Device/NullDevice.cs |
Formatting-only method expansion. |
libs/storage/Tsavorite/cs/src/core/Device/LinuxFileExtensions.cs |
Adds Linux open(2) helpers for O_DIRECT. |
libs/storage/Tsavorite/cs/src/core/Device/IDevice.cs |
Updates device initialization contract documentation. |
libs/storage/Tsavorite/cs/src/core/Device/Devices.cs |
Adds native backend/completion-thread parameters. |
libs/storage/Tsavorite/cs/src/core/Device/AsyncPool.cs |
Rolls back allocation count when item creation fails. |
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs |
Adds bounded allocation retry backoff and modifies aligned read sizing. |
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/README.md |
Documents native backend benchmark scenarios. |
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Options.cs |
Adds io_uring backend parsing. |
libs/storage/Tsavorite/cc/src/device/utility.h |
Adds trailing-zero helper. |
libs/storage/Tsavorite/cc/src/device/native_device.h |
Adds native device interface, runtime segment sizing, probing, and backend types. |
libs/storage/Tsavorite/cc/src/device/native_device_wrapper.cc |
Extends C ABI for backend selection and diagnostics. |
libs/storage/Tsavorite/cc/src/device/native_device_error.h |
Adds thread-local native error channel. |
libs/storage/Tsavorite/cc/src/device/file_windows.h |
Adds cross-backend stubs for Windows handler compatibility. |
libs/storage/Tsavorite/cc/src/device/file_system_disk.h |
Converts segmented file logic to runtime segment size. |
libs/storage/Tsavorite/cc/src/device/file_linux.h |
Adds libaio wake support and sharded io_uring handler declarations. |
libs/storage/Tsavorite/cc/src/device/file_linux.cc |
Implements native Linux open/error handling, wake paths, and io_uring draining/submission changes. |
libs/storage/Tsavorite/cc/src/CMakeLists.txt |
Links liburing when enabled and includes new native error header. |
libs/storage/Tsavorite/cc/README.md |
Refreshes native-device build/runtime documentation. |
libs/storage/Tsavorite/cc/CMakeLists.txt |
Raises CMake minimum and enables io_uring by default. |
libs/server/Servers/GarnetServerOptions.cs |
Adds Garnet server options for native backend and completion threads. |
libs/host/defaults.conf |
Adds default config entries for native backend settings. |
libs/host/Configuration/Options.cs |
Adds CLI/config parsing for native backend settings. |
Dockerfile.ubuntu |
Installs liburing runtime dependency. |
Dockerfile.chiseled |
Stages liburing for chiseled runtime image. |
Dockerfile.azurelinux |
Installs liburing runtime dependency. |
Dockerfile.alpine |
Installs liburing package. |
Dockerfile |
Installs liburing runtime dependency. |
benchmark/README.md |
Adds Device.benchmark overview. |
benchmark/Device.benchmark/README.md |
Adds full Device.benchmark usage guide. |
benchmark/Device.benchmark/Program.cs |
Adds backend selection and success/error throughput accounting. |
benchmark/Device.benchmark/Options.cs |
Adds benchmark file-size/backend option updates. |
benchmark/Device.benchmark/BenchWorker.cs |
Aggregates errors and counts successful completions. |
Device.benchmarkDrive: Dell Ent NVMe P5600 MU U.2 3.2 TB (logical=physical=512). Command (per row, varying numactl --cpunodebind=0 --membind=0 \
dotnet bin/Release/net10.0/Device.benchmark.dll \
--file-name /path/to/test.dat \
--file-size 17179869184 --segment-size 1073741824 \
--device-type Native --io-backend {libaio|uring} \
--completion-threads {1|4} \
-t {1,2,4,8,16,32} -b 64 \
--runtime 10 --throttle-limit 256Throughput (ops/sec, libaio with ct=1, uring with ct=4, batch = 64, 3 runs each):
Saturation reference (fio, same drive): ~772K random 4 KB reads. Both backends saturate at ≥4 submitter threads (~720–750K, within ~3% of the fio ceiling). At t=1 the workload is latency-bound on a single submitter; throughput scales linearly with submitter concurrency from there. |
398a099 to
7d890ea
Compare
All 31 non-benchmark files changed on optimize-v2-io (vs its branch base d3677cf) ported here. Backup tag: optimize-v2-io-prerebase-backup @ 3f41f2b. Scope: device/IO/native-backend ONLY. Includes: Tsavorite C++ native device: - io_uring backend + pluggable C ABI (file_linux.cc/h) - error model split (native_device_error.h) - file_system_disk + native_device.h updates - CMakeLists + README Tsavorite C# device: - NativeStorageDevice: IoBackend enum (Default, Libaio, Uring), completion threads, production-readiness pass - LinuxFileExtensions.cs: P/Invoke open() for true O_DIRECT - ManagedLocalStorageDevice + RandomAccessLocalStorageDevice: O_DIRECT wiring on Linux - Devices.cs: router updates for new device APIs Tsavorite allocator + utilities: - AllocatorBase: bounded backoff in TryAllocateRetryNow - CompletionEvent: Wait(TimeSpan) overload Tsavorite checkpoint management: - LocalStorageNamedDeviceFactory + Creator surface ioBackend + completionThreads parameters Tests: - DeviceTests.cs updated for new device APIs Garnet host: - --device-io-backend, --device-completion-threads flags - defaults.conf updated, GarnetServerOptions wiring Dockerfiles (all 5): install liburing alongside libaio. Note: YCSB.benchmark and KV.benchmark are not modified by this commit; KV.benchmark is the supported benchmark on this branch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…iburing runtime dep
- Options: --device-io-backend now accepts 'uring' (aliases: io_uring,
iouring) in addition to libaio/default; help text + validation error
list updated to match.
- KV.benchmark README: device-backend table now describes the libaio vs
uring split, with a runtime-install snippet for liburing across
Debian/Ubuntu, Fedora/RHEL/AzureLinux, and Alpine, plus link to the
full Tsavorite Native Device docs. New 'native + libaio' and
'native + uring' rows added to the cookbook for constrained-log
large-dataset workloads.
- Tsavorite Native Device README: new top-level 'Runtime dependencies
(end users)' section listing the apt/dnf/apk install lines and how to
fall back to the no-liburing variant.
- KV.benchmark Validate: fix two bugs that surface when load and run
use different thread counts and when the log spills to disk:
1) writerThread reconstruction now uses ResolvedLoadThreads (not
Options.Threads which is the RUN count), so --load-threads N with
--threads M != N validates correctly.
2) Reads of records below HeadAddress return Status.IsPending; the
previous code counted these as misses. Validate now issues reads
in batches of 256 and drains via CompletePendingWithOutputs,
verifying each completed output against the per-thread pattern.
Verified end-to-end with both backends:
4.6M × 100B × 8T × log=256m (~580MB dataset > 256MB log → forces
disk spill) × 50R/50U × --validate:
native + libaio → [validate] OK, run = 1.27 M ops/s
native + uring → [validate] OK, run = 1.02 M ops/s
4.6M × 100B × 8T × log auto (fits) × 95R/5U × --validate:
native + libaio → [validate] OK, run = 16.10 M ops/s
native + uring → [validate] OK, run = 16.33 M ops/s
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…s on SafeFileHandle
Background:
MarkHandleAsAsync used reflection to flip SafeFileHandle.IsAsync's
non-public setter so a P/Invoke-opened O_DIRECT FD could be wrapped
in 'new FileStream(handle, isAsync: true)' without throwing 'Handle
does not support asynchronous operations'. The flag is non-public
in .NET 8/10, the hack was fragile across future runtime versions,
and on Linux IsAsync is a contract gate (no real overlapped I/O
exists for files), so the lie bought us nothing beyond letting the
FileStream constructor accept the handle.
RandomAccessLocalStorageDevice — refactor:
- StorageAccessContext.handle is now SafeFileHandle (was FileStream).
- CreateRead/WriteHandle:
* Linux + O_DIRECT capable: LinuxFileExtensions.OpenDirect -> raw
SafeFileHandle, no FileStream wrap. Page-cache bypass via the
O_DIRECT flag at open(2), exactly as before.
* Otherwise (Windows; or Linux when filesystem rejects O_DIRECT):
File.OpenHandle(path, ..., FileOptions.Asynchronous | cast
FILE_FLAG_NO_BUFFERING). On Windows this gives the runtime
IOCP-bound OVERLAPPED I/O; on Linux it's page-cached.
- All I/O goes through RandomAccess.{Read,Write}Async(safeHandle,
memory, offset). On Windows: true kernel async via IOCP. On Linux:
pread/pwrite dispatched to ThreadPool (same as before).
- GetFileSize uses RandomAccess.GetLength(handle).
- SetFileSize uses RandomAccess.SetLength(handle, size).
LinuxFileExtensions:
- MarkHandleAsAsync and the IsAsyncProperty reflection are deleted
entirely (no remaining callers).
- System.Reflection using removed.
ManagedLocalStorageDevice:
- Reverted to origin/main. This device is designed to stay within
FileStream APIs; the O_DIRECT branch we added doesn't belong here.
Verified:
- Tsavorite test.hlog DeviceTests: 36/36 passed.
- KV.benchmark --device randomaccess --log-memory 256m --preallocate-log
--rumd 50,50,0,0 --validate: PASS, 357 K ops/sec, iostat shows
100-177 K real disk r/s and 53-90% NVMe util → O_DIRECT page-cache
bypass confirmed.
- KV.benchmark --device randomaccess (log fits) --validate: PASS,
15.7 M ops/sec.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…+ ManagedLocal)
The Phase-7 hardening suite added in this branch was NativeStorageDevice-
specific. Add parametrized variants of the four tests that are pure IDevice
contract checks (not native-specific lifecycle/API), so they exercise all
three local-storage device implementations:
- Hardening_AllDevices_RoundTrip_BasicReadWrite
- Hardening_AllDevices_RoundTrip_AcrossSegmentBoundary
- Hardening_AllDevices_Parallel_32ConcurrentWrites
- Hardening_AllDevices_Parallel_BurstyTraffic
Each is parametrized by a new DeviceKind enum (Native, RandomAccess,
ManagedLocal). Native is gated on OperatingSystem.IsLinux() (the C++
shim links against libaio/liburing); the other two run on both Linux
and Windows.
A shared CreateDeviceForTest helper takes care of the per-kind ctor +
Initialize() dance so the test body stays uniform.
Result: 38/38 hardening tests pass on Linux (12 new cross-device + 26
native-only).
Native-specific tests retained as-is because they test API that doesn't
exist on the other devices:
- Lifecycle (DisposeBeforeInitialize, InitializeTwice, etc.) —
NativeStorageDevice defers Initialize from the ctor; the other
devices initialize in their ctor.
- Segment-size validation (NonPowerOfTwoSegmentSize_Throws, etc.) —
Initialize() is the only callsite that validates.
- Recovery_*_SegmentSize_* — the native device's open() path is the
only place that records and re-validates per-segment-size metadata.
- SectorSize stability across opens — not all devices expose this.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…evice_ buckets; fix AsyncPool creator-throw hang
Test refactor:
- IDevice_*: 8 contract tests parametrized across Native, RandomAccess,
ManagedLocal (round-trip basic, round-trip cross-segment, round-trip
various segment sizes, 32 concurrent writes, 64 concurrent reads,
mixed reads+writes, bursty traffic, stress burst of 100 writes,
permission-denied callback contract). 33 cases total.
- NativeStorageDevice_*: 16 native-only tests for behaviors managed
devices don't have (deferred Initialize signature, recovery
segment-size mismatch detection, sector-size discovery, sync-throw
unaligned IO guard).
AsyncPool fix:
GetOrAdd reserved a slot in totalAllocated before calling creator().
If creator() threw (e.g. open() returned EACCES, ENOSPC), the slot
was never released, so Dispose() would loop forever waiting for
totalAllocated to drain to zero. This manifested as a process hang
when a device pool's first open() failed. Rollback the reservation
on exception so the failure propagates cleanly and the pool can
still be disposed.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…es, -1 = unbounded single segment
Before this change, IDevice.Initialize had the same signature for every
device but very different semantics:
* StorageDeviceBase (RandomAccess, ManagedLocal, LocalStorage, NullDevice,
LocalMemoryDevice): the ctor pre-set segmentSize = -1 / bits = 64 /
mask = ~0, so calling an IO entry point without Initialize() silently
ran in unbounded single-segment mode.
* NativeStorageDevice: Initialize was MANDATORY (the C++ shim needs the
segment size at create time for libaio/io_uring geometry), IO entry
points threw if invoked first, and segmentSize = -1 was rejected.
This commit unifies the contract: every IDevice must call Initialize()
exactly once before any IO, and segmentSize = -1 selects unbounded
single-segment mode on every device.
Implementation:
* StorageDeviceBase
- Added `initialized` flag (volatile) and `EnsureInitialized()`
helper that throws InvalidOperationException with a clear message
naming the device by FileName.
- Ctor leaves `initialized = false` but keeps the safe fallback
defaults (-1 / 64 / ~0) so any cold maintenance path that touches
segmentSizeBits before the guard can't compute outright nonsense.
- EnsureInitialized() called from the base address-based ReadAsync /
WriteAsync overloads and TruncateUntilAddress / TruncateUntilAddressAsync.
- Initialize sets `initialized = true` at the end.
* NativeStorageDevice
- Accepts segmentSize = -1: translates to 1UL << 63 for the native
shim so the C++ FileSystemSegmentedFile's shift = log2(segment_size)
math collapses every non-negative upper-layer address into segment 0
(parity with the managed-side bits = 64 / mask = ~0). Single growing
file on disk.
- Tracks the value passed to native in nativeSegmentSizeBytes (replaces
the diagnostic-only configuredSegmentSizeBytes long field, which
couldn't hold 1<<63 without overflow).
- ABI readback (NativeDevice_GetSegmentSize) compared against the value
we sent to native, not the user-facing -1.
- Always rejects omitSegmentIdFromFilename — the C++ shim has no
omit-suffix code path, every segment is written as <base>.<segmentId>.
Better to fail fast than silently produce wrong file names.
* Concrete IO entry points (ReadAsync / WriteAsync / RemoveSegment /
RemoveSegmentAsync) of NullDevice, LocalMemoryDevice,
ManagedLocalStorageDevice, RandomAccessLocalStorageDevice,
LocalStorageDevice, AzureStorageDevice, ShardedStorageDevice, and
TieredStorageDevice now call EnsureInitialized() before doing work.
Caller fix-ups:
* LocalStorageNamedDeviceFactory.Get now calls device.Initialize(-1L)
before returning. Commit / checkpoint metadata is single growing-file
usage (segment 0 only, .0 suffix), so unbounded mode is the right
default and unblocks every DeviceLogCommitCheckpointManager caller from
needing to remember to initialize.
* LocalStorageNamedDeviceFactory.ListContents skips dotfile entries —
defensive against a pre-existing race in
LinuxFileExtensions.IsDirectIOSupported where a .tsavorite-odirect-probe-*
temp file can leak in the commit dir if File.Delete races with
File.GetFiles. Without this filter, leaked probe files surface as
Int64.Parse("") failures in DefaultCheckpointNamingScheme.CommitNumber.
* SimulatedFlakyDevice.Initialize now propagates to the wrapped device.
* ComponentRecoveryTests Setup_* helpers call Initialize(-1) on devices
they construct directly (bypass the Tsavorite allocator path which
normally Initializes).
Tests (DeviceTests.cs):
* IDevice_ReadAsyncBeforeInitialize_Throws(kind) × 3 — new contract test.
* IDevice_WriteAsyncBeforeInitialize_Throws(kind) × 3 — same.
* IDevice_Initialize_SegmentSizeMinusOne_UnboundedSingleSegment(kind) × 3
— write at offset 1 MiB (would be in segment-N for any positive size)
and read back, confirming -1 routes through segment 0 on all 3 kinds.
* NativeStorageDevice_Initialize_OmitSegmentIdFromFilename_Throws — new
native-only test for the omit rejection in both -1 and explicit-size
modes.
* Removed NativeStorageDevice_{Read,Write}AsyncBeforeInitialize_Throws
(now subsumed by the IDevice_ variants).
Docs:
* IDevice.Initialize docstring rewritten to spell out the new contract
and the -1 semantics. NativeStorageDevice.Initialize remarks updated.
Verified on Linux net10.0 Release:
* 599 hlog tests (491 passed + 108 skipped)
* 305 recovery tests
* 144 + 155 + 127 + 346 = 772 other Tsavorite + Garnet RespTests
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…probe (no race)
Two related fixes on top of the unified Initialize contract:
1) NativeStorageDevice now supports omitSegmentIdFromFilename
─────────────────────────────────────────────────────────
Previously Native rejected the omit flag because the C++ shim hard-coded
the '.<segmentId>' suffix in three places in file_system_disk.h. This made
the IDevice contract asymmetric (managed devices honored omit, Native
didn't). Fix by threading the bool through the entire C++/C ABI:
C++ (libs/storage/Tsavorite/cc/src/device/):
* FileSystemSegmentBundle: new bool omit_segment_id_; both ctors accept it
and use a new segment_path(idx) helper that returns just filename_
when set, otherwise filename_ + '.' + std::to_string(idx). Used at all
three locations that previously hard-coded the suffix.
* FileSystemSegmentedFile: new bool omit_segment_id_ (const) wired through
ctor and propagated to bundles allocated by OpenSegment.
* NativeDeviceImpl: new bool omit_segment_id constructor param; recorded
as omit_segment_id_ member. ValidateRecoveredSegments short-circuits
in omit mode (single bare-named file, segment-size mismatch check is
meaningless when there's no .<id> suffix to scan for).
* native_device_wrapper.cc / NativeDevice_CreateWithBackend: new trailing
'bool omit_segment_id' parameter. ABI BUMP — managed wrapper updated
to match; Linux .so rebuilt and committed at
libs/storage/Tsavorite/cs/src/core/Device/runtimes/linux-x64/native/
libnative_device.so. **Windows DLL must be rebuilt by user** with
cmake -G 'Visual Studio 17 2022' -A x64 -T v143,spectre=true.
C# (libs/storage/Tsavorite/cs/src/core/Device/):
* NativeStorageDevice P/Invoke signature updated.
* NativeStorageDevice.Initialize removes the 'always rejects omit' guard.
It now accepts omit:true together with segmentSize = -1 and forwards
to native; rejects omit:true together with a positive segmentSize
with a clear error message (multiple segments would collapse onto
the same on-disk path and clobber each other).
Tests (DeviceTests.cs):
* IDevice_Initialize_OmitSegmentIdFromFilename_BareFileName(kind) × 3:
writes via Initialize(-1, omit:true) and asserts the on-disk file is
the bare basename (no .0 suffix). Replaces the native-only 'throws'
test from the previous commit.
* IDevice_Initialize_OmitSegmentIdFromFilename_WithoutMinusOne_Throws(kind)
× 3: enforces the no-positive-size-with-omit invariant on every kind.
2) IsDirectIOSupported uses O_TMPFILE (race-free probe)
─────────────────────────────────────────────────────
The previous probe in libs/storage/Tsavorite/cs/src/core/Device/
LinuxFileExtensions.cs created a hidden '.tsavorite-odirect-probe-<pid>-
<guid>' file in the device's directory, then File.Delete'd it in a
silent-catch finally. Multiple concurrent commits (one device per Get())
ran probes simultaneously; concurrent ListContents calls from
CommitRecordBoundedGrowthTest would observe the probe file during its
brief lifetime, and DefaultCheckpointNamingScheme.CommitNumber would then
throw FormatException on long.Parse(''). 18/20 baseline failure rate.
Switching from create+unlink to open(directory, O_TMPFILE | O_RDWR |
O_DIRECT) tells the kernel to allocate an anonymous inode in the
directory's filesystem with NO directory entry. The probe inode is
invisible to readdir/getdents regardless of timing; concurrent
ListContents cannot observe it. Freed on close. Linux >= 3.11 +
ext4/xfs/tmpfs/btrfs all support it. If O_TMPFILE itself fails
(EOPNOTSUPP on some filesystem) we conservatively report 'no O_DIRECT'
so the device falls back to the page-cache path — no named-file
fallback because that's the bug we're fixing.
Reverts the dotfile filter in LocalStorageNamedDeviceFactory.ListContents
added by the previous commit; the underlying race is now eliminated at
the kernel level so the workaround is unnecessary.
20/20 LogFastCommitTests runs pass after the change (was 2/20 on baseline,
3/20 on the previous unlink-after-open attempt which still raced).
Verified on Linux net10.0 Release:
* 612 hlog tests (504 passed + 108 skipped)
* 305 recovery tests
* 62 device tests (IDevice contract + NativeStorageDevice-specific)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Captures the 100M-key disk-bound thread-scale experiment from the
optimize-device branch sweep so it can be reproduced verbatim in the
future:
- 100M × 100B records (12.8 GB on disk)
- 16 MB log so ~0.125 % of dataset is in memory and almost every
read is a 4 KB random disk fetch
- 8 load threads, run-threads sweep 1,2,4,8,16,32 at 15s each
- One row per backend (RandomAccess / native+libaio / native+uring)
Added a short note after the table explaining what to compare against
(the disk's fio ceiling at 4K-aligned QD=64-per-job), the expected
~2 min wall-clock per device, and the observed per-backend plateau
characteristics so the next operator knows what 'good' looks like.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Device.benchmark fixes (previously reported throughput could be 2x inflated): - Throughput counter now tallies successful completions only. Before, every ReadAsync call was counted as success even when the kernel returned EAGAIN (Status::IOError=4 — flooded libaio io_context ring). Under --throttle-limit 0 with high QD, ~40% of "ops" were errored requests. - Per-error-code histogram printed at end of run; no per-error Console.WriteLine (was emitting millions of serial writes/run, both falsifying numbers and slowing the real path). - DEBUG data validation skipped on errored ops (was reading garbage from the destination buffer on EAGAIN paths and reporting spurious "Data mismatch"). - --throttle-limit help text documents the libaio kernel ring trap (128 slots wide; high QD + no throttle floods it) and recommends --throttle-limit 128 (also the io_uring SQ depth this build uses). - --io-backend flag added (libaio / uring / default) so the existing Linux Native path can be exercised against either backend. Unknown values are rejected with an actionable message at startup instead of silently falling back to default. - --completion-threads is now wired through to the Linux Native ctor (was hardcoded to 1). - --file-size widened to long (was int; --file-size > 2GB threw a parse error). KV.benchmark + Devices.cs: clarified XML/help text for the existing --device-completion-threads / numCompletionThreads parameter to describe the current behavior (multiple drainer threads share one kernel io_context / io_uring per device). No behavior change for KV.benchmark or core Tsavorite Devices.cs API beyond the Device.benchmark surface and clarified docs.
Adds N independent kernel io_contexts (libaio) / io_urings (uring) per
NativeStorageDevice, with completion threads bound 1:1 to contexts.
libaio internally always uses 1 context (sharding empirically gave
nothing — kernel mutex efficient at all tested loads), but the sharded
ABI surface is kept across both backends so the templated
NativeDeviceImpl<HandlerT> doesn't need a fork.
C ABI changes (libnative_device.so):
- NativeDevice_CreateWithBackend signature bumped: trailing int32 num_io_contexts.
- New exports: NativeDevice_QueueRunFor(device, ctx_idx, timeout_secs),
NativeDevice_NumIoContexts(device).
- Legacy NativeDevice_QueueRun kept; under uring sharding it scans all
rings (back-compat for any single-thread drainer).
C# (NativeStorageDevice):
- Synchronous ABI probe at Initialize() that converts EntryPointNotFoundException
into a clear TsavoriteException listing the missing exports and how to
rebuild — guards against a stale .so silently hanging Dispose's drain loop.
- Probe is intentionally gated by the QueueRun branch so it runs only on
backends that actually use the new symbols at runtime: Linux Native (libaio
/ uring) where QueueRun returns >= 0, NOT on Windows IOCP where the
ThreadPoolIoHandler returns -1 by design. This means a stale Windows DLL
keeps working unchanged because it never calls the sharded exports.
cdecl/x64 ABI silently tolerates the new trailing num_io_contexts arg on
NativeDevice_CreateWithBackend.
- Completion threads bound 1:1 via QueueRunFor(ctxIdx) (no closure capture
bug — ctxIdx is captured per-iteration into a local).
- numCompletionThreads is the user-facing knob; native side decides how many
contexts to actually create (libaio: always 1; uring: honours the request).
Empirical justification (Device.benchmark, NVMe, 4K random reads,
batch=4096, throt=512, fio ceiling 749K, NUMA0-pinned):
libaio CT=1 (always 1 ctx + 1 drainer): 755K ops/sec (t=32)
uring CT=1 (1 ring + 1 drainer): 357K ops/sec ← SpinLock-bound
uring CT=1 ring + N drainers (regresses): 357K → 274K ← cq_lock contention
uring CT=4 (4 rings + 4 drainers): 745K ops/sec
uring CT=8 (8 rings + 8 drainers): 758K ops/sec ← hardware ceiling
Both backends now reach the hardware NVMe ceiling. uring requires sharding
(the user-space SpinLock around io_uring_get_sqe + prep + submit is the
real cap). libaio doesn't need sharding (kernel io_context mutex already
efficient at all tested loads).
Files:
- file_linux.h : UringIoHandler sharded (vector<io_uring*>, per-ring sq_lock
+ cq_lock, atomic round-robin pick_ring). QueueIoHandler
unchanged on the data plane (single io_context_t) but
exposes the same num_contexts()/TryCompleteFor/QueueRunFor
surface as inline stubs for ABI symmetry.
- file_linux.cc : new UringIoHandler impls; QueueIoHandler unchanged.
- file_windows.h: stub overloads (num_contexts()=1, QueueRunFor=-1, 2-arg ctor)
so the templated NativeDeviceImpl compiles unchanged.
- native_device.h, native_device_wrapper.cc: ABI plumbing as above.
- NativeStorageDevice.cs: ABI probe + per-context drain workers.
- runtimes/linux-x64/native/libnative_device.so: rebuilt with sharding.
…VMe IOPS Captures the verified copy-paste recipe for both Linux Native backends (libaio and io_uring) to hit the hardware ceiling on a Dell P5600-class NVMe, alongside a flag reference, output-schema explanation, and troubleshooting table. Headline recipes verified end-to-end on the reference setup: libaio --completion-threads 1 --threads 16 --throttle-limit 512 → 743K ops/sec uring --completion-threads 8 --threads 16 --throttle-limit 512 → 738K ops/sec (Both within 2 % of the table values in the README; zero kernel-side errors.) Key facts documented: - libaio always uses one io_context in this build regardless of --completion-threads (sharding empirically gave nothing; the hint is ignored). Pass 1 explicitly so scripts are self-describing. - io_uring needs sharded rings (CT >= 4) to escape the per-ring user-space SpinLock cap around io_uring_get_sqe + prep + submit; CT=8 is the safe peak. - --throttle-limit must be set to at least the per-ring/per-context depth (128 in this build for both backends). --throttle-limit 0 floods the kernel ring and the benchmark correctly surfaces Status::IOError=4 in the per-code histogram rather than tallying errored ops as throughput. - --file-size must be a multiple of 1024 × --sector-size (fill phase uses a 1024-sector temp buffer). Plus a section comparing Device.benchmark vs KV.benchmark to direct readers to the right tool: Device.benchmark for IO-layer ceiling validation (saturates NVMe), KV.benchmark for full-path throughput (currently caps ~30 % below the IO ceiling on the upper-layer pending-read path — see KV.benchmark README for that side of the story). Also adds a one-paragraph pointer in benchmark/README.md so the new README is discoverable from the top-level benchmarks listing.
…full ScheduleOperation in both QueueFile (libaio) and UringFile (io_uring) now retries the kernel-side submission on the transient back-pressure signal (libaio: `io_submit == 0`; uring: `io_uring_get_sqe == nullptr`) up to kMaxSubmitRetries = 8 attempts, each separated by a sched_yield(). Permanent errors (libaio io_submit < 0, uring io_uring_submit < 0) are NEVER retried — they surface immediately as Status::IOError. Motivation: the upper-layer throttle gate in AllocatorBase.AsyncGetFromDisk is a racy test-then-increment (Throttle() reads numPending non-atomically, then ReadAsync does Interlocked.Increment). With N concurrent submitters all passing the gate at numPending == ThrottleLimit, in-flight can spike to ThrottleLimit + N momentarily, exceeding the 128-slot per-context / per-ring kernel ring depth when N > 8 (which is normal for Garnet under heavy disk-bound load). Pre-fix, the kernel rejects the overshoot submissions with EAGAIN, which Tsavorite handled by re-routing through the full AllocatorBase pending-read retry loop (correct but expensive: a full round-trip per IOError). Post-fix, the burst is absorbed locally by a handful of sched_yields and never surfaces to the upper layer. Why sched_yield + bounded retries is the right shape: on a 750K-IOPS NVMe the kernel ring drains a slot every ~1.3 µs and sched_yield is typically 1-10 µs on Linux, so 8 retries (worst-case ~40-80 µs window) is more than enough to absorb the typical 24-slot overshoot from a 32-thread burst. For genuine sustained overload (application submission rate exceeds device IOPS for seconds), the retries exhaust and Status::IOError surfaces — which is the correct signal for the caller to apply back-pressure. Implementation notes: - libaio: simple loop around io_submit. No lock to release; io_submit is kernel-thread-safe per io_context, so concurrent submitters serialise inside the kernel. - uring: must release sq_lock around sched_yield. Holding a SpinLock across a syscall would stall every other submitter on the same ring. Only the get_sqe == nullptr path is retried; if get_sqe succeeded we've already "consumed" an SQE slot in the user-side bookkeeping and re-issuing via get_sqe+prep on retry would corrupt the ring (we'd hold two SQEs for one logical op). For SQPOLL-disabled rings (our setup) io_uring_submit returns 1 in steady state — a non-1 there is an unrecoverable kernel-side error and surfaces immediately. Verified end-to-end (Device.benchmark, NVMe, 4K random reads, batch=4096, NUMA0-pinned, --throttle-limit 120 to match production NativeStorageDevice default): libaio CT=1: t=8 / 16 / 32 → 626K / 631K / 627K ok/sec, 0 err uring CT=8: t=8 / 16 / 32 → 622K / 625K / 626K ok/sec, 0 err At extreme intentional-overload settings (--throttle-limit 4096, t=32) errors still appear — confirming the retry budget correctly exhausts when the application is genuinely outpacing the device: libaio CT=1, t=32, throt=4096: 754K ok, 14.7M code4 err (~5% of submits) uring CT=8, t=32, throt=4096: 745K ok, 0 err (uring still produces 0 err under same overload because 8 rings × 128 = 1024 SQ slots is large enough that even gross over-submission fits within the retry budget per ring.)
BLOCKER fixes:
- NativeStorageDevice.Dispose UAF race. Previously NativeDevice_Destroy(nativeDevice)
ran before nativeDevice was nulled, so a concurrent guard-bypassed P/Invoke could
observe a non-zero handle that points to freed memory. Fix: Interlocked.Exchange
atomically captures-and-nulls the handle; destroy runs on the captured pointer.
EnsureReadyOrSilent now checks disposedFlag first.
HIGH fixes:
- UringFile::ScheduleOperation SQE leak on transient io_uring_submit failure. After
a successful get_sqe + prep, the SQE is committed to the user-side SQ ring; a
-EAGAIN/-EBUSY return from io_uring_submit left the slot permanently occupied
with no kernel iocb, eventually starving get_sqe forever. Fix: retry
io_uring_submit (without re-preparing) up to kMaxSubmitRetries on transient
negatives, with sched_yield (and sq_lock released) between attempts.
- UringIoHandler::Init partial-init leak. If new SpinLock() threw after
io_uring_queue_init succeeded for ring i, the already-initialized ring leaked
(the class dtor doesn't run on partial construction). Fix: use std::unique_ptr
RAII holders during construction; release into the member vectors only after
all allocations succeed.
POLISH fixes:
- NativeStorageDevice.Initialize tail-throw cleanup. If base.Initialize threw after
the native device and completion threads were created, both leaked. Now wrapped
in try/catch that cancels token, joins threads, and destroys the native device.
- UringIoHandler rule-of-5 hygiene: explicitly deleted copy ctor, copy-assign, and
move-assign so the implicit shallow copies (which would double-delete the raw
owning pointers) cannot be generated.
- DispatchUringCqe: added static_assert(is_trivially_destructible<IoCallbackContext>)
so a future non-trivial member fails the build instead of silently leaking.
- NativeDeviceImpl::num_io_contexts: removed unnecessary const_cast (the underlying
num_contexts() is already const on both backends).
- Removed duplicate XML <summary> on NativeStorageDevice.Dispose.
- FileSystemDisk dead-code ctor: passed the now-required 5th arg to
FileSystemSegmentedFile so the file compiles if anyone instantiates it.
Comment hygiene sweep (per explicit review-rule #4 "comments should not refer
to design thought processes"):
- Removed embedded benchmark results, hardware-specific throughput numbers, and
historical narrative from class/method documentation in file_linux.{h,cc},
native_device.h, native_device_wrapper.cc, NativeStorageDevice.cs.
- Kept WHAT each method does and the invariants it enforces; moved WHY this
approach was chosen out of source comments (the commit log is the appropriate
place for that context).
- Net: -162 lines across 7 files, no behavior change from the trim itself.
Verified post-fix performance unchanged (Device.benchmark, NVMe, 4K random reads,
batch=4096, NUMA0-pinned):
libaio CT=1 t=16 throt=512: 746K ok/sec, 0 err (hardware ceiling)
uring CT=8 t=16 throt=512: 743K ok/sec, 0 err (hardware ceiling)
libaio CT=1 t=32 throt=120: 636K ok/sec, 0 err (production default)
uring CT=8 t=32 throt=120: 618K ok/sec, 0 err (production default)
…reation
NativeStorageDevice.Initialize used to perform all the heavy work (native
device creation, completion-thread spawn, ABI / segment-size / sector-size
cross-checks) eagerly and threw "called more than once" on a second call.
The other IDevice implementations (LocalStorageDevice, RandomAccessLocalStorageDevice,
ManagedLocalStorageDevice) all inherit a metadata-only StorageDeviceBase.Initialize
that simply overwrites segmentSize / segmentSizeBits / mask fields and is
silently idempotent. They open their per-segment OS handles lazily inside the
IO methods.
This contract mismatch broke any caller that invokes Initialize twice on the
same NSD instance. The canonical case is
LocalStorageNamedDeviceFactory.Get(), which calls Initialize(-1L) as a
defensive pre-init so consumers can't forget; the consumer (snapshot
checkpoint state machine SnapshotCheckpointSMTask, cluster checkpoint
streaming TsavoriteCheckpointReader.CreateCheckpointDevice) then calls
Initialize(actualSegmentSize). Under the old NSD that throws; under the new
NSD it works the same way the other backends do.
Implementation:
- NSD.Initialize is now metadata-only — delegates to base.Initialize.
Pre-flight argument validation (segmentSize power-of-two, sector-size
floor, omitSegmentIdFromFilename) is preserved.
- New EnsureNativeDeviceCreated() does the heavy work, lazily, on first IO.
Reads the latest base.segmentSize and base.OmitSegmentIdFromFileName so
whichever Initialize call ran most recently wins.
- Thread-safe via double-checked locking on a new nativeCreateLock. The
publish of nativeDevice uses Volatile.Write so a second observer of
nativeDevice != IntPtr.Zero is guaranteed to see a fully-initialised
handle with completion threads already running.
- Dispose now also takes nativeCreateLock around the cancel-join-destroy
sequence so it cannot race with a concurrent EnsureNativeDeviceCreated
(which would otherwise leak a freshly-published native handle and its
completion threads).
- IO entry points (ReadAsync, WriteAsync) call EnsureNativeDeviceCreated()
before submission. Bookkeeping entry points (Reset, TryComplete,
GetFileSize, RemoveSegment) no-op when the native handle has not been
created yet, matching the semantics of the other backends (Reset on a
device with no open handles is a no-op).
Verified against the full unit-test sweep with Native forced as the default
device (the GetDefaultDeviceType hack is local-only and not in this commit):
Tsavorite.test: 206 / 206 (was 204 / 206 pre-fix)
Garnet.test: 789 / 789 (was 110 / 792 pre-fix; 681
were blocked on Initialize-twice)
Garnet.test.acl: 425 / 425
Garnet.test.collections: 746 / 746
Garnet.test.complexstring: 386 / 386
Garnet.test.rangeindex: 62 / 62
Garnet.test.vectorset: 42 / 42
No change to behavior for callers that invoke Initialize once with the real
segment size, which is what every production code path already does.
…ive handle GetFileSize and RemoveSegment must report the on-disk state regardless of whether IO has flowed through the device, matching LocalStorageDevice and RandomAccessLocalStorageDevice semantics. Before this fix, both no-op'd when no native handle had been created — which silently truncated the cluster manager's recovery decision because ClusterManager.cs:79 and ReplicationManager.cs:160 call `device.GetFileSize(0) > 0` to decide whether to recover persisted cluster config / replication history. With Native, a restarted node would always "Initialize new node instance config" instead of recovering, get a fresh node ID, and fail every replication-resume test (e.g. ClusterSRNoCheckpointRestartSecondary which restarts a replica and then waits for AOF sync to catch up). Changes: * GetFileSize now falls back to FileInfo when no native handle exists — same shape as RandomAccessLocalStorageDevice.GetFileSize (open-on-demand) but without paying io_uring/libaio setup cost just to stat a file. * RemoveSegment now falls back to File.Delete when no native handle exists — same shape as LocalStorageDevice / RandomAccessLocalStorageDevice (best-effort unlink, swallows ENOENT). * Per IDevice contract enforced in 889def4 ("Tsavorite IDevice: unify Initialize contract — required for all devices, -1 = unbounded single segment"), ReadAsync / WriteAsync now call EnsureInitialized() before EnsureNativeDeviceCreated() so the IDevice_*BeforeInitialize_Throws hardening tests get the same InvalidOperationException shape from Native that they get from the other devices. * Two device tests updated to match the lazy-Initialize contract that was introduced in commit f4e3044 ("Tsavorite Native: make Initialize idempotent via lazy native-handle creation"): - NativeStorageDevice_InitializeTwice_Throws → _Idempotent: idempotent Initialize matches the LSD/RA contract used by LocalStorageNamedDeviceFactory.Get + consumer re-init pattern. - NativeStorageDevice_Recovery_LargerExistingSegment_DetectsMismatch: the C++ ValidateRecoveredSegments check now fires on first IO (when EnsureNativeDeviceCreated runs), not at Initialize time, so the test asserts on a ReadAsync rather than Initialize. Verified on Linux x64 / .NET 10: * libs/storage/Tsavorite/cs/test/test.hlog: all IDevice + NativeStorageDevice tests pass (62/62) with Native default * test/cluster/Garnet.test.cluster.replication: all 4 ClusterSRNoCheckpointRestartSecondary variants pass with Native default (regression-test for the recovery path) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Without this fix, NSD.Dispose() can stall up to CompletionWorkerTimeoutSecs (1s)
per io_context because the completion-drainer thread is blocked in
io_getevents / io_uring_wait_cqe_timeout waiting for events that will never
come (the IO drain phase has already brought numPending to 0). The Thread.Join
following completionThreadToken.Cancel() then has to wait for the next QueueRunFor
timeout to fire so the thread can observe cancellation and exit.
This was visible as exactly-1.0s gaps in cluster replication recovery traces:
checkpoint metadata reads / writes that each create+dispose a fresh NSD spent
~1s in Dispose, multiplying across the ~5–10 devices created per checkpoint
into multi-second stalls. ClusterReplicaSyncTimeoutTest (replicaSyncTimeout=1s)
and MultiDatabaseSaveRecoverByDbIdTest(True) (2s LASTSAVE poll window) failed
because of this; the actual I/O on Native is microseconds, not seconds.
Fix: post a synthetic wake-up event on each io_context when Dispose runs.
* libaio: submit a 0-byte read on a /dev/null fd opened in the handler ctor.
/dev/null completes immediately and does not require O_DIRECT alignment,
so the wake-up does not interfere with the real segment files.
* io_uring: submit io_uring_prep_nop with user_data = nullptr; the drain
loop recognises nullptr as a wake-up sentinel and skips dispatch.
* Windows ThreadPoolIoHandler has no dedicated drainer (callbacks fire on
threadpool threads), so its Wake is a no-op stub returning 0.
The completion thread wakes from its blocking syscall almost immediately,
observes the cancellation token on its next loop iteration, and exits. No
extra idle work, no polling, no shortened timeout.
* NSD.Dispose latency: 1025ms worst case -> ~20-30ms (microbenchmark).
* ClusterReplicaSyncTimeoutTest with Native: ~22-25s (fail) -> ~3s (pass).
* MultiDatabaseSaveRecoverByDbIdTest(True) with Native: timeout (fail)
-> ~6s (pass).
* Idle drainer syscall rate is unchanged (1/s/context).
C ABI changes (additive — old exports preserved):
* NativeDevice_WakeCompletionWorker(device, ctx_idx).
* INativeDevice::Wake; QueueIoHandler::Wake, UringIoHandler::Wake,
ThreadPoolIoHandler::Wake stub.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…id for IO Background: commit 889def4 ("unify Initialize contract — required for all devices") added an EnsureInitialized() guard that threw InvalidOperationException at every IO entry point if Initialize() had not been called first. This was redundant: the ctor already establishes segmentSize=-1 / segmentSizeBits=64 / segmentSizeMask=~0UL, which is functionally identical to having called Initialize(-1) — every absolute address right-shifts to segment 0, producing unbounded single-segment routing. The mandatory-Initialize contract was the root cause of the entire factory-pre-init + NSD lazy-creation saga: LocalStorageNamedDeviceFactory.Get was forced to call device.Initialize(-1L) defensively just to satisfy the contract, which broke NativeStorageDevice (its single-shot Initialize then asserted on the consumer's follow-up Initialize(realSize)). Recent commits f4e3044 + 0535da0 papered over this with lazy native-handle creation; this commit removes the root cause. Changes: * StorageDeviceBase: remove the 'initialized' flag, EnsureInitialized() helper, and ThrowNotInitialized() method. Initialize() is now purely a *configuration* call to override the ctor defaults (set a non-default segment size, opt into OmitSegmentIdFromFileName). The ctor doc explicitly states that callers may issue IO immediately after construction. * All concrete devices: remove the EnsureInitialized() calls at the top of ReadAsync / WriteAsync / TruncateUntilSegmentAsync / RemoveSegment (libaio, io_uring, RA, ManagedLocal, LocalMemory, Null, Tiered, Sharded, Azure). * LocalStorageNamedDeviceFactory.Get: drop the defensive device.Initialize(-1L); the ctor defaults match what that call did anyway. * NSD's recent EnsureInitialized() additions to ReadAsync/WriteAsync (introduced in 0535da0 only to satisfy the hardening test) are also removed by the sweep. * ComponentRecoveryTests.cs: drop 3 redundant Initialize(-1L) calls. * test.hlog DeviceTests: rename and repurpose IDevice_*BeforeInitialize_Throws to IDevice_*BeforeInitialize_UsesCtorDefaults — the new test demonstrates that WriteAsync/ReadAsync on a freshly-constructed device (no Initialize call) works correctly using the unbounded single-segment defaults, across Native / RA / ManagedLocal. TestUtils.cs:165 still calls device.Initialize() — that path is conditional on the caller wanting OmitSegmentIdFromFileName=true, which IS only settable via Initialize (it is not a ctor parameter), so the call is genuinely needed there. Verified on Linux x64 / .NET 10: * libs/storage/Tsavorite/cs/test/test.hlog (IDevice + NativeStorageDevice tests): 62/62 pass. * libs/storage/Tsavorite/cs/test/test.recovery (ComponentRecovery tests): 4/4 pass. * Full Garnet.test, Garnet.test.cluster, Garnet.test.acl, Garnet.test.collections, Garnet.test.extensions, Garnet.test.scripting, Garnet.test.complexstring, Tsavorite IDevice+NSD: pass at the same rates as before this commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sweep the PR for comments that referenced earlier design choices as they evolved during development, and rewrite them to describe the steady-state contract directly without historical baggage. * IDevice.Initialize: replace the "Must be called exactly once... before any IO entry point... uninitialized device throws InvalidOperationException" doc with the actual contract: Initialize is purely an opt-in configuration step to override the ctor defaults (which are equivalent to Initialize(-1)); callers may issue IO immediately after construction. * NativeStorageDevice ctor doc: rewrite to describe the as-shipped lazy creation flow (configuration captured at ctor, native handle created on first IO via EnsureNativeDeviceCreated) rather than the stale "Native device creation is DEFERRED until Initialize... every IO entry point throws InvalidOperationException" framing. * NativeStorageDevice.Initialize doc: drop the misleading "Creates the underlying native device with the requested segment size" lead-in (which hasn't been true since the lazy-creation refactor); replace the "factory pre-init" example (factory no longer pre-inits) with a steady-state description of when repeat Initialize calls are honoured. * NativeStorageDevice.EnsureNativeDeviceCreated doc: replace "Throws if Initialize has not been called" (now uses ctor defaults if no Initialize) with "Throws if the device has been disposed or if the native shim rejects the configuration". * NativeStorageDevice.EnsureReadyOrSilent doc: drop the "does not throw on 'not initialized yet'" qualification. * NativeStorageDevice.GetSectorSize doc: re-point the "cross-check" link from Initialize to EnsureNativeDeviceCreated (which is where it actually happens). * NativeStorageDevice.Dispose doc + body: bound the worst-case shutdown stall by the longest in-flight user callback (not CompletionWorkerTimeoutSecs) since wake-up uses NativeDevice_WakeCompletionWorker; rewrite the inline Dispose comment so it documents the steady-state design rather than what it improved over. * NativeStorageDevice nativeSegmentSizeBytes / UnboundedNativeSegmentSizeBytes field doc: clarify that the value is populated by EnsureNativeDeviceCreated (not Initialize) and that the default is reached without calling Initialize. * NativeStorageDevice_InitializeTwice_Idempotent test: drop the now-stale "factory pre-init... consumer re-initializes" rationale; describe the idempotent contract directly. * NativeStorageDevice_DisposeBeforeInitialize_IsNoOp test: drop the "Phase 6" reference and reword in terms of the steady-state lazy-creation contract. * SimulatedFlakyDevice.Initialize: replace the "so its EnsureInitialized() guard passes when our IO methods delegate to it" comment (the guard no longer exists) with a description of why both devices need matching geometry. * LinuxFileExtensions.OpenDirect dsync param doc: drop the "previously asked for it" wording — the WriteThrough callsites still pass it; describe the parameter as an opt-in for WriteThrough-equivalent semantics. * Doc-cref bookkeeping: change <see cref="base.segmentSize"/> (illegal cref for inherited fields) to <c>base.segmentSize</c> code spans. No behaviour change. Build clean on Garnet.slnx and Tsavorite.slnx; dotnet format --verify-no-changes clean on both. IDevice + NSD device tests all pass (62/62). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Background: the shipped libnative_device.so was built with -DUSE_URING=ON,
so it had a hard DT_NEEDED entry for liburing.so.2. Loading the .so on a
host without liburing2 installed (e.g. a GitHub Actions ubuntu-latest
runner, or an end-user box where only libaio is in the base image) failed
with:
System.DllNotFoundException: ... liburing.so.2: cannot open shared
object file: No such file or directory
…even for callers that only ever requested the libaio backend, because the
dynamic linker resolves NEEDED libraries at load time regardless of which
exported symbols the caller goes on to invoke.
Rebuild the prebuilt with -DUSE_URING=OFF so the shipped .so links only
libaio. Most Linux distributions ship libaio in the base system, so the
prebuilt now loads without any additional setup. The io_uring backend
becomes a build-time opt-in: callers that want it install liburing-dev
and rebuild with -DUSE_URING=ON. The C# layer already surfaces a clear
TsavoriteException for callers that request Uring against a USE_URING=OFF
build ("Requested IO backend 'Uring' is not available in the loaded
native_device library… Rebuild the native library with -DUSE_URING=ON
and install liburing-dev to enable io_uring.").
Build fix: file_linux.h now includes <fcntl.h> directly (for the
::open() / O_RDONLY usage in QueueIoHandler::OpenWakeFd()). Previously
these were pulled in transitively through <liburing.h>, which is now
gated behind #ifdef FASTER_URING.
Dockerfile updates: drop liburing2 / liburing from the runtime install
list in all 5 Dockerfiles (default, .ubuntu, .alpine, .azurelinux,
.chiseled). Comments left for users that rebuild with USE_URING=ON.
README updates: rewrite the "Runtime dependencies" section to describe
the new default (libaio only). Replace the "Disabling io_uring (optional)"
section with "Enabling io_uring (optional)".
Verified on Linux x64 / .NET 10: libaio default works (62/62 IDevice +
NativeStorageDevice tests pass); ldd confirms only libaio.so.1t64 is
in NEEDED.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fallback
Single-shipping libnative_device.so created a deployment dilemma: build with
USE_URING=ON and end-users without liburing get DllNotFoundException at load
time; build with USE_URING=OFF and the io_uring backend stops working even on
hosts that DO have liburing installed (which is the case where uring matters
for perf — modern NVMe at >1M IOPS benefits noticeably from uring over libaio).
Ship both flavors instead:
* libnative_device.so — built with USE_URING=ON; DT_NEEDED on
libaio AND liburing. Exposes both Libaio and Uring backends.
* libnative_device_libaio.so — built with USE_URING=OFF; DT_NEEDED on
libaio only. Exposes the Libaio backend.
NativeStorageDevice's DllImportResolver tries the uring-enabled binary first;
on DllNotFoundException matching 'liburing.so.2: cannot open' it falls back to
the libaio-only binary. The Libaio backend therefore always works out of the
box on any Linux distribution that ships libaio (essentially all of them).
liburing is opt-in: hosts that install it get the Uring backend with zero
runtime overhead vs Libaio (direct calls, no function-pointer indirection — we
deliberately rejected the dlopen approach so the future-default uring path
stays optimal).
If a caller explicitly selects IoBackend.Uring on a host without liburing, the
construction-time error message now points at the install command per distro
('apt-get install -y liburing2', 'dnf install -y liburing', 'apk add liburing')
instead of telling the user to rebuild the .so with -DUSE_URING=ON. We never
silently downgrade Uring to Libaio.
Changes:
* NativeStorageDevice.cs: new LibaioFallbackLibraryPath; ImportResolver
catches DllNotFoundException for liburing.so.2 and falls back to the
libaio-only .so. ResolveNativeLibraryPath now takes the path as a
parameter so it can resolve either flavor.
* NativeStorageDevice.cs: rewrite the 'backend not available' exception
message — point at install commands (the actual remediation) not rebuild.
* Tsavorite.core.csproj: add libnative_device_libaio.so as a second
ContentWithTargetPath asset so both .so files are copied to the
output directory and packed into the NuGet runtime payload.
* runtimes/linux-x64/native/libnative_device.so — REPLACED with
USE_URING=ON build (DT_NEEDED libaio + liburing). 2.3 MB.
* runtimes/linux-x64/native/libnative_device_libaio.so — NEW, USE_URING=OFF
build (DT_NEEDED libaio only). 1.6 MB.
* Dockerfile, Dockerfile.ubuntu, Dockerfile.alpine, Dockerfile.azurelinux,
Dockerfile.chiseled: re-add liburing2 / liburing to the runtime installs
so docker users get the io_uring backend out of the box (the libaio-only
fallback would otherwise leave Uring unusable inside containers).
* cc/README.md: rewrite the 'Runtime dependencies' and build sections to
describe the two-flavor layout, drop the stale 'Enabling io_uring'
section, and document the prebuilt rebuild workflow.
Verified end-to-end:
* Both backends saturate the Dell P5600 NVMe at ~743K random read IOPS
in benchmark/Device.benchmark (matches the pre-change reference).
* 62/62 IDevice + NativeStorageDevice tests pass.
* dotnet format clean on both Garnet.slnx and Tsavorite.slnx.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…hreadPoolIoHandler NativeDeviceImpl's constructor (native_device.h:100-101) gates on handler_.init_errno() to surface an actionable error message when the underlying IO handler failed to initialize (e.g., libaio io_setup() failed with EMFILE / ENOMEM, or io_uring_queue_init() failed). The Linux handlers QueueIoHandler and UringIoHandler both expose this API; ThreadPoolIoHandler (Windows) did not, so MSVC failed to instantiate NativeDeviceImpl<ThreadPoolIoHandler> with: error C2039: 'init_errno': is not a member of 'FASTER::environment::ThreadPoolIoHandler' Add init_errno() and initialized() stubs that return 0 / true unconditionally — the Windows ThreadPool API does not have a separable init step that can fail in the same way the Linux io_setup / io_uring_queue_init paths can (threadpool creation failures propagate via threadpool_'s ctor, not via a later 'check this' field on the handler), so the stubs are semantically correct. NativeDeviceImpl then falls through to the log_.Open(&handler_) path which is where Windows-specific errors (missing directory, permission denied, etc.) actually surface. Linux unaffected: rebuilt build/Release-uring cleanly after this change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Native tests in DeviceTests.cs had blanket 'NativeStorageDevice is Linux-only' Assert.Ignore guards that dated from when NSD's C++ shim was Linux-only. The shim is built on Windows too (native_device.dll via the ThreadPool / IOCP backend in file_windows.cc), so directly constructing 'new NativeStorageDevice(...)' works on Windows. The blanket guards were silently dropping ~15 NSD test cases on Windows CI. Drop the guards so the tests exercise the Windows C++ shim. The legitimate Linux-only guard on IDevice_PermissionDeniedAtFirstWrite_CallbackGetsError (chmod-based; chmod has no Windows analogue) is preserved. Important: end-user device routing is UNCHANGED. Devices.CreateLogDevice(DeviceType.Native) on Windows still returns LocalStorageDevice (managed Windows IOCP), not NativeStorageDevice — that routing happens in Devices.cs and was not touched. These tests directly instantiate the NSD class for shim-coverage purposes only; they do not affect what end users get from the default device factory. Linux: 62/62 pass after this change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…+ source Rebuild the shipped Windows prebuilt from the current native_device source so the latest fixes (Initialize idempotence, WakeCompletionWorker, etc.) are reflected in the win-x64 DLL. USE_URING is a no-op on Windows; the DLL only exposes the Default (IOCP) backend, so there is no equivalent of the libnative_device_libaio.so fallback on this platform. End-user device routing is unchanged: Devices.CreateLogDevice(DeviceType.Native) on Windows still returns LocalStorageDevice (managed Windows IOCP). This DLL is exercised by direct 'new NativeStorageDevice(...)' construction (see Tsavorite.test.hlog DeviceTests — 59/59 Native + IDevice tests pass on Windows after this rebuild). Built with: Visual Studio 17 2022, MSVC v143, x64, Release configuration, Spectre-mitigated CRT.
Fixes two correctness issues caught by an automated code review of the
optimize-device PR.
### io_uring SQE leak on submit failure
In UringFile::ScheduleOperation, io_uring_get_sqe() advances the
user-side sqe_tail before io_uring_submit() is called. If submit fails
after retries (-EAGAIN/-EBUSY exhausted, or any other negative), the
old code released the lock and returned IOError without doing anything
about the still-pending SQE. user_data on that SQE pointed at the
io_context unique_ptr that was about to be freed by the guards
unwinding, so the next successful submit on the same ring would consume
the stale SQE and the QueueRunFor drain loop would dispatch a callback
against freed memory — a clear use-after-free.
Fix: before releasing sq_lock on the failure path, rewrite the still-
pending SQE in place as io_uring_prep_nop with user_data = nullptr.
The drain loop already skips nullptr user_data (it's the wake-up
sentinel used by UringIoHandler::Wake), so when a later submit flushes
this nop the CQE is drained harmlessly. Safe to mutate the SQE in place
because we still hold sq_lock and no kernel/concurrent submitter has
observed it yet.
### NativeDevice sector_size always returned 512
FileSystemSegmentedFile::alignment() returned a hard-coded 512.
NativeDeviceImpl::sector_size() delegated to it, so the C# wrapper's
sector-size cross-check in EnsureNativeDeviceCreated would:
- falsely throw on 4K-native disks where ProbeAlignment returns 4096
(managed 4096 vs native 512 → 'sector-size mismatch' → device
unusable), or
- on 4K-native disks where the managed probe fell back to 512 (e.g.
older kernel without STATX_DIOALIGN), let the device initialize
with SectorSize=512 and then have the kernel reject the 512-aligned
O_DIRECT buffers with EINVAL.
Fix: factor the STATX_DIOALIGN probe from NativeDevice_ProbeAlignment
into a shared inline helper (native_device::ProbeDioAlignment in
native_device.h) and call it once from the NativeDeviceImpl ctor,
caching the result as the immutable member device_alignment_.
sector_size() now returns the cached value; NativeDevice_ProbeAlignment
delegates to the same helper. Both sides of the ABI cross-check go
through identical probe logic, so the check is now a meaningful ABI /
runtime-drift detector instead of a 4K-disk footgun.
### Stale IDevice.Initialize XML
The omitSegmentIdFromFilename param said it was 'only supported by
managed devices — NativeStorageDevice rejects this flag'. Native
devices have honored the flag since 6584cf7. Updated the doc.
### Alpine install hint
The 'IoBackend.Uring with libaio fallback' error message suggested
'sudo apk add liburing' on Alpine, but README.md notes that the
prebuilt won't load on Alpine (musl) at all. Replaced the apk
suggestion with the actual Alpine support story (use a glibc image or
fall back to a managed device).
Verification: 62/62 Tsavorite.test.hlog IDevice + NativeStorageDevice
tests pass on Linux. Both .so binaries rebuilt (uring-enabled and
libaio-only fallback) with correct ldd output. Device.benchmark NVMe
saturation throughput unchanged within noise (libaio 738K IOPS, uring
349K IOPS on Dell P5600).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces the statx(STATX_DIOALIGN) probe in ProbeDioAlignment with a
direct sysfs lookup of max(logical_block_size, physical_block_size).
Why:
- STATX_DIOALIGN reports only the kernel-enforced minimum (= logical
block size). It misses the firmware's preferred sector
(physical_block_size), so on a 512e drive (logical=512,
physical=4096) the probe would return 512 and Tsavorite would take
a firmware RMW penalty on every partial-sector write.
- STATX_DIOALIGN also requires kernel 6.1+ AND the filesystem to
populate the field; ext4 on 6.8 leaves it unset on 512-byte
devices, so the probe was already falling through to the 512
default in practice.
- sysfs gives us both values directly, on every kernel, with no
O_DIRECT dance. Taking max(logical, physical) covers the
correctness floor (logical = kernel-enforced minimum) and the
performance floor (physical = avoid RMW on partial writes) in one
shot.
Implementation:
- stat() the file (or its closest existing ancestor — log file may
not exist yet at construction). Extract st_dev → (major, minor).
- Read /sys/dev/block/<maj>:<min>/queue/{logical,physical}_block_size.
For partitions (e.g. sda2), the queue/ dir lives on the parent
whole-disk block device — fall through to ../queue/<field>.
- Round result up to a power of two (always already pow2 on real
hardware) and floor at 512 B.
On this machine (Dell P5600 NVMe + PERC sda):
Probe(/DATA2/badrishc) = 512 (NVMe: logical=512, physical=512)
Probe(/tmp/devbench) = 512 (sda partition via parent-walk)
Probe(/home/badrishc) = 512
Probe(/) = 512
All values match max(logical, physical) read directly from sysfs.
Verification:
- 62/62 Tsavorite.test.hlog IDevice + NativeStorageDevice tests pass
- Both .so flavors rebuilt (uring-enabled + libaio-only fallback)
- C ABI NativeDevice_ProbeAlignment delegates to the same helper, so
managed SectorSize and native sector_size() remain in lockstep.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…obed sector size CI failure on IDevice_Initialize_OmitSegmentIdFromFilename_BareFileName (and likely other IDevice_* tests on the same runner) reported: 'NativeStorageDevice.WriteAsync: misaligned I/O — sector size is 4096, but offset=0x0, length=4096, buffer=0x...7EC7A9F76800' The buffer ends at 0x...800 = 2048 — i.e. 2048-aligned but not 4096-aligned. The test helper allocated buffers aligned to HardeningSectorSize = 512 (the pre-PR default for every Garnet Linux device); a 512-aligned formula can land on a 2048-boundary that is not also a 4096-boundary. CI's underlying disk reports physical_block_size = 4096 in sysfs, so the new max(logical, physical) probe returns 4096 there. The native shim then correctly rejects sub-4096-aligned O_DIRECT buffers with EINVAL. The fix is on the test side: bump HardeningSectorSize from 512 to 4096 so the test buffer alignment matches the strictest device.SectorSize seen on any modern hardware (512n, 512e, 4Kn). Locally (Dell P5600 NVMe, logical=physical=512 → SectorSize=512) all 62 IDevice + NativeStorageDevice tests still pass — 4096 trivially divides 512. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… PR) The read-window sizing optimization (drop leading-slop padding + clamp to page-end) is out of scope for this device-backend PR; it interacts with the larger read-IO path and deserves its own focused PR with dedicated benchmarking. Reverting to the pre-PR behavior here. TryAllocateRetryNow's bounded-backoff change is retained — it's a self-contained allocator hot-path fix and is independently verified (+13.9% on YCSB load with libaio at 64 threads, per kvbench benchmarking). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… max(logical, physical)
Symmetry with the Linux sysfs probe — Windows now reads both
BytesPerLogicalSector and BytesPerPhysicalSector from the volume's
STORAGE_ACCESS_ALIGNMENT_DESCRIPTOR (via IOCTL_STORAGE_QUERY_PROPERTY +
StorageAccessAlignmentProperty) and returns the rounded-up-to-pow2 max,
floor 512 B. Previously the Windows branch returned 512 unconditionally,
which would silently undersize SectorSize on Windows 4Kn / 512e drives.
Implementation:
- Parse drive letter from filename ("C:\foo.dat" -> "\\.\\C:").
UNC paths are not supported by this probe — fall back to 512.
- CreateFile on the volume with FILE_READ_ATTRIBUTES (no admin needed).
- DeviceIoControl(IOCTL_STORAGE_QUERY_PROPERTY) with
StorageAccessAlignmentProperty.
- max(logical, physical), round up to pow2, floor 512.
Linux behavior unchanged. Both .so flavors rebuilt and pass 62/62
device tests on this machine (logical=physical=512 NVMe).
REQUIRES Windows DLL rebuild — the Windows path in ProbeDioAlignment
is now non-trivial, and the existing prebuilt native_device.dll still
returns 512 unconditionally. Without the rebuild, on a Windows 4Kn box
the managed SectorSize cross-check would (incorrectly) pass at 512
while the device might actually need 4096. Rebuild recipe in the
companion review comment / Tsavorite/cc/README.md.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…leteFor
Addresses Copilot review comment on file_linux.cc:373.
UringIoHandler::TryCompleteFor (and via it TryComplete) dispatched
every drained CQE through DispatchUringCqe without checking the
user_data = nullptr sentinel that QueueRunFor already handles. The
sentinel marks two kinds of no-op CQEs:
- Wake-up nops submitted by UringIoHandler::Wake to unblock the
drainer on Dispose.
- SQEs rewritten in-place after io_uring_submit failed (the SQE
leak fix in c6d6892); these are committed to the SQ but carry
no caller context.
If a TryComplete() / TryCompleteFor() call picks up either kind of
nop CQE, DispatchUringCqe would dereference the null context at
context->callback(...) and segfault.
Fix: mirror the nullptr-skip from QueueRunFor in TryCompleteFor.
Return true to count the drain (matching the any-flag semantics).
Verification: 62/62 Tsavorite.test.hlog IDevice + NativeStorageDevice
tests pass. uring .so rebuilt; libaio-only .so is byte-identical
because the patched code path is wrapped in #ifdef FASTER_URING and
not compiled into the libaio-only fallback.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
QueueRunFor used to acquire cq_lock per CQE (peek -> read fields ->
cqe_seen -> release -> dispatch). With a single drainer thread that
serializes lock acquire/release on every completion and forces
submitters to wait through callback latency when they need ring access.
Replaced with the canonical liburing batch-drain idiom:
- acquire cq_lock once
- io_uring_peek_batch_cqe(ring, cqes, 64) to pull up to 64 CQEs
- snapshot (io_res, context) for each
- io_uring_cq_advance(ring, n) to release the slots
- release cq_lock
- dispatch callbacks outside the lock
This is the io_uring equivalent of libaio's io_getevents(n) per syscall.
Snapshot BEFORE cq_advance is mandatory because the kernel may reuse
CQ slots once advanced, leaving the cqe pointers dangling.
The wake-up / failed-submit sentinel (user_data == nullptr) is still
skipped without dispatch, same as before.
Measured impact on Dell P5600 (16 submitter threads, batch 64,
throttle 256):
ct=1 (1 ring, 1 drainer): 339K -> 354K ops/sec (+4%)
ct=4 (4 rings, 4 drainers): 735K -> 737K (saturates, noise)
ct=8 (8 rings, 8 drainers): 750K -> 742K avg (saturates, noise)
The single-drainer gain is modest because the real bottleneck at
ct=1 with 16 submitters is sq_lock contention on the single ring, not
cq_lock contention. The batch-drain is still strictly better:
- dispatches outside the lock so submitters aren't blocked by
user-callback latency,
- matches the idiomatic liburing pattern,
- amortizes the lock acquire/release across up to 64 CQEs per cycle.
For high-throughput workloads, sharding across multiple rings remains
the right scaling lever (ct >= 4 saturates this drive).
Verification: 62/62 Tsavorite.test.hlog IDevice + NativeStorageDevice
tests pass. uring .so rebuilt.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminates the sq_lock contention that was capping uring at ~340K IOPS at
the default numCompletionThreads=1. Two changes work together:
1. Per-thread ring affinity in pick_ring (file_linux.h):
Each submitter thread is assigned a ring on its first submit (round-
robin against other threads via an atomic counter) and keeps that
assignment for life. Same-thread submits never contend on sq_lock with
themselves; different threads only contend when they got assigned the
same ring (num_submitter_threads > num_rings). This is the user-space
equivalent of libaio's "io_submit is thread-safe per io_context" —
eliminate shared mutable state across submitters.
2. Hardcoded 4 rings for uring (NativeStorageDevice.cs):
numIoContextsConfig = ioBackend == Uring
? max(kDefaultUringRings=4, numCompletionThreads)
: numCompletionThreads
So uring always has at least 4 rings even at numCompletionThreads=1.
The single drainer covers all 4 rings via the legacy QueueRun compat
scanner (CompletionWorker passes ctxIdx=-1 in that case).
libaio is unchanged: rings == numCompletionThreads (extra rings don't
help; the kernel io_context mutex is already efficient).
Result on Dell P5600 NVMe (16 submitter threads, batch 64, throttle 256):
Before (1 ring, 1 drainer): ~340K
After (4 rings, 1 drainer, default): ~700K (matches libaio ct=1)
After (8 rings, 8 drainers, sharded): ~745K (unchanged, was already saturating)
No new public configuration parameters. numCompletionThreads still
controls drainer count; the ring count is now backend-derived behind the
scenes. The CompletionWorker single-drainer-multi-ring path was added
specifically so the default numCompletionThreads=1 case can saturate
without spawning extra drainer threads.
Also: bumped HardeningSectorSize and the legacy bufferPool / NativeDeviceTest2
sector_size constants from 512 to 4096 to match the strictest device
SectorSize we expect on any modern hardware (4Kn drives where the new
max(logical, physical) probe returns 4096). Tests would otherwise fail
with EINVAL on 4Kn CI runners with 512-aligned buffers.
Verification:
- 64/64 Tsavorite.test.hlog IDevice + NativeStorageDevice tests pass
- Default uring (no flags) hits 626-740K across t=1..64 vs ~340K before
- Sharded ct=4/8 unchanged (still saturates)
- libaio default unchanged
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… SectorSize
NativeDeviceTest1 read 1024 bytes (entryLength) using ReadInto, which:
- rounded the read length up to the device sector size,
- then returned a buffer of that ROUNDED length,
- which the caller compared via SequenceEqual against the original
`entry` byte[] (length 1024).
When SectorSize was 512 (the old constant probe), 1024 rounded to 1024
and the lengths happened to match. With the new max(logical, physical)
probe returning 4096 on 4Kn drives (Windows/Ubuntu CI runners), 1024
rounds to 4096, the returned buffer is 4096 bytes long, and
SequenceEqual fails on length mismatch (regardless of content).
Pre-existing latent bug — the rounding to sector size is correct for
the IO submit, but the caller should only see the bytes it asked for.
Fix: return a buffer of the caller-requested logical `size`, not the
sector-rounded `numBytesToRead`.
Verification: 64/64 Tsavorite.test.hlog NativeDeviceTest + IDevice +
NativeStorageDevice tests pass on Linux (where SectorSize is 4096 on
the CI runner's 4Kn drive).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The --device-completion-threads (KV.benchmark) and --completion-threads (Device.benchmark) help text said "all drainers share the same kernel io_context / io_uring" and "values > 1 are rarely useful past 1 today". Both claims are stale since the sharded-rings work (8cbca9d) and the per-thread ring affinity + 4-default-rings change (298bfd1): - Each drainer is bound 1:1 to its own kernel io_context (libaio) or io_uring ring (uring). - Submitters distribute across rings via per-thread affinity. - For io_uring, throughput scales with completion-threads up to available submitter concurrency (measured: ct=1 ~340K → ct=4 ~735K on Dell P5600 NVMe at the device-benchmark level). - For libaio extra drainers still rarely help past 1 (kernel per-context mutex is efficient). - Note added that uring uses min 4 rings even at ct=1 with the single drainer covering all rings via the legacy QueueRun scanner. Help-text-only change. No code behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ll reads
Root cause: cross-segment read rejection + engine retry loop
==============================================================
The AllocatorBase.GetAndPopulateReadBuffer sector-aligned read window can
extend past the page-end boundary when reading a record near the tail of
a page. When the device's segment size is a multiple of the page size
(e.g. 4MB pages, 1GB segments — the Garnet default), an over-extended
read at the last page of a segment also crosses the device's segment
boundary. NativeStorageDevice's underlying FileSystemSegmentedFile
rejects cross-segment reads with Status::IOError; the engine's
AsyncGetFromDiskCallback interprets a 0-byte read as a short read and
retries the same address — forever. Worker thread spins at 99% CPU,
disk activity drops to zero, benchmark deadlocks.
Reproduced reliably on KV.benchmark:
--device native --device-io-backend libaio \
--log-memory 16m --page-size 4m --segment-size 1g \
-n 10000000 (1.28 GB dataset → crosses 1GB segment boundary)
Smaller datasets (1M = 128MB, fits in 1 segment) work; larger ones
hang. RandomAccess device works on all dataset sizes because its
managed segmented-file wrapper doesn't reject cross-segment reads.
Diagnostic captured the exact symptom: a read at sourceAddress
0x3FFFF600 (1,073,739,776 — 2,560 bytes before the 1GB segment
boundary) with readLength 4608 (sector-aligned record window) extends
to 0x40000C00 — 2,560 bytes into segment 1. Native rejects with
Status::IOError, callback fires with numBytes=0, engine retries.
Fix: clamp the aligned read length so it never crosses page-end.
============================================================
Added in AllocatorBase.GetAndPopulateReadBuffer:
var pageEndInFile = (ulong)(AlignedPageSizeBytes * (GetPage(fromLogicalAddress) + 1));
if (alignedFileOffset + alignedReadLength > pageEndInFile)
alignedReadLength = (uint)(pageEndInFile - alignedFileOffset);
Records never span page boundaries (HandlePageOverflow guarantees), so
the actual record is fully readable within the clamped window —
available_bytes reflects what we actually got from disk, and the engine
continues normally. pageEnd is sector-aligned (PageSizeBits >= sector
size), so the clamped length stays sector-aligned.
Also reverted the uring "min 4 rings even at ct=1" experiment
=============================================================
The earlier "default 4 rings for uring regardless of ct" change was
fundamentally broken: with per-thread submit affinity (pick_ring's
thread_local index), submitters bound to rings 1-3 never get their
completions drained because the single drainer blocks on ring 0 with
a 1-second QueueRun timeout and only briefly polls the other rings
between wake-ups. The result is ~50x throughput degradation on
workloads where submitters land on rings != 0 (KV.benchmark load
phase dropped from 2.5M ops/sec to 54K ops/sec at t=1).
Reverted to the simple rule: rings == numCompletionThreads. For uring
perf scaling, users set numCompletionThreads >= expected submitter
concurrency; each ring is then continuously drained by its dedicated
drainer thread.
Defense-in-depth hardening
==========================
- NativeStorageDevice._callback now catches ALL exceptions from the
user callback (was: try/finally but exception propagated). A managed
exception escaping back into native code across the C ABI boundary
silently terminates the drainer thread; the next submitter then
spins forever in device.Throttle(). Now the exception is logged and
swallowed so the drainer survives.
- NativeStorageDevice.CompletionWorker has the same try/catch around
the whole drain loop as defense-in-depth against unrelated managed
exceptions (P/Invoke marshalling, IntPtr.Zero races with Dispose,
etc.).
- file_linux.cc QueueFile::ScheduleOperation (libaio) and
UringFile::ScheduleOperation (uring) now retry submit-side EAGAIN
indefinitely with bounded backoff (64 sched_yields, then 1ms
nanosleeps) instead of returning Status::IOError after 8 yields.
Surfacing transient EAGAIN as a permanent error creates the same
retry-loop pathology as the cross-segment-read bug above. EAGAIN is
the kernel saying "ring is full, try later"; it's not a real error
and must not be exposed to the engine.
Verification
============
KV.benchmark, 100M keys × 100B, 16MB log (mostly disk-spill),
1 completion thread, 100% reads:
libaio: t=1 135K ops/sec, t=4 400K, t=8 444K, t=16 445K, t=32 404K
uring: t=1 124K ops/sec, t=4 244K, t=8 265K, t=16 278K, t=32 272K
Both backends stable across the full thread × dataset sweep
(previously native+libaio hung on any 10M+ dataset; native+uring hung
on every config).
64/64 Tsavorite.test.hlog IDevice + NativeStorageDevice tests pass.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
01c9a77 to
dc97630
Compare
Description of Change
Hardens Tsavorite's
NativeStorageDeviceLinux backend for production use and adds a substantially-refreshed storage benchmark plus a new KV benchmark harness used to validate it.Native device runtime:
QueueIoHandler(libaio) andUringIoHandler(io_uring, built withUSE_URING). Sharded per-completion-thread io_uring rings and a Wake mechanism unblock the completion drainer on Dispose.libnative_device.so(libaio + liburing) andlibnative_device_libaio.so(libaio-only fallback, noliburing.so.2DT_NEEDED).ImportResolvercatchesliburing.so.2: cannot openand transparently falls back to the libaio-only flavor. ExplicitIoBackend.Uringagainst the fallback fails with distro-specific install hints rather than silently downgrading.native_device.dllrebuilt from the current C++ source;ThreadPoolIoHandlergainedinit_errno()/initialized()stubs soNativeDeviceImpl<ThreadPoolIoHandler>instantiates cleanly.Managed
NativeStorageDevicerewrite:Initializeis idempotent; ctor establishes valid defaults equivalent toInitialize(-1).GetFileSize/RemoveSegmentwork pre-IO without forcing native handle creation.Disposewakes the completion drainer via a no-op IO so shutdown doesn't wait on the per-contextQueueRunFortimeout.omitSegmentIdFromFilename(previously native-rejected).ProbeDioAlignmentso the sector-size check is a real ABI / runtime-drift detector instead of a 4K-disk footgun.IDevicecontract changes:Initialize()is optional; constructor defaults are equivalent toInitialize(-1)(unbounded single segment). All managed device implementations updated accordingly.RandomAccessLocalStorageDevicereplaces theMarkHandleAsAsyncreflection hack withRandomAccessoverSafeFileHandle.AsyncPoolrolls backtotalAllocatedifcreator()throws, fixing the dispose-hang class of bugs.Allocator hot path:
AllocatorBase.TryAllocateRetryNowreplaces unbounded CPU spin with boundedkFlushSpinCountyields followed by epoch-suspend +flushEvent.Wait(1ms)+ resume, matching the canonicalShiftBeginAddresspattern. Yields +14% on YCSB load with libaio at 64 threads.Garnet configuration plumbing:
Options.cs,defaults.conf, andGarnetServerOptions.Default device routing on Windows is unchanged:
DeviceType.Nativestill resolves toLocalStorageDevice(managed Windows IOCP) — the native shim on Windows is only exercised by tests that explicitly constructNativeStorageDevice.Benchmarks:
benchmark/Device.benchmarksubstantially rewritten with new throttle, completion-thread, and IO-backend flags. New cookbook README documents how to saturate ~750K NVMe IOPS on a P5600.libs/storage/Tsavorite/cs/benchmark/KV.benchmark— a clean replacement for YCSB.benchmark focused on KV performance: synthetic data only, configurable RUMD percentages, Zipf/uniform key selection, thread-coordinated runs, NUMA pinning, auto-configured threadpool, zero-allocation hot paths.Tests:
DeviceTests.cssplit intoIDevice_*(cross-device contract, parametrized overNative/RandomAccess/ManagedLocal) andNativeStorageDevice_*(NSD-specific). 62 tests pass on Linux; 59 pass + 3 chmod-permission Linux-only skips on Windows.Docker images:
liburing2/liburingalongsidelibaio1so the io_uring backend works out of the box; the libaio-only fallback handles missing liburing transparently.Key Technical Details
Affected types / interfaces:
INativeDevice,NativeDeviceImpl<HandlerT>,QueueIoHandler,UringIoHandler,ThreadPoolIoHandler— native C++ implementation.NativeStorageDevice,RandomAccessLocalStorageDevice,IDevice,StorageDeviceBase,NullDevice,SimulatedFlakyDevice,LocalStorageNamedDeviceFactory,TieredStorageDevice,AsyncPool— managed device layer.AllocatorBase.TryAllocateRetryNow,CompletionEvent.Wait(TimeSpan)— allocator hot path.Devices.CreateLogDevice— routing preserved on Windows.Options.cs,defaults.conf,GarnetServerOptions— Garnet host configuration knobs.Edge Cases
liburing.so.2libnative_device_libaio.soforIoBackend.Default/IoBackend.LibaioIoBackend.Uringwithout liburingDeviceType.NativeLocalStorageDeviceby defaultInitialize()not calledInitialize(-1)(unbounded single segment)Dispose()with blocked completion drainerProbeDioAlignment; cross-check passesIssues Fixed
No linked issue.