[Tsavorite] Add native Linux storage backend, harden NativeStorageDevice, refresh storage benchmarks#1831

badrishc · 2026-05-27T15:41:46Z

Description of Change

Hardens Tsavorite's NativeStorageDevice Linux backend for production use and adds a substantially-refreshed storage benchmark plus a new KV benchmark harness used to validate it.

Native device runtime:

Linux native shim now has two IO backends — QueueIoHandler (libaio) and UringIoHandler (io_uring, built with USE_URING). Sharded per-completion-thread io_uring rings and a Wake mechanism unblock the completion drainer on Dispose.
Ships two Linux native binaries: libnative_device.so (libaio + liburing) and libnative_device_libaio.so (libaio-only fallback, no liburing.so.2 DT_NEEDED).
Managed ImportResolver catches liburing.so.2: cannot open and transparently falls back to the libaio-only flavor. Explicit IoBackend.Uring against the fallback fails with distro-specific install hints rather than silently downgrading.
Windows native_device.dll rebuilt from the current C++ source; ThreadPoolIoHandler gained init_errno() / initialized() stubs so NativeDeviceImpl<ThreadPoolIoHandler> instantiates cleanly.

Managed NativeStorageDevice rewrite:

Lazy native-handle creation: Initialize is idempotent; ctor establishes valid defaults equivalent to Initialize(-1).
GetFileSize / RemoveSegment work pre-IO without forcing native handle creation.
Dispose wakes the completion drainer via a no-op IO so shutdown doesn't wait on the per-context QueueRunFor timeout.
Honors omitSegmentIdFromFilename (previously native-rejected).
Cross-checks segment size and sector size with the native shim before first IO; both probes now go through ProbeDioAlignment so the sector-size check is a real ABI / runtime-drift detector instead of a 4K-disk footgun.

IDevice contract changes:

Initialize() is optional; constructor defaults are equivalent to Initialize(-1) (unbounded single segment). All managed device implementations updated accordingly.
RandomAccessLocalStorageDevice replaces the MarkHandleAsAsync reflection hack with RandomAccess over SafeFileHandle.
AsyncPool rolls back totalAllocated if creator() throws, fixing the dispose-hang class of bugs.

Allocator hot path:

AllocatorBase.TryAllocateRetryNow replaces unbounded CPU spin with bounded kFlushSpinCount yields followed by epoch-suspend + flushEvent.Wait(1ms) + resume, matching the canonical ShiftBeginAddress pattern. Yields +14% on YCSB load with libaio at 64 threads.

Garnet configuration plumbing:

New device-backend / completion-thread knobs in Options.cs, defaults.conf, and GarnetServerOptions.

Default device routing on Windows is unchanged: DeviceType.Native still resolves to LocalStorageDevice (managed Windows IOCP) — the native shim on Windows is only exercised by tests that explicitly construct NativeStorageDevice.

Benchmarks:

benchmark/Device.benchmark substantially rewritten with new throttle, completion-thread, and IO-backend flags. New cookbook README documents how to saturate ~750K NVMe IOPS on a P5600.
New libs/storage/Tsavorite/cs/benchmark/KV.benchmark — a clean replacement for YCSB.benchmark focused on KV performance: synthetic data only, configurable RUMD percentages, Zipf/uniform key selection, thread-coordinated runs, NUMA pinning, auto-configured threadpool, zero-allocation hot paths.

Tests:

DeviceTests.cs split into IDevice_* (cross-device contract, parametrized over Native / RandomAccess / ManagedLocal) and NativeStorageDevice_* (NSD-specific). 62 tests pass on Linux; 59 pass + 3 chmod-permission Linux-only skips on Windows.

Docker images:

All five Dockerfiles re-install liburing2 / liburing alongside libaio1 so the io_uring backend works out of the box; the libaio-only fallback handles missing liburing transparently.

Key Technical Details

Affected types / interfaces:

INativeDevice, NativeDeviceImpl<HandlerT>, QueueIoHandler, UringIoHandler, ThreadPoolIoHandler — native C++ implementation.
NativeStorageDevice, RandomAccessLocalStorageDevice, IDevice, StorageDeviceBase, NullDevice, SimulatedFlakyDevice, LocalStorageNamedDeviceFactory, TieredStorageDevice, AsyncPool — managed device layer.
AllocatorBase.TryAllocateRetryNow, CompletionEvent.Wait(TimeSpan) — allocator hot path.
Devices.CreateLogDevice — routing preserved on Windows.
Options.cs, defaults.conf, GarnetServerOptions — Garnet host configuration knobs.

Edge Cases

Scenario	Behavior
Linux host lacks `liburing.so.2`	Loader transparently falls back to `libnative_device_libaio.so` for `IoBackend.Default` / `IoBackend.Libaio`
Explicit `IoBackend.Uring` without liburing	Construction fails with distro-specific install guidance instead of silently downgrading
Windows `DeviceType.Native`	Still routes to managed `LocalStorageDevice` by default
`Initialize()` not called	Device uses constructor defaults equivalent to `Initialize(-1)` (unbounded single segment)
`Dispose()` with blocked completion drainer	Wake-up IO unblocks the drainer before destroying the native handle
4K-native-DIO disk	Native and managed sides probe identical alignment via shared `ProbeDioAlignment`; cross-check passes
io_uring submit fails after retries	Prepared SQE is rewritten in place to nop + nullptr sentinel; harmlessly drained on next successful submit

Issues Fixed

No linked issue.

Copilot

Pull request overview

This PR hardens Tsavorite’s native storage path, adds Linux native backend selection for libaio/io_uring, updates Garnet configuration plumbing, and refreshes storage/KV benchmarking guidance and harness behavior.

Changes:

Adds/rewrites native device backend abstractions, runtime segment sizing, alignment probing, error reporting, io_uring support, and managed NativeStorageDevice/IDevice contract updates.
Plumbs native backend and completion-thread options through Tsavorite/Garnet device factories and host configuration.
Expands device tests and benchmark documentation/tools for native, RandomAccess, and FileStream storage paths.

Reviewed changes

Copilot reviewed 42 out of 46 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`libs/storage/Tsavorite/cs/test/test.hlog/DeviceTests.cs`	Adds broad device contract, native lifecycle, recovery, alignment, and permission tests.
`libs/storage/Tsavorite/cs/test/SimulatedFlakyDevice.cs`	Forwards `Initialize` to wrapped devices.
`libs/storage/Tsavorite/cs/src/core/Utilities/CompletionEvent.cs`	Adds timed wait helper.
`libs/storage/Tsavorite/cs/src/core/Tsavorite.core.csproj`	Copies libaio-only native fallback artifact.
`libs/storage/Tsavorite/cs/src/core/Index/CheckpointManagement/LocalStorageNamedDeviceFactoryCreator.cs`	Adds backend/completion-thread factory parameters.
`libs/storage/Tsavorite/cs/src/core/Index/CheckpointManagement/LocalStorageNamedDeviceFactory.cs`	Passes backend/completion-thread settings to created devices.
`libs/storage/Tsavorite/cs/src/core/Device/TieredStorageDevice.cs`	Formatting-only cleanup.
`libs/storage/Tsavorite/cs/src/core/Device/StorageDeviceBase.cs`	Documents constructor defaults and optional initialization.
`libs/storage/Tsavorite/cs/src/core/Device/RandomAccessLocalStorageDevice.cs`	Switches to `SafeFileHandle`/`RandomAccess` and adds Linux direct-I/O open path.
`libs/storage/Tsavorite/cs/src/core/Device/NullDevice.cs`	Formatting-only method expansion.
`libs/storage/Tsavorite/cs/src/core/Device/LinuxFileExtensions.cs`	Adds Linux `open(2)` helpers for O_DIRECT.
`libs/storage/Tsavorite/cs/src/core/Device/IDevice.cs`	Updates device initialization contract documentation.
`libs/storage/Tsavorite/cs/src/core/Device/Devices.cs`	Adds native backend/completion-thread parameters.
`libs/storage/Tsavorite/cs/src/core/Device/AsyncPool.cs`	Rolls back allocation count when item creation fails.
`libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs`	Adds bounded allocation retry backoff and modifies aligned read sizing.
`libs/storage/Tsavorite/cs/benchmark/KV.benchmark/README.md`	Documents native backend benchmark scenarios.
`libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Options.cs`	Adds io_uring backend parsing.
`libs/storage/Tsavorite/cc/src/device/utility.h`	Adds trailing-zero helper.
`libs/storage/Tsavorite/cc/src/device/native_device.h`	Adds native device interface, runtime segment sizing, probing, and backend types.
`libs/storage/Tsavorite/cc/src/device/native_device_wrapper.cc`	Extends C ABI for backend selection and diagnostics.
`libs/storage/Tsavorite/cc/src/device/native_device_error.h`	Adds thread-local native error channel.
`libs/storage/Tsavorite/cc/src/device/file_windows.h`	Adds cross-backend stubs for Windows handler compatibility.
`libs/storage/Tsavorite/cc/src/device/file_system_disk.h`	Converts segmented file logic to runtime segment size.
`libs/storage/Tsavorite/cc/src/device/file_linux.h`	Adds libaio wake support and sharded io_uring handler declarations.
`libs/storage/Tsavorite/cc/src/device/file_linux.cc`	Implements native Linux open/error handling, wake paths, and io_uring draining/submission changes.
`libs/storage/Tsavorite/cc/src/CMakeLists.txt`	Links liburing when enabled and includes new native error header.
`libs/storage/Tsavorite/cc/README.md`	Refreshes native-device build/runtime documentation.
`libs/storage/Tsavorite/cc/CMakeLists.txt`	Raises CMake minimum and enables io_uring by default.
`libs/server/Servers/GarnetServerOptions.cs`	Adds Garnet server options for native backend and completion threads.
`libs/host/defaults.conf`	Adds default config entries for native backend settings.
`libs/host/Configuration/Options.cs`	Adds CLI/config parsing for native backend settings.
`Dockerfile.ubuntu`	Installs liburing runtime dependency.
`Dockerfile.chiseled`	Stages liburing for chiseled runtime image.
`Dockerfile.azurelinux`	Installs liburing runtime dependency.
`Dockerfile.alpine`	Installs liburing package.
`Dockerfile`	Installs liburing runtime dependency.
`benchmark/README.md`	Adds Device.benchmark overview.
`benchmark/Device.benchmark/README.md`	Adds full Device.benchmark usage guide.
`benchmark/Device.benchmark/Program.cs`	Adds backend selection and success/error throughput accounting.
`benchmark/Device.benchmark/Options.cs`	Adds benchmark file-size/backend option updates.
`benchmark/Device.benchmark/BenchWorker.cs`	Aggregates errors and counts successful completions.

badrishc · 2026-05-28T00:13:24Z

Device.benchmark

Drive: Dell Ent NVMe P5600 MU U.2 3.2 TB (logical=physical=512).
Kernel: Linux 6.8.0-117-generic (Ubuntu).
CPU: Intel Xeon Platinum 8380, NUMA0 pinned via numactl.
Workload: 4 KB random reads, file-size 16 GiB, segment 1 GiB, O_DIRECT, throttle 256, 10 s/run × 3 runs/config.
Build: dotnet build benchmark/Device.benchmark/Device.benchmark.csproj -c Release -f net10.0.
Completion threads: 1 for libaio, 4 for uring.

Command (per row, varying -t):

numactl --cpunodebind=0 --membind=0 \
  dotnet bin/Release/net10.0/Device.benchmark.dll \
    --file-name /path/to/test.dat \
    --file-size 17179869184 --segment-size 1073741824 \
    --device-type Native --io-backend {libaio|uring} \
    --completion-threads {1|4} \
    -t {1,2,4,8,16,32} -b 64 \
    --runtime 10 --throttle-limit 256

Throughput (ops/sec, libaio with ct=1, uring with ct=4, batch = 64, 3 runs each):

submitter threads	libaio (run 1 / 2 / 3)	uring (run 1 / 2 / 3)
1	250K / 256K / 249K	241K / 251K / 242K
2	422K / 416K / 421K	431K / 429K / 416K
4	730K / 730K / 742K	712K / 718K / 720K
8	746K / 742K / 742K	747K / 752K / 753K
16	746K / 746K / 740K	748K / 750K / 749K
32	743K / 744K / 744K	747K / 743K / 751K

Saturation reference (fio, same drive): ~772K random 4 KB reads.

Both backends saturate at ≥4 submitter threads (~720–750K, within ~3% of the fio ceiling). At t=1 the workload is latency-bound on a single submitter; throughput scales linearly with submitter concurrency from there.

All 31 non-benchmark files changed on optimize-v2-io (vs its branch base d3677cf) ported here. Backup tag: optimize-v2-io-prerebase-backup @ 3f41f2b. Scope: device/IO/native-backend ONLY. Includes: Tsavorite C++ native device: - io_uring backend + pluggable C ABI (file_linux.cc/h) - error model split (native_device_error.h) - file_system_disk + native_device.h updates - CMakeLists + README Tsavorite C# device: - NativeStorageDevice: IoBackend enum (Default, Libaio, Uring), completion threads, production-readiness pass - LinuxFileExtensions.cs: P/Invoke open() for true O_DIRECT - ManagedLocalStorageDevice + RandomAccessLocalStorageDevice: O_DIRECT wiring on Linux - Devices.cs: router updates for new device APIs Tsavorite allocator + utilities: - AllocatorBase: bounded backoff in TryAllocateRetryNow - CompletionEvent: Wait(TimeSpan) overload Tsavorite checkpoint management: - LocalStorageNamedDeviceFactory + Creator surface ioBackend + completionThreads parameters Tests: - DeviceTests.cs updated for new device APIs Garnet host: - --device-io-backend, --device-completion-threads flags - defaults.conf updated, GarnetServerOptions wiring Dockerfiles (all 5): install liburing alongside libaio. Note: YCSB.benchmark and KV.benchmark are not modified by this commit; KV.benchmark is the supported benchmark on this branch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…iburing runtime dep - Options: --device-io-backend now accepts 'uring' (aliases: io_uring, iouring) in addition to libaio/default; help text + validation error list updated to match. - KV.benchmark README: device-backend table now describes the libaio vs uring split, with a runtime-install snippet for liburing across Debian/Ubuntu, Fedora/RHEL/AzureLinux, and Alpine, plus link to the full Tsavorite Native Device docs. New 'native + libaio' and 'native + uring' rows added to the cookbook for constrained-log large-dataset workloads. - Tsavorite Native Device README: new top-level 'Runtime dependencies (end users)' section listing the apt/dnf/apk install lines and how to fall back to the no-liburing variant. - KV.benchmark Validate: fix two bugs that surface when load and run use different thread counts and when the log spills to disk: 1) writerThread reconstruction now uses ResolvedLoadThreads (not Options.Threads which is the RUN count), so --load-threads N with --threads M != N validates correctly. 2) Reads of records below HeadAddress return Status.IsPending; the previous code counted these as misses. Validate now issues reads in batches of 256 and drains via CompletePendingWithOutputs, verifying each completed output against the per-thread pattern. Verified end-to-end with both backends: 4.6M × 100B × 8T × log=256m (~580MB dataset > 256MB log → forces disk spill) × 50R/50U × --validate: native + libaio → [validate] OK, run = 1.27 M ops/s native + uring → [validate] OK, run = 1.02 M ops/s 4.6M × 100B × 8T × log auto (fits) × 95R/5U × --validate: native + libaio → [validate] OK, run = 16.10 M ops/s native + uring → [validate] OK, run = 16.33 M ops/s Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…s on SafeFileHandle Background: MarkHandleAsAsync used reflection to flip SafeFileHandle.IsAsync's non-public setter so a P/Invoke-opened O_DIRECT FD could be wrapped in 'new FileStream(handle, isAsync: true)' without throwing 'Handle does not support asynchronous operations'. The flag is non-public in .NET 8/10, the hack was fragile across future runtime versions, and on Linux IsAsync is a contract gate (no real overlapped I/O exists for files), so the lie bought us nothing beyond letting the FileStream constructor accept the handle. RandomAccessLocalStorageDevice — refactor: - StorageAccessContext.handle is now SafeFileHandle (was FileStream). - CreateRead/WriteHandle: * Linux + O_DIRECT capable: LinuxFileExtensions.OpenDirect -> raw SafeFileHandle, no FileStream wrap. Page-cache bypass via the O_DIRECT flag at open(2), exactly as before. * Otherwise (Windows; or Linux when filesystem rejects O_DIRECT): File.OpenHandle(path, ..., FileOptions.Asynchronous | cast FILE_FLAG_NO_BUFFERING). On Windows this gives the runtime IOCP-bound OVERLAPPED I/O; on Linux it's page-cached. - All I/O goes through RandomAccess.{Read,Write}Async(safeHandle, memory, offset). On Windows: true kernel async via IOCP. On Linux: pread/pwrite dispatched to ThreadPool (same as before). - GetFileSize uses RandomAccess.GetLength(handle). - SetFileSize uses RandomAccess.SetLength(handle, size). LinuxFileExtensions: - MarkHandleAsAsync and the IsAsyncProperty reflection are deleted entirely (no remaining callers). - System.Reflection using removed. ManagedLocalStorageDevice: - Reverted to origin/main. This device is designed to stay within FileStream APIs; the O_DIRECT branch we added doesn't belong here. Verified: - Tsavorite test.hlog DeviceTests: 36/36 passed. - KV.benchmark --device randomaccess --log-memory 256m --preallocate-log --rumd 50,50,0,0 --validate: PASS, 357 K ops/sec, iostat shows 100-177 K real disk r/s and 53-90% NVMe util → O_DIRECT page-cache bypass confirmed. - KV.benchmark --device randomaccess (log fits) --validate: PASS, 15.7 M ops/sec. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…+ ManagedLocal) The Phase-7 hardening suite added in this branch was NativeStorageDevice- specific. Add parametrized variants of the four tests that are pure IDevice contract checks (not native-specific lifecycle/API), so they exercise all three local-storage device implementations: - Hardening_AllDevices_RoundTrip_BasicReadWrite - Hardening_AllDevices_RoundTrip_AcrossSegmentBoundary - Hardening_AllDevices_Parallel_32ConcurrentWrites - Hardening_AllDevices_Parallel_BurstyTraffic Each is parametrized by a new DeviceKind enum (Native, RandomAccess, ManagedLocal). Native is gated on OperatingSystem.IsLinux() (the C++ shim links against libaio/liburing); the other two run on both Linux and Windows. A shared CreateDeviceForTest helper takes care of the per-kind ctor + Initialize() dance so the test body stays uniform. Result: 38/38 hardening tests pass on Linux (12 new cross-device + 26 native-only). Native-specific tests retained as-is because they test API that doesn't exist on the other devices: - Lifecycle (DisposeBeforeInitialize, InitializeTwice, etc.) — NativeStorageDevice defers Initialize from the ctor; the other devices initialize in their ctor. - Segment-size validation (NonPowerOfTwoSegmentSize_Throws, etc.) — Initialize() is the only callsite that validates. - Recovery_*_SegmentSize_* — the native device's open() path is the only place that records and re-validates per-segment-size metadata. - SectorSize stability across opens — not all devices expose this. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…evice_ buckets; fix AsyncPool creator-throw hang Test refactor: - IDevice_*: 8 contract tests parametrized across Native, RandomAccess, ManagedLocal (round-trip basic, round-trip cross-segment, round-trip various segment sizes, 32 concurrent writes, 64 concurrent reads, mixed reads+writes, bursty traffic, stress burst of 100 writes, permission-denied callback contract). 33 cases total. - NativeStorageDevice_*: 16 native-only tests for behaviors managed devices don't have (deferred Initialize signature, recovery segment-size mismatch detection, sector-size discovery, sync-throw unaligned IO guard). AsyncPool fix: GetOrAdd reserved a slot in totalAllocated before calling creator(). If creator() threw (e.g. open() returned EACCES, ENOSPC), the slot was never released, so Dispose() would loop forever waiting for totalAllocated to drain to zero. This manifested as a process hang when a device pool's first open() failed. Rollback the reservation on exception so the failure propagates cleanly and the pool can still be disposed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…es, -1 = unbounded single segment Before this change, IDevice.Initialize had the same signature for every device but very different semantics: * StorageDeviceBase (RandomAccess, ManagedLocal, LocalStorage, NullDevice, LocalMemoryDevice): the ctor pre-set segmentSize = -1 / bits = 64 / mask = ~0, so calling an IO entry point without Initialize() silently ran in unbounded single-segment mode. * NativeStorageDevice: Initialize was MANDATORY (the C++ shim needs the segment size at create time for libaio/io_uring geometry), IO entry points threw if invoked first, and segmentSize = -1 was rejected. This commit unifies the contract: every IDevice must call Initialize() exactly once before any IO, and segmentSize = -1 selects unbounded single-segment mode on every device. Implementation: * StorageDeviceBase - Added `initialized` flag (volatile) and `EnsureInitialized()` helper that throws InvalidOperationException with a clear message naming the device by FileName. - Ctor leaves `initialized = false` but keeps the safe fallback defaults (-1 / 64 / ~0) so any cold maintenance path that touches segmentSizeBits before the guard can't compute outright nonsense. - EnsureInitialized() called from the base address-based ReadAsync / WriteAsync overloads and TruncateUntilAddress / TruncateUntilAddressAsync. - Initialize sets `initialized = true` at the end. * NativeStorageDevice - Accepts segmentSize = -1: translates to 1UL << 63 for the native shim so the C++ FileSystemSegmentedFile's shift = log2(segment_size) math collapses every non-negative upper-layer address into segment 0 (parity with the managed-side bits = 64 / mask = ~0). Single growing file on disk. - Tracks the value passed to native in nativeSegmentSizeBytes (replaces the diagnostic-only configuredSegmentSizeBytes long field, which couldn't hold 1<<63 without overflow). - ABI readback (NativeDevice_GetSegmentSize) compared against the value we sent to native, not the user-facing -1. - Always rejects omitSegmentIdFromFilename — the C++ shim has no omit-suffix code path, every segment is written as <base>.<segmentId>. Better to fail fast than silently produce wrong file names. * Concrete IO entry points (ReadAsync / WriteAsync / RemoveSegment / RemoveSegmentAsync) of NullDevice, LocalMemoryDevice, ManagedLocalStorageDevice, RandomAccessLocalStorageDevice, LocalStorageDevice, AzureStorageDevice, ShardedStorageDevice, and TieredStorageDevice now call EnsureInitialized() before doing work. Caller fix-ups: * LocalStorageNamedDeviceFactory.Get now calls device.Initialize(-1L) before returning. Commit / checkpoint metadata is single growing-file usage (segment 0 only, .0 suffix), so unbounded mode is the right default and unblocks every DeviceLogCommitCheckpointManager caller from needing to remember to initialize. * LocalStorageNamedDeviceFactory.ListContents skips dotfile entries — defensive against a pre-existing race in LinuxFileExtensions.IsDirectIOSupported where a .tsavorite-odirect-probe-* temp file can leak in the commit dir if File.Delete races with File.GetFiles. Without this filter, leaked probe files surface as Int64.Parse("") failures in DefaultCheckpointNamingScheme.CommitNumber. * SimulatedFlakyDevice.Initialize now propagates to the wrapped device. * ComponentRecoveryTests Setup_* helpers call Initialize(-1) on devices they construct directly (bypass the Tsavorite allocator path which normally Initializes). Tests (DeviceTests.cs): * IDevice_ReadAsyncBeforeInitialize_Throws(kind) × 3 — new contract test. * IDevice_WriteAsyncBeforeInitialize_Throws(kind) × 3 — same. * IDevice_Initialize_SegmentSizeMinusOne_UnboundedSingleSegment(kind) × 3 — write at offset 1 MiB (would be in segment-N for any positive size) and read back, confirming -1 routes through segment 0 on all 3 kinds. * NativeStorageDevice_Initialize_OmitSegmentIdFromFilename_Throws — new native-only test for the omit rejection in both -1 and explicit-size modes. * Removed NativeStorageDevice_{Read,Write}AsyncBeforeInitialize_Throws (now subsumed by the IDevice_ variants). Docs: * IDevice.Initialize docstring rewritten to spell out the new contract and the -1 semantics. NativeStorageDevice.Initialize remarks updated. Verified on Linux net10.0 Release: * 599 hlog tests (491 passed + 108 skipped) * 305 recovery tests * 144 + 155 + 127 + 346 = 772 other Tsavorite + Garnet RespTests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…probe (no race) Two related fixes on top of the unified Initialize contract: 1) NativeStorageDevice now supports omitSegmentIdFromFilename ───────────────────────────────────────────────────────── Previously Native rejected the omit flag because the C++ shim hard-coded the '.<segmentId>' suffix in three places in file_system_disk.h. This made the IDevice contract asymmetric (managed devices honored omit, Native didn't). Fix by threading the bool through the entire C++/C ABI: C++ (libs/storage/Tsavorite/cc/src/device/): * FileSystemSegmentBundle: new bool omit_segment_id_; both ctors accept it and use a new segment_path(idx) helper that returns just filename_ when set, otherwise filename_ + '.' + std::to_string(idx). Used at all three locations that previously hard-coded the suffix. * FileSystemSegmentedFile: new bool omit_segment_id_ (const) wired through ctor and propagated to bundles allocated by OpenSegment. * NativeDeviceImpl: new bool omit_segment_id constructor param; recorded as omit_segment_id_ member. ValidateRecoveredSegments short-circuits in omit mode (single bare-named file, segment-size mismatch check is meaningless when there's no .<id> suffix to scan for). * native_device_wrapper.cc / NativeDevice_CreateWithBackend: new trailing 'bool omit_segment_id' parameter. ABI BUMP — managed wrapper updated to match; Linux .so rebuilt and committed at libs/storage/Tsavorite/cs/src/core/Device/runtimes/linux-x64/native/ libnative_device.so. **Windows DLL must be rebuilt by user** with cmake -G 'Visual Studio 17 2022' -A x64 -T v143,spectre=true. C# (libs/storage/Tsavorite/cs/src/core/Device/): * NativeStorageDevice P/Invoke signature updated. * NativeStorageDevice.Initialize removes the 'always rejects omit' guard. It now accepts omit:true together with segmentSize = -1 and forwards to native; rejects omit:true together with a positive segmentSize with a clear error message (multiple segments would collapse onto the same on-disk path and clobber each other). Tests (DeviceTests.cs): * IDevice_Initialize_OmitSegmentIdFromFilename_BareFileName(kind) × 3: writes via Initialize(-1, omit:true) and asserts the on-disk file is the bare basename (no .0 suffix). Replaces the native-only 'throws' test from the previous commit. * IDevice_Initialize_OmitSegmentIdFromFilename_WithoutMinusOne_Throws(kind) × 3: enforces the no-positive-size-with-omit invariant on every kind. 2) IsDirectIOSupported uses O_TMPFILE (race-free probe) ───────────────────────────────────────────────────── The previous probe in libs/storage/Tsavorite/cs/src/core/Device/ LinuxFileExtensions.cs created a hidden '.tsavorite-odirect-probe-<pid>- <guid>' file in the device's directory, then File.Delete'd it in a silent-catch finally. Multiple concurrent commits (one device per Get()) ran probes simultaneously; concurrent ListContents calls from CommitRecordBoundedGrowthTest would observe the probe file during its brief lifetime, and DefaultCheckpointNamingScheme.CommitNumber would then throw FormatException on long.Parse(''). 18/20 baseline failure rate. Switching from create+unlink to open(directory, O_TMPFILE | O_RDWR | O_DIRECT) tells the kernel to allocate an anonymous inode in the directory's filesystem with NO directory entry. The probe inode is invisible to readdir/getdents regardless of timing; concurrent ListContents cannot observe it. Freed on close. Linux >= 3.11 + ext4/xfs/tmpfs/btrfs all support it. If O_TMPFILE itself fails (EOPNOTSUPP on some filesystem) we conservatively report 'no O_DIRECT' so the device falls back to the page-cache path — no named-file fallback because that's the bug we're fixing. Reverts the dotfile filter in LocalStorageNamedDeviceFactory.ListContents added by the previous commit; the underlying race is now eliminated at the kernel level so the workaround is unnecessary. 20/20 LogFastCommitTests runs pass after the change (was 2/20 on baseline, 3/20 on the previous unlink-after-open attempt which still raced). Verified on Linux net10.0 Release: * 612 hlog tests (504 passed + 108 skipped) * 305 recovery tests * 62 device tests (IDevice contract + NativeStorageDevice-specific) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Captures the 100M-key disk-bound thread-scale experiment from the optimize-device branch sweep so it can be reproduced verbatim in the future: - 100M × 100B records (12.8 GB on disk) - 16 MB log so ~0.125 % of dataset is in memory and almost every read is a 4 KB random disk fetch - 8 load threads, run-threads sweep 1,2,4,8,16,32 at 15s each - One row per backend (RandomAccess / native+libaio / native+uring) Added a short note after the table explaining what to compare against (the disk's fio ceiling at 4K-aligned QD=64-per-job), the expected ~2 min wall-clock per device, and the observed per-backend plateau characteristics so the next operator knows what 'good' looks like. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Device.benchmark fixes (previously reported throughput could be 2x inflated): - Throughput counter now tallies successful completions only. Before, every ReadAsync call was counted as success even when the kernel returned EAGAIN (Status::IOError=4 — flooded libaio io_context ring). Under --throttle-limit 0 with high QD, ~40% of "ops" were errored requests. - Per-error-code histogram printed at end of run; no per-error Console.WriteLine (was emitting millions of serial writes/run, both falsifying numbers and slowing the real path). - DEBUG data validation skipped on errored ops (was reading garbage from the destination buffer on EAGAIN paths and reporting spurious "Data mismatch"). - --throttle-limit help text documents the libaio kernel ring trap (128 slots wide; high QD + no throttle floods it) and recommends --throttle-limit 128 (also the io_uring SQ depth this build uses). - --io-backend flag added (libaio / uring / default) so the existing Linux Native path can be exercised against either backend. Unknown values are rejected with an actionable message at startup instead of silently falling back to default. - --completion-threads is now wired through to the Linux Native ctor (was hardcoded to 1). - --file-size widened to long (was int; --file-size > 2GB threw a parse error). KV.benchmark + Devices.cs: clarified XML/help text for the existing --device-completion-threads / numCompletionThreads parameter to describe the current behavior (multiple drainer threads share one kernel io_context / io_uring per device). No behavior change for KV.benchmark or core Tsavorite Devices.cs API beyond the Device.benchmark surface and clarified docs.

Adds N independent kernel io_contexts (libaio) / io_urings (uring) per NativeStorageDevice, with completion threads bound 1:1 to contexts. libaio internally always uses 1 context (sharding empirically gave nothing — kernel mutex efficient at all tested loads), but the sharded ABI surface is kept across both backends so the templated NativeDeviceImpl<HandlerT> doesn't need a fork. C ABI changes (libnative_device.so): - NativeDevice_CreateWithBackend signature bumped: trailing int32 num_io_contexts. - New exports: NativeDevice_QueueRunFor(device, ctx_idx, timeout_secs), NativeDevice_NumIoContexts(device). - Legacy NativeDevice_QueueRun kept; under uring sharding it scans all rings (back-compat for any single-thread drainer). C# (NativeStorageDevice): - Synchronous ABI probe at Initialize() that converts EntryPointNotFoundException into a clear TsavoriteException listing the missing exports and how to rebuild — guards against a stale .so silently hanging Dispose's drain loop. - Probe is intentionally gated by the QueueRun branch so it runs only on backends that actually use the new symbols at runtime: Linux Native (libaio / uring) where QueueRun returns >= 0, NOT on Windows IOCP where the ThreadPoolIoHandler returns -1 by design. This means a stale Windows DLL keeps working unchanged because it never calls the sharded exports. cdecl/x64 ABI silently tolerates the new trailing num_io_contexts arg on NativeDevice_CreateWithBackend. - Completion threads bound 1:1 via QueueRunFor(ctxIdx) (no closure capture bug — ctxIdx is captured per-iteration into a local). - numCompletionThreads is the user-facing knob; native side decides how many contexts to actually create (libaio: always 1; uring: honours the request). Empirical justification (Device.benchmark, NVMe, 4K random reads, batch=4096, throt=512, fio ceiling 749K, NUMA0-pinned): libaio CT=1 (always 1 ctx + 1 drainer): 755K ops/sec (t=32) uring CT=1 (1 ring + 1 drainer): 357K ops/sec ← SpinLock-bound uring CT=1 ring + N drainers (regresses): 357K → 274K ← cq_lock contention uring CT=4 (4 rings + 4 drainers): 745K ops/sec uring CT=8 (8 rings + 8 drainers): 758K ops/sec ← hardware ceiling Both backends now reach the hardware NVMe ceiling. uring requires sharding (the user-space SpinLock around io_uring_get_sqe + prep + submit is the real cap). libaio doesn't need sharding (kernel io_context mutex already efficient at all tested loads). Files: - file_linux.h : UringIoHandler sharded (vector<io_uring*>, per-ring sq_lock + cq_lock, atomic round-robin pick_ring). QueueIoHandler unchanged on the data plane (single io_context_t) but exposes the same num_contexts()/TryCompleteFor/QueueRunFor surface as inline stubs for ABI symmetry. - file_linux.cc : new UringIoHandler impls; QueueIoHandler unchanged. - file_windows.h: stub overloads (num_contexts()=1, QueueRunFor=-1, 2-arg ctor) so the templated NativeDeviceImpl compiles unchanged. - native_device.h, native_device_wrapper.cc: ABI plumbing as above. - NativeStorageDevice.cs: ABI probe + per-context drain workers. - runtimes/linux-x64/native/libnative_device.so: rebuilt with sharding.

…VMe IOPS Captures the verified copy-paste recipe for both Linux Native backends (libaio and io_uring) to hit the hardware ceiling on a Dell P5600-class NVMe, alongside a flag reference, output-schema explanation, and troubleshooting table. Headline recipes verified end-to-end on the reference setup: libaio --completion-threads 1 --threads 16 --throttle-limit 512 → 743K ops/sec uring --completion-threads 8 --threads 16 --throttle-limit 512 → 738K ops/sec (Both within 2 % of the table values in the README; zero kernel-side errors.) Key facts documented: - libaio always uses one io_context in this build regardless of --completion-threads (sharding empirically gave nothing; the hint is ignored). Pass 1 explicitly so scripts are self-describing. - io_uring needs sharded rings (CT >= 4) to escape the per-ring user-space SpinLock cap around io_uring_get_sqe + prep + submit; CT=8 is the safe peak. - --throttle-limit must be set to at least the per-ring/per-context depth (128 in this build for both backends). --throttle-limit 0 floods the kernel ring and the benchmark correctly surfaces Status::IOError=4 in the per-code histogram rather than tallying errored ops as throughput. - --file-size must be a multiple of 1024 × --sector-size (fill phase uses a 1024-sector temp buffer). Plus a section comparing Device.benchmark vs KV.benchmark to direct readers to the right tool: Device.benchmark for IO-layer ceiling validation (saturates NVMe), KV.benchmark for full-path throughput (currently caps ~30 % below the IO ceiling on the upper-layer pending-read path — see KV.benchmark README for that side of the story). Also adds a one-paragraph pointer in benchmark/README.md so the new README is discoverable from the top-level benchmarks listing.

…full ScheduleOperation in both QueueFile (libaio) and UringFile (io_uring) now retries the kernel-side submission on the transient back-pressure signal (libaio: `io_submit == 0`; uring: `io_uring_get_sqe == nullptr`) up to kMaxSubmitRetries = 8 attempts, each separated by a sched_yield(). Permanent errors (libaio io_submit < 0, uring io_uring_submit < 0) are NEVER retried — they surface immediately as Status::IOError. Motivation: the upper-layer throttle gate in AllocatorBase.AsyncGetFromDisk is a racy test-then-increment (Throttle() reads numPending non-atomically, then ReadAsync does Interlocked.Increment). With N concurrent submitters all passing the gate at numPending == ThrottleLimit, in-flight can spike to ThrottleLimit + N momentarily, exceeding the 128-slot per-context / per-ring kernel ring depth when N > 8 (which is normal for Garnet under heavy disk-bound load). Pre-fix, the kernel rejects the overshoot submissions with EAGAIN, which Tsavorite handled by re-routing through the full AllocatorBase pending-read retry loop (correct but expensive: a full round-trip per IOError). Post-fix, the burst is absorbed locally by a handful of sched_yields and never surfaces to the upper layer. Why sched_yield + bounded retries is the right shape: on a 750K-IOPS NVMe the kernel ring drains a slot every ~1.3 µs and sched_yield is typically 1-10 µs on Linux, so 8 retries (worst-case ~40-80 µs window) is more than enough to absorb the typical 24-slot overshoot from a 32-thread burst. For genuine sustained overload (application submission rate exceeds device IOPS for seconds), the retries exhaust and Status::IOError surfaces — which is the correct signal for the caller to apply back-pressure. Implementation notes: - libaio: simple loop around io_submit. No lock to release; io_submit is kernel-thread-safe per io_context, so concurrent submitters serialise inside the kernel. - uring: must release sq_lock around sched_yield. Holding a SpinLock across a syscall would stall every other submitter on the same ring. Only the get_sqe == nullptr path is retried; if get_sqe succeeded we've already "consumed" an SQE slot in the user-side bookkeeping and re-issuing via get_sqe+prep on retry would corrupt the ring (we'd hold two SQEs for one logical op). For SQPOLL-disabled rings (our setup) io_uring_submit returns 1 in steady state — a non-1 there is an unrecoverable kernel-side error and surfaces immediately. Verified end-to-end (Device.benchmark, NVMe, 4K random reads, batch=4096, NUMA0-pinned, --throttle-limit 120 to match production NativeStorageDevice default): libaio CT=1: t=8 / 16 / 32 → 626K / 631K / 627K ok/sec, 0 err uring CT=8: t=8 / 16 / 32 → 622K / 625K / 626K ok/sec, 0 err At extreme intentional-overload settings (--throttle-limit 4096, t=32) errors still appear — confirming the retry budget correctly exhausts when the application is genuinely outpacing the device: libaio CT=1, t=32, throt=4096: 754K ok, 14.7M code4 err (~5% of submits) uring CT=8, t=32, throt=4096: 745K ok, 0 err (uring still produces 0 err under same overload because 8 rings × 128 = 1024 SQ slots is large enough that even gross over-submission fits within the retry budget per ring.)

BLOCKER fixes: - NativeStorageDevice.Dispose UAF race. Previously NativeDevice_Destroy(nativeDevice) ran before nativeDevice was nulled, so a concurrent guard-bypassed P/Invoke could observe a non-zero handle that points to freed memory. Fix: Interlocked.Exchange atomically captures-and-nulls the handle; destroy runs on the captured pointer. EnsureReadyOrSilent now checks disposedFlag first. HIGH fixes: - UringFile::ScheduleOperation SQE leak on transient io_uring_submit failure. After a successful get_sqe + prep, the SQE is committed to the user-side SQ ring; a -EAGAIN/-EBUSY return from io_uring_submit left the slot permanently occupied with no kernel iocb, eventually starving get_sqe forever. Fix: retry io_uring_submit (without re-preparing) up to kMaxSubmitRetries on transient negatives, with sched_yield (and sq_lock released) between attempts. - UringIoHandler::Init partial-init leak. If new SpinLock() threw after io_uring_queue_init succeeded for ring i, the already-initialized ring leaked (the class dtor doesn't run on partial construction). Fix: use std::unique_ptr RAII holders during construction; release into the member vectors only after all allocations succeed. POLISH fixes: - NativeStorageDevice.Initialize tail-throw cleanup. If base.Initialize threw after the native device and completion threads were created, both leaked. Now wrapped in try/catch that cancels token, joins threads, and destroys the native device. - UringIoHandler rule-of-5 hygiene: explicitly deleted copy ctor, copy-assign, and move-assign so the implicit shallow copies (which would double-delete the raw owning pointers) cannot be generated. - DispatchUringCqe: added static_assert(is_trivially_destructible<IoCallbackContext>) so a future non-trivial member fails the build instead of silently leaking. - NativeDeviceImpl::num_io_contexts: removed unnecessary const_cast (the underlying num_contexts() is already const on both backends). - Removed duplicate XML <summary> on NativeStorageDevice.Dispose. - FileSystemDisk dead-code ctor: passed the now-required 5th arg to FileSystemSegmentedFile so the file compiles if anyone instantiates it. Comment hygiene sweep (per explicit review-rule #4 "comments should not refer to design thought processes"): - Removed embedded benchmark results, hardware-specific throughput numbers, and historical narrative from class/method documentation in file_linux.{h,cc}, native_device.h, native_device_wrapper.cc, NativeStorageDevice.cs. - Kept WHAT each method does and the invariants it enforces; moved WHY this approach was chosen out of source comments (the commit log is the appropriate place for that context). - Net: -162 lines across 7 files, no behavior change from the trim itself. Verified post-fix performance unchanged (Device.benchmark, NVMe, 4K random reads, batch=4096, NUMA0-pinned): libaio CT=1 t=16 throt=512: 746K ok/sec, 0 err (hardware ceiling) uring CT=8 t=16 throt=512: 743K ok/sec, 0 err (hardware ceiling) libaio CT=1 t=32 throt=120: 636K ok/sec, 0 err (production default) uring CT=8 t=32 throt=120: 618K ok/sec, 0 err (production default)

…reation NativeStorageDevice.Initialize used to perform all the heavy work (native device creation, completion-thread spawn, ABI / segment-size / sector-size cross-checks) eagerly and threw "called more than once" on a second call. The other IDevice implementations (LocalStorageDevice, RandomAccessLocalStorageDevice, ManagedLocalStorageDevice) all inherit a metadata-only StorageDeviceBase.Initialize that simply overwrites segmentSize / segmentSizeBits / mask fields and is silently idempotent. They open their per-segment OS handles lazily inside the IO methods. This contract mismatch broke any caller that invokes Initialize twice on the same NSD instance. The canonical case is LocalStorageNamedDeviceFactory.Get(), which calls Initialize(-1L) as a defensive pre-init so consumers can't forget; the consumer (snapshot checkpoint state machine SnapshotCheckpointSMTask, cluster checkpoint streaming TsavoriteCheckpointReader.CreateCheckpointDevice) then calls Initialize(actualSegmentSize). Under the old NSD that throws; under the new NSD it works the same way the other backends do. Implementation: - NSD.Initialize is now metadata-only — delegates to base.Initialize. Pre-flight argument validation (segmentSize power-of-two, sector-size floor, omitSegmentIdFromFilename) is preserved. - New EnsureNativeDeviceCreated() does the heavy work, lazily, on first IO. Reads the latest base.segmentSize and base.OmitSegmentIdFromFileName so whichever Initialize call ran most recently wins. - Thread-safe via double-checked locking on a new nativeCreateLock. The publish of nativeDevice uses Volatile.Write so a second observer of nativeDevice != IntPtr.Zero is guaranteed to see a fully-initialised handle with completion threads already running. - Dispose now also takes nativeCreateLock around the cancel-join-destroy sequence so it cannot race with a concurrent EnsureNativeDeviceCreated (which would otherwise leak a freshly-published native handle and its completion threads). - IO entry points (ReadAsync, WriteAsync) call EnsureNativeDeviceCreated() before submission. Bookkeeping entry points (Reset, TryComplete, GetFileSize, RemoveSegment) no-op when the native handle has not been created yet, matching the semantics of the other backends (Reset on a device with no open handles is a no-op). Verified against the full unit-test sweep with Native forced as the default device (the GetDefaultDeviceType hack is local-only and not in this commit): Tsavorite.test: 206 / 206 (was 204 / 206 pre-fix) Garnet.test: 789 / 789 (was 110 / 792 pre-fix; 681 were blocked on Initialize-twice) Garnet.test.acl: 425 / 425 Garnet.test.collections: 746 / 746 Garnet.test.complexstring: 386 / 386 Garnet.test.rangeindex: 62 / 62 Garnet.test.vectorset: 42 / 42 No change to behavior for callers that invoke Initialize once with the real segment size, which is what every production code path already does.

…ive handle GetFileSize and RemoveSegment must report the on-disk state regardless of whether IO has flowed through the device, matching LocalStorageDevice and RandomAccessLocalStorageDevice semantics. Before this fix, both no-op'd when no native handle had been created — which silently truncated the cluster manager's recovery decision because ClusterManager.cs:79 and ReplicationManager.cs:160 call `device.GetFileSize(0) > 0` to decide whether to recover persisted cluster config / replication history. With Native, a restarted node would always "Initialize new node instance config" instead of recovering, get a fresh node ID, and fail every replication-resume test (e.g. ClusterSRNoCheckpointRestartSecondary which restarts a replica and then waits for AOF sync to catch up). Changes: * GetFileSize now falls back to FileInfo when no native handle exists — same shape as RandomAccessLocalStorageDevice.GetFileSize (open-on-demand) but without paying io_uring/libaio setup cost just to stat a file. * RemoveSegment now falls back to File.Delete when no native handle exists — same shape as LocalStorageDevice / RandomAccessLocalStorageDevice (best-effort unlink, swallows ENOENT). * Per IDevice contract enforced in 889def4 ("Tsavorite IDevice: unify Initialize contract — required for all devices, -1 = unbounded single segment"), ReadAsync / WriteAsync now call EnsureInitialized() before EnsureNativeDeviceCreated() so the IDevice_*BeforeInitialize_Throws hardening tests get the same InvalidOperationException shape from Native that they get from the other devices. * Two device tests updated to match the lazy-Initialize contract that was introduced in commit f4e3044 ("Tsavorite Native: make Initialize idempotent via lazy native-handle creation"): - NativeStorageDevice_InitializeTwice_Throws → _Idempotent: idempotent Initialize matches the LSD/RA contract used by LocalStorageNamedDeviceFactory.Get + consumer re-init pattern. - NativeStorageDevice_Recovery_LargerExistingSegment_DetectsMismatch: the C++ ValidateRecoveredSegments check now fires on first IO (when EnsureNativeDeviceCreated runs), not at Initialize time, so the test asserts on a ReadAsync rather than Initialize. Verified on Linux x64 / .NET 10: * libs/storage/Tsavorite/cs/test/test.hlog: all IDevice + NativeStorageDevice tests pass (62/62) with Native default * test/cluster/Garnet.test.cluster.replication: all 4 ClusterSRNoCheckpointRestartSecondary variants pass with Native default (regression-test for the recovery path) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Without this fix, NSD.Dispose() can stall up to CompletionWorkerTimeoutSecs (1s) per io_context because the completion-drainer thread is blocked in io_getevents / io_uring_wait_cqe_timeout waiting for events that will never come (the IO drain phase has already brought numPending to 0). The Thread.Join following completionThreadToken.Cancel() then has to wait for the next QueueRunFor timeout to fire so the thread can observe cancellation and exit. This was visible as exactly-1.0s gaps in cluster replication recovery traces: checkpoint metadata reads / writes that each create+dispose a fresh NSD spent ~1s in Dispose, multiplying across the ~5–10 devices created per checkpoint into multi-second stalls. ClusterReplicaSyncTimeoutTest (replicaSyncTimeout=1s) and MultiDatabaseSaveRecoverByDbIdTest(True) (2s LASTSAVE poll window) failed because of this; the actual I/O on Native is microseconds, not seconds. Fix: post a synthetic wake-up event on each io_context when Dispose runs. * libaio: submit a 0-byte read on a /dev/null fd opened in the handler ctor. /dev/null completes immediately and does not require O_DIRECT alignment, so the wake-up does not interfere with the real segment files. * io_uring: submit io_uring_prep_nop with user_data = nullptr; the drain loop recognises nullptr as a wake-up sentinel and skips dispatch. * Windows ThreadPoolIoHandler has no dedicated drainer (callbacks fire on threadpool threads), so its Wake is a no-op stub returning 0. The completion thread wakes from its blocking syscall almost immediately, observes the cancellation token on its next loop iteration, and exits. No extra idle work, no polling, no shortened timeout. * NSD.Dispose latency: 1025ms worst case -> ~20-30ms (microbenchmark). * ClusterReplicaSyncTimeoutTest with Native: ~22-25s (fail) -> ~3s (pass). * MultiDatabaseSaveRecoverByDbIdTest(True) with Native: timeout (fail) -> ~6s (pass). * Idle drainer syscall rate is unchanged (1/s/context). C ABI changes (additive — old exports preserved): * NativeDevice_WakeCompletionWorker(device, ctx_idx). * INativeDevice::Wake; QueueIoHandler::Wake, UringIoHandler::Wake, ThreadPoolIoHandler::Wake stub. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…id for IO Background: commit 889def4 ("unify Initialize contract — required for all devices") added an EnsureInitialized() guard that threw InvalidOperationException at every IO entry point if Initialize() had not been called first. This was redundant: the ctor already establishes segmentSize=-1 / segmentSizeBits=64 / segmentSizeMask=~0UL, which is functionally identical to having called Initialize(-1) — every absolute address right-shifts to segment 0, producing unbounded single-segment routing. The mandatory-Initialize contract was the root cause of the entire factory-pre-init + NSD lazy-creation saga: LocalStorageNamedDeviceFactory.Get was forced to call device.Initialize(-1L) defensively just to satisfy the contract, which broke NativeStorageDevice (its single-shot Initialize then asserted on the consumer's follow-up Initialize(realSize)). Recent commits f4e3044 + 0535da0 papered over this with lazy native-handle creation; this commit removes the root cause. Changes: * StorageDeviceBase: remove the 'initialized' flag, EnsureInitialized() helper, and ThrowNotInitialized() method. Initialize() is now purely a *configuration* call to override the ctor defaults (set a non-default segment size, opt into OmitSegmentIdFromFileName). The ctor doc explicitly states that callers may issue IO immediately after construction. * All concrete devices: remove the EnsureInitialized() calls at the top of ReadAsync / WriteAsync / TruncateUntilSegmentAsync / RemoveSegment (libaio, io_uring, RA, ManagedLocal, LocalMemory, Null, Tiered, Sharded, Azure). * LocalStorageNamedDeviceFactory.Get: drop the defensive device.Initialize(-1L); the ctor defaults match what that call did anyway. * NSD's recent EnsureInitialized() additions to ReadAsync/WriteAsync (introduced in 0535da0 only to satisfy the hardening test) are also removed by the sweep. * ComponentRecoveryTests.cs: drop 3 redundant Initialize(-1L) calls. * test.hlog DeviceTests: rename and repurpose IDevice_*BeforeInitialize_Throws to IDevice_*BeforeInitialize_UsesCtorDefaults — the new test demonstrates that WriteAsync/ReadAsync on a freshly-constructed device (no Initialize call) works correctly using the unbounded single-segment defaults, across Native / RA / ManagedLocal. TestUtils.cs:165 still calls device.Initialize() — that path is conditional on the caller wanting OmitSegmentIdFromFileName=true, which IS only settable via Initialize (it is not a ctor parameter), so the call is genuinely needed there. Verified on Linux x64 / .NET 10: * libs/storage/Tsavorite/cs/test/test.hlog (IDevice + NativeStorageDevice tests): 62/62 pass. * libs/storage/Tsavorite/cs/test/test.recovery (ComponentRecovery tests): 4/4 pass. * Full Garnet.test, Garnet.test.cluster, Garnet.test.acl, Garnet.test.collections, Garnet.test.extensions, Garnet.test.scripting, Garnet.test.complexstring, Tsavorite IDevice+NSD: pass at the same rates as before this commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Sweep the PR for comments that referenced earlier design choices as they evolved during development, and rewrite them to describe the steady-state contract directly without historical baggage. * IDevice.Initialize: replace the "Must be called exactly once... before any IO entry point... uninitialized device throws InvalidOperationException" doc with the actual contract: Initialize is purely an opt-in configuration step to override the ctor defaults (which are equivalent to Initialize(-1)); callers may issue IO immediately after construction. * NativeStorageDevice ctor doc: rewrite to describe the as-shipped lazy creation flow (configuration captured at ctor, native handle created on first IO via EnsureNativeDeviceCreated) rather than the stale "Native device creation is DEFERRED until Initialize... every IO entry point throws InvalidOperationException" framing. * NativeStorageDevice.Initialize doc: drop the misleading "Creates the underlying native device with the requested segment size" lead-in (which hasn't been true since the lazy-creation refactor); replace the "factory pre-init" example (factory no longer pre-inits) with a steady-state description of when repeat Initialize calls are honoured. * NativeStorageDevice.EnsureNativeDeviceCreated doc: replace "Throws if Initialize has not been called" (now uses ctor defaults if no Initialize) with "Throws if the device has been disposed or if the native shim rejects the configuration". * NativeStorageDevice.EnsureReadyOrSilent doc: drop the "does not throw on 'not initialized yet'" qualification. * NativeStorageDevice.GetSectorSize doc: re-point the "cross-check" link from Initialize to EnsureNativeDeviceCreated (which is where it actually happens). * NativeStorageDevice.Dispose doc + body: bound the worst-case shutdown stall by the longest in-flight user callback (not CompletionWorkerTimeoutSecs) since wake-up uses NativeDevice_WakeCompletionWorker; rewrite the inline Dispose comment so it documents the steady-state design rather than what it improved over. * NativeStorageDevice nativeSegmentSizeBytes / UnboundedNativeSegmentSizeBytes field doc: clarify that the value is populated by EnsureNativeDeviceCreated (not Initialize) and that the default is reached without calling Initialize. * NativeStorageDevice_InitializeTwice_Idempotent test: drop the now-stale "factory pre-init... consumer re-initializes" rationale; describe the idempotent contract directly. * NativeStorageDevice_DisposeBeforeInitialize_IsNoOp test: drop the "Phase 6" reference and reword in terms of the steady-state lazy-creation contract. * SimulatedFlakyDevice.Initialize: replace the "so its EnsureInitialized() guard passes when our IO methods delegate to it" comment (the guard no longer exists) with a description of why both devices need matching geometry. * LinuxFileExtensions.OpenDirect dsync param doc: drop the "previously asked for it" wording — the WriteThrough callsites still pass it; describe the parameter as an opt-in for WriteThrough-equivalent semantics. * Doc-cref bookkeeping: change <see cref="base.segmentSize"/> (illegal cref for inherited fields) to <c>base.segmentSize</c> code spans. No behaviour change. Build clean on Garnet.slnx and Tsavorite.slnx; dotnet format --verify-no-changes clean on both. IDevice + NSD device tests all pass (62/62). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Background: the shipped libnative_device.so was built with -DUSE_URING=ON, so it had a hard DT_NEEDED entry for liburing.so.2. Loading the .so on a host without liburing2 installed (e.g. a GitHub Actions ubuntu-latest runner, or an end-user box where only libaio is in the base image) failed with: System.DllNotFoundException: ... liburing.so.2: cannot open shared object file: No such file or directory …even for callers that only ever requested the libaio backend, because the dynamic linker resolves NEEDED libraries at load time regardless of which exported symbols the caller goes on to invoke. Rebuild the prebuilt with -DUSE_URING=OFF so the shipped .so links only libaio. Most Linux distributions ship libaio in the base system, so the prebuilt now loads without any additional setup. The io_uring backend becomes a build-time opt-in: callers that want it install liburing-dev and rebuild with -DUSE_URING=ON. The C# layer already surfaces a clear TsavoriteException for callers that request Uring against a USE_URING=OFF build ("Requested IO backend 'Uring' is not available in the loaded native_device library… Rebuild the native library with -DUSE_URING=ON and install liburing-dev to enable io_uring."). Build fix: file_linux.h now includes <fcntl.h> directly (for the ::open() / O_RDONLY usage in QueueIoHandler::OpenWakeFd()). Previously these were pulled in transitively through <liburing.h>, which is now gated behind #ifdef FASTER_URING. Dockerfile updates: drop liburing2 / liburing from the runtime install list in all 5 Dockerfiles (default, .ubuntu, .alpine, .azurelinux, .chiseled). Comments left for users that rebuild with USE_URING=ON. README updates: rewrite the "Runtime dependencies" section to describe the new default (libaio only). Replace the "Disabling io_uring (optional)" section with "Enabling io_uring (optional)". Verified on Linux x64 / .NET 10: libaio default works (62/62 IDevice + NativeStorageDevice tests pass); ldd confirms only libaio.so.1t64 is in NEEDED. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…fallback Single-shipping libnative_device.so created a deployment dilemma: build with USE_URING=ON and end-users without liburing get DllNotFoundException at load time; build with USE_URING=OFF and the io_uring backend stops working even on hosts that DO have liburing installed (which is the case where uring matters for perf — modern NVMe at >1M IOPS benefits noticeably from uring over libaio). Ship both flavors instead: * libnative_device.so — built with USE_URING=ON; DT_NEEDED on libaio AND liburing. Exposes both Libaio and Uring backends. * libnative_device_libaio.so — built with USE_URING=OFF; DT_NEEDED on libaio only. Exposes the Libaio backend. NativeStorageDevice's DllImportResolver tries the uring-enabled binary first; on DllNotFoundException matching 'liburing.so.2: cannot open' it falls back to the libaio-only binary. The Libaio backend therefore always works out of the box on any Linux distribution that ships libaio (essentially all of them). liburing is opt-in: hosts that install it get the Uring backend with zero runtime overhead vs Libaio (direct calls, no function-pointer indirection — we deliberately rejected the dlopen approach so the future-default uring path stays optimal). If a caller explicitly selects IoBackend.Uring on a host without liburing, the construction-time error message now points at the install command per distro ('apt-get install -y liburing2', 'dnf install -y liburing', 'apk add liburing') instead of telling the user to rebuild the .so with -DUSE_URING=ON. We never silently downgrade Uring to Libaio. Changes: * NativeStorageDevice.cs: new LibaioFallbackLibraryPath; ImportResolver catches DllNotFoundException for liburing.so.2 and falls back to the libaio-only .so. ResolveNativeLibraryPath now takes the path as a parameter so it can resolve either flavor. * NativeStorageDevice.cs: rewrite the 'backend not available' exception message — point at install commands (the actual remediation) not rebuild. * Tsavorite.core.csproj: add libnative_device_libaio.so as a second ContentWithTargetPath asset so both .so files are copied to the output directory and packed into the NuGet runtime payload. * runtimes/linux-x64/native/libnative_device.so — REPLACED with USE_URING=ON build (DT_NEEDED libaio + liburing). 2.3 MB. * runtimes/linux-x64/native/libnative_device_libaio.so — NEW, USE_URING=OFF build (DT_NEEDED libaio only). 1.6 MB. * Dockerfile, Dockerfile.ubuntu, Dockerfile.alpine, Dockerfile.azurelinux, Dockerfile.chiseled: re-add liburing2 / liburing to the runtime installs so docker users get the io_uring backend out of the box (the libaio-only fallback would otherwise leave Uring unusable inside containers). * cc/README.md: rewrite the 'Runtime dependencies' and build sections to describe the two-flavor layout, drop the stale 'Enabling io_uring' section, and document the prebuilt rebuild workflow. Verified end-to-end: * Both backends saturate the Dell P5600 NVMe at ~743K random read IOPS in benchmark/Device.benchmark (matches the pre-change reference). * 62/62 IDevice + NativeStorageDevice tests pass. * dotnet format clean on both Garnet.slnx and Tsavorite.slnx. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…hreadPoolIoHandler NativeDeviceImpl's constructor (native_device.h:100-101) gates on handler_.init_errno() to surface an actionable error message when the underlying IO handler failed to initialize (e.g., libaio io_setup() failed with EMFILE / ENOMEM, or io_uring_queue_init() failed). The Linux handlers QueueIoHandler and UringIoHandler both expose this API; ThreadPoolIoHandler (Windows) did not, so MSVC failed to instantiate NativeDeviceImpl<ThreadPoolIoHandler> with: error C2039: 'init_errno': is not a member of 'FASTER::environment::ThreadPoolIoHandler' Add init_errno() and initialized() stubs that return 0 / true unconditionally — the Windows ThreadPool API does not have a separable init step that can fail in the same way the Linux io_setup / io_uring_queue_init paths can (threadpool creation failures propagate via threadpool_'s ctor, not via a later 'check this' field on the handler), so the stubs are semantically correct. NativeDeviceImpl then falls through to the log_.Open(&handler_) path which is where Windows-specific errors (missing directory, permission denied, etc.) actually surface. Linux unaffected: rebuilt build/Release-uring cleanly after this change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The Native tests in DeviceTests.cs had blanket 'NativeStorageDevice is Linux-only' Assert.Ignore guards that dated from when NSD's C++ shim was Linux-only. The shim is built on Windows too (native_device.dll via the ThreadPool / IOCP backend in file_windows.cc), so directly constructing 'new NativeStorageDevice(...)' works on Windows. The blanket guards were silently dropping ~15 NSD test cases on Windows CI. Drop the guards so the tests exercise the Windows C++ shim. The legitimate Linux-only guard on IDevice_PermissionDeniedAtFirstWrite_CallbackGetsError (chmod-based; chmod has no Windows analogue) is preserved. Important: end-user device routing is UNCHANGED. Devices.CreateLogDevice(DeviceType.Native) on Windows still returns LocalStorageDevice (managed Windows IOCP), not NativeStorageDevice — that routing happens in Devices.cs and was not touched. These tests directly instantiate the NSD class for shim-coverage purposes only; they do not affect what end users get from the default device factory. Linux: 62/62 pass after this change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…+ source Rebuild the shipped Windows prebuilt from the current native_device source so the latest fixes (Initialize idempotence, WakeCompletionWorker, etc.) are reflected in the win-x64 DLL. USE_URING is a no-op on Windows; the DLL only exposes the Default (IOCP) backend, so there is no equivalent of the libnative_device_libaio.so fallback on this platform. End-user device routing is unchanged: Devices.CreateLogDevice(DeviceType.Native) on Windows still returns LocalStorageDevice (managed Windows IOCP). This DLL is exercised by direct 'new NativeStorageDevice(...)' construction (see Tsavorite.test.hlog DeviceTests — 59/59 Native + IDevice tests pass on Windows after this rebuild). Built with: Visual Studio 17 2022, MSVC v143, x64, Release configuration, Spectre-mitigated CRT.

Fixes two correctness issues caught by an automated code review of the optimize-device PR. ### io_uring SQE leak on submit failure In UringFile::ScheduleOperation, io_uring_get_sqe() advances the user-side sqe_tail before io_uring_submit() is called. If submit fails after retries (-EAGAIN/-EBUSY exhausted, or any other negative), the old code released the lock and returned IOError without doing anything about the still-pending SQE. user_data on that SQE pointed at the io_context unique_ptr that was about to be freed by the guards unwinding, so the next successful submit on the same ring would consume the stale SQE and the QueueRunFor drain loop would dispatch a callback against freed memory — a clear use-after-free. Fix: before releasing sq_lock on the failure path, rewrite the still- pending SQE in place as io_uring_prep_nop with user_data = nullptr. The drain loop already skips nullptr user_data (it's the wake-up sentinel used by UringIoHandler::Wake), so when a later submit flushes this nop the CQE is drained harmlessly. Safe to mutate the SQE in place because we still hold sq_lock and no kernel/concurrent submitter has observed it yet. ### NativeDevice sector_size always returned 512 FileSystemSegmentedFile::alignment() returned a hard-coded 512. NativeDeviceImpl::sector_size() delegated to it, so the C# wrapper's sector-size cross-check in EnsureNativeDeviceCreated would: - falsely throw on 4K-native disks where ProbeAlignment returns 4096 (managed 4096 vs native 512 → 'sector-size mismatch' → device unusable), or - on 4K-native disks where the managed probe fell back to 512 (e.g. older kernel without STATX_DIOALIGN), let the device initialize with SectorSize=512 and then have the kernel reject the 512-aligned O_DIRECT buffers with EINVAL. Fix: factor the STATX_DIOALIGN probe from NativeDevice_ProbeAlignment into a shared inline helper (native_device::ProbeDioAlignment in native_device.h) and call it once from the NativeDeviceImpl ctor, caching the result as the immutable member device_alignment_. sector_size() now returns the cached value; NativeDevice_ProbeAlignment delegates to the same helper. Both sides of the ABI cross-check go through identical probe logic, so the check is now a meaningful ABI / runtime-drift detector instead of a 4K-disk footgun. ### Stale IDevice.Initialize XML The omitSegmentIdFromFilename param said it was 'only supported by managed devices — NativeStorageDevice rejects this flag'. Native devices have honored the flag since 6584cf7. Updated the doc. ### Alpine install hint The 'IoBackend.Uring with libaio fallback' error message suggested 'sudo apk add liburing' on Alpine, but README.md notes that the prebuilt won't load on Alpine (musl) at all. Replaced the apk suggestion with the actual Alpine support story (use a glibc image or fall back to a managed device). Verification: 62/62 Tsavorite.test.hlog IDevice + NativeStorageDevice tests pass on Linux. Both .so binaries rebuilt (uring-enabled and libaio-only fallback) with correct ldd output. Device.benchmark NVMe saturation throughput unchanged within noise (libaio 738K IOPS, uring 349K IOPS on Dell P5600). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replaces the statx(STATX_DIOALIGN) probe in ProbeDioAlignment with a direct sysfs lookup of max(logical_block_size, physical_block_size). Why: - STATX_DIOALIGN reports only the kernel-enforced minimum (= logical block size). It misses the firmware's preferred sector (physical_block_size), so on a 512e drive (logical=512, physical=4096) the probe would return 512 and Tsavorite would take a firmware RMW penalty on every partial-sector write. - STATX_DIOALIGN also requires kernel 6.1+ AND the filesystem to populate the field; ext4 on 6.8 leaves it unset on 512-byte devices, so the probe was already falling through to the 512 default in practice. - sysfs gives us both values directly, on every kernel, with no O_DIRECT dance. Taking max(logical, physical) covers the correctness floor (logical = kernel-enforced minimum) and the performance floor (physical = avoid RMW on partial writes) in one shot. Implementation: - stat() the file (or its closest existing ancestor — log file may not exist yet at construction). Extract st_dev → (major, minor). - Read /sys/dev/block/<maj>:<min>/queue/{logical,physical}_block_size. For partitions (e.g. sda2), the queue/ dir lives on the parent whole-disk block device — fall through to ../queue/<field>. - Round result up to a power of two (always already pow2 on real hardware) and floor at 512 B. On this machine (Dell P5600 NVMe + PERC sda): Probe(/DATA2/badrishc) = 512 (NVMe: logical=512, physical=512) Probe(/tmp/devbench) = 512 (sda partition via parent-walk) Probe(/home/badrishc) = 512 Probe(/) = 512 All values match max(logical, physical) read directly from sysfs. Verification: - 62/62 Tsavorite.test.hlog IDevice + NativeStorageDevice tests pass - Both .so flavors rebuilt (uring-enabled + libaio-only fallback) - C ABI NativeDevice_ProbeAlignment delegates to the same helper, so managed SectorSize and native sector_size() remain in lockstep. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…obed sector size CI failure on IDevice_Initialize_OmitSegmentIdFromFilename_BareFileName (and likely other IDevice_* tests on the same runner) reported: 'NativeStorageDevice.WriteAsync: misaligned I/O — sector size is 4096, but offset=0x0, length=4096, buffer=0x...7EC7A9F76800' The buffer ends at 0x...800 = 2048 — i.e. 2048-aligned but not 4096-aligned. The test helper allocated buffers aligned to HardeningSectorSize = 512 (the pre-PR default for every Garnet Linux device); a 512-aligned formula can land on a 2048-boundary that is not also a 4096-boundary. CI's underlying disk reports physical_block_size = 4096 in sysfs, so the new max(logical, physical) probe returns 4096 there. The native shim then correctly rejects sub-4096-aligned O_DIRECT buffers with EINVAL. The fix is on the test side: bump HardeningSectorSize from 512 to 4096 so the test buffer alignment matches the strictest device.SectorSize seen on any modern hardware (512n, 512e, 4Kn). Locally (Dell P5600 NVMe, logical=physical=512 → SectorSize=512) all 62 IDevice + NativeStorageDevice tests still pass — 4096 trivially divides 512. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… PR) The read-window sizing optimization (drop leading-slop padding + clamp to page-end) is out of scope for this device-backend PR; it interacts with the larger read-IO path and deserves its own focused PR with dedicated benchmarking. Reverting to the pre-PR behavior here. TryAllocateRetryNow's bounded-backoff change is retained — it's a self-contained allocator hot-path fix and is independently verified (+13.9% on YCSB load with libaio at 64 threads, per kvbench benchmarking). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… max(logical, physical) Symmetry with the Linux sysfs probe — Windows now reads both BytesPerLogicalSector and BytesPerPhysicalSector from the volume's STORAGE_ACCESS_ALIGNMENT_DESCRIPTOR (via IOCTL_STORAGE_QUERY_PROPERTY + StorageAccessAlignmentProperty) and returns the rounded-up-to-pow2 max, floor 512 B. Previously the Windows branch returned 512 unconditionally, which would silently undersize SectorSize on Windows 4Kn / 512e drives. Implementation: - Parse drive letter from filename ("C:\foo.dat" -> "\\.\\C:"). UNC paths are not supported by this probe — fall back to 512. - CreateFile on the volume with FILE_READ_ATTRIBUTES (no admin needed). - DeviceIoControl(IOCTL_STORAGE_QUERY_PROPERTY) with StorageAccessAlignmentProperty. - max(logical, physical), round up to pow2, floor 512. Linux behavior unchanged. Both .so flavors rebuilt and pass 62/62 device tests on this machine (logical=physical=512 NVMe). REQUIRES Windows DLL rebuild — the Windows path in ProbeDioAlignment is now non-trivial, and the existing prebuilt native_device.dll still returns 512 unconditionally. Without the rebuild, on a Windows 4Kn box the managed SectorSize cross-check would (incorrectly) pass at 512 while the device might actually need 4096. Rebuild recipe in the companion review comment / Tsavorite/cc/README.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…leteFor Addresses Copilot review comment on file_linux.cc:373. UringIoHandler::TryCompleteFor (and via it TryComplete) dispatched every drained CQE through DispatchUringCqe without checking the user_data = nullptr sentinel that QueueRunFor already handles. The sentinel marks two kinds of no-op CQEs: - Wake-up nops submitted by UringIoHandler::Wake to unblock the drainer on Dispose. - SQEs rewritten in-place after io_uring_submit failed (the SQE leak fix in c6d6892); these are committed to the SQ but carry no caller context. If a TryComplete() / TryCompleteFor() call picks up either kind of nop CQE, DispatchUringCqe would dereference the null context at context->callback(...) and segfault. Fix: mirror the nullptr-skip from QueueRunFor in TryCompleteFor. Return true to count the drain (matching the any-flag semantics). Verification: 62/62 Tsavorite.test.hlog IDevice + NativeStorageDevice tests pass. uring .so rebuilt; libaio-only .so is byte-identical because the patched code path is wrapped in #ifdef FASTER_URING and not compiled into the libaio-only fallback. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

QueueRunFor used to acquire cq_lock per CQE (peek -> read fields -> cqe_seen -> release -> dispatch). With a single drainer thread that serializes lock acquire/release on every completion and forces submitters to wait through callback latency when they need ring access. Replaced with the canonical liburing batch-drain idiom: - acquire cq_lock once - io_uring_peek_batch_cqe(ring, cqes, 64) to pull up to 64 CQEs - snapshot (io_res, context) for each - io_uring_cq_advance(ring, n) to release the slots - release cq_lock - dispatch callbacks outside the lock This is the io_uring equivalent of libaio's io_getevents(n) per syscall. Snapshot BEFORE cq_advance is mandatory because the kernel may reuse CQ slots once advanced, leaving the cqe pointers dangling. The wake-up / failed-submit sentinel (user_data == nullptr) is still skipped without dispatch, same as before. Measured impact on Dell P5600 (16 submitter threads, batch 64, throttle 256): ct=1 (1 ring, 1 drainer): 339K -> 354K ops/sec (+4%) ct=4 (4 rings, 4 drainers): 735K -> 737K (saturates, noise) ct=8 (8 rings, 8 drainers): 750K -> 742K avg (saturates, noise) The single-drainer gain is modest because the real bottleneck at ct=1 with 16 submitters is sq_lock contention on the single ring, not cq_lock contention. The batch-drain is still strictly better: - dispatches outside the lock so submitters aren't blocked by user-callback latency, - matches the idiomatic liburing pattern, - amortizes the lock acquire/release across up to 64 CQEs per cycle. For high-throughput workloads, sharding across multiple rings remains the right scaling lever (ct >= 4 saturates this drive). Verification: 62/62 Tsavorite.test.hlog IDevice + NativeStorageDevice tests pass. uring .so rebuilt. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Eliminates the sq_lock contention that was capping uring at ~340K IOPS at the default numCompletionThreads=1. Two changes work together: 1. Per-thread ring affinity in pick_ring (file_linux.h): Each submitter thread is assigned a ring on its first submit (round- robin against other threads via an atomic counter) and keeps that assignment for life. Same-thread submits never contend on sq_lock with themselves; different threads only contend when they got assigned the same ring (num_submitter_threads > num_rings). This is the user-space equivalent of libaio's "io_submit is thread-safe per io_context" — eliminate shared mutable state across submitters. 2. Hardcoded 4 rings for uring (NativeStorageDevice.cs): numIoContextsConfig = ioBackend == Uring ? max(kDefaultUringRings=4, numCompletionThreads) : numCompletionThreads So uring always has at least 4 rings even at numCompletionThreads=1. The single drainer covers all 4 rings via the legacy QueueRun compat scanner (CompletionWorker passes ctxIdx=-1 in that case). libaio is unchanged: rings == numCompletionThreads (extra rings don't help; the kernel io_context mutex is already efficient). Result on Dell P5600 NVMe (16 submitter threads, batch 64, throttle 256): Before (1 ring, 1 drainer): ~340K After (4 rings, 1 drainer, default): ~700K (matches libaio ct=1) After (8 rings, 8 drainers, sharded): ~745K (unchanged, was already saturating) No new public configuration parameters. numCompletionThreads still controls drainer count; the ring count is now backend-derived behind the scenes. The CompletionWorker single-drainer-multi-ring path was added specifically so the default numCompletionThreads=1 case can saturate without spawning extra drainer threads. Also: bumped HardeningSectorSize and the legacy bufferPool / NativeDeviceTest2 sector_size constants from 512 to 4096 to match the strictest device SectorSize we expect on any modern hardware (4Kn drives where the new max(logical, physical) probe returns 4096). Tests would otherwise fail with EINVAL on 4Kn CI runners with 512-aligned buffers. Verification: - 64/64 Tsavorite.test.hlog IDevice + NativeStorageDevice tests pass - Default uring (no flags) hits 626-740K across t=1..64 vs ~340K before - Sharded ct=4/8 unchanged (still saturates) - libaio default unchanged Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… SectorSize NativeDeviceTest1 read 1024 bytes (entryLength) using ReadInto, which: - rounded the read length up to the device sector size, - then returned a buffer of that ROUNDED length, - which the caller compared via SequenceEqual against the original `entry` byte[] (length 1024). When SectorSize was 512 (the old constant probe), 1024 rounded to 1024 and the lengths happened to match. With the new max(logical, physical) probe returning 4096 on 4Kn drives (Windows/Ubuntu CI runners), 1024 rounds to 4096, the returned buffer is 4096 bytes long, and SequenceEqual fails on length mismatch (regardless of content). Pre-existing latent bug — the rounding to sector size is correct for the IO submit, but the caller should only see the bytes it asked for. Fix: return a buffer of the caller-requested logical `size`, not the sector-rounded `numBytesToRead`. Verification: 64/64 Tsavorite.test.hlog NativeDeviceTest + IDevice + NativeStorageDevice tests pass on Linux (where SectorSize is 4096 on the CI runner's 4Kn drive). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The --device-completion-threads (KV.benchmark) and --completion-threads (Device.benchmark) help text said "all drainers share the same kernel io_context / io_uring" and "values > 1 are rarely useful past 1 today". Both claims are stale since the sharded-rings work (8cbca9d) and the per-thread ring affinity + 4-default-rings change (298bfd1): - Each drainer is bound 1:1 to its own kernel io_context (libaio) or io_uring ring (uring). - Submitters distribute across rings via per-thread affinity. - For io_uring, throughput scales with completion-threads up to available submitter concurrency (measured: ct=1 ~340K → ct=4 ~735K on Dell P5600 NVMe at the device-benchmark level). - For libaio extra drainers still rarely help past 1 (kernel per-context mutex is efficient). - Note added that uring uses min 4 rings even at ct=1 with the single drainer covering all rings via the legacy QueueRun scanner. Help-text-only change. No code behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ll reads Root cause: cross-segment read rejection + engine retry loop ============================================================== The AllocatorBase.GetAndPopulateReadBuffer sector-aligned read window can extend past the page-end boundary when reading a record near the tail of a page. When the device's segment size is a multiple of the page size (e.g. 4MB pages, 1GB segments — the Garnet default), an over-extended read at the last page of a segment also crosses the device's segment boundary. NativeStorageDevice's underlying FileSystemSegmentedFile rejects cross-segment reads with Status::IOError; the engine's AsyncGetFromDiskCallback interprets a 0-byte read as a short read and retries the same address — forever. Worker thread spins at 99% CPU, disk activity drops to zero, benchmark deadlocks. Reproduced reliably on KV.benchmark: --device native --device-io-backend libaio \ --log-memory 16m --page-size 4m --segment-size 1g \ -n 10000000 (1.28 GB dataset → crosses 1GB segment boundary) Smaller datasets (1M = 128MB, fits in 1 segment) work; larger ones hang. RandomAccess device works on all dataset sizes because its managed segmented-file wrapper doesn't reject cross-segment reads. Diagnostic captured the exact symptom: a read at sourceAddress 0x3FFFF600 (1,073,739,776 — 2,560 bytes before the 1GB segment boundary) with readLength 4608 (sector-aligned record window) extends to 0x40000C00 — 2,560 bytes into segment 1. Native rejects with Status::IOError, callback fires with numBytes=0, engine retries. Fix: clamp the aligned read length so it never crosses page-end. ============================================================ Added in AllocatorBase.GetAndPopulateReadBuffer: var pageEndInFile = (ulong)(AlignedPageSizeBytes * (GetPage(fromLogicalAddress) + 1)); if (alignedFileOffset + alignedReadLength > pageEndInFile) alignedReadLength = (uint)(pageEndInFile - alignedFileOffset); Records never span page boundaries (HandlePageOverflow guarantees), so the actual record is fully readable within the clamped window — available_bytes reflects what we actually got from disk, and the engine continues normally. pageEnd is sector-aligned (PageSizeBits >= sector size), so the clamped length stays sector-aligned. Also reverted the uring "min 4 rings even at ct=1" experiment ============================================================= The earlier "default 4 rings for uring regardless of ct" change was fundamentally broken: with per-thread submit affinity (pick_ring's thread_local index), submitters bound to rings 1-3 never get their completions drained because the single drainer blocks on ring 0 with a 1-second QueueRun timeout and only briefly polls the other rings between wake-ups. The result is ~50x throughput degradation on workloads where submitters land on rings != 0 (KV.benchmark load phase dropped from 2.5M ops/sec to 54K ops/sec at t=1). Reverted to the simple rule: rings == numCompletionThreads. For uring perf scaling, users set numCompletionThreads >= expected submitter concurrency; each ring is then continuously drained by its dedicated drainer thread. Defense-in-depth hardening ========================== - NativeStorageDevice._callback now catches ALL exceptions from the user callback (was: try/finally but exception propagated). A managed exception escaping back into native code across the C ABI boundary silently terminates the drainer thread; the next submitter then spins forever in device.Throttle(). Now the exception is logged and swallowed so the drainer survives. - NativeStorageDevice.CompletionWorker has the same try/catch around the whole drain loop as defense-in-depth against unrelated managed exceptions (P/Invoke marshalling, IntPtr.Zero races with Dispose, etc.). - file_linux.cc QueueFile::ScheduleOperation (libaio) and UringFile::ScheduleOperation (uring) now retry submit-side EAGAIN indefinitely with bounded backoff (64 sched_yields, then 1ms nanosleeps) instead of returning Status::IOError after 8 yields. Surfacing transient EAGAIN as a permanent error creates the same retry-loop pathology as the cross-segment-read bug above. EAGAIN is the kernel saying "ring is full, try later"; it's not a real error and must not be exposed to the engine. Verification ============ KV.benchmark, 100M keys × 100B, 16MB log (mostly disk-spill), 1 completion thread, 100% reads: libaio: t=1 135K ops/sec, t=4 400K, t=8 444K, t=16 445K, t=32 404K uring: t=1 124K ops/sec, t=4 244K, t=8 265K, t=16 278K, t=32 272K Both backends stable across the full thread × dataset sweep (previously native+libaio hung on any 10M+ dataset; native+uring hung on every config). 64/64 Tsavorite.test.hlog IDevice + NativeStorageDevice tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

badrishc force-pushed the badrishc/optimize-device branch from 4e6392e to 05fdd6c Compare May 27, 2026 17:56

badrishc changed the title ~~Optimize native devices, update benchmark~~ [Tsavorite] Add native Linux storage backend, harden NativeStorageDevice, refresh storage benchmarks May 27, 2026

badrishc force-pushed the badrishc/optimize-device branch from ed4618a to 2a61c1e Compare May 27, 2026 21:03

badrishc marked this pull request as ready for review May 27, 2026 21:03

Copilot AI review requested due to automatic review settings May 27, 2026 21:03

Copilot started reviewing on behalf of badrishc May 27, 2026 21:03 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

badrishc force-pushed the badrishc/optimize-device branch from 398a099 to 7d890ea Compare May 28, 2026 00:16

TedHartMS approved these changes May 28, 2026

View reviewed changes

badrishc and others added 20 commits May 27, 2026 20:36

badrishc and others added 16 commits May 27, 2026 20:36

Add binaries

f806b36

Add binaries

4814b08

badrishc force-pushed the badrishc/optimize-device branch from 01c9a77 to dc97630 Compare May 28, 2026 03:37

TedHartMS approved these changes May 28, 2026

View reviewed changes

badrishc merged commit 97b52bc into main May 28, 2026
187 checks passed

badrishc deleted the badrishc/optimize-device branch May 28, 2026 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tsavorite] Add native Linux storage backend, harden NativeStorageDevice, refresh storage benchmarks#1831