Skip to content

[Tsavorite] Add native Linux storage backend, harden NativeStorageDevice, refresh storage benchmarks#1831

Merged
badrishc merged 36 commits into
mainfrom
badrishc/optimize-device
May 28, 2026
Merged

[Tsavorite] Add native Linux storage backend, harden NativeStorageDevice, refresh storage benchmarks#1831
badrishc merged 36 commits into
mainfrom
badrishc/optimize-device

Conversation

@badrishc

@badrishc badrishc commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Description of Change

Hardens Tsavorite's NativeStorageDevice Linux backend for production use and adds a substantially-refreshed storage benchmark plus a new KV benchmark harness used to validate it.

Native device runtime:

  • Linux native shim now has two IO backends — QueueIoHandler (libaio) and UringIoHandler (io_uring, built with USE_URING). Sharded per-completion-thread io_uring rings and a Wake mechanism unblock the completion drainer on Dispose.
  • Ships two Linux native binaries: libnative_device.so (libaio + liburing) and libnative_device_libaio.so (libaio-only fallback, no liburing.so.2 DT_NEEDED).
  • Managed ImportResolver catches liburing.so.2: cannot open and transparently falls back to the libaio-only flavor. Explicit IoBackend.Uring against the fallback fails with distro-specific install hints rather than silently downgrading.
  • Windows native_device.dll rebuilt from the current C++ source; ThreadPoolIoHandler gained init_errno() / initialized() stubs so NativeDeviceImpl<ThreadPoolIoHandler> instantiates cleanly.

Managed NativeStorageDevice rewrite:

  • Lazy native-handle creation: Initialize is idempotent; ctor establishes valid defaults equivalent to Initialize(-1).
  • GetFileSize / RemoveSegment work pre-IO without forcing native handle creation.
  • Dispose wakes the completion drainer via a no-op IO so shutdown doesn't wait on the per-context QueueRunFor timeout.
  • Honors omitSegmentIdFromFilename (previously native-rejected).
  • Cross-checks segment size and sector size with the native shim before first IO; both probes now go through ProbeDioAlignment so the sector-size check is a real ABI / runtime-drift detector instead of a 4K-disk footgun.

IDevice contract changes:

  • Initialize() is optional; constructor defaults are equivalent to Initialize(-1) (unbounded single segment). All managed device implementations updated accordingly.
  • RandomAccessLocalStorageDevice replaces the MarkHandleAsAsync reflection hack with RandomAccess over SafeFileHandle.
  • AsyncPool rolls back totalAllocated if creator() throws, fixing the dispose-hang class of bugs.

Allocator hot path:

  • AllocatorBase.TryAllocateRetryNow replaces unbounded CPU spin with bounded kFlushSpinCount yields followed by epoch-suspend + flushEvent.Wait(1ms) + resume, matching the canonical ShiftBeginAddress pattern. Yields +14% on YCSB load with libaio at 64 threads.

Garnet configuration plumbing:

  • New device-backend / completion-thread knobs in Options.cs, defaults.conf, and GarnetServerOptions.

Default device routing on Windows is unchanged: DeviceType.Native still resolves to LocalStorageDevice (managed Windows IOCP) — the native shim on Windows is only exercised by tests that explicitly construct NativeStorageDevice.

Benchmarks:

  • benchmark/Device.benchmark substantially rewritten with new throttle, completion-thread, and IO-backend flags. New cookbook README documents how to saturate ~750K NVMe IOPS on a P5600.
  • New libs/storage/Tsavorite/cs/benchmark/KV.benchmark — a clean replacement for YCSB.benchmark focused on KV performance: synthetic data only, configurable RUMD percentages, Zipf/uniform key selection, thread-coordinated runs, NUMA pinning, auto-configured threadpool, zero-allocation hot paths.

Tests:

  • DeviceTests.cs split into IDevice_* (cross-device contract, parametrized over Native / RandomAccess / ManagedLocal) and NativeStorageDevice_* (NSD-specific). 62 tests pass on Linux; 59 pass + 3 chmod-permission Linux-only skips on Windows.

Docker images:

  • All five Dockerfiles re-install liburing2 / liburing alongside libaio1 so the io_uring backend works out of the box; the libaio-only fallback handles missing liburing transparently.

Key Technical Details

Affected types / interfaces:

  • INativeDevice, NativeDeviceImpl<HandlerT>, QueueIoHandler, UringIoHandler, ThreadPoolIoHandler — native C++ implementation.
  • NativeStorageDevice, RandomAccessLocalStorageDevice, IDevice, StorageDeviceBase, NullDevice, SimulatedFlakyDevice, LocalStorageNamedDeviceFactory, TieredStorageDevice, AsyncPool — managed device layer.
  • AllocatorBase.TryAllocateRetryNow, CompletionEvent.Wait(TimeSpan) — allocator hot path.
  • Devices.CreateLogDevice — routing preserved on Windows.
  • Options.cs, defaults.conf, GarnetServerOptions — Garnet host configuration knobs.

Edge Cases

Scenario Behavior
Linux host lacks liburing.so.2 Loader transparently falls back to libnative_device_libaio.so for IoBackend.Default / IoBackend.Libaio
Explicit IoBackend.Uring without liburing Construction fails with distro-specific install guidance instead of silently downgrading
Windows DeviceType.Native Still routes to managed LocalStorageDevice by default
Initialize() not called Device uses constructor defaults equivalent to Initialize(-1) (unbounded single segment)
Dispose() with blocked completion drainer Wake-up IO unblocks the drainer before destroying the native handle
4K-native-DIO disk Native and managed sides probe identical alignment via shared ProbeDioAlignment; cross-check passes
io_uring submit fails after retries Prepared SQE is rewritten in place to nop + nullptr sentinel; harmlessly drained on next successful submit

Issues Fixed

No linked issue.

@badrishc badrishc force-pushed the badrishc/optimize-device branch from 4e6392e to 05fdd6c Compare May 27, 2026 17:56
@badrishc badrishc changed the title Optimize native devices, update benchmark [Tsavorite] Add native Linux storage backend, harden NativeStorageDevice, refresh storage benchmarks May 27, 2026
@badrishc badrishc force-pushed the badrishc/optimize-device branch from ed4618a to 2a61c1e Compare May 27, 2026 21:03
@badrishc badrishc marked this pull request as ready for review May 27, 2026 21:03
Copilot AI review requested due to automatic review settings May 27, 2026 21:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens Tsavorite’s native storage path, adds Linux native backend selection for libaio/io_uring, updates Garnet configuration plumbing, and refreshes storage/KV benchmarking guidance and harness behavior.

Changes:

  • Adds/rewrites native device backend abstractions, runtime segment sizing, alignment probing, error reporting, io_uring support, and managed NativeStorageDevice/IDevice contract updates.
  • Plumbs native backend and completion-thread options through Tsavorite/Garnet device factories and host configuration.
  • Expands device tests and benchmark documentation/tools for native, RandomAccess, and FileStream storage paths.

Reviewed changes

Copilot reviewed 42 out of 46 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
libs/storage/Tsavorite/cs/test/test.hlog/DeviceTests.cs Adds broad device contract, native lifecycle, recovery, alignment, and permission tests.
libs/storage/Tsavorite/cs/test/SimulatedFlakyDevice.cs Forwards Initialize to wrapped devices.
libs/storage/Tsavorite/cs/src/core/Utilities/CompletionEvent.cs Adds timed wait helper.
libs/storage/Tsavorite/cs/src/core/Tsavorite.core.csproj Copies libaio-only native fallback artifact.
libs/storage/Tsavorite/cs/src/core/Index/CheckpointManagement/LocalStorageNamedDeviceFactoryCreator.cs Adds backend/completion-thread factory parameters.
libs/storage/Tsavorite/cs/src/core/Index/CheckpointManagement/LocalStorageNamedDeviceFactory.cs Passes backend/completion-thread settings to created devices.
libs/storage/Tsavorite/cs/src/core/Device/TieredStorageDevice.cs Formatting-only cleanup.
libs/storage/Tsavorite/cs/src/core/Device/StorageDeviceBase.cs Documents constructor defaults and optional initialization.
libs/storage/Tsavorite/cs/src/core/Device/RandomAccessLocalStorageDevice.cs Switches to SafeFileHandle/RandomAccess and adds Linux direct-I/O open path.
libs/storage/Tsavorite/cs/src/core/Device/NullDevice.cs Formatting-only method expansion.
libs/storage/Tsavorite/cs/src/core/Device/LinuxFileExtensions.cs Adds Linux open(2) helpers for O_DIRECT.
libs/storage/Tsavorite/cs/src/core/Device/IDevice.cs Updates device initialization contract documentation.
libs/storage/Tsavorite/cs/src/core/Device/Devices.cs Adds native backend/completion-thread parameters.
libs/storage/Tsavorite/cs/src/core/Device/AsyncPool.cs Rolls back allocation count when item creation fails.
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs Adds bounded allocation retry backoff and modifies aligned read sizing.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/README.md Documents native backend benchmark scenarios.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Options.cs Adds io_uring backend parsing.
libs/storage/Tsavorite/cc/src/device/utility.h Adds trailing-zero helper.
libs/storage/Tsavorite/cc/src/device/native_device.h Adds native device interface, runtime segment sizing, probing, and backend types.
libs/storage/Tsavorite/cc/src/device/native_device_wrapper.cc Extends C ABI for backend selection and diagnostics.
libs/storage/Tsavorite/cc/src/device/native_device_error.h Adds thread-local native error channel.
libs/storage/Tsavorite/cc/src/device/file_windows.h Adds cross-backend stubs for Windows handler compatibility.
libs/storage/Tsavorite/cc/src/device/file_system_disk.h Converts segmented file logic to runtime segment size.
libs/storage/Tsavorite/cc/src/device/file_linux.h Adds libaio wake support and sharded io_uring handler declarations.
libs/storage/Tsavorite/cc/src/device/file_linux.cc Implements native Linux open/error handling, wake paths, and io_uring draining/submission changes.
libs/storage/Tsavorite/cc/src/CMakeLists.txt Links liburing when enabled and includes new native error header.
libs/storage/Tsavorite/cc/README.md Refreshes native-device build/runtime documentation.
libs/storage/Tsavorite/cc/CMakeLists.txt Raises CMake minimum and enables io_uring by default.
libs/server/Servers/GarnetServerOptions.cs Adds Garnet server options for native backend and completion threads.
libs/host/defaults.conf Adds default config entries for native backend settings.
libs/host/Configuration/Options.cs Adds CLI/config parsing for native backend settings.
Dockerfile.ubuntu Installs liburing runtime dependency.
Dockerfile.chiseled Stages liburing for chiseled runtime image.
Dockerfile.azurelinux Installs liburing runtime dependency.
Dockerfile.alpine Installs liburing package.
Dockerfile Installs liburing runtime dependency.
benchmark/README.md Adds Device.benchmark overview.
benchmark/Device.benchmark/README.md Adds full Device.benchmark usage guide.
benchmark/Device.benchmark/Program.cs Adds backend selection and success/error throughput accounting.
benchmark/Device.benchmark/Options.cs Adds benchmark file-size/backend option updates.
benchmark/Device.benchmark/BenchWorker.cs Aggregates errors and counts successful completions.

Comment thread libs/storage/Tsavorite/cc/src/device/file_linux.cc
Comment thread libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs Outdated
Comment thread libs/storage/Tsavorite/cc/src/device/native_device_wrapper.cc
Comment thread libs/storage/Tsavorite/cs/src/core/Device/Devices.cs
@badrishc

badrishc commented May 28, 2026

Copy link
Copy Markdown
Collaborator Author

Device.benchmark

Drive: Dell Ent NVMe P5600 MU U.2 3.2 TB (logical=physical=512).
Kernel: Linux 6.8.0-117-generic (Ubuntu).
CPU: Intel Xeon Platinum 8380, NUMA0 pinned via numactl.
Workload: 4 KB random reads, file-size 16 GiB, segment 1 GiB, O_DIRECT, throttle 256, 10 s/run × 3 runs/config.
Build: dotnet build benchmark/Device.benchmark/Device.benchmark.csproj -c Release -f net10.0.
Completion threads: 1 for libaio, 4 for uring.

Command (per row, varying -t):

numactl --cpunodebind=0 --membind=0 \
  dotnet bin/Release/net10.0/Device.benchmark.dll \
    --file-name /path/to/test.dat \
    --file-size 17179869184 --segment-size 1073741824 \
    --device-type Native --io-backend {libaio|uring} \
    --completion-threads {1|4} \
    -t {1,2,4,8,16,32} -b 64 \
    --runtime 10 --throttle-limit 256

Throughput (ops/sec, libaio with ct=1, uring with ct=4, batch = 64, 3 runs each):

submitter threads libaio (run 1 / 2 / 3) uring (run 1 / 2 / 3)
1 250K / 256K / 249K 241K / 251K / 242K
2 422K / 416K / 421K 431K / 429K / 416K
4 730K / 730K / 742K 712K / 718K / 720K
8 746K / 742K / 742K 747K / 752K / 753K
16 746K / 746K / 740K 748K / 750K / 749K
32 743K / 744K / 744K 747K / 743K / 751K

Saturation reference (fio, same drive): ~772K random 4 KB reads.

Both backends saturate at ≥4 submitter threads (~720–750K, within ~3% of the fio ceiling). At t=1 the workload is latency-bound on a single submitter; throughput scales linearly with submitter concurrency from there.

@badrishc badrishc force-pushed the badrishc/optimize-device branch from 398a099 to 7d890ea Compare May 28, 2026 00:16
badrishc and others added 20 commits May 27, 2026 20:36
All 31 non-benchmark files changed on optimize-v2-io (vs its branch
base d3677cf) ported here. Backup tag: optimize-v2-io-prerebase-backup
@ 3f41f2b.

Scope: device/IO/native-backend ONLY. Includes:

Tsavorite C++ native device:
- io_uring backend + pluggable C ABI (file_linux.cc/h)
- error model split (native_device_error.h)
- file_system_disk + native_device.h updates
- CMakeLists + README

Tsavorite C# device:
- NativeStorageDevice: IoBackend enum (Default, Libaio, Uring),
  completion threads, production-readiness pass
- LinuxFileExtensions.cs: P/Invoke open() for true O_DIRECT
- ManagedLocalStorageDevice + RandomAccessLocalStorageDevice:
  O_DIRECT wiring on Linux
- Devices.cs: router updates for new device APIs

Tsavorite allocator + utilities:
- AllocatorBase: bounded backoff in TryAllocateRetryNow
- CompletionEvent: Wait(TimeSpan) overload

Tsavorite checkpoint management:
- LocalStorageNamedDeviceFactory + Creator surface ioBackend +
  completionThreads parameters

Tests:
- DeviceTests.cs updated for new device APIs

Garnet host:
- --device-io-backend, --device-completion-threads flags
- defaults.conf updated, GarnetServerOptions wiring

Dockerfiles (all 5): install liburing alongside libaio.

Note: YCSB.benchmark and KV.benchmark are not modified by this
commit; KV.benchmark is the supported benchmark on this branch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…iburing runtime dep

- Options: --device-io-backend now accepts 'uring' (aliases: io_uring,
  iouring) in addition to libaio/default; help text + validation error
  list updated to match.

- KV.benchmark README: device-backend table now describes the libaio vs
  uring split, with a runtime-install snippet for liburing across
  Debian/Ubuntu, Fedora/RHEL/AzureLinux, and Alpine, plus link to the
  full Tsavorite Native Device docs. New 'native + libaio' and
  'native + uring' rows added to the cookbook for constrained-log
  large-dataset workloads.

- Tsavorite Native Device README: new top-level 'Runtime dependencies
  (end users)' section listing the apt/dnf/apk install lines and how to
  fall back to the no-liburing variant.

- KV.benchmark Validate: fix two bugs that surface when load and run
  use different thread counts and when the log spills to disk:
  1) writerThread reconstruction now uses ResolvedLoadThreads (not
     Options.Threads which is the RUN count), so --load-threads N with
     --threads M != N validates correctly.
  2) Reads of records below HeadAddress return Status.IsPending; the
     previous code counted these as misses. Validate now issues reads
     in batches of 256 and drains via CompletePendingWithOutputs,
     verifying each completed output against the per-thread pattern.

Verified end-to-end with both backends:

  4.6M × 100B × 8T × log=256m (~580MB dataset > 256MB log → forces
  disk spill) × 50R/50U × --validate:
    native + libaio  → [validate] OK, run = 1.27 M ops/s
    native + uring   → [validate] OK, run = 1.02 M ops/s

  4.6M × 100B × 8T × log auto (fits) × 95R/5U × --validate:
    native + libaio  → [validate] OK, run = 16.10 M ops/s
    native + uring   → [validate] OK, run = 16.33 M ops/s

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…s on SafeFileHandle

Background:
  MarkHandleAsAsync used reflection to flip SafeFileHandle.IsAsync's
  non-public setter so a P/Invoke-opened O_DIRECT FD could be wrapped
  in 'new FileStream(handle, isAsync: true)' without throwing 'Handle
  does not support asynchronous operations'. The flag is non-public
  in .NET 8/10, the hack was fragile across future runtime versions,
  and on Linux IsAsync is a contract gate (no real overlapped I/O
  exists for files), so the lie bought us nothing beyond letting the
  FileStream constructor accept the handle.

RandomAccessLocalStorageDevice — refactor:
  - StorageAccessContext.handle is now SafeFileHandle (was FileStream).
  - CreateRead/WriteHandle:
      * Linux + O_DIRECT capable: LinuxFileExtensions.OpenDirect -> raw
        SafeFileHandle, no FileStream wrap. Page-cache bypass via the
        O_DIRECT flag at open(2), exactly as before.
      * Otherwise (Windows; or Linux when filesystem rejects O_DIRECT):
        File.OpenHandle(path, ..., FileOptions.Asynchronous | cast
        FILE_FLAG_NO_BUFFERING). On Windows this gives the runtime
        IOCP-bound OVERLAPPED I/O; on Linux it's page-cached.
  - All I/O goes through RandomAccess.{Read,Write}Async(safeHandle,
    memory, offset). On Windows: true kernel async via IOCP. On Linux:
    pread/pwrite dispatched to ThreadPool (same as before).
  - GetFileSize uses RandomAccess.GetLength(handle).
  - SetFileSize uses RandomAccess.SetLength(handle, size).

LinuxFileExtensions:
  - MarkHandleAsAsync and the IsAsyncProperty reflection are deleted
    entirely (no remaining callers).
  - System.Reflection using removed.

ManagedLocalStorageDevice:
  - Reverted to origin/main. This device is designed to stay within
    FileStream APIs; the O_DIRECT branch we added doesn't belong here.

Verified:
  - Tsavorite test.hlog DeviceTests: 36/36 passed.
  - KV.benchmark --device randomaccess --log-memory 256m --preallocate-log
    --rumd 50,50,0,0 --validate: PASS, 357 K ops/sec, iostat shows
    100-177 K real disk r/s and 53-90% NVMe util → O_DIRECT page-cache
    bypass confirmed.
  - KV.benchmark --device randomaccess (log fits) --validate: PASS,
    15.7 M ops/sec.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…+ ManagedLocal)

The Phase-7 hardening suite added in this branch was NativeStorageDevice-
specific. Add parametrized variants of the four tests that are pure IDevice
contract checks (not native-specific lifecycle/API), so they exercise all
three local-storage device implementations:

  - Hardening_AllDevices_RoundTrip_BasicReadWrite
  - Hardening_AllDevices_RoundTrip_AcrossSegmentBoundary
  - Hardening_AllDevices_Parallel_32ConcurrentWrites
  - Hardening_AllDevices_Parallel_BurstyTraffic

Each is parametrized by a new DeviceKind enum (Native, RandomAccess,
ManagedLocal). Native is gated on OperatingSystem.IsLinux() (the C++
shim links against libaio/liburing); the other two run on both Linux
and Windows.

A shared CreateDeviceForTest helper takes care of the per-kind ctor +
Initialize() dance so the test body stays uniform.

Result: 38/38 hardening tests pass on Linux (12 new cross-device + 26
native-only).

Native-specific tests retained as-is because they test API that doesn't
exist on the other devices:
  - Lifecycle (DisposeBeforeInitialize, InitializeTwice, etc.) —
    NativeStorageDevice defers Initialize from the ctor; the other
    devices initialize in their ctor.
  - Segment-size validation (NonPowerOfTwoSegmentSize_Throws, etc.) —
    Initialize() is the only callsite that validates.
  - Recovery_*_SegmentSize_* — the native device's open() path is the
    only place that records and re-validates per-segment-size metadata.
  - SectorSize stability across opens — not all devices expose this.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…evice_ buckets; fix AsyncPool creator-throw hang

Test refactor:
  - IDevice_*: 8 contract tests parametrized across Native, RandomAccess,
    ManagedLocal (round-trip basic, round-trip cross-segment, round-trip
    various segment sizes, 32 concurrent writes, 64 concurrent reads,
    mixed reads+writes, bursty traffic, stress burst of 100 writes,
    permission-denied callback contract). 33 cases total.
  - NativeStorageDevice_*: 16 native-only tests for behaviors managed
    devices don't have (deferred Initialize signature, recovery
    segment-size mismatch detection, sector-size discovery, sync-throw
    unaligned IO guard).

AsyncPool fix:
  GetOrAdd reserved a slot in totalAllocated before calling creator().
  If creator() threw (e.g. open() returned EACCES, ENOSPC), the slot
  was never released, so Dispose() would loop forever waiting for
  totalAllocated to drain to zero. This manifested as a process hang
  when a device pool's first open() failed. Rollback the reservation
  on exception so the failure propagates cleanly and the pool can
  still be disposed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…es, -1 = unbounded single segment

Before this change, IDevice.Initialize had the same signature for every
device but very different semantics:

  * StorageDeviceBase (RandomAccess, ManagedLocal, LocalStorage, NullDevice,
    LocalMemoryDevice): the ctor pre-set segmentSize = -1 / bits = 64 /
    mask = ~0, so calling an IO entry point without Initialize() silently
    ran in unbounded single-segment mode.
  * NativeStorageDevice: Initialize was MANDATORY (the C++ shim needs the
    segment size at create time for libaio/io_uring geometry), IO entry
    points threw if invoked first, and segmentSize = -1 was rejected.

This commit unifies the contract: every IDevice must call Initialize()
exactly once before any IO, and segmentSize = -1 selects unbounded
single-segment mode on every device.

Implementation:

  * StorageDeviceBase
    - Added `initialized` flag (volatile) and `EnsureInitialized()`
      helper that throws InvalidOperationException with a clear message
      naming the device by FileName.
    - Ctor leaves `initialized = false` but keeps the safe fallback
      defaults (-1 / 64 / ~0) so any cold maintenance path that touches
      segmentSizeBits before the guard can't compute outright nonsense.
    - EnsureInitialized() called from the base address-based ReadAsync /
      WriteAsync overloads and TruncateUntilAddress / TruncateUntilAddressAsync.
    - Initialize sets `initialized = true` at the end.

  * NativeStorageDevice
    - Accepts segmentSize = -1: translates to 1UL << 63 for the native
      shim so the C++ FileSystemSegmentedFile's shift = log2(segment_size)
      math collapses every non-negative upper-layer address into segment 0
      (parity with the managed-side bits = 64 / mask = ~0). Single growing
      file on disk.
    - Tracks the value passed to native in nativeSegmentSizeBytes (replaces
      the diagnostic-only configuredSegmentSizeBytes long field, which
      couldn't hold 1<<63 without overflow).
    - ABI readback (NativeDevice_GetSegmentSize) compared against the value
      we sent to native, not the user-facing -1.
    - Always rejects omitSegmentIdFromFilename — the C++ shim has no
      omit-suffix code path, every segment is written as <base>.<segmentId>.
      Better to fail fast than silently produce wrong file names.

  * Concrete IO entry points (ReadAsync / WriteAsync / RemoveSegment /
    RemoveSegmentAsync) of NullDevice, LocalMemoryDevice,
    ManagedLocalStorageDevice, RandomAccessLocalStorageDevice,
    LocalStorageDevice, AzureStorageDevice, ShardedStorageDevice, and
    TieredStorageDevice now call EnsureInitialized() before doing work.

Caller fix-ups:

  * LocalStorageNamedDeviceFactory.Get now calls device.Initialize(-1L)
    before returning. Commit / checkpoint metadata is single growing-file
    usage (segment 0 only, .0 suffix), so unbounded mode is the right
    default and unblocks every DeviceLogCommitCheckpointManager caller from
    needing to remember to initialize.
  * LocalStorageNamedDeviceFactory.ListContents skips dotfile entries —
    defensive against a pre-existing race in
    LinuxFileExtensions.IsDirectIOSupported where a .tsavorite-odirect-probe-*
    temp file can leak in the commit dir if File.Delete races with
    File.GetFiles. Without this filter, leaked probe files surface as
    Int64.Parse("") failures in DefaultCheckpointNamingScheme.CommitNumber.
  * SimulatedFlakyDevice.Initialize now propagates to the wrapped device.
  * ComponentRecoveryTests Setup_* helpers call Initialize(-1) on devices
    they construct directly (bypass the Tsavorite allocator path which
    normally Initializes).

Tests (DeviceTests.cs):

  * IDevice_ReadAsyncBeforeInitialize_Throws(kind) × 3 — new contract test.
  * IDevice_WriteAsyncBeforeInitialize_Throws(kind) × 3 — same.
  * IDevice_Initialize_SegmentSizeMinusOne_UnboundedSingleSegment(kind) × 3
    — write at offset 1 MiB (would be in segment-N for any positive size)
    and read back, confirming -1 routes through segment 0 on all 3 kinds.
  * NativeStorageDevice_Initialize_OmitSegmentIdFromFilename_Throws — new
    native-only test for the omit rejection in both -1 and explicit-size
    modes.
  * Removed NativeStorageDevice_{Read,Write}AsyncBeforeInitialize_Throws
    (now subsumed by the IDevice_ variants).

Docs:

  * IDevice.Initialize docstring rewritten to spell out the new contract
    and the -1 semantics. NativeStorageDevice.Initialize remarks updated.

Verified on Linux net10.0 Release:
  * 599 hlog tests (491 passed + 108 skipped)
  * 305 recovery tests
  * 144 + 155 + 127 + 346 = 772 other Tsavorite + Garnet RespTests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…probe (no race)

Two related fixes on top of the unified Initialize contract:

1) NativeStorageDevice now supports omitSegmentIdFromFilename
   ─────────────────────────────────────────────────────────
   Previously Native rejected the omit flag because the C++ shim hard-coded
   the '.<segmentId>' suffix in three places in file_system_disk.h. This made
   the IDevice contract asymmetric (managed devices honored omit, Native
   didn't). Fix by threading the bool through the entire C++/C ABI:

   C++ (libs/storage/Tsavorite/cc/src/device/):
     * FileSystemSegmentBundle: new bool omit_segment_id_; both ctors accept it
       and use a new segment_path(idx) helper that returns just filename_
       when set, otherwise filename_ + '.' + std::to_string(idx). Used at all
       three locations that previously hard-coded the suffix.
     * FileSystemSegmentedFile: new bool omit_segment_id_ (const) wired through
       ctor and propagated to bundles allocated by OpenSegment.
     * NativeDeviceImpl: new bool omit_segment_id constructor param; recorded
       as omit_segment_id_ member. ValidateRecoveredSegments short-circuits
       in omit mode (single bare-named file, segment-size mismatch check is
       meaningless when there's no .<id> suffix to scan for).
     * native_device_wrapper.cc / NativeDevice_CreateWithBackend: new trailing
       'bool omit_segment_id' parameter. ABI BUMP — managed wrapper updated
       to match; Linux .so rebuilt and committed at
       libs/storage/Tsavorite/cs/src/core/Device/runtimes/linux-x64/native/
       libnative_device.so. **Windows DLL must be rebuilt by user** with
       cmake -G 'Visual Studio 17 2022' -A x64 -T v143,spectre=true.

   C# (libs/storage/Tsavorite/cs/src/core/Device/):
     * NativeStorageDevice P/Invoke signature updated.
     * NativeStorageDevice.Initialize removes the 'always rejects omit' guard.
       It now accepts omit:true together with segmentSize = -1 and forwards
       to native; rejects omit:true together with a positive segmentSize
       with a clear error message (multiple segments would collapse onto
       the same on-disk path and clobber each other).

   Tests (DeviceTests.cs):
     * IDevice_Initialize_OmitSegmentIdFromFilename_BareFileName(kind) × 3:
       writes via Initialize(-1, omit:true) and asserts the on-disk file is
       the bare basename (no .0 suffix). Replaces the native-only 'throws'
       test from the previous commit.
     * IDevice_Initialize_OmitSegmentIdFromFilename_WithoutMinusOne_Throws(kind)
       × 3: enforces the no-positive-size-with-omit invariant on every kind.

2) IsDirectIOSupported uses O_TMPFILE (race-free probe)
   ─────────────────────────────────────────────────────
   The previous probe in libs/storage/Tsavorite/cs/src/core/Device/
   LinuxFileExtensions.cs created a hidden '.tsavorite-odirect-probe-<pid>-
   <guid>' file in the device's directory, then File.Delete'd it in a
   silent-catch finally. Multiple concurrent commits (one device per Get())
   ran probes simultaneously; concurrent ListContents calls from
   CommitRecordBoundedGrowthTest would observe the probe file during its
   brief lifetime, and DefaultCheckpointNamingScheme.CommitNumber would then
   throw FormatException on long.Parse(''). 18/20 baseline failure rate.

   Switching from create+unlink to open(directory, O_TMPFILE | O_RDWR |
   O_DIRECT) tells the kernel to allocate an anonymous inode in the
   directory's filesystem with NO directory entry. The probe inode is
   invisible to readdir/getdents regardless of timing; concurrent
   ListContents cannot observe it. Freed on close. Linux >= 3.11 +
   ext4/xfs/tmpfs/btrfs all support it. If O_TMPFILE itself fails
   (EOPNOTSUPP on some filesystem) we conservatively report 'no O_DIRECT'
   so the device falls back to the page-cache path — no named-file
   fallback because that's the bug we're fixing.

   Reverts the dotfile filter in LocalStorageNamedDeviceFactory.ListContents
   added by the previous commit; the underlying race is now eliminated at
   the kernel level so the workaround is unnecessary.

   20/20 LogFastCommitTests runs pass after the change (was 2/20 on baseline,
   3/20 on the previous unlink-after-open attempt which still raced).

Verified on Linux net10.0 Release:
  * 612 hlog tests (504 passed + 108 skipped)
  * 305 recovery tests
  * 62 device tests (IDevice contract + NativeStorageDevice-specific)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Captures the 100M-key disk-bound thread-scale experiment from the
optimize-device branch sweep so it can be reproduced verbatim in the
future:

  - 100M × 100B records (12.8 GB on disk)
  - 16 MB log so ~0.125 % of dataset is in memory and almost every
    read is a 4 KB random disk fetch
  - 8 load threads, run-threads sweep 1,2,4,8,16,32 at 15s each
  - One row per backend (RandomAccess / native+libaio / native+uring)

Added a short note after the table explaining what to compare against
(the disk's fio ceiling at 4K-aligned QD=64-per-job), the expected
~2 min wall-clock per device, and the observed per-backend plateau
characteristics so the next operator knows what 'good' looks like.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Device.benchmark fixes (previously reported throughput could be 2x inflated):
 - Throughput counter now tallies successful completions only. Before, every
   ReadAsync call was counted as success even when the kernel returned
   EAGAIN (Status::IOError=4 — flooded libaio io_context ring). Under
   --throttle-limit 0 with high QD, ~40% of "ops" were errored requests.
 - Per-error-code histogram printed at end of run; no per-error
   Console.WriteLine (was emitting millions of serial writes/run, both
   falsifying numbers and slowing the real path).
 - DEBUG data validation skipped on errored ops (was reading garbage from
   the destination buffer on EAGAIN paths and reporting spurious
   "Data mismatch").
 - --throttle-limit help text documents the libaio kernel ring trap
   (128 slots wide; high QD + no throttle floods it) and recommends
   --throttle-limit 128 (also the io_uring SQ depth this build uses).
 - --io-backend flag added (libaio / uring / default) so the existing
   Linux Native path can be exercised against either backend. Unknown
   values are rejected with an actionable message at startup instead of
   silently falling back to default.
 - --completion-threads is now wired through to the Linux Native ctor
   (was hardcoded to 1).
 - --file-size widened to long (was int; --file-size > 2GB threw a parse
   error).

KV.benchmark + Devices.cs: clarified XML/help text for the existing
--device-completion-threads / numCompletionThreads parameter to describe
the current behavior (multiple drainer threads share one kernel
io_context / io_uring per device).

No behavior change for KV.benchmark or core Tsavorite Devices.cs API
beyond the Device.benchmark surface and clarified docs.
Adds N independent kernel io_contexts (libaio) / io_urings (uring) per
NativeStorageDevice, with completion threads bound 1:1 to contexts.
libaio internally always uses 1 context (sharding empirically gave
nothing — kernel mutex efficient at all tested loads), but the sharded
ABI surface is kept across both backends so the templated
NativeDeviceImpl<HandlerT> doesn't need a fork.

C ABI changes (libnative_device.so):
- NativeDevice_CreateWithBackend signature bumped: trailing int32 num_io_contexts.
- New exports: NativeDevice_QueueRunFor(device, ctx_idx, timeout_secs),
  NativeDevice_NumIoContexts(device).
- Legacy NativeDevice_QueueRun kept; under uring sharding it scans all
  rings (back-compat for any single-thread drainer).

C# (NativeStorageDevice):
- Synchronous ABI probe at Initialize() that converts EntryPointNotFoundException
  into a clear TsavoriteException listing the missing exports and how to
  rebuild — guards against a stale .so silently hanging Dispose's drain loop.
- Probe is intentionally gated by the QueueRun branch so it runs only on
  backends that actually use the new symbols at runtime: Linux Native (libaio
  / uring) where QueueRun returns >= 0, NOT on Windows IOCP where the
  ThreadPoolIoHandler returns -1 by design. This means a stale Windows DLL
  keeps working unchanged because it never calls the sharded exports.
  cdecl/x64 ABI silently tolerates the new trailing num_io_contexts arg on
  NativeDevice_CreateWithBackend.
- Completion threads bound 1:1 via QueueRunFor(ctxIdx) (no closure capture
  bug — ctxIdx is captured per-iteration into a local).
- numCompletionThreads is the user-facing knob; native side decides how many
  contexts to actually create (libaio: always 1; uring: honours the request).

Empirical justification (Device.benchmark, NVMe, 4K random reads,
batch=4096, throt=512, fio ceiling 749K, NUMA0-pinned):

  libaio CT=1 (always 1 ctx + 1 drainer):    755K ops/sec (t=32)
  uring  CT=1 (1 ring + 1 drainer):          357K ops/sec  ← SpinLock-bound
  uring  CT=1 ring + N drainers (regresses): 357K → 274K  ← cq_lock contention
  uring  CT=4 (4 rings + 4 drainers):        745K ops/sec
  uring  CT=8 (8 rings + 8 drainers):        758K ops/sec  ← hardware ceiling

Both backends now reach the hardware NVMe ceiling. uring requires sharding
(the user-space SpinLock around io_uring_get_sqe + prep + submit is the
real cap). libaio doesn't need sharding (kernel io_context mutex already
efficient at all tested loads).

Files:
- file_linux.h  : UringIoHandler sharded (vector<io_uring*>, per-ring sq_lock
                  + cq_lock, atomic round-robin pick_ring). QueueIoHandler
                  unchanged on the data plane (single io_context_t) but
                  exposes the same num_contexts()/TryCompleteFor/QueueRunFor
                  surface as inline stubs for ABI symmetry.
- file_linux.cc : new UringIoHandler impls; QueueIoHandler unchanged.
- file_windows.h: stub overloads (num_contexts()=1, QueueRunFor=-1, 2-arg ctor)
                  so the templated NativeDeviceImpl compiles unchanged.
- native_device.h, native_device_wrapper.cc: ABI plumbing as above.
- NativeStorageDevice.cs: ABI probe + per-context drain workers.
- runtimes/linux-x64/native/libnative_device.so: rebuilt with sharding.
…VMe IOPS

Captures the verified copy-paste recipe for both Linux Native backends
(libaio and io_uring) to hit the hardware ceiling on a Dell P5600-class
NVMe, alongside a flag reference, output-schema explanation, and
troubleshooting table.

Headline recipes verified end-to-end on the reference setup:
  libaio --completion-threads 1 --threads 16 --throttle-limit 512  →  743K ops/sec
  uring  --completion-threads 8 --threads 16 --throttle-limit 512  →  738K ops/sec
(Both within 2 % of the table values in the README; zero kernel-side errors.)

Key facts documented:
- libaio always uses one io_context in this build regardless of
  --completion-threads (sharding empirically gave nothing; the hint is
  ignored). Pass 1 explicitly so scripts are self-describing.
- io_uring needs sharded rings (CT >= 4) to escape the per-ring user-space
  SpinLock cap around io_uring_get_sqe + prep + submit; CT=8 is the safe
  peak.
- --throttle-limit must be set to at least the per-ring/per-context depth
  (128 in this build for both backends). --throttle-limit 0 floods the
  kernel ring and the benchmark correctly surfaces Status::IOError=4 in
  the per-code histogram rather than tallying errored ops as throughput.
- --file-size must be a multiple of 1024 × --sector-size (fill phase
  uses a 1024-sector temp buffer).

Plus a section comparing Device.benchmark vs KV.benchmark to direct
readers to the right tool: Device.benchmark for IO-layer ceiling
validation (saturates NVMe), KV.benchmark for full-path throughput
(currently caps ~30 % below the IO ceiling on the upper-layer pending-read
path — see KV.benchmark README for that side of the story).

Also adds a one-paragraph pointer in benchmark/README.md so the new
README is discoverable from the top-level benchmarks listing.
…full

ScheduleOperation in both QueueFile (libaio) and UringFile (io_uring) now
retries the kernel-side submission on the transient back-pressure signal
(libaio: `io_submit == 0`; uring: `io_uring_get_sqe == nullptr`) up to
kMaxSubmitRetries = 8 attempts, each separated by a sched_yield(). Permanent
errors (libaio io_submit < 0, uring io_uring_submit < 0) are NEVER retried —
they surface immediately as Status::IOError.

Motivation: the upper-layer throttle gate in AllocatorBase.AsyncGetFromDisk
is a racy test-then-increment (Throttle() reads numPending non-atomically,
then ReadAsync does Interlocked.Increment). With N concurrent submitters
all passing the gate at numPending == ThrottleLimit, in-flight can spike
to ThrottleLimit + N momentarily, exceeding the 128-slot per-context /
per-ring kernel ring depth when N > 8 (which is normal for Garnet under
heavy disk-bound load). Pre-fix, the kernel rejects the overshoot
submissions with EAGAIN, which Tsavorite handled by re-routing through
the full AllocatorBase pending-read retry loop (correct but expensive: a
full round-trip per IOError). Post-fix, the burst is absorbed locally
by a handful of sched_yields and never surfaces to the upper layer.

Why sched_yield + bounded retries is the right shape: on a 750K-IOPS NVMe
the kernel ring drains a slot every ~1.3 µs and sched_yield is typically
1-10 µs on Linux, so 8 retries (worst-case ~40-80 µs window) is more
than enough to absorb the typical 24-slot overshoot from a 32-thread
burst. For genuine sustained overload (application submission rate
exceeds device IOPS for seconds), the retries exhaust and Status::IOError
surfaces — which is the correct signal for the caller to apply
back-pressure.

Implementation notes:
- libaio: simple loop around io_submit. No lock to release; io_submit is
  kernel-thread-safe per io_context, so concurrent submitters serialise
  inside the kernel.
- uring: must release sq_lock around sched_yield. Holding a SpinLock
  across a syscall would stall every other submitter on the same ring.
  Only the get_sqe == nullptr path is retried; if get_sqe succeeded
  we've already "consumed" an SQE slot in the user-side bookkeeping
  and re-issuing via get_sqe+prep on retry would corrupt the ring
  (we'd hold two SQEs for one logical op). For SQPOLL-disabled rings
  (our setup) io_uring_submit returns 1 in steady state — a non-1 there
  is an unrecoverable kernel-side error and surfaces immediately.

Verified end-to-end (Device.benchmark, NVMe, 4K random reads, batch=4096,
NUMA0-pinned, --throttle-limit 120 to match production NativeStorageDevice
default):
  libaio CT=1: t=8 / 16 / 32 → 626K / 631K / 627K ok/sec, 0 err
  uring  CT=8: t=8 / 16 / 32 → 622K / 625K / 626K ok/sec, 0 err

At extreme intentional-overload settings (--throttle-limit 4096, t=32)
errors still appear — confirming the retry budget correctly exhausts
when the application is genuinely outpacing the device:
  libaio CT=1, t=32, throt=4096: 754K ok, 14.7M code4 err (~5% of submits)
  uring  CT=8, t=32, throt=4096: 745K ok,    0 err
(uring still produces 0 err under same overload because 8 rings × 128 =
1024 SQ slots is large enough that even gross over-submission fits within
the retry budget per ring.)
BLOCKER fixes:
- NativeStorageDevice.Dispose UAF race. Previously NativeDevice_Destroy(nativeDevice)
  ran before nativeDevice was nulled, so a concurrent guard-bypassed P/Invoke could
  observe a non-zero handle that points to freed memory. Fix: Interlocked.Exchange
  atomically captures-and-nulls the handle; destroy runs on the captured pointer.
  EnsureReadyOrSilent now checks disposedFlag first.

HIGH fixes:
- UringFile::ScheduleOperation SQE leak on transient io_uring_submit failure. After
  a successful get_sqe + prep, the SQE is committed to the user-side SQ ring; a
  -EAGAIN/-EBUSY return from io_uring_submit left the slot permanently occupied
  with no kernel iocb, eventually starving get_sqe forever. Fix: retry
  io_uring_submit (without re-preparing) up to kMaxSubmitRetries on transient
  negatives, with sched_yield (and sq_lock released) between attempts.
- UringIoHandler::Init partial-init leak. If new SpinLock() threw after
  io_uring_queue_init succeeded for ring i, the already-initialized ring leaked
  (the class dtor doesn't run on partial construction). Fix: use std::unique_ptr
  RAII holders during construction; release into the member vectors only after
  all allocations succeed.

POLISH fixes:
- NativeStorageDevice.Initialize tail-throw cleanup. If base.Initialize threw after
  the native device and completion threads were created, both leaked. Now wrapped
  in try/catch that cancels token, joins threads, and destroys the native device.
- UringIoHandler rule-of-5 hygiene: explicitly deleted copy ctor, copy-assign, and
  move-assign so the implicit shallow copies (which would double-delete the raw
  owning pointers) cannot be generated.
- DispatchUringCqe: added static_assert(is_trivially_destructible<IoCallbackContext>)
  so a future non-trivial member fails the build instead of silently leaking.
- NativeDeviceImpl::num_io_contexts: removed unnecessary const_cast (the underlying
  num_contexts() is already const on both backends).
- Removed duplicate XML <summary> on NativeStorageDevice.Dispose.
- FileSystemDisk dead-code ctor: passed the now-required 5th arg to
  FileSystemSegmentedFile so the file compiles if anyone instantiates it.

Comment hygiene sweep (per explicit review-rule #4 "comments should not refer
to design thought processes"):
- Removed embedded benchmark results, hardware-specific throughput numbers, and
  historical narrative from class/method documentation in file_linux.{h,cc},
  native_device.h, native_device_wrapper.cc, NativeStorageDevice.cs.
- Kept WHAT each method does and the invariants it enforces; moved WHY this
  approach was chosen out of source comments (the commit log is the appropriate
  place for that context).
- Net: -162 lines across 7 files, no behavior change from the trim itself.

Verified post-fix performance unchanged (Device.benchmark, NVMe, 4K random reads,
batch=4096, NUMA0-pinned):
  libaio CT=1 t=16 throt=512:  746K ok/sec, 0 err  (hardware ceiling)
  uring  CT=8 t=16 throt=512:  743K ok/sec, 0 err  (hardware ceiling)
  libaio CT=1 t=32 throt=120:  636K ok/sec, 0 err  (production default)
  uring  CT=8 t=32 throt=120:  618K ok/sec, 0 err  (production default)
…reation

NativeStorageDevice.Initialize used to perform all the heavy work (native
device creation, completion-thread spawn, ABI / segment-size / sector-size
cross-checks) eagerly and threw "called more than once" on a second call.
The other IDevice implementations (LocalStorageDevice, RandomAccessLocalStorageDevice,
ManagedLocalStorageDevice) all inherit a metadata-only StorageDeviceBase.Initialize
that simply overwrites segmentSize / segmentSizeBits / mask fields and is
silently idempotent. They open their per-segment OS handles lazily inside the
IO methods.

This contract mismatch broke any caller that invokes Initialize twice on the
same NSD instance. The canonical case is
LocalStorageNamedDeviceFactory.Get(), which calls Initialize(-1L) as a
defensive pre-init so consumers can't forget; the consumer (snapshot
checkpoint state machine SnapshotCheckpointSMTask, cluster checkpoint
streaming TsavoriteCheckpointReader.CreateCheckpointDevice) then calls
Initialize(actualSegmentSize). Under the old NSD that throws; under the new
NSD it works the same way the other backends do.

Implementation:
- NSD.Initialize is now metadata-only — delegates to base.Initialize.
  Pre-flight argument validation (segmentSize power-of-two, sector-size
  floor, omitSegmentIdFromFilename) is preserved.
- New EnsureNativeDeviceCreated() does the heavy work, lazily, on first IO.
  Reads the latest base.segmentSize and base.OmitSegmentIdFromFileName so
  whichever Initialize call ran most recently wins.
- Thread-safe via double-checked locking on a new nativeCreateLock. The
  publish of nativeDevice uses Volatile.Write so a second observer of
  nativeDevice != IntPtr.Zero is guaranteed to see a fully-initialised
  handle with completion threads already running.
- Dispose now also takes nativeCreateLock around the cancel-join-destroy
  sequence so it cannot race with a concurrent EnsureNativeDeviceCreated
  (which would otherwise leak a freshly-published native handle and its
  completion threads).
- IO entry points (ReadAsync, WriteAsync) call EnsureNativeDeviceCreated()
  before submission. Bookkeeping entry points (Reset, TryComplete,
  GetFileSize, RemoveSegment) no-op when the native handle has not been
  created yet, matching the semantics of the other backends (Reset on a
  device with no open handles is a no-op).

Verified against the full unit-test sweep with Native forced as the default
device (the GetDefaultDeviceType hack is local-only and not in this commit):
  Tsavorite.test:                 206 / 206 (was 204 / 206 pre-fix)
  Garnet.test:                    789 / 789 (was 110 / 792 pre-fix; 681
                                              were blocked on Initialize-twice)
  Garnet.test.acl:                425 / 425
  Garnet.test.collections:        746 / 746
  Garnet.test.complexstring:      386 / 386
  Garnet.test.rangeindex:          62 /  62
  Garnet.test.vectorset:           42 /  42

No change to behavior for callers that invoke Initialize once with the real
segment size, which is what every production code path already does.
…ive handle

GetFileSize and RemoveSegment must report the on-disk state regardless of
whether IO has flowed through the device, matching LocalStorageDevice and
RandomAccessLocalStorageDevice semantics. Before this fix, both no-op'd when
no native handle had been created — which silently truncated the cluster
manager's recovery decision because ClusterManager.cs:79 and ReplicationManager.cs:160
call `device.GetFileSize(0) > 0` to decide whether to recover persisted cluster
config / replication history. With Native, a restarted node would always
"Initialize new node instance config" instead of recovering, get a fresh node ID,
and fail every replication-resume test (e.g. ClusterSRNoCheckpointRestartSecondary
which restarts a replica and then waits for AOF sync to catch up).

Changes:
* GetFileSize now falls back to FileInfo when no native handle exists — same
  shape as RandomAccessLocalStorageDevice.GetFileSize (open-on-demand) but
  without paying io_uring/libaio setup cost just to stat a file.
* RemoveSegment now falls back to File.Delete when no native handle exists —
  same shape as LocalStorageDevice / RandomAccessLocalStorageDevice
  (best-effort unlink, swallows ENOENT).
* Per IDevice contract enforced in 889def4 ("Tsavorite IDevice: unify
  Initialize contract — required for all devices, -1 = unbounded single segment"),
  ReadAsync / WriteAsync now call EnsureInitialized() before
  EnsureNativeDeviceCreated() so the IDevice_*BeforeInitialize_Throws hardening
  tests get the same InvalidOperationException shape from Native that they
  get from the other devices.
* Two device tests updated to match the lazy-Initialize contract that was
  introduced in commit f4e3044 ("Tsavorite Native: make Initialize idempotent
  via lazy native-handle creation"):
  - NativeStorageDevice_InitializeTwice_Throws → _Idempotent: idempotent Initialize
    matches the LSD/RA contract used by LocalStorageNamedDeviceFactory.Get +
    consumer re-init pattern.
  - NativeStorageDevice_Recovery_LargerExistingSegment_DetectsMismatch: the
    C++ ValidateRecoveredSegments check now fires on first IO (when
    EnsureNativeDeviceCreated runs), not at Initialize time, so the test
    asserts on a ReadAsync rather than Initialize.

Verified on Linux x64 / .NET 10:
* libs/storage/Tsavorite/cs/test/test.hlog: all IDevice + NativeStorageDevice
  tests pass (62/62) with Native default
* test/cluster/Garnet.test.cluster.replication: all 4
  ClusterSRNoCheckpointRestartSecondary variants pass with Native default
  (regression-test for the recovery path)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Without this fix, NSD.Dispose() can stall up to CompletionWorkerTimeoutSecs (1s)
per io_context because the completion-drainer thread is blocked in
io_getevents / io_uring_wait_cqe_timeout waiting for events that will never
come (the IO drain phase has already brought numPending to 0). The Thread.Join
following completionThreadToken.Cancel() then has to wait for the next QueueRunFor
timeout to fire so the thread can observe cancellation and exit.

This was visible as exactly-1.0s gaps in cluster replication recovery traces:
checkpoint metadata reads / writes that each create+dispose a fresh NSD spent
~1s in Dispose, multiplying across the ~5–10 devices created per checkpoint
into multi-second stalls. ClusterReplicaSyncTimeoutTest (replicaSyncTimeout=1s)
and MultiDatabaseSaveRecoverByDbIdTest(True) (2s LASTSAVE poll window) failed
because of this; the actual I/O on Native is microseconds, not seconds.

Fix: post a synthetic wake-up event on each io_context when Dispose runs.

  * libaio: submit a 0-byte read on a /dev/null fd opened in the handler ctor.
    /dev/null completes immediately and does not require O_DIRECT alignment,
    so the wake-up does not interfere with the real segment files.
  * io_uring: submit io_uring_prep_nop with user_data = nullptr; the drain
    loop recognises nullptr as a wake-up sentinel and skips dispatch.
  * Windows ThreadPoolIoHandler has no dedicated drainer (callbacks fire on
    threadpool threads), so its Wake is a no-op stub returning 0.

The completion thread wakes from its blocking syscall almost immediately,
observes the cancellation token on its next loop iteration, and exits. No
extra idle work, no polling, no shortened timeout.

  * NSD.Dispose latency: 1025ms worst case -> ~20-30ms (microbenchmark).
  * ClusterReplicaSyncTimeoutTest with Native: ~22-25s (fail) -> ~3s (pass).
  * MultiDatabaseSaveRecoverByDbIdTest(True) with Native: timeout (fail)
    -> ~6s (pass).
  * Idle drainer syscall rate is unchanged (1/s/context).

C ABI changes (additive — old exports preserved):
  * NativeDevice_WakeCompletionWorker(device, ctx_idx).
  * INativeDevice::Wake; QueueIoHandler::Wake, UringIoHandler::Wake,
    ThreadPoolIoHandler::Wake stub.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…id for IO

Background: commit 889def4 ("unify Initialize contract — required for all devices")
added an EnsureInitialized() guard that threw InvalidOperationException at every IO
entry point if Initialize() had not been called first. This was redundant: the ctor
already establishes segmentSize=-1 / segmentSizeBits=64 / segmentSizeMask=~0UL, which
is functionally identical to having called Initialize(-1) — every absolute address
right-shifts to segment 0, producing unbounded single-segment routing.

The mandatory-Initialize contract was the root cause of the entire factory-pre-init
+ NSD lazy-creation saga: LocalStorageNamedDeviceFactory.Get was forced to call
device.Initialize(-1L) defensively just to satisfy the contract, which broke
NativeStorageDevice (its single-shot Initialize then asserted on the consumer's
follow-up Initialize(realSize)). Recent commits f4e3044 + 0535da0 papered over
this with lazy native-handle creation; this commit removes the root cause.

Changes:
* StorageDeviceBase: remove the 'initialized' flag, EnsureInitialized() helper,
  and ThrowNotInitialized() method. Initialize() is now purely a *configuration*
  call to override the ctor defaults (set a non-default segment size, opt into
  OmitSegmentIdFromFileName). The ctor doc explicitly states that callers may
  issue IO immediately after construction.
* All concrete devices: remove the EnsureInitialized() calls at the top of
  ReadAsync / WriteAsync / TruncateUntilSegmentAsync / RemoveSegment (libaio,
  io_uring, RA, ManagedLocal, LocalMemory, Null, Tiered, Sharded, Azure).
* LocalStorageNamedDeviceFactory.Get: drop the defensive device.Initialize(-1L);
  the ctor defaults match what that call did anyway.
* NSD's recent EnsureInitialized() additions to ReadAsync/WriteAsync (introduced
  in 0535da0 only to satisfy the hardening test) are also removed by the sweep.
* ComponentRecoveryTests.cs: drop 3 redundant Initialize(-1L) calls.
* test.hlog DeviceTests: rename and repurpose IDevice_*BeforeInitialize_Throws to
  IDevice_*BeforeInitialize_UsesCtorDefaults — the new test demonstrates that
  WriteAsync/ReadAsync on a freshly-constructed device (no Initialize call) works
  correctly using the unbounded single-segment defaults, across Native / RA /
  ManagedLocal.

TestUtils.cs:165 still calls device.Initialize() — that path is conditional on
the caller wanting OmitSegmentIdFromFileName=true, which IS only settable via
Initialize (it is not a ctor parameter), so the call is genuinely needed there.

Verified on Linux x64 / .NET 10:
* libs/storage/Tsavorite/cs/test/test.hlog (IDevice + NativeStorageDevice tests):
  62/62 pass.
* libs/storage/Tsavorite/cs/test/test.recovery (ComponentRecovery tests):
  4/4 pass.
* Full Garnet.test, Garnet.test.cluster, Garnet.test.acl, Garnet.test.collections,
  Garnet.test.extensions, Garnet.test.scripting, Garnet.test.complexstring,
  Tsavorite IDevice+NSD: pass at the same rates as before this commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sweep the PR for comments that referenced earlier design choices as they
evolved during development, and rewrite them to describe the steady-state
contract directly without historical baggage.

* IDevice.Initialize: replace the "Must be called exactly once... before any
  IO entry point... uninitialized device throws InvalidOperationException"
  doc with the actual contract: Initialize is purely an opt-in configuration
  step to override the ctor defaults (which are equivalent to Initialize(-1));
  callers may issue IO immediately after construction.
* NativeStorageDevice ctor doc: rewrite to describe the as-shipped lazy
  creation flow (configuration captured at ctor, native handle created on
  first IO via EnsureNativeDeviceCreated) rather than the stale "Native device
  creation is DEFERRED until Initialize... every IO entry point throws
  InvalidOperationException" framing.
* NativeStorageDevice.Initialize doc: drop the misleading "Creates the
  underlying native device with the requested segment size" lead-in (which
  hasn't been true since the lazy-creation refactor); replace the "factory
  pre-init" example (factory no longer pre-inits) with a steady-state
  description of when repeat Initialize calls are honoured.
* NativeStorageDevice.EnsureNativeDeviceCreated doc: replace "Throws if
  Initialize has not been called" (now uses ctor defaults if no Initialize)
  with "Throws if the device has been disposed or if the native shim rejects
  the configuration".
* NativeStorageDevice.EnsureReadyOrSilent doc: drop the "does not throw on
  'not initialized yet'" qualification.
* NativeStorageDevice.GetSectorSize doc: re-point the "cross-check" link from
  Initialize to EnsureNativeDeviceCreated (which is where it actually happens).
* NativeStorageDevice.Dispose doc + body: bound the worst-case shutdown
  stall by the longest in-flight user callback (not CompletionWorkerTimeoutSecs)
  since wake-up uses NativeDevice_WakeCompletionWorker; rewrite the inline
  Dispose comment so it documents the steady-state design rather than what
  it improved over.
* NativeStorageDevice nativeSegmentSizeBytes / UnboundedNativeSegmentSizeBytes
  field doc: clarify that the value is populated by EnsureNativeDeviceCreated
  (not Initialize) and that the default is reached without calling Initialize.
* NativeStorageDevice_InitializeTwice_Idempotent test: drop the now-stale
  "factory pre-init... consumer re-initializes" rationale; describe the
  idempotent contract directly.
* NativeStorageDevice_DisposeBeforeInitialize_IsNoOp test: drop the "Phase 6"
  reference and reword in terms of the steady-state lazy-creation contract.
* SimulatedFlakyDevice.Initialize: replace the "so its EnsureInitialized()
  guard passes when our IO methods delegate to it" comment (the guard no
  longer exists) with a description of why both devices need matching geometry.
* LinuxFileExtensions.OpenDirect dsync param doc: drop the "previously asked
  for it" wording — the WriteThrough callsites still pass it; describe the
  parameter as an opt-in for WriteThrough-equivalent semantics.
* Doc-cref bookkeeping: change <see cref="base.segmentSize"/> (illegal cref
  for inherited fields) to <c>base.segmentSize</c> code spans.

No behaviour change. Build clean on Garnet.slnx and Tsavorite.slnx;
dotnet format --verify-no-changes clean on both. IDevice + NSD device tests
all pass (62/62).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Background: the shipped libnative_device.so was built with -DUSE_URING=ON,
so it had a hard DT_NEEDED entry for liburing.so.2. Loading the .so on a
host without liburing2 installed (e.g. a GitHub Actions ubuntu-latest
runner, or an end-user box where only libaio is in the base image) failed
with:

    System.DllNotFoundException: ... liburing.so.2: cannot open shared
    object file: No such file or directory

…even for callers that only ever requested the libaio backend, because the
dynamic linker resolves NEEDED libraries at load time regardless of which
exported symbols the caller goes on to invoke.

Rebuild the prebuilt with -DUSE_URING=OFF so the shipped .so links only
libaio. Most Linux distributions ship libaio in the base system, so the
prebuilt now loads without any additional setup. The io_uring backend
becomes a build-time opt-in: callers that want it install liburing-dev
and rebuild with -DUSE_URING=ON. The C# layer already surfaces a clear
TsavoriteException for callers that request Uring against a USE_URING=OFF
build ("Requested IO backend 'Uring' is not available in the loaded
native_device library… Rebuild the native library with -DUSE_URING=ON
and install liburing-dev to enable io_uring.").

Build fix: file_linux.h now includes <fcntl.h> directly (for the
::open() / O_RDONLY usage in QueueIoHandler::OpenWakeFd()). Previously
these were pulled in transitively through <liburing.h>, which is now
gated behind #ifdef FASTER_URING.

Dockerfile updates: drop liburing2 / liburing from the runtime install
list in all 5 Dockerfiles (default, .ubuntu, .alpine, .azurelinux,
.chiseled). Comments left for users that rebuild with USE_URING=ON.

README updates: rewrite the "Runtime dependencies" section to describe
the new default (libaio only). Replace the "Disabling io_uring (optional)"
section with "Enabling io_uring (optional)".

Verified on Linux x64 / .NET 10: libaio default works (62/62 IDevice +
NativeStorageDevice tests pass); ldd confirms only libaio.so.1t64 is
in NEEDED.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fallback

Single-shipping libnative_device.so created a deployment dilemma: build with
USE_URING=ON and end-users without liburing get DllNotFoundException at load
time; build with USE_URING=OFF and the io_uring backend stops working even on
hosts that DO have liburing installed (which is the case where uring matters
for perf — modern NVMe at >1M IOPS benefits noticeably from uring over libaio).

Ship both flavors instead:
  * libnative_device.so          — built with USE_URING=ON; DT_NEEDED on
    libaio AND liburing. Exposes both Libaio and Uring backends.
  * libnative_device_libaio.so   — built with USE_URING=OFF; DT_NEEDED on
    libaio only. Exposes the Libaio backend.

NativeStorageDevice's DllImportResolver tries the uring-enabled binary first;
on DllNotFoundException matching 'liburing.so.2: cannot open' it falls back to
the libaio-only binary. The Libaio backend therefore always works out of the
box on any Linux distribution that ships libaio (essentially all of them).
liburing is opt-in: hosts that install it get the Uring backend with zero
runtime overhead vs Libaio (direct calls, no function-pointer indirection — we
deliberately rejected the dlopen approach so the future-default uring path
stays optimal).

If a caller explicitly selects IoBackend.Uring on a host without liburing, the
construction-time error message now points at the install command per distro
('apt-get install -y liburing2', 'dnf install -y liburing', 'apk add liburing')
instead of telling the user to rebuild the .so with -DUSE_URING=ON. We never
silently downgrade Uring to Libaio.

Changes:
  * NativeStorageDevice.cs: new LibaioFallbackLibraryPath; ImportResolver
    catches DllNotFoundException for liburing.so.2 and falls back to the
    libaio-only .so. ResolveNativeLibraryPath now takes the path as a
    parameter so it can resolve either flavor.
  * NativeStorageDevice.cs: rewrite the 'backend not available' exception
    message — point at install commands (the actual remediation) not rebuild.
  * Tsavorite.core.csproj: add libnative_device_libaio.so as a second
    ContentWithTargetPath asset so both .so files are copied to the
    output directory and packed into the NuGet runtime payload.
  * runtimes/linux-x64/native/libnative_device.so          — REPLACED with
    USE_URING=ON build (DT_NEEDED libaio + liburing). 2.3 MB.
  * runtimes/linux-x64/native/libnative_device_libaio.so   — NEW, USE_URING=OFF
    build (DT_NEEDED libaio only). 1.6 MB.
  * Dockerfile, Dockerfile.ubuntu, Dockerfile.alpine, Dockerfile.azurelinux,
    Dockerfile.chiseled: re-add liburing2 / liburing to the runtime installs
    so docker users get the io_uring backend out of the box (the libaio-only
    fallback would otherwise leave Uring unusable inside containers).
  * cc/README.md: rewrite the 'Runtime dependencies' and build sections to
    describe the two-flavor layout, drop the stale 'Enabling io_uring'
    section, and document the prebuilt rebuild workflow.

Verified end-to-end:
  * Both backends saturate the Dell P5600 NVMe at ~743K random read IOPS
    in benchmark/Device.benchmark (matches the pre-change reference).
  * 62/62 IDevice + NativeStorageDevice tests pass.
  * dotnet format clean on both Garnet.slnx and Tsavorite.slnx.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc and others added 16 commits May 27, 2026 20:36
…hreadPoolIoHandler

NativeDeviceImpl's constructor (native_device.h:100-101) gates on
handler_.init_errno() to surface an actionable error message when the
underlying IO handler failed to initialize (e.g., libaio io_setup() failed
with EMFILE / ENOMEM, or io_uring_queue_init() failed). The Linux handlers
QueueIoHandler and UringIoHandler both expose this API; ThreadPoolIoHandler
(Windows) did not, so MSVC failed to instantiate NativeDeviceImpl<ThreadPoolIoHandler>
with:

  error C2039: 'init_errno': is not a member of
  'FASTER::environment::ThreadPoolIoHandler'

Add init_errno() and initialized() stubs that return 0 / true unconditionally
— the Windows ThreadPool API does not have a separable init step that can
fail in the same way the Linux io_setup / io_uring_queue_init paths can
(threadpool creation failures propagate via threadpool_'s ctor, not via a
later 'check this' field on the handler), so the stubs are semantically
correct. NativeDeviceImpl then falls through to the log_.Open(&handler_) path
which is where Windows-specific errors (missing directory, permission denied,
etc.) actually surface.

Linux unaffected: rebuilt build/Release-uring cleanly after this change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Native tests in DeviceTests.cs had blanket 'NativeStorageDevice is
Linux-only' Assert.Ignore guards that dated from when NSD's C++ shim was
Linux-only. The shim is built on Windows too (native_device.dll via the
ThreadPool / IOCP backend in file_windows.cc), so directly constructing
'new NativeStorageDevice(...)' works on Windows. The blanket guards were
silently dropping ~15 NSD test cases on Windows CI.

Drop the guards so the tests exercise the Windows C++ shim. The legitimate
Linux-only guard on IDevice_PermissionDeniedAtFirstWrite_CallbackGetsError
(chmod-based; chmod has no Windows analogue) is preserved.

Important: end-user device routing is UNCHANGED.
Devices.CreateLogDevice(DeviceType.Native) on Windows still returns
LocalStorageDevice (managed Windows IOCP), not NativeStorageDevice — that
routing happens in Devices.cs and was not touched. These tests directly
instantiate the NSD class for shim-coverage purposes only; they do not
affect what end users get from the default device factory.

Linux: 62/62 pass after this change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…+ source

   Rebuild the shipped Windows prebuilt from the current native_device source
   so the latest fixes (Initialize idempotence, WakeCompletionWorker, etc.)
   are reflected in the win-x64 DLL.

   USE_URING is a no-op on Windows; the DLL only exposes the Default (IOCP)
   backend, so there is no equivalent of the libnative_device_libaio.so
   fallback on this platform.

   End-user device routing is unchanged: Devices.CreateLogDevice(DeviceType.Native)
   on Windows still returns LocalStorageDevice (managed Windows IOCP). This DLL
   is exercised by direct 'new NativeStorageDevice(...)' construction (see
   Tsavorite.test.hlog DeviceTests — 59/59 Native + IDevice tests pass on
   Windows after this rebuild).

   Built with: Visual Studio 17 2022, MSVC v143, x64, Release configuration,
   Spectre-mitigated CRT.
Fixes two correctness issues caught by an automated code review of the
optimize-device PR.

### io_uring SQE leak on submit failure

In UringFile::ScheduleOperation, io_uring_get_sqe() advances the
user-side sqe_tail before io_uring_submit() is called. If submit fails
after retries (-EAGAIN/-EBUSY exhausted, or any other negative), the
old code released the lock and returned IOError without doing anything
about the still-pending SQE. user_data on that SQE pointed at the
io_context unique_ptr that was about to be freed by the guards
unwinding, so the next successful submit on the same ring would consume
the stale SQE and the QueueRunFor drain loop would dispatch a callback
against freed memory — a clear use-after-free.

Fix: before releasing sq_lock on the failure path, rewrite the still-
pending SQE in place as io_uring_prep_nop with user_data = nullptr.
The drain loop already skips nullptr user_data (it's the wake-up
sentinel used by UringIoHandler::Wake), so when a later submit flushes
this nop the CQE is drained harmlessly. Safe to mutate the SQE in place
because we still hold sq_lock and no kernel/concurrent submitter has
observed it yet.

### NativeDevice sector_size always returned 512

FileSystemSegmentedFile::alignment() returned a hard-coded 512.
NativeDeviceImpl::sector_size() delegated to it, so the C# wrapper's
sector-size cross-check in EnsureNativeDeviceCreated would:

  - falsely throw on 4K-native disks where ProbeAlignment returns 4096
    (managed 4096 vs native 512 → 'sector-size mismatch' → device
    unusable), or
  - on 4K-native disks where the managed probe fell back to 512 (e.g.
    older kernel without STATX_DIOALIGN), let the device initialize
    with SectorSize=512 and then have the kernel reject the 512-aligned
    O_DIRECT buffers with EINVAL.

Fix: factor the STATX_DIOALIGN probe from NativeDevice_ProbeAlignment
into a shared inline helper (native_device::ProbeDioAlignment in
native_device.h) and call it once from the NativeDeviceImpl ctor,
caching the result as the immutable member device_alignment_.
sector_size() now returns the cached value; NativeDevice_ProbeAlignment
delegates to the same helper. Both sides of the ABI cross-check go
through identical probe logic, so the check is now a meaningful ABI /
runtime-drift detector instead of a 4K-disk footgun.

### Stale IDevice.Initialize XML

The omitSegmentIdFromFilename param said it was 'only supported by
managed devices — NativeStorageDevice rejects this flag'. Native
devices have honored the flag since 6584cf7. Updated the doc.

### Alpine install hint

The 'IoBackend.Uring with libaio fallback' error message suggested
'sudo apk add liburing' on Alpine, but README.md notes that the
prebuilt won't load on Alpine (musl) at all. Replaced the apk
suggestion with the actual Alpine support story (use a glibc image or
fall back to a managed device).

Verification: 62/62 Tsavorite.test.hlog IDevice + NativeStorageDevice
tests pass on Linux. Both .so binaries rebuilt (uring-enabled and
libaio-only fallback) with correct ldd output. Device.benchmark NVMe
saturation throughput unchanged within noise (libaio 738K IOPS, uring
349K IOPS on Dell P5600).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces the statx(STATX_DIOALIGN) probe in ProbeDioAlignment with a
direct sysfs lookup of max(logical_block_size, physical_block_size).

Why:
  - STATX_DIOALIGN reports only the kernel-enforced minimum (= logical
    block size). It misses the firmware's preferred sector
    (physical_block_size), so on a 512e drive (logical=512,
    physical=4096) the probe would return 512 and Tsavorite would take
    a firmware RMW penalty on every partial-sector write.
  - STATX_DIOALIGN also requires kernel 6.1+ AND the filesystem to
    populate the field; ext4 on 6.8 leaves it unset on 512-byte
    devices, so the probe was already falling through to the 512
    default in practice.
  - sysfs gives us both values directly, on every kernel, with no
    O_DIRECT dance. Taking max(logical, physical) covers the
    correctness floor (logical = kernel-enforced minimum) and the
    performance floor (physical = avoid RMW on partial writes) in one
    shot.

Implementation:
  - stat() the file (or its closest existing ancestor — log file may
    not exist yet at construction). Extract st_dev → (major, minor).
  - Read /sys/dev/block/<maj>:<min>/queue/{logical,physical}_block_size.
    For partitions (e.g. sda2), the queue/ dir lives on the parent
    whole-disk block device — fall through to ../queue/<field>.
  - Round result up to a power of two (always already pow2 on real
    hardware) and floor at 512 B.

On this machine (Dell P5600 NVMe + PERC sda):
  Probe(/DATA2/badrishc) = 512    (NVMe: logical=512, physical=512)
  Probe(/tmp/devbench)   = 512    (sda partition via parent-walk)
  Probe(/home/badrishc)  = 512
  Probe(/)               = 512

All values match max(logical, physical) read directly from sysfs.

Verification:
  - 62/62 Tsavorite.test.hlog IDevice + NativeStorageDevice tests pass
  - Both .so flavors rebuilt (uring-enabled + libaio-only fallback)
  - C ABI NativeDevice_ProbeAlignment delegates to the same helper, so
    managed SectorSize and native sector_size() remain in lockstep.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…obed sector size

CI failure on IDevice_Initialize_OmitSegmentIdFromFilename_BareFileName
(and likely other IDevice_* tests on the same runner) reported:
  'NativeStorageDevice.WriteAsync: misaligned I/O — sector size is 4096,
   but offset=0x0, length=4096, buffer=0x...7EC7A9F76800'

The buffer ends at 0x...800 = 2048 — i.e. 2048-aligned but not
4096-aligned. The test helper allocated buffers aligned to
HardeningSectorSize = 512 (the pre-PR default for every Garnet Linux
device); a 512-aligned formula can land on a 2048-boundary that is not
also a 4096-boundary.

CI's underlying disk reports physical_block_size = 4096 in sysfs, so
the new max(logical, physical) probe returns 4096 there. The native
shim then correctly rejects sub-4096-aligned O_DIRECT buffers with
EINVAL. The fix is on the test side: bump HardeningSectorSize from
512 to 4096 so the test buffer alignment matches the strictest
device.SectorSize seen on any modern hardware (512n, 512e, 4Kn).

Locally (Dell P5600 NVMe, logical=physical=512 → SectorSize=512) all
62 IDevice + NativeStorageDevice tests still pass — 4096 trivially
divides 512.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… PR)

The read-window sizing optimization (drop leading-slop padding + clamp
to page-end) is out of scope for this device-backend PR; it interacts
with the larger read-IO path and deserves its own focused PR with
dedicated benchmarking. Reverting to the pre-PR behavior here.

TryAllocateRetryNow's bounded-backoff change is retained — it's a
self-contained allocator hot-path fix and is independently verified
(+13.9% on YCSB load with libaio at 64 threads, per kvbench
benchmarking).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… max(logical, physical)

Symmetry with the Linux sysfs probe — Windows now reads both
BytesPerLogicalSector and BytesPerPhysicalSector from the volume's
STORAGE_ACCESS_ALIGNMENT_DESCRIPTOR (via IOCTL_STORAGE_QUERY_PROPERTY +
StorageAccessAlignmentProperty) and returns the rounded-up-to-pow2 max,
floor 512 B. Previously the Windows branch returned 512 unconditionally,
which would silently undersize SectorSize on Windows 4Kn / 512e drives.

Implementation:
  - Parse drive letter from filename ("C:\foo.dat" -> "\\.\\C:").
    UNC paths are not supported by this probe — fall back to 512.
  - CreateFile on the volume with FILE_READ_ATTRIBUTES (no admin needed).
  - DeviceIoControl(IOCTL_STORAGE_QUERY_PROPERTY) with
    StorageAccessAlignmentProperty.
  - max(logical, physical), round up to pow2, floor 512.

Linux behavior unchanged. Both .so flavors rebuilt and pass 62/62
device tests on this machine (logical=physical=512 NVMe).

REQUIRES Windows DLL rebuild — the Windows path in ProbeDioAlignment
is now non-trivial, and the existing prebuilt native_device.dll still
returns 512 unconditionally. Without the rebuild, on a Windows 4Kn box
the managed SectorSize cross-check would (incorrectly) pass at 512
while the device might actually need 4096. Rebuild recipe in the
companion review comment / Tsavorite/cc/README.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…leteFor

Addresses Copilot review comment on file_linux.cc:373.

UringIoHandler::TryCompleteFor (and via it TryComplete) dispatched
every drained CQE through DispatchUringCqe without checking the
user_data = nullptr sentinel that QueueRunFor already handles. The
sentinel marks two kinds of no-op CQEs:
  - Wake-up nops submitted by UringIoHandler::Wake to unblock the
    drainer on Dispose.
  - SQEs rewritten in-place after io_uring_submit failed (the SQE
    leak fix in c6d6892); these are committed to the SQ but carry
    no caller context.

If a TryComplete() / TryCompleteFor() call picks up either kind of
nop CQE, DispatchUringCqe would dereference the null context at
context->callback(...) and segfault.

Fix: mirror the nullptr-skip from QueueRunFor in TryCompleteFor.
Return true to count the drain (matching the any-flag semantics).

Verification: 62/62 Tsavorite.test.hlog IDevice + NativeStorageDevice
tests pass. uring .so rebuilt; libaio-only .so is byte-identical
because the patched code path is wrapped in #ifdef FASTER_URING and
not compiled into the libaio-only fallback.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
QueueRunFor used to acquire cq_lock per CQE (peek -> read fields ->
cqe_seen -> release -> dispatch). With a single drainer thread that
serializes lock acquire/release on every completion and forces
submitters to wait through callback latency when they need ring access.

Replaced with the canonical liburing batch-drain idiom:
  - acquire cq_lock once
  - io_uring_peek_batch_cqe(ring, cqes, 64) to pull up to 64 CQEs
  - snapshot (io_res, context) for each
  - io_uring_cq_advance(ring, n) to release the slots
  - release cq_lock
  - dispatch callbacks outside the lock

This is the io_uring equivalent of libaio's io_getevents(n) per syscall.
Snapshot BEFORE cq_advance is mandatory because the kernel may reuse
CQ slots once advanced, leaving the cqe pointers dangling.

The wake-up / failed-submit sentinel (user_data == nullptr) is still
skipped without dispatch, same as before.

Measured impact on Dell P5600 (16 submitter threads, batch 64,
throttle 256):
  ct=1 (1 ring, 1 drainer):  339K -> 354K ops/sec  (+4%)
  ct=4 (4 rings, 4 drainers): 735K -> 737K         (saturates, noise)
  ct=8 (8 rings, 8 drainers): 750K -> 742K avg     (saturates, noise)

The single-drainer gain is modest because the real bottleneck at
ct=1 with 16 submitters is sq_lock contention on the single ring, not
cq_lock contention. The batch-drain is still strictly better:
  - dispatches outside the lock so submitters aren't blocked by
    user-callback latency,
  - matches the idiomatic liburing pattern,
  - amortizes the lock acquire/release across up to 64 CQEs per cycle.

For high-throughput workloads, sharding across multiple rings remains
the right scaling lever (ct >= 4 saturates this drive).

Verification: 62/62 Tsavorite.test.hlog IDevice + NativeStorageDevice
tests pass. uring .so rebuilt.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Eliminates the sq_lock contention that was capping uring at ~340K IOPS at
the default numCompletionThreads=1. Two changes work together:

1. Per-thread ring affinity in pick_ring (file_linux.h):
   Each submitter thread is assigned a ring on its first submit (round-
   robin against other threads via an atomic counter) and keeps that
   assignment for life. Same-thread submits never contend on sq_lock with
   themselves; different threads only contend when they got assigned the
   same ring (num_submitter_threads > num_rings). This is the user-space
   equivalent of libaio's "io_submit is thread-safe per io_context" —
   eliminate shared mutable state across submitters.

2. Hardcoded 4 rings for uring (NativeStorageDevice.cs):
   numIoContextsConfig = ioBackend == Uring
       ? max(kDefaultUringRings=4, numCompletionThreads)
       : numCompletionThreads
   So uring always has at least 4 rings even at numCompletionThreads=1.
   The single drainer covers all 4 rings via the legacy QueueRun compat
   scanner (CompletionWorker passes ctxIdx=-1 in that case).

   libaio is unchanged: rings == numCompletionThreads (extra rings don't
   help; the kernel io_context mutex is already efficient).

Result on Dell P5600 NVMe (16 submitter threads, batch 64, throttle 256):

  Before (1 ring, 1 drainer):              ~340K
  After  (4 rings, 1 drainer, default):    ~700K   (matches libaio ct=1)
  After  (8 rings, 8 drainers, sharded):   ~745K   (unchanged, was already saturating)

No new public configuration parameters. numCompletionThreads still
controls drainer count; the ring count is now backend-derived behind the
scenes. The CompletionWorker single-drainer-multi-ring path was added
specifically so the default numCompletionThreads=1 case can saturate
without spawning extra drainer threads.

Also: bumped HardeningSectorSize and the legacy bufferPool / NativeDeviceTest2
sector_size constants from 512 to 4096 to match the strictest device
SectorSize we expect on any modern hardware (4Kn drives where the new
max(logical, physical) probe returns 4096). Tests would otherwise fail
with EINVAL on 4Kn CI runners with 512-aligned buffers.

Verification:
  - 64/64 Tsavorite.test.hlog IDevice + NativeStorageDevice tests pass
  - Default uring (no flags) hits 626-740K across t=1..64 vs ~340K before
  - Sharded ct=4/8 unchanged (still saturates)
  - libaio default unchanged

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… SectorSize

NativeDeviceTest1 read 1024 bytes (entryLength) using ReadInto, which:
  - rounded the read length up to the device sector size,
  - then returned a buffer of that ROUNDED length,
  - which the caller compared via SequenceEqual against the original
    `entry` byte[] (length 1024).

When SectorSize was 512 (the old constant probe), 1024 rounded to 1024
and the lengths happened to match. With the new max(logical, physical)
probe returning 4096 on 4Kn drives (Windows/Ubuntu CI runners), 1024
rounds to 4096, the returned buffer is 4096 bytes long, and
SequenceEqual fails on length mismatch (regardless of content).

Pre-existing latent bug — the rounding to sector size is correct for
the IO submit, but the caller should only see the bytes it asked for.

Fix: return a buffer of the caller-requested logical `size`, not the
sector-rounded `numBytesToRead`.

Verification: 64/64 Tsavorite.test.hlog NativeDeviceTest + IDevice +
NativeStorageDevice tests pass on Linux (where SectorSize is 4096 on
the CI runner's 4Kn drive).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The --device-completion-threads (KV.benchmark) and --completion-threads
(Device.benchmark) help text said "all drainers share the same kernel
io_context / io_uring" and "values > 1 are rarely useful past 1 today".

Both claims are stale since the sharded-rings work (8cbca9d) and the
per-thread ring affinity + 4-default-rings change (298bfd1):

  - Each drainer is bound 1:1 to its own kernel io_context (libaio)
    or io_uring ring (uring).
  - Submitters distribute across rings via per-thread affinity.
  - For io_uring, throughput scales with completion-threads up to
    available submitter concurrency (measured: ct=1 ~340K → ct=4 ~735K
    on Dell P5600 NVMe at the device-benchmark level).
  - For libaio extra drainers still rarely help past 1 (kernel
    per-context mutex is efficient).
  - Note added that uring uses min 4 rings even at ct=1 with the
    single drainer covering all rings via the legacy QueueRun scanner.

Help-text-only change. No code behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ll reads

Root cause: cross-segment read rejection + engine retry loop
==============================================================

The AllocatorBase.GetAndPopulateReadBuffer sector-aligned read window can
extend past the page-end boundary when reading a record near the tail of
a page. When the device's segment size is a multiple of the page size
(e.g. 4MB pages, 1GB segments — the Garnet default), an over-extended
read at the last page of a segment also crosses the device's segment
boundary. NativeStorageDevice's underlying FileSystemSegmentedFile
rejects cross-segment reads with Status::IOError; the engine's
AsyncGetFromDiskCallback interprets a 0-byte read as a short read and
retries the same address — forever. Worker thread spins at 99% CPU,
disk activity drops to zero, benchmark deadlocks.

Reproduced reliably on KV.benchmark:
  --device native --device-io-backend libaio \
  --log-memory 16m --page-size 4m --segment-size 1g \
  -n 10000000     (1.28 GB dataset → crosses 1GB segment boundary)

Smaller datasets (1M = 128MB, fits in 1 segment) work; larger ones
hang. RandomAccess device works on all dataset sizes because its
managed segmented-file wrapper doesn't reject cross-segment reads.

Diagnostic captured the exact symptom: a read at sourceAddress
0x3FFFF600 (1,073,739,776 — 2,560 bytes before the 1GB segment
boundary) with readLength 4608 (sector-aligned record window) extends
to 0x40000C00 — 2,560 bytes into segment 1. Native rejects with
Status::IOError, callback fires with numBytes=0, engine retries.

Fix: clamp the aligned read length so it never crosses page-end.
============================================================

Added in AllocatorBase.GetAndPopulateReadBuffer:

    var pageEndInFile = (ulong)(AlignedPageSizeBytes * (GetPage(fromLogicalAddress) + 1));
    if (alignedFileOffset + alignedReadLength > pageEndInFile)
        alignedReadLength = (uint)(pageEndInFile - alignedFileOffset);

Records never span page boundaries (HandlePageOverflow guarantees), so
the actual record is fully readable within the clamped window —
available_bytes reflects what we actually got from disk, and the engine
continues normally. pageEnd is sector-aligned (PageSizeBits >= sector
size), so the clamped length stays sector-aligned.

Also reverted the uring "min 4 rings even at ct=1" experiment
=============================================================

The earlier "default 4 rings for uring regardless of ct" change was
fundamentally broken: with per-thread submit affinity (pick_ring's
thread_local index), submitters bound to rings 1-3 never get their
completions drained because the single drainer blocks on ring 0 with
a 1-second QueueRun timeout and only briefly polls the other rings
between wake-ups. The result is ~50x throughput degradation on
workloads where submitters land on rings != 0 (KV.benchmark load
phase dropped from 2.5M ops/sec to 54K ops/sec at t=1).

Reverted to the simple rule: rings == numCompletionThreads. For uring
perf scaling, users set numCompletionThreads >= expected submitter
concurrency; each ring is then continuously drained by its dedicated
drainer thread.

Defense-in-depth hardening
==========================

- NativeStorageDevice._callback now catches ALL exceptions from the
  user callback (was: try/finally but exception propagated). A managed
  exception escaping back into native code across the C ABI boundary
  silently terminates the drainer thread; the next submitter then
  spins forever in device.Throttle(). Now the exception is logged and
  swallowed so the drainer survives.

- NativeStorageDevice.CompletionWorker has the same try/catch around
  the whole drain loop as defense-in-depth against unrelated managed
  exceptions (P/Invoke marshalling, IntPtr.Zero races with Dispose,
  etc.).

- file_linux.cc QueueFile::ScheduleOperation (libaio) and
  UringFile::ScheduleOperation (uring) now retry submit-side EAGAIN
  indefinitely with bounded backoff (64 sched_yields, then 1ms
  nanosleeps) instead of returning Status::IOError after 8 yields.
  Surfacing transient EAGAIN as a permanent error creates the same
  retry-loop pathology as the cross-segment-read bug above. EAGAIN is
  the kernel saying "ring is full, try later"; it's not a real error
  and must not be exposed to the engine.

Verification
============

KV.benchmark, 100M keys × 100B, 16MB log (mostly disk-spill),
1 completion thread, 100% reads:

  libaio: t=1 135K ops/sec, t=4 400K, t=8 444K, t=16 445K, t=32 404K
  uring:  t=1 124K ops/sec, t=4 244K, t=8 265K, t=16 278K, t=32 272K

Both backends stable across the full thread × dataset sweep
(previously native+libaio hung on any 10M+ dataset; native+uring hung
on every config).

64/64 Tsavorite.test.hlog IDevice + NativeStorageDevice tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc force-pushed the badrishc/optimize-device branch from 01c9a77 to dc97630 Compare May 28, 2026 03:37
@badrishc badrishc merged commit 97b52bc into main May 28, 2026
187 checks passed
@badrishc badrishc deleted the badrishc/optimize-device branch May 28, 2026 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants