sieve

package module
v0.6.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 13, 2026 License: BSD-2-Clause Imports: 10 Imported by: 1

Image README

go-sieve - SIEVE cache eviction for Go

Go Reference Go Report Card Release

A Go implementation of the SIEVE cache eviction algorithm (NSDI'24, Zhang et al.), engineered from the ground up for highly concurrent, read-heavy workloads. Generic over key and value types.

The read path (Get()) is fully lock-free — a single atomic load on the key→index map plus a single atomic bit update on a shared visited bitfield. No mutex, no pointer chasing, no per-entry allocations, zero GC traffic on hits. Under parallel load this is ~90–300x faster than hashicorp/golang-lru and scales linearly with cores: measured at 1–3 ns/op across 32 goroutines on real-world cache traces. The write path uses a single short-held mutex and a pre-allocated node pool, so Add()/Probe() also avoid per-operation heap allocation in steady state.

If you need a cache that many goroutines hit simultaneously on the read path — an HTTP response cache, a DNS resolver cache, an authz decision cache, a hot-path object lookup — this implementation is built for that shape. For purely single-threaded use, hashicorp/golang-lru may be marginally faster on the sequential write path; under any concurrency, Sieve wins decisively.

SIEVE uses a FIFO queue with a roving "hand" pointer: cache hits set a visited bit (lazy promotion), and eviction scans from the hand clearing visited bits until it finds an unvisited node (quick demotion). It matches or exceeds LRU/ARC hit ratios with far less bookkeeping — validated here on ~300M requests from the MSR Cambridge and Meta Storage trace repositories.

Key Design Elements

Array-backed indexed list. Marc Brooker observed that SIEVE's mid-list removal prevents a simple circular buffer. Rather than Tobin Baker's "swap tail into hole" workaround, this implementation uses a doubly-linked list with int32 indices into a pre-allocated backing array. This preserves SIEVE's exact eviction semantics while eliminating all interior pointers — the GC sees a flat []node with no pointers to trace (for non-pointer K, V types).

xsync.MapOf for concurrent access. The key→index map uses puzpuzpuz/xsync.MapOf which stores int32 values inline in cache-line-padded buckets — no traced pointers per entry. Get() is fully lock-free; only Add()/Probe() (on miss) and Delete() take the global mutex.

Columnar slot state (slotState). Each node's lock and visited counter are hoisted out of the node struct into a separate contiguous []uint64 — one word per slot, laid out as a column alongside the []node array. This columnar split has three effects:

  1. the eviction hand scans IsVisited by walking a dense []uint64 sequentially — 8 slots per cache line, hardware-prefetch friendly — without pulling in key/val data it doesn't need;
  2. Get()'s LockAndMark writes only to the slotState word for that slot, so it never dirties the cache line holding the node's key/val — no false sharing with concurrent readers of adjacent nodes
  3. []uint64 contains no pointers, so the GC never traces it, unlike node fields that may hold pointer-typed K/V.

Within each word, bit 63 is a spinlock and the low bits are a saturating visited counter (1 bit at k=1, ⌈log₂(k+1)⌉ bits for higher k).

Pre-allocated node pool. All nodes are allocated once at cache creation in a contiguous array. A bump allocator + intrusive freelist (reusing node.next) provides O(1) alloc/free with zero heap allocations during steady-state operation.

TOCTOU-safe concurrent writes. Add() and Probe() use a double-check pattern: fast-path Load() outside the lock, re-check under mu.Lock() before inserting. This prevents duplicate nodes from concurrent writers racing on the same key.

Benchmark Results

Benchmarked against hashicorp/golang-lru v2.0.7 (LRU and ARC) on a 13th Gen Intel Core i9-13900 with 32 cores and 32GB of RAM:

  • Ubuntu/Linux 24.04 (kernel 6.8.0-106)
  • go1.26.1
  • GOMAXPROCS=32

Benchmarks live in bench/ as a separate module to avoid polluting go.mod.

Commands used (no name filter — every benchmark in every package runs):

cd bench && make bench    # synthetic comparison, count=3
cd bench && make trace    # trace replay + miss ratio + GC, count=1

Full raw results: bench-results.md.

Reading the parallel ns/op numbers

The sub-2 ns/op numbers in the parallel tables below are real but need context: they are aggregate throughput.

Go's b.RunParallel distributes b.N total operations across GOMAXPROCS goroutines and reports ns/op = wall_clock / b.N. When 32 goroutines complete 1 billion Get()s in ~1 second, the reported number is 1.0 ns/op — meaning "the system produces one completed Get every ~1 ns." The per-core latency is ~32 ns (1.0 × 32 cores), which is consistent with two L1/L2-hot atomic operations on a 5 GHz CPU.

LRU/ARC report ~200–600 ns/op under the same conditions — not because a single Get is 200x slower, but because every Get takes a mutex. The 32 goroutines serialize through one lock, so throughput is flat regardless of core count. On a single goroutine (see BenchmarkReplay, sequential), SIEVE and LRU are within 15% of each other — the ~100–300x gap is a concurrency-scaling story.

After warmup, the working data structures (map buckets, node array, visited bitfield) are resident in CPU cache — L1/L2 for small traces, L3 for larger ones. Real workloads with lower temporal locality will see higher per-core latency, but the relative advantage over mutex-bound LRU/ARC holds whenever there is any parallelism at all.

Parallel Micro-Benchmarks (count=3, medians)
Benchmark Sieve LRU ARC
Get_Parallel 2.36 ns/op, 0 B 563.2 ns/op, 0 B 606.7 ns/op, 0 B
Add_Parallel 426.9 ns/op, 8 B 527.0 ns/op, 40 B 1020 ns/op, 76 B
Probe_Parallel 378.4 ns/op, 8 B
Delete_Parallel 230.1 ns/op, 0 B 163.1 ns/op, 0 B 253.9 ns/op, 0 B
Mixed_Parallel (60/30/10) 344.1 ns/op, 2 B 602.7 ns/op, 12 B 637.9 ns/op, 24 B
Zipf_Get_Parallel (s=1.01) 16.5 ns/op 472.5 ns/op 396.7 ns/op
Memory @ 1M fill 122 MB, 1.10M allocs 156 MB, 1.01M allocs 156 MB, 1.01M allocs
GCImpact 9.22 ms/op, 9.8 KB/op 13.64 ms/op, 26.3 KB/op 14.26 ms/op, 117.2 KB/op

Probe_Parallel is SIEVE-only — LRU's PeekOrAdd and ContainsOrAdd skip recency promotion and are not semantic equivalents. Delete_Parallel is the one micro where SIEVE loses: LRU's single-lock linked-list unlink edges out SIEVE's slot-state clear (163 vs 230 ns/op). ARC is slowest (254 ns/op) because its T1/T2/B1/B2 bookkeeping doubles the work.

Trace Replay Results

This implmentation is benchmarked against real-world cache traces from the libCacheSim trace repository — 14 MSR Cambridge enterprise block I/O traces + 5 Meta Storage (Tectonic) block traces totalling ~300M requests. Each trace was replayed with a cache sized at 10% of unique keys, comparing SIEVE (k=1, k=2, k=3) against hashicorp/golang-lru (LRU and ARC).

Parallel Get throughput (warm cache, 32 goroutines, ns/op, zero allocs)

SIEVE's lock-free Get() is ~100–300x faster than LRU/ARC under concurrent read load. Every trace, every cache:

Trace SIEVE k=1 SIEVE k=3 LRU ARC
msr_web_2 1.02 1.04 182.6 377.5
meta_storage/block_traces_1 1.29 1.44 232.1 360.2
meta_storage/block_traces_2 1.30 1.51 257.2 358.9
msr_proj_4 1.31 1.23 360.7 479.1
meta_storage/block_traces_3 1.50 1.60 264.9 458.0
meta_storage/block_traces_4 1.55 1.62 234.1 449.2
msr_prn_1 1.56 1.59 321.4 381.6
meta_storage/block_traces_5 1.59 1.70 222.5 348.6
msr_prxy_0 1.76 2.27 313.5 363.3
msr_usr_2 2.03 2.12 344.5 502.0
msr_src1_1 2.19 2.28 394.8 587.4
msr_usr_1 2.32 2.26 395.0 536.9
msr_proj_2 2.79 3.03 280.8 460.6
msr_proj_1 2.91 2.89 334.6 483.3
msr_src1_0 4.85 4.89 336.4 443.6
msr_proj_0 5.24 5.67 294.6 407.5
msr_hm_0 6.61 7.12 275.6 455.2
msr_prn_0 9.21 10.41 288.0 402.3

The k=3 saturating counter adds under 10% overhead to the read path. This benchmark pre-warms the cache with a full trace replay, then hammers Get() only — the ideal scenario for a read-heavy cache in steady state.

Parallel Replay throughput (cold cache, mixed read+write, 32 goroutines, ns/op)

This is Probe() for SIEVE and Get+Add for LRU/ARC, hammered in parallel with no warmup. It measures the steady-state workload a real cache faces: reads and writes interleaved, evictions happening live. The previous table's numbers reflect the best-case read ceiling; this table reflects what throughput you actually get when your cache is doing work.

Trace SIEVE k=1 SIEVE k=3 LRU ARC
msr_prxy_0 6.45 6.61 401.3 431.0
msr_prn_0 14.90 15.49 481.7 497.1
meta_storage/block_traces_4 16.01 16.75 469.5 453.0
meta_storage/block_traces_1 16.15 18.48 440.7 486.5
msr_proj_0 16.36 15.86 422.8 471.8
msr_prn_1 16.50 18.88 421.4 491.6
meta_storage/block_traces_2 16.69 16.85 454.9 525.0
meta_storage/block_traces_3 16.74 17.71 434.4 501.3
meta_storage/block_traces_5 17.05 16.65 449.6 519.8
msr_hm_0 19.18 17.29 447.8 499.7
msr_usr_2 19.04 19.13 431.5 545.2
msr_proj_4 20.20 20.71 416.4 438.8
msr_usr_1 21.64 23.53 441.6 418.7
msr_src1_1 21.96 20.60 429.2 526.0
msr_src1_0 22.34 25.78 402.0 461.8
msr_proj_1 22.46 22.27 426.8 446.2
msr_proj_2 22.81 22.03 441.2 529.0
msr_web_2 24.35 25.41 436.6 466.3

Under concurrent cold-cache replay, SIEVE is ~18–62x faster than LRU/ARC. The best case is msr_prxy_0 (95% hits, so the lock-free fast path dominates); the worst case is msr_web_2 (98% miss ratio, almost every call takes the write lock) — and even there SIEVE is 18x faster.

Miss ratio (every trace)

Cache sized at 10% of unique keys. Bold = best in row.

Trace SIEVE k=1 SIEVE k=2 SIEVE k=3 LRU ARC
meta_storage/block_traces_1 0.4632 0.4651 0.4672 0.4602 0.4667
meta_storage/block_traces_2 0.4719 0.4743 0.4754 0.4676 0.4755
meta_storage/block_traces_3 0.4908 0.4928 0.4948 0.4885 0.4947
meta_storage/block_traces_4 0.4841 0.4870 0.4888 0.4812 0.4887
meta_storage/block_traces_5 0.4959 0.4984 0.4998 0.4927 0.5003
msr_hm_0 0.2991 0.3025 0.3025 0.3188 0.2923
msr_prn_0 0.2156 0.2194 0.2208 0.2310 0.2145
msr_prn_1 0.3908 0.3837 0.3796 0.4341 0.4148
msr_proj_0 0.2537 0.2660 0.2745 0.2375 0.2242
msr_proj_1 0.6794 0.6794 0.6794 0.7215 0.6788
msr_proj_2 0.8231 0.8231 0.8231 0.8548 0.8125
msr_proj_4 0.8463 0.8463 0.8463 0.8140 0.7173
msr_prxy_0 0.0512 0.0572 0.0594 0.0476 0.0468
msr_src1_0 0.7845 0.7845 0.7845 0.9132 0.7811
msr_src1_1 0.7939 0.7934 0.7934 0.8129 0.8231
msr_usr_1 0.3558 0.3558 0.3558 0.4007 0.3513
msr_usr_2 0.7216 0.7216 0.7216 0.7533 0.7199
msr_web_2 0.9786 0.9786 0.9786 0.9929 0.9785

Overall best (bold): ARC has the lowest miss ratio in 11 of 18 rows, LRU in 5 (all meta_storage), SIEVE k=3 in 2 (msr_prn_1 and msr_src1_1). SIEVE k=1 is never the overall best.

Head-to-head, SIEVE k=1 vs LRU (the typical deployment choice): SIEVE k=1 has lower miss ratio on 10 of 18 traces, LRU on 8. When SIEVE is better the margins are large (msr_src1_0: 12.9 points, msr_usr_1: 4.5, msr_prn_1: 4.3). When LRU is better the margins are narrow (all 5 meta_storage: 2–3 points each).

SIEVE's case rests on throughput Miss ratios are competitive (SIEVE is never dramatically worse than LRU), and SIEVE is 18–300x faster under any concurrency. SIEVE k=3 produces the single-best entry in the entire table on msr_prn_1 (0.3796 vs LRU 0.4341, ARC 0.4148).

Memory during replay

On the 13.2M-request meta_storage/block_traces_1 trace (601K-entry cache), TestGCPressure reports:

Variant TotalAlloc
SIEVE k=1 154 MB
SIEVE k=3 155 MB
LRU 418 MB
ARC 997 MB

SIEVE allocates 2.7x less than LRU and 6.5x less than ARC during replay — the array-backed node pool and inline-int32 xsync.MapOf are structural wins.

Full per-trace tables (sequential replay ns/op, B/op, miss ratio for every trace) are in bench-results.md. Methodology and trace-loading details are in bench/README.md.

Usage

import "github.com/opencoff/go-sieve"

// Create a cache mapping string keys to int values, capacity 1000.
c, err := sieve.New[string, int](1000)
if err != nil {
    log.Fatal(err) // ErrInvalidCapacity or ErrInvalidVisitClamp
}

// Or, for constant arguments, use Must to get a one-liner:
c := sieve.Must(sieve.New[string, int](1000))

c.Add("foo", 42)

if val, ok := c.Get("foo"); ok {
    fmt.Println(val) // 42
}

// Probe inserts only if absent; returns the cached value if present.
val, _, r := c.Probe("foo", 99)
// val == 42, r.Hit() == true

c.Delete("foo")
c.Purge() // reset entire cache
SIEVE-k

WithVisitClamp(k) creates a SIEVE-k cache where each entry uses a saturating counter instead of a single visited bit. An item accessed k+1 times survives k eviction passes before being evicted. k=1 is equivalent to classic SIEVE (the default). Use k=2 or k=3 for workloads with repeated access patterns where extra eviction resistance is beneficial.

// Classic SIEVE (k=1, the default)
c, err := sieve.New[string, int](1000)

// SIEVE-k=3: items survive up to 3 eviction passes
c, err := sieve.New[string, int](1000, sieve.WithVisitClamp(3))

New returns an error for capacity <= 0 (ErrInvalidCapacity) and for WithVisitClamp(k) with k > sieve.MaxVisitClamp (255 — ErrInvalidVisitClamp). Clamp values below 1 are silently rounded up to 1.

Eviction Handling

Add() and Probe() return the evicted entry (if any) along with a CacheResult bitmask. This allows callers to handle evictions synchronously without channels, goroutines, or lifecycle management.

c := sieve.Must(sieve.New[string, int](1000))

ev, r := c.Add("foo", 42)
if r.Evicted() {
    cleanupDisk(ev.Key, ev.Val)
}

// CacheResult bitmask:
//   CacheHit   — key was already present (value updated, no eviction)
//   CacheEvict — an entry was evicted to make room (mutually exclusive with CacheHit)

Purge() and Delete() do not report evictions.

API

Function / Method Description
New[K, V](capacity, ...Option) (*Sieve[K,V], error) Create a cache with fixed capacity. Returns ErrInvalidCapacity or ErrInvalidVisitClamp on bad input.
Must[K, V](*Sieve[K,V], error) *Sieve[K,V] Helper that panics on error; useful with constant arguments.
Get(key) (V, bool) Look up a key (lock-free, zero-alloc).
Add(key, val) (Evicted[K,V], CacheResult) Insert or update; returns evicted entry and result bitmask.
Probe(key, val) (V, Evicted[K,V], CacheResult) Insert-if-absent; returns cached/inserted value, evicted entry, and result bitmask.
Delete(key) bool Remove a key.
Purge() Clear the entire cache.
Len() int Current number of entries (lock-free atomic load).
Cap() int Maximum capacity.
Options
Option Description
WithVisitClamp(k) Use k-level saturating counters (default k=1 = classic SIEVE). k is capped at MaxVisitClamp (255); k < 1 is silently rounded to 1.

GC Note

When K or V is a pointer type (including string, which contains an internal pointer in Go), the node array will still contain GC-traced pointers. The GC pressure reduction is most dramatic for scalar key/value types (int, [16]byte, fixed-size structs).

License

BSD-2-Clause. See the source files for the full license text.

Image Documentation

Overview

Package sieve implements the SIEVE cache eviction algorithm (NSDI'24, Zhang et al.). https://yazhuozhang.com/assets/pdf/nsdi24-sieve.pdf

SIEVE uses a FIFO queue with a roving "hand" pointer. On cache hit, only a visited bit is set (lazy promotion). On miss, the hand scans toward the head, clearing visited bits until it finds an unvisited node to evict (quick demotion).

This implementation is optimized for low GC overhead and high concurrency: an array-backed doubly-linked list with int32 indices (no interior pointers), a combined per-node lock+visited word (one uint64 per node), and xsync.MapOf for lock-free reads.

Index

Constants

View Source
const MaxCapacity = math.MaxInt32 - 1

MaxCapacity is the largest value accepted by New for the cache capacity. Node indices are stored as int32 (to keep the node struct compact and to let xsync.MapOf pack values inline); the bump allocator advances to capacity+1 after the last fill, so the upper bound is math.MaxInt32 - 1.

View Source
const MaxVisitClamp = 7

MaxVisitClamp is the maximum value accepted by WithVisitClamp. Values above this are rejected by New with ErrInvalidVisitClamp. The limit is one byte because no published SIEVE-k workload benefits from counters larger than a few units.

Variables

View Source
var (
	// ErrInvalidCapacity is returned by New when capacity is outside the
	// valid range [1, MaxCapacity].
	ErrInvalidCapacity = errors.New("sieve: capacity out of range [1, MaxCapacity]")

	// ErrInvalidVisitClamp is returned by New when WithVisitClamp receives
	// a value greater than MaxVisitClamp. Values below 1 are silently
	// clamped up to 1 (classic SIEVE).
	ErrInvalidVisitClamp = errors.New("sieve: visit clamp must be <= MaxVisitClamp")
)

Construction errors returned by New.

Functions

This section is empty.

Types

type CacheResult added in v0.5.1

type CacheResult uint8

CacheResult is a bitmask indicating what happened during an Add or Probe operation.

const (
	// CacheHit indicates the key was already present in the cache.
	CacheHit CacheResult = 1 << iota

	// CacheEvict indicates an entry was evicted to make room for the new key.
	CacheEvict
)

func (CacheResult) Evicted added in v0.5.1

func (r CacheResult) Evicted() bool

Evicted reports whether an entry was evicted during the operation.

func (CacheResult) Hit added in v0.5.1

func (r CacheResult) Hit() bool

Hit reports whether the key was already present in the cache.

type Evicted added in v0.5.1

type Evicted[K comparable, V any] struct {
	Key K
	Val V
}

Evicted represents a key-value pair that was evicted from the cache.

type Option added in v0.5.0

type Option func(*config)

Option configures a Sieve cache at construction time.

func WithVisitClamp added in v0.5.0

func WithVisitClamp(k int) Option

WithVisitClamp creates a SIEVE-k cache where each entry can accumulate up to k visit counts before being considered "maximally visited". k=1 is equivalent to classic SIEVE (the default). k>1 uses multi-bit saturating counters: an item accessed k+1 times survives k eviction passes. Values less than 1 are clamped to 1.

type Sieve

type Sieve[K comparable, V any] struct {
	// contains filtered or unexported fields
}

Sieve represents a cache mapping the key of type 'K' with a value of type 'V'. The type 'K' must implement the comparable trait. An instance of Sieve has a fixed max capacity; new additions to the cache beyond the capacity will cause cache eviction of other entries - as determined by the SIEVE algorithm.

func Must added in v0.6.0

func Must[K comparable, V any](s *Sieve[K, V], err error) *Sieve[K, V]

Must is a helper that wraps a call to New and panics if the error is non-nil. It is intended for use in tests and top-level variable initializers where construction arguments are known constants:

var cache = sieve.Must(sieve.New[string, int](1000))

func New

func New[K comparable, V any](capacity int, opts ...Option) (*Sieve[K, V], error)

New creates a new cache of size 'capacity' mapping key 'K' to value 'V'. Without options, this creates a classic SIEVE with a single visited bit (k=1). Use WithVisitClamp to create a SIEVE-k cache.

Returns ErrInvalidCapacity if capacity is outside [1, MaxCapacity] and ErrInvalidVisitClamp if the visit-clamp option exceeds MaxVisitClamp. Clamp values below 1 are silently rounded up to 1 for backwards compatibility with callers that pass a zero default.

func (*Sieve[K, V]) Add

func (s *Sieve[K, V]) Add(key K, val V) (Evicted[K, V], CacheResult)

Add adds a new element to the cache or overwrites one if it exists. Returns the evicted entry (if any) and a CacheResult bitmask:

  • CacheHit is set when the key was already present (value updated).
  • CacheEvict is set when an entry was evicted to make room.

CacheHit and CacheEvict are mutually exclusive: updating an existing key never triggers eviction.

func (*Sieve[K, V]) Cap

func (s *Sieve[K, V]) Cap() int

Cap returns the max cache capacity

func (*Sieve[K, V]) Delete

func (s *Sieve[K, V]) Delete(key K) bool

Delete deletes the named key from the cache It returns true if the item was in the cache and false otherwise

func (*Sieve[K, V]) Dump

func (s *Sieve[K, V]) Dump() string

Dump dumps all the cache contents as a newline delimited string.

func (*Sieve[K, V]) Get

func (s *Sieve[K, V]) Get(key K) (V, bool)

Get fetches the value for a given key in the cache. It returns true if the key is in the cache, false otherwise. The zero value for 'V' is returned when key is not in the cache.

func (*Sieve[K, V]) Len

func (s *Sieve[K, V]) Len() int

Len returns the current cache utilization. It is lock-free (atomic load) and may observe in-flight updates from concurrent Add/Delete/Purge callers, so treat the result as a point-in-time estimate under concurrent use.

func (*Sieve[K, V]) Probe

func (s *Sieve[K, V]) Probe(key K, val V) (V, Evicted[K, V], CacheResult)

Probe adds <key, val> if not present in the cache. Returns:

  • The cached value (on hit) or val (on miss)
  • The evicted entry, if any
  • A CacheResult bitmask: CacheHit if key was present, CacheEvict if an entry was evicted. CacheHit and CacheEvict are mutually exclusive.

func (*Sieve[K, V]) Purge

func (s *Sieve[K, V]) Purge()

Purge resets the cache. Concurrent Get/Add/Probe calls that loaded an index before Purge may return a stale result; this is inherent to any concurrent purge operation.

We intentionally do NOT call slots.ResetAll() here. Visited bits for reused slots are cleared by newNode() via LockAndReset(), which safely spins until any concurrent fast-path holder releases the slot lock. An unconditional ResetAll(Store→0) would destroy locks held by stale fast-path goroutines, causing two goroutines to "hold" the same lock.

func (*Sieve[K, V]) String

func (s *Sieve[K, V]) String() string

String returns a string description of the sieve cache

Image Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL