go-sieve - SIEVE cache eviction for Go

A Go implementation of the
SIEVE cache eviction
algorithm (NSDI'24, Zhang et al.), engineered from the ground up for highly
concurrent, read-heavy workloads. Generic over key and value types.
The read path (Get()) is fully lock-free — a single atomic load on the
key→index map plus a single atomic bit update on a shared visited bitfield.
No mutex, no pointer chasing, no per-entry allocations, zero GC traffic on
hits. Under parallel load this is ~90–300x faster than
hashicorp/golang-lru and scales linearly with cores: measured at
1–3 ns/op across 32 goroutines on real-world cache traces. The write
path uses a single short-held mutex and a pre-allocated node pool, so
Add()/Probe() also avoid per-operation heap allocation in steady state.
If you need a cache that many goroutines hit simultaneously on the read
path — an HTTP response cache, a DNS resolver cache, an authz decision
cache, a hot-path object lookup — this implementation is built for that
shape. For purely single-threaded use, hashicorp/golang-lru may be
marginally faster on the sequential write path; under any concurrency, Sieve
wins decisively.
SIEVE uses a FIFO queue with a roving "hand" pointer: cache hits set a
visited bit (lazy promotion), and eviction scans from the hand clearing
visited bits until it finds an unvisited node (quick demotion). It matches
or exceeds LRU/ARC hit ratios with far less bookkeeping — validated here on
~300M requests from the MSR Cambridge and Meta Storage trace repositories.
Key Design Elements
Array-backed indexed list. Marc Brooker observed
that SIEVE's mid-list removal prevents a simple circular buffer. Rather than
Tobin Baker's "swap tail into hole" workaround, this implementation uses a
doubly-linked list with int32 indices into a pre-allocated backing array.
This preserves SIEVE's exact eviction semantics while eliminating all interior
pointers — the GC sees a flat []node with no pointers to trace (for
non-pointer K, V types).
xsync.MapOf for concurrent access. The key→index map uses
puzpuzpuz/xsync.MapOf which stores
int32 values inline in cache-line-padded buckets — no traced pointers per
entry. Get() is fully lock-free; only Add()/Probe() (on miss) and
Delete() take the global mutex.
Columnar slot state (slotState). Each node's lock and visited
counter are hoisted out of the node struct into a separate contiguous
[]uint64 — one word per slot, laid out as a column alongside the
[]node array. This columnar split has three effects:
- the eviction hand scans
IsVisited by walking a dense []uint64 sequentially —
8 slots per cache line, hardware-prefetch friendly — without pulling in
key/val data it doesn't need;
Get()'s LockAndMark writes only to the slotState word for that slot,
so it never dirties the cache line holding the node's key/val — no false
sharing with concurrent readers of adjacent nodes
[]uint64 contains no pointers, so the GC never traces it, unlike node fields
that may hold pointer-typed K/V.
Within each word, bit 63 is a spinlock and the low bits are a
saturating visited counter (1 bit at k=1, ⌈log₂(k+1)⌉ bits for higher k).
Pre-allocated node pool. All nodes are allocated once at cache creation in
a contiguous array. A bump allocator + intrusive freelist (reusing node.next)
provides O(1) alloc/free with zero heap allocations during steady-state operation.
TOCTOU-safe concurrent writes. Add() and Probe() use a double-check
pattern: fast-path Load() outside the lock, re-check under mu.Lock() before
inserting. This prevents duplicate nodes from concurrent writers racing on the
same key.
Benchmark Results
Benchmarked against hashicorp/golang-lru v2.0.7
(LRU and ARC) on a 13th Gen Intel Core i9-13900 with 32 cores and 32GB of RAM:
- Ubuntu/Linux 24.04 (kernel 6.8.0-106)
go1.26.1
GOMAXPROCS=32
Benchmarks live in bench/ as a separate module to avoid polluting go.mod.
Commands used (no name filter — every benchmark in every package runs):
cd bench && make bench # synthetic comparison, count=3
cd bench && make trace # trace replay + miss ratio + GC, count=1
Full raw results: bench-results.md.
Reading the parallel ns/op numbers
The sub-2 ns/op numbers in the parallel tables below are real but need
context: they are aggregate throughput.
Go's b.RunParallel distributes b.N total operations across
GOMAXPROCS goroutines and reports ns/op = wall_clock / b.N. When 32
goroutines complete 1 billion Get()s in ~1 second, the reported number
is 1.0 ns/op — meaning "the system produces one completed Get every
~1 ns." The per-core latency is ~32 ns (1.0 × 32 cores), which is
consistent with two L1/L2-hot atomic operations on a 5 GHz CPU.
LRU/ARC report ~200–600 ns/op under the same conditions — not because a
single Get is 200x slower, but because every Get takes a mutex. The 32
goroutines serialize through one lock, so throughput is flat regardless
of core count. On a single goroutine (see BenchmarkReplay,
sequential), SIEVE and LRU are within 15% of each other — the
~100–300x gap is a concurrency-scaling story.
After warmup, the working data structures (map buckets, node array,
visited bitfield) are resident in CPU cache — L1/L2 for small traces,
L3 for larger ones. Real workloads with lower temporal locality will
see higher per-core latency, but the relative advantage over
mutex-bound LRU/ARC holds whenever there is any parallelism at all.
| Benchmark |
Sieve |
LRU |
ARC |
Get_Parallel |
2.36 ns/op, 0 B |
563.2 ns/op, 0 B |
606.7 ns/op, 0 B |
Add_Parallel |
426.9 ns/op, 8 B |
527.0 ns/op, 40 B |
1020 ns/op, 76 B |
Probe_Parallel |
378.4 ns/op, 8 B |
— |
— |
Delete_Parallel |
230.1 ns/op, 0 B |
163.1 ns/op, 0 B |
253.9 ns/op, 0 B |
Mixed_Parallel (60/30/10) |
344.1 ns/op, 2 B |
602.7 ns/op, 12 B |
637.9 ns/op, 24 B |
Zipf_Get_Parallel (s=1.01) |
16.5 ns/op |
472.5 ns/op |
396.7 ns/op |
| Memory @ 1M fill |
122 MB, 1.10M allocs |
156 MB, 1.01M allocs |
156 MB, 1.01M allocs |
GCImpact |
9.22 ms/op, 9.8 KB/op |
13.64 ms/op, 26.3 KB/op |
14.26 ms/op, 117.2 KB/op |
Probe_Parallel is SIEVE-only — LRU's PeekOrAdd and ContainsOrAdd
skip recency promotion and are not semantic equivalents. Delete_Parallel
is the one micro where SIEVE loses: LRU's single-lock linked-list unlink
edges out SIEVE's slot-state clear (163 vs 230 ns/op). ARC is slowest
(254 ns/op) because its T1/T2/B1/B2 bookkeeping doubles the work.
Trace Replay Results
This implmentation is benchmarked against real-world cache traces from the
libCacheSim trace repository — 14 MSR Cambridge
enterprise block I/O traces + 5 Meta Storage (Tectonic) block traces
totalling ~300M requests. Each trace was replayed with a cache sized at
10% of unique keys, comparing SIEVE (k=1, k=2, k=3) against
hashicorp/golang-lru (LRU and ARC).
Parallel Get throughput (warm cache, 32 goroutines, ns/op, zero allocs)
SIEVE's lock-free Get() is ~100–300x faster than LRU/ARC under
concurrent read load. Every trace, every cache:
| Trace |
SIEVE k=1 |
SIEVE k=3 |
LRU |
ARC |
| msr_web_2 |
1.02 |
1.04 |
182.6 |
377.5 |
| meta_storage/block_traces_1 |
1.29 |
1.44 |
232.1 |
360.2 |
| meta_storage/block_traces_2 |
1.30 |
1.51 |
257.2 |
358.9 |
| msr_proj_4 |
1.31 |
1.23 |
360.7 |
479.1 |
| meta_storage/block_traces_3 |
1.50 |
1.60 |
264.9 |
458.0 |
| meta_storage/block_traces_4 |
1.55 |
1.62 |
234.1 |
449.2 |
| msr_prn_1 |
1.56 |
1.59 |
321.4 |
381.6 |
| meta_storage/block_traces_5 |
1.59 |
1.70 |
222.5 |
348.6 |
| msr_prxy_0 |
1.76 |
2.27 |
313.5 |
363.3 |
| msr_usr_2 |
2.03 |
2.12 |
344.5 |
502.0 |
| msr_src1_1 |
2.19 |
2.28 |
394.8 |
587.4 |
| msr_usr_1 |
2.32 |
2.26 |
395.0 |
536.9 |
| msr_proj_2 |
2.79 |
3.03 |
280.8 |
460.6 |
| msr_proj_1 |
2.91 |
2.89 |
334.6 |
483.3 |
| msr_src1_0 |
4.85 |
4.89 |
336.4 |
443.6 |
| msr_proj_0 |
5.24 |
5.67 |
294.6 |
407.5 |
| msr_hm_0 |
6.61 |
7.12 |
275.6 |
455.2 |
| msr_prn_0 |
9.21 |
10.41 |
288.0 |
402.3 |
The k=3 saturating counter adds under 10% overhead to the read path.
This benchmark pre-warms the cache with a full trace replay, then hammers
Get() only — the ideal scenario for a read-heavy cache in steady state.
Parallel Replay throughput (cold cache, mixed read+write, 32 goroutines, ns/op)
This is Probe() for SIEVE and Get+Add for LRU/ARC, hammered in
parallel with no warmup. It measures the steady-state workload a real
cache faces: reads and writes interleaved, evictions happening live. The
previous table's numbers reflect the best-case read ceiling; this table
reflects what throughput you actually get when your cache is doing work.
| Trace |
SIEVE k=1 |
SIEVE k=3 |
LRU |
ARC |
| msr_prxy_0 |
6.45 |
6.61 |
401.3 |
431.0 |
| msr_prn_0 |
14.90 |
15.49 |
481.7 |
497.1 |
| meta_storage/block_traces_4 |
16.01 |
16.75 |
469.5 |
453.0 |
| meta_storage/block_traces_1 |
16.15 |
18.48 |
440.7 |
486.5 |
| msr_proj_0 |
16.36 |
15.86 |
422.8 |
471.8 |
| msr_prn_1 |
16.50 |
18.88 |
421.4 |
491.6 |
| meta_storage/block_traces_2 |
16.69 |
16.85 |
454.9 |
525.0 |
| meta_storage/block_traces_3 |
16.74 |
17.71 |
434.4 |
501.3 |
| meta_storage/block_traces_5 |
17.05 |
16.65 |
449.6 |
519.8 |
| msr_hm_0 |
19.18 |
17.29 |
447.8 |
499.7 |
| msr_usr_2 |
19.04 |
19.13 |
431.5 |
545.2 |
| msr_proj_4 |
20.20 |
20.71 |
416.4 |
438.8 |
| msr_usr_1 |
21.64 |
23.53 |
441.6 |
418.7 |
| msr_src1_1 |
21.96 |
20.60 |
429.2 |
526.0 |
| msr_src1_0 |
22.34 |
25.78 |
402.0 |
461.8 |
| msr_proj_1 |
22.46 |
22.27 |
426.8 |
446.2 |
| msr_proj_2 |
22.81 |
22.03 |
441.2 |
529.0 |
| msr_web_2 |
24.35 |
25.41 |
436.6 |
466.3 |
Under concurrent cold-cache replay, SIEVE is ~18–62x faster than
LRU/ARC. The best case is msr_prxy_0 (95% hits, so the lock-free fast
path dominates); the worst case is msr_web_2 (98% miss ratio, almost
every call takes the write lock) — and even there SIEVE is 18x faster.
Miss ratio (every trace)
Cache sized at 10% of unique keys. Bold = best in row.
| Trace |
SIEVE k=1 |
SIEVE k=2 |
SIEVE k=3 |
LRU |
ARC |
| meta_storage/block_traces_1 |
0.4632 |
0.4651 |
0.4672 |
0.4602 |
0.4667 |
| meta_storage/block_traces_2 |
0.4719 |
0.4743 |
0.4754 |
0.4676 |
0.4755 |
| meta_storage/block_traces_3 |
0.4908 |
0.4928 |
0.4948 |
0.4885 |
0.4947 |
| meta_storage/block_traces_4 |
0.4841 |
0.4870 |
0.4888 |
0.4812 |
0.4887 |
| meta_storage/block_traces_5 |
0.4959 |
0.4984 |
0.4998 |
0.4927 |
0.5003 |
| msr_hm_0 |
0.2991 |
0.3025 |
0.3025 |
0.3188 |
0.2923 |
| msr_prn_0 |
0.2156 |
0.2194 |
0.2208 |
0.2310 |
0.2145 |
| msr_prn_1 |
0.3908 |
0.3837 |
0.3796 |
0.4341 |
0.4148 |
| msr_proj_0 |
0.2537 |
0.2660 |
0.2745 |
0.2375 |
0.2242 |
| msr_proj_1 |
0.6794 |
0.6794 |
0.6794 |
0.7215 |
0.6788 |
| msr_proj_2 |
0.8231 |
0.8231 |
0.8231 |
0.8548 |
0.8125 |
| msr_proj_4 |
0.8463 |
0.8463 |
0.8463 |
0.8140 |
0.7173 |
| msr_prxy_0 |
0.0512 |
0.0572 |
0.0594 |
0.0476 |
0.0468 |
| msr_src1_0 |
0.7845 |
0.7845 |
0.7845 |
0.9132 |
0.7811 |
| msr_src1_1 |
0.7939 |
0.7934 |
0.7934 |
0.8129 |
0.8231 |
| msr_usr_1 |
0.3558 |
0.3558 |
0.3558 |
0.4007 |
0.3513 |
| msr_usr_2 |
0.7216 |
0.7216 |
0.7216 |
0.7533 |
0.7199 |
| msr_web_2 |
0.9786 |
0.9786 |
0.9786 |
0.9929 |
0.9785 |
Overall best (bold): ARC has the lowest miss ratio in 11 of 18 rows,
LRU in 5 (all meta_storage), SIEVE k=3 in 2 (msr_prn_1 and
msr_src1_1). SIEVE k=1 is never the overall best.
Head-to-head, SIEVE k=1 vs LRU (the typical deployment choice):
SIEVE k=1 has lower miss ratio on 10 of 18 traces, LRU on 8. When
SIEVE is better the margins are large (msr_src1_0: 12.9 points,
msr_usr_1: 4.5, msr_prn_1: 4.3). When LRU is better the margins are
narrow (all 5 meta_storage: 2–3 points each).
SIEVE's case rests on throughput Miss ratios are
competitive (SIEVE is never dramatically worse than LRU), and SIEVE is
18–300x faster under any concurrency. SIEVE k=3 produces the
single-best entry in the entire table on msr_prn_1 (0.3796 vs LRU
0.4341, ARC 0.4148).
Memory during replay
On the 13.2M-request meta_storage/block_traces_1 trace (601K-entry cache),
TestGCPressure reports:
| Variant |
TotalAlloc |
| SIEVE k=1 |
154 MB |
| SIEVE k=3 |
155 MB |
| LRU |
418 MB |
| ARC |
997 MB |
SIEVE allocates 2.7x less than LRU and 6.5x less than ARC during
replay — the array-backed node pool and inline-int32 xsync.MapOf are
structural wins.
Full per-trace tables (sequential replay ns/op, B/op, miss ratio for every
trace) are in bench-results.md. Methodology and
trace-loading details are in bench/README.md.
Usage
import "github.com/opencoff/go-sieve"
// Create a cache mapping string keys to int values, capacity 1000.
c, err := sieve.New[string, int](1000)
if err != nil {
log.Fatal(err) // ErrInvalidCapacity or ErrInvalidVisitClamp
}
// Or, for constant arguments, use Must to get a one-liner:
c := sieve.Must(sieve.New[string, int](1000))
c.Add("foo", 42)
if val, ok := c.Get("foo"); ok {
fmt.Println(val) // 42
}
// Probe inserts only if absent; returns the cached value if present.
val, _, r := c.Probe("foo", 99)
// val == 42, r.Hit() == true
c.Delete("foo")
c.Purge() // reset entire cache
SIEVE-k
WithVisitClamp(k) creates a SIEVE-k cache where each entry uses a
saturating counter instead of a single visited bit. An item accessed k+1
times survives k eviction passes before being evicted. k=1 is equivalent
to classic SIEVE (the default). Use k=2 or k=3 for workloads with
repeated access patterns where extra eviction resistance is beneficial.
// Classic SIEVE (k=1, the default)
c, err := sieve.New[string, int](1000)
// SIEVE-k=3: items survive up to 3 eviction passes
c, err := sieve.New[string, int](1000, sieve.WithVisitClamp(3))
New returns an error for capacity <= 0 (ErrInvalidCapacity) and for
WithVisitClamp(k) with k > sieve.MaxVisitClamp (255 — ErrInvalidVisitClamp).
Clamp values below 1 are silently rounded up to 1.
Eviction Handling
Add() and Probe() return the evicted entry (if any) along with a
CacheResult bitmask. This allows callers to handle evictions
synchronously without channels, goroutines, or lifecycle management.
c := sieve.Must(sieve.New[string, int](1000))
ev, r := c.Add("foo", 42)
if r.Evicted() {
cleanupDisk(ev.Key, ev.Val)
}
// CacheResult bitmask:
// CacheHit — key was already present (value updated, no eviction)
// CacheEvict — an entry was evicted to make room (mutually exclusive with CacheHit)
Purge() and Delete() do not report evictions.
API
| Function / Method |
Description |
New[K, V](capacity, ...Option) (*Sieve[K,V], error) |
Create a cache with fixed capacity. Returns ErrInvalidCapacity or ErrInvalidVisitClamp on bad input. |
Must[K, V](*Sieve[K,V], error) *Sieve[K,V] |
Helper that panics on error; useful with constant arguments. |
Get(key) (V, bool) |
Look up a key (lock-free, zero-alloc). |
Add(key, val) (Evicted[K,V], CacheResult) |
Insert or update; returns evicted entry and result bitmask. |
Probe(key, val) (V, Evicted[K,V], CacheResult) |
Insert-if-absent; returns cached/inserted value, evicted entry, and result bitmask. |
Delete(key) bool |
Remove a key. |
Purge() |
Clear the entire cache. |
Len() int |
Current number of entries (lock-free atomic load). |
Cap() int |
Maximum capacity. |
Options
| Option |
Description |
WithVisitClamp(k) |
Use k-level saturating counters (default k=1 = classic SIEVE). k is capped at MaxVisitClamp (255); k < 1 is silently rounded to 1. |
GC Note
When K or V is a pointer type (including string, which contains an
internal pointer in Go), the node array will still contain GC-traced pointers.
The GC pressure reduction is most dramatic for scalar key/value types (int,
[16]byte, fixed-size structs).
License
BSD-2-Clause. See the source files for the full license text.