hotspot-bpf | Devpost

GIF
demo memory leak

Inspiration

This project began with a simple frustration: when a host starts slowing down or memory pressure builds up, it’s hard to understand why. Tools like top, htop, or /proc only show resource usage, not the cause of the pressure. Even when OOM events occur, it’s unclear which process actually triggered the chain of events.

I wanted a way to correlate CPU usage, scheduler contention, and page-fault pressure in the same sliding window, so we could explain performance symptoms instead of just observing them. That’s where eBPF came in, it let me capture kernel signals at the source and assemble a real-time performance storyline.

What I Learned

During this project I learned:

eBPF tracepoints give nanosecond-accurate CPU accounting
kprobes such as handle_mm_fault reveal page-fault pressure in real time
CPU usage alone is meaningless, correlation is everything
kernel BTF is critical for CO-RE and must match the host kernel
user-space diagnosis logic matters as much as BPF collection

How I Built It

The project combines two eBPF programs + a Go TUI:

Component	Purpose
`sched/sched_switch` tracepoint	CPU usage + process contention (victim/aggressor)
`kprobe/handle_mm_fault`	Page-fault rate per PID (detect memory pressure)
Go inference layer	Classifies processes (e.g., CPU-bound, Mem-thrashing, OOM risk)
TUI interface	Shows all signals in the same sliding window

Everything is built using cilium/ebpf, bpf2go, and CO-RE (BTF-enabled), so the binary can run directly on the host without recompilation.

Challenges

The main challenge was running on Kubernetes. I attempted to integrate with bpfman for program lifecycle management, but user-space changes are still required, so I postponed Kubernetes support and focused on a single-host prototype first.

Other challenges included:

struct relocation issues when switching kernel versions
formatting CO-RE maps correctly for bpf2go
trimming noisy signals while keeping diagnoses accurate
choosing thresholds for OOM risk detection
making the UI readable during page-fault spikes

What hotspot-bpf Solves

Traditional tools show usage, hotspot-bpf reveals cause and effect:

Why a process is slow (CPU-bound vs starved)
Who is stealing CPU (victim/aggressor pairs)
When memory is becoming unstable
How close the system is to an OOM kill, before it happens

What’s Next

Kubernetes support using bpfman
JSON export mode for Grafana / Prometheus
Automatic alerting when “OOM risk” is detected
Optional Stack sampling using bpf_get_stackid
cgroup filtering per namespace / tenant

Final Reflection

This project began with a debugging frustration, and turned into a kernel-level forensic tool. Working with eBPF taught me how much the kernel already knows, and how little of it we usually see.

Built With

bpf2go
bpftool
c
cilium/ebpf
go
linux
stress-ng

Updates

Simone Rodigari started this project — Nov 29, 2025 01:33 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.