Inspiration
This project began with a simple frustration: when a host starts slowing down or memory pressure builds up, it’s hard to understand why. Tools like top, htop, or /proc only show resource usage, not the cause of the pressure. Even when OOM events occur, it’s unclear which process actually triggered the chain of events.
I wanted a way to correlate CPU usage, scheduler contention, and page-fault pressure in the same sliding window, so we could explain performance symptoms instead of just observing them. That’s where eBPF came in, it let me capture kernel signals at the source and assemble a real-time performance storyline.
What I Learned
During this project I learned:
- eBPF tracepoints give nanosecond-accurate CPU accounting
- kprobes such as handle_mm_fault reveal page-fault pressure in real time
- CPU usage alone is meaningless, correlation is everything
- kernel BTF is critical for CO-RE and must match the host kernel
- user-space diagnosis logic matters as much as BPF collection
How I Built It
The project combines two eBPF programs + a Go TUI:
| Component | Purpose |
|---|---|
sched/sched_switch tracepoint |
CPU usage + process contention (victim/aggressor) |
kprobe/handle_mm_fault |
Page-fault rate per PID (detect memory pressure) |
| Go inference layer | Classifies processes (e.g., CPU-bound, Mem-thrashing, OOM risk) |
| TUI interface | Shows all signals in the same sliding window |
Everything is built using cilium/ebpf, bpf2go, and CO-RE (BTF-enabled), so the binary can run directly on the host without recompilation.
Challenges
The main challenge was running on Kubernetes. I attempted to integrate with bpfman for program lifecycle management, but user-space changes are still required, so I postponed Kubernetes support and focused on a single-host prototype first.
Other challenges included:
- struct relocation issues when switching kernel versions
- formatting CO-RE maps correctly for bpf2go
- trimming noisy signals while keeping diagnoses accurate
- choosing thresholds for OOM risk detection
- making the UI readable during page-fault spikes
What hotspot-bpf Solves
Traditional tools show usage, hotspot-bpf reveals cause and effect:
- Why a process is slow (CPU-bound vs starved)
- Who is stealing CPU (victim/aggressor pairs)
- When memory is becoming unstable
- How close the system is to an OOM kill, before it happens
What’s Next
- Kubernetes support using bpfman
- JSON export mode for Grafana / Prometheus
- Automatic alerting when “OOM risk” is detected
- Optional Stack sampling using bpf_get_stackid
- cgroup filtering per namespace / tenant
Final Reflection
This project began with a debugging frustration, and turned into a kernel-level forensic tool. Working with eBPF taught me how much the kernel already knows, and how little of it we usually see.
Log in or sign up for Devpost to join the conversation.