llama.cpp-omni

llama.cpp-omni is a high-performance Omni multimodal inference engine built on llama.cpp.

🚀 First Full-Duplex Omni Streaming Engine — The first open-source C++ inference framework supporting full-duplex, omni-modal streaming video calls
⚡ Lightweight & Efficient — Inherits llama.cpp's high-performance characteristics with GGUF quantization support and low memory footprint
🔌 Fully Ecosystem Compatible — Compatible with llama.cpp interfaces and ecosystem for seamless integration with existing toolchains
🌐 Cross-Platform Deployment — Supports Windows, Linux, and macOS, enabling efficient Omni model inference on consumer-grade hardware
🎙️ End-to-End Voice Interaction — Supports the complete pipeline of streaming audio input, LLM inference, and TTS speech synthesis

MiniCPM-o

MiniCPM-o 4.5 is a 9B-parameter on-device omni-modal large language model jointly developed by ModelBest and Tsinghua University, featuring powerful vision, speech, and full-duplex streaming capabilities.

Omni Architecture & Runtime Mechanism

Model Architecture

Built on the MiniCPM-o 4.5 end-to-end omni-modal architecture, where modality encoders/decoders are densely connected to the LLM through hidden states. This design enables better information flow and control while fully leveraging the rich multimodal knowledge acquired during training.

llama.cpp-omni splits the original PyTorch model into multiple independent GGUF modules, each with specific responsibilities:

VPM: Vision encoder based on SigLip2 architecture, responsible for encoding images into visual embeddings. Includes a Resampler module that compresses visual features into a fixed number of query tokens before projecting them into the LLM's hidden space.
APM: Audio encoder based on Whisper architecture, responsible for encoding 16kHz audio into audio embeddings. Features AvgPool and Projector layers to project into the LLM's hidden space.
LLM: Main language model based on Qwen3-8B, which receives visual and audio embeddings as input and generates text token sequences. Supports multiple quantization formats (F16/Q8_0/Q4_K_M).
TTS: Text-to-speech model based on LLaMA architecture, which projects LLM hidden states through Projector Semantic and autoregressively generates audio token sequences.
Token2Wav: Flow Matching-based vocoder that converts audio tokens into 24kHz waveform audio.

Full-Duplex Streaming Mechanism

llama.cpp-omni implements a full-duplex streaming mechanism where input streams (video + audio) and output streams (speech + text) operate without blocking each other:

Streaming Encoders: Transforms offline modality encoders into online streaming versions for real-time input processing. Audio is sliced into 1-second chunks for APM, while images are fed frame-by-frame to VPM.
Time-Division Multiplexing (TDM): Within the LLM backbone, TDM divides parallel omni-modal streams into sequential information groups within periodic time slices, achieving millisecond-level input/output stream synchronization.
Interleaved Speech Generation: The TTS module models text and speech tokens in an interleaved manner, supporting full-duplex speech generation where output can synchronize with new input in real-time while ensuring stability for long speech generation (>1 minute).

Proactive Interaction Mechanism

In duplex mode, the LLM continuously monitors incoming video and audio streams, deciding whether to speak proactively at 1Hz frequency. This high-frequency decision-making capability, combined with full-duplex features, enables proactive interactions such as spontaneous reminders and comments.

Runtime Pipeline

The core runtime pipeline of llama.cpp-omni consists of three stages:

Initialization (omni_init): Loads all GGUF models, initializes LLM/TTS/Token2Wav contexts, and configures simplex/duplex mode along with reference audio (for voice cloning).
Streaming Prefill (stream_prefill):
- When index=0: Initializes System Prompt, including text system prompt and audio system prompt (reference audio embedding)
- When index>0: Processes user input — audio is encoded via APM, images via VPM, and embeddings are fed into LLM prefill
- Supports high-resolution mode (max_slice_nums=2) and high-FPS mode (main image + stacked images)
Streaming Decode (stream_decode):
- LLM autoregressively generates text tokens, entering speech generation upon <|speak|> and switching to listening state upon <|listen|>
- TTS projects LLM hidden states to generate audio tokens
- Token2Wav synthesizes WAV audio in real-time using a sliding window approach (28 tokens input, 25 tokens stride)
- All three modules execute in parallel via asynchronous queues, enabling streaming output

Performance Benchmarks

Inference Latency (RTX 4090, F16)

Stage	Latency	Notes
Time to First Token (TTFT)	< 550ms	First audio output
Prefill (vision + audio)	~65ms	Audio-only ~21ms
Decode-LLM	~38ms/token	3 tokens ~115ms
TTS Generation	~8.5ms/token	25 tokens ~215ms
Token2Wav	RTF ~0.15x	25 tokens → 1s audio ~150ms

Inference Latency (Apple M4 Max, Metal)

Stage	Latency	Notes
Time to First Token (TTFT)	< 650ms	First audio output
Prefill (audio)	~30ms	Audio-only
Decode-LLM	~12ms/token	Metal accelerated
TTS Generation	~10ms/token	Metal accelerated
Token2Wav (Token2Mel)	~235ms/chunk	Metal accelerated
Token2Wav (Vocoder)	~220ms/chunk	CPU (HiFiGAN)
Token2Wav Total	RTF ~0.47x	28 tokens → 1s audio ~450ms

Memory Usage (NVIDIA GPU)

Configuration	LLM Quantization	Model Size	VRAM Estimate
Full Omni	F16	~18 GB	~20 GB
Full Omni	Q8_0	~11 GB	~13 GB
Full Omni	Q4_K_M	~8 GB	~9 GB
Vision Only	Q8_0	~9 GB	~10 GB
Audio Only	Q8_0	~10 GB	~12 GB

Memory Usage (Apple Silicon)

Configuration	LLM Quantization	Model Size	Unified Memory
Full Omni	F16	~15 GB	~19 GB
Full Omni	Q8_0	~8.1 GB	~12 GB
Full Omni	Q4_K_M	~4.7 GB	~8.5 GB

Note: Apple Silicon uses unified memory architecture. Recommended: 16GB Mac for Q4_K_M/Q8_0, 32GB+ Mac for F16.

Quick Start

Prerequisites

Model Files: Download MiniCPM-o 4.5 GGUF models with the following directory structure:

MiniCPM-o-4_5-gguf/
├── MiniCPM-o-4_5-Q4_K_M.gguf         # LLM (or F16/Q8_0)
├── audio/
│   └── MiniCPM-o-4_5-audio-F16.gguf
├── tts/
│   ├── MiniCPM-o-4_5-tts-F16.gguf
│   └── MiniCPM-o-4_5-projector-F16.gguf
├── token2wav-gguf/
│   ├── encoder.gguf                  # ~144MB
│   ├── flow_matching.gguf            # ~437MB
│   ├── flow_extra.gguf               # ~13MB
│   ├── hifigan2.gguf                 # ~79MB
│   └── prompt_cache.gguf             # ~67MB
└── vision/
    └── MiniCPM-o-4_5-vision-F16.gguf

Build

# Configure
cmake -B build -DCMAKE_BUILD_TYPE=Release

# Build
cmake --build build --target llama-omni-cli -j

CMake will auto-detect and enable Metal (macOS) or CUDA (Linux with NVIDIA GPU).

Usage

# Basic usage (auto-detect all model paths from LLM path)
./build/bin/llama-omni-cli \
    -m /path/to/MiniCPM-o-4_5-gguf/MiniCPM-o-4_5-Q4_K_M.gguf

# With custom reference audio (voice cloning)
./build/bin/llama-omni-cli \
    -m /path/to/MiniCPM-o-4_5-gguf/MiniCPM-o-4_5-Q4_K_M.gguf \
    --ref-audio /path/to/your_voice.wav

# Disable TTS (text-only output)
./build/bin/llama-omni-cli \
    -m /path/to/MiniCPM-o-4_5-gguf/MiniCPM-o-4_5-F16.gguf \
    --no-tts

CLI Options

Option	Description
`-m <path>`	Required. Path to LLM GGUF model
`--vision <path>`	Override vision model path
`--audio <path>`	Override audio model path
`--tts <path>`	Override TTS model path
`--projector <path>`	Override projector model path
`--ref-audio <path>`	Reference audio for voice cloning
`-c, --ctx-size <n>`	Context size (default: 4096)
`-ngl <n>`	Number of GPU layers (default: 99)
`--no-tts`	Disable TTS output
`--test <prefix> <n>`	Run test with audio files

Output

Generated audio files are saved to tools/omni/output/:

tools/omni/output/
├── round_000/
│   └── tts_wav/
│       ├── wav_0.wav
│       ├── wav_1.wav
│       └── ...
└── round_001/
    └── tts_wav/
        └── wav_1000.wav

🐳 WebRTC Demo (macOS Docker Deployment)

Full-duplex real-time video interaction demo based on WebRTC. One-click deployment with pre-built Docker images.

Requirements

Hardware: Apple Silicon Mac (M1/M2/M3/M4), M4 recommended for optimal performance
Software: Docker Desktop, Python 3.10+
Models: MiniCPM-o 4.5 GGUF models (see directory structure above)

Quick Start

1. Download Docker Package

📦 Download Docker Image (macOS)

2. Build llama-server

cd /path/to/llama.cpp-omni
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-server -j

3. Deploy

# Extract and enter directory
unzip omni_docker.zip && cd omni_docker

# Load Docker images
docker load -i o45-frontend.tar
docker load -i omini_backend_code/omni_backend.tar

# One-click deployment
./deploy_all.sh \
    --cpp-dir /path/to/llama.cpp-omni \
    --model-dir /path/to/MiniCPM-o-4_5-gguf

# For duplex mode
./deploy_all.sh \
    --cpp-dir /path/to/llama.cpp-omni \
    --model-dir /path/to/MiniCPM-o-4_5-gguf \
    --duplex

4. Access Web Interface

open http://localhost:3000

Service Ports

Service	Port	Description
Frontend	3000	Web UI
Backend	8021	Backend API
LiveKit	7880	Real-time communication
Inference	9060	Python HTTP API

Troubleshooting

Issue	Solution
Port conflict	Use `--port 9060` (recommended, avoids Cursor IDE conflicts)
"Experience quota full"	Check `model_type` and `session_type` in registration
Model loading timeout	Wait 2-3 minutes for large models
LiveKit connection failed	Update `node_ip` in `livekit.yaml`

📖 Full Documentation: tools/omni/release_cpp/README.md

Coming Soon

Deployment and usage documentation is still being prepared, including:

Voice Cloning: Custom voice synthesis with reference audio
NPU Adaptation: Support for various NPU hardware platforms
More features...

Stay tuned for updates in the coming days.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.devops		.devops
.github		.github
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README_llama.cpp.md		README_llama.cpp.md
SECURITY.md		SECURITY.md
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama.cpp-omni

MiniCPM-o

Omni Architecture & Runtime Mechanism

Model Architecture

Full-Duplex Streaming Mechanism

Proactive Interaction Mechanism

Runtime Pipeline

Performance Benchmarks

Inference Latency (RTX 4090, F16)

Inference Latency (Apple M4 Max, Metal)

Memory Usage (NVIDIA GPU)

Memory Usage (Apple Silicon)

Quick Start

Prerequisites

Build

Usage

CLI Options

Output

🐳 WebRTC Demo (macOS Docker Deployment)

Requirements

Quick Start

Service Ports

Troubleshooting

Coming Soon

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

tc-mb/llama.cpp-omni

Folders and files

Latest commit

History

Repository files navigation

llama.cpp-omni

MiniCPM-o

Omni Architecture & Runtime Mechanism

Model Architecture

Full-Duplex Streaming Mechanism

Proactive Interaction Mechanism

Runtime Pipeline

Performance Benchmarks

Inference Latency (RTX 4090, F16)

Inference Latency (Apple M4 Max, Metal)

Memory Usage (NVIDIA GPU)

Memory Usage (Apple Silicon)

Quick Start

Prerequisites

Build

Usage

CLI Options

Output

🐳 WebRTC Demo (macOS Docker Deployment)

Requirements

Quick Start

Service Ports

Troubleshooting

Coming Soon

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages