GitHub - cactus-compute/cactus: Kernels & AI inference engine for mobile devices.

┌─────────────────┐     Energy-efficient inference engine for running AI on mobile devices 
│  Cactus Engine  │ ←── OpenAI compatible APIs for C/C++, Swift, Kotlin, Flutter & React-Native
└─────────────────┘     Supports tool call, auto RAG, NPU, INT4, and cloud handoff for complex tasks
         │
┌─────────────────┐     Zero-copy computation graph, think PyTorch for mobile devices
│  Cactus Graph   │ ←── You can implement custom models directly using this
└─────────────────┘     Highly optimised for RAM & lossless weight quantisation 
         │
┌─────────────────┐     Low-level ARM-specific SIMD kernels (Apple, Snapdragon, Google, Exynos, MediaTek & Raspberry Pi)
│ Cactus Kernels  │ ←── Optimised Matrix Multiplication & n
└─────────────────┘     Custom attention kernels with KV-Cache Quantisation, chunked prefill, streaming LLM, etc.

Cactus Engine

#include cactus.h

cactus_set_pro_key("optional, email founders@cactuscompute.com"); 

cactus_model_t model = cactus_init(
    "path/to/weight/folder",            
    "path to txt or dir of txts for auto-rag",  
);

const char* messages = R"([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "My name is Henry Ndubuaku"}
])";

const char* options = R"({
    "max_tokens": 50,
    "stop_sequences": ["<|im_end|>"]
})";

char response[4096];
int result = cactus_complete(
    model,                            // model handle from cactus_init
    messages,                         // JSON array of chat messages
    response,                         // buffer to store response JSON
    sizeof(response),                 // size of response buffer
    options,                          // optional: generation options (nullptr for defaults)
    nullptr,                          // optional: tools JSON for function calling 
    nullptr,                          // optional: streaming callback fn(token, id, user_data)
    nullptr                           // optional: user data passed to callback
);

Example response from Gemma3-270m

{
    "success": true,                 // when successfully generated locally
    "error": null,                   // returns specific errors if success = false
    "cloud_handoff": false,          // true when model is unconfident, simply route to cloud
    "response": "Hi there!",         // null when error is not null or cloud_handoff = true
    "function_calls": [],            // parsed to [{"name":"set_alarm","arguments":{"hour":"10","minute":"0"}}]
    "confidence": 0.8193,            // how confident the model is with its response
    "time_to_first_token_ms": 45.23, // latency (time to first token)
    "total_time_ms": 163.67,         // total execution time
    "prefill_tps": 1621.89,          // prefill tokens per second
    "decode_tps": 168.42,            // decode tokens per second
    "ram_usage_mb": 245.67,          // current process RAM usage in MB
    "prefill_tokens": 28,
    "decode_tokens": 50,
    "total_tokens": 78
}

Cactus Graph

#include cactus.h

CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);

auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);

float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);

graph.execute();
void* output_data = graph.get_output(result);

graph.hard_reset();

High-End Devices Benchmark (INT8)

Tested for worst case (big model + 1k context size)
Small models and small context yield flashier numbers, but hides stress points.

Device	LFM2.5-1.2B (1k-Prefill/100-Decode)	LFM2.5-VL-1.6B (256px-Latency & Decode)	Whisper-Small (30s-audio-Latency & Decode)
Mac M4 Pro	582/77 toks/sec	1.2s(0.3s*) & 76 toks/sec	1.5s(0.2s*) & 65 toks/sec
iPad/Mac M4	-	-	-
iPhone 17 Pro	300/33 toks/sec	1.6s(0.3s*) & 33 toks/sec	3.0s(0.6s*) & 70 toks/sec
Galaxy S25 Ultra	226/35 toks/sec	2.6s & 35 toks/sec	2.9s & 44 toks/sec
Pixel 10 Pro	-	-	-
Vivo X200 Pro	-	-	-

Budget Devices Benchmark (INT8)

We recommend <=600m LLM/VLM and sub-300m transcription for ALL mobiles + cloud fallback
Cactus decides in sub 100ms when to fallback to private cloud due to complexity, happens <20%

Device	LFM2-350m (1k-Prefill/100-Decode)	LFM2-VL-450m (256px-Latency & Decode)	Moonshine-Base (30s-audio-Latency & Decode)
iPad/Mac M1	-	-	-
iPhone 13 Mini	-	-	-
Galaxy A56	-	-	-
Pixel 6a	218/44 toks/sec	3.0s & 42 toks/sec	-
Nothing CMF	-	-	-
Raspberry Pi 5	-	-	-

Supported Models

Model	Size	Features
LLMs
google/gemma-3-270m-it	252MB	completion
google/functiongemma-270m-it	252MB	completion, tools
LiquidAI/LFM2-350M	244MB	completion, tools, embed
Qwen/Qwen3-0.6B	514MB	completion, tools, embed
LiquidAI/LFM2-700M	498MB	completion, tools, embed
google/gemma-3-1b-it	642MB	completion
LiquidAI/LFM2.5-1.2B-Thinking	474MB	completion, tools, embed
LiquidAI/LFM2.5-1.2B-Instruct	474MB	completion, tools, embed
LiquidAI/LFM2-1.2B-RAG	474MB	completion, tools, embed
LiquidAI/LFM2-1.2B-Tool	474MB	completion, tools, embed
Qwen/Qwen3-1.7B	749MB	completion, tools, embed
VLMs
LiquidAI/LFM2-VL-450M	448MB	vision, txt & img embed, Apple NPU
LiquidAI/LFM2.5-VL-1.6B	954MB	vision, txt & img embed, Apple NPU
Speech
UsefulSensors/moonshine-base	150MB	transcription, speech embed
openai/whisper-small	283MB	transcription, speech embed, Apple NPU
openai/whisper-medium	658MB	transcription, speech embed, Apple NPU
Embeddings
nomic-ai/nomic-embed-text-v2-moe	451MB	embed
Qwen/Qwen3-Embedding-0.6B	514MB	embed

Using this repo on a Mac

git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup

Command	Description
`cactus run [model-name-as-written-in-above-tables]`	Opens playground (auto downloads model)
`cactus download [model]`	Downloads model to `./weights`
`cactus convert [model] [dir]`	Converts model, supports LoRA merging (`--lora <path>`)
`cactus build`	Builds for ARM (`--apple` or `--android`)
`cactus test`	Runs tests (`--ios` / `--android`, `--model [name/path]`), `--precision`
`cactus clean`	Removes build artifacts
`cactus --help`	Shows all commands and flags (please run this to see all commands)

Name		Name	Last commit message	Last commit date
Latest commit History 464 Commits
.githooks		.githooks
.github/workflows		.github/workflows
android		android
apple		apple
assets		assets
cactus		cactus
docs		docs
flutter		flutter
libs		libs
python		python
tests		tests
.gitignore		.gitignore
CACTUS_VERSION		CACTUS_VERSION
CONTRIBUTING.md		CONTRIBUTING.md
DCO.md		DCO.md
LICENSE		LICENSE
README.md		README.md
setup		setup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cactus Engine

Cactus Graph

High-End Devices Benchmark (INT8)

Budget Devices Benchmark (INT8)

Supported Models

Using this repo on a Mac

Using in your apps

Try demo apps

Maintaining Organisations

Join The Community

About

Uh oh!

Releases 8

Packages

Contributors 35

Languages

License

cactus-compute/cactus

Folders and files

Latest commit

History

Repository files navigation

Cactus Engine

Cactus Graph

High-End Devices Benchmark (INT8)

Budget Devices Benchmark (INT8)

Supported Models

Using this repo on a Mac

Using in your apps

Try demo apps

Maintaining Organisations

Join The Community

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Contributors 35

Languages

Packages