Inspiration

I built SubEleven to combine two trends I care about:

  1. The immediate usefulness of language models (summaries, email drafts, coding help), and
  2. The need for privacy and offline reliability on mobile/edge devices. The goal was to show that real, useful LLM-driven capabilities can live entirely on device, no cloud required, if the model, conversion pipeline, and runtime are engineered appropriately for Arm silicon. The project was inspired by the rapid progress in quantized small LLMs and by Arm’s public direction toward “AI on device”.

What it does

SubEleven is a mobile app + runtime that supports:

  • Conversational assistant (chat UI) for general queries.
  • Summarizer & Rewriter (documents, emails, articles).
  • Code helper: summarize code, generate small snippets, and create unit test stubs.
  • Retrieval-Augmented Generation (RAG): local vector store (documents on device) + retrieval to provide contextually grounded answers.
  • Adaptive precision switching, dynamically choose runtime precision or model size to trade quality for latency/energy.
  • On-device telemetry dashboard: shows First-Token-Time, tokens/sec, peak memory, and approximate energy usage per session. All model inference and retrieval happen locally on Arm hardware (phone / SBC). A small optional server exists only for heavy offline conversions/experiments, the default user experience requires zero network access.

How we built it

  • Key steps
  • Model selection & distillation — chose a distilled 1.3B LLM family (open checkpoint with permissive license) as the main production model to balance quality and footprint.
  • Conversion & quantization — exported to ONNX / TorchScript and produced 8-bit and 4-bit variants via post-training quantization and calibration; used an intermediate dynamic quantization step and a calibration dataset of representative prompts.
  • Runtime optimization — integrated ONNX Runtime Mobile and Arm NN layers for critical ops, used operator fusions and NEON optimized kernels. Where ExecuTorch gave better transformer kernels, we used it for our quantized transformer backends.
  • Retrieval — small embedding model (MiniLM variant) produces 384-dim vectors; Annoy chosen for its tiny memory footprint and fast CPU search. SQLite holds document metadata.
  • Mobile UX — Flutter front-end with native plugin to stream tokens during generation for low-latency feel; offline onboarding flow and local document ingestion UI.

Benchmark process: Measured First-Token-Time, full response time for 64/128 tokens, peak memory, and energy. Energy measured using phone battery deltas (software) and validated on one device with an external power meter for accuracy.

Challenges we ran into

  • Model/operator compatibility: Some transformer forks used ops unsupported by ONNXRuntime Mobile / Arm NN. Fixes included operator replacement and small code patches to the conversion scripts.
  • Quality vs footprint: Reducing model size/precision required careful calibration to keep user-visible quality acceptable. We iterated on distillation + per-layer quantization sensitivity analysis.
  • Energy measurement noise: phone battery readings are noisy. We standardized runs (airplane mode, screen off for benchmarks, repeated trials) and used external power meter validation on sample runs.
  • Latency for first token: streaming tokens required minimizing framework warmup and preloading embeddings and context to reduce time-to-first token.

Accomplishments that we're proud of

  • A fully reproducible pipeline: clone → convert → deploy → run using scripts included in runtime/python/.
  • On-device UX: smooth streaming chat experience with first token usually under the interactive threshold on tested devices.
  • Benchmarked improvements: by applying Arm NN optimizations and quantization we reduced inference latency by a large factor vs baseline FP32 runtime on the same device (see docs/benchmarks.md).
  • Privacy by design: all user data and documents remain on the device unless the user explicitly chooses to export.
  • A polished Devpost submission with demo video, architecture diagram, and reproducible artifacts.

What we learned

  • Making LLMs practical on devices is a systems engineering problem: model selection, conversion, runtime integration and UX are equally important.
  • Small engineering changes (op fusion, operator replacement) often produce outsized performance gains on Arm hardware.
  • User perception of speed depends more on first-token latency and streaming behavior than raw throughput.

What's next for SubEleven

  • Train lightweight adapters for better personalization without full fine-tuning (on-device adapters).
  • Add secure device-to-device encrypted sync (optional opt-in) for multi-device context.
  • Explore 4-bit fine-tuning and quantization-aware training for improved small-model fidelity.
  • Support additional Arm NPUs (vendor SDKs) to extract more power/perf improvements.

Built With

Share this project:

Updates