Depth Anything V2 — iOS Benchmark

Real-time monocular depth estimation on iPhone and Meta Ray-Ban glasses, using Depth Anything V2 via CoreML.

Typical inference: ~40–60 ms on iPhone (CPU+GPU), dropping to ~35 ms as Neural Engine warms up.

Setup

1. Download the CoreML model

The model file is too large for git. Run the setup script once:

./setup_model.sh

This requires Python 3 and downloads DepthAnythingV2SmallF16.mlpackage from Hugging Face. Alternatively, download manually:

pip3 install huggingface_hub
python3 -m huggingface_hub.cli.hf download apple/coreml-depth-anything-v2-small \
  --repo-type model \
  --local-dir DepthanythingTest/DepthanythingTest \
  --include "DepthAnythingV2SmallF16.mlpackage/*"

2. Open in Xcode

open DepthanythingTest/DepthanythingTest.xcodeproj

Xcode resolves Meta Wearables DAT and WebRTC (stasel, pinned to 140.0.0) via Swift Package Manager.

3. Build and run on a physical iPhone

The app will not work on simulator (camera required). Select your device and hit Run.

First launch takes ~60 seconds while the Neural Engine compiles the model. Subsequent launches are instant (model is cached on device).

Ray-Ban Glasses (optional)

To use the Meta Ray-Ban camera instead of the phone camera:

Register a Meta developer app at developers.facebook.com
Add your Meta App ID and Client Token as Xcode build settings:
- In Xcode → Target → Build Settings → search META_APP_ID and CLIENT_TOKEN
- Set both values from your Meta developer dashboard
Pair your glasses in the Meta View app
In the app, tap Ray-Ban in the source picker at the bottom

Features

Live camera feed (phone or Ray-Ban glasses)
Depth map overlay with Turbo colormap (red = close, blue = far)
Per-frame latency stats (now / avg / min / max)
Toggle depth overlay on/off
Adjustable overlay opacity
Vision tab: VisionClaw features (Gemini Live, WebRTC viewer, OpenClaw) — configure VisionClaw/Secrets.swift from Secrets.swift.example for API keys.

Depth → Spatial Audio (planned)

The depth map can drive 8D/spatial audio:

Divide the frame into spatial zones (left/center/right × near/far)
Map each zone's average depth to a 3D position
Use AVAudioEnvironmentNode + AVAudioPlayerNode for HRTF binaural rendering
Closer objects → higher volume, less reverb; position → stereo pan + elevation

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.cursor/rules		.cursor/rules
DepthanythingTest		DepthanythingTest
NavigatorImpaired		NavigatorImpaired
model/DepthAnythingV2SmallF16.mlpackage		model/DepthAnythingV2SmallF16.mlpackage
.gitignore		.gitignore
.gitmessage		.gitmessage
DOCUMENTATION.md		DOCUMENTATION.md
PROJECT_OVERVIEW.md		PROJECT_OVERVIEW.md
README.md		README.md
SPATIAL_AUDIO_LOGIC.md		SPATIAL_AUDIO_LOGIC.md
YHACK_PRIZES.md		YHACK_PRIZES.md
convert_for_ane.py		convert_for_ane.py
openclaw-ws-bridge.patch		openclaw-ws-bridge.patch
quantize_model.py		quantize_model.py
setup_model.sh		setup_model.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Depth Anything V2 — iOS Benchmark

Setup

1. Download the CoreML model

2. Open in Xcode

3. Build and run on a physical iPhone

Ray-Ban Glasses (optional)

Features

Depth → Spatial Audio (planned)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Depth Anything V2 — iOS Benchmark

Setup

1. Download the CoreML model

2. Open in Xcode

3. Build and run on a physical iPhone

Ray-Ban Glasses (optional)

Features

Depth → Spatial Audio (planned)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages