Skip to content
/ diffcua Public

differential vision encoding for computer-use agents

License

Notifications You must be signed in to change notification settings

sdan/diffcua

Repository files navigation

diffcua

differential_vision.mp4
attention_click.mp4

When vision-language models process sequential screenshots, they re-encode the entire image every frame. For a 1920×1080 screen with 24×24 patches, this means computing 3600 vision tokens per frame, even when the only change is a cursor moving 10 pixels.

This project explores differential vision encoding: caching vision tokens between frames and selectively re-encoding only the patches that changed.

Approach

We detect changed patches by computing per-patch pixel differences between consecutive frames. Based on how many patches changed, we either return cached tokens, perform a partial update, or fall back to full encoding.

Frame 0:  full encode ──────────────────> 3600 tokens (cached)
Frame 1:  12 patches changed ─> encode 12 ─> update cache
Frame 2:  0 patches changed ────────────> return cache
Frame 3:  8 patches changed ──> encode 8 ─> update cache

This is analogous to I-frames and P-frames in video compression.

Results

Change detection worked well. Per-patch pixel diffs reliably identify which regions changed. A threshold of 5% average pixel difference captures meaningful UI changes while ignoring compression artifacts.

Cache hits provided significant speedups. When nothing changes (loading screens, typing pauses), we skip vision encoding entirely.

Partial re-encoding did not work. Vision transformers expect full images. The positional embeddings are learned for the complete grid, so tokens computed from isolated patches do not blend correctly with surrounding cached tokens. We tried context windows, explicit position indices, and separate projection—none produced coherent results.

Fallback dominated in practice. Scrolling, window switches, and significant UI transitions change more than 50 patches, triggering full re-encoding. The net speedup was modest.

Also Explored: Attention-Based Click Targeting

Inspired by GUI-Actor, we experimented with using VLM attention maps for click targeting. Instead of predicting coordinates, we extract the attention heatmap over image patches and click the centroid of the highest-attention region.

Attention does concentrate on relevant UI elements. However, the spatial resolution is too coarse for precise clicking—attention spreads across a button's surface and surrounding context rather than localizing to a single point.

Code

differential_vision/    # token caching implementation
attention_click/        # attention-based click targeting
predict.py              # FastVLM inference with differential encoding
llava/                  # model code from FastVLM

Built on FastVLM (Apple, CVPR 2025).

About

differential vision encoding for computer-use agents

Resources

License

Stars

Watchers

Forks

Packages

No packages published