diffcua

differential_vision.mp4

attention_click.mp4

When vision-language models process sequential screenshots, they re-encode the entire image every frame. For a 1920×1080 screen with 24×24 patches, this means computing 3600 vision tokens per frame, even when the only change is a cursor moving 10 pixels.

This project explores differential vision encoding: caching vision tokens between frames and selectively re-encoding only the patches that changed.

Approach

We detect changed patches by computing per-patch pixel differences between consecutive frames. Based on how many patches changed, we either return cached tokens, perform a partial update, or fall back to full encoding.

Frame 0:  full encode ──────────────────> 3600 tokens (cached)
Frame 1:  12 patches changed ─> encode 12 ─> update cache
Frame 2:  0 patches changed ────────────> return cache
Frame 3:  8 patches changed ──> encode 8 ─> update cache

This is analogous to I-frames and P-frames in video compression.

Results

Change detection worked well. Per-patch pixel diffs reliably identify which regions changed. A threshold of 5% average pixel difference captures meaningful UI changes while ignoring compression artifacts.

Cache hits provided significant speedups. When nothing changes (loading screens, typing pauses), we skip vision encoding entirely.

Partial re-encoding did not work. Vision transformers expect full images. The positional embeddings are learned for the complete grid, so tokens computed from isolated patches do not blend correctly with surrounding cached tokens. We tried context windows, explicit position indices, and separate projection—none produced coherent results.

Fallback dominated in practice. Scrolling, window switches, and significant UI transitions change more than 50 patches, triggering full re-encoding. The net speedup was modest.

Also Explored: Attention-Based Click Targeting

Inspired by GUI-Actor, we experimented with using VLM attention maps for click targeting. Instead of predicting coordinates, we extract the attention heatmap over image patches and click the centroid of the highest-attention region.

Attention does concentrate on relevant UI elements. However, the spatial resolution is too coarse for precise clicking—attention spreads across a button's surface and surrounding context rather than localizing to a single point.

Code

differential_vision/    # token caching implementation
attention_click/        # attention-based click targeting
predict.py              # FastVLM inference with differential encoding
llava/                  # model code from FastVLM

Built on FastVLM (Apple, CVPR 2025).

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
attention_click		attention_click
differential_vision		differential_vision
llava		llava
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
cli.py		cli.py
predict.py		predict.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

diffcua

Approach

Results

Also Explored: Attention-Based Click Targeting

Code

About

Uh oh!

Releases 1

Packages

Languages

License

sdan/diffcua

Folders and files

Latest commit

History

Repository files navigation

diffcua

Approach

Results

Also Explored: Attention-Based Click Targeting

Code

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages