Skip to content

thaoshibe/relsim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

arXiv BibTeX HuggingFace Dataset HuggingFace Model Project Page Dataset Live view Image Retrieval Results GitHub stars

Image Relational Visual Similarity (arXiv 2025)
Thao Nguyen1, Sicheng Mo3, Krishna Kumar Singh2, Yilin Wang2, Jing Shi2, Nicholas Kolkin2, Eli Shechtman2, Yong Jae Lee1,2, β˜…, Yuheng Li1, β˜…
(β˜… Equal advising)
1- University of Wisconsin–Madison; 2- Adobe Research; 3- UCLA

TL;DR: We introduce a new visual similarity notion: relational visual similarity, which complements traditional attribute-based perceptual similarity (e.g., LPIPS, CLIP, DINO).

Click here to read Abstract πŸ“ Humans do not just see attribute similarity---we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach’s skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image–caption dataset in which the captions are anonymized---describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision–Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it---revealing a critical gap in visual computing.
---

πŸ”— Table of Content:

  1. πŸ› οΈ Quick Usage
  2. πŸ«₯ Anonymous Captioning Model
  3. πŸ“ Dataset | live view
  4. πŸ”„ Image Retrieval | live view
  5. πŸ“„ BibTeX

πŸ› οΈ Quick Usage

This code is tested on Python 3.10: (i) NVIDIA A100 80GB (torch2.5.1+cu124) and (ii) NVIDIA RTX A6000 48GB (torch2.9.1+cu128).
Other hardware setup hasn't been tested, but it should still work. Please install pytorch and torchvision according to your machine configuration.

You can also found relsim via HuggingFace: πŸ€— thaoshibe/relsim-qwenvl25-lora

conda create -n relsim python=3.10
pip install relsim

# or you can clone the repo
git clone https://github.com/thaoshibe/relsim.git
cd relsim
pip install -r requirements.txt

Given two images, you can compute their relational visual similarity (relsim) like this:

from relsim.relsim_score import relsim
from PIL import Image

# Load model
model, preprocess = relsim(pretrained=True, checkpoint_dir="thaoshibe/relsim-qwenvl25-lora")

img1 = preprocess(Image.open("image_path_1"))
img2 = preprocess(Image.open("image_path_2"))
similarity = model(img1, img2)  # Returns similarity score (higher = more similar)
print(f"relational similarity score: {similarity:.3f}")

Or you can run python test.py for a quick test. Here is example results. All below images can be found in this folder:

reference image test image 1 test image 2 test image 3 test image 4 test image 5 test image 6
Image Image Image Image Image Image Image
(to itself: 1.000) 0.981 0.830 0.808 0.767 0.465 0.223

πŸ€— You're welcome to improve the current relsim model! The training code is provided in ./relsim/train.sh folder. For a quick jump to the training script:
(Reminder: you might need to download data here first to run this training code sucessfully)

git clone https://github.com/thaoshibe/relsim.git
cd relsim/relsim
pip install -r requirements_train.txt
bash train.sh # this assume you have the dataset alrerady

### you might want to export WANDB and HF_TOKEN
# export WANDB_API_KEY='your_wandb_api_key'
# export HF_TOKEN='your_hf_token'
Click here to see example of wandb log Image

πŸ«₯ Anonymous Caption Model

Anonymous captions are image captions that do not refer to specific visible objects but instead capture the relational logic conveyed by the image.

The pretrained anonymous caption model (Qwen-VL-2.5 7B) is provided in ./anonymous_caption. This model is trained on a limited number of seed groups and their corresponding generated captions (see the training data here).

You can also find Anonymous Caption Model on HuggingFace: πŸ€— thaoshibe/relsim-anonymous-caption-qwen25vl-lora.

Script to run Anonymous Caption Model (with HuggingFace) is provided at ./anonymous_caption/anonymous_caption_hf.py.

# run with huggingface-model
python anonymous_caption/anonymous_caption_hf.py

# run on default test image (mam.jpg)
python anonymous_caption/anonymous_caption.py

# run on your own images
python anonymous_caption/anonymous_caption.py --image_path $PATH_TO_IMAGE_OR_IMAGE_FOLDER

# if you need to see all arguments (e.g., batch size)
python anonymous_caption/anonymous_caption.py --help

Here is example of the generated anonymous captions with different runs.

Input image Generated anonymous captions (Different run)
Image Example: python anonymous_caption/anonymous_caption.py --image_path anonymous_caption/mam.jpg
Run 1: "Curious {Animal} peering out from behind a {Object}."
Run 2: "Curious {Animal} peeking out from behind the {Object} in an unexpected and playful way."
Run 3: "Curious {Cat} looking through a {Doorway} into the {Room}."
Run 4: "A curious {Animal} peeking from behind a {Barrier}."
Run 5: "A {Cat} peeking out from behind a {Door} with curious eyes."
...
Image Example: python anonymous_caption/anonymous_caption.py --image_path anonymous_caption/bo.jpg
Run 1: "Animals with {Leaf} artfully placed on their {Head}."
Run 2: "A {Dog} with a {Leaf} delicately placed on its head."
Run 3: "A {Dog} with a {Leaf} artfully placed on its head."
Run 4: "A {Dog} with a {Leaf} delicately placed on their head, representing the beauty of {Season}."
Run 5: "Animals adorned with {Leaf} in a {Seasonal} setting."
...

πŸ€— You are more than welcome to help improve the anonymous caption model! The current model may hallucinate or produce incorrect results, and sometimes it may generate captions that are not "anonymous enough", etc.

The training script for the anonymous caption model is shown below. Please check config.yaml for config details.

#########################################
#
#     train anonymous caption model 
#
#########################################

# install git lfs if you don't have
sudo apt update
sudo apt install git-lfs
git lfs install

# clone repo if you haven't do that yet
git clone https://github.com/thaoshibe/relsim.git

# download the training data
cd relsim/anonymous_caption
git clone https://huggingface.co/datasets/thaoshibe/seed-groups
pip install -r requirements.txt
# run train
python anonymous_caption_train.py
Click here to see example of wandb log. Checkpoints will be saved in `./anonymous_caption/ckpt`.*
Image
And your console should look like this:
Image

πŸ“ Data

πŸ” You can see the snapshot of the data on this live website: πŸ”πŸ”πŸ” relsim: data viewer

Dataset name Short description JSON file πŸ” Data viewer
seed-groups HuggingFace Dataset Use to train the anonymous captioning model seed_group.json See Seed Groups Dataset
anonymous-captions-114k HuggingFace Dataset Use to train the relational similarity model anonymous_captions_train.jsonl, anonymous_captions_test.jsonl See Anonymous Captions Dataset

Each image will be given by their corresponding Image URL. Please see the json files in ./data.

(Optional) Depending on your internet speed, it should take under 0.5 hours to download all images with the default MAX_WORKER = 64. You can increase MAX_WORKER to speed up the download or reduce it depending on your internet (see the data/download_data.sh)

To download, please run this the data/download_data.sh

#########################################
#
#            download data
#
#########################################

git clone https://github.com/thaoshibe/relsim.git
cd relsim
bash data/download_data.sh # this script will download all dataset

πŸ”„ Image Retrieval

You might want to build an image retrieval system.
A snapshot of how to do that is provided in ./retrieval/, along with the GPT-4o scoring code (to evaluate top-k retrieval). The provided databases are 14k-test-set and 14k random images from LAION. Here, we combined them for the convenience of the code into combined.jsonl.
The full code can be found in ./retrieval/pipeline.sh.

cd retrieval

# precompute the embedding for each image
python get_embedding_our.py \
    --checkpoint_dir thaoshibe/relsim-qwenvl25-lora \
    --json_file combined.jsonl \
    --output_path ./precomputed/relsim.npz \
    --batch_size 16

# perform retrieval
python retrieve_topk_images.py \
    --precomputed_dir ./precomputed \
    --output_file retrieved_images.json \
    --topk 10 \
    --num_images 1000 \
    --image_dir ./images

An example of retrieved_images.json are provided in ./retrieval/retrieved_images.json. You can also see the uncurated 1000 retrieved results live at πŸ” Image Retrieval Results | LIVE!!!.

(Optional) If you want to use GPT-4o to evaluate the results, please put your GPT-4o API key in ./retrieval/gpt4o_config.yaml. Then run the GPT-4o evaluation code at the bottom of this file: ./retrieval/pipeline.sh. GPT-4o's answers may vary between sessions.

Click here to see one example snapshot of the GPT-4o score Image

πŸ“Š Similarity Space Figure

You might not believe it, but yes, we spent months+++ figuring out how to plot this theorical Similarity Space (Figure 7 in main paper).
Code for the figures in this paper is available in ./plot_figure/ (This is the code to show plot figure similar to below).

Image Image

Left: Similarity Space figure---in Cognitive Science theory, published with the phenomenal 1997 paper: "Structure Mapping in Analogy and Similarity" (Dedre Gentner and Arthur B. Markman).
Right: Similarity Space figure---in Computer Science~ yes, after almost 30 years, we finally can replicate the theorical figure, with Relational Visual Similarity paper.

Cool, isn't it?? (Λ΅ β€’Μ€ α΄— - Λ΅ ) ✧


⚠️ Disclaimer

All images are extracted from LAION dataset. We do NOT own any of the images and we acknowledge the rights and contributions of the original creators. Please respect the authors of all images. These images are used for research purposes only.


πŸ“„ BibTeX

@misc{nguyen2025relationalvisualsimilarity,
      title={Relational Visual Similarity}, 
      author={Thao Nguyen and Sicheng Mo and Krishna Kumar Singh and Yilin Wang and Jing Shi and Nicholas Kolkin and Eli Shechtman and Yong Jae Lee and Yuheng Li},
      year={2025},
      eprint={2512.07833},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.07833}, 
}

You've reached the end (.❛ α΄— ❛.).
(Λ†Ϊ‘Λ†)β—žπŸͺ
πŸͺ here is a cookie for you~
Enjoy and consider giving me a star ⭐~ Thank you GitHub stars