Skip to content

WangYipu2002/CrossPoint

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Towards Cross-View Point Correspondence in Vision-Language Models

This repository provides a comprehensive framework for evaluating VLMs on the CrossPoint-Bench benchmark and training CroPond.

arXiv HuggingFace HuggingFace Dataset

Image

📋 TODO

  • Release CrossPoint-Bench
  • Release CrossPoint-378K
  • Release CroPond model

🤖 Models

We provide two versions of CroPond:

Model Base Model Parameters Checkpoints
CroPond-3B Qwen2.5-VL-3B 3B 🤗 CroPond-3B
CroPond-7B Qwen2.5-VL-7B 7B 🤗 CroPond-7B

🚀 Setup

Clone the repository:

git clone https://github.com/WangYipu2002/CrossPoint.git
cd CrossPoint

📦 Install Dependencies

For evaluation:

conda create -n crosspoint_eval python=3.10
conda activate crosspoint_eval
pip install -r requirements_eval.txt

For training:

conda create -n crosspoint_train python=3.10
conda activate crosspoint_train
pip install -r requirements_train.txt

📊 Evaluation on CrossPoint-Bench

The evaluation process consists of three steps:

Step 1: Download CrossPoint-Bench

Download CrossPoint-Bench from Hugging Face.

After downloading, the directory structure should look like:

CrossPoint-Bench/
├── image/                     # Contains all benchmark images
│   ├── origin_image/          # Original images
│   └── visual_image/          # Annotated images
└── CrossPoint-Bench.jsonl     # Benchmark annotations

Step 2: Run Inference

Choose one of the following methods based on your model type:

Option A: API-based Models (GPT, Claude, Gemini)

Edit scripts/eval/eval_api.sh to configure your paths, model names, and API credentials.

Run:

bash scripts/eval/eval_api.sh

Option B: Open-source Models/Trained Models

Edit scripts/eval/eval_opensource.sh to configure your paths and model settings.

Run:

bash scripts/eval/eval_opensource.sh

Output: Inference results will be saved to eval_results/inference/eval_<model_name>.jsonl

Step 3: Calculate Metrics

Edit scripts/eval/cal_metric.sh to configure paths and coordinate format settings.

Run:

bash scripts/eval/cal_metric.sh

Output: Metrics will be saved to eval_results/scores/evaluation_report.txt

🏋️ Training

Step 1: Download Training Datasets

In this repository, we primarily use CrossPoint-378K, which can be downloaded from Hugging Face.

After downloading, the training datasets directory structure should look like:

training_datasets/
├── CrossPoint-378K/               # Main training dataset
│   ├── CrossPoint-378K.json       # Main data file (ShareGPT format)
│   ├── image/                     # Original images directory
│   │   └── [scene_id]/            # Scene ID directory
│   │       └── [images]           # Scene images
│   └── visual_image/              # Annotated images directory
│       └── [scene_id]/            # Scene ID directory
│           └── [images]           # Annotated images with visual markers
└── other_datasets/                # Community mainstream datasets (recommend)

To enhance the model's spatial understanding capabilities while maintaining its general knowledge, we also incorporate other datasets including RefSpatial, SAT, SPAR-7M, MulSeT and LLaVA-1.5. Please refer to the original papers and repositories for dataset preparation instructions.

Step 2: Training

Edit scripts/train/train.sh to configure your actual paths and register your training data paths in data/dataset_info.json.

bash scripts/train/train.sh

Output: Checkpoints will be saved to OUTPUT_PATH

After training, you can evaluate the trained checkpoint on CrossPoint-Bench and other benchmarks using the evaluation scripts described above.

🙏 Acknowledgment

This repository is built upon the codebase of LLaMA Factory.

We acknowledge ScanNet, ScanNet++, ETH3D, and ARKitScenes for their data.

📝 Citation

If you find CrossPoint-Bench, CrossPoint-378K, and CroPond useful for your research, please cite using this BibTeX:

@article{wang2025crosspoint,
  title={Towards Cross-View Point Correspondence in Vision-Language Models},
  author={Wang, Yipu and Ji, Yuheng and Liu, Yuyang and Zhou, Enshen and Yang, Ziqiang and Tian, Yuxuan and Qin, Ziheng and Liu, Yue and Tan, Huajie and Chi, Cheng and Ma, Zhiyuan and Zeng, Daniel Dajun and Zheng, Xiaolong},
  journal={arXiv preprint arXiv:2512.04686},
  year={2025}
}

About

Official implementation of “Towards Cross-View Point Correspondence in Vision-Language Models”.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors