This repository provides a comprehensive framework for evaluating VLMs on the CrossPoint-Bench benchmark and training CroPond.
- Release CrossPoint-Bench
- Release CrossPoint-378K
- Release CroPond model
We provide two versions of CroPond:
| Model | Base Model | Parameters | Checkpoints |
|---|---|---|---|
| CroPond-3B | Qwen2.5-VL-3B | 3B | 🤗 CroPond-3B |
| CroPond-7B | Qwen2.5-VL-7B | 7B | 🤗 CroPond-7B |
Clone the repository:
git clone https://github.com/WangYipu2002/CrossPoint.git
cd CrossPointFor evaluation:
conda create -n crosspoint_eval python=3.10
conda activate crosspoint_eval
pip install -r requirements_eval.txtFor training:
conda create -n crosspoint_train python=3.10
conda activate crosspoint_train
pip install -r requirements_train.txtThe evaluation process consists of three steps:
Download CrossPoint-Bench from Hugging Face.
After downloading, the directory structure should look like:
CrossPoint-Bench/
├── image/ # Contains all benchmark images
│ ├── origin_image/ # Original images
│ └── visual_image/ # Annotated images
└── CrossPoint-Bench.jsonl # Benchmark annotations
Choose one of the following methods based on your model type:
Edit scripts/eval/eval_api.sh to configure your paths, model names, and API credentials.
Run:
bash scripts/eval/eval_api.shEdit scripts/eval/eval_opensource.sh to configure your paths and model settings.
Run:
bash scripts/eval/eval_opensource.shOutput: Inference results will be saved to eval_results/inference/eval_<model_name>.jsonl
Edit scripts/eval/cal_metric.sh to configure paths and coordinate format settings.
Run:
bash scripts/eval/cal_metric.shOutput: Metrics will be saved to eval_results/scores/evaluation_report.txt
In this repository, we primarily use CrossPoint-378K, which can be downloaded from Hugging Face.
After downloading, the training datasets directory structure should look like:
training_datasets/
├── CrossPoint-378K/ # Main training dataset
│ ├── CrossPoint-378K.json # Main data file (ShareGPT format)
│ ├── image/ # Original images directory
│ │ └── [scene_id]/ # Scene ID directory
│ │ └── [images] # Scene images
│ └── visual_image/ # Annotated images directory
│ └── [scene_id]/ # Scene ID directory
│ └── [images] # Annotated images with visual markers
└── other_datasets/ # Community mainstream datasets (recommend)
To enhance the model's spatial understanding capabilities while maintaining its general knowledge, we also incorporate other datasets including RefSpatial, SAT, SPAR-7M, MulSeT and LLaVA-1.5. Please refer to the original papers and repositories for dataset preparation instructions.
Edit scripts/train/train.sh to configure your actual paths and register your training data paths in data/dataset_info.json.
bash scripts/train/train.shOutput: Checkpoints will be saved to OUTPUT_PATH
After training, you can evaluate the trained checkpoint on CrossPoint-Bench and other benchmarks using the evaluation scripts described above.
This repository is built upon the codebase of LLaMA Factory.
We acknowledge ScanNet, ScanNet++, ETH3D, and ARKitScenes for their data.
If you find CrossPoint-Bench, CrossPoint-378K, and CroPond useful for your research, please cite using this BibTeX:
@article{wang2025crosspoint,
title={Towards Cross-View Point Correspondence in Vision-Language Models},
author={Wang, Yipu and Ji, Yuheng and Liu, Yuyang and Zhou, Enshen and Yang, Ziqiang and Tian, Yuxuan and Qin, Ziheng and Liu, Yue and Tan, Huajie and Chi, Cheng and Ma, Zhiyuan and Zeng, Daniel Dajun and Zheng, Xiaolong},
journal={arXiv preprint arXiv:2512.04686},
year={2025}
}