Skip to content

Interface. [Preprint] Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Notifications You must be signed in to change notification settings

Mars-tin/pixrefer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PixRefer

Precommit Setup

This is for the author, you can skip this step.

We use Google docstring format for our docstrings and the pre-commit library to check our code. To install pre-commit, run the following command:

conda install pre-commit  # or pip install pre-commit
pre-commit install

The pre-commit hooks will run automatically when you try to commit changes to the repository.

Quickstart

These tasks require a GUI, so it is recommended that you run it on a Mac / Windows.

Clone git repo

git clone https://github.com/Mars-tin/pixrefer.git
cd pixrefer
pip install -e .

Install some packages

If you are using a Mac, run the following code to install PyAudio:

brew install portaudio
pip install pyaudio

Run the following code to install google-cloud-speech:

pip install google-cloud-speech

Download the data

git lfs install

For REL and REG tasks

git clone https://huggingface.co/datasets/Seed42Lab/Pixrefer_data

If you have already download the data above and want to download the pragmatics preference data:

cd Pixrefer_data
mkdir pragmatic
git worktree add pragmatic pragmatics_preference
cd -

If you want to renew the data:

cd Pixrefer_data
git pull origin main
cd -

For pragmatics preference task only

git clone -b pragmatics_preference https://huggingface.co/datasets/Seed42Lab/Pixrefer_data

If you want to renew the data:

cd Pixrefer_data/pragmatic
git pull origin pragmatics_preference
cd -

Prepare the google key

Create empty .env file:

touch .env

And add the content below: GOOGLE_API_KEY={YOUR_API_KEY} Please replace the key with the real api key provided.

Launch the demo

REL task

bash pixrefer/interface/run_rel.sh

Please note you need to change the JSON file path in this file first: run_rel.sh

Replace the following path with your given data path. For example, you may need to annotate the llava_7b_concise_results.json:

--json_path Pixrefer_data/data/rel_user_input/llava_7b_concise_results.json  # replace the example gpt_4o file path here

Also replace the output_dir when you annotate another file, so that your results will not be overwritten:

--output_dir output/user_rel/regular  # replace the example concise dir if you are annotating the regular data

For each image, you are required to click where you think the unique object in the red box (you cannot see it) is located.

rel_regular

If you find multiple objects that match the description, click Multiple Match and confirm your guess.

rel_multiple_match

If you cannot find such an object in the image, click Cannot Tell Where The Object Is and confirm your guess.

rel_nomatch

You can always use Enter(Return) on your keyboard to quickly confirm and go to the next image.

REG task

bash pixrefer/interface/run_reg.sh

For each image, you are required to give at least one description of the object in the red box to make it can be uniquely identified by another person.

Write a text description:

reg_text
After you finish, please click `Save Description` to save your result and you will see a green 'Text ✓'.

Record an audio description:

Please note that you need to set the google api key in the .env file to proceed.

reg_audio

Click Audio to switch to the audio mode, and click Start Recording to record. When you finish, click Stop Recording. You can edit the translation words, and click Save Description to save the edited result.

You can always use Enter(Return) on your keyboard to quickly confirm and go to the next image.

Pragmatics Preference

bash pixrefer/interface/run_pragmatic.sh

Please note you need to change the JSON file path in this file first: run_rel.sh

Replace the following path with your given data path. For example, you may need to annotate the user_6_allocation.json:

--json_path Pixrefer_data/pragmatic/user_input/user_6_allocation.json  # replace the example user_1 file path here

For the task, select one of the following options to describe the object pointed by the arrow compared to the other one in the image.

Please note:

  • Follow your first instinct.
  • The options change orders for each image.
  • The maximum number of images that can be annotated at a time is 25. Once this limit is reached, please take a break for at least 10 minutes before continuing.

About

Interface. [Preprint] Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •