Skip to content

embodiedreasoning/ERQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Embodied Reasoning QA Evaluation Dataset

As part of Google DeepMind's Gemini Robotics release, we open-source a benchmark of multi-modal embodied reasoning questions. This evaluation benhchmark covers a variety of topics related to spatial reasoning and world knowledge focused on real-world scenarios, particularly in the context of robotics. Please find more details and visualizations in the tech report.

ERQA Dataset

The questions consist of multimodal interleaved images and text, phrased as multiple-choice questions. The answers are provided as a single letter (A, B, C, D) for each question. We provide the ERQA benchmark in data/erqa.tfrecord as TF Examples saved with the following features:

  • question: The text question to ask
  • image/encoded: One or more encoded images
  • answer: The ground truth answer
  • question_type: The type of question (optional)
  • visual_indices: Indices of visual elements (determines image placement)

Setup Instructions

1. Install Dependencies

Once the virtual environment is activated, install the required packages:

pip install -r requirements.txt

2. Run example dataset loader

To see the structure of the ERQA dataset and how to load examples from the TFRecord file, run the simple dataset loader:

python loading_example.py

This script demonstrates how to:

  • Load the TFRecord file
  • Parse examples with their features (questions, images, answers, etc.)
  • Access the data in each example
  • Handle the visual indices that determine image placement

Multimodal Evaluation Harness

We also provide an example of a lightweight evaluation harness for querying multimodal APIs (Gemini 2.0 and OpenAI) with examples loaded from the ERQA benchmark.

Setup

  1. Install the required dependencies:

    pip install -r requirements.txt
  2. Set up your API keys. You can easily register a new Gemini API key here.

API Key Configuration

There are multiple ways to provide API keys to the evaluation harness:

Option 1: Environment Variables

Set environment variables for the APIs you want to use:

# For Gemini API
export GEMINI_API_KEY="your_gemini_api_key_here"

# For OpenAI API
export OPENAI_API_KEY="your_openai_api_key_here"

Option 2: Command-line Arguments

Provide API keys directly as command-line arguments:

# For a single Gemini API key
python eval_harness.py --gemini_api_key YOUR_GEMINI_API_KEY

# For a single OpenAI API key
python eval_harness.py --api openai --openai_api_key YOUR_OPENAI_API_KEY

Option 3: API Keys File

Create a text file with your API keys and provide the path to the file. This is helpful when you have multiple API keys for the same API and are running into rate limits per key.

# Using a keys file
python eval_harness.py --api_keys_file path/to/your/keys.txt

The keys file should have one key per line, the keys will be assumed to be for the API specified with the --api argument.

YOUR_API_KEY_1
YOUR_API_KEY_2

Running the Evaluation Harness

Basic Usage

Run the evaluation harness with default settings (Gemini API):

python eval_harness.py

Specifying a Model

For Gemini API:

# Using the default Gemini Flash model
python eval_harness.py --model gemini-2.0-flash

# Using the experimental Gemini Pro model
python eval_harness.py --model gemini-2.0-pro-exp-02-05

For OpenAI API:

# Using the GPT-4o-mini model
python eval_harness.py --api openai --model gpt-4o-2024-11-20

Setting the Number of Examples

By default, the full ERQA benchmark consists of 400 examples.

python eval_harness.py --num_examples 10

Complete Examples

Example with custom arguments for Gemini:

python eval_harness.py --api gemini --model gemini-2.0-pro --gemini_api_key YOUR_API_KEY

Example with custom arguments for OpenAI:

python eval_harness.py --api openai --model gpt-4o-mini --openai_api_key YOUR_API_KEY

Example with multiple API keys and a keys file:

python eval_harness.py --api_keys_file ./gemini_keys.txt

Command-line Arguments

  • --tfrecord_path: Path to the TFRecord file (default: './data/erqa.tfrecord')
  • --api: API to use: 'gemini' or 'openai' (default: 'gemini')
  • --model: Model name to use (defaults: 'gemini-2.0-flash-exp' for Gemini, 'gpt-4o' for OpenAI)
    • Available Gemini models include: gemini-2.0-flash-exp, gemini-2.0-pro, gemini-2.0-pro-exp-02-05
  • --gemini_api_key: Gemini API key (can be specified multiple times for multiple keys)
  • --openai_api_key: OpenAI API key (can be specified multiple times for multiple keys)
  • --api_keys_file: Path to a file containing API keys (one per line, format: "gemini:KEY" or "openai:KEY")
  • --num_examples: Number of examples to process (default: 1)
  • --max_retries: Maximum number of retries per API key on resource exhaustion (default: 2)
  • --max_tokens: Maximum number of tokens in the response (for OpenAI only, default: 300)
  • --connection_retries: Maximum number of retries for connection errors (for OpenAI only, default: 5)

Multiple API Keys and Retry Logic

The harness supports using multiple API keys with retry logic when encountering resource exhaustion errors:

  1. You can provide multiple API keys using the --gemini_api_key or --openai_api_key arguments multiple times or via a file with --api_keys_file
  2. When a resource exhaustion error (429) is encountered, the harness will:
    • Retry the request up to max_retries times (default: 2) with a fixed 2-second backoff
    • If all retries for one API key fail, it will try the next API key
    • Only exit when all API keys have been exhausted
  3. For OpenAI API, when connection errors are encountered:
    • Retry the request up to connection_retries times (default: 5) with a fixed 2-second backoff
    • If all connection retries for one API key fail, it will try the next API key
    • Only exit when all API keys have been exhausted

About

Embodied Reasoning Question Answer (ERQA) Benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages