Embodied Reasoning QA Evaluation Dataset

As part of Google DeepMind's Gemini Robotics release, we open-source a benchmark of multi-modal embodied reasoning questions. This evaluation benhchmark covers a variety of topics related to spatial reasoning and world knowledge focused on real-world scenarios, particularly in the context of robotics. Please find more details and visualizations in the tech report.

ERQA Dataset

The questions consist of multimodal interleaved images and text, phrased as multiple-choice questions. The answers are provided as a single letter (A, B, C, D) for each question. We provide the ERQA benchmark in data/erqa.tfrecord as TF Examples saved with the following features:

question: The text question to ask
image/encoded: One or more encoded images
answer: The ground truth answer
question_type: The type of question (optional)
visual_indices: Indices of visual elements (determines image placement)

Setup Instructions

1. Install Dependencies

Once the virtual environment is activated, install the required packages:

pip install -r requirements.txt

2. Run example dataset loader

To see the structure of the ERQA dataset and how to load examples from the TFRecord file, run the simple dataset loader:

python loading_example.py

This script demonstrates how to:

Load the TFRecord file
Parse examples with their features (questions, images, answers, etc.)
Access the data in each example
Handle the visual indices that determine image placement

Multimodal Evaluation Harness

We also provide an example of a lightweight evaluation harness for querying multimodal APIs (Gemini 2.0 and OpenAI) with examples loaded from the ERQA benchmark.

Setup

Install the required dependencies:
```
pip install -r requirements.txt
```
Set up your API keys. You can easily register a new Gemini API key here.

API Key Configuration

There are multiple ways to provide API keys to the evaluation harness:

Option 1: Environment Variables

Set environment variables for the APIs you want to use:

# For Gemini API
export GEMINI_API_KEY="your_gemini_api_key_here"

# For OpenAI API
export OPENAI_API_KEY="your_openai_api_key_here"

Option 2: Command-line Arguments

Provide API keys directly as command-line arguments:

# For a single Gemini API key
python eval_harness.py --gemini_api_key YOUR_GEMINI_API_KEY

# For a single OpenAI API key
python eval_harness.py --api openai --openai_api_key YOUR_OPENAI_API_KEY

Option 3: API Keys File

Create a text file with your API keys and provide the path to the file. This is helpful when you have multiple API keys for the same API and are running into rate limits per key.

# Using a keys file
python eval_harness.py --api_keys_file path/to/your/keys.txt

The keys file should have one key per line, the keys will be assumed to be for the API specified with the --api argument.

YOUR_API_KEY_1
YOUR_API_KEY_2

Running the Evaluation Harness

Basic Usage

Run the evaluation harness with default settings (Gemini API):

python eval_harness.py

Specifying a Model

For Gemini API:

# Using the default Gemini Flash model
python eval_harness.py --model gemini-2.0-flash

# Using the experimental Gemini Pro model
python eval_harness.py --model gemini-2.0-pro-exp-02-05

For OpenAI API:

# Using the GPT-4o-mini model
python eval_harness.py --api openai --model gpt-4o-2024-11-20

Setting the Number of Examples

By default, the full ERQA benchmark consists of 400 examples.

python eval_harness.py --num_examples 10

Complete Examples

Example with custom arguments for Gemini:

python eval_harness.py --api gemini --model gemini-2.0-pro --gemini_api_key YOUR_API_KEY

Example with custom arguments for OpenAI:

python eval_harness.py --api openai --model gpt-4o-mini --openai_api_key YOUR_API_KEY

Example with multiple API keys and a keys file:

python eval_harness.py --api_keys_file ./gemini_keys.txt

Command-line Arguments

--tfrecord_path: Path to the TFRecord file (default: './data/erqa.tfrecord')
--api: API to use: 'gemini' or 'openai' (default: 'gemini')
--model: Model name to use (defaults: 'gemini-2.0-flash-exp' for Gemini, 'gpt-4o' for OpenAI)
- Available Gemini models include: gemini-2.0-flash-exp, gemini-2.0-pro, gemini-2.0-pro-exp-02-05
--gemini_api_key: Gemini API key (can be specified multiple times for multiple keys)
--openai_api_key: OpenAI API key (can be specified multiple times for multiple keys)
--api_keys_file: Path to a file containing API keys (one per line, format: "gemini:KEY" or "openai:KEY")
--num_examples: Number of examples to process (default: 1)
--max_retries: Maximum number of retries per API key on resource exhaustion (default: 2)
--max_tokens: Maximum number of tokens in the response (for OpenAI only, default: 300)
--connection_retries: Maximum number of retries for connection errors (for OpenAI only, default: 5)

Multiple API Keys and Retry Logic

The harness supports using multiple API keys with retry logic when encountering resource exhaustion errors:

You can provide multiple API keys using the --gemini_api_key or --openai_api_key arguments multiple times or via a file with --api_keys_file
When a resource exhaustion error (429) is encountered, the harness will:
- Retry the request up to max_retries times (default: 2) with a fixed 2-second backoff
- If all retries for one API key fail, it will try the next API key
- Only exit when all API keys have been exhausted
For OpenAI API, when connection errors are encountered:
- Retry the request up to connection_retries times (default: 5) with a fixed 2-second backoff
- If all connection retries for one API key fail, it will try the next API key
- Only exit when all API keys have been exhausted

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
LICENSE		LICENSE
README.md		README.md
eval_harness.py		eval_harness.py
loading_example.py		loading_example.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Embodied Reasoning QA Evaluation Dataset

ERQA Dataset

Setup Instructions

1. Install Dependencies

2. Run example dataset loader

Multimodal Evaluation Harness

Setup

API Key Configuration

Option 1: Environment Variables

Option 2: Command-line Arguments

Option 3: API Keys File

Running the Evaluation Harness

Basic Usage

Specifying a Model

Setting the Number of Examples

Complete Examples

Command-line Arguments

Multiple API Keys and Retry Logic

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

embodiedreasoning/ERQA

Folders and files

Latest commit

History

Repository files navigation

Embodied Reasoning QA Evaluation Dataset

ERQA Dataset

Setup Instructions

1. Install Dependencies

2. Run example dataset loader

Multimodal Evaluation Harness

Setup

API Key Configuration

Option 1: Environment Variables

Option 2: Command-line Arguments

Option 3: API Keys File

Running the Evaluation Harness

Basic Usage

Specifying a Model

Setting the Number of Examples

Complete Examples

Command-line Arguments

Multiple API Keys and Retry Logic

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages