As part of Google DeepMind's Gemini Robotics release, we open-source a benchmark of multi-modal embodied reasoning questions. This evaluation benhchmark covers a variety of topics related to spatial reasoning and world knowledge focused on real-world scenarios, particularly in the context of robotics. Please find more details and visualizations in the tech report.
The questions consist of multimodal interleaved images and text, phrased as multiple-choice questions. The answers are provided as a single letter (A, B, C, D) for each question. We provide the ERQA benchmark in data/erqa.tfrecord as TF Examples saved with the following features:
question: The text question to askimage/encoded: One or more encoded imagesanswer: The ground truth answerquestion_type: The type of question (optional)visual_indices: Indices of visual elements (determines image placement)
Once the virtual environment is activated, install the required packages:
pip install -r requirements.txtTo see the structure of the ERQA dataset and how to load examples from the TFRecord file, run the simple dataset loader:
python loading_example.pyThis script demonstrates how to:
- Load the TFRecord file
- Parse examples with their features (questions, images, answers, etc.)
- Access the data in each example
- Handle the visual indices that determine image placement
We also provide an example of a lightweight evaluation harness for querying multimodal APIs (Gemini 2.0 and OpenAI) with examples loaded from the ERQA benchmark.
-
Install the required dependencies:
pip install -r requirements.txt
-
Set up your API keys. You can easily register a new Gemini API key here.
There are multiple ways to provide API keys to the evaluation harness:
Set environment variables for the APIs you want to use:
# For Gemini API
export GEMINI_API_KEY="your_gemini_api_key_here"
# For OpenAI API
export OPENAI_API_KEY="your_openai_api_key_here"Provide API keys directly as command-line arguments:
# For a single Gemini API key
python eval_harness.py --gemini_api_key YOUR_GEMINI_API_KEY
# For a single OpenAI API key
python eval_harness.py --api openai --openai_api_key YOUR_OPENAI_API_KEYCreate a text file with your API keys and provide the path to the file. This is helpful when you have multiple API keys for the same API and are running into rate limits per key.
# Using a keys file
python eval_harness.py --api_keys_file path/to/your/keys.txtThe keys file should have one key per line, the keys will be assumed to be for the API specified with the --api argument.
YOUR_API_KEY_1
YOUR_API_KEY_2
Run the evaluation harness with default settings (Gemini API):
python eval_harness.pyFor Gemini API:
# Using the default Gemini Flash model
python eval_harness.py --model gemini-2.0-flash
# Using the experimental Gemini Pro model
python eval_harness.py --model gemini-2.0-pro-exp-02-05For OpenAI API:
# Using the GPT-4o-mini model
python eval_harness.py --api openai --model gpt-4o-2024-11-20By default, the full ERQA benchmark consists of 400 examples.
python eval_harness.py --num_examples 10Example with custom arguments for Gemini:
python eval_harness.py --api gemini --model gemini-2.0-pro --gemini_api_key YOUR_API_KEYExample with custom arguments for OpenAI:
python eval_harness.py --api openai --model gpt-4o-mini --openai_api_key YOUR_API_KEYExample with multiple API keys and a keys file:
python eval_harness.py --api_keys_file ./gemini_keys.txt--tfrecord_path: Path to the TFRecord file (default: './data/erqa.tfrecord')--api: API to use: 'gemini' or 'openai' (default: 'gemini')--model: Model name to use (defaults: 'gemini-2.0-flash-exp' for Gemini, 'gpt-4o' for OpenAI)- Available Gemini models include: gemini-2.0-flash-exp, gemini-2.0-pro, gemini-2.0-pro-exp-02-05
--gemini_api_key: Gemini API key (can be specified multiple times for multiple keys)--openai_api_key: OpenAI API key (can be specified multiple times for multiple keys)--api_keys_file: Path to a file containing API keys (one per line, format: "gemini:KEY" or "openai:KEY")--num_examples: Number of examples to process (default: 1)--max_retries: Maximum number of retries per API key on resource exhaustion (default: 2)--max_tokens: Maximum number of tokens in the response (for OpenAI only, default: 300)--connection_retries: Maximum number of retries for connection errors (for OpenAI only, default: 5)
The harness supports using multiple API keys with retry logic when encountering resource exhaustion errors:
- You can provide multiple API keys using the
--gemini_api_keyor--openai_api_keyarguments multiple times or via a file with--api_keys_file - When a resource exhaustion error (429) is encountered, the harness will:
- Retry the request up to
max_retriestimes (default: 2) with a fixed 2-second backoff - If all retries for one API key fail, it will try the next API key
- Only exit when all API keys have been exhausted
- Retry the request up to
- For OpenAI API, when connection errors are encountered:
- Retry the request up to
connection_retriestimes (default: 5) with a fixed 2-second backoff - If all connection retries for one API key fail, it will try the next API key
- Only exit when all API keys have been exhausted
- Retry the request up to