Skip to content

thierry123454/tool-selection-bias

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

321 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tool-Selection Bias in LLMs

Quick SetupOverall StructureCitation

This repo contains the code to reproduce our research on tool-selection bias: the tendency of LLMs to prefer some APIs over functionally equivalent alternatives. It includes:

  • A bias benchmark (10 clusters × 5 APIs × 100 queries),
  • Experiments showing API/position bias across 7 model families,
  • Feature-level analysis & metadata perturbations,
  • A biased continued pre-training (CPT) study,
  • A lightweight subset-selection mitigation that reduces bias.

Built on top of ToolBench / ToolLLM. Please also see their license and citation.

Here is an overview of the different phases in how we measure and aim to understand tool-selection bias. First, we embed and cluster the existing APIs in ToolLLM, then generate queries for each cluster such that each API within the cluster can satisfy the query to create our bias-evaluation benchmark. We run inference on this benchmark using various models, compute the empirical selection distributions and our bespoke bias metric, and finally investigate why models exhibit particular biases via a range of different experiments.


Image

🚀 Quick Setup

Prerequisites

  • Python: 3.10+
  • GPU: CUDA-enabled GPU with working drivers if you want to use local models.

Clone the repo

git clone https://github.com/thierry123454/tool-selection-bias.git
cd tool-selection-bias

System essentials

sudo apt update && sudo apt upgrade -y
sudo apt install -y git build-essential cmake curl

Python environment

python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt
# Optional extras you may use:
pip install torch torchvision anthropic

If you hit binary issues with bitsandbytes/triton, you can remove them:

pip uninstall -y bitsandbytes triton

If you see HF hub import/version errors, pin:

pip install "huggingface-hub==0.11.1"

Download ToolLLM data

Please download the ToolLLM dataset using the following link Google Drive. Please make sure you have downloaded the necessary data and put the directory (e.g. data/) under ToolBench/, so that the following bash scripts can navigate to the related data. Via gdown:

pip install gdown
gdown https://drive.google.com/uc?id=1vzUpO2TadV97upKwLn-TWHA-PR57Vs2H -O data.zip
unzip data.zip

Bias queries (provided)

mkdir -p data_bias/instruction
mv 3_generate_queries_for_clusters/toolbench_bias_queries.json \
   data_bias/instruction/

Environment variables

Set keys for whichever providers you’ll use (leave unset if you’re running locally without live APIs):

# RapidAPI (if calling the live ToolBench/rapid server yourself)
export RAPIDAPI_KEY="YOUR_RAPIDAPI_KEY"

# OpenAI (ChatGPT family)
export OPENAI_KEY="YOUR_OPENAI_KEY"

# Google (Gemini)
export GEMINI_KEY="YOUR_GEMINI_KEY"

# Same for any other LLM API

export PYTHONPATH="$(pwd)"
# (Optional) pick a single GPU
export CUDA_VISIBLE_DEVICES=0

▶️ Run

ToolLLaMA (local model, recommended baseline)

python toolbench/inference/qa_pipeline.py \
  --tool_root_dir data/toolenv/tools/ \
  --backbone_model toolllama \
  --model_path ToolBench/ToolLLaMA-2-7b-v2 \
  --max_observation_length 1024 \
  --method CoT@1 \
  --input_query_file data_bias/instruction/toolbench_bias_queries.json \
  --output_answer_file data_bias/answer_toolllama \
  --rapidapi_key $RAPIDAPI_KEY \
  --use_rapidapi_key \
  --test_bias

ChatGPT (OpenAI)

python toolbench/inference/qa_pipeline.py \
  --tool_root_dir data/toolenv/tools/ \
  --backbone_model chatgpt \
  --openai_key $OPENAI_KEY \
  --max_observation_length 1024 \
  --method CoT@1 \
  --input_query_file data_bias/instruction/toolbench_bias_queries.json \
  --output_answer_file data_bias/answer_chatgpt_no_func_base_prompt \
  --rapidapi_key $RAPIDAPI_KEY \
  --use_rapidapi_key \
  --test_bias

Gemini

python toolbench/inference/qa_pipeline.py \
  --tool_root_dir data/toolenv/tools/ \
  --backbone_model gemini \
  --openai_key $GEMINI_KEY \
  --max_observation_length 1024 \
  --method CoT@1 \
  --input_query_file data_bias/instruction/toolbench_bias_queries.json \
  --output_answer_file data_bias/answer_gemini \
  --rapidapi_key $RAPIDAPI_KEY \
  --use_rapidapi_key \
  --test_bias

General

python toolbench/inference/qa_pipeline.py \
  --tool_root_dir data/toolenv/tools/ \
  --backbone_model {qwen-235b}/{deepseek}/{claude} \
  --openai_key ${LLM_KEY} \
  --max_observation_length 1024 \
  --method CoT@1 \
  --input_query_file data_bias/instruction/toolbench_bias_queries.json \
  --output_answer_file data_bias/{answer_dir} \
  --rapidapi_key $RAPIDAPI_KEY \
  --use_rapidapi_key \
  --test_bias

Note:

  • --test_bias only extracts the first endpoint call and stops execution afterwards

📦 Overall Structure

Dataset Generation

Below are the dataset stats used to test tool-selection bias:

Clusters API / Cluster Queries / Cluster
10 5 100

The pipeline to generate it has three stages:

  1. Collect & embed endpoint metadata
  2. Form functionally-equivalent clusters & refine
  3. Generate queries & export into ToolBench-format with controlled API ordering

Auth note: Some scripts use the legacy OpenAI SDK (OPENAI_API_KEY), others the new client (OPENAI_KEY). To be safe, set both:

export OPENAI_API_KEY="sk-..."   # legacy SDK
export OPENAI_KEY="sk-..."       # new SDK

Endpoint metadata & embeddings (folder: 1_endpoint_metadata_and_embed/)

extract_api_metadata.py

Parses data/toolenv/tools/**/*.json and writes a compact map of tools → (tool_desc, [api_name, api_desc]).

  • In: data/toolenv/tools/ (from ToolLLM)
  • Out: api_metadata.json

Run:
cd 1_endpoint_metadata_and_embed
python extract_api_metadata.py
create_embeddings_openai.py

Builds texts like "Tool: <tool_desc> | <api_name>: <api_desc>" and embeds them with text-embedding-ada-002.

  • In: api_metadata.json
  • Out: embeddings_combined_openai.npy

Run:
python create_embeddings_openai.py

Cluster generation & refinement (folder: 2_generate_clusters_and_refine/)

generate_duplicate_clusters.py

Given a seed list of “general” APIs, finds top-K nearest neighbors in embedding space, then iteratively removes outliers via an LLM check (ensures all endpoints can satisfy the same task). Also pulls required params from ToolLLM queries for context.

  • In:
    • ../1_endpoint_metadata_and_embed/api_metadata.json
    • ../1_endpoint_metadata_and_embed/embeddings_combined_openai.npy
    • ../data/instruction/G1_query.json (to attach required params)
  • Out: duplicate_api_clusters_2.json (rename to your preferred canonical filename)

Run:
cd ../2_generate_clusters_and_refine
python generate_duplicate_clusters.py
refine_retrieval_clusters.py (optional, interactive)

Suggests nearest neighbors for undersized clusters; you can add items by number.

  • In: duplicate_api_clusters.json (set this path in the script)
  • Out: updated duplicate_api_clusters.json

Run:
python refine_retrieval_clusters.py

Query generation & export (folder: 3_generate_queries_for_clusters/)

You can generate free-text queries directly, or via template filling (good when open-ended generation drifts toward specific providers).

generate_queries_for_cluster.py

LLM produces 100 realistic queries per cluster that every endpoint can satisfy.

  • In: ../2_generate_clusters_and_refine/duplicate_api_clusters.json
  • Out: cluster_queries_3.json (one entry per cluster with 100 queries)

Run
cd ../3_generate_queries_for_clusters
python generate_queries_for_cluster.py
template_generation_for_cluster.py (optional)

Fills hand-written templates repeatedly to avoid provider-specific drift.

  • In:
    • ../2_generate_clusters_and_refine/duplicate_api_clusters.json
    • templates.json (by cluster id)
  • Out: filled_queries_by_template.json

Run
python template_generation_for_cluster.py
create_toolbench_format_dataset.py

Converts clusters + queries into ToolBench format, and controls API ordering per query. With SHUFFLE="cycle", each query is emitted once per API with a cyclic rotation. This is crucial to compensate for positional bias. You can limit clusters via HOLDOUT (e.g., [6,8,9,10]).

  • In:
    • ../2_generate_clusters_and_refine/duplicate_api_clusters.json
    • cluster_queries.json
    • ToolLLM originals: ../data/instruction/G{1,2,3}_query.json (to fetch canonical API defs)
    • Fallbacks from data/toolenv/tools if needed
  • Out: toolbench_bias_queries_cycle.json
    • With 10 clusters × 5 APIs × 100 queries and SHUFFLE="cycle", expect 5000 entries.
    • If HOLDOUT is set, only those clusters are emitted.

Run:
python create_toolbench_format_dataset.py

Tip: If you want exactly 100 prompts per cluster without rotations, set SHUFFLE="none" inside the script.

Analyzing Data

The scripts to analyze the data output by the models are given in 4_gather_and_visualize_data. This folder contains the post-processing pipeline for turning raw inference outputs into bias metrics and figures. Run extract_selected_api.py to extract the endpoints that were called and the positions of those endpoints in the API list.

Bias Investigation

The folder 5_bias_investigation holds the code that explains and probes tool-selection bias. It builds per-endpoint feature tables (e.g., query–metadata similarity, lengths, params, readability, age), runs the feature-level analysis (correlations and per-model linear regressions), and saves compact plots. It also contains code to generate metadata perturbations (scrambling/swapping of tool names). Lastly, it contains the biased continued pre-training (CPT) pipeline, where one can train an LLM with a corpus saturated in one endpoint’s metadata and evaluate how exposure changes selection shares.

Bias Mitigation

The folder 6_bias_mitigation contains the code to evaluate our subset-selection debiasing pipeline. It has code to synthesize a benchmark where each query is paired with 8 candidate APIs, of which a subset is truly sufficient; ground truth is saved alongside the dataset. We can run the selector to predict subsets, then use the evaluator to compute micro-precision, micro-recall, and exact-set match overall. The resulting filter is intended to precede uniform sampling among retained tools, flattening selection distributions without harming coverage.

📚 Citation

You can cite our work using the following BibTeX-formatted citation.

@article{blankenstein2025biasbusters,
      title={BiasBusters: Uncovering and Mitigating Tool Selection Bias in Large Language Models}, 
      author={Thierry Blankenstein and Jialin Yu and Zixuan Li and Vassilis Plachouras and Sunando Sengupta and Philip Torr and Yarin Gal and Alasdair Paren and Adel Bibi},
      year={2026},
      journal={ICLR},
}

About

Reproducible benchmark & toolkit for measuring, explaining, and mitigating LLM tool-selection bias.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

No contributors

Languages

  • Python 59.3%
  • Jupyter Notebook 40.1%
  • Shell 0.6%