Skip to content

VibeBench/VibeSearchBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VibeSearchBench Logo

VibeSearchBench

Tasks Best F1 Paper Leaderboard Project Page Dataset License

Hardest β€” vague multi-turn proactive search in the wild.
Verifiable β€” schema-free knowledge graph evaluation.
Long-horizon β€” persona-driven progressive disclosure.


Leaderboard

Browse the full leaderboard and multi-turn task trajectories at vibebench.github.io/VibeSearchBench.github.io.

Evaluation:

  • Primary metric: Triplet F1. Predicted knowledge graphs are matched against ground truth via LLM-as-judge node alignment and triplet semantic equivalence.
  • Multi-turn interaction. Each task uses a persona-driven user simulator with progressive disclosure; agents may search, visit pages, and run code across many turns.
  • Best reported score: 30.3 triplet F1 (Claude Opus 4.6, OpenClaw).

Tasks

200 tasks across 2 subsets and 20 domains. Each task pairs a vague initial query with a ground-truth knowledge graph.

Split Count Description
pro 100 Professional research β€” literature reviews, market analysis, technical due diligence
daily 100 Daily-life search β€” shopping, travel, lifestyle with evolving preferences

Real users rarely specify full intent upfront. VibeSearch captures bidirectional convergence: agents interleave partial results with follow-up questions while users progressively disclose needs.

Dataset

Available on Hugging Face: VibeSearchBench/VibeSearchBench

Field Description
qid Unique task identifier
question Full research query with constraints
user_persona Persona for the progressive-disclosure simulator
nodes / triples Ground-truth knowledge graph

Quick Start

GeneralAgent (LLM-based)

Uses an OpenAI-compatible LLM to drive multi-step web research.

# Full pipeline (inference + evaluation)
MODEL_NAME=glm-5.1 VLLM_URL=http://host/v1 bash scripts/run_all.sh

# Inference only
MODEL_NAME=kimi-k2.5 VLLM_URL=http://host/v1 bash scripts/run_inference.sh

# With model config profile
MODEL_CONFIG=model_config.yaml MODEL_PROFILE=seed2_0_pro bash scripts/run_all.sh

OpenClaw Agent (CLI-based)

Wraps the OpenClaw CLI into the benchmark. Requires a running OpenClaw gateway.

# Default (simulated mode)
bash scripts/run_openclaw.sh

# Direct mode (no user simulation)
MODE=direct bash scripts/run_openclaw.sh

# Custom data and model
DATA_PATH=tasks/my_tasks MODE=simulated OPENCLAW_MODEL=my-model bash scripts/run_openclaw.sh

Key OpenClaw env vars: GATEWAY_PORT (default 18789), SOURCE_DIR, IDLE_THRESHOLD, MAX_NUDGE, OPENCLAW_MODEL.

Evaluation Only

TRAJS_DIR=results/trajs/glm-5.1_custom_serper bash scripts/run_eval.sh

Direct Python Usage

# GeneralAgent: full pipeline
python run.py \
  --agent-type general \
  --model glm-5.1 \
  --vllm-server-url http://host/v1 \
  --tool-set custom \
  --num-samples 4 \
  --grader-type gemini \
  --grader-api-url https://... \
  --grader-api-key YOUR_KEY

# GeneralAgent: inference only
python run.py \
  --agent-type general \
  --model glm-5.1 \
  --vllm-server-url http://host/v1 \
  --skip-eval

# OpenClaw agent
python run.py \
  --agent-type openclaw \
  --gateway-port 18789 \
  --mode simulated \
  --user-model doubao-seed-2-0-pro \
  --user-model-url http://host/v1 \
  --user-model-api-key YOUR_KEY \
  --num-samples 4

# Eval only
python run.py \
  --eval-only \
  --trajs-dir results/trajs/glm-5.1_custom_serper \
  --grader-type gemini \
  --grader-api-url https://...

Project Structure

VibeSearchBench/
β”œβ”€β”€ agent/                          # Agent implementations
β”‚   β”œβ”€β”€ general_agent.py            # GeneralAgent (OpenAI-compatible, single/multi-agent)
β”‚   β”œβ”€β”€ openclaw_agent.py           # OpenClaw agent wrapper
β”‚   β”œβ”€β”€ llm.py                      # LLM client utilities
β”‚   β”œβ”€β”€ prompts.py                  # Prompt templates
β”‚   └── toolkit.py                  # ToolKit (search / visit / python via Serper)
β”œβ”€β”€ eval/                           # Evaluation module
β”‚   β”œβ”€β”€ grader.py                   # GraderClient (OpenAI / Gemini backends)
β”‚   └── evaluator.py                # KG evaluation: node F1, triplet F1
β”œβ”€β”€ scripts/                        # Bash/Python scripts
β”‚   β”œβ”€β”€ run_all.sh                  # Full pipeline (inference + evaluation)
β”‚   β”œβ”€β”€ run_inference.sh            # Agent inference only
β”‚   β”œβ”€β”€ run_eval.sh                 # Evaluation only
β”‚   β”œβ”€β”€ run_openclaw.sh             # OpenClaw evaluation
β”‚   └── build_website_data.py       # Export data for the project page
β”œβ”€β”€ viberesearch_query_synthesis/   # Query synthesis module
β”œβ”€β”€ website/                        # Static site template (deployed via github.io repo)
β”œβ”€β”€ tasks/                          # Task JSON files (benchmark data)
β”œβ”€β”€ results/                        # Output (auto-created)
β”œβ”€β”€ model_config.yaml               # LLM model profiles
└── run.py                          # Main entry point

Configuration

Environment Variables

Variable Description Default
MODEL_NAME Model name for chat API glm-5.1
VLLM_URL Base URL for chat API (none)
TOOL_SET custom or builtin custom
API_KEY API key for main model (empty)
MULTI_AGENT Set to 1 for multi-agent mode 0
SERPER_API_KEY Serper API key for web search (preset)
SUMMARIZE_URL vLLM URL for page summarization (preset)
SUMMARIZE_MODEL Model for summarization qwen3-30b-a3b-instruct
CODE_SANDBOX_URL HTTP sandbox for Python tool (preset)
GEMINI_API_KEY API key for Gemini grader (preset)
GEMINI_API_URL API URL for Gemini grader (preset)

Tool Sets

  • custom (default): search (Serper) + visit (Serper scrape + LLM summarize) + python (HTTP sandbox)
  • builtin: search + open + find (requires gpt_oss package)

Agent Modes

  • Single-agent: One agent handles the entire query
  • Multi-agent (MULTI_AGENT=1): Main agent can spawn sub-agents for parallel research

Output Format

Trajectories (results/trajs/{experiment}/)

One JSONL file per task ({task_id}.jsonl), each line is one sample:

{"qid": "task_042_...", "sample_idx": 0, "question": "...", "messages": [...], "response": "...", "termination": "answer", ...}

Evaluation (results/eval/{experiment}/)

  • {task_id}_sample{N}.json β€” Per-trajectory evaluation with node/triplet metrics
  • item_ratings.json β€” All per-item results
  • summary.json β€” Aggregated metrics (avg@N, best@N)

Dependencies

openai aiohttp httpx tqdm transformers json_repair

Evaluation Metrics

Two-phase LLM-as-judge evaluation:

  1. Node matching: LLM matches predicted entities to ground-truth entities (alias/translation-aware)
  2. Triplet matching: For matched entity pairs, LLM judges relation semantic equivalence

Metrics: Precision, Recall, F1 at both node and triplet levels, with avg@N and best@N aggregation across samples.

License

This project is released under the MIT License.


VibeSearchBench Β· Rednote-Hilab & Unipat AI

About

πŸ” The hardest search benchmark in the wild β€” vague, multi-turn, proactive. 200 long-horizon tasks with persona-driven progressive disclosure, scored by verifiable schema-free knowledge-graph evaluation. No vibes, just triplet F1.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors