Official code for the ICML 2026 paper WEASEL: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection.
[Paper] [Project Page]
WEASEL selects compact, goal-relevant, and diverse web-agent trajectory steps to improve out-of-domain generalization while reducing training cost.
This repository currently contains the cleaned data-selection pipeline:
- Prune AXTree states.
- Compute goal-relevance and pairwise distance scores.
- Run the WEASEL greedy subset-selection objective.
- Build the final training subset, including length filtering and 10K subsampling.
We do not include the original training datasets in this repository. To download
AgentTrek, please refer to the official xlang-ai/AgentTrek
repository. In the commands below, replace path/to/train.json with the local
path to the downloaded training file.
If you want to skip the preprocessing steps and directly use our WEASEL-selected training dataset, it will be available here:
- WEASEL-selected AgentTrek training dataset: weasel_agenttrek_train_10k.json
We use target-centered AXTree pruning before score computation, with a threshold-based fallback when the action does not reference a valid bid.
python -m weasel.prune_axtree \
--input path/to/train.json \
--output path/to/train_pruned.json \
--window-size 60 \
--fallback-threshold 120Run score preprocessing on the downloaded training data:
python -m weasel.prepare_scores \
--input path/to/train_pruned.json \
--output path/to/goals_with_scores.json \
--augmented-dataset-output path/to/train_with_phi_scores.jsonRun greedy subset selection using the precomputed scores:
python -m weasel.select_greedy \
--input path/to/goals_with_scores.json \
--output path/to/full_selected_dataset_indices_T0_3.jsonBuild the final WEASEL training subset:
python -m weasel.postprocess_dataset \
--dataset path/to/train_pruned.json \
--selected-indices path/to/full_selected_dataset_indices_T0_3.json \
--output path/to/weasel_train_10k.json \
--max-user-chars 40000 \
--max-examples 10000 \
--seed 0For supervised fine-tuning, we used hiyouga/LLaMA-Factory. After building the WEASEL-selected training file, you can use it as the training dataset in a LLaMA-Factory SFT run.
If you want to directly use our trained model checkpoints, they are available in the WEASEL Hugging Face collection:
- Qwen2.5-7B-Instruct WEASEL checkpoint
- Gemma3-4B-IT WEASEL checkpoint
- Qwen3-8B WEASEL checkpoint
For WebArena evaluation, please refer to web-arena-x/webarena.
For MiniWob evaluation, please refer to the MiniWob documentation and Farama-Foundation/miniwob-plusplus.
For WorkArena evaluation, please refer to ServiceNow/WorkArena.
@inproceedings{pesaranzadeh2026weasel,
title = {{WEASEL}: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection},
author = {Pesaran Zadeh, Fatemeh and Choi, Seyeon and L\`u, Xing Han and Reddy, Siva and Kim, Gunhee},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
year = {2026}
}