Built-in toolsets for default tools #1139

24anisha · 2025-09-25T04:06:51Z

Currently, we have a default toolset of 46 tools. Many of these tools are hyper-specified and often unnecessary for the task at hand. Their existence in the toolset ends up yielding worse behavior for many models, such as Sonnet 3.7 and GPT 4.1. As such, we're running an experiment to leverage the existing logic for MCP server toolsets to create built in toolsets from the default tools and only exposing a handful of tools to the user directly.

GPT 4.1 Pass Rate

Toolset	Benchmark	Pass Rate (%)
small	MFA	19.4
full	MFA	13.4
small	SWEBench C#	8.1
full	SWEBench C#	7.9

GPT 5 Pass Rate

model	metric	toolset	run id	%
gpt5	swec#	full	18232124031	38.6
gpt5	swec#	small	18232409805	42.6
gpt5	swelancer	full	18232152700	41.8
gpt5	swelancer	small	18232428869	43.9
gpt5	mfa	full	18232085552	32.8
gpt5	mfa	small	18232361377	28.3

Sonnet 3.7 Success = True for Both

Case Name	Small Toolset Steps	Full Toolset Steps
gitmoji-cli-1248	7	18
isomorphic-git-1493	18	22
Luxon-1173	13	28
Prom-client-146	7	19

… if default) and set exp to true

24anisha · 2025-10-03T03:07:55Z

24anisha · 2025-10-07T02:19:58Z

src/extension/tools/common/virtualTools/toolGrouping.ts

+		// Enable if we could potentially trigger built-in grouping (when GPT model is used)
+		const defaultToolGroupingEnabled = this._configurationService.getExperimentBasedConfig(ConfigKey.Internal.DefaultToolsGrouped, this._experimentationService);
+		const couldTriggerBuiltInGrouping = this._tools.length > Constant.START_BUILTIN_GROUPING_AFTER_TOOL_COUNT && defaultToolGroupingEnabled;
+


This doesn't check everything we need. We also need to confirm that the model is gpt-4.1 or gpt-5, but this is supposed to be a lightweight, quick function to call.
Options:

pass the endpoint into this function so we can check to make sure it's the correct model before going to virtual tool grouping

let it go to virtual tool grouping every time (since START_BUILTIN_GROUPING_AFTER_TOOL_COUNT is 20) and get stopped from grouping there by the endpoint check

Copilot

Pull Request Overview

Introduce experimental grouping of built‑in (default) tools to reduce exposed tool count for certain GPT model families, aiming to improve model performance and tool selection quality.

Adds experimental config flag and threshold to trigger grouping of default tools for GPT-4.1 / GPT-5 families.
Extends tool grouping pipeline to optionally group built‑in tools into predefined virtual tool categories.
Propagates endpoint-awareness through grouping APIs to make grouping decisions model-dependent.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/platform/configuration/common/configurationService.ts	Adds experimental config key controlling built-in tool grouping.
src/extension/tools/common/virtualTools/virtualToolsConstants.ts	Defines new threshold for triggering built-in grouping.
src/extension/tools/common/virtualTools/virtualToolTypes.ts	Extends interfaces to pass optional endpoint for model-aware grouping.
src/extension/tools/common/virtualTools/virtualToolGrouper.ts	Core logic for conditional default tool grouping and expansion adjustments.
src/extension/tools/common/virtualTools/toolGrouping.ts	Updates enablement logic and propagates endpoint parameter.
src/extension/tools/common/virtualTools/builtInToolGroupHandler.ts	Implements predefined grouping of built-in tools into virtual groups.
src/extension/prompt/node/defaultIntentRequestHandler.ts	Passes endpoint into grouping computations.
src/extension/extension/vscode-node/services.ts	Import ordering adjustment (non-functional).

src/extension/tools/common/virtualTools/virtualToolGrouper.ts

Copilot · 2025-10-07T14:59:58Z

src/extension/tools/common/virtualTools/toolGrouping.ts

 	public get isEnabled() {
-		return this._tools.length >= computeToolGroupingMinThreshold(this._experimentationService, this._configurationService).get();
+		// Match the logic from VirtualToolGrouper.addGroups()
+		// Enable if we could potentially trigger built-in grouping (when GPT model is used)
+		const defaultToolGroupingEnabled = this._configurationService.getExperimentBasedConfig(ConfigKey.Internal.DefaultToolsGrouped, this._experimentationService);
+		const couldTriggerBuiltInGrouping = this._tools.length > Constant.START_BUILTIN_GROUPING_AFTER_TOOL_COUNT && defaultToolGroupingEnabled;


Comment claims to match VirtualToolGrouper.addGroups logic, but this condition does not check the model family (GPT-only restriction there). This may enable grouping (and related UI/state expectations) for non-GPT models once tool count > threshold, diverging from the actual grouping behavior. Pass endpoint or defer enablement until model family is known, or update the comment and naming to reflect the broader condition.

See below for a potential fix:

/** * Returns whether tool grouping is enabled. * - Built-in grouping is only enabled for GPT model endpoints, matching VirtualToolGrouper.addGroups logic. * - Standard grouping is enabled for all tool types above the threshold. */ public isEnabled(endpoint?: IChatEndpoint): boolean { const defaultToolGroupingEnabled = this._configurationService.getExperimentBasedConfig(ConfigKey.Internal.DefaultToolsGrouped, this._experimentationService); const isGptModel = endpoint?.modelFamily === 'gpt'; const couldTriggerBuiltInGrouping = isGptModel && this._tools.length > Constant.START_BUILTIN_GROUPING_AFTER_TOOL_COUNT && defaultToolGroupingEnabled;

Copilot · 2025-10-07T14:59:58Z

src/extension/tools/common/virtualTools/virtualToolGrouper.ts

+		const currentEndpoint = endpoint ?? (await this._endpointProvider.getAllChatEndpoints()).find(e => e.isDefault) ?? await this._endpointProvider.getChatEndpoint('gpt-4.1');
+		const modelFamily = currentEndpoint?.family;
+		const isGpt = modelFamily?.startsWith('gpt-4.1') || modelFamily?.startsWith('gpt-5');


Endpoint/modelFamily retrieval and the hard-coded 'gpt-4.1' fallback are duplicated. Extract a helper (e.g. getCurrentChatEndpoint()) and centralize the fallback model constant to avoid drift and ease future model changes.

Copilot · 2025-10-07T14:59:58Z

src/extension/tools/common/virtualTools/virtualToolGrouper.ts

+				const currentEndpoint = endpoint ?? (await this._endpointProvider.getAllChatEndpoints()).find(e => e.isDefault) ?? await this._endpointProvider.getChatEndpoint('gpt-4.1');
+				const modelFamily = currentEndpoint?.family;
+


Endpoint/modelFamily retrieval and the hard-coded 'gpt-4.1' fallback are duplicated. Extract a helper (e.g. getCurrentChatEndpoint()) and centralize the fallback model constant to avoid drift and ease future model changes.

Co-authored-by: Copilot <[email protected]>

src/extension/tools/common/virtualTools/builtInToolGroupHandler.ts

connor4312 · 2025-10-08T17:15:09Z

src/extension/tools/common/virtualTools/builtInToolGroupHandler.ts

+			'search_workspace_symbols',
+			'list_code_usages',


These two are general search tools or might even be top-level tools

In restricted telemetry, I see that they get called extremely rarely (800 and 3000 times in the last 3 days, respectively, compared to ~3 million for something like grep_search), so I don't think they make sense at the top level. Maybe I could put them in redundant but specific instead of the vs code toolset?

src/extension/tools/common/virtualTools/builtInToolGroupHandler.ts

connor4312 · 2025-10-08T17:18:48Z

src/extension/tools/common/virtualTools/builtInToolGroupHandler.ts

+			'edit_files',
+			'multi_replace_string_in_file'


Editing tools should always be top level

is edit_files an editing tool? The description in the package.json is "This is a placeholder tool, do not use" so I figured I shouldn't add it to the core toolset

connor4312 · 2025-10-08T17:25:03Z

src/extension/tools/common/virtualTools/virtualToolGrouper.ts

 		// If there's no need to group tools, just add them all directly;
-		if (tools.length < Constant.START_GROUPING_AFTER_TOOL_COUNT) {
+
+		// if the model is gpt 4.1 or gpt-5 and there are more than START_BUILTIN_GROUPING_AFTER_TOOL_COUNT tools, we should group built-in tools


Does Sonnet not see the same benefits?

sonnet 3.7 does, but sonnet 4 seemed robust to the larger toolset. Also, I think we decided to avoid editing the tools for the sonnet models because of prompt caching (I believe the tools are added at the top of a request for sonnet models, so it would cause a lot of cache misses to regularly change the toolset)

The cache behavior is the case with all our current models.

We should have these checks in the central chatModelCapabilities.ts file

src/extension/tools/common/virtualTools/virtualToolTypes.ts

src/extension/tools/common/virtualTools/builtInToolGroupHandler.ts

src/extension/tools/common/virtualTools/toolGrouping.ts

src/extension/tools/common/virtualTools/virtualToolGrouper.ts

connor4312 · 2025-10-08T17:33:22Z

I'm also curious--what trajectory difference you saw in the small vs large toolset in your runs? Specifically for GPT-5/Sonnet. 4.1 is going away soon, we don't care too much about that any longer.

24anisha · 2025-10-08T18:25:03Z

I'm also curious--what trajectory difference you saw in the small vs large toolset in your runs? Specifically for GPT-5/Sonnet. 4.1 is going away soon, we don't care too much about that any longer.

If you look at the second table in the PR description, you can see the deltas for GPT 5. I'm doing an in-depth qualitative dive into the trajectories today, so I should have more behavioral discrepancies to report shortly.

24anisha · 2025-10-10T15:34:55Z

I'm also curious--what trajectory difference you saw in the small vs large toolset in your runs? Specifically for GPT-5/Sonnet. 4.1 is going away soon, we don't care too much about that any longer.

BTW, I completed a qualitative analysis of the GPT-5 small vs. large toolset trajectories. In MFA (where full toolset scored higher than small by 2 cases), the only tool that was regularly called in the large toolset runs that the small toolset did not have was create_and_run_task. But it didn't seem to provide that much value to the model (cases where create_and_run_task was called were not more likely to succeed). Besides that, the small toolset regularly performed at or above the level of the full toolset.

src/extension/tools/common/toolNames.ts

src/extension/tools/common/virtualTools/builtInToolGroupHandler.ts

connor4312 · 2025-10-15T18:03:42Z

src/extension/tools/common/virtualTools/virtualToolTypes.ts

+	compute(query: string, token: CancellationToken, endpoint?: IChatEndpoint): Promise<LanguageModelToolInformation[]>;

 	/**
 	 * Returns the complete tree of tools, used for diagnostic purposes.
 	 */
-	computeAll(query: string, token: CancellationToken): Promise<(LanguageModelToolInformation | VirtualTool)[]>;
+	computeAll(query: string, token: CancellationToken, endpoint?: IChatEndpoint): Promise<(LanguageModelToolInformation | VirtualTool)[]>;


We don't use the endpoint anymore, should it be removed from these interfaces?

connor4312 · 2025-10-15T18:04:27Z

src/extension/tools/common/virtualTools/virtualToolTypes.ts

 	 * the appropriate virtual tool or top-level tool in the `root`.
 	 */
-	addGroups(query: string, root: VirtualTool, tools: LanguageModelToolInformation[], token: CancellationToken): Promise<void>;
+	addGroups(query: string, root: VirtualTool, tools: LanguageModelToolInformation[], token: CancellationToken, endpoint?: IChatEndpoint): Promise<void>;


connor4312 · 2025-10-15T18:04:59Z

src/extension/tools/common/virtualTools/virtualTool.ts

+	toolsetKey?: string;
+	possiblePrefix?: string;


these two aren't used any more, should be safe to delete them

src/extension/tools/common/virtualTools/toolGrouping.ts

connor4312 · 2025-10-15T18:08:50Z

src/extension/tools/common/virtualTools/virtualToolGrouper.ts

+			const builtinSlotCount = shouldGroupBuiltin
+				? groupedResults.filter(item => item instanceof VirtualTool).length
+				: builtinTools.length;


I think here we should just take the groupedResults.length (which used to be equivalent to builtinTools.length). We want to distribute the number of 'slots' that other tools get based on how many are left over

Co-authored-by: Connor Peet <[email protected]>

connor4312 · 2025-10-16T18:56:51Z

src/extension/tools/common/virtualTools/builtInToolGroupHandler.ts

+		case ToolCategory.Core:
+			return 'Core tools that should always be available without grouping.';
+		default:
+			return 'Tools in this category.';


You can use assertNever(category) to have a compile-time assert that all categories are described.

Anisha Agarwal and others added 10 commits September 24, 2025 11:44

adding starting changes for tool selection from default toolset

8f4b748

update code to have all current default tools in groups (or ungrouped…

d878d19

… if default) and set exp to true

update max tools

ab16c1f

small changes

422b3a5

separate logic for built in tool grouping to new file

9c600ac

remove unnecessary telemetry sending

9b2a91f

adding intermediate changes

f66be55

Merge branch 'main' into anisha/default_toolsets

5d44c99

updated explicit tool set

6741dc8

Merge branch 'main' into anisha/default_toolsets

5affdde

Anisha Agarwal added 3 commits October 5, 2025 07:52

Merge branch 'main' into anisha/default_toolsets

125db45

only perform grouping for gpt 4.1 and 5

2031717

updated the hard tool limit back to 128, added logic

74b7a7a

24anisha commented Oct 7, 2025

View reviewed changes

Merge branch 'main' into anisha/default_toolsets

8523fba

24anisha marked this pull request as ready for review October 7, 2025 14:56

Copilot AI review requested due to automatic review settings October 7, 2025 14:56

vs-code-engineering bot assigned bpasero Oct 7, 2025

vs-code-engineering bot added the triage-needed label Oct 7, 2025

Copilot AI reviewed Oct 7, 2025

View reviewed changes

Apply suggestion from @Copilot

b3a53dd

Co-authored-by: Copilot <[email protected]>

bpasero assigned connor4312 and unassigned bpasero Oct 7, 2025

connor4312 requested changes Oct 8, 2025

View reviewed changes

Anisha Agarwal added 2 commits October 10, 2025 08:45

merge main and manage conflicts

eceb2b6

add tool grouping enum

52c8cc8

Anisha Agarwal and others added 4 commits October 14, 2025 19:25

add edit_files to core

aa817e4

resolving merge conflicts

e0e4be1

debugging changes

692a712

Merge branch 'main' into anisha/default_toolsets

d112240

connor4312 reviewed Oct 15, 2025

View reviewed changes

connor4312 previously approved these changes Oct 15, 2025

View reviewed changes

vs-code-engineering bot added this to the October 2025 milestone Oct 15, 2025

bpasero previously approved these changes Oct 15, 2025

View reviewed changes

Apply suggestion from @connor4312

dc96924

Co-authored-by: Connor Peet <[email protected]>

24anisha dismissed stale reviews from bpasero and connor4312 via dc96924 October 15, 2025 20:27

24anisha and others added 4 commits October 15, 2025 13:28

Apply suggestion from @connor4312

eec22bf

Co-authored-by: Connor Peet <[email protected]>

Apply suggestion from @connor4312

39a2786

Co-authored-by: Connor Peet <[email protected]>

update to remove chat endpoint + clean up categorization code

a7abf31

Merge branch 'main' into anisha/default_toolsets

e445fc7

connor4312 previously approved these changes Oct 16, 2025

View reviewed changes

24anisha and others added 2 commits October 16, 2025 13:30

Merge branch 'main' into anisha/default_toolsets

908a926

removing think tool

98c5be0

24anisha dismissed connor4312’s stale review via 98c5be0 October 16, 2025 20:53

Anisha Agarwal and others added 2 commits October 16, 2025 14:13

updating to use assertnever

73e11dc

Merge branch 'main' into anisha/default_toolsets

56cdbae

24anisha requested review from bpasero and connor4312 October 17, 2025 20:08

Merge branch 'main' into anisha/default_toolsets

e1c576f

connor4312 approved these changes Oct 17, 2025

View reviewed changes

connor4312 enabled auto-merge October 17, 2025 21:03

amunger approved these changes Oct 17, 2025

View reviewed changes

connor4312 added this pull request to the merge queue Oct 17, 2025

Merged via the queue into microsoft:main with commit ccfd104 Oct 17, 2025
6 checks passed

		const currentEndpoint = endpoint ?? (await this._endpointProvider.getAllChatEndpoints()).find(e => e.isDefault) ?? await this._endpointProvider.getChatEndpoint('gpt-4.1');
		const modelFamily = currentEndpoint?.family;

Built-in toolsets for default tools #1139

Built-in toolsets for default tools #1139

Uh oh!

Conversation

24anisha commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GPT 4.1 Pass Rate

GPT 5 Pass Rate

Sonnet 3.7 Success = True for Both

Uh oh!

24anisha commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

24anisha Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

connor4312 commented Oct 8, 2025

Uh oh!

24anisha commented Oct 8, 2025

Uh oh!

24anisha commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

24anisha commented Sep 25, 2025 •

edited

Loading

24anisha commented Oct 3, 2025 •

edited

Loading

24anisha Oct 8, 2025 •

edited

Loading

24anisha commented Oct 10, 2025 •

edited

Loading