Skip to content

Conversation

@24anisha
Copy link
Contributor

@24anisha 24anisha commented Sep 25, 2025

Currently, we have a default toolset of 46 tools. Many of these tools are hyper-specified and often unnecessary for the task at hand. Their existence in the toolset ends up yielding worse behavior for many models, such as Sonnet 3.7 and GPT 4.1. As such, we're running an experiment to leverage the existing logic for MCP server toolsets to create built in toolsets from the default tools and only exposing a handful of tools to the user directly.

GPT 4.1 Pass Rate

Toolset Benchmark Pass Rate (%)
small MFA 19.4
full MFA 13.4
small SWEBench C# 8.1
full SWEBench C# 7.9

GPT 5 Pass Rate

model metric toolset run id %
gpt5 swec# full 18232124031 38.6
gpt5 swec# small 18232409805 42.6
gpt5 swelancer full 18232152700 41.8
gpt5 swelancer small 18232428869 43.9
gpt5 mfa full 18232085552 32.8
gpt5 mfa small 18232361377 28.3

Sonnet 3.7 Success = True for Both

Case Name Small Toolset  Steps Full Toolset Steps
gitmoji-cli-1248 7 18
isomorphic-git-1493 18 22
Luxon-1173 13 28
Prom-client-146 7 19

@24anisha
Copy link
Contributor Author

24anisha commented Oct 3, 2025

Questions/Action items remaining:

  • Group the default tools without needing to alter the max toolset
    • BUG: only works for gpt-4.1 and gpt-5 because the max # tools is hardcoded to 28. Need to make it dynamic
    • ISSUE: in toolGrouping.ts, isEnabled() is meant to be a fast check. So, we don't pass the chat endpoint in, which means we don't know the model family. If we do that, do non-GPT models get sent to the more costly compute step every query? How to manage?
  • Only apply this alteration for gpt models -- how to pass in the model family without changing the whole process?
  • Automatically add new tools to the built-in toolset groups (so that new tools aren't automatically considered part of the default toolset)
  • Confirm whether we should only apply changes for GPT models, or for all
  • remove unnecessary telemetry
  • remove isEnabled() function if it doesn't check anything meaningful
  • Manage duplicated logic
  • Re-adjust tool categorizations based on feedback
  • What causes many tools to appear as core tools to the model, even though they don't show up in debugging?
  • edit_files -> core
  • handle the merge conflicts

// Enable if we could potentially trigger built-in grouping (when GPT model is used)
const defaultToolGroupingEnabled = this._configurationService.getExperimentBasedConfig(ConfigKey.Internal.DefaultToolsGrouped, this._experimentationService);
const couldTriggerBuiltInGrouping = this._tools.length > Constant.START_BUILTIN_GROUPING_AFTER_TOOL_COUNT && defaultToolGroupingEnabled;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't check everything we need. We also need to confirm that the model is gpt-4.1 or gpt-5, but this is supposed to be a lightweight, quick function to call.
Options:

  • pass the endpoint into this function so we can check to make sure it's the correct model before going to virtual tool grouping
  • let it go to virtual tool grouping every time (since START_BUILTIN_GROUPING_AFTER_TOOL_COUNT is 20) and get stopped from grouping there by the endpoint check

@24anisha 24anisha marked this pull request as ready for review October 7, 2025 14:56
Copilot AI review requested due to automatic review settings October 7, 2025 14:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Introduce experimental grouping of built‑in (default) tools to reduce exposed tool count for certain GPT model families, aiming to improve model performance and tool selection quality.

  • Adds experimental config flag and threshold to trigger grouping of default tools for GPT-4.1 / GPT-5 families.
  • Extends tool grouping pipeline to optionally group built‑in tools into predefined virtual tool categories.
  • Propagates endpoint-awareness through grouping APIs to make grouping decisions model-dependent.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/platform/configuration/common/configurationService.ts Adds experimental config key controlling built-in tool grouping.
src/extension/tools/common/virtualTools/virtualToolsConstants.ts Defines new threshold for triggering built-in grouping.
src/extension/tools/common/virtualTools/virtualToolTypes.ts Extends interfaces to pass optional endpoint for model-aware grouping.
src/extension/tools/common/virtualTools/virtualToolGrouper.ts Core logic for conditional default tool grouping and expansion adjustments.
src/extension/tools/common/virtualTools/toolGrouping.ts Updates enablement logic and propagates endpoint parameter.
src/extension/tools/common/virtualTools/builtInToolGroupHandler.ts Implements predefined grouping of built-in tools into virtual groups.
src/extension/prompt/node/defaultIntentRequestHandler.ts Passes endpoint into grouping computations.
src/extension/extension/vscode-node/services.ts Import ordering adjustment (non-functional).

Comment on lines 49 to 54
public get isEnabled() {
return this._tools.length >= computeToolGroupingMinThreshold(this._experimentationService, this._configurationService).get();
// Match the logic from VirtualToolGrouper.addGroups()
// Enable if we could potentially trigger built-in grouping (when GPT model is used)
const defaultToolGroupingEnabled = this._configurationService.getExperimentBasedConfig(ConfigKey.Internal.DefaultToolsGrouped, this._experimentationService);
const couldTriggerBuiltInGrouping = this._tools.length > Constant.START_BUILTIN_GROUPING_AFTER_TOOL_COUNT && defaultToolGroupingEnabled;
Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment claims to match VirtualToolGrouper.addGroups logic, but this condition does not check the model family (GPT-only restriction there). This may enable grouping (and related UI/state expectations) for non-GPT models once tool count > threshold, diverging from the actual grouping behavior. Pass endpoint or defer enablement until model family is known, or update the comment and naming to reflect the broader condition.

See below for a potential fix:

	/**
	 * Returns whether tool grouping is enabled.
	 * - Built-in grouping is only enabled for GPT model endpoints, matching VirtualToolGrouper.addGroups logic.
	 * - Standard grouping is enabled for all tool types above the threshold.
	 */
	public isEnabled(endpoint?: IChatEndpoint): boolean {
		const defaultToolGroupingEnabled = this._configurationService.getExperimentBasedConfig(ConfigKey.Internal.DefaultToolsGrouped, this._experimentationService);
		const isGptModel = endpoint?.modelFamily === 'gpt';
		const couldTriggerBuiltInGrouping = isGptModel && this._tools.length > Constant.START_BUILTIN_GROUPING_AFTER_TOOL_COUNT && defaultToolGroupingEnabled;

Copilot uses AI. Check for mistakes.
Comment on lines 60 to 62
const currentEndpoint = endpoint ?? (await this._endpointProvider.getAllChatEndpoints()).find(e => e.isDefault) ?? await this._endpointProvider.getChatEndpoint('gpt-4.1');
const modelFamily = currentEndpoint?.family;
const isGpt = modelFamily?.startsWith('gpt-4.1') || modelFamily?.startsWith('gpt-5');
Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Endpoint/modelFamily retrieval and the hard-coded 'gpt-4.1' fallback are duplicated. Extract a helper (e.g. getCurrentChatEndpoint()) and centralize the fallback model constant to avoid drift and ease future model changes.

Copilot uses AI. Check for mistakes.
Comment on lines 269 to 271
const currentEndpoint = endpoint ?? (await this._endpointProvider.getAllChatEndpoints()).find(e => e.isDefault) ?? await this._endpointProvider.getChatEndpoint('gpt-4.1');
const modelFamily = currentEndpoint?.family;

Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Endpoint/modelFamily retrieval and the hard-coded 'gpt-4.1' fallback are duplicated. Extract a helper (e.g. getCurrentChatEndpoint()) and centralize the fallback model constant to avoid drift and ease future model changes.

Copilot uses AI. Check for mistakes.
@bpasero bpasero assigned connor4312 and unassigned bpasero Oct 7, 2025
Comment on lines 46 to 47
'search_workspace_symbols',
'list_code_usages',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two are general search tools or might even be top-level tools

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In restricted telemetry, I see that they get called extremely rarely (800 and 3000 times in the last 3 days, respectively, compared to ~3 million for something like grep_search), so I don't think they make sense at the top level. Maybe I could put them in redundant but specific instead of the vs code toolset?

Comment on lines 81 to 82
'edit_files',
'multi_replace_string_in_file'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Editing tools should always be top level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is edit_files an editing tool? The description in the package.json is "This is a placeholder tool, do not use" so I figured I shouldn't add it to the core toolset

// If there's no need to group tools, just add them all directly;
if (tools.length < Constant.START_GROUPING_AFTER_TOOL_COUNT) {

// if the model is gpt 4.1 or gpt-5 and there are more than START_BUILTIN_GROUPING_AFTER_TOOL_COUNT tools, we should group built-in tools
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Sonnet not see the same benefits?

Copy link
Contributor Author

@24anisha 24anisha Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sonnet 3.7 does, but sonnet 4 seemed robust to the larger toolset. Also, I think we decided to avoid editing the tools for the sonnet models because of prompt caching (I believe the tools are added at the top of a request for sonnet models, so it would cause a lot of cache misses to regularly change the toolset)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache behavior is the case with all our current models.

We should have these checks in the central chatModelCapabilities.ts file

@connor4312
Copy link
Member

I'm also curious--what trajectory difference you saw in the small vs large toolset in your runs? Specifically for GPT-5/Sonnet. 4.1 is going away soon, we don't care too much about that any longer.

@24anisha
Copy link
Contributor Author

24anisha commented Oct 8, 2025

I'm also curious--what trajectory difference you saw in the small vs large toolset in your runs? Specifically for GPT-5/Sonnet. 4.1 is going away soon, we don't care too much about that any longer.

If you look at the second table in the PR description, you can see the deltas for GPT 5. I'm doing an in-depth qualitative dive into the trajectories today, so I should have more behavioral discrepancies to report shortly.

@24anisha
Copy link
Contributor Author

24anisha commented Oct 10, 2025

I'm also curious--what trajectory difference you saw in the small vs large toolset in your runs? Specifically for GPT-5/Sonnet. 4.1 is going away soon, we don't care too much about that any longer.

BTW, I completed a qualitative analysis of the GPT-5 small vs. large toolset trajectories. In MFA (where full toolset scored higher than small by 2 cases), the only tool that was regularly called in the large toolset runs that the small toolset did not have was create_and_run_task. But it didn't seem to provide that much value to the model (cases where create_and_run_task was called were not more likely to succeed). Besides that, the small toolset regularly performed at or above the level of the full toolset.

Comment on lines 52 to 57
compute(query: string, token: CancellationToken, endpoint?: IChatEndpoint): Promise<LanguageModelToolInformation[]>;

/**
* Returns the complete tree of tools, used for diagnostic purposes.
*/
computeAll(query: string, token: CancellationToken): Promise<(LanguageModelToolInformation | VirtualTool)[]>;
computeAll(query: string, token: CancellationToken, endpoint?: IChatEndpoint): Promise<(LanguageModelToolInformation | VirtualTool)[]>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use the endpoint anymore, should it be removed from these interfaces?

connor4312
connor4312 previously approved these changes Oct 15, 2025
* the appropriate virtual tool or top-level tool in the `root`.
*/
addGroups(query: string, root: VirtualTool, tools: LanguageModelToolInformation[], token: CancellationToken): Promise<void>;
addGroups(query: string, root: VirtualTool, tools: LanguageModelToolInformation[], token: CancellationToken, endpoint?: IChatEndpoint): Promise<void>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same ^

Comment on lines 15 to 16
toolsetKey?: string;
possiblePrefix?: string;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two aren't used any more, should be safe to delete them

Comment on lines 109 to 111
const builtinSlotCount = shouldGroupBuiltin
? groupedResults.filter(item => item instanceof VirtualTool).length
: builtinTools.length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we should just take the groupedResults.length (which used to be equivalent to builtinTools.length). We want to distribute the number of 'slots' that other tools get based on how many are left over

@vs-code-engineering vs-code-engineering bot added this to the October 2025 milestone Oct 15, 2025
bpasero
bpasero previously approved these changes Oct 15, 2025
@24anisha 24anisha dismissed stale reviews from bpasero and connor4312 via dc96924 October 15, 2025 20:27
connor4312
connor4312 previously approved these changes Oct 16, 2025
case ToolCategory.Core:
return 'Core tools that should always be available without grouping.';
default:
return 'Tools in this category.';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use assertNever(category) to have a compile-time assert that all categories are described.

@connor4312 connor4312 enabled auto-merge October 17, 2025 21:03
@connor4312 connor4312 added this pull request to the merge queue Oct 17, 2025
Merged via the queue into microsoft:main with commit ccfd104 Oct 17, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants