-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Built-in toolsets for default tools #1139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… if default) and set exp to true
|
Questions/Action items remaining:
|
| // Enable if we could potentially trigger built-in grouping (when GPT model is used) | ||
| const defaultToolGroupingEnabled = this._configurationService.getExperimentBasedConfig(ConfigKey.Internal.DefaultToolsGrouped, this._experimentationService); | ||
| const couldTriggerBuiltInGrouping = this._tools.length > Constant.START_BUILTIN_GROUPING_AFTER_TOOL_COUNT && defaultToolGroupingEnabled; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't check everything we need. We also need to confirm that the model is gpt-4.1 or gpt-5, but this is supposed to be a lightweight, quick function to call.
Options:
- pass the endpoint into this function so we can check to make sure it's the correct model before going to virtual tool grouping
- let it go to virtual tool grouping every time (since START_BUILTIN_GROUPING_AFTER_TOOL_COUNT is 20) and get stopped from grouping there by the endpoint check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Introduce experimental grouping of built‑in (default) tools to reduce exposed tool count for certain GPT model families, aiming to improve model performance and tool selection quality.
- Adds experimental config flag and threshold to trigger grouping of default tools for GPT-4.1 / GPT-5 families.
- Extends tool grouping pipeline to optionally group built‑in tools into predefined virtual tool categories.
- Propagates endpoint-awareness through grouping APIs to make grouping decisions model-dependent.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/platform/configuration/common/configurationService.ts | Adds experimental config key controlling built-in tool grouping. |
| src/extension/tools/common/virtualTools/virtualToolsConstants.ts | Defines new threshold for triggering built-in grouping. |
| src/extension/tools/common/virtualTools/virtualToolTypes.ts | Extends interfaces to pass optional endpoint for model-aware grouping. |
| src/extension/tools/common/virtualTools/virtualToolGrouper.ts | Core logic for conditional default tool grouping and expansion adjustments. |
| src/extension/tools/common/virtualTools/toolGrouping.ts | Updates enablement logic and propagates endpoint parameter. |
| src/extension/tools/common/virtualTools/builtInToolGroupHandler.ts | Implements predefined grouping of built-in tools into virtual groups. |
| src/extension/prompt/node/defaultIntentRequestHandler.ts | Passes endpoint into grouping computations. |
| src/extension/extension/vscode-node/services.ts | Import ordering adjustment (non-functional). |
| public get isEnabled() { | ||
| return this._tools.length >= computeToolGroupingMinThreshold(this._experimentationService, this._configurationService).get(); | ||
| // Match the logic from VirtualToolGrouper.addGroups() | ||
| // Enable if we could potentially trigger built-in grouping (when GPT model is used) | ||
| const defaultToolGroupingEnabled = this._configurationService.getExperimentBasedConfig(ConfigKey.Internal.DefaultToolsGrouped, this._experimentationService); | ||
| const couldTriggerBuiltInGrouping = this._tools.length > Constant.START_BUILTIN_GROUPING_AFTER_TOOL_COUNT && defaultToolGroupingEnabled; |
Copilot
AI
Oct 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment claims to match VirtualToolGrouper.addGroups logic, but this condition does not check the model family (GPT-only restriction there). This may enable grouping (and related UI/state expectations) for non-GPT models once tool count > threshold, diverging from the actual grouping behavior. Pass endpoint or defer enablement until model family is known, or update the comment and naming to reflect the broader condition.
See below for a potential fix:
/**
* Returns whether tool grouping is enabled.
* - Built-in grouping is only enabled for GPT model endpoints, matching VirtualToolGrouper.addGroups logic.
* - Standard grouping is enabled for all tool types above the threshold.
*/
public isEnabled(endpoint?: IChatEndpoint): boolean {
const defaultToolGroupingEnabled = this._configurationService.getExperimentBasedConfig(ConfigKey.Internal.DefaultToolsGrouped, this._experimentationService);
const isGptModel = endpoint?.modelFamily === 'gpt';
const couldTriggerBuiltInGrouping = isGptModel && this._tools.length > Constant.START_BUILTIN_GROUPING_AFTER_TOOL_COUNT && defaultToolGroupingEnabled;
| const currentEndpoint = endpoint ?? (await this._endpointProvider.getAllChatEndpoints()).find(e => e.isDefault) ?? await this._endpointProvider.getChatEndpoint('gpt-4.1'); | ||
| const modelFamily = currentEndpoint?.family; | ||
| const isGpt = modelFamily?.startsWith('gpt-4.1') || modelFamily?.startsWith('gpt-5'); |
Copilot
AI
Oct 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Endpoint/modelFamily retrieval and the hard-coded 'gpt-4.1' fallback are duplicated. Extract a helper (e.g. getCurrentChatEndpoint()) and centralize the fallback model constant to avoid drift and ease future model changes.
| const currentEndpoint = endpoint ?? (await this._endpointProvider.getAllChatEndpoints()).find(e => e.isDefault) ?? await this._endpointProvider.getChatEndpoint('gpt-4.1'); | ||
| const modelFamily = currentEndpoint?.family; | ||
|
|
Copilot
AI
Oct 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Endpoint/modelFamily retrieval and the hard-coded 'gpt-4.1' fallback are duplicated. Extract a helper (e.g. getCurrentChatEndpoint()) and centralize the fallback model constant to avoid drift and ease future model changes.
Co-authored-by: Copilot <[email protected]>
src/extension/tools/common/virtualTools/builtInToolGroupHandler.ts
Outdated
Show resolved
Hide resolved
| 'search_workspace_symbols', | ||
| 'list_code_usages', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two are general search tools or might even be top-level tools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In restricted telemetry, I see that they get called extremely rarely (800 and 3000 times in the last 3 days, respectively, compared to ~3 million for something like grep_search), so I don't think they make sense at the top level. Maybe I could put them in redundant but specific instead of the vs code toolset?
src/extension/tools/common/virtualTools/builtInToolGroupHandler.ts
Outdated
Show resolved
Hide resolved
src/extension/tools/common/virtualTools/builtInToolGroupHandler.ts
Outdated
Show resolved
Hide resolved
| 'edit_files', | ||
| 'multi_replace_string_in_file' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Editing tools should always be top level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is edit_files an editing tool? The description in the package.json is "This is a placeholder tool, do not use" so I figured I shouldn't add it to the core toolset
| // If there's no need to group tools, just add them all directly; | ||
| if (tools.length < Constant.START_GROUPING_AFTER_TOOL_COUNT) { | ||
|
|
||
| // if the model is gpt 4.1 or gpt-5 and there are more than START_BUILTIN_GROUPING_AFTER_TOOL_COUNT tools, we should group built-in tools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Sonnet not see the same benefits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sonnet 3.7 does, but sonnet 4 seemed robust to the larger toolset. Also, I think we decided to avoid editing the tools for the sonnet models because of prompt caching (I believe the tools are added at the top of a request for sonnet models, so it would cause a lot of cache misses to regularly change the toolset)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache behavior is the case with all our current models.
We should have these checks in the central chatModelCapabilities.ts file
src/extension/tools/common/virtualTools/builtInToolGroupHandler.ts
Outdated
Show resolved
Hide resolved
|
I'm also curious--what trajectory difference you saw in the small vs large toolset in your runs? Specifically for GPT-5/Sonnet. 4.1 is going away soon, we don't care too much about that any longer. |
If you look at the second table in the PR description, you can see the deltas for GPT 5. I'm doing an in-depth qualitative dive into the trajectories today, so I should have more behavioral discrepancies to report shortly. |
BTW, I completed a qualitative analysis of the GPT-5 small vs. large toolset trajectories. In MFA (where full toolset scored higher than small by 2 cases), the only tool that was regularly called in the large toolset runs that the small toolset did not have was create_and_run_task. But it didn't seem to provide that much value to the model (cases where create_and_run_task was called were not more likely to succeed). Besides that, the small toolset regularly performed at or above the level of the full toolset. |
src/extension/tools/common/virtualTools/builtInToolGroupHandler.ts
Outdated
Show resolved
Hide resolved
| compute(query: string, token: CancellationToken, endpoint?: IChatEndpoint): Promise<LanguageModelToolInformation[]>; | ||
|
|
||
| /** | ||
| * Returns the complete tree of tools, used for diagnostic purposes. | ||
| */ | ||
| computeAll(query: string, token: CancellationToken): Promise<(LanguageModelToolInformation | VirtualTool)[]>; | ||
| computeAll(query: string, token: CancellationToken, endpoint?: IChatEndpoint): Promise<(LanguageModelToolInformation | VirtualTool)[]>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use the endpoint anymore, should it be removed from these interfaces?
| * the appropriate virtual tool or top-level tool in the `root`. | ||
| */ | ||
| addGroups(query: string, root: VirtualTool, tools: LanguageModelToolInformation[], token: CancellationToken): Promise<void>; | ||
| addGroups(query: string, root: VirtualTool, tools: LanguageModelToolInformation[], token: CancellationToken, endpoint?: IChatEndpoint): Promise<void>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same ^
| toolsetKey?: string; | ||
| possiblePrefix?: string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two aren't used any more, should be safe to delete them
| const builtinSlotCount = shouldGroupBuiltin | ||
| ? groupedResults.filter(item => item instanceof VirtualTool).length | ||
| : builtinTools.length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we should just take the groupedResults.length (which used to be equivalent to builtinTools.length). We want to distribute the number of 'slots' that other tools get based on how many are left over
Co-authored-by: Connor Peet <[email protected]>
Co-authored-by: Connor Peet <[email protected]>
Co-authored-by: Connor Peet <[email protected]>
| case ToolCategory.Core: | ||
| return 'Core tools that should always be available without grouping.'; | ||
| default: | ||
| return 'Tools in this category.'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use assertNever(category) to have a compile-time assert that all categories are described.
Currently, we have a default toolset of 46 tools. Many of these tools are hyper-specified and often unnecessary for the task at hand. Their existence in the toolset ends up yielding worse behavior for many models, such as Sonnet 3.7 and GPT 4.1. As such, we're running an experiment to leverage the existing logic for MCP server toolsets to create built in toolsets from the default tools and only exposing a handful of tools to the user directly.
GPT 4.1 Pass Rate
GPT 5 Pass Rate
Sonnet 3.7 Success = True for Both