fix(core): recurse into submodule files when crawling git repos#4596
Conversation
| // versions; newline-delimited output is fine here (index paths cannot contain newlines). | ||
| const trackedArgs = ['--literal-pathspecs', 'ls-files', '--cached']; | ||
| const trackedArgs = [ | ||
| '--literal-pathspecs', |
There was a problem hiding this comment.
[Critical] git ls-files --deleted --recurse-submodules is unsupported by git (fatal: "unsupported mode"). This creates an asymmetry: when a tracked file inside an initialized submodule is deleted from the working tree, --cached --recurse-submodules still lists it (from the submodule's index), but listDeletedTrackedFiles (--deleted) cannot detect the deletion. The deletedSet check in processTrackedFile therefore does not filter it, and the crawler returns a phantom path that doesn't exist on disk.
Impact: File search results include ghost entries for deleted submodule files. Downstream consumers (read file, @-mention completion) will get ENOENT errors.
| '--literal-pathspecs', | |
| const trackedArgs = [ | |
| '--literal-pathspecs', | |
| 'ls-files', | |
| '--cached', | |
| '--recurse-submodules', | |
| ]; | |
| // NOTE: git does not support `--deleted --recurse-submodules` (fatal error). | |
| // Deleted files inside submodules won't appear in deletedSet. | |
| // Consider adding an fs.access() guard for paths under submodule mount points, | |
| // or use `git submodule foreach --recursive 'git ls-files -z --deleted'` | |
| // to collect submodule-internal deleted files. |
Also: When a submodule is registered but not initialized (common after plain git clone), --cached --recurse-submodules -t emits the gitlink mount point (e.g., H vendor/lib) as a cached entry. processTrackedFile only skips status S, but gitlinks appear as H — so a directory path ends up in the file list. Consider filtering gitlink entries or adding a stat guard.
— qwen3.7-max via Qwen Code /review
|
Addressed the submodule edge cases from review in c806c2a. Changes:
Validation:
|
wenshao
left a comment
There was a problem hiding this comment.
No review findings at this commit. Downgraded from Approve to Comment: CI still running. The implementation correctly addresses the previously raised Critical (deleted-file filtering asymmetry handled via lstatSync guard) and Suggestion (gitlink directory entries filtered by isDirectory check). All 47 crawler tests pass. — qwen3.7-max via Qwen Code /review
Verification Report — PR #4596 (Recurse into Submodule Files)Tested by: wenshao 1. Build / Typecheck / Lint
2. Unit Tests —
|
| Run | Tests | Result | Notes |
|---|---|---|---|
| Default timeout (5s) | 47 | 46 pass, 1 timeout | should recurse into tracked submodules takes ~5.4s, exceeds default 5s |
| Extended timeout (30s) | 47 | 47 pass | All tests pass including the 3 new submodule tests |
New tests added by this PR:
| Test | Type | Result | Time |
|---|---|---|---|
should recurse into tracked submodules on the git path |
Real git fixture | PASS | ~5.3s |
should skip missing tracked paths from submodule indexes |
Mock | PASS | 4ms |
should skip cached gitlink directories from uninitialized submodules |
Mock | PASS | 2ms |
3. Live Submodule Integration (manual verification)
Created a real parent repo with a submodule at vendor/lib containing files at two depth levels.
Before fix (git ls-files --cached -t without --recurse-submodules):
H .gitmodules
H root.txt
H vendor/lib ← gitlink directory only, files inside invisible
After fix (git ls-files --cached -t --recurse-submodules):
H .gitmodules
H root.txt
H vendor/lib/deep/nested/deep-file.txt ← now visible
H vendor/lib/inner.txt ← now visible
Also verified:
- lstatSync filter:
vendor/libgitlink entry correctly identified as directory and filtered out - Stale index entries:
lstatSynctry/catch correctly filters paths that exist in the index but not on disk
4. Findings
⚠️ Test timeout (non-blocking): The real submodule fixture test (should recurse into tracked submodules) takes ~5.3s due togit init+git submodule add+git commitoverhead. It marginally exceeds the default 5s Vitest timeout. CI passes (macOS/Ubuntu/Windows all green) because CI machines are faster, but locally on macOS the test can flap. Consider adding{ timeout: 15_000 }to this specific test for margin.- The code change is minimal and surgical: 1 flag added (
--recurse-submodules) + 1 stat guard (8 lines) to filter stale/directory entries. - No regressions in the existing 44 crawler tests.
5. Verdict
PASS — ready to merge. The core fix is correct and well-tested. The timeout finding is cosmetic (CI is green) but worth a quick fix for local dev experience.
Tested on 2026-05-29 via tmux with real git submodule fixtures and full test suite.
|
|
||
| let stat: fs.Stats; | ||
| try { | ||
| stat = fs.lstatSync(path.join(gitRoot, ...normalizedFile.split('/'))); |
There was a problem hiding this comment.
nit: lstatSync is called for every tracked file in the streaming onLine callback, not just submodule-related entries. In large repos (50K+ files) this adds ~0.5–1s of synchronous I/O that blocks the event loop.
A lighter alternative: parse .gitmodules (or git config --file .gitmodules -l) once to collect known submodule paths, then only stat entries whose prefix matches a submodule path. That narrows the stat surface to submodule root entries instead of the entire file list.
Not blocking — the correctness is fine, just something to keep in mind if crawl latency regresses on large monorepos.
tanzhenxin
left a comment
There was a problem hiding this comment.
Thanks for this — the submodule recursion itself is correct and nicely scoped: --recurse-submodules applied only to the --cached listing (the only mode git supports it on), nested/uninitialized submodules handled, and the hermetic local-submodule test is exactly the right way to cover it.
One issue with the new existence check that I'd resolve before this lands: it silently drops tracked files with non-ASCII names.
Under git's default core.quotePath=true, git ls-files -t escapes any path containing non-ASCII or control characters. parseTrackedLine keeps the quotes + octal escapes, normalizePath only swaps slashes (no dequoting), so the new fs.lstatSync(path.join(gitRoot, ...)) probes a literal path like <root>/"caf\303\251.txt" that doesn't exist on disk → catch { return true } skips the fileSet.add → the file disappears from the crawl, and therefore from @-completion and file search.
This hits CJK filenames too, which are common in this project's user base:
# default git — what the crawler sees today
H "caf\303\251.txt" # café.txt
H "\346\226\207\346\241\243.md" # 文档.md
Reproduced against this branch with a repo containing a tracked café.txt + plain.txt:
- current:
crawl()→[".", "plain.txt"]— café.txt is gone - with
-c core.quotePath=false: →[".", "café.txt", "plain.txt"]— restored, with a clean name
A tracked file with a non-ASCII name (e.g. café.txt or a CJK name) under default core.quotePath would lock this in as a regression test.
|
Good catch. I pushed a follow-up that runs the git listing commands with core.quotePath=false, so tracked non-ASCII paths are emitted as UTF-8 paths before the lstat guard runs. I also added a real git fixture with café.txt and 文档.md to lock the behavior down under the default git configuration. Validation run locally:
|
| const trackedArgs = [ | ||
| '--literal-pathspecs', | ||
| 'ls-files', | ||
| '--cached', |
There was a problem hiding this comment.
[Suggestion] If any submodule is uninitialized or has a broken .git reference, git ls-files --cached --recurse-submodules exits non-zero and the entire tracked-file listing fails (line ~1141), falling back to ripgrep — which silently loses git-tracking semantics (force-added files matching .gitignore are dropped, untracked files appear).
Consider retrying without --recurse-submodules before returning failure, so a broken submodule only loses submodule files rather than degrading the entire parent repo listing:
if (!trackedResult.success) {
// Retry without --recurse-submodules: broken submodule states can cause
// the recursive listing to fail entirely.
const fallbackArgs = trackedArgs.filter(a => a !== '--recurse-submodules');
const fallbackResult = await commandRunner('git', withSafeGitConfig(fallbackArgs), ...);
if (!fallbackResult.success) {
return { success: false, files: [], gitRepoListingFailed: true };
}
// ... process fallbackResult
}— qwen3.7-max via Qwen Code /review
| cacheTtl: 0, | ||
| }); | ||
|
|
||
| expect(results).toContain('vendor/lib/inner.txt'); |
There was a problem hiding this comment.
[Suggestion] This test only asserts that the submodule file appears in results, but doesn't verify that parent-repo files are also present. A regression that drops non-submodule tracked files while returning submodule files would pass undetected.
| expect(results).toContain('vendor/lib/inner.txt'); | |
| expect(results).toContain('vendor/lib/inner.txt'); | |
| expect(results).toContain('root.txt'); |
— qwen3.7-max via Qwen Code /review
58bca22 to
cd5509f
Compare
|
Added the suggested timeout margin for the real submodule fixture in The test still exercises the same git-submodule path; the only change is giving that one fixture 15 seconds so slower local machines do not fail around Vitest's default 5 second timeout. Validation: |
wenshao
left a comment
There was a problem hiding this comment.
[Suggestion] Submodule index changes are not captured by the change detection mechanism (getGitRootMtime only checks the parent repo .git/index, and scanWorkingTreeForChange fingerprints don't recurse into submodules). Within the throttle/TTL window, staging a new file in an initialized submodule won't trigger cache invalidation, causing stale crawl results until the cache expires. Consider falling back to always-relist when submodules are present.
[Suggestion] --recurse-submodules was added only to ls-files --cached but not to --others (untracked) or --deleted listing commands. Untracked files inside initialized submodules remain silently excluded, creating an asymmetry with the tracked-file path. (The --deleted case is handled by the lstatSync fallback, so correctness is fine.)
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| '--literal-pathspecs', | ||
| 'ls-files', | ||
| '--cached', | ||
| '--recurse-submodules', |
There was a problem hiding this comment.
[Suggestion] --recurse-submodules is added to --cached only, not to --others (line 852) or --deleted (line 878). The asymmetry is correct — --others and --deleted operate on the working tree, not submodule indexes — but no comment explains why. A future maintainer might "harmonize" the three calls by adding the flag everywhere, which would break behavior: --deleted --recurse-submodules would emit gitlink directory entries that the deletedSet filter cannot handle.
| '--recurse-submodules', | |
| // --recurse-submodules is intentionally scoped to --cached only: | |
| // --others would surface untracked submodule internals (policy choice), | |
| // and --deleted would emit gitlink entries that deletedSet cannot handle. | |
| '--recurse-submodules', |
— qwen3.7-max via Qwen Code /review
|
|
||
| let stat: fs.Stats; | ||
| try { | ||
| stat = fs.lstatSync(path.join(gitRoot, ...normalizedFile.split('/'))); |
There was a problem hiding this comment.
[Suggestion] This lstatSync + isDirectory() guard filters gitlink entries (mode 160000) that --recurse-submodules emits for uninitialized submodules. The guard's purpose is non-obvious — a future maintainer optimizing hot-path I/O (one synchronous stat per tracked file) might remove it as seemingly dead code, re-introducing the bug.
Consider adding a comment tying the guard to its reason:
// --recurse-submodules causes ls-files --cached to emit gitlink entries
// (directory placeholders) for uninitialized submodules. Skip those here.Also, path.join(gitRoot, ...normalizedFile.split('/')) can be simplified to path.join(gitRoot, normalizedFile) — path.join already handles / separators on all platforms, and every other path.join call in this file uses direct string args.
— qwen3.7-max via Qwen Code /review
tanzhenxin
left a comment
There was a problem hiding this comment.
Review
Thanks for the quick follow-up. The earlier concern — that the new existence-check guard would silently drop tracked files with non-ASCII names, because git's default quoting emits those names octal-escaped and the guard then looked for a path that doesn't exist on disk — is fully resolved. Running the git listing with quoting disabled means non-ASCII paths now arrive as plain UTF-8 before the guard runs, and the new café.txt / 文档.md fixture genuinely exercises that path (it fails without the fix). The change also incidentally aligns the tracked listing with the deleted-file set, which previously couldn't match non-ASCII deletions. The submodule fix itself was already correct and well-scoped.
One optional, non-blocking nit if you touch this again: the guard's catch swallows every stat failure, not just the not-found case, so a transient or permission error would silently include or drop a file with no signal — narrowing it to the not-found case would be a small robustness win. Not a merge blocker.
Verdict
APPROVE — the reported bug is fixed correctly and the regression is locked down by a real test.
Summary
Fixes #4568.
git ls-files --cached -treports a submodule as the gitlink path, so the crawler only sawvendor/liband missed files tracked inside that submodule. This adds--recurse-submodulesto the tracked-file listing path so submodule contents are crawled the same way regular tracked files are.I also added a real git fixture that creates a parent repo, adds a local submodule, and verifies the crawler returns a file from inside the submodule.
Verification
npx vitest run src/utils/filesearch/crawler.test.ts -t "recurse into tracked submodules"npx vitest run src/utils/filesearch/crawler.test.tsnpm run typecheck --workspace=@qwen-code/qwen-code-corenpm run lint --workspace=@qwen-code/qwen-code-core -- src/utils/filesearch/crawler.ts src/utils/filesearch/crawler.test.tsnpm run build --workspace=@qwen-code/qwen-code-coregit diff --check