Skip to content

fix(core): recurse into submodule files when crawling git repos#4596

Merged
tanzhenxin merged 3 commits into
QwenLM:mainfrom
he-yufeng:fix/recurse-submodules-crawler
Jun 8, 2026
Merged

fix(core): recurse into submodule files when crawling git repos#4596
tanzhenxin merged 3 commits into
QwenLM:mainfrom
he-yufeng:fix/recurse-submodules-crawler

Conversation

@he-yufeng

Copy link
Copy Markdown
Contributor

Summary

Fixes #4568.

git ls-files --cached -t reports a submodule as the gitlink path, so the crawler only saw vendor/lib and missed files tracked inside that submodule. This adds --recurse-submodules to the tracked-file listing path so submodule contents are crawled the same way regular tracked files are.

I also added a real git fixture that creates a parent repo, adds a local submodule, and verifies the crawler returns a file from inside the submodule.

Verification

  • npx vitest run src/utils/filesearch/crawler.test.ts -t "recurse into tracked submodules"
  • npx vitest run src/utils/filesearch/crawler.test.ts
  • npm run typecheck --workspace=@qwen-code/qwen-code-core
  • npm run lint --workspace=@qwen-code/qwen-code-core -- src/utils/filesearch/crawler.ts src/utils/filesearch/crawler.test.ts
  • npm run build --workspace=@qwen-code/qwen-code-core
  • git diff --check

// versions; newline-delimited output is fine here (index paths cannot contain newlines).
const trackedArgs = ['--literal-pathspecs', 'ls-files', '--cached'];
const trackedArgs = [
'--literal-pathspecs',

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] git ls-files --deleted --recurse-submodules is unsupported by git (fatal: "unsupported mode"). This creates an asymmetry: when a tracked file inside an initialized submodule is deleted from the working tree, --cached --recurse-submodules still lists it (from the submodule's index), but listDeletedTrackedFiles (--deleted) cannot detect the deletion. The deletedSet check in processTrackedFile therefore does not filter it, and the crawler returns a phantom path that doesn't exist on disk.

Impact: File search results include ghost entries for deleted submodule files. Downstream consumers (read file, @-mention completion) will get ENOENT errors.

Suggested change
'--literal-pathspecs',
const trackedArgs = [
'--literal-pathspecs',
'ls-files',
'--cached',
'--recurse-submodules',
];
// NOTE: git does not support `--deleted --recurse-submodules` (fatal error).
// Deleted files inside submodules won't appear in deletedSet.
// Consider adding an fs.access() guard for paths under submodule mount points,
// or use `git submodule foreach --recursive 'git ls-files -z --deleted'`
// to collect submodule-internal deleted files.

Also: When a submodule is registered but not initialized (common after plain git clone), --cached --recurse-submodules -t emits the gitlink mount point (e.g., H vendor/lib) as a cached entry. processTrackedFile only skips status S, but gitlinks appear as H — so a directory path ends up in the file list. Consider filtering gitlink entries or adding a stat guard.

— qwen3.7-max via Qwen Code /review

@he-yufeng

Copy link
Copy Markdown
Contributor Author

Addressed the submodule edge cases from review in c806c2a.

Changes:

  • kept deleted-file filtering on the top-level git index path instead of relying on unsupported git ls-files --deleted --recurse-submodules
  • check each cached tracked entry against the working tree and skip entries that are missing
  • skip cached directory entries, which covers uninitialized submodule gitlinks

Validation:

  • npm run test --workspace=packages/core -- crawler.test.ts
  • npm run typecheck --workspace=packages/core
  • npx eslint packages/core/src/utils/filesearch/crawler.ts packages/core/src/utils/filesearch/crawler.test.ts --max-warnings 0
  • git diff --check

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No review findings at this commit. Downgraded from Approve to Comment: CI still running. The implementation correctly addresses the previously raised Critical (deleted-file filtering asymmetry handled via lstatSync guard) and Suggestion (gitlink directory entries filtered by isDirectory check). All 47 crawler tests pass. — qwen3.7-max via Qwen Code /review

wenshao
wenshao previously approved these changes May 29, 2026
@wenshao

wenshao commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Verification Report — PR #4596 (Recurse into Submodule Files)

Tested by: wenshao
Branch: fix/recurse-submodules-crawler
Commit: c806c2a5 (latest)
Environment: macOS Darwin 25.4.0, Node.js, real git submodule fixtures


1. Build / Typecheck / Lint

Check Result
npm run build --workspace=packages/core PASS (0 errors)
npm run typecheck --workspace=packages/core PASS (0 TS errors)
npm run lint --workspace=packages/core -- crawler.ts crawler.test.ts PASS (0 lint issues)
git diff --check PASS (no whitespace issues)

2. Unit Tests — crawler.test.ts

Run Tests Result Notes
Default timeout (5s) 47 46 pass, 1 timeout should recurse into tracked submodules takes ~5.4s, exceeds default 5s
Extended timeout (30s) 47 47 pass All tests pass including the 3 new submodule tests

New tests added by this PR:

Test Type Result Time
should recurse into tracked submodules on the git path Real git fixture PASS ~5.3s
should skip missing tracked paths from submodule indexes Mock PASS 4ms
should skip cached gitlink directories from uninitialized submodules Mock PASS 2ms

3. Live Submodule Integration (manual verification)

Created a real parent repo with a submodule at vendor/lib containing files at two depth levels.

Before fix (git ls-files --cached -t without --recurse-submodules):

H .gitmodules
H root.txt
H vendor/lib          ← gitlink directory only, files inside invisible

After fix (git ls-files --cached -t --recurse-submodules):

H .gitmodules
H root.txt
H vendor/lib/deep/nested/deep-file.txt   ← now visible
H vendor/lib/inner.txt                    ← now visible

Also verified:

  • lstatSync filter: vendor/lib gitlink entry correctly identified as directory and filtered out
  • Stale index entries: lstatSync try/catch correctly filters paths that exist in the index but not on disk

4. Findings

  • ⚠️ Test timeout (non-blocking): The real submodule fixture test (should recurse into tracked submodules) takes ~5.3s due to git init + git submodule add + git commit overhead. It marginally exceeds the default 5s Vitest timeout. CI passes (macOS/Ubuntu/Windows all green) because CI machines are faster, but locally on macOS the test can flap. Consider adding { timeout: 15_000 } to this specific test for margin.
  • The code change is minimal and surgical: 1 flag added (--recurse-submodules) + 1 stat guard (8 lines) to filter stale/directory entries.
  • No regressions in the existing 44 crawler tests.

5. Verdict

PASS — ready to merge. The core fix is correct and well-tested. The timeout finding is cosmetic (CI is green) but worth a quick fix for local dev experience.


Tested on 2026-05-29 via tmux with real git submodule fixtures and full test suite.


let stat: fs.Stats;
try {
stat = fs.lstatSync(path.join(gitRoot, ...normalizedFile.split('/')));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lstatSync is called for every tracked file in the streaming onLine callback, not just submodule-related entries. In large repos (50K+ files) this adds ~0.5–1s of synchronous I/O that blocks the event loop.

A lighter alternative: parse .gitmodules (or git config --file .gitmodules -l) once to collect known submodule paths, then only stat entries whose prefix matches a submodule path. That narrows the stat surface to submodule root entries instead of the entire file list.

Not blocking — the correctness is fine, just something to keep in mind if crawl latency regresses on large monorepos.

@tanzhenxin tanzhenxin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — the submodule recursion itself is correct and nicely scoped: --recurse-submodules applied only to the --cached listing (the only mode git supports it on), nested/uninitialized submodules handled, and the hermetic local-submodule test is exactly the right way to cover it.

One issue with the new existence check that I'd resolve before this lands: it silently drops tracked files with non-ASCII names.

Under git's default core.quotePath=true, git ls-files -t escapes any path containing non-ASCII or control characters. parseTrackedLine keeps the quotes + octal escapes, normalizePath only swaps slashes (no dequoting), so the new fs.lstatSync(path.join(gitRoot, ...)) probes a literal path like <root>/"caf\303\251.txt" that doesn't exist on disk → catch { return true } skips the fileSet.add → the file disappears from the crawl, and therefore from @-completion and file search.

This hits CJK filenames too, which are common in this project's user base:

# default git — what the crawler sees today
H "caf\303\251.txt"          # café.txt
H "\346\226\207\346\241\243.md"   # 文档.md

Reproduced against this branch with a repo containing a tracked café.txt + plain.txt:

  • current: crawl()[".", "plain.txt"] — café.txt is gone
  • with -c core.quotePath=false: → [".", "café.txt", "plain.txt"] — restored, with a clean name

A tracked file with a non-ASCII name (e.g. café.txt or a CJK name) under default core.quotePath would lock this in as a regression test.

@he-yufeng

Copy link
Copy Markdown
Contributor Author

Good catch. I pushed a follow-up that runs the git listing commands with core.quotePath=false, so tracked non-ASCII paths are emitted as UTF-8 paths before the lstat guard runs.

I also added a real git fixture with café.txt and 文档.md to lock the behavior down under the default git configuration.

Validation run locally:

  • npm run test --workspace=packages/core -- crawler.test.ts -- --testTimeout=30000: 48 passed
  • npx eslint packages/core/src/utils/filesearch/crawler.ts packages/core/src/utils/filesearch/crawler.test.ts --max-warnings 0: passed
  • npx prettier --check packages/core/src/utils/filesearch/crawler.ts packages/core/src/utils/filesearch/crawler.test.ts: passed
  • npm run typecheck --workspace=packages/core: passed
  • npm run build --workspace=packages/core: passed
  • git diff --check: passed

const trackedArgs = [
'--literal-pathspecs',
'ls-files',
'--cached',

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] If any submodule is uninitialized or has a broken .git reference, git ls-files --cached --recurse-submodules exits non-zero and the entire tracked-file listing fails (line ~1141), falling back to ripgrep — which silently loses git-tracking semantics (force-added files matching .gitignore are dropped, untracked files appear).

Consider retrying without --recurse-submodules before returning failure, so a broken submodule only loses submodule files rather than degrading the entire parent repo listing:

if (!trackedResult.success) {
  // Retry without --recurse-submodules: broken submodule states can cause
  // the recursive listing to fail entirely.
  const fallbackArgs = trackedArgs.filter(a => a !== '--recurse-submodules');
  const fallbackResult = await commandRunner('git', withSafeGitConfig(fallbackArgs), ...);
  if (!fallbackResult.success) {
    return { success: false, files: [], gitRepoListingFailed: true };
  }
  // ... process fallbackResult
}

— qwen3.7-max via Qwen Code /review

cacheTtl: 0,
});

expect(results).toContain('vendor/lib/inner.txt');

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] This test only asserts that the submodule file appears in results, but doesn't verify that parent-repo files are also present. A regression that drops non-submodule tracked files while returning submodule files would pass undetected.

Suggested change
expect(results).toContain('vendor/lib/inner.txt');
expect(results).toContain('vendor/lib/inner.txt');
expect(results).toContain('root.txt');

— qwen3.7-max via Qwen Code /review

@he-yufeng he-yufeng force-pushed the fix/recurse-submodules-crawler branch from 58bca22 to cd5509f Compare June 7, 2026 02:03
@he-yufeng

Copy link
Copy Markdown
Contributor Author

Added the suggested timeout margin for the real submodule fixture in cd5509fc7.

The test still exercises the same git-submodule path; the only change is giving that one fixture 15 seconds so slower local machines do not fail around Vitest's default 5 second timeout.

Validation:

npm run test --workspace=packages/core -- crawler.test.ts -- --testTimeout=30000
# 48 passed

npx prettier --check packages/core/src/utils/filesearch/crawler.test.ts
npx eslint packages/core/src/utils/filesearch/crawler.test.ts --max-warnings 0
git diff --check
# passed

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Submodule index changes are not captured by the change detection mechanism (getGitRootMtime only checks the parent repo .git/index, and scanWorkingTreeForChange fingerprints don't recurse into submodules). Within the throttle/TTL window, staging a new file in an initialized submodule won't trigger cache invalidation, causing stale crawl results until the cache expires. Consider falling back to always-relist when submodules are present.

[Suggestion] --recurse-submodules was added only to ls-files --cached but not to --others (untracked) or --deleted listing commands. Untracked files inside initialized submodules remain silently excluded, creating an asymmetry with the tracked-file path. (The --deleted case is handled by the lstatSync fallback, so correctness is fine.)

— DeepSeek/deepseek-v4-pro via Qwen Code /review

'--literal-pathspecs',
'ls-files',
'--cached',
'--recurse-submodules',

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] --recurse-submodules is added to --cached only, not to --others (line 852) or --deleted (line 878). The asymmetry is correct — --others and --deleted operate on the working tree, not submodule indexes — but no comment explains why. A future maintainer might "harmonize" the three calls by adding the flag everywhere, which would break behavior: --deleted --recurse-submodules would emit gitlink directory entries that the deletedSet filter cannot handle.

Suggested change
'--recurse-submodules',
// --recurse-submodules is intentionally scoped to --cached only:
// --others would surface untracked submodule internals (policy choice),
// and --deleted would emit gitlink entries that deletedSet cannot handle.
'--recurse-submodules',

— qwen3.7-max via Qwen Code /review


let stat: fs.Stats;
try {
stat = fs.lstatSync(path.join(gitRoot, ...normalizedFile.split('/')));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] This lstatSync + isDirectory() guard filters gitlink entries (mode 160000) that --recurse-submodules emits for uninitialized submodules. The guard's purpose is non-obvious — a future maintainer optimizing hot-path I/O (one synchronous stat per tracked file) might remove it as seemingly dead code, re-introducing the bug.

Consider adding a comment tying the guard to its reason:

// --recurse-submodules causes ls-files --cached to emit gitlink entries
// (directory placeholders) for uninitialized submodules. Skip those here.

Also, path.join(gitRoot, ...normalizedFile.split('/')) can be simplified to path.join(gitRoot, normalizedFile)path.join already handles / separators on all platforms, and every other path.join call in this file uses direct string args.

— qwen3.7-max via Qwen Code /review

@tanzhenxin tanzhenxin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Thanks for the quick follow-up. The earlier concern — that the new existence-check guard would silently drop tracked files with non-ASCII names, because git's default quoting emits those names octal-escaped and the guard then looked for a path that doesn't exist on disk — is fully resolved. Running the git listing with quoting disabled means non-ASCII paths now arrive as plain UTF-8 before the guard runs, and the new café.txt / 文档.md fixture genuinely exercises that path (it fails without the fix). The change also incidentally aligns the tracked listing with the deleted-file set, which previously couldn't match non-ASCII deletions. The submodule fix itself was already correct and well-scoped.

One optional, non-blocking nit if you touch this again: the guard's catch swallows every stat failure, not just the not-found case, so a transient or permission error would silently include or drop a file with no signal — narrowing it to the not-found case would be a small robustness win. Not a merge blocker.

Verdict

APPROVE — the reported bug is fixed correctly and the regression is locked down by a real test.

@tanzhenxin tanzhenxin merged commit 6d64b34 into QwenLM:main Jun 8, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: @ file completion shows submodule folder but no files inside it

4 participants