Skip to content

Improve snippet case transforms suport for non-Latin scripts (fix: #286165)#287150

Merged
jrieken merged 10 commits intomicrosoft:mainfrom
lucas-gomes-santana:fix/snippet-unicode-support
Jan 26, 2026
Merged

Improve snippet case transforms suport for non-Latin scripts (fix: #286165)#287150
jrieken merged 10 commits intomicrosoft:mainfrom
lucas-gomes-santana:fix/snippet-unicode-support

Conversation

@lucas-gomes-santana
Copy link
Contributor

@lucas-gomes-santana lucas-gomes-santana commented Jan 12, 2026

Description

This PR was made to solve a problem reported on Issue #286165, and the objective is improves snippet case transforms by replacing ASCII-only regular expressions with Unicode-aware patterns and locale-aware case mapping.

Previously, snippet transforms such as upcase, downcase, camelcase, pascalcase, kebabcase, and snakecase relied on [a-zA-Z]-based matching. As a result, non-Latin input (e.g. Cyrillic or Greek) was not recognized correctly and transforms were silently skipped, producing no output changes at all.


The changes in this PR:

  • Use Unicode property escapes (\p{L}, \p{Lu}, \p{Ll}, \p{Nd}) to properly detect letters and numbers across modern scripts.

  • Use locale-aware casing (toLocaleLowerCase / toLocaleUpperCase) instead of ASCII-only case conversion.

  • Preserve existing behavior for Latin input while improving support for scripts that have uppercase/lowercase distinctions (e.g. Cyrillic, Greek).


Limitations

This change does not aim to provide a fully language-aware or linguistically perfect solution for all scripts.
Word-based transforms (camelCase, PascalCase, kebab-case, snake_case) inherently rely on uppercase/lowercase transitions and therefore cannot be meaningfully applied to scripts without case (e.g. Chinese, Japanese, Arabic, Hebrew).

For such scripts, transforms effectively become no-ops, which is consistent with current behavior and preferable to producing arbitrary or destructive output.

Summary

  • Fixes silent failures for non-Latin input in snippet transforms

  • Improves Unicode correctness without breaking existing behavior

  • Clearly scoped as an incremental improvement, not a universal linguistic solution


Final inputs:

одинДва -> ОДИНДВА одиндва одинДва ОдинДва один-два один_два (Russian)
一个测试 -> 一个测试 一个测试 一个测试 一个测试 一个测试 一个测试 (Simplefied Chinese)
έναςΔύο -> ΈΝΑΣΔΎΟ έναςδύο έναςΔύο ΈναςΔύο ένας-δύο ένας_δύο (Greek)
ひらがなカタカナ -> ひらがなカタカナ ひらがなカタカナ ひらがなカタカナ ひらがなカタカナ ひらがなカタカナ ひらがなカタカナ(Japonese Hiragana + Katakana)
하나둘 -> 하나둘 하나둘 하나둘 하나둘 하나둘 하나둘
одинДва3 -> ОДИНДВА3 одиндва3 одинДва3 ОдинДва3 один-два3 один_два3 (Russian with number)
ένας_δύο -> ΈΝΑΣ_ΔΎΟ ένας_δύο έναςΔύο ΈναςΔύο ένας-δύο ένας_δύο (Greek with underline)
こんにちはWorld -> こんにちはWORLD こんにちはworld こんにちはWorld こんにちはWorld world こんにちはworld (Japonese with english word)
واحدإثنين -> واحدإثنين واحدإثنين واحدإثنين واحدإثنين واحدإثنين واحدإثنين (Arabic)

Russian input before the regexs changes:

одинДва -> ОДИНДВА одиндва одинДва одинДва одинДва одиндва (wrong formatting)

@vs-code-engineering
Copy link

vs-code-engineering bot commented Jan 12, 2026

📬 CODENOTIFY

The following users are being notified based on files changed in this PR:

@jrieken

Matched files:

  • src/vs/editor/contrib/snippet/browser/snippetParser.ts
  • src/vs/editor/contrib/snippet/test/browser/snippetParser.test.ts

@lucas-gomes-santana
Copy link
Contributor Author

@microsoft-github-policy-service agree

Copy link
Contributor

@dmitrivMS dmitrivMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned adding some tests - I think it would be really good in this case.

@lucas-gomes-santana
Copy link
Contributor Author

lucas-gomes-santana commented Jan 14, 2026

@dmitrivMS Now I added tests for the modified regexs, including a test with the turkish language. Waiting for review and feedback.

@dmitrivMS
Copy link
Contributor

Apologies for the delay, I'll review and respond in about 24h.

Copy link
Contributor Author

@lucas-gomes-santana lucas-gomes-santana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now these changes should work. I have mentionated the tests logs on past comments.

@lucas-gomes-santana
Copy link
Contributor Author

Apologies for the delay, I'll review and respond in about 24h.

Did you review my changes now? I think everything is working now

Copy link
Member

@jrieken jrieken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @lucas-gomes-santana

@vs-code-engineering vs-code-engineering bot added this to the January 2026 milestone Jan 26, 2026
@jrieken jrieken enabled auto-merge January 26, 2026 16:05
@jrieken jrieken merged commit 283d8d0 into microsoft:main Jan 26, 2026
17 checks passed
@lucas-gomes-santana lucas-gomes-santana deleted the fix/snippet-unicode-support branch January 28, 2026 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants