Improve snippet case transforms suport for non-Latin scripts (fix: #286165)#287150
Conversation
📬 CODENOTIFYThe following users are being notified based on files changed in this PR: @jriekenMatched files:
|
|
@microsoft-github-policy-service agree |
dmitrivMS
left a comment
There was a problem hiding this comment.
You mentioned adding some tests - I think it would be really good in this case.
|
@dmitrivMS Now I added tests for the modified regexs, including a test with the turkish language. Waiting for review and feedback. |
|
Apologies for the delay, I'll review and respond in about 24h. |
lucas-gomes-santana
left a comment
There was a problem hiding this comment.
I think now these changes should work. I have mentionated the tests logs on past comments.
Did you review my changes now? I think everything is working now |
jrieken
left a comment
There was a problem hiding this comment.
Thanks for this @lucas-gomes-santana
Description
This PR was made to solve a problem reported on Issue #286165, and the objective is improves snippet case transforms by replacing ASCII-only regular expressions with Unicode-aware patterns and locale-aware case mapping.
Previously, snippet transforms such as upcase, downcase, camelcase, pascalcase, kebabcase, and snakecase relied on [a-zA-Z]-based matching. As a result, non-Latin input (e.g. Cyrillic or Greek) was not recognized correctly and transforms were silently skipped, producing no output changes at all.
The changes in this PR:
Use Unicode property escapes (\p{L}, \p{Lu}, \p{Ll}, \p{Nd}) to properly detect letters and numbers across modern scripts.
Use locale-aware casing (toLocaleLowerCase / toLocaleUpperCase) instead of ASCII-only case conversion.
Preserve existing behavior for Latin input while improving support for scripts that have uppercase/lowercase distinctions (e.g. Cyrillic, Greek).
Limitations
This change does not aim to provide a fully language-aware or linguistically perfect solution for all scripts.
Word-based transforms (camelCase, PascalCase, kebab-case, snake_case) inherently rely on uppercase/lowercase transitions and therefore cannot be meaningfully applied to scripts without case (e.g. Chinese, Japanese, Arabic, Hebrew).
For such scripts, transforms effectively become no-ops, which is consistent with current behavior and preferable to producing arbitrary or destructive output.
Summary
Fixes silent failures for non-Latin input in snippet transforms
Improves Unicode correctness without breaking existing behavior
Clearly scoped as an incremental improvement, not a universal linguistic solution
Final inputs:
Russian input before the regexs changes: