-
Notifications
You must be signed in to change notification settings - Fork 9k
Unify UTF-8 handling using til::u8u16 & revise WriteConsoleAImpl #4422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…d of a couple of warnings in `_stream.cpp`
|
|
||
| // Convert our input parameters to Unicode | ||
| std::unique_ptr<wchar_t[]> wideCharBuffer{ nullptr }; | ||
| static Utf8ToWideCharParser parser{ gci.OutputCP }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IGNORE, SEE BELOW
this one unfortunately cannot change just yet. gci.OutputCP must be the user's console output codepage for now and the forseeable future.
This poses an interesting conundrum for u8u16: should we have "ASCII" versions that take a codepage? au16 and u16a? I'm not sure 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's actually part of the u*state itself...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OH, I understand. You can almost entirely ignore this comment. I was confused because this was created outside the CP_UTF8 block.
I also understand that writing in another codepage means we need to kill the u8 state -- so we can't move it inside this block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DHowett-MSFT Your understanding is correct. The reset() method is called in the else branch where codepages other than UTF-8 are processed. I knew I would need the reset here, that's why I implemented it from the beginning.
This poses an interesting conundrum for u8u16: should we have "ASCII" versions that take a codepage? au16 and u16a? I'm not sure 😄
Interesting indeed. Well, we already have function ConvertToW (and ConvertToA). But that's not simply applicable in WriteConsoleAImpl() due to the fact that we may receive DBCS-encoded text where caching of partials is required, too. @miniksa briefly mentioned that in #386 (comment)
I only have a poor understanding of how DBCS has to be processed though. The manpage of IsDBCSLeadByte states that even if you validated a lead byte you may not rely on MultiByteToWideChar being able to process the substring correctly. So, unfortunately I don't know enough to revise the DBCS handling. And I don't know if there is any other function in the code base where DBCS has to be converted. Hence it might not worth the effort to bring it into a separate function.
However, I offer to do my best to get rid of new and delete in WriteConsoleAImpl() and instead use the wstring that we already have for the UTF-8 conversion.
// EDIT done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@german-one, if we theoretically made a au16 and a u16a to supercede ConvertToW and ConvertToA, there would probably be some sort of astate variable required. That variable could be stored on the assorted input handles to ensure that the problem briefly discussed in #386 is rectified.
Off hand, I believe there are probably several points in the code base that could be unified behind such a convergence function in a similar way to converging the u8/u16 problem. But I haven't enumerated them recently, so I may be incorrect.
I believe that the only circumstance where we'd really pay attention to IsDBCSLeadByte and hold onto it until the next call is if a string ends in a lead byte (then it would take the lead byte off the end and cache it for the next call in the astate or equivalent.) If it's in the middle of a run, we wouldn't care and just pass it into MultiByteToWideChar probably with the replacement character flag on so it would remove invalid DBCS representations. The next write would have the stored DBCS lead prefixed to whatever comes next, even if it's not valid together, and we'd let MB2WC sort it out and replace it.
The last provision to consider is that I believe the state would be discarded anytime the code page changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miniksa u16a would be easy. We only need to take care of split surrogates, and that's what u16state already does.
As to au16 - IsDBCSLeadByteEx isn't used in the current implementation of CheckBisectStringA. It would be my attempt to implement an astate though. The remarks found on the manpage still make me wonder if it would be bulletproof for DBCS.
But even if it was that simple, we would still only have conversions for SBCS, DBCS (the few mentioned on the manpage), and UTF-8. Guess what happens if we receive UTF-7 that was split inside of a base64 sequence 🤕
…garian notation; unify camel case
|
Removed |
til::u8u16til::u8u16, revise WriteConsoleAImpl
|
The latest commit is intended to improve the readability of |
miniksa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me.
There's still some opportunities in _stream.cpp to use some of the new Chromium math stuff instead of the hard gsl casts and the old SizeTToUInt math functions.
There's also the big opportunity to hopefully eliminate a lot of the dangerousish pointer dances going on around the MBCS handling. The clean up is just incremental progress here. I'd really love to see it get to iterators and maybe even make/use a future til::au16 function that eliminates the need to do some of the odd counting.
But for now, this is better than what we had. So I'm good with it.
DHowett-MSFT
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I had to read these functions side-by-side (original and new), but I think I trust it. Thank you, @german-one!
til::u8u16, revise WriteConsoleAImpl|
🎉 Once again, thanks for the contribution! This pull request was included in a set of conhost changes that was just |
Summary of the Pull Request
Replace
utf8Parserwithtil::u8u16in order to have the same conversion algorithms used in terminal and conhost.References
This PR is a follow up of #4093
PR Checklist
utf8ToWideCharParser.cpp#3378Detailed Description of the Pull Request / Additional comments
This PR addresses item 2 in this list:
✉ Implement
til::u8u16andtil::u16u8(done in PR Implement til::u8u16 and til::u16u8 conversion functions #4093)✔ Unify UTF-8 handling using
til::u8u16(this PR)2.1. ✔ Update VtInputThread::_HandleRunInput()
2.2. ✔ Update ApiRoutines::WriteConsoleAImpl()
2.3. ❌ (optional / ask the core team) Remove Utf8ToWideCharParser from the code base to avoid further use
❌ Enable BOM discarding (follow up)
3.1. ❌ extend
til::u8u16andtil::u16u8with a 3rd parameter to enable discarding the BOM3.2. ❌ Make use of the 3rd parameter to discard the BOM in all current function callers, or (optional / ask the core team) make it the default for
til::u8u16andtil::u16u8❌ Find UTF-16 to UTF-8 conversions and examine if they can be unified, too (follow up)
@miniksa @DHowett-MSFT
utf8ToWideCharParser.cpp#3378Utf8ToWideCharParsernow that it isn't used anymoreValidation Steps Performed
printftested as shown in Certain invalid UTF-8 sequences can cause the output to fail #4086