decoder2: Add support for surrogates#25193

Larsimusrex · 2025-08-29T18:46:46Z

Will now decode utf-16 surrogates, used by some encoders for characters outside the bilingual plane.

println(decoder2.decode[string](r'"\ud83d\ude00"')!) // '😀'

huly-for-github · 2025-08-29T18:47:19Z

spytheman · 2025-08-30T08:36:02Z

+
+fn test_surrogate() {
+	assert decoder2.decode[string](r'"\ud83d\ude00"')! == '😀'
+	assert decoder2.decode[string](r'"\ud83d\ude00 text"')! == '😀 text'


What will be the result, if you later JSON encode the decoded string?
Can you please add a test for that too?

the json2 encoder currently handles these characters incorrectly see #25115. In my new implementation it outputs utf-8 by default unless specified otherwise

spytheman · 2025-08-30T08:36:59Z

Excellent work @Larsimusrex.

I am curious, what are some of the JSON encoders, that produce such output?
If they are easy to install on the CI, we can add round trip tests with them.

Larsimusrex · 2025-08-30T09:28:31Z

Python definitely does it by default. I think java and c# too.

spytheman · 2025-08-30T10:00:42Z

Python is already preinstalled on the CI, so we can add a V test, that invokes a python program, that generates a json encoded value with surrogates on stdout, then decodes it and asserts on the output.

We also support // vtest build: present_python? so that the V test could be skipped on environments that do not have Python.

decoder2: Add support for surrogates

f7f4ac4

spytheman reviewed Aug 30, 2025

View reviewed changes

spytheman merged commit 24f9128 into vlang:master Aug 30, 2025
79 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

decoder2: Add support for surrogates#25193