Skip to content

json2 can't encode runes of more than two bytes #25115

Description

@jorgeluismireles
V version: V 0.4.11 bbb61ab, press to see full `v doctor` output
V full version V 0.4.11 7dc3889.bbb61ab
OS linux, Ubuntu 22.04.5 LTS
Processor 4 cpus, 64bit, little endian, Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz
Memory 0.48GB/3.7GB
V executable /home/jorge/v/v
V last modified time 2025-08-14 17:07:08
V home dir OK, value: /home/jorge/v
VMODULES OK, value: /home/jorge/.vmodules
VTMP OK, value: /tmp/v_1000
Current working dir OK, value: /home/jorge/bugs
Git version git version 2.34.1
V git status weekly.2025.16-626-gbbb61ab3
.git/config present true
cc version cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
gcc version gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
clang version Ubuntu clang version 14.0.0-1ubuntu1.1
tcc version tcc version 0.9.28rc 2025-02-13 HEAD@f8bd136d (x86_64 Linux)
tcc git status thirdparty-linux-amd64 696c1d84
emcc version emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.5 ()
glibc version ldd (Ubuntu GLIBC 2.35-0ubuntu3.10) 2.35

What did you do?
./v -g -o vdbg cmd/v && ./vdbg json_emoji/json_emoji.v && json_emoji/json_emoji

module main

import x.json2

struct Struct {
	a []string
}

fn main() {
	s := Struct{
		a: ['\0', '\t', 'a', 'ñ', '♥', '𝄞']
	}
	input := r'{"a":["\u0000","\t","a","\u00f1","\u2665","\ud834\udd1e"]}'

	// custom string encode
	mut bytes := []u8{}
	bytes << '{"a":['.bytes()
	for j in 0 .. s.a.len {
		if j > 0 {
			bytes << `,`
		}
		custom_string_encode(s.a[j], mut bytes)
	}
	bytes << ']}'.bytes()
	print_ascii('Custom encode: ', bytes)
	assert bytes.bytestr() == input

	// json2 encode
	e2 := json2.encode(s)
	print_ascii('json2 encode: ', e2.bytes())
	assert e2 == input
}

const escape = u8(`\\`)

// custom_string_encode converts a V string including runes 
// into json formated string bytes.
fn custom_string_encode(s string, mut bytes []u8) {
	bytes << `"`
	for r in s.runes() {
		if u32(r) < 0x7F {
			match r {
				`"`, `\\`, `/` {
					bytes << [escape, r]
				}
				`\b` {
					bytes << [escape, `b`]
				}
				`\f` {
					bytes << [escape, `f`]
				}
				`\n` {
					bytes << [escape, `n`]
				}
				`\r` {
					bytes << [escape, `r`]
				}
				`\t` {
					bytes << [escape, `t`]
				}
				else {
					if r < 0x20 {
						bytes << [escape, `u`]
						bytes << `0`
						bytes << `0`
						bytes << hexa[(r >> 4) & 15]
						bytes << hexa[(r >> 0) & 15]
					} else {
						bytes << r
					}
				}
			}
		} else if r < 0x10000 {
			// Example: ñ = c3b1 -> \u00f1
			// Example: ♥ = e299a5 -> \u2665
			// convert rune to string json format \uxxxx
			bytes << [escape, `u`]
			bytes << hexa[(r >> 12) & 15]
			bytes << hexa[(r >> 8) & 15]
			bytes << hexa[(r >> 4) & 15]
			bytes << hexa[(r >> 0) & 15]
		} else {
			// Use surrogate pair
			// Example 👋 = 0x1ff4b // ---1111111|1101001101 : two ten-bits groups
			v := u32(r - 0x10000)   // ---0111111|1101001101 : substract 0x10000
			//                         hhhhhhhhhh|llllllllll : hi part | low part
			hi := v >> 10           // hhhhhhhhhh = 0000111111
			lo := v & 0x3ff         // llllllllll = 1101001101
			u1 := 0xd800 + hi       // 1101_1000_0000_0000 + 00_0011_1111
			u2 := 0xdc00 + lo       // 1101_1100_0000_0000 + 11_0100_1101
			bytes << [escape, `u`]
			bytes << hexa[(u1 >> 12) & 15]
			bytes << hexa[(u1 >> 8) & 15]
			bytes << hexa[(u1 >> 4) & 15]
			bytes << hexa[(u1 >> 0) & 15]

			bytes << [escape, `u`]
			bytes << hexa[(u2 >> 12) & 15]
			bytes << hexa[(u2 >> 8) & 15]
			bytes << hexa[(u2 >> 4) & 15]
			bytes << hexa[(u2 >> 0) & 15]
		}
	}
	bytes << `"`
}

// vfmt off
const hexa = [ `0`,`1`,`2`,`3`,`4`,`5`,`6`,`7`,`8`,`9`,`a`,`b`,`c`,`d`,`e`,`f`]!
// vfmt on


fn print_ascii(pre string, bytes []u8) {
	print('${pre}: ')
	for letter in bytes {
		if letter > 0x20 && letter < 0x7f {
			print(`\0` + letter)
		} else {
			print('{${letter:x}}')
		}
	}
	println('')
}

What did you see?

Custom encode: : {"a":["\u0000","\t","a","\u00f1","\u2665","\ud834\udd1e"]}
json2 encode: : {"a":["\u0000","\t","a","{c3}{b1}","\u2665","{f0}{9d}{84}{9e}"]}
json_emoji/json_emoji.v:31: FAIL: fn main.main: assert e2 == input
   left value: e2 = {"a":["\u0000","\t","a","ñ","\u2665","𝄞"]}
  right value: input = {"a":["\u0000","\t","a","\u00f1","\u2665","\ud834\udd1e"]}
V panic: Assertion failed...
v hash: bbb61ab
/tmp/v_1000/json_emoji.01K2QAZ1E40Y1X0SAN2VN3EKAY.tmp.c:4718: at _v_panic: Backtrace
/tmp/v_1000/json_emoji.01K2QAZ1E40Y1X0SAN2VN3EKAY.tmp.c:10753: by main__main
/tmp/v_1000/json_emoji.01K2QAZ1E40Y1X0SAN2VN3EKAY.tmp.c:10869: by main

What did you expect to see?

pass all asserts

Problem

JSON spec strings encodes unicode with only two bytes in the form \uxxxx where x is in 0-F range. As a comparison TOML use both formats uxxx and Uxxxxxxxx to store runes up to four bytes. In order JSON can store runes with more than 2 bytes a surrogate is adviced

To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a
twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for
example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

The surrogate pair algorithm is described here

Code

The program attached in this issue includes a custom string encoder with the surrogate pair algorithm to encode runes like 𝄞 described in pdf link above with value U+1D11E that should be surrogate pair \ud834\udd1e. As a test an array of five runes is expected to be compared against an encoder.

Results

json2 encoder encodes ok 2-byte runes like ♥ (See #25103), but don't have programmed the surrogate pair to encode runes like 𝄞. Seems also runes like ñ cannot be encoded properly. What json2 does with higher runes is output the codes, first one always > 0x7F against the JSON spec.

So the idea is to incorporate a surrogate pair algorithm in json2 encode string function https://github.com/vlang/v/blob/master/vlib/x/json2/encoder.v#L472 . I think one of the json2 mantainers can make a better PR to extended the encoder than me.

What's next

json2 decoder seems that partially decode properly all the runes. json is the same story.

Note

You can use the 👍 reaction to increase the issue's priority for developers.

Please note that only the 👍 reaction to the issue itself counts as a vote.
Other reactions and those to comments will not be taken into account.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions