json2 can't encode runes of more than two bytes

<details>
<summary>V version: V 0.4.11 bbb61ab, press to see full `v doctor` output</summary>

|V full version      |V 0.4.11 7dc3889f19e7bdc6fb803ad4d1d65fd9d50c6e0c.bbb61ab
|:-------------------|:-------------------
|OS                  |linux, Ubuntu 22.04.5 LTS
|Processor           |4 cpus, 64bit, little endian, Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz
|Memory              |0.48GB/3.7GB
|                    |
|V executable        |/home/jorge/v/v
|V last modified time|2025-08-14 17:07:08
|                    |
|V home dir          |OK, value: /home/jorge/v
|VMODULES            |OK, value: /home/jorge/.vmodules
|VTMP                |OK, value: /tmp/v_1000
|Current working dir |OK, value: /home/jorge/bugs
|                    |
|Git version         |git version 2.34.1
|V git status        |weekly.2025.16-626-gbbb61ab3
|.git/config present |true
|                    |
|cc version          |cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
|gcc version         |gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
|clang version       |Ubuntu clang version 14.0.0-1ubuntu1.1
|tcc version         |tcc version 0.9.28rc 2025-02-13 HEAD@f8bd136d (x86_64 Linux)
|tcc git status      |thirdparty-linux-amd64 696c1d84
|emcc version        |emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.5 ()
|glibc version       |ldd (Ubuntu GLIBC 2.35-0ubuntu3.10) 2.35

</details>

**What did you do?**
`./v -g -o vdbg cmd/v && ./vdbg  json_emoji/json_emoji.v && json_emoji/json_emoji`
```v
module main

import x.json2

struct Struct {
	a []string
}

fn main() {
	s := Struct{
		a: ['\0', '\t', 'a', 'ñ', '♥', '𝄞']
	}
	input := r'{"a":["\u0000","\t","a","\u00f1","\u2665","\ud834\udd1e"]}'

	// custom string encode
	mut bytes := []u8{}
	bytes << '{"a":['.bytes()
	for j in 0 .. s.a.len {
		if j > 0 {
			bytes << `,`
		}
		custom_string_encode(s.a[j], mut bytes)
	}
	bytes << ']}'.bytes()
	print_ascii('Custom encode: ', bytes)
	assert bytes.bytestr() == input

	// json2 encode
	e2 := json2.encode(s)
	print_ascii('json2 encode: ', e2.bytes())
	assert e2 == input
}

const escape = u8(`\\`)

// custom_string_encode converts a V string including runes 
// into json formated string bytes.
fn custom_string_encode(s string, mut bytes []u8) {
	bytes << `"`
	for r in s.runes() {
		if u32(r) < 0x7F {
			match r {
				`"`, `\\`, `/` {
					bytes << [escape, r]
				}
				`\b` {
					bytes << [escape, `b`]
				}
				`\f` {
					bytes << [escape, `f`]
				}
				`\n` {
					bytes << [escape, `n`]
				}
				`\r` {
					bytes << [escape, `r`]
				}
				`\t` {
					bytes << [escape, `t`]
				}
				else {
					if r < 0x20 {
						bytes << [escape, `u`]
						bytes << `0`
						bytes << `0`
						bytes << hexa[(r >> 4) & 15]
						bytes << hexa[(r >> 0) & 15]
					} else {
						bytes << r
					}
				}
			}
		} else if r < 0x10000 {
			// Example: ñ = c3b1 -> \u00f1
			// Example: ♥ = e299a5 -> \u2665
			// convert rune to string json format \uxxxx
			bytes << [escape, `u`]
			bytes << hexa[(r >> 12) & 15]
			bytes << hexa[(r >> 8) & 15]
			bytes << hexa[(r >> 4) & 15]
			bytes << hexa[(r >> 0) & 15]
		} else {
			// Use surrogate pair
			// Example 👋 = 0x1ff4b // ---1111111|1101001101 : two ten-bits groups
			v := u32(r - 0x10000)   // ---0111111|1101001101 : substract 0x10000
			//                         hhhhhhhhhh|llllllllll : hi part | low part
			hi := v >> 10           // hhhhhhhhhh = 0000111111
			lo := v & 0x3ff         // llllllllll = 1101001101
			u1 := 0xd800 + hi       // 1101_1000_0000_0000 + 00_0011_1111
			u2 := 0xdc00 + lo       // 1101_1100_0000_0000 + 11_0100_1101
			bytes << [escape, `u`]
			bytes << hexa[(u1 >> 12) & 15]
			bytes << hexa[(u1 >> 8) & 15]
			bytes << hexa[(u1 >> 4) & 15]
			bytes << hexa[(u1 >> 0) & 15]

			bytes << [escape, `u`]
			bytes << hexa[(u2 >> 12) & 15]
			bytes << hexa[(u2 >> 8) & 15]
			bytes << hexa[(u2 >> 4) & 15]
			bytes << hexa[(u2 >> 0) & 15]
		}
	}
	bytes << `"`
}

// vfmt off
const hexa = [ `0`,`1`,`2`,`3`,`4`,`5`,`6`,`7`,`8`,`9`,`a`,`b`,`c`,`d`,`e`,`f`]!
// vfmt on


fn print_ascii(pre string, bytes []u8) {
	print('${pre}: ')
	for letter in bytes {
		if letter > 0x20 && letter < 0x7f {
			print(`\0` + letter)
		} else {
			print('{${letter:x}}')
		}
	}
	println('')
}

```

**What did you see?**
```
Custom encode: : {"a":["\u0000","\t","a","\u00f1","\u2665","\ud834\udd1e"]}
json2 encode: : {"a":["\u0000","\t","a","{c3}{b1}","\u2665","{f0}{9d}{84}{9e}"]}
json_emoji/json_emoji.v:31: FAIL: fn main.main: assert e2 == input
   left value: e2 = {"a":["\u0000","\t","a","ñ","\u2665","𝄞"]}
  right value: input = {"a":["\u0000","\t","a","\u00f1","\u2665","\ud834\udd1e"]}
V panic: Assertion failed...
v hash: bbb61ab
/tmp/v_1000/json_emoji.01K2QAZ1E40Y1X0SAN2VN3EKAY.tmp.c:4718: at _v_panic: Backtrace
/tmp/v_1000/json_emoji.01K2QAZ1E40Y1X0SAN2VN3EKAY.tmp.c:10753: by main__main
/tmp/v_1000/json_emoji.01K2QAZ1E40Y1X0SAN2VN3EKAY.tmp.c:10869: by main
```

**What did you expect to see?**

pass all asserts

**Problem**

JSON spec strings encodes unicode with only two bytes in the form `\uxxxx` where x is in `0-F` range. As a comparison TOML use both formats `uxxx` and `Uxxxxxxxx` to store runes up to four bytes. In order JSON can store runes with more than 2 bytes a surrogate is [adviced](https://ecma-international.org/wp-content/uploads/ECMA-404_2nd_edition_december_2017.pdf)

> To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a
twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for
example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

The surrogate pair algorithm is described [here](https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF)

**Code**

The program attached in this issue includes a custom string encoder with the surrogate pair algorithm to encode runes like 𝄞 described in pdf link above with value U+1D11E that should be surrogate pair `\ud834\udd1e`. As a test an array of five runes is expected to be compared against an encoder. 

**Results**

json2 encoder encodes ok 2-byte runes like ♥ (See https://github.com/vlang/v/issues/25103), but don't have programmed the surrogate pair to encode runes like 𝄞. Seems also runes like `ñ` cannot be encoded properly. What json2 does with higher runes is output the codes, first one always > 0x7F against the JSON spec.

So the idea is to incorporate a surrogate pair algorithm in json2 encode string function https://github.com/vlang/v/blob/master/vlib/x/json2/encoder.v#L472 . I think one of the json2 mantainers can make a better PR to extended the encoder than me.

**What's next**

`json2` decoder seems that partially decode properly all the runes. `json` is the same story.

> [!NOTE]
> You can use the 👍 reaction to increase the issue's priority for developers.
>
> Please note that only the 👍 reaction to the issue itself counts as a vote.
> Other reactions and those to comments will not be taken into account.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

json2 can't encode runes of more than two bytes #25115

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

V full version	V 0.4.11 `7dc3889`.bbb61ab
OS	linux, Ubuntu 22.04.5 LTS
Processor	4 cpus, 64bit, little endian, Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz
Memory	0.48GB/3.7GB

V executable	/home/jorge/v/v
V last modified time	2025-08-14 17:07:08

V home dir	OK, value: /home/jorge/v
VMODULES	OK, value: /home/jorge/.vmodules
VTMP	OK, value: /tmp/v_1000
Current working dir	OK, value: /home/jorge/bugs

Git version	git version 2.34.1
V git status	weekly.2025.16-626-gbbb61ab3
.git/config present	true

cc version	cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
gcc version	gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
clang version	Ubuntu clang version 14.0.0-1ubuntu1.1
tcc version	tcc version 0.9.28rc 2025-02-13 HEAD@f8bd136d (x86_64 Linux)
tcc git status	thirdparty-linux-amd64 696c1d84
emcc version	emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.5 ()
glibc version	ldd (Ubuntu GLIBC 2.35-0ubuntu3.10) 2.35

Uh oh!

json2 can't encode runes of more than two bytes #25115

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions