Optimize branch structure of UTF-8 decoder routine

alexdowad · alexdowad · commit 092ad3e4624f · 2023-01-08T17:27:19.000+02:00
I like the asm which gcc -O3 generates on this modified code...
and guess what: my CPU likes it too!

(The asm is noticeably tighter, without any extra operations in the
path which dispatches to the code for decoding a 1-byte, 2-byte,
3-byte, or 4-byte character. It's just CMP, conditional jump, CMP,
conditional jump, CMP, conditional jump.

...Though I was admittedly impressed to see gcc could implement the
boolean expression `c &gt;= 0xC2 &amp;&amp; c &lt;= 0xDF` with just 3 instructions:
add, CMP, then conditional jump. Pretty slick stuff there, guys.)

Benchmark results:

UTF-8, short - to UTF-16LE  faster by 7.36% (0.0001 vs 0.0002)
UTF-8, short - to UTF-16BE  faster by 6.24% (0.0001 vs 0.0002)
UTF-8, medium - to UTF-16BE faster by 4.56% (0.0003 vs 0.0003)
UTF-8, medium - to UTF-16LE faster by 4.00% (0.0003 vs 0.0003)
UTF-8, long - to UTF-16BE   faster by 1.02% (0.0215 vs 0.0217)
UTF-8, long - to UTF-16LE   faster by 1.01% (0.0209 vs 0.0211)
diff --git a/ext/mbstring/libmbfl/filters/mbfilter_utf8.c b/ext/mbstring/libmbfl/filters/mbfilter_utf8.c
@@ -225,7 +225,9 @@ static size_t mb_utf8_to_wchar(unsigned char **in, size_t *in_len, uint32_t *buf
 
 		if (c < 0x80) {
 			*out++ = c;
-		} else if (c >= 0xC2 && c <= 0xDF) { /* 2 byte character */
+		} else if (c < 0xC2) {
+			*out++ = MBFL_BAD_INPUT;
+		} else if (c <= 0xDF) { /* 2 byte character */
 			if (p < e) {
 				unsigned char c2 = *p++;
 				if ((c2 & 0xC0) != 0x80) {
@@ -237,7 +239,7 @@ static size_t mb_utf8_to_wchar(unsigned char **in, size_t *in_len, uint32_t *buf
 			} else {
 				*out++ = MBFL_BAD_INPUT;
 			}
-		} else if (c >= 0xE0 && c <= 0xEF) { /* 3 byte character */
+		} else if (c <= 0xEF) { /* 3 byte character */
 			if ((e - p) >= 2) {
 				unsigned char c2 = *p++;
 				unsigned char c3 = *p++;
@@ -262,7 +264,7 @@ static size_t mb_utf8_to_wchar(unsigned char **in, size_t *in_len, uint32_t *buf
 					}
 				}
 			}
-		} else if (c >= 0xF0 && c <= 0xF4) { /* 4 byte character */
+		} else if (c <= 0xF4) { /* 4 byte character */
 			if ((e - p) >= 3) {
 				unsigned char c2 = *p++;
 				unsigned char c3 = *p++;