Commit 7299096
committed
New implementation of mb_strimwidth
This new implementation of mb_strimwidth uses the new text
encoding conversion filters. Changes from the previous
implementation:
• mb_strimwidth allows a negative 'from' argument, which
should count backwards from the end of the string. However,
the implementation of this feature was buggy (starting right
from when it was first implemented).
It used the following code:
if ((from < 0) || (width < 0)) {
swidth = mbfl_strwidth(&string);
}
if (from < 0) {
from += swidth;
}
Do you see the bug? 'from' is a count of CODEPOINTS, but
'swidth' is a count of TERMINAL COLUMNS. Adding those two
together does not make sense. If there were no fullwidth
characters in the input string, then the two counts coincide
and the feature would work correctly. However, each
fullwidth character would throw the result off by one,
causing more characters to be skipped than was requested.
• mb_strimwidth also allows a negative 'width' argument,
which again counts backwards from the end of the string;
in this case, it is not determining the START of the portion
which we want to extract, but rather, the END of that portion.
Perhaps unsurprisingly, this feature was also buggy.
Code:
if (width < 0) {
width = swidth + width - from;
}
'swidth + width' is fine here; the problem is '- from'.
Again, that is subtracting a count of CODEPOINTS from a
count of TERMINAL COLUMNS. In this case, we really need
to count the terminal width of the string prefix skipped
over by 'from', and subtract that rather than the number
of codepoints which are being skipped.
As a result, if a 'from' count was passed along with a
negative 'width', for every fullwidth character in the
skipped prefix, the result of mb_strimwidth was one
terminal column wider than requested.
Since these situations were covered by unit tests, you
might wonder why the bugs were not caught. Well, as far as
I can see, it looks like the author of the 'tests' just
captured the actual output of mb_strimwidth and defined it
as 'correct'. The tests were written in such a way that it
was difficult to examine them and see whether they made
sense or not; but a careful examination of the inputs and
outputs clearly shows that the legacy tests did not conform
to the documented contract of mb_strimwidth.
• The old implementation would always pass the input string
through decoding/encoding filters before returning it to
the caller, even if it fit within the specified width. This
means that invalid byte sequences would be converted to
error markers. For performance, the new implementation
returns the very same string which was passed in if it
does not exceed the specified width. This means that
erroneous byte sequences are not converted to error markers
unless it is necessary to trim the string.
• The same applies to the 'trim marker' string.
• The old implementation was buggy in the (unusual)
case that the trim marker is wider than the requested
maximum width of the result. It did an unsigned subtraction
of the requested width and the width of the trim marker. If the
width of the trim marker was greater, that subtraction would
underflow and yield a huge number. As a result, mb_strimwidth
would then pass the input string through, even if it was
far wider than the requested maximum width.
In that case, since the input string is wider than the
requested width, and NONE of it will fit together with the
trim marker, the new implementation returns just the trim
marker. This is the one case where the output can be wider
than the requested width: when BOTH the input string and
also the trim marker are too wide.
• Since it passed the input string and trim marker through
decoding/encoding filters, when using "Quoted-Printable" as
the encoding, newlines could be inserted into the trim marker
to maintain the maximum line length for QP.
This is an extremely bizarre use case and I don't think there
is any point in worrying about it. QP will be removed from
mbstring in time, anyways.
PERFORMANCE:
• From micro-benchmarking with various input string lengths and
text encodings, it appears that the new implementation is 2-3x
faster for UTF-8 and UTF-16. For legacy Japanese text encodings
like ISO-2022-JP or SJIS, the new implementation is perhaps 25%
faster.
• Note that correctly implementing negative 'from' and 'width'
arguments imposes a small performance burden in such cases; one
which the old implementation did not pay. This slightly skews
benchmarking results in favor of the old implementation. However,
even so, the new implementation is faster in all cases which I
tested.1 parent 94fde15 commit 7299096
File tree
4 files changed
+450
-282
lines changed- ext/mbstring
- libmbfl/mbfl
- tests
4 files changed
+450
-282
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
93 | 93 | | |
94 | 94 | | |
95 | 95 | | |
96 | | - | |
97 | 96 | | |
98 | 97 | | |
99 | 98 | | |
| |||
1191 | 1190 | | |
1192 | 1191 | | |
1193 | 1192 | | |
1194 | | - | |
1195 | | - | |
1196 | | - | |
1197 | | - | |
1198 | | - | |
1199 | | - | |
1200 | | - | |
1201 | | - | |
1202 | | - | |
1203 | | - | |
1204 | | - | |
1205 | | - | |
1206 | | - | |
1207 | | - | |
1208 | | - | |
1209 | | - | |
1210 | | - | |
1211 | | - | |
1212 | | - | |
1213 | | - | |
1214 | | - | |
1215 | | - | |
1216 | | - | |
1217 | | - | |
1218 | | - | |
1219 | | - | |
1220 | | - | |
1221 | | - | |
1222 | | - | |
1223 | | - | |
1224 | | - | |
1225 | | - | |
1226 | | - | |
1227 | | - | |
1228 | | - | |
1229 | | - | |
1230 | | - | |
1231 | | - | |
1232 | | - | |
1233 | | - | |
1234 | | - | |
1235 | | - | |
1236 | | - | |
1237 | | - | |
1238 | | - | |
1239 | | - | |
1240 | | - | |
1241 | | - | |
1242 | | - | |
1243 | | - | |
1244 | | - | |
1245 | | - | |
1246 | | - | |
1247 | | - | |
1248 | | - | |
1249 | | - | |
1250 | | - | |
1251 | | - | |
1252 | | - | |
1253 | | - | |
1254 | | - | |
1255 | | - | |
1256 | | - | |
1257 | | - | |
1258 | | - | |
1259 | | - | |
1260 | | - | |
1261 | | - | |
1262 | | - | |
1263 | | - | |
1264 | | - | |
1265 | | - | |
1266 | | - | |
1267 | | - | |
1268 | | - | |
1269 | | - | |
1270 | | - | |
1271 | | - | |
1272 | | - | |
1273 | | - | |
1274 | | - | |
1275 | | - | |
1276 | | - | |
1277 | | - | |
1278 | | - | |
1279 | | - | |
1280 | | - | |
1281 | | - | |
1282 | | - | |
1283 | | - | |
1284 | | - | |
1285 | | - | |
1286 | | - | |
1287 | | - | |
1288 | | - | |
1289 | | - | |
1290 | | - | |
1291 | | - | |
1292 | | - | |
1293 | | - | |
1294 | | - | |
1295 | | - | |
1296 | | - | |
1297 | | - | |
1298 | | - | |
1299 | | - | |
1300 | | - | |
1301 | | - | |
1302 | | - | |
1303 | | - | |
1304 | | - | |
1305 | | - | |
1306 | | - | |
1307 | | - | |
1308 | | - | |
1309 | | - | |
1310 | | - | |
1311 | | - | |
1312 | | - | |
1313 | | - | |
1314 | | - | |
1315 | | - | |
1316 | | - | |
1317 | | - | |
1318 | | - | |
1319 | | - | |
1320 | | - | |
1321 | | - | |
1322 | | - | |
1323 | | - | |
1324 | | - | |
1325 | | - | |
1326 | | - | |
1327 | | - | |
1328 | | - | |
1329 | | - | |
1330 | | - | |
1331 | | - | |
1332 | | - | |
1333 | | - | |
1334 | | - | |
1335 | | - | |
1336 | | - | |
1337 | | - | |
1338 | | - | |
1339 | | - | |
1340 | | - | |
1341 | | - | |
1342 | | - | |
1343 | | - | |
1344 | | - | |
1345 | | - | |
1346 | | - | |
1347 | | - | |
1348 | | - | |
1349 | | - | |
1350 | | - | |
1351 | | - | |
1352 | | - | |
1353 | | - | |
1354 | | - | |
1355 | | - | |
1356 | | - | |
1357 | | - | |
1358 | | - | |
1359 | | - | |
1360 | | - | |
1361 | | - | |
1362 | | - | |
1363 | | - | |
1364 | | - | |
1365 | | - | |
1366 | | - | |
1367 | | - | |
1368 | | - | |
1369 | | - | |
1370 | | - | |
1371 | | - | |
1372 | | - | |
1373 | | - | |
1374 | | - | |
1375 | | - | |
1376 | | - | |
1377 | | - | |
1378 | | - | |
1379 | | - | |
1380 | | - | |
1381 | | - | |
1382 | | - | |
1383 | | - | |
1384 | | - | |
1385 | | - | |
1386 | | - | |
1387 | | - | |
1388 | | - | |
1389 | | - | |
1390 | | - | |
1391 | | - | |
1392 | | - | |
1393 | 1193 | | |
1394 | 1194 | | |
1395 | 1195 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
231 | 231 | | |
232 | 232 | | |
233 | 233 | | |
234 | | - | |
235 | | - | |
236 | | - | |
237 | | - | |
238 | | - | |
239 | | - | |
240 | | - | |
241 | | - | |
242 | | - | |
243 | | - | |
244 | | - | |
245 | | - | |
246 | 234 | | |
247 | 235 | | |
248 | 236 | | |
| |||
0 commit comments