uniq -w counts UTF-8 characters instead of bytes
Component
uniq
Description
GNU uniq's -w N (check-chars) option compares the first N bytes of each line, while uutils uniq compares the first N UTF-8 characters. This causes different behavior when processing multibyte characters (e.g., CJK characters, emoji).
uutils uniq always uses UTF-8 character-based comparison regardless of locale, using Rust's .chars().take(N) method.
let total_chars = string_after_skip.chars().count();
// `-w N` => Compare no more than N characters
let slice_stop = self.slice_stop.unwrap_or(total_chars);
let slice_start = slice_stop.min(total_chars);
let mut iter = string_after_skip.chars().take(slice_start);
Test / Reproduction Steps
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_ALL=
$ printf "가나다라마\n가나다바사\n" > korean.txt
# -w 3
$ uniq -w 3 korean.txt
가나다라마
$ coreutils uniq -w 3 korean.txt
가나다라마
# -w 4
$ uniq -w 4 korean.txt
가나다라마
$ coreutils uniq -w 4 korean.txt
가나다라마
가나다바사
# -w 12
$ uniq -w 12 korean.txt
가나다라마
가나다바사
$ coreutils uniq -w 12 korean.txt
가나다라마
가나다바사
Impact
Scripts using -w with multibyte characters (CJK, emoji) produce different results between GNU and uutils.
uniq -wcounts UTF-8 characters instead of bytesComponent
uniq
Description
GNU uniq's
-w N(check-chars) option compares the first N bytes of each line, while uutils uniq compares the first N UTF-8 characters. This causes different behavior when processing multibyte characters (e.g., CJK characters, emoji).uutils uniq always uses UTF-8 character-based comparison regardless of locale, using Rust's
.chars().take(N)method.Test / Reproduction Steps
Impact
Scripts using
-wwith multibyte characters (CJK, emoji) produce different results between GNU and uutils.