Component
wc -m
Description
GNU wc checks MB_CUR_MAX to determine whether to count bytes or multibyte characters. When MB_CUR_MAX == 1 (C/POSIX locale), it treats each byte as a character.
src/wc.c
static bool
wc (int fd, char const *file_x, struct fstatus *fstatus)
{
// [...]
/* If in the current locale, chars are equivalent to bytes, we prefer
counting bytes, because that's easier. */
! if (MB_CUR_MAX > 1)
{
count_bytes = print_bytes;
count_chars = print_chars;
}
else
{
count_bytes = print_bytes || print_chars;
count_chars = false;
}
However, uutils wc ignores locale and always counts UTF-8 characters using bytecount::num_chars().
src/uu/wc/src/count_fast.rs
pub(crate) fn count_bytes_chars_and_lines_fast<
// [...]
>(
handle: &mut R,
) -> (WordCount, Option<io::Error>) {
let mut total = WordCount::default();
let buf: &mut [u8] = &mut AlignedBuffer::default().data;
loop {
match handle.read(buf) {
Ok(0) => return (total, None),
Ok(n) => {
if COUNT_BYTES {
total.bytes += n;
}
if COUNT_CHARS {
! total.chars += bytecount::num_chars(&buf[..n]);
}
if COUNT_LINES {
total.lines += bytecount::count(&buf[..n], b'\n');
}
}
Err(ref e) if e.kind() == ErrorKind::Interrupted => (),
Err(e) => return (total, Some(e)),
}
}
}
Test / Reproduction Steps
$ echo -n "한글"|LC_ALL=C wc -m
6
$ echo -n "한글"|LC_ALL=C coreutils wc -m
2
Impact
wc -m produces different output than GNU in C/POSIX locale environments, breaking compatibility for scripts and CI pipelines that rely on locale-dependent character counting.
Recommendations
Check MB_CUR_MAX (or equivalent locale detection in rust) before counting characters. If MB_CUR_MAX == 1, return byte count instead of UTF-8 character count to match GNU behavior.
Component
wc -m
Description
GNU wc checks
MB_CUR_MAXto determine whether to count bytes or multibyte characters. WhenMB_CUR_MAX == 1(C/POSIX locale), it treats each byte as a character.src/wc.c
However, uutils wc ignores locale and always counts UTF-8 characters using
bytecount::num_chars().src/uu/wc/src/count_fast.rs
Test / Reproduction Steps
Impact
wc -mproduces different output than GNU in C/POSIX locale environments, breaking compatibility for scripts and CI pipelines that rely on locale-dependent character counting.Recommendations
Check
MB_CUR_MAX(or equivalent locale detection in rust) before counting characters. IfMB_CUR_MAX == 1, return byte count instead of UTF-8 character count to match GNU behavior.