Skip to content

wc -m returns character count instead of byte count in C/POSIX locale #9712

@sylvestre

Description

@sylvestre

Component

wc -m

Description

GNU wc checks MB_CUR_MAX to determine whether to count bytes or multibyte characters. When MB_CUR_MAX == 1 (C/POSIX locale), it treats each byte as a character.

src/wc.c

static bool
wc (int fd, char const *file_x, struct fstatus *fstatus)
{
  // [...]
  /* If in the current locale, chars are equivalent to bytes, we prefer
     counting bytes, because that's easier.  */
!  if (MB_CUR_MAX > 1)
    {
      count_bytes = print_bytes;
      count_chars = print_chars;
    }
  else
    {
      count_bytes = print_bytes || print_chars;
      count_chars = false;
    }

However, uutils wc ignores locale and always counts UTF-8 characters using bytecount::num_chars().

src/uu/wc/src/count_fast.rs

pub(crate) fn count_bytes_chars_and_lines_fast<
// [...]
>(
    handle: &mut R,
) -> (WordCount, Option<io::Error>) {
    let mut total = WordCount::default();
    let buf: &mut [u8] = &mut AlignedBuffer::default().data;
    loop {
        match handle.read(buf) {
            Ok(0) => return (total, None),
            Ok(n) => {
                if COUNT_BYTES {
                    total.bytes += n;
                }
                if COUNT_CHARS {
!                    total.chars += bytecount::num_chars(&buf[..n]);
                }
                if COUNT_LINES {
                    total.lines += bytecount::count(&buf[..n], b'\n');
                }
            }
            Err(ref e) if e.kind() == ErrorKind::Interrupted => (),
            Err(e) => return (total, Some(e)),
        }
    }
}

Test / Reproduction Steps

$ echo -n "한글"|LC_ALL=C wc -m
6
$ echo -n "한글"|LC_ALL=C coreutils wc -m
2

Impact

wc -m produces different output than GNU in C/POSIX locale environments, breaking compatibility for scripts and CI pipelines that rely on locale-dependent character counting.

Recommendations

Check MB_CUR_MAX (or equivalent locale detection in rust) before counting characters. If MB_CUR_MAX == 1, return byte count instead of UTF-8 character count to match GNU behavior.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions