Skip to content

uniq: -w counts UTF-8 characters instead of bytes #10184

@sylvestre

Description

@sylvestre

uniq -w counts UTF-8 characters instead of bytes

Component

uniq

Description

GNU uniq's -w N (check-chars) option compares the first N bytes of each line, while uutils uniq compares the first N UTF-8 characters. This causes different behavior when processing multibyte characters (e.g., CJK characters, emoji).

uutils uniq always uses UTF-8 character-based comparison regardless of locale, using Rust's .chars().take(N) method.

let total_chars = string_after_skip.chars().count();

// `-w N` => Compare no more than N characters
let slice_stop = self.slice_stop.unwrap_or(total_chars);
let slice_start = slice_stop.min(total_chars);

let mut iter = string_after_skip.chars().take(slice_start);

Test / Reproduction Steps

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_ALL=
$ printf "가나다라마\n가나다바사\n" > korean.txt

# -w 3
$ uniq -w 3 korean.txt
가나다라마

$ coreutils uniq -w 3 korean.txt
가나다라마

# -w 4
$ uniq -w 4 korean.txt
가나다라마

$ coreutils uniq -w 4 korean.txt
가나다라마
가나다바사

# -w 12
$ uniq -w 12 korean.txt
가나다라마
가나다바사

$ coreutils uniq -w 12 korean.txt
가나다라마
가나다바사

Impact

Scripts using -w with multibyte characters (CJK, emoji) produce different results between GNU and uutils.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions