TL;DR: I want to add a new util for locale generation and provide locale-aware functionality in uucore
uutils is currently following the C locale for most of its operations and the locale settings of the system are mostly ignored. This has led to issues and PRs like these:
We've mostly been putting this off due to missing libraries in Rust, but recently, this has changed with the release of icu4x. It covers many of the things we need like locale-aware datetime formatting, locale-aware collation, etc..
However, it requires data to operate on, which is different from the usual data generated by locale-gen and friends (if I understand correctly). There are essentially 2 viable ways to include data with icu4x1:
- Store a blob on the filesystem to read at runtime (
BlobDataProvider).
- Encode the data as Rust code included in the binary (
BakedDataProvider).
Since we don't know up front what locales we might need, I think we need to use the BlobDataProvider and allow the user to generate their own locale data on command. So, I propose we do the following:
- Add a new util, called
locale-gen or something similar
- This util downloads and stores the locale data in a global directory (I'm not sure where, could also be controlled by an environment variable).
- This util would be a wrapper around the
icu_datagen crate2.
- It could also read from system config files and install any necessary locales based on the system config automatically.
- Since this util needs access to the internet, we will run into similar issues like we did with
uudoc back when it automatically downloaded examples, so it needs to be optional.3
- Create locale-aware functionality in
uucore as much as possible, so that the utils themselves don't have to bother with checking the right environment variables, loading the icu data, etc..
- For example, to check the collation locale, the
LC_COLLATE, LC_ALL and LANG env vars need to be checked.
- For the utils, we then just expose a
sort/collate function that checks (and caches) the locale and performs the correct collation.
- Change the utils to use the locale-aware functions provided by
uucore.
Do you see any problems with this approach? Are there alternatives we should explore first?
TL;DR: I want to add a new util for locale generation and provide locale-aware functionality in
uucoreuutils is currently following the
Clocale for most of its operations and the locale settings of the system are mostly ignored. This has led to issues and PRs like these:expris failing with multibyte chars #3132We've mostly been putting this off due to missing libraries in Rust, but recently, this has changed with the release of
icu4x. It covers many of the things we need like locale-aware datetime formatting, locale-aware collation, etc..However, it requires data to operate on, which is different from the usual data generated by
locale-genand friends (if I understand correctly). There are essentially 2 viable ways to include data withicu4x1:BlobDataProvider).BakedDataProvider).Since we don't know up front what locales we might need, I think we need to use the
BlobDataProviderand allow the user to generate their own locale data on command. So, I propose we do the following:locale-genor something similaricu_datagencrate2.uudocback when it automatically downloaded examples, so it needs to be optional.3uucoreas much as possible, so that the utils themselves don't have to bother with checking the right environment variables, loading the icu data, etc..LC_COLLATE,LC_ALLandLANGenv vars need to be checked.sort/collatefunction that checks (and caches) the locale and performs the correct collation.uucore.Do you see any problems with this approach? Are there alternatives we should explore first?
Footnotes
They also have
FsDataProviderwhich is meant for development only. ↩This crate also has a CLI, but we need to tailor it for use with coreutils, by setting nicer defaults for our purpose. ↩
icu_datagenusesreqwest, which will lead to similar problems as in https://github.com/uutils/coreutils/pull/3184 ↩