Subject: Re: Permit loose matching of codeset names in locales
To: Ian Lance Taylor <ian@wasabisystems.com>
From: SODA Noriyuki <soda@sra.co.jp>
List: tech-userlevel
Date: 09/07/2004 04:20:53
>>>>> On 05 Sep 2004 15:25:18 -0400,
	Ian Lance Taylor <ian@wasabisystems.com> said:

>> Our codeset names conform existing UNIX conventions as far as
>> possible, so our current names are just exactly compatible with most
>> commercial UNIX variants.

> But note that we've already seen in this thread a discrepancy between
> Linux and NetBSD: NetBSD uses "ru_RU.KOI8-R" where Linux uses
> "ru_RU.koi8r".

No, the official locale name on Linux is "ru_RU.KOI8-R", so it's
just same with NetBSD. Please look at the following Linux standard:
	http://www.openi18n.org/docs/text/LocNameGuide-V10.txt
As you see, official codeset names of locales on Linux should only
have uppercase letters, digits and hypens. "ru_RU.KOI-8R" conforms
to the standard, but "ru_RU.koi8r" doesn't.

You can confirm the fact by running the following small program
as well.

	#include <stdio.h>
	#include <locale.h>
	#include <langinfo.h>

	main()
	{
		if (setlocale(LC_ALL, "") == NULL) {
			fprintf(stderr, "failed\n");
			return 1;
		}
		printf("<%s>\n", setlocale(LC_CTYPE, NULL));
		printf("<%s>\n", nl_langinfo(CODESET));
		return 0;
	}

The following is the result on Fedora Core 2:
	% env LANG=ru_RU.koi8r ./a.out
	<ru_RU.koi8r>
	<KOI8-R>

As this shows, glibc uses "KOI8-R" rather than "koi8r" as the
canonical codeset name.

I guess the output of "locale -a" on Linux confused you. But the
command on Linux doesn't show official names, not only about the
russian locale, but all of locales. You could see that none of the
output matches with the Linux standard above.

Although Linux is currently going to change their standard locale
names (to make the names match with the standard), they still keep
compatibilty with existing locale names as well. So, nearly all NetBSD
locale names just work on Linux, too.

> For what it's worth, I'll note that NetBSD may not be consistent with
> itself at present.  In /usr/share/locale I see "bg_BG.CP1251" and in
> /usr/X11R6/share/locale I see "bg_BG.cp1251".

Actually pkgsrc is the inconsistent part ;-<
The directory /usr/X11R6/share/locale isn't part of X11, but part of
pkgsrc.

And worse, "bg_BG.CP1251" isn't only inconsistency in pkgsrc.
For example, pkgsrc uses "ja_JP.EUC" as Japanese locale, and it's just
wrong. ("EUC" isn't an actual codeset, but an encoding method.
eucJP, eucKR, eucTW and eucCN are actual codesets, and EUC is the
method which is common in those codesets.)
IMHO, pkgsrc must be corrected.

FWIW, Both NetBSD and X11 uses "bg_BG.CP1251"
(You can see "bg_BG.CP1251" in /usr/X11R6/lib/X11/locale/locale.dir).

> Given the existence of /usr/share/i18n/esdb/esdb.alias, why is it a
> bad thing to use it?

Thor already described one of the reasons.

Another reason is because the namespace used in esdb.alias is
different from the namespace for the canonical UNIX codeset names on
NetBSD.
For example, you can see that the canonical name for Japanese EUC in
the esdb.alias is "EUC-JP", but the canonical UNIX codeset name for
Japanese EUC is "eucJP" on NetBSD.
And we should not change the canonical UNIX codeset name due to
compatibility.
--
soda