tech-userlevel: Re: Permit loose matching of codeset names in locales

Subject: Re: Permit loose matching of codeset names in locales
To: Ian Lance Taylor <ian@wasabisystems.com>
From: Curt Sampson <cjs@cynic.net>
List: tech-userlevel
Date: 09/03/2004 16:26:14

On Fri, 2 Sep 2004, Ian Lance Taylor wrote:

> The loose matching algorithm used on Linux converts codeset names as
> follows:
>   * Ignore all non-alphanumeric characters, such as '-'.
>   * If the remaining string is all digits, prepend "iso"
>     (e.g. "8859-1" is converted to "iso88591").
>   * Force all alphabetic characters to lower case.

I do worry a little about collisions caused by the canonization. Perhaps
we can include a test that takes the entire list of known names,
canonizes all of them, and then checks for collisions?

You mention in a later message that it might be good to have a canonical
function for codeset (which I assume means "character encoding") name
matching; I agree. We might as will stick it in libc and make all of our
system software use it.

As well as the preferred MIME name, it would be nice to match against
all the aliases available for the character encoding. For example, the
official aliases for ISO-8859-1 are:

    Name: ISO_8859-1:1987                                    [RFC1345,KXS2]
    MIBenum: 4
    Source: ECMA registry
    Alias: iso-ir-100
    Alias: ISO_8859-1
    Alias: ISO-8859-1 (preferred MIME name)
    Alias: latin1
    Alias: l1
    Alias: IBM819
    Alias: CP819
    Alias: csISOLatin1

The full list is in the following IANA document (though they are
mistakenly called character sets, rather than character encodings):

    http://www.iana.org/assignments/character-sets

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.NetBSD.org
    Don't you know, in this new Dark Age, we're all light.  --XTC