tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

localedef(1) (Was: [OT]Who's in charge of Localization (L10N) in NetBSD?)



hi,

> I've also started work on a localedef(1) tool. Again, not enough time
> to complete.

i have started to implement localedef(1) too :-)
but now suspended, because of the specification's unclearness
(and also i have no time to solve these problems).

POSIX locale's localedef(1) specification apparently disregards some
stateful encoding.
charmap can't treat locking-shift escape sequence like ISO/IEC
2022(*1) correctly :-<

*1 http://www.ecma-international.org/publications/standards/Ecma-035.htm

OTOH, we have already have these encodings(like ja_JP.ISO2022-JP,
ja_JP.CTEXT and so on).
to keep backward compatibility, we have to support locking-shift
stateful encoding by charmap.

ISO/IEC TR14652(*2) extensions mentioned about how to treat ISO/IEC
2022 escape sequence,
by using <escseq2022> and <include> tags in charmap.
# why not hz-gb2312? n-byte hangle? viqr? and any other stateful encoding?

*2 http://www.open-std.org/JTC1/SC22/WG20/docs/n972-14652ft.pdf

but it is not enough, if i don't misunderstand spec :-(
following problems still exist:


1. <include><escseq2022> tag can't treat ISO/IEC 2022's single-shift correctly.

using single-shift is not special case, see:
    + ISO-2022-CN-EXT(RFC1922) (*3)
    + eucJP with supplymentaly ideograph(jisx0212) and half-width
kana(jisx0201-right)
     (UI-OSF Application Platform Profile for Japanese Environment
Version 1.1) (*4)

*3 http://www.ietf.org/rfc/rfc1922.txt
*4 http://home.m05.itscom.net/numa/uocjleE.pdf

localedef(1) parser can't determin whether
LS2 or SS2, LS3 or SS3, LS2R or SS2R, LS3R or SS3R:

designation -> tag
LS0(SO)         <include> "g0";"g0",...;
LS1(SI)         <include> "g1";"g0",...;
LS2             <include> "g2";"g0",...;
LS3             <include> "g3";"g0",...;
LS1R            <include> "g1";"g1",...;
LS2R            <include> "g2";"g1",...;
LS3R            <include> "g3";"g1",...;

SS2             <include> "g2";"g0",...; /* XXX same as LS2 */
SS3             <include> "g3";"g0",...; /*             LS3 */
SS2R            <include> "g2";"g1",...; /*             LS2R */
SS3R            <include> "g3";"g1",...; /*             LS3R */

because escape sequense is hardcorded in *included* files's
<escseq2022> tag, like:
...
<escseq2022>    "g0";"g0";"/x28/x48"
<escseq2022>    "g1";"g0";"/x29/x48"
...

this escape sequence(includes final character) means locking-shift,
not single-shift(0x8E/0x8F).

or we have to treat single-shift as multibyte sequence like Shift_JIS?
...
<include> "g2";"g0","JISX0208-1990"; # LS2
CHARMAP
# XXX: SS2
<half-width-kana-a> /x8E/xAE
...


2. spec disregards `empty charset designation by default' case, such as:
    + ISO-2022-KR(RFC1557) (*5)

*5 http://www.ietf.org/rfc/rfc1557.txt

RFC1557 said:

ESC $ ) C       Appears once in the beginning of a line
                before any appearance of SO characters.

so that G1 should be empty in initial state, but we can't control.

or should i avoid this case by using EMPTY designation?
...
<include> "g0";"g0";"ISO646-US"
<include> "g1";"g0";"EMPTY" /* XXX kludge */
<include> "g1";"g0";"KSC5601"
...


3. <escseqq2022> doesn't include ESC(0x1b),
hardcording of 0x1b is not Codeset Independent!


4. repertoiremap can't treat many to many mapping

example, JISX0213-2004 <-> UCS-4 conversion need many to one mapping.
(and some arabic too!) but repertoiremap seems only one to one mapping.
using Unicode Named Sequence? how to map them to wchar_t?


and more, more, more...how simple mklocale(1) is lol.

early stage of implementation(=garbage) is here.
http://sigsegv.s25.xrea.com/distfiles/citrus/NetBSD/localedef-20070329.tar.bz2


very truly yours.
--
Takehiko NOZAKI<tnozaki%NetBSD.org@localhost>


Home | Main Index | Thread Index | Old Index