tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
localedef(1) (Was: [OT]Who's in charge of Localization (L10N) in NetBSD?)
hi,
> I've also started work on a localedef(1) tool. Again, not enough time
> to complete.
i have started to implement localedef(1) too :-)
but now suspended, because of the specification's unclearness
(and also i have no time to solve these problems).
POSIX locale's localedef(1) specification apparently disregards some
stateful encoding.
charmap can't treat locking-shift escape sequence like ISO/IEC
2022(*1) correctly :-<
*1 http://www.ecma-international.org/publications/standards/Ecma-035.htm
OTOH, we have already have these encodings(like ja_JP.ISO2022-JP,
ja_JP.CTEXT and so on).
to keep backward compatibility, we have to support locking-shift
stateful encoding by charmap.
ISO/IEC TR14652(*2) extensions mentioned about how to treat ISO/IEC
2022 escape sequence,
by using <escseq2022> and <include> tags in charmap.
# why not hz-gb2312? n-byte hangle? viqr? and any other stateful encoding?
*2 http://www.open-std.org/JTC1/SC22/WG20/docs/n972-14652ft.pdf
but it is not enough, if i don't misunderstand spec :-(
following problems still exist:
1. <include><escseq2022> tag can't treat ISO/IEC 2022's single-shift correctly.
using single-shift is not special case, see:
+ ISO-2022-CN-EXT(RFC1922) (*3)
+ eucJP with supplymentaly ideograph(jisx0212) and half-width
kana(jisx0201-right)
(UI-OSF Application Platform Profile for Japanese Environment
Version 1.1) (*4)
*3 http://www.ietf.org/rfc/rfc1922.txt
*4 http://home.m05.itscom.net/numa/uocjleE.pdf
localedef(1) parser can't determin whether
LS2 or SS2, LS3 or SS3, LS2R or SS2R, LS3R or SS3R:
designation -> tag
LS0(SO) <include> "g0";"g0",...;
LS1(SI) <include> "g1";"g0",...;
LS2 <include> "g2";"g0",...;
LS3 <include> "g3";"g0",...;
LS1R <include> "g1";"g1",...;
LS2R <include> "g2";"g1",...;
LS3R <include> "g3";"g1",...;
SS2 <include> "g2";"g0",...; /* XXX same as LS2 */
SS3 <include> "g3";"g0",...; /* LS3 */
SS2R <include> "g2";"g1",...; /* LS2R */
SS3R <include> "g3";"g1",...; /* LS3R */
because escape sequense is hardcorded in *included* files's
<escseq2022> tag, like:
...
<escseq2022> "g0";"g0";"/x28/x48"
<escseq2022> "g1";"g0";"/x29/x48"
...
this escape sequence(includes final character) means locking-shift,
not single-shift(0x8E/0x8F).
or we have to treat single-shift as multibyte sequence like Shift_JIS?
...
<include> "g2";"g0","JISX0208-1990"; # LS2
CHARMAP
# XXX: SS2
<half-width-kana-a> /x8E/xAE
...
2. spec disregards `empty charset designation by default' case, such as:
+ ISO-2022-KR(RFC1557) (*5)
*5 http://www.ietf.org/rfc/rfc1557.txt
RFC1557 said:
ESC $ ) C Appears once in the beginning of a line
before any appearance of SO characters.
so that G1 should be empty in initial state, but we can't control.
or should i avoid this case by using EMPTY designation?
...
<include> "g0";"g0";"ISO646-US"
<include> "g1";"g0";"EMPTY" /* XXX kludge */
<include> "g1";"g0";"KSC5601"
...
3. <escseqq2022> doesn't include ESC(0x1b),
hardcording of 0x1b is not Codeset Independent!
4. repertoiremap can't treat many to many mapping
example, JISX0213-2004 <-> UCS-4 conversion need many to one mapping.
(and some arabic too!) but repertoiremap seems only one to one mapping.
using Unicode Named Sequence? how to map them to wchar_t?
and more, more, more...how simple mklocale(1) is lol.
early stage of implementation(=garbage) is here.
http://sigsegv.s25.xrea.com/distfiles/citrus/NetBSD/localedef-20070329.tar.bz2
very truly yours.
--
Takehiko NOZAKI<tnozaki%NetBSD.org@localhost>
Home |
Main Index |
Thread Index |
Old Index