tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: localeio



hi,

> > 1. please don't install LC_* those codeset is not supported by
> > iconv_open(3) yet,
> > such as ISCII-DEV(LC_CTYPE that i maintain keep this rule).
>
> Fine, I have no problem with this.  Do you have such a list?

as far as i glanced:

* be_BY.CP1131
  iconv(3) is ok, but LC_CTYPE is not.
  we can get be_BY.CP1131's LC_CTYPE src from FreeBSD too,
  but request we got is only CP1251.
  http://mail-index.netbsd.org/tech-userlevel/2006/03/14/0000.html
  (hi, cheusov!)

* am_ET.UTF-8, he_IL.UTF-8, mn_MN.UTF-8
  LC_CTYPE support is missing.
  yes, we can add en_US.UTF-8 -> {am_ET,he_IL,mn_MN}.UTF-8 alias.

* hi_IN.ISCII-DEV
  LC_CTYPE and iconv(3) support missing.
  i need conversion table, i have been looking for.

* zh_CN.GB2312
  zh_CN.GB2312 should an alias of zh_CN.eucCN,
  this is FreeBSD's redundancy.


> Hmm, I'm not sure that anything that we do will be compatible with
> GNU's ideas.  Do we have to be constrained by GNU?

but it seems hat Free Standards Group / Linux Standard Base comes to have the
influence power in ISO/IEC WG14(C) and WG15(POSIX),
# some glibc2 extension becomes ISO/IEC's Techical Report(such as TR24731-2).


> > at this point, no magic, no version controlled locale-db format
> > is not good idea.
>
> I'm still not convinced.  A version might make some things easier but
> will also add complexity.  I'm still not convinced 100% of the benefit.
> However, I'm starting to lean that way.  I think I may have a method
> of encoding at least a rudimentary header with standard tools.

i can't remember well, but glibc2 has some wide-string version of
LC_*'s string fields(if i'm wrong, correct me).
if it is reasonable for implementing our libc's locale function, we might too.

yes, i know those field can be generated *on the fly*
at the time setlocale(3) was called by using mbsrtowcs(3).

but string -> wide-string conversion costs much on run-time action,
localedef(1) can generate those wide-string fields, and store locale-db
in advance.

i prefer later, but wide-string in locale-db may require the care for
byteorder(3) etc.

that's why i proposed to using citrus_db* stuffs.


> > to introduce flexibility, i think it's better to use key-value pair db 
> > format.
> > src/lib/libc/citrus/citrus_db*.[ch] stuff may good for this purpose.
> > # easy to use as match as plain-text, i believe.
>
> I'm not sure I agree.  The database format should be trivially
> parsed or rather loaded by the library.  All of the work would be
> done up front by the tools that create it.  The missing localedef(1)
> would normally do the job.  But in the interim a simple plain-text
> file is far easier to create.  Well, plain-text is an over
> simplification.  The file is really a sequence of bytes.  The strings
> and string like things are newline terminated.  I think this keeps
> things MI.  I'm just not sure about multi-byte sequences.  You'll
> forgive me I don't deal with multi-byte characters in my day-to-day.

most important thing, we have to keep ABI, and allow file-format changable.

i don't opposed to import FreeBSD's text locale-db
as long as we can change format later.
so i would prefer introducing sub-directory like:

        /usr/share/locale/*/LC_*/*


> Also what does the citrus_db* stuff gain over say using db(3)?

no, berkeley-db is too heavy to implement iconv(3) and iconvdata,
so that tshiozak-san wrote a tiny, first, on disk format by mmap(2),
nestable hashmap implementation, that's citrus_db* struff.

so that we can't use db(1), we have to write
new tool like iconvdata, mkcsmapper(1) and mkesdb(1).


> Where is any of the citrus stuff documented?

sorry, not documented(hi, tshiozak-san).
src/usr.bin/{mkcsmapper,mkesdb} may be a good example, i hope.


> Is it used anyplace other than iconv(3)?

no, but tshiozak-san intended to use it for LC_* implementation
when he done it, AFIK.


> > files under /usr/share should be MI,
> > because these can be shared among different MACHINE_ARCH by NFS etc.
> > of course db file generated by citrus_db*[c.h] is MI.
>
> Exactly.  This is why they haven't been encoded as anything other
> than plain-text.  I'm not sure it is practice to share files across
> different OSes.  The citrus stuff maybe MI but citrus isn't everywhere.

i think sharing LC_* databases across different OSes is not required.
only sharing all different version, architecture of NetBSD.

currently we keep forwad-compatibility with FreeBSD's LC_CTYPE format,
but they changed the format at 6.0, i think it is better to remove
_ReadCTypeAsRune() in src/lib/libc/locale/setrunelocale.c

and, we might have to move LC_* stuff to /usr/libdata/locale or /usr/lib/locale.


> Your proposal is to add the additional "indirection" (sub-directory)
> to all of the categories.  This might be reasonable.  It would allow
> for backward compatibility.

i think LC_CTYPE were too, but... :(


> > SUSv3 spec is very ambigious about ``where do we *copy* from information?''
> > if this means:
> > ``copy from /usr/share/locale/en_US.UTF-8/* that compliled by localedef(1)''
> >  we have to restore from (multi)byte-sequence in plain-text db to
> > charmap's symbol-name, it is *impossible*(yes i know LC_CTYPE too).
>
> Huh?  Why would that be the case?  Either copy means take from the
> source (which seems to be GNU's method and that used by IRIX) or
> directly from the "compiled" binary.

the term *ambiguous*(sorry, misspelling), i intends to mention about
semantics of copy instruction had been changed between SUSv3
and ISO/IEC TR14652.

SUSv3 said that:

<cite>
    the copy statement names a valid, existing locale, then localedef shall
    behave as if the source definition had contained a valid category source
    definition for the named locale.
</cite>

it is clear at this point, "existing" and "valid" locale.
this means that if we wrote following localedef src:

        charmap "UTF-8"
        LC_CTYPE
        copy "ja_JP.eucJP"
        END LC_CTYPE

following code must work fine:

        #include <assert.h>
        #include <locale.h>
        main(void)
        {
                char *loc = setlocale(LC_CTYPE, "ja_JP.eucJP");
                assert(loc != NULL);
        }

and we only copy from installed "compiled" locale-db.
but in ISO/IEC TR14652, copy instruction's semantics has been changed.

        charmap "UTF-8"
        LC_CTYPE
        copy "i18n"
        END LC_CTYPE

<cite>
    4.1.3 Names for copy keyword

    In most of the categories a "copy" keyword is allowed.
    The name specified with this copy keyword is one of:
    - "i18n" which indicate the "i18n" FDCC-set defined in this specification,
    - the name of a FDCC-set or POSIX locale registered by the process defined
      in ISO/IEC 15897,
    - any other name which may be recognized in some local context - not being
      recommended as an international specification.
</cite>

"i18n" is not a existing locale and valid locale name.
setlocale(LC_CTYPE, "i18n") may not work.
copy means take from source.


> I'm not sure I understand the conversion back to "plain-text".  Note
> that the current plain-text database isn't really plain-text.  It
> is actually a sequence of bytes.  I think the multi-byte sequences
> just happen to come out "right".

think following case:

        charmap "UTF-8"
        LC_TIME
        copy "ja_JP.eucJP"
        END LC_TIME

if our wchar_t were UCS4 codepoint(this means we can define
 __STDC_ISO10646__ like glibc2), we can easily convert
ja_JP.eucJP -> ja_JP.UTF-8 directly such instruction:

- open ja_JP.eucJP locale-db and read multibyte(=eucJP) sequences.
- loading eucJP encoding module.
- convert multibyte(=eucJP) to wchar_t(=UCS4) by mbrtowc() in eucJP mod.
- loading UTF-8 encoding module.
- convert wchar_t(=UCS4) to multibyte(=UTF-8) by wcrtomb() in UTF-8 mod.
- save multibyte(=UTF-8) to newly created ja_JP.UTF-8 locale-db.

but our wchar_t is not UCS4, because we're CSI(=Codeset Independent) policy.
# UCS4 hardwired wchar_t is not enough, read itojun-san's paper:
# 
http://www.usenix.org/events/usenix01/freenix01/full_papers/hagino/hagino_html/index.html

we can't directly convert from eucJP's wchar_t(=JIS) to UTF-8's wchar_t(=UCS4),
because encoding module don't know how to mapping JIS <-> UCS4 codepoint
(it require huge conversion table).

that's why i think it is *impossible*, but...


> > Solaris, they don't copy information from
> > /usr/lib/localedef/src/en_US.UTF-8/*.src
> > but /usr/share/locale/en_US.UTF8/* stuffs, as far as i know from
> > truss(1)'s output.
>
> Which version of Solaris?   I don't have /usr/share/locale on my
> Solaris 9 box.  I've got /usr/lib/localedef/src and /usr/lib/locale.
> The latter has dynamic shared objects created via localedef(1).

sorry, s/\/usr\/share/\/usr\/lib/; please.

Solaris's wchar_t is not UCS4 but CSI, they don't define __ISO_STDC10646__.
this is same policy as NetBSD.

it seems that my Solaris 8 box's localedef(1) load ja_JP.eucJP encoding module
and call it's __{mbtowc,wctomb}_dense_eucjp function
(from truss(1) information, i don't read their CDDL source.
correct me if i'm wrong).

i guess following conversion is happend in its internal:

    multibyte(eucJP) -> wchar_t(JIS) -> ? -> wchar_t(UCS4) -> multibyte(UTF-8)

i once said that converting wchar_t(JIS) -> wchar_t(UCS4) is *impossible*.
but Solaris people the find the way, i think they uses iconv(3)'s tables.

of course, we can adapt the same way of Solaris.
but we might not, since localedef(1) assumed as to be a part of toolchain,
cross-build capablility is required.

- iconv(3) and iconvdata(up to 10MB) into libnbcompat is quite a hell.
- dynamic loading encoding module is not portable.

and i think it is quite overhead that convert each other different codeset.
so, using symbol-name that stored LC_* database is very very simple way
and reasonable.

yes, you may think it is useless information for libc-runtime, waste of memory.
my idea is "split locale-db into pieces" like:

        /usr/share/locale/*/LC_*/
                localedb.1      => libc's locale function only read this.
                localedefdb.1   => store localedef src's symbol-name for
                                 localedef(1)'s copy instruction.
                charmapdb.1     => (LC_CTYPE only) used by iconv(3),
                                 build from charmap + repertoiremap.


> > # localedef(1) is quite a beast from ``spec then code'' outer space.
> > # please read my past tech-userlevel's post:
>
> No kidding.  But the spec wasn't created in a vacuum.  It just
> tried to codify existing stuff.  In this case I think the stuff
> that originally came from System V.  I also don't believe it
> was ``spec then code'' as the System V stuff probably existed
> before the spec.  From what I've seen, these standards are pretty
> much codify existing.  Unlike some others...

"spec then buy code" :)

my intention, localedef(1) spec apparently lacks care for
stateful encoding that uses locking-shift such as ISO-2022, hz-gb2312 and so on.
in spite of mbrtowc/wcrtomb was designed to support them.
SysV MNLS spec is too old to support them, but they don't brush-up til TR14652.
(and TR14652 have some problems...)


> I know I read it when you originally replied to me.  Not sure I
> understood all of it.  I meant to read it again and get back to
> you.

i believe ISO/IEC TR14652's charmap extension, <escseq2022> is not
enough to support existing ISO-2022 locale stuff.

the solution is:

1. introduce our own extension(like <netbsd:xxx> tag) stuffs.
2. temporary revoke ISO-2022 locale, such as
  ja_JP.ISO2022-JP, ja_JP.ISO2022-JP2, ja_JP.ct(i'm not willing...)


very truly yours.
--
Takehiko NOZAKI <tnozaki%NetBSD.org@localhost>

-- 
Takehiko NOZAKI<takehiko.nozaki%gmail.com@localhost>


Home | Main Index | Thread Index | Old Index