tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

[RFC] introducing new locale-db implementation (Re: lib/39662: shortcomings in LC_{MONETARY,NUMERIC,TIME,MESSAGES} db format)



happy new year! all.

let's remember following discussion about locale-db format.
http://mail-index.netbsd.org/tech-userlevel/2008/05/21/msg000591.html

let me summarize:

1. the lack of magic number, no versioning mechanism is the killer
   for backward binary compatibility for libc itself.

2. plain-text based db file can't afford to store wide string data,
  it is not good idea "on the fly" conversion, we need more efficient format
  that can easily handle byteorder(3) issue.

3. making /usr/share/locale/*/LC_MESSAGES as the monolithic file
   give us the confliction with gettext(3)'s namespace,

4. we're already have too many locale db format,
  LC_CTYPE(rune), *.cat(catgets), *.mo(gettext), citrus_db(iconv)
  introducing another format is not good idea.

before the shipping of 5.0, we have to fix these problems (this
problem is already filed as PR/39662, and blocker for netbsd-5).

so i wrote brand new localedata implementation for LC_*.
it uses citrus_db framework as backend(we're already uses citrus_db
to implement iconv).

here is the patch to HEAD and netbsd-5.
ftp://ftp.netbsd.org/pub/NetBSD/misc/tnozaki/

i've already checked this patch doesn't break release build:
   i386, amd64, hpcarm, hpcmips, hpcsh, vax.

i want to commit this patch into HEAD and send pullup-5 request.
is there any objection, or comments?


P.S.

i think it is better to merge only libc's change,
and don't install LC_{MONETARY,NUMERIC,TIME,MESSAGES} locale-db for 5.0
(currently, this patch **install** all kind of locale-db, see
src/share/locale/Makefile).

because of following reason:

1. our regex(3) doesn't supports multibyte encoding such as UTF-8,
so it can't parse multibyte LC_MESSAGES's yesexpr/noexpr correctly.
we have to introdule multibyte-aware regex(3).

2. some locale(ja_JP.eucJP, ko_KR.eucKR) assign LC_NUMERIC's currency_symbol
as 0x5c(\), the zombie derrived from internalional version of ISO646 makes
some shell script broken, i'm afraid.
   $ LANG=ja_JP.eucJP locale -k currency_symbol
   currency_symbol="\"

as far as Solaris' locale(1), 0x5c is surely escaped.
   currency_symbol="\\"

we have to fix our locale(1).

3. date(1) output is too strange under some locale(ja_JP.eucJP and so on),
because the format string is hardcoded:
[src/bin/date.c]
   120
   121         format = "%a %b %e %H:%M:%S %Z %Y";
   122

this format must be "%+", but it seems that our strftime(3) lacks
"%+" conversion facility.  and more, LC_TIME's d_t_fmt field doesn't include
%a(week) and %Z(timezone).  so we have to fix date(1), and add new field to
implement "%+" and maintain locale definition file.

i think there is no time to fix these problems until 5.0 release...


P.P.S.

i once sugested make LC_* as sub-directory for versioning.
http://mail-index.netbsd.org/tech-userlevel/2008/05/23/msg000602.html

but i abondon this, because we already have monolithic LC_CTYPE db.
so my previous idea of  localedef(1) at tech-userlevel@  is
hard to realize ;-< and i think it is much confusious that such
monolithic'db and modular'ed db exists same time.

anyway, forward compatibility is no problems, new setlocale(3) can read
previous plain-text type as well as new citrus_db's locale-db.
but backward is not, because  localeio.c never validate locale-db,
no IS_REG, no magic, no size checing :-<
# that's why i strongly against to localeio at tech-userlevel@.


very truly yours.
--
Takehiko NOZAKI <tnozaki%NetBSD.org@localhost>


Home | Main Index | Thread Index | Old Index