Re: localeio

To: Takehiko NOZAKI <takehiko.nozaki%gmail.com@localhost>
Subject: Re: localeio
From: Brian Ginsbach <ginsbach%netbsd.org@localhost>
Date: Sat, 24 May 2008 03:27:48 +0000
I'm not sure about the mixing of various "comments" here.  Some are
prefaced by "#", I'm assuming these are from an earlier email thread.

On Fri, May 23, 2008 at 06:43:01PM +0900, Takehiko NOZAKI wrote:
> hi, all.
> 
> my opinions about localeio:
> 
> 1. please don't install LC_* those codeset is not supported by
> iconv_open(3) yet,
> such as ISCII-DEV(LC_CTYPE that i maintain keep this rule).

Fine, I have no problem with this.  Do you have such a list?

> 
> 2. LC_MESSAGES locale-db, /usr/share/locale/*/LC_MESSAGES may
> conflicts with some gettext(3)'s mo messages catalog.
> they're frequentry stored as /usr/share/locale/*/LC_MESSAGES/*.mo .
> 
> # in many case, GNU configure set bindtextdonain(3)'s 2nd argument to
> # "/usr/share/locale" by default when using -DLOCALEDIR macro.

Hmm, I'm not sure that anything that we do will be compatible with
GNU's ideas.  Do we have to be constrained by GNU?

I do agree that the LC_MESSAGES stuff should be installed with an
additional directory.  I see that this is how Solaris and most GNU
systems work.

> 
> AFAIK, glibc2 uses /usr/share/locale/*/LC_MESSAGES/SYS_LC_MESSAGES
> for LC_MESSAGES locale-db.

I guess that is one possibility.  It looks silly but it could even
be /usr/share/locale/*/LC_MESSAGES/LC_MESSAGES.  The gettext(3)
catalogs will have different names and a .cat suffix.

> 
> # in the past, they had been directly used gettext(3) mo catalog for
> # libc's LC_MESSAGES locale-db, /usr/share/locale/*/LC_MESSAGES/libc.mo.

I agree.  I had forgotten about the fact that gencat(1) generated
catalogs also live in /usr/share/locale/*/LC_MESSAGES.  A directory
makes perfect sense.  This needs to be accounted for in the current
code.

> 
> don't forget we already have own BSDL libintl implementation in base.
> i believe it is reasonable that /usr/share/locale/*/LC_MESSAGES is S_IFDIR.

Again, I concur.  I had forgotten about message catalogs.  I need to
fix this oversight.

> 
> 3. LC_TIME locale-db still lacks ERA, ERA_D_FMT, ERA_D_T_FMT ERA_T_FMT
> langinfo stuffs(and LC_MONETARY locale-db doesn't have CRNCYSTR too).

Um, I think I've followed what is specified by The Open Group Base
Specifications Issue 6 / IEEE Std 1003.1, 2004 Edition.

Yes, ERA (era), ERA_D_FMT (era_d_fmt), ERA_D_T_FMT (era_d_t_fmt),
and ERA_T_FMT (era_t_fmt) are missing.  Also missing is alt_digits.
The assumption was that these could be added to the end of LC_TIME.
Yes, a magic number and/or version would make it easier to support.

My read is that CRNCYSTR is a composite of fields from LC_MONETARY.
From the standard, "If the locale's values for p_cs_precedes and
n_cs_precedes do not match, the value of nl_langinfo(CRNCYSTR) is
unspecified."  This, to me, implies that the CRNCYSTR needs to be
made up by using these fields along with currency_symbol and
mon_decimal_point.  This means that CRNCYSTR is not actually present
in the "database" but is derived when asked for in nl_langinfo(3).

> 
> thus, sooner or later we have to change file format of
> LC_TIME, LC_MONETARY locale-db, aim to support such langinfo stuffs.
> # as for me(=with ja_JP.* locale), ERA stuff is the very very familiar one.

Only LC_TIME, at least based on current standards, would need to
change.  Again, these could be added as additional "fields" to the
end of the "database".  I can't see that the format doesn't changes.
The size of the file changes.  This could be handled, if a database
without the additional ("new") fields is still valid.  It should
be fairly easy to accept either length database.

> 
> at this point, no magic, no version controlled locale-db format
> is not good idea.

I'm still not convinced.  A version might make some things easier but
will also add complexity.  I'm still not convinced 100% of the benefit.
However, I'm starting to lean that way.  I think I may have a method
of encoding at least a rudimentary header with standard tools.

> 
> # file format has a big influence on backward binary compatibility.
> # we can't easily change file format even if we have to do.
> 

Yes, this is true.  But how often will the format need to change?
Is the additional complexity worth it?

Now I do agree that it might be worthwhile to give some more thought
to backward compatibility.

> to introduce flexibility, i think it's better to use key-value pair db format.
> src/lib/libc/citrus/citrus_db*.[ch] stuff may good for this purpose.
> # easy to use as match as plain-text, i believe.

I'm not sure I agree.  The database format should be trivially
parsed or rather loaded by the library.  All of the work would be
done up front by the tools that create it.  The missing localedef(1)
would normally do the job.  But in the interim a simple plain-text
file is far easier to create.  Well, plain-text is an over
simplification.  The file is really a sequence of bytes.  The strings
and string like things are newline terminated.  I think this keeps
things MI.  I'm just not sure about multi-byte sequences.  You'll
forgive me I don't deal with multi-byte characters in my day-to-day.

Also what does the citrus_db* stuff gain over say using db(3)?

Where is any of the citrus stuff documented?

Is it used anyplace other than iconv(3)?

Sure is a lot of complicated code without any documentation.

> 
> files under /usr/share should be MI,
> because these can be shared among different MACHINE_ARCH by NFS etc.
> of course db file generated by citrus_db*[c.h] is MI.

Exactly.  This is why they haven't been encoded as anything other
than plain-text.  I'm not sure it is practice to share files across
different OSes.  The citrus stuff maybe MI but citrus isn't everywhere.

Adding magic numbers / versions all require additional steps to
ensure they are MI.  Also the my goal was to get something -- rather
than nothing -- before localedef(1) is written.  This means, I feel,
using standard tool to create the databases.

Whoa, wait a minute.  I now realize, after looking at the LC_CTYPE
magic number that it can just be ASCII characters.  This makes
things a bit easier.  I guess I was thinking a binary magic number.

> 
> # FreeBSD's LC_CTYPE/LC_COLLATE is installed in /usr/share,
> # but their format was MD, they don't use intNN_t/uintNN_t stuff and
> # pay no attension to byteorder(3)... thus, they shed blood
> # at the FreeBSD 6.0-Release, see more:
> # http://www.freebsd.org/releases/6.0R/relnotes-i386.html#USERLAND

I didn't change LC_CTYPE.  It still uses the rune stuff.  Nor did
I attempt to tackle LC_COLLATE.  I took on the "easy" ones that
just have simple, for the most part, "strings".

> 
> or use different namespace(as Solaris does) for plain-text db, like:
> 
>     /usr/share/locale/LC_TIME/localedb.1
> 
> if we change file format, simply change version sufix:
> 
>     /usr/share/locale/LC_TIME/localedb.2

Your proposal is to add the additional "indirection" (sub-directory)
to all of the categories.  This might be reasonable.  It would allow
for backward compatibility.

> 
> old/new ABI keeps very happy(except LC_CTYPE...but it has magic).
> 
> 4. as for localedef(1), we might have to keep charmap's symbol-name in
> LC_* locale-db, because localedef(1) require "copy" instruction such as:
> 
>     LC_MONETARY
>     copy "en_US.UTF-8"
>     END LC_MONETARY
> 
> SUSv3 spec is very ambigious about ``where do we *copy* from information?''
> if this means:
> ``copy from /usr/share/locale/en_US.UTF-8/* that compliled by localedef(1)''
>  we have to restore from (multi)byte-sequence in plain-text db to
> charmap's symbol-name, it is *impossible*(yes i know LC_CTYPE too).

Huh?  Why would that be the case?  Either copy means take from the
source (which seems to be GNU's method and that used by IRIX) or
directly from the "compiled" binary.

I'm not sure I understand the conversion back to "plain-text".  Note
that the current plain-text database isn't really plain-text.  It
is actually a sequence of bytes.  I think the multi-byte sequences
just happen to come out "right".

> 
> it seems that glibc2 people interprets this ambigiousness that
> ``let's install localedef src file in /usr/share/i18n/localedef, and
> copy from it, yeah!''
> but i suspect that this is a result of the compromise.

Maybe.

> 
> Solaris, they don't copy information from
> /usr/lib/localedef/src/en_US.UTF-8/*.src
> but /usr/share/locale/en_US.UTF8/* stuffs, as far as i know from
> truss(1)'s output.

Which version of Solaris?   I don't have /usr/share/locale on my
Solaris 9 box.  I've got /usr/lib/localedef/src and /usr/lib/locale.
The latter has dynamic shared objects created via localedef(1).

> 
> # localedef(1) is quite a beast from ``spec then code'' outer space.
> # please read my past tech-userlevel's post:

No kidding.  But the spec wasn't created in a vacuum.  It just
tried to codify existing stuff.  In this case I think the stuff
that originally came from System V.  I also don't believe it
was ``spec then code'' as the System V stuff probably existed
before the spec.  From what I've seen, these standards are pretty
much codify existing.  Unlike some others...

> # http://marc.info/?l=netbsd-tech-userlevel&m=120422584723718&w=2
> 

I know I read it when you originally replied to me.  Not sure I
understood all of it.  I meant to read it again and get back to
you.

Thanks for the long well thought out response.  I've come around
on some points.  I now see the merit of a magic/version number.
I've got some ideas on how to work up a rudimentary magic number.

I hope to have something coded up soon.

--
Brian Ginsbach <ginsbach%NetBSD.org@localhost>
Follow-Ups:
- Re: localeio
  - From: Takehiko NOZAKI
References:
- Re: localeio
  - From: SODA Noriyuki
- Re: localeio
  - From: Brian Ginsbach
- Re: localeio
  - From: Thor Lancelot Simon
- Re: localeio
  - From: Brian Ginsbach
- Re: localeio
  - From: Takehiko NOZAKI
Prev by Date: Re: Bottomline - Going LDAP.
Next by Date: sh -e changes broke pkgsrc
Previous by Thread: Re: localeio
Next by Thread: Re: localeio
Indexes:
Home | Main Index | Thread Index | Old Index