Subject: Re: language names and character sets for web pages
To: None <netbsd-docs@netbsd.org>
From: Paulo Alexandre Pinto Pires <p@ppires.org>
List: netbsd-docs
Date: 11/16/2002 02:42:06
Hello, folks.

First of all, I have to apologize for being absent for some time.  I've been
too busy with other stuff...

> Klaus Heinz <k.heinz.nov.zwei@onlinehome.de> wrote:
> > Jan Schaumann wrote:
> > > Klaus Heinz <k.heinz.nov.zwei@onlinehome.de> wrote:
> >
> > > I believe this should be ISO-8859-15 - as most (all?) ISO-8859-1's
> > > should be to include the Euro etc.
> >
> > Quoting translate.html:
>
> Guess who wrote that and should have known...  Given this, I obviously
> retract my previous statement. ;-}
>
> > > I had asked Paulo the same question when I originally imported the
> > > branch.  His answer led me to include "-BR" -- since it was given in a
> > > private email, I'll not copy'n paste it here, but the differences seemed
> > > at least as strong as the differences between de-DE and de-CH.

That would not be a problem to me.

> > Shouldn't we then be more consistent and use htdocs/pt-BR/ instead of
> > htdocs/pt/ if the HTML pages use <html lang="pt-BR">?
>
> Hmmmm, dunno.  Given that we have zh-TW, it would make sense.  On the
> other hand, it seems as if brazilian Portuguese can still be understood
> easily by "normal" Portuguese speaking people (though on occasion it may
> sound funny), while -- as I said, I /believe/ -- zh-TW may not be as
> easily be understood by zh-whatever speaking people.
>
> I don't think I like htdocs/pt-BR/ for aesthetics, though.

Yet is not clearly stated in RFC3066 (at least as far as I can remember), I
feel like it suggests that one should _not_ consider xx and xx-YY to be the
same thing, and not even that a guy who's xx will understand xx-YY _or_
vice-versa, not to mention crossing between xx-YY and xx-ZZ.  But this is
probably a more serious problem with aural media than with written text, so I
did experiments with Netscape and Apache with multiview enabled, and found that
if I requested an en-US document, and Apache had only en available, I would get
it.  Also, I saw that Apache would still return a valid response to an en-US
request if it had only an en-GB alternative, and Netscape would be happy with
it.  This may "break" the standards somehow, but it really makes sense to me.

I think we are in the right way calling the directory "pt" and not "pt-BR".
Applications probably wouldn't care anyway, and most users would be satisfied
with understanding most, if not all, of the content, even with some
funny-sounding sentences here and there.  However, I think that documents
translated by me should be tagged with "pt-BR", because they in fact use
Brazilian jargon, expressions and verbal constructions.  Tagging them simply
with "pt" perhaps wouldn't exactly be "wrong", but it could be misleading to
every speaker but Brazilians, even though Brazilians are more than 85% of those
who speak Portuguese in the world.

> > > >   - Who works on zh-TW?
> > >
> > > Rui-Xiang Guo (rxg@ )
> >
> > So we could list him as the coordinator for zh-TW on the web page.
> > And Bang Jun-Young for the Korean translation?
>
> Yes, definitely.
>
> > > hardly understand somebody speaking in another.  A quick google
> > > indicates that there are at least seven major language groups.  I'd
> > > assume that this should be enough to leave it as is, but if Rui-Xiang
> > > (or anybody else with more insights) knows better, please let us
> > > know.
> >
> > Then it seems necessary to use lang="zh-TW" in the /zh-TW/ branch.
>
> Yes, that would be TRT, I think.
>
> > Btw, according to RRC 3066, language codes use '-' and not '_' to
> > separate subtags, so we need to change this in the pages (pt_BR) and in
> > the directory structure (zh_TW).
>
> I seem to remember there being a case where "_" is
> used/recommended/required, but off the top of my head, I can't recall
> it.  You're probably right.  Maybe I was confused by locales using "_".

RFCs really direct into using "-".  "-" is a good choice, anyway, because some
of those historical text-only terminals with limited character sets
occasionally do not include "_" in such sets.  Tradiotional UNIX approach on
locales was using "_", but I believe that it can be troublesome for the most
general I18N.  Perhaps this should be reviewed by developers, too.

> Thanks for looking at this - obviously we need to clean up some of the
> translations.

I've been using "pt-BR" in pt/adjust.sed for a while.  Too bad I missed it in
some files and wrote "pt_BR"...

> -Jan
>
> --
> I always said there was something fundamentally wrong with the universe.

--
        Pappires

... Qui habet aurem audiat quid Spiritus dicat ecclesiis.