pkgsrc-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: postgresql encoding/locale issues



Thanks all for the comments.  It is providing some clarity.   Responding
to all at the risk of being confusing;

Jonathan Perkin <jperkin%mnx.io@localhost> writes:

> This has been my understanding for many years - LANG and LC_ALL should
> not be set, but instead use LC_CTYPE to specify the general locale you
> wish to use (en_GB.UTF-8 for me), and then other LC_* as appropriate.

I realize we are talking about pgsql, but to get messages in a different
language, LC_CTYPE doesn't do it (and that doesn't surprise me).  But, I
agree that for pgsql we are talking about choosing the db encoding.

$ LC_CTYPE=fr_FR.UTF-8 date
Fri Apr 14 20:17:49 EDT 2023
$ LANG=fr_FR.UTF-8 date
ven. avr. 14 20:18:04 EDT 2023

> In particular you likely do not want LC_COLLATE to default to whatever
> you set LANG to, as the only sane sort order is LC_COLLATE=C.

(I think this was Joerg's point.)
What if my strings are in UTF-8?  Wouldn't I want them interpreted as
unicode and sorted that way, vs sorting the utf-8 encoding as bytes?
And if I were using fr_FR.ISO8859-1 I would sort of expect e è and é to
sort near each other despite the first one's codepoint being far from
the others, but I have no idea what is correct.  (I was raised ASCII and
don't use newfangled quotes, so I don't really have this issue.)

> From: Edgar Fuß <ef%math.uni-bonn.de@localhost>
> CREATE DATABASE foo WITH OWNER bar TEMPLATE template0 ENCODING UTF8 LC_CTYPE de_DE.UTF-8

Thanks - this avoids the issue of pgsql interpreting locale variables,
and that seems like a good use of a big hammer.  This seems very much
like what Matthias is doing.

> From: Robert Elz <kre%munnari.OZ.AU@localhost>
  | which also
  | seems normal (to be using UTF-8, and one's own language).

> [straightening me about env var processing order omitted, but thank you]

> Using ones own language, sure, but setting UTF-8 in LANG might not
> be the best idea, LANG provides the default locale for LC_TIME
> LC_MONETARY LC_...  as well as LC_CTYPE the only locale setting
> for which the character encoding method is really relevant.

This is perhaps the issue, and it still feels like a bug.   As far as I
can tell, LANG typically includes an encoding, not just a country code
pair.  (The variable LANGUAGE, perhaps a linuxism, seems to not have
encoding).

I find that "LANG=es_ES date" prints in English (surely "C"), and
"LANG=fr_FR.UTF-8 date" is French.

So certainly it's valid to suggest that I only set LC_CTYPE because
that's all I want to control, but setting LANG=en_US.UTF-8 seems valid
in general to me, and corresponds to what I expect people whose desired
language isn't English would set.

Regarding COLLATE, setting LANG leads to:

  $ LANG=en_US.UTF-8 locale
  LANG="en_US.UTF-8"
  LC_CTYPE="en_US.UTF-8"
  LC_COLLATE="C"
  LC_TIME="en_US.UTF-8"
  LC_NUMERIC="en_US.UTF-8"
  LC_MONETARY="en_US.UTF-8"
  LC_MESSAGES="en_US.UTF-8"
  LC_ALL=""

which seems ok.

I wonder then if pgsql expects LANG not to have an encoding, but it
certainly seems like it should give a much better error message.



Home | Main Index | Thread Index | Old Index