Re: postgresql encoding/locale issues

To: Greg Troxel <gdt%lexort.com@localhost>
Subject: Re: postgresql encoding/locale issues
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Sat, 15 Apr 2023 12:10:21 +0700

    Date:        Fri, 14 Apr 2023 20:46:20 -0400
    From:        Greg Troxel <gdt%lexort.com@localhost>
    Message-ID:  <rmiwn2em0pv.fsf%s1.lexort.com@localhost>

  | What if my strings are in UTF-8?  Wouldn't I want them interpreted as
  | unicode and sorted that way, vs sorting the utf-8 encoding as bytes?

Different issue.  That's controlled (or should be) by LC_CTYPE.
LC_COLLATE controls the ordering of the characters once recognised,
not how they are recognised.

  | And if I were using fr_FR.ISO8859-1 I would sort of expect [omitted]

I removed your 'e' and 'e' accented example as I don't trust my
mailer to both encode the chars correctly, and label the encoding
properly in the mime headers.

  | sort near each other despite the first one's codepoint being far from
  | the others,

That is LC_COLLATE's function, but you have to be very careful using
it, as things you think should be obvious can stop working, eg: where
people typically expect [A-Z] to match an upper case letter, in some
locales it will also match lower case as well, but usually miss either
'a' or 'z' depending whether the collating sequence is aAbBcC... or
AaBbCc...   Then you'd think [[:upper:]] would be what you should use,
but then you can also match lots of other characters you were not
intending to match.   All this gets hard...

  | This is perhaps the issue, and it still feels like a bug.   As far as I
  | can tell, LANG typically includes an encoding,

As I said in my recent reply to Joerg, ignore that part of my previous
message, it was nonsense.   (Take everything in this message about how
locales work as something to verify rather than fact as well).

  | So certainly it's valid to suggest that I only set LC_CTYPE because
  | that's all I want to control, but setting LANG=en_US.UTF-8 seems valid
  | in general to me, and corresponds to what I expect people whose desired
  | language isn't English would set.

Yes, it should be if you want to control everything.

  |   $ LANG=en_US.UTF-8 locale
  |   LANG="en_US.UTF-8"
  |   LC_CTYPE="en_US.UTF-8"
  |   LC_COLLATE="C"

I don't think that's because of any special magic for LC_COLLATE,
but just because NetBSD doesn't really support it at all.   That
is, we have no locating sequence definitions at all (or if there
are any, there are very few).

There LC_COLLATE should also be en_US.UTF-8 but we have no data
for that, so we get the default (C) instead.

But that might not be true for apps linked against glibc
which might be (I have no idea) using entirely different
locale data, which might include collating sequences.

  | I wonder then if pgsql expects LANG not to have an encoding,

I kind of doubt it is that simple.  It is more likely that it is
not liking what happens when you set one (or more) of the other
LC_* vars.   If you wanted, you could try them one be one, including
LC_CTYPE as well each time, until you discover which one(s) are the
problem.

  | but it
  | certainly seems like it should give a much better error message.

That sounds like something to take up with postgresql developers.

kre

Follow-Ups:
- Re: postgresql encoding/locale issues
  - From: Michael van Elst

References:
- Re: postgresql encoding/locale issues
  - From: Greg Troxel
- postgresql encoding/locale issues
  - From: Greg Troxel
- Re: postgresql encoding/locale issues
  - From: Jonathan Perkin

Prev by Date: Re: postgresql encoding/locale issues
Next by Date: Re: Compilation error in archiver/pax under cygwin, and a patch for the fix
Previous by Thread: Re: postgresql encoding/locale issues
Next by Thread: Re: postgresql encoding/locale issues
Indexes:

Home | Main Index | Thread Index | Old Index