NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: sending/receiving UTF-8 characters from terminal to program



    Date:        Fri, 20 Jan 2023 08:55:45 +0000 (UTC)
    From:        RVP <rvp%SDF.ORG@localhost>
    Message-ID:  <4dd21c1f-f5c3-c3ba-96d8-cab73a0b433%SDF.ORG@localhost>

  | Both /bin/sh and bash output UTF-8 if given Unicode code-
  | points in the form `\uNNNN'. So,

I believe bash will take your current locale into account
when doing that, whereas neither /bin/sh nor /usr/bin/printf
do, they simply emit UTF-8 unconditionally.   This kind of
difference is (partly) why POSIX is not including the \u (or \U)
escape sequences in $'...' quoted strings in Issue 8.

Another is how the end of the NNNN is detected, is it always
exactly 4 hex digits (or 8 for \U), or any number up to 4 (or
8) if followed by a non-hex char, or using as many hex chars
as exist?  To be portable (as input) such a string needs to
be exactly 4 (8) hex digits, and be followed by something
which is not a hex digit - the closing ' is often useful
there, it can always be followed immediately by $' to
resume quoting again (or just ' or " if those are adequate).
But that's just the input, you also need to be using a
locale using UTF-8 char encoding to get predictable output.

kre
  |
  | $ printf 'néz' | hexdump -C
  | 00000000  6e c3 a9 7a                                       |n..z|
  | 00000004
  | $ printf $'n\uE9z' | hexdump -C
  | 00000000  6e c3 a9 7a                                       |n..z|
  | 00000004
  | $
  |
  | If that works, then check those UTF-8 bytes against whatever the
  | terminal emulator generated from your keystrokes for the `&eacute;'
  | in `néz'.
  |
  | -RVP
  |
  | --0-494486379-1674204946=:18222--
  |


Home | Main Index | Thread Index | Old Index