tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Encoding non-alphanumeric characters in manpage filenames

>> What does POSIX say?
> [...]
> 2. Each byte in the UTF-8 encoding is interpreted as ASCII

As soon as any of the input codepoints are non-ASCII, UTF-8 generates
octets which are ouside the ASCII range and thus cannot be interpreted
as ASCII (at least not without further processing).

> 3. If there's a matching character, use that one.  If not, insert a
> hex encoding of the byte.

Provided the "If not" case covers the case where the octet isn't ASCII,
this is then well-defined...provided manpage names are taken as
sequences of characters.

> AFAIK the DNS now uses "Punycode" to encode non-ASCII Unicode
> characters in domain names.  It's confusing and likely has no
> advantage over a straight-up hex encoding.

I think it does, for its design use case.

The major advantage I see is that it's more compact; hex encoding
doubles or, with the % prefix, triples, octet count, and to compare
fairly with punycode you have to first convert the Unicode codepoint
string into an octet string; assuming this is done with UTF-8, it leads
to two to five times as many octets in the intermediate string as there
are codepoints in the original string (counting only the non-ASCII
characters, of course).  This count is then doubled or tripled, leading
to at least four and possibly as many as 15 output octets per
(non-ASCII) input codepoint.  Since there is a small maximum - 63 - on
DNS label length, this degree of expansion is undesirable.

Punycode is substantially more compact.  See the examples in RFC3492.

I am actually somewhat surprised they didn't just specify use of UTF-8.
The DNS supports all 256 possible octet values in labels, except that
there is the historical misfeature that 26 of them are treated as
identical to a different 26.  I see no particular reason to not just
use UTF-8 labels.  Presumably they had some, but if it's in 3492 then
my reading missed it.

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Home | Main Index | Thread Index | Old Index