[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Encoding non-alphanumeric characters in manpage filenames
>> What does POSIX say?
> 2. Each byte in the UTF-8 encoding is interpreted as ASCII
As soon as any of the input codepoints are non-ASCII, UTF-8 generates
octets which are ouside the ASCII range and thus cannot be interpreted
as ASCII (at least not without further processing).
> 3. If there's a matching character, use that one. If not, insert a
> hex encoding of the byte.
Provided the "If not" case covers the case where the octet isn't ASCII,
this is then well-defined...provided manpage names are taken as
sequences of characters.
> AFAIK the DNS now uses "Punycode" to encode non-ASCII Unicode
> characters in domain names. It's confusing and likely has no
> advantage over a straight-up hex encoding.
I think it does, for its design use case.
The major advantage I see is that it's more compact; hex encoding
doubles or, with the % prefix, triples, octet count, and to compare
fairly with punycode you have to first convert the Unicode codepoint
string into an octet string; assuming this is done with UTF-8, it leads
to two to five times as many octets in the intermediate string as there
are codepoints in the original string (counting only the non-ASCII
characters, of course). This count is then doubled or tripled, leading
to at least four and possibly as many as 15 output octets per
(non-ASCII) input codepoint. Since there is a small maximum - 63 - on
DNS label length, this degree of expansion is undesirable.
Punycode is substantially more compact. See the examples in RFC3492.
I am actually somewhat surprised they didn't just specify use of UTF-8.
The DNS supports all 256 possible octet values in labels, except that
there is the historical misfeature that 26 of them are treated as
identical to a different 26. I see no particular reason to not just
use UTF-8 labels. Presumably they had some, but if it's in 3492 then
my reading missed it.
/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mouse%rodents-montreal.org@localhost
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Main Index |
Thread Index |