tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: UTF8 (Was: [PATCH] replace 0xA0 to whitespace in plain text files (part 2))

> Staying latin1 or ascii is not an option IMO, so the answer is then
> go UTF-8 for everything, not at once, but where its needed.

> I'll start with Kerberos, what do you start with ?

I don't "start with" anything.  I think what you are proposing is a
wrong fix to a nonexistent problem.  (That's not to say no problem
exists.  It's to say that the problem your suggestion tries to fix
isn't one that exists.)

NetBSD mostly does not use characters at the moment; it uses bytes.  If
you try to convert Kerberos to UTF-8, you will run into basically the
same problem I did with ssh: you don't have character strings for
things like usernames and passwords; you have octet strings, and you
can't convert between them and UTF-8 without knowing what character set
the octet strings are intended to be in.

For example, if I'm Greek and use 8859-7 and set my password to, say, 8
capital-theta small-pi I + small-lambda, and you're Swedish and use
8859-1 and set yours to 8 E-grave small-eth I + e-dots, we will
generate the same octet sequences - 0x38 0xc8 0xf0 0x49 0x2b 0xeb - fed
into the password hashing algorithm.  Then your UTF-8-aware Kerberos
gets, say, UTF-8 0x38 0xc3 0x88 0xc3 0xb0 0x49 0x2b 0xc3 0xab; how does
it manage to make this match your password but not mine?  You'll need
to add character set information to the stored authentication
information.  Then that means making everything that uses that
information understand it.

This sort of contagion (the way the philosophical shift from octet
sequences to character sequences "infects" all tools that touch the
data in question) is why I think that that change can't be done
piecemeal the way you suggest.  Doing it at all means doing it almost
everywhere and will be very invasive.  And you can't just say
"everything is $FOO" (even for values of FOO like UTF-8 that include
all characters of interest), or you run into problems with things (like
executables' text segments) that are inherently not character data and
thus cannot be tagged, even implicitly, with a character set.

/~\ The ASCII                           der Mouse
\ / Ribbon Campaign
 X  Against HTML      
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

Home | Main Index | Thread Index | Old Index