Subject: Re: Unicode support in iso9660.
To: Christopher Vance <email@example.com>
From: Pelle Johansson <firstname.lastname@example.org>
Date: 11/23/2004 18:13:20
I'm coming late to this discussion but I have a few points.
I believe that standardising the system calls to UTF-8 would be great.
It would make it much easier for i18n of applications if they can know
for sure how the file system is encoded. But if done this should be a
check in the system call entries and bad strings should be rejected.
The problem is of course existing files on file systems that are
currently encoding independent. File names that are valid UTF-8 on disk
can be assumed to be UTF-8, because anything else is highly improbable.
For those filenames that are not valid UTF-8 it should be possible to
specify an encoding either via tunefs or to mount. The kernel would
convert these strings to UTF-8 before userland. New files should be
saved as UTF-8 unless there's reason to do otherwise. If you have
multiple encodings on the same file system... well that's pretty messed
up and I'm sure you want to fix it. Convert as best as possible and let
the users do the rest.
However, UTF-8 is not specific enough. There are characters that can be
written in two or more ways in Unicode. All strings with the same
sequence of characters should for all purposes refer to the same file.
Therefore strings have to be converted to one of the two (applicable)
standard forms: UTF-8 NFC or UTF-8 NFD. There are advantages to both.
One gives shorter byte strings, while the other makes it easier to sort
since base characters and affixes are separate. Both strings coming
from the file system and from userland have to be converted.
Also worth to note is that Mac OS X have chosen to do this. They
decided on UTF-8 NFD. HFS is an example of a file system with encoding,
it uses UTF-16 on disk, just as joliet. I've been on the Darwin mailing
list almost since the launch, and I've only seen problems with this
brought up twice, mostly concerning how to convert a string to UTF-8
(IIRC). A library for doing that might be needed if there isn't already