tech-kern: Re: Unicode support in iso9660.

Subject: Re: Unicode support in iso9660.
To: Christopher Vance <christopher@nu.org>
From: Pelle Johansson <morth@morth.org>
List: tech-kern
Date: 11/23/2004 18:13:20

I'm coming late to this discussion but I have a few points.

I believe that standardising the system calls to UTF-8 would be great. 
It would make it much easier for i18n of applications if they can know 
for sure how the file system is encoded. But if done this should be a 
check in the system call entries and bad strings should be rejected.
The problem is of course existing files on file systems that are 
currently encoding independent. File names that are valid UTF-8 on disk 
can be assumed to be UTF-8, because anything else is highly improbable. 
For those filenames that are not valid UTF-8 it should be possible to 
specify an encoding either via tunefs or to mount. The kernel would 
convert these strings to UTF-8 before userland. New files should be 
saved as UTF-8 unless there's reason to do otherwise. If you have 
multiple encodings on the same file system... well that's pretty messed 
up and I'm sure you want to fix it. Convert as best as possible and let 
the users do the rest.

However, UTF-8 is not specific enough. There are characters that can be 
written in two or more ways in Unicode. All strings with the same 
sequence of characters should for all purposes refer to the same file. 
Therefore strings have to be converted to one of the two (applicable) 
standard forms: UTF-8 NFC or UTF-8 NFD. There are advantages to both. 
One gives shorter byte strings, while the other makes it easier to sort 
since base characters and affixes are separate. Both strings coming 
from the file system and from userland have to be converted.

Also worth to note is that Mac OS X have chosen to do this. They 
decided on UTF-8 NFD. HFS is an example of a file system with encoding, 
it uses UTF-16 on disk, just as joliet. I've been on the Darwin mailing 
list almost since the launch, and I've only seen problems with this 
brought up twice, mostly concerning how to convert a string to UTF-8 
(IIRC). A library for doing that might be needed if there isn't already 
one.
-- 
Pelle Johansson
<morth@morth.org>