tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



David Young <dyoung%pobox.com@localhost> wrote:
 |Consider this real path on my Mac,
 |
 |Music/iTunes/iTunes\ Music/Sinéad\ O\'Connor/The\ Lion\ and\ the\ Cobra/
 |
 |Because of the not-so-special but special-nonetheless characters in
 |the path---the spaces, the accented e, and the apostrophe---to type
 |that path will be a royal pain.  To find(1) directories containing the

Didn't work.

 |word 'Sinéad' will also be a pain if a) you don't know how to type é
 |or b) you didn't realize 'Sinéad' was spelled with an é. To write a
 |script that tolerates paths like that---I wrote one once to de-duplicate

Didn't work.
Unfortunately it is not really UTF-8.  (It's soo awful.)

 |files in the iTunes\ Music directory---requires special care to protect
 |against spaces being interpreted as field delimiters.
 |
 |(How do you even type an é with wscons?)
 [.]
 |album/performer/song using either the Mac's full-text index, Spotlight,

Can be turned off volume-wise, but don't remove the hidden
directories and files that do configure this behaviour from and
for the volume in question.  (I never managed to turn it off as
such.  There is no button.)

 [.]
 |> I'm confident that glob(3) could be adapted to Unicode, that open(2)
 |> could canonicalize, that ffs could be changed to reflect the encoding,
 |> and mount(2) to enforce it.  That's just a small matter of
 |> programming.  For it to happen, though, we need consensus that's it's
 |> good and necessary.  A consensus that seems surprisingly hard to
 |> establish.  
 |
 |Maybe it's good.  I don't know if it's necessary.  Developing a rapid
 |full-text search capability will probably have a greater and faster
 |pay-off than trying to make UTF-8 filenames a coherent part of UNIX.

The data is changing a bit and growing some more over time, which
doesn't sound like being a good thing for a part of a BSD kernel.
Also it doesn't help any user program like that.
Some Unicode aware nfd() / nfk() should be made part of
C libraries, so that interested parties can take advantage of
them.  (This still may not allow to use Mac OS X filenames in
interchange, but i don't know.)

 |Dave

--steffen


Home | Main Index | Thread Index | Old Index