tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: A draft for a multibyte and multi-codepoint C string interface



On Mon, Apr 15, 2013 at 05:51:33PM -0400, James K. Lowden wrote:
> If he types "vi året", I think the file
> should open if the character strings match regardless of the
> byte-sequences, but today the odds are 1:4 against.  
> 

No. If the policy _in this network_ is to considered that filenames do
not represent themselves but are an instance of a class of considered
equal filenames, and that a canonical class name is used as the filename
stored, vi(1) would not be the program but a shell wrapper, doing this:

1) Normalize the argument $1 into a byte string (UTF-8) pattern, with
all the ligatures expanded ('oe' -> 'o' 'e'; 'fi' -> 'f' 'i') all the
equivalent characters replaced by the class representant.

2) vi $(the_normalized_result)

in the hypothesis that the network is administrated, that is that only
canonical names (the canonical name being decided by the administrator)
is the one used as the real filename.

If the network is "creative" (politically correct word for non
administrated, non maintained---not holding in the hand of the
administrator), the wrapper would do the following:

1) As above;

2) ls -R $some_dir >$TMPDIR/$$.ls-R # UTF-8 output

while read filename; do
        norm_filename=$(normalize filename)
  test "$norm_filename" = "norm_arg" && exec vi $filename
done <$TMPDIR/$$.ls-R

echo  "$1 not found anywhere\n" >&2

exit 1

Note: instead of test(1), you can use a regex matcher if the $1 can be a
regex.

What is the problem? The main program to write is the normalize (a
translator), that is _not_ a regex program, but a program that
translates from UTF-8 to UTF-8 but replacing lower case by higher
case, ligatures by sequences, equivalent classes by a canonical
representant (this has not to be standardized: if one uses the same
translator for the text and the pattern, this is it, the representant
will be the same).

This is the Unix spirit: combine tools that do one thing but one thing
well, instead of trying to solve conflicting things in one tool.
-- 
        Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
                      http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C


Home | Main Index | Thread Index | Old Index