tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: libcodecs(3), take 2

>>>>> On Tue, 21 Sep 2010 08:02:40 +0200, Alistair Crooks 
>>>>> <> said:

>      ascii2ebcdic
>                   [charset] convert the input from ASCII character encodings
>                   to EBCDIC character encodings.

>      ebcdic2ascii
>                   [charset] convert the input from EBCDIC character encodings
>                   to ASCII character encodings.

I guess those are not so good names, because EBCDIC has so many variants.

>      to-lower     [charset] change any uppercase letters in the input string
>                   to lowercase.

>      to-upper     [charset] change any lowercase letters in the input string
>                   to uppercase.

Those are problematic, because to-lower/to-upper conversion
are affected by current locale setting.

Also, it's better to use "wctrans(tolower)"/"wctrans(toupper)" or
something like those, to allow all character mapping names in
wctrans(3) in future.  Although NetBSD currently only supports
tolower/toupper.  (wctrans(3) is affected by current locale too.)

>      to-unicode   [charset] translate to unicode-16 from UTF-8
>      to-utf8      [charset] translate from unicode-16 to UTF-8

Those are bad names, since unicode is a concept which includes
UTF-16+BOM, UTF-16BE, UTF-16LE, UTF-8, UTF-8+BOM, UCS-4 and others.

What does the "to-unicode" really do?
Does it convert to UTF-8 to UTF-16LE? or UTF-16BE? or UTF-16LE+BOM
or UTF-16BE+BOM?

Does "to-utf8" remove BOM from UTF-16? or add BOM in the case when
UTF-16 didn't have BOM?

For code conversion, I think libcodec(3) shouldn't handle codeset names
by itself.  Maybe it makes sense to provde a transformation
"iconv(from_codeset,to_codeset)", though.  In that case libcodec(3)
internally can call iconv(3) for the actual conversion, and
ascii2ebcdic, ebcdic2ascii, to-unicode and to-utf8 are all unnecessary.

Home | Main Index | Thread Index | Old Index