tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: libcodecs(3), take 3



Hi, Al

It seems some points are still not addressed.
Probably because my description was too sketchy. (sorry for that).

The problems are:

* duplicated functionality

  - libcodecs(3) has duplicated functionality with iconv(3), this is
    undesirable because of code size and inconsistency.
    It should call iconv(3) internally.
    For example, current implementation of utf8_to_unicode16() and
    unicode16_to_utf8() doesn't support the surrogate pair feature of
    the Unicode standard.

* limited extensibility

  - The code conversion feature often needs more codesets.
    With current traslation naming scheme, libcodecs has to be changed
    at each addition of a codeset.  That's undesirable.
    If the translation name for code coversion is the follwoing format,
    and if libcodecs internally calls iconv(3), many codesets will be
    supported by libcodec automatically:
        current naming:
                ascii2ebcdic
                ebcdic2ascii
                to_unicode
                to_utf8
        desirable naming scheme:
                iconv(FROM_CODESET,TO_CODESET)

  - The mapping name for wctrans(3) will be added in future,
    With current traslation naming scheme, libcodecs has to be changed
    at each addition of a mapping.  That's undesirable.
    If the translation name for code coversion is the follwoing
    format, no change is necessary at an addition of a mapping:
        current naming:
                to_lower
                to_upper
        desirable naming scheme:
                wctrans(MAPPING_NAME)
    Providing "to_lower"/"to_upper" as an alias of wctrans("tolower")
    wctrans("toupper") may be a good idea due to its frequenst use, though.

* naming inconsistency

  - Just using "EBCDIC" is inconsistent with existing NetBSD
    installation, because we have already supported the following
    EBCDIC variants:

        $ iconv -l | grep -i ebcdic | tr '\012' ' '
        ebcdic-at-de ebcdic-at-de-a ebcdic-be ebcdic-br ebcdic-ca-fr
        ebcdic-cp-ar1 ebcdic-cp-ar2 ebcdic-cp-be ebcdic-cp-ca
        ebcdic-cp-ch ebcdic-cp-dk ebcdic-cp-es ebcdic-cp-fi
        ebcdic-cp-fr ebcdic-cp-gb ebcdic-cp-gr ebcdic-cp-he
        ebcdic-cp-is ebcdic-cp-it ebcdic-cp-nl ebcdic-cp-no
        ebcdic-cp-roece ebcdic-cp-se ebcdic-cp-tr ebcdic-cp-us
        ebcdic-cp-wt ebcdic-cp-yu ebcdic-cyrillic ebcdic-dk-no
        ebcdic-dk-no-a ebcdic-es ebcdic-es-a ebcdic-es-s ebcdic-fi-se
        ebcdic-fi-se-a ebcdic-fr ebcdic-int ebcdic-it ebcdic-jp-e
        ebcdic-jp-kana ebcdic-pt ebcdic-uk

* naming ambiguity

  - The name "to_unicode" and "to_utf8" are ambiguous because
    they don't indicate which codeset it converts from.
      
  - The name "EBCDIC" itself is ambiguous.
    (c.f. "iconv -l | grep -i ebcdic")

  - The name "unicode" itself is ambiguous.
    It is possible that "unicode" means:
        - UCS-4
        - UTF-8
        - UTF-8 with byte order mark
        - UTF-16 Big Endian without byte order mark
        - UTF-16 Big Endian with byte order mark
        - UTF-16 Little Endian without byte order mark
        - UTF-16 Little Endian with byte order mark
        - and many more.

* bugs

  - mixed2lower() and mixed2upper() are using a cast for passing "char *"
    towctrans(3).  This doesn't work for multibyte codesets.

  - As written above,, current implementation of utf8_to_unicode16() and
    unicode16_to_utf8() doesn't support the surrogate pair feature of
    the Unicode standard.
-- 
soda


Home | Main Index | Thread Index | Old Index