Subject: Valid and incomplete character sequence and mbrlen()
To: None <tech-userlevel@netbsd.org>
From: Masao Uebayashi <uebayasi@soum.co.jp>
List: tech-userlevel
Date: 10/28/2001 22:59:31
If I read SUSv2 correctly, mbrlen() to incomplete and valid character
sequence should return -2. SUSv2 says:

--------8<--------8<--------8<--------8<--------8<--------8<--------8<
 RETURN VALUE

    The mbrlen() function returns the first of the following that applies:

    0   If the next n or fewer bytes complete the character that corresponds to
        the null wide-character.
    positive
        If the next n or fewer bytes complete a valid character; the value
        returned is the number of bytes that complete the character.
    (size_t)-2
        If the next n bytes contribute to an incomplete but potentially valid
        character, and all n bytes have been processed. When n has at least the
        value of the MB_CUR_MAX macro, this case can only occur if s points at
        a sequence of redundant shift sequences (for implementations with
        state-dependent encodings).
    (size_t)-1
        If an encoding error occurs, in which case the next n or fewer bytes do
        not contribute to a complete and valid character. In this case, EILSEQ
        is stored in errno and the conversion state is undefined.
--------8<--------8<--------8<--------8<--------8<--------8<--------8<

For example, what's displayed with the -current locale?

Here, mbrlen() to s specifying 1, 2, 3 and 4 as the 2nd argument
should return -2.

--------8<--------8<--------8<--------8<--------8<--------8<--------8<
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>

char buf[1024];

/*
 * A multibyte string in ISO-2022-JP.  The fist 3 bytes are a shift
 * sequence, the next 6 bytes are JIS X 0208 characters, the last 3
 * are also a shift sequence.
 */
const char s[] =
        {
                0x1b, 0x24, 0x42,                       /* JIS X 0208 */
                0x46, 0x7c, 0x4b, 0x5c, 0x38, 0x6c,     /* 日本語 */
                0x1b, 0x28, 0x42,                       /* ASCII */
                '\0'
        };

int
main()
{
        mbstate_t *ps;
        int i;
        int ret;

        if (setlocale(LC_ALL, "ja_JP.ISO2022-JP") == NULL)
                exit(EXIT_FAILURE);

        printf("%s\n", s);

        /* Initialize mbstate_t. */
        ps = (mbstate_t *)malloc(sizeof(mbstate_t));
        memset(ps, 0, sizeof(mbstate_t));

        for (i = 0; i < strlen(s); ++i) {
                /*
                 * mbrlen()
                 */
                ret = mbrlen(s, i, ps);
                printf("%d: %d\n", i, ret);
        }

        return 0;
}
--------8<--------8<--------8<--------8<--------8<--------8<--------8<

I'm using XPG4DL on 1.5. I'm sorry if this is not the case on
-current.

Regards,
Masao