tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: vi vs. nvi



>>>>> On Thu, 07 Aug 2008 11:16:39 -0400,
      Roland Dowdeswell <elric%imrryr.org@localhost> said:

>> The standard way is to use regcomp()/regexec() functions against
>> multibyte strings (instead of using wide character strings). 
>> The regex functions should honor current locale, but NetBSD
>> implementation currently doesn't.

> I'm not sure that I like this as a default.  You get all manner of
> lossage on Linux because of this.  I.e. [a-z] in the en_US locale
> matches all characters except for capital A because of the collation
> order.  And on a typical Linux system about half of the system
> tools will do this.

That's a different problem.

As you said, whether the regexp [a-z] matches with capital B-Z under
English locale depends on whether regcomp(3)/regexp(3) obeys collation
order (i.e. uses strcoll(3) or wcscoll(3)) or not.
Whether regcomp(3)/regexp(3) obeys current locale (I mean LC_CTYPE) or
not certainly does concern, but it is not the sufficient condition.
i.e. Using current LC_CTYPE itself won't make the problem,
if LC_COLLATE is not used.

Until recently, what I thought about this topic was just same with
what you said.
But then I found that CentOS 3, RHEL4, RHEL5 and Debian GNU/Linux 4.0
didn't work like what you worried.
i.e. The regexp [a-z] only matched with lower letters under English
locale on those Linux variants.
I found that the regexp only matched with lower letters on Solaris 2.6
with LANG=en_US too.

By contrast, [a-z] matches with capitcal B to Z on Solaris 2.6 with
LANG=en_US.UTF-8, HP-UX v11 and RedHat 6.2.

OS                        LANG           CODESET(*1) result of regexec(3)
------------------------- -------------- ----------  --------------------
RedHat 6.2 (glibc-2.1.3)  en_US          ISO-8859-1  match
RedHat 6.2 (glibc-2.1.3)  en_US.UTF-8    UTF-8       match
CentOS 3.6 (glibc-2.3.2)  en_US          ISO-8859-1  not match
CentOS 3.6 (glibc-2.3.2)  en_US.UTF-8    UTF-8       not match
RHEL 4.5   (glibc-2.3.4)  en_US          ISO-8859-1  not match
RHEL 4.5   (glibc-2.3.4)  en_US.UTF-8    UTF-8       not match
Debian 4.0 (glibc-2.3.6)  en_US.UTF-8    UTF-8       not match
CentOS 5   (glibc-2.5)    en_US          ISO-8859-1  not match
CentOS 5   (glibc-2.5)    en_US.UTF-8    UTF-8       not match
SuOS 5.6                  en_US          ISO8859-1   not match
SuOS 5.6                  en_US.UTF-8    UTF-8       match
HP-UX 11i v3 / PA-RISC    english        roman8      match
HP-UX 11i v3 / PA-RISC    en_US.utf8     utf8        match

The result of the following condition is always true in all of above
cases:
        strcoll("a", "B") < 0 && strcoll("B", "z") < 0 

So, some of the regular expression libraries don't always obey current
collation order at least in the above "not match" cases.


According to the POSIX standard (*2), whether regcomp(3)/regexp(3)
should obey current collation order or not is only specified when
current locale is "POSIX" or "C", and the behavior in other locales is
explicitly said "unspecified".  So, it seems the above "not match"
cases are still OK from POSIX point of view.


BTW, I don't know the result of recent Solaris.
Could any try the following commands with the programs attached at the
end of this mail?
        $ env LANG=en_US ./strcoll a B
        $ env LANG=en_US ./strcoll B z
        $ echo 'B' | env LANG=en_US ./regex '[a-z]'
        $ env LANG=en_US.UTF-8 ./strcoll a B
        $ env LANG=en_US.UTF-8 ./strcoll B z
        $ echo 'B' | env LANG=en_US.UTF-8 ./regex '[a-z]'
According to the source code (*3), it seems OpenSolaris doesn't use
strcoll(3)/wcscoll(3), and always compares character code values,
although I may be missing something.

On the other hand, the glibc source code (*4) is certainly calling
wcscoll(3), so it's possible that some Linux variants strill behave
like what you worried.


Anyway, it must be OK that regcomp(3)/regexec(3) obeys current LC_CTYPE.
Whether current LC_COLLATE should be obeyed or not is a different
problem.

(*1)
CODESET means return value of nl_langinfo(CODESET).

(*2)
http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03_05
9.3.5 RE Bracket Expression
In the POSIX locale, a range expression represents the set of
collating elements that fall between two elements in the collation
sequence, inclusive. In other locales, a range expression has
unspecified behavior: strictly conforming applications shall not rely
on whether the range expression is valid, or on the set of collating
elements matched.

(*3)
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/regex/regex.c#809
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/regex/regex.c#test_char_against_ascii_class
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/regex/regex.c#in_wchar_range

(*4)
http://sourceware.org/cgi-bin/cvsweb.cgi/libc/posix/regcomp.c?rev=1.118&content-type=text/x-cvsweb-markup&cvsroot=glibc
http://sourceware.org/cgi-bin/cvsweb.cgi/libc/posix/regexec.c?rev=1.98&content-type=text/x-cvsweb-markup&cvsroot=glibc
wcscoll(3) is used in build_range_exp() and check_node_accept_bytes().
-- 
soda

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <locale.h>
#include <regex.h>
#include <langinfo.h>

extern char *optarg;
extern int optind;

static char *progname = "regex";

#define EXIT_FOUND              0
#define EXIT_NOTFOUND           1
#define EXIT_ERROR_USAGE        2
#define EXIT_ERROR_RUNTIME      3

void
usage(void)
{
        fprintf(stderr, "Usage: %s [-EFGi] <regexp\n", progname);
        exit(EXIT_ERROR_USAGE);
}

int
main(int argc, char **argv)
{
        regex_t regex;
        int comp_flags = 0;
        int exec_flags = 0;
        int c, err, len, had_newline, status = EXIT_NOTFOUND;
        char buf[BUFSIZ], errbuf[BUFSIZ];

        if (argc > 0)
                progname = argv[0];
        if (setlocale(LC_ALL, "") == NULL) {
                fprintf(stderr, "%s: setlocale failed\n", progname);
                return EXIT_ERROR_RUNTIME;
        }
        fprintf(stderr, "info: LC_ALL: %s\n", setlocale(LC_ALL, NULL));
        fprintf(stderr, "info: LC_CTYPE: %s\n", setlocale(LC_CTYPE, NULL));
        fprintf(stderr, "info: LC_COLLATE: %s\n", setlocale(LC_COLLATE, NULL));
        fprintf(stderr, "info: nl_langinfo(CODESET): %s\n",
            nl_langinfo(CODESET));

        while ((c = getopt(argc, argv, "EFGi")) != -1) {
                switch (c) {
                case 'E':
                        comp_flags |= REG_EXTENDED;
                        break;
                case 'F':
#ifdef REG_NOSPEC
                        comp_flags |= REG_NOSPEC;
#else
                        fprintf(stderr, "error: REG_NOSPEC is not defined\n");
                        return EXIT_ERROR_USAGE;
#endif
                        break;
                case 'G':
#ifdef REG_BASIC
                        comp_flags |= REG_BASIC;
#else
                        fprintf(stderr, "error: REG_BASIC is not defined\n");
                        return EXIT_ERROR_USAGE;
#endif
                        break;
                case 'i':
                        comp_flags |= REG_ICASE;
                        break;
                case '?':
                        /*FALLTHRU*/
                default:
                        usage();
                }
        }
        argc -= optind;
        argv += optind;
        if (argc < 1)
                usage();

        err = regcomp(&regex, argv[0], comp_flags);
        if (err != 0) {
                regerror(err, &regex, errbuf, sizeof errbuf);
                fprintf(stderr, "%s: %s\n", argv[0], errbuf);
                return EXIT_ERROR_USAGE;
        }
        while (fgets(buf, sizeof buf, stdin) != NULL) {
                len = strlen(buf);
                assert(len > 0);
                had_newline = 0;
                if (buf[len - 1] == '\n') {
                        buf[len - 1] = '\0';
                        had_newline = 1;
                } else if (len >= sizeof buf - 1)
                        fprintf(stderr, "warning: too long line: %s\n", buf);
                err = regexec(&regex, buf, 0, NULL, exec_flags);
                if (err == 0) {
                        printf(had_newline ? "%s\n": "%s", buf);
                        status = EXIT_FOUND;
                } else if (err != REG_NOMATCH) {
                        regerror(err, &regex, errbuf, sizeof errbuf);
                        fprintf(stderr, "%s: match failed: %s\n", buf, errbuf);
                        return EXIT_ERROR_RUNTIME;
                }
        }
        regfree(&regex);
        return status;
}
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <langinfo.h>

static char *progname = "strcoll";

int
main(int argc, char **argv)
{
        int rv;

        if (argc > 0)
                progname = argv[0];
        if (setlocale(LC_ALL, "") == NULL) {
                fprintf(stderr, "%s: setlocale failed\n", progname);
                return 3;
        }
        fprintf(stderr, "info: LC_ALL: %s\n", setlocale(LC_ALL, NULL));
        fprintf(stderr, "info: LC_CTYPE: %s\n", setlocale(LC_CTYPE, NULL));
        fprintf(stderr, "info: LC_COLLATE: %s\n", setlocale(LC_COLLATE, NULL));
        fprintf(stderr, "info: nl_langinfo(CODESET): %s\n",
            nl_langinfo(CODESET));

        if (argc != 3) {
                fprintf(stderr, "Usage: %s <string1> <string2>\n", progname);
                return 2;
        }
        rv = strcoll(argv[1], argv[2]);
        printf("strcoll(\"%s\", \"%s\") = %d\n", argv[1], argv[2], rv);
        printf("i.e. \"%s\" %s \"%s\"\n",
            argv[1],
            rv < 0 ? "<" : rv == 0 ? "==" : ">",
            argv[2]);
        return 0;
}


Home | Main Index | Thread Index | Old Index