tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: regexp word boundaries



On Sun, Dec 02, 2012 at 07:35:51PM -0500, Thomas Dickey wrote:
> On Mon, Dec 03, 2012 at 12:16:57AM +0100, Alistair Crooks wrote:
> > 
> > It's long been a pet peeve of mine that regexp matching for word
> > boundaries has an annoying dependency on the implementation.
> > 
> > My thanks to the many people hwo helped out with this table, reproduced
> > below, which shows what works, and what doesn't work, when attempting
> > to match the zero-width pattern at a word boundary.
>  
> \b is a perl feature
> 
> man perlre explains about that, and \<
> (BRE's versus ERE's, essentially).

Let's look at what happens:

        vile-9.8nb1 on NetBSD/amd64 6.99.10, vile /usr/share/dict/words
        /\<arch
        cursor is placed at the start of the word "arch" on line 12397, as I 
would expect.

        vile-9.8e on FreeBSD/amd64 9.0-RC1 (I know, I know), vile 
/usr/share/dict/words
        /\<arch
        cursor is placed on the "arch" in "agonistarch" on line 4109
        i.e. \< as a word boundary is not respected.
        /\barch results in a "not found"

now another try:

        vile-9.8e on FreeBSD/amd64 9.0-RC1, vile /etc/motd
        The text reads
        ...
        Welcome to FreeBSD!
        ...
        /\<to
        Results in "not found"
        so now let's use the one derived from perl regexps
        /\bto
        and the word "to" is found.
        (unfortunately, the cursor is placed on the space before the word "to".
        So, it's not quite zero-width, and some people may find that close 
enough.
        Again, unfortunately, I'm not one of them).

not quite what i'd expect from RTFM, but thanks for the suggestion.
 
> > regexp word boundaries
> >                 \<      \b      [[:<:]]
> > perl            not     works   not
> 
> (see manpage, as noted)

I think it would probably be best if you viewed what I wrote as a general
criticism that is the trainwreck of regexp word boundary matching, rather
than pointing me at a manual page for one of the programs involved.
 
> > freebsd vile    not     works   not
> > netbsd vile     works   not     not
> 
> without version numbers, I can only guess what you're referring to with
> vile.  \< has been part of vile for a long time; \b is different from perl
> (vile matches whitespace rather than a word boundary).  Both are in the
> help-file.  See
> 
>       http://invisible-island.net/vile/vile-toc.html
>       http://invisible-island.net/vile/vile-hlp.html#regular-expressions2

Thanks - I remember fixing the \< zero-width matching in the mid 1990s
on vile, and Paul merged the fix.  Unfortunately, your change log only
goes back as far as 1999 when the license was changed to the GPL (and
when I stopped working on vile), so there's no record of anything going
back that far.

Regards,
Alistair


Home | Main Index | Thread Index | Old Index