tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: regexp word boundaries



On Mon, Dec 03, 2012 at 04:06:06AM +0100, Alistair Crooks wrote:
> On Sun, Dec 02, 2012 at 07:35:51PM -0500, Thomas Dickey wrote:
> > On Mon, Dec 03, 2012 at 12:16:57AM +0100, Alistair Crooks wrote:
> > > 
> > > It's long been a pet peeve of mine that regexp matching for word
> > > boundaries has an annoying dependency on the implementation.
> > > 
> > > My thanks to the many people hwo helped out with this table, reproduced
> > > below, which shows what works, and what doesn't work, when attempting
> > > to match the zero-width pattern at a word boundary.
> >  
> > \b is a perl feature
> > 
> > man perlre explains about that, and \<
> > (BRE's versus ERE's, essentially).
> 
> Let's look at what happens:
> 
>       vile-9.8nb1 on NetBSD/amd64 6.99.10, vile /usr/share/dict/words
>       /\<arch
>       cursor is placed at the start of the word "arch" on line 12397, as I 
> would expect.
> 
>       vile-9.8e on FreeBSD/amd64 9.0-RC1 (I know, I know), vile 
> /usr/share/dict/words
>       /\<arch
>       cursor is placed on the "arch" in "agonistarch" on line 4109
>       i.e. \< as a word boundary is not respected.
>       /\barch results in a "not found"
> 
> now another try:

hmm - 9.8j is current.  I don't recall any recent regex changes or related
fixes.  I have 9.8i on FreeBSD 9, and don't see this behavior.  I'll make
a to-do to investigate the port's configuration...

\b in vile means almost the same as \s
        \b is [[:blank:]]
        \s is [[:space:]]

so \s includes \r, \n and \f while \b does not.

(\b has been there since early 2001 - seems that I added it as part
of the character-class changes).
 
>       vile-9.8e on FreeBSD/amd64 9.0-RC1, vile /etc/motd
>       The text reads
>       ...
>       Welcome to FreeBSD!
>       ...
>       /\<to
>       Results in "not found"
>       so now let's use the one derived from perl regexps
>       /\bto
>       and the word "to" is found.
>       (unfortunately, the cursor is placed on the space before the word "to".
>       So, it's not quite zero-width, and some people may find that close 
> enough.
>       Again, unfortunately, I'm not one of them).
> 
> not quite what i'd expect from RTFM, but thanks for the suggestion.
>  
> > > regexp word boundaries
> > >                 \<      \b      [[:<:]]
> > > perl            not     works   not
> > 
> > (see manpage, as noted)
> 
> I think it would probably be best if you viewed what I wrote as a general
> criticism that is the trainwreck of regexp word boundary matching, rather
> than pointing me at a manual page for one of the programs involved.

sure - but reading it, I look for things to improve (or fix).

\< should work because it's the most standardized.
\b is perl (perl will never be standardized - so I've read :-)
[[:<:]] is... (let's not digress)

I'd forgotten about \b in vile actually, but given compatibility issues
I could change its behavior to more closely match perl's (and added
it to my to-do list to investigate).  Doing that would lose the nice
feature that all of the character classes have an abbreviation.
  
> > > freebsd vile    not     works   not
> > > netbsd vile     works   not     not
> > 
> > without version numbers, I can only guess what you're referring to with
> > vile.  \< has been part of vile for a long time; \b is different from perl
> > (vile matches whitespace rather than a word boundary).  Both are in the
> > help-file.  See
> > 
> >     http://invisible-island.net/vile/vile-toc.html
> >     http://invisible-island.net/vile/vile-hlp.html#regular-expressions2
> 
> Thanks - I remember fixing the \< zero-width matching in the mid 1990s
> on vile, and Paul merged the fix.  Unfortunately, your change log only
> goes back as far as 1999 when the license was changed to the GPL (and
> when I stopped working on vile), so there's no record of anything going
> back that far.

All of the changelogs are in the sources - the practice used to be that
we would rename CHANGES to CHANGES.Rx, but I stopped doing that a while
back (filesizes aren't as important).

-- 
Thomas E. Dickey
http://invisible-island.net
ftp://invisible-island.net

Attachment: pgpDdv04NS3Mc.pgp
Description: PGP signature



Home | Main Index | Thread Index | Old Index