Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Potentially undesirable behavior with apropos(1)



Hi Paul,

On Fri, Jul 8, 2016 at 7:24 AM, Paul Goyette <paul%whooppee.com@localhost> wrote:
> With a reasonably current 7.99.33 (less than a week old), I noticed that
> when I request
>
>         apropos kms
>
> (expecting to find man pages referencing "xxxdrmkms"), it seems to find a
> lot of entries for "km".  Is this intended?  None of the found entries has
> "kms", only "km".
>
> I really didn't expecting to find anything about kilometers, or meta-keys,
> or khmer (cambodian language?)!

This is one of the short comings of apropos(1) right now. While
indexing the man pages, the tokenizer does stemming of the words being
indexed. Stemming essentially tries to reduce the words to their root
words, for example
running --> run
eating -> eat
eats -> eat
listened -> listen

It does this by removing suffixes like 's', 'es', 'ing', 'ed' from the
words. Therefore, 'kms' when being indexed, gets stored as 'km'. Same
is the case for 'ffs', 'lfs', 'ntfs' etc :)

It applies the same algorithm when doing the search, so when you enter
'kms' it first stems it down to 'km' and then does the search. This is
needed because when doing the indexing, 'kms' was stored as 'km', so
now searching with 'kms' will not get you anything.

Stemming is an essential part for implementing full text search and
except for such cases, it works really well. I'm planning to write a
custom tokenizer implementation which will not stem technical keywords
like kms, lfs, ntfs, nfs, etc. That will fix these problems, it's
coming soon :)

-
Abhinav


Home | Main Index | Thread Index | Old Index