Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Potentially undesirable behavior with apropos(1)



Hi Abhinav,

> On 8 Jul 2016, at 8:08 AM, Abhinav Upadhyay <er.abhinav.upadhyay%gmail.com@localhost> wrote:
> 
> Hi Paul,
> 
>> On Fri, Jul 8, 2016 at 7:24 AM, Paul Goyette <paul%whooppee.com@localhost> wrote:
>> With a reasonably current 7.99.33 (less than a week old), I noticed that
>> when I request
>> 
>>        apropos kms
>> 
>> (expecting to find man pages referencing "xxxdrmkms"), it seems to find a
>> lot of entries for "km".  Is this intended?  None of the found entries has
>> "kms", only "km".
>> 
>> I really didn't expecting to find anything about kilometers, or meta-keys,
>> or khmer (cambodian language?)!
> 
> This is one of the short comings of apropos(1) right now. While
> indexing the man pages, the tokenizer does stemming of the words being
> indexed. Stemming essentially tries to reduce the words to their root
> words, for example
> running --> run
> eating -> eat
> eats -> eat
> listened -> listen
Is there a way to disable the stemming (preferably config or environment?)

Thilo
> 
> It does this by removing suffixes like 's', 'es', 'ing', 'ed' from the
> words. Therefore, 'kms' when being indexed, gets stored as 'km'. Same
> is the case for 'ffs', 'lfs', 'ntfs' etc :)
> 
> It applies the same algorithm when doing the search, so when you enter
> 'kms' it first stems it down to 'km' and then does the search. This is
> needed because when doing the indexing, 'kms' was stored as 'km', so
> now searching with 'kms' will not get you anything.
> 
> Stemming is an essential part for implementing full text search and
> except for such cases, it works really well. I'm planning to write a
> custom tokenizer implementation which will not stem technical keywords
> like kms, lfs, ntfs, nfs, etc. That will fix these problems, it's
> coming soon :)
> 
> -
> Abhinav



Home | Main Index | Thread Index | Old Index