Re: List of Keywords for apropos(1) Which Should Not be Stemmed

On Tue, Jul 12, 2016 at 11:41:21AM +0530, Abhinav Upadhyay wrote:
 > >> But the downside is that technical keywords (e.g. kms, lfs, ffs), are
 > >> also stemmed down and stored (e.g. km, lf, ff) in the index. So if you
 > >> search for kms, you will see results for both kms and km.
 > > Interesting problem.
 > > I expect the set of documents that contain a word ("directories") and
 > > the set of documents containing its true stem ("directory") to overlap
 > > widely.  I also expect the set of documents that contain a word ("kms")
 > > and an incorrect stem ("km") to scarcely overlap.  Do the manual pages
 > > meet these expections?  If so, then maybe you can decide whether or not
 > > to keep a stem by looking at the document-set overlap?
 > Yes, usually when the stem is incorrect, the overlap is not that much.
 > But the only way to figure out such cases is manually comparing the
 > output of apropos, unless we have a pre-built list of expected
 > document-set and we can compare those. :)

You could build such a list from the current set of man pages, and
refresh it once in a while, and that would probably work well enough.

I'm wondering though if there's some characteristic of the document
sets you can use to automatically reject wrong stemmings without
having to precompute. What comes to mind though is some kind of
diameter or breadth metric on the image of the document set on the
crossreference graph. Or maybe something like the average
crossreference pagerank of the document set, which if it's too high
means you aren't retrieving useful information. But I guess these
notions aren't much use because I'm sure we don't currently build the
crossreference graph.

(Also, as far as longer vs. shorter words, there's not much harm
besides performance in searching for nonsense words like "resize_ff"
as they generally won't match anything.)

David A. Holland

