tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: List of Keywords for apropos(1) Which Should Not be Stemmed

On Tue, Jul 12, 2016 at 12:26 PM, David Holland
<> wrote:
> On Tue, Jul 12, 2016 at 11:41:21AM +0530, Abhinav Upadhyay wrote:
>  > >> But the downside is that technical keywords (e.g. kms, lfs, ffs), are
>  > >> also stemmed down and stored (e.g. km, lf, ff) in the index. So if you
>  > >> search for kms, you will see results for both kms and km.
>  > >
>  > > Interesting problem.
>  > >
>  > > I expect the set of documents that contain a word ("directories") and
>  > > the set of documents containing its true stem ("directory") to overlap
>  > > widely.  I also expect the set of documents that contain a word ("kms")
>  > > and an incorrect stem ("km") to scarcely overlap.  Do the manual pages
>  > > meet these expections?  If so, then maybe you can decide whether or not
>  > > to keep a stem by looking at the document-set overlap?
>  >
>  > Yes, usually when the stem is incorrect, the overlap is not that much.
>  > But the only way to figure out such cases is manually comparing the
>  > output of apropos, unless we have a pre-built list of expected
>  > document-set and we can compare those. :)
> You could build such a list from the current set of man pages, and
> refresh it once in a while, and that would probably work well enough.

That's one of the things I want to do. It would be nice to create a
labeled dataset, probably something like a set of queries and an
expected list of documents in the top 10 for each of them. It could
then be used as a training data for tasks such as
- evaluating performance of various ranking algorithms,
- using machine learning to learn an optimal ranking algorithm
- determining which keywords should be stemmed by comparing the
overlap of the actual and expected results

> I'm wondering though if there's some characteristic of the document
> sets you can use to automatically reject wrong stemmings without
> having to precompute. What comes to mind though is some kind of
> diameter or breadth metric on the image of the document set on the
> crossreference graph. Or maybe something like the average
> crossreference pagerank of the document set, which if it's too high
> means you aren't retrieving useful information. But I guess these
> notions aren't much use because I'm sure we don't currently build the
> crossreference graph.

We haven't tried exploring this aspect. Probably if we have a hand
labeled dataset as mentioned above, we could compare performance of
page rank as well.


Home | Main Index | Thread Index | Old Index