tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: List of Keywords for apropos(1) Which Should Not be Stemmed
On Tue, Jul 12, 2016 at 12:26 PM, David Holland
<dholland-tech%netbsd.org@localhost> wrote:
> On Tue, Jul 12, 2016 at 11:41:21AM +0530, Abhinav Upadhyay wrote:
> > >> But the downside is that technical keywords (e.g. kms, lfs, ffs), are
> > >> also stemmed down and stored (e.g. km, lf, ff) in the index. So if you
> > >> search for kms, you will see results for both kms and km.
> > >
> > > Interesting problem.
> > >
> > > I expect the set of documents that contain a word ("directories") and
> > > the set of documents containing its true stem ("directory") to overlap
> > > widely. I also expect the set of documents that contain a word ("kms")
> > > and an incorrect stem ("km") to scarcely overlap. Do the manual pages
> > > meet these expections? If so, then maybe you can decide whether or not
> > > to keep a stem by looking at the document-set overlap?
> >
> > Yes, usually when the stem is incorrect, the overlap is not that much.
> > But the only way to figure out such cases is manually comparing the
> > output of apropos, unless we have a pre-built list of expected
> > document-set and we can compare those. :)
>
> You could build such a list from the current set of man pages, and
> refresh it once in a while, and that would probably work well enough.
That's one of the things I want to do. It would be nice to create a
labeled dataset, probably something like a set of queries and an
expected list of documents in the top 10 for each of them. It could
then be used as a training data for tasks such as
- evaluating performance of various ranking algorithms,
- using machine learning to learn an optimal ranking algorithm
- determining which keywords should be stemmed by comparing the
overlap of the actual and expected results
> I'm wondering though if there's some characteristic of the document
> sets you can use to automatically reject wrong stemmings without
> having to precompute. What comes to mind though is some kind of
> diameter or breadth metric on the image of the document set on the
> crossreference graph. Or maybe something like the average
> crossreference pagerank of the document set, which if it's too high
> means you aren't retrieving useful information. But I guess these
> notions aren't much use because I'm sure we don't currently build the
> crossreference graph.
We haven't tried exploring this aspect. Probably if we have a hand
labeled dataset as mentioned above, we could compare performance of
page rank as well.
-
Abhinav
Home |
Main Index |
Thread Index |
Old Index