List of Keywords for apropos(1) Which Should Not be Stemmed

To: "tech-userlevel%NetBSD.org@localhost" <tech-userlevel%netbsd.org@localhost>
Subject: List of Keywords for apropos(1) Which Should Not be Stemmed
From: Abhinav Upadhyay <er.abhinav.upadhyay%gmail.com@localhost>
Date: Mon, 11 Jul 2016 18:59:25 +0530

Hi,

Currently we are using the in-built Porter stemming tokenizer of
SQLite, which by default stems all the keywords while indexing. It
does this by removing suffixes like 's', 'es', 'ing', 'ed' from the
end of the words and various other similar heuristics. This is useful
for full text search because if you search for 'directories', you will
find matches for both 'directory' and 'directories'.

But the downside is that technical keywords (e.g. kms, lfs, ffs), are
also stemmed down and stored (e.g. km, lf, ff) in the index. So if you
search for kms, you will see results for both kms and km.

The solution is to write a custom tokenizer where we check in an
ignore list to decide whether to stem a token or not. I'm looking how
best to obtain this ignore list of keywords. The discussion on
current-users [1] had two suggestions:

1. If a word is not in /usr/share/dict/words, don't stem.
2. Look for .Tn macros (and probably other similar macros) and don't stem those.

Doing (1) is simple but that file is huge and it would require
building a huge hash table to search in it for ever keyword while
parsing the man pages.
With (2), the list will not be available before makemandb(8) runs, so
it is hard to implement.

There is another option of building a list by hand and by using
/usr/data/src/usr.bin/spell/spell/{special.netbsd, special.math} as a
starting point. If you have any better alternatives, please let me
know :)

[1]: http://mail-index.netbsd.org/current-users/2016/07/08/msg029732.html


-
Abhinav

Follow-Ups:
- Re: List of Keywords for apropos(1) Which Should Not be Stemmed
  - From: David Young
- Re: List of Keywords for apropos(1) Which Should Not be Stemmed
  - From: Thomas Klausner

Prev by Date: Re: _SC_SIGQUEUE_MAX
Next by Date: Re: List of Keywords for apropos(1) Which Should Not be Stemmed
Previous by Thread: Useless programs in base.tgz - bpm, ekermit...
Next by Thread: Re: List of Keywords for apropos(1) Which Should Not be Stemmed
Indexes:

Home | Main Index | Thread Index | Old Index