tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

List of Keywords for apropos(1) Which Should Not be Stemmed



Hi,

Currently we are using the in-built Porter stemming tokenizer of
SQLite, which by default stems all the keywords while indexing. It
does this by removing suffixes like 's', 'es', 'ing', 'ed' from the
end of the words and various other similar heuristics. This is useful
for full text search because if you search for 'directories', you will
find matches for both 'directory' and 'directories'.

But the downside is that technical keywords (e.g. kms, lfs, ffs), are
also stemmed down and stored (e.g. km, lf, ff) in the index. So if you
search for kms, you will see results for both kms and km.

The solution is to write a custom tokenizer where we check in an
ignore list to decide whether to stem a token or not. I'm looking how
best to obtain this ignore list of keywords. The discussion on
current-users [1] had two suggestions:

1. If a word is not in /usr/share/dict/words, don't stem.
2. Look for .Tn macros (and probably other similar macros) and don't stem those.

Doing (1) is simple but that file is huge and it would require
building a huge hash table to search in it for ever keyword while
parsing the man pages.
With (2), the list will not be available before makemandb(8) runs, so
it is hard to implement.

There is another option of building a list by hand and by using
/usr/data/src/usr.bin/spell/spell/{special.netbsd, special.math} as a
starting point. If you have any better alternatives, please let me
know :)

[1]: http://mail-index.netbsd.org/current-users/2016/07/08/msg029732.html


-
Abhinav


Home | Main Index | Thread Index | Old Index