NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

PR/46255 CVS commit: src/usr.sbin/makemandb



The following reply was made to PR bin/46255; it has been noted by GNATS.

From: "Abhinav Upadhyay" <abhinav%netbsd.org@localhost>
To: gnats-bugs%gnats.NetBSD.org@localhost
Cc: 
Subject: PR/46255 CVS commit: src/usr.sbin/makemandb
Date: Sun, 18 Jun 2017 16:24:10 +0000

 Module Name:	src
 Committed By:	abhinav
 Date:		Sun Jun 18 16:24:10 UTC 2017
 
 Modified Files:
 	src/usr.sbin/makemandb: Makefile apropos-utils.c apropos-utils.h
 Added Files:
 	src/usr.sbin/makemandb: custom_apropos_tokenizer.c
 	    custom_apropos_tokenizer.h fts3_tokenizer.h nostem.txt
 
 Log Message:
 Add a custom tokenizer which does not stem certain keywords.
 
 Which keywords should not be stemmed is specified in the nostem.txt file.
 (Right now I have taken all the man page names, split them if they had
 underscores, removed common English words and converted everything to
 lowercase.)
 
 The tokenizer itself is based on the Porter stemming tokenizer shipped with
 Sqlite. The code in custom_apropos_tokenizer.c is copy of that code with
 some modifications to prevent stemming keywords specified in nostem.txt.
 
 Additionally, it now uses underscore `_' also as a token delimiter. Therefore,
 now it's possible to do query for `lwp' and all `_lwp_*' man page names
 will be matched. Or the query can be `unconst' and `__UNCONST' will be matched.
 This was not possible earlier, because underscore was not a delimiter and therefore
 the index would have __UNCONST as a key rather than UNCONST.
 
 The tokenizer needs fts3_tokenizer.h file, which is not shipped with the
 amalgamation build of Sqlite, therefore it needs to be added here (unless
 we decide there is a better place for it).
 
 To enforce using the new tokenizer, a schema version bump is needed
 
 Since the tokenization is done both at the indexing time (via makemandb) and
 also while query time (via apropos or whatis), it will be needed to bump
 the schema version everytime nostem.txt is modified. Otherwise the
 index will consist of old tokens and desired changes will not be seen with
 apropos.
 
 This should also fix the issue reported in PR bin/46255. Similar suggestion was
 also made on tech-userlevel@ recently:
 <http://mail-index.netbsd.org/tech-userlevel/2017/06/08/msg010620.html>
 
 Thanks to christos@ for multiple rounds of reviews of the tokenizer code.
 
 
 To generate a diff of this commit:
 cvs rdiff -u -r1.8 -r1.9 src/usr.sbin/makemandb/Makefile
 cvs rdiff -u -r1.37 -r1.38 src/usr.sbin/makemandb/apropos-utils.c
 cvs rdiff -u -r1.12 -r1.13 src/usr.sbin/makemandb/apropos-utils.h
 cvs rdiff -u -r0 -r1.1 src/usr.sbin/makemandb/custom_apropos_tokenizer.c \
     src/usr.sbin/makemandb/custom_apropos_tokenizer.h \
     src/usr.sbin/makemandb/fts3_tokenizer.h src/usr.sbin/makemandb/nostem.txt
 
 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.
 


Home | Main Index | Thread Index | Old Index