tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: [GSoC 2011] [Status Report] Apropos Replacement



On Fri, Aug 26, 2011 at 12:59 AM, dieter roelants
<dieter.NetBSD%pandora.be@localhost> wrote:
Hi dieter
> This looks very useful. I wonder about 2 things:
Thanks :)

> - would it be possible to let the incremental updating of makemandb use
> the date in the man page to decide if it  has to be re-indexed?
It does not keep track of the data in the man pages. But I suppose you
are asking this in the context of pages which have been updated. For
that  makemandb works something like this:
makemandb maintains an index of the md5 hashes of the man pages in the
database. (Call it set A)

Step1: makemandb will traverse the search path obtained from man -p
and compute md5 hashes of all the man pages it encounters, store them
in a temporary table. (Call it set B)

Step2: Now we have two sets of md5 hashes, A and B. The difference of
the set B with set A (B - A) will give us those pages which are either
newly installed or which have been updated (thus a change in date).
makemandb will then parse and index these new and updated pages.

Step3: Similarly the difference of set A with set B (A - B) will give
us the set of pages whose records exist in the database but which have
been removed from the file system (thus their md5 hash not present in
set B). makemandb will remove all such pages from the database.

Step4: In the end it will drop the temporary md5 table built in step1.

So if a man page is updated or new man pages are installed, their md5
hashes will not be present in the db, which will trigger their parsing
and subsequent insertion in the database.

> - would it be possible/a good idea to increase the apropos matching
> score if the terms searched for are in the subject of the man page?
> The screenshot for "add a new user" doesn't highlight the keywords in
> the subject, which makes me think it is not used for the
> indexing/scoring?

Yes, the scoring algorithm being employed at the moment gives maximum
weight to matches which are found in the NAME section. Though it is
not perfect, i.e. the result you were probably looking for might not
be at the number 1 position, but invariably you will find it in top
10. Thus you see useradd(8) on number 3.

Regarding the highlight of keywords, I apologise that I did not
clarify how to interpret the output of apropos(1). Let me do this now
:)

1. The first line of a result contains following information:
         a)  the name of the man page,
         b)  it's section number,
         c)  and the one line description as obtained from the page's
NAME section.
2. The second line of a result contains a snippet of the text from the
man page. The matching keywords in this snippet appear highlighted.
So even though weight is given to matches found in the NAME section,
the keywords appear highlighted only in the Snippet.

Although I think it might be a good idea to make the matching keywords
highlighted in the NAME section as well.

Thanks :)

--
Abhinav


Home | Main Index | Thread Index | Old Index