[GSoC] [Apropos Replacement] Midterm Project Status Update

To: tech-userlevel%netbsd.org@localhost
Subject: [GSoC] [Apropos Replacement] Midterm Project Status Update
From: Abhinav Upadhyay <er.abhinav.upadhyay%gmail.com@localhost>
Date: Fri, 8 Jul 2011 22:54:33 +0530

Hello NetBSD!!

I am writing to report about the status of my project as we are
approaching the midterm evaluation dates (11th July).

As always I have put up a detailed blog post about the changes and
improvements I did as I can format the content better there making it
easier to read. Here is the post:
http://abhinav-upadhyay.blogspot.com/2011/07/netbsd-gsoc-midterm-project-update.html

Following are the major changes I did:

1. Showing the section number along with search results: Last time
around, there was no section number with the search results, which
made the results slightly incomplete. Thanks to Kristaps about
suggesting the right method to extract the meta data out of the man
pages, this is fixed.

2. Improved the Ranking Algorithm By Including the Inverse Document
Frequency (IDF): Earlier, the ranking algorithm was only based on the
term frequency (the number of times a term appears in a document). But
I updated it to also include the IDF factor.
Inverse Document Frequency for a term is defined as the number of
documents in which the ter appears. Combining both tf and idf results
in improvement in the quality of results, as it now considers both how
often a term appears in a document, and also how often the term
appears in the whole corpus.

 IDF is calculated as: idf = log(n / nt)
Where n = Total number of documents in the corpus
nt = Number of documents in which term t appears
So for a term which appears in only one document it will have a IDF = log(n)
while a term which appears in all the documents, it will have a IDF =
log(1) = 0.

For example a term like "the" will have a high term frequency in any
document, but at the same time it will have a lower inverse document
frequency (almost close to 0), which will nullify it's effect on the
quality of search results.

The final weight now is calculated as weight = tf * idf

3. Pre-computing The Term-Weights: The improvement in the ranking
algorithm came at a cost. The performance of apropos had degraded
considerably. To fix this, we decided to pre-compute all the
term-weights beforehand while indexing and store them in the database.
So apropos will only have to lookup the database to compute the rank
of the documents.

After implementing the pre-computation and fixing some bugs, there was
still a big problem. This pre-computation was causing makemandb to
take around 3 hours to complete the whole operation. Joerg suggested
to do all the calculations in a single SQL query, this will allow
Sqlite to optimize things better. And he was right, makemandb now
takes under 3 minutes to index the man pages and compute the
term-weights. (Of course it depends on the number of man pages you
have installed).

4. Further Improvement In The Ranking Algorithm: I came across a
seminal paper by Salton and Buckley [1] on analysis of term-weighting
schemes. Their study showed the most effective formula to calculate
term-weight. I cannot show the formula here as it requires special
symbols like sigma, please refer the blog post or this bug report:
https://github.com/abhinav-upadhyay/apropos_replacement/issues/21 .

After implementing the above term-weighting scheme, I noticed further
improvement in the search results.

See some sample runs: http://pastebin.com/PjdNY68m

5. Keyword Recognizer: I wanted to implement this feature, which would
scan the user query for a section specific keyword, and if it finds
then it will show results only from that particular section to which
the keyword belongs.

For example: apropos "function to copy strings", this query suggests
that the user is probably looking for a standard library function from
section 3.

After some discussion with David, we came to the conclusion that it is
probably better to implement the way Google does for site specific
searches.

So we agreed upon something like following query format:

apropos "kernel: allocating memory" . The string "kernel: here
suggests that the user is looking for results only from section 9
(kernel here is the section specific keyword for section 9).
I still need to work on it (some previous tries failed).

6. Status at Midterm: As I had promised in my proposal, I think I have
accomplished most of the tasks. Now some community feedback would be
very welcome :-)

Please try the current version out and tell me how good/bad the search
results are. If you encounter any queries for which the right results
are not showing up at the top, then please report them to me along
with the query and what results you expected to see.
Also if you have some ideas/suggestions in your mind then they are
also welcome :-)

7. Upcoming planned features: I have planned to implement a Link
analysis algorithm to improve the ranking, specifically something like
PageRank. But I am concerned about the patent that Stanford has over
PageRank process.
Another feature I have in mind is to implement a spell checker (if it
can be done in reasonable time and if it's worth it). The use case I
have in mind is to give out some suggestions to the user in case he
does a type and get's no results ;-)


Thanks for your interest in the project.


References:
[1]: 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.9086&rep=rep1&type=pdf

--
Abhinav

Prev by Date: Re: bumping library major on release branch?
Next by Date: mksh
Previous by Thread: ld.elf_so: support ELF symbol versioning (PR/45015)
Next by Thread: mksh
Indexes:

Home | Main Index | Thread Index | Old Index