[GSoC] [Apropos Replacement] Project Update

To: tech-userlevel%netbsd.org@localhost
Subject: [GSoC] [Apropos Replacement] Project Update
From: Abhinav Upadhyay <er.abhinav.upadhyay%gmail.com@localhost>
Date: Mon, 1 Aug 2011 03:57:46 +0530

Hello NetBSD!!

First of all I would like to thank Joerg, David Young and everyone
involved with GSoC as I passed my midterm evaluations. I have been
quite for last 3 weeks as with the growing size of the project, it is
taking more and more time to make more visible changes, more time to
test and fix problems. Besides a lot of time is spent experimenting ,
what works and what doesn't work.

However, I did make some improvements in the project in this while:
As always a detailed post on my blog:
http://abhinav-upadhyay.blogspot.com/2011/07/netbsd-gsoc-project-update-5.html
A brief overview below:

1. New Feature: Option to apropos to do search within specific sections:

Now you can do something like this:
$apropos -1 "copy strings" to search only in section 1
or $apropos -18 "add new user" to search in section 1 and 8 only.
Some sample outputs:
    http://paste2.org/p/1554491
    http://paste2.org/p/1554510
    http://paste2.org/p/1554509

2. Indexing speed improvements: The speed of makemandb has improve
considerably. Bringing the complete indexing operation under one
transaction did the trick. Thanks to Joerg for the idea.

3. Parsing man(7) pages: With mdoc(7) pages being indexed properly it
was time to add support for man(7) as well. I implemented this, it was
more tricky and challenging as compared to mdoc(7) but I somehow
managed it. The only problem at the moment with it are the escape
sequences which are also getting indexed as it is. Hopefully with a
newer version of mdocml, I should be able to avoid escape sequences.

4. Regression: Large Database Size: After adding support for indexing
man(7) pages, I could index all the man pages on my system. I have
around 7600 pages on my system and indexing all of them took the size
of the database from 23M (for ~3000 pages) to 99M (for 7600 pages).
     Cause: The main cause for the large database size was the
term-weights table. This table contained a sort of index of all the
unique terms in the corpus and their weights in each document in which
they occurred. This was done to improve the speed of search as well as
implement some more advanced ranking algorithms.

    Solution: As a quick solution I simply removed the code related to
pre-computation of weights and dropped this table. Probably I took the
decision in haste. But with compression and a custom tokenizer I was
able to bring it down in the range of 30-40M

    Alternative: There is a trade off between the space usage and the
quality + speed of search. If some compromize on space can be made
then it allows us to experiment more freely with advanced search
techinques and improve the search experience. Joerg also suggested on
IRC that this can be made conditional and the user can choose whether
he/she wants to build the  database with advanced search requiring
additional space or build the compact database.

5. Added Compression Option To The Database: I implemented
compression/decompression functions using zlib(3) and integrated with
Sqlite to bring down the DB size.

6. Stopword Tokenizer: I also patched the Porter tokenizer from the
Sqlite source to filter out any stopwords. This removes extra noise
from the index caused by the stopwords which are useless and also
reduces the DB size.

7. Parsing data section wise and storing in separate columns: With
7600 pages in the databse, and almost all of the content in one column
meant a lot of noise in the search results and the accuracy of the
search getting off the mark, so it was necessary to decompose the
different sections into their own columns in the database. At the
present I have following columns in the DB for different sections:

name, name_desc for the NAME section
desc for the DESCRIPTION section
lib for the LIBRARY section
synopsis for the SYNOPSIS section
return_vals for the RETURN VALUES section
exit_status for the EXIT STATUS section
env for the ENVIRONMENT section
files for the FILES section
diagnostics for the DIAGNOSTICS section
errors for the ERRORS section

It took some time to get this code right but now it should be only a
matter of a couple of lines of code to add or remove support for a new
section. It also makes the search more parametric, meaning one can
play around with the weights for these different columns and easily
adjust the ranking function to a certain extent.

New Feature Proposal: With the data being organized in section wise
manner it is now possible to display some additional data alongside
the search results. For example I have just today I added code for
display the one line description of the page along side the search
results. Similarly it can be possible to display more data like
library (if it is a section 2,3 result) or exit status( for section
1,8 results) and so on.

Or it can be added as an option to apropos to show additional
information. For example something like this:

$apropos -r strcmp

this would display the Return Values for strcmp(3). This would of
course work only if the man page of strcmp had a RETURN VALUES
section.

What are your views about this ?

Besides this, I still have some more things to implement from my TODO
list. One of them being managing the man page aliases using the
database. If nothing more important comes up than I will pick one of
these tasks up.

Hoping for your feedback.

Thanks
Abhinav

Prev by Date: Re: Making powerd=YES default
Next by Date: Re: Making powerd=YES default
Previous by Thread: /bin/sh: set -e is broken again
Next by Thread: pthread_atfork()/pthread_once()
Indexes:

Home | Main Index | Thread Index | Old Index