tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Support for boolean queries in apropos



On Fri, Mar 16, 2012 at 7:55 PM, David Young <dyoung%pobox.com@localhost> wrote:
> On Wed, Mar 14, 2012 at 02:24:42AM +0530, Abhinav Upadhyay wrote:
>> On Tue, Mar 13, 2012 at 5:40 AM, David Young <dyoung%pobox.com@localhost> 
>> wrote:
>> > Just how much extra storage are we talking about?  10 MB? 100 MB? 1 GB?
>>
>> When the last time this was implemented in apropos, it almost doubled
>> the database size. For example, for 7000 man pages the normal database
>> size was around 45 MB but with pre-computed term weights being stored
>> in the database itself, the size climbed up to ~90 MB or so. However,
>> IIRC at that point of time, the compression of the FTS index and
>> optimization of the database to save disk space were not implemented,
>> that might have saved some disk space as well.
>
> A few thoughts:
>
> 1) That's not that much storage.
>
> 2) By how much does spending an extra 45 MB improve the search
>   effectiveness?

This is an area where I have been struggling a bit. It is a usual
practice to compute some numerical score to evaluate the performance
or effectiveness of algorithms. In case of information retrieval
systems this score is computed in terms of precision and recall
values. Quoting from Wikipedia: "precision is the fraction of
retrieved instances that are relevant, while recall is the fraction of
relevant instances that are retrieved."

The way precision and recall values are computed depends on the
knowledge of some stats of the training corpus, namely we need to know
for a given query, how many relevant documents are there in the
corpus. I do not have any such training data upon which I can base my
findings. I usually run a set of queries and compare the results
manually to see the how effective any ranking scheme is.

In case of the said ranking scheme in which pre-computed weights were
being stored on the disk, these were some sample queries:
http://pastebin.com/PjdNY68m

Note that, at that point of time, the FTS table had only three
columns: name, name_desc and desc. The desc column stored bulk of the
man page content while name, name_desc stored the NAME section. Even
then the results were this good. After this ranking scheme was
rejected and the current one was being implemented, the database
schema was changed to store different man page sections in separate
columns and each column was assigned its own relevance weight. I think
combining this decomposed schema with the pre-computed term-weights
ranking scheme *might* have had better results (it is just a
speculation, no stats to back anything up).

> 3) People who don't have that kind of storage to spare can just disable
>   FTS, no?

Yes, use the -l option.

> 4) If SQLite is not space-efficient, maybe it is not the right tool for
>   the job.

For man page search I think it does the job. Although if there was an
easy way to extend the FTS module to store the pre-computed term
weights in the inverted index itself, it would indeed save a lot of
disk space since the pre-computed weights stored in a separate table
really reflect the same structure.

On the other hand, a new feature is under development and might appear
in future releases of Sqlite. Sqlite actually stores the content of
the indexed documents in a database table, the new feature will allow
for building of contentless FTS tables, this means that Sqlite will
only maintain a full text index without storing any actual content.
This feature will save significant disk space. Read more about it
here: 
http://www.sqlite.org/src/artifact/fdc666a70d5257a64fee209f97cf89e0e6e32b51

--
Abhinav


Home | Main Index | Thread Index | Old Index