tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Support for boolean queries in apropos



On Tue, Mar 13, 2012 at 5:40 AM, David Young <dyoung%pobox.com@localhost> wrote:
> On Mon, Mar 12, 2012 at 10:41:29PM +0530, Abhinav Upadhyay wrote:
>> On Mon, Mar 12, 2012 at 2:18 AM, David Young <dyoung%pobox.com@localhost> 
>> wrote:
>> > I think that general boolean queries are known now to be less powerful
>> > and useful than everyone supposed that they were 20 or 30 years ago, so
>> > search engines either omit support for boolean operators or else do not
>> > encourage the use of such operators.  Instead, usually you can specify
>> > terms that MUST appear in the results, and terms that MUST NOT appear,
>> > using either a simplified set of operators (+ and -) or an "Advanced"
>> > search form that has fields for the MUST/MUST-NOT terms.  Is there any
>> > reason to believe that boolean operators are more suitable for apropos
>> > than for other search engines?
>>
>> I agree with what Mouse said. I have not done much research on why Web
>> search engines do not advocate the use of such Boolean operators but I
>> don't think they really need them when their results are so accurate
>> but with apropos where the search is still in its infancy stage, they
>> are useful. As a concrete example from my personal experience,
>> sometimes it happens that few packages like Git, OpenSSL, Perl, etc.
>> come with so many man pages that they might literally start polluting
>> the search results. In such cases if the user finds that the results
>> are being unnecessarily cluttered by man pages from these packages, it
>> would prove to be handy to use a Boolean query to negate such man
>> pages from appearing in search results. Like the example I posted in
>> my original email, where a simple query "add new user" would get
>> cluttered by few results from git or open-ssh with whom I wasn't
>> really concerned, so a boolean NOT operator for "git" and "ssh"
>> eliminated any man pages from those packages.
>
> Is a boolean query really so handy as the minus operator?  It isn't as
> succinct.

I apologize but I lost the context a bit here, what do you mean by the
minus operator ? If you mean to represent the boolean NOT operator
using a '-' then I think it makes sense, easy to type and succinct as
well.

>> Apart from that, having these capabilities available to the user
>> wouldn't hurt.
>
> You may have thought that no one could disagree with that, but I
> do. :-) Any software capability comes with a cost to maintain and
> support it.  If boolean queries are available, users may spend a lot of
> time writing them when a query written with +/- operators would be more
> succinct (thus faster to type) and accurate (that is, resembling their
> intentions better).

It is possible to represent the AND, OR, NOT boolean operators with
handy symbols like +, |, - etc. for easiness of typing and succinct
representation, that shouldn't be a worry. Maintenance cost is also
very minimal in this case, as the support for boolean queries is
provided by Sqlite and apropos simply needs to translate the user
query to an equivalent SQL representation.

>> What I meant to say in that thread was that, if term weights for all
>> the terms in the corpus are pre-computed and stored in a database
>> table, then while ranking the search results it is possible to use a
>> more sophisticated ranking scheme or algorithm. But storing the
>> pre-computed weights in the database requires extra storage, which I
>> don't think would be welcome by many people, therefore currently this
>> approach is being avoided. It would be more prudent to get support for
>> storage of term-weights in the FTS index implemented in Sqlite itself,
>> as I think it would save a lot of disk space by avoiding duplication
>> of data (but I am not sure if that is possible or would be welcome by
>> Sqlite developers, I haven't talked to them though).
>
> Just how much extra storage are we talking about?  10 MB? 100 MB? 1 GB?

When the last time this was implemented in apropos, it almost doubled
the database size. For example, for 7000 man pages the normal database
size was around 45 MB but with pre-computed term weights being stored
in the database itself, the size climbed up to ~90 MB or so. However,
IIRC at that point of time, the compression of the FTS index and
optimization of the database to save disk space were not implemented,
that might have saved some disk space as well.

--
Abhinav


Home | Main Index | Thread Index | Old Index