tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Support for boolean queries in apropos



On Mon, Mar 12, 2012 at 2:18 AM, David Young <dyoung%pobox.com@localhost> wrote:
> I think that general boolean queries are known now to be less powerful
> and useful than everyone supposed that they were 20 or 30 years ago, so
> search engines either omit support for boolean operators or else do not
> encourage the use of such operators.  Instead, usually you can specify
> terms that MUST appear in the results, and terms that MUST NOT appear,
> using either a simplified set of operators (+ and -) or an "Advanced"
> search form that has fields for the MUST/MUST-NOT terms.  Is there any
> reason to believe that boolean operators are more suitable for apropos
> than for other search engines?

I agree with what Mouse said. I have not done much research on why Web
search engines do not advocate the use of such Boolean operators but I
don't think they really need them when their results are so accurate
but with apropos where the search is still in its infancy stage, they
are useful. As a concrete example from my personal experience,
sometimes it happens that few packages like Git, OpenSSL, Perl, etc.
come with so many man pages that they might literally start polluting
the search results. In such cases if the user finds that the results
are being unnecessarily cluttered by man pages from these packages, it
would prove to be handy to use a Boolean query to negate such man
pages from appearing in search results. Like the example I posted in
my original email, where a simple query "add new user" would get
cluttered by few results from git or open-ssh with whom I wasn't
really concerned, so a boolean NOT operator for "git" and "ssh"
eliminated any man pages from those packages.

Apart from that, having these capabilities available to the user wouldn't hurt.

> ISTR you wrote the other day that terms are not weighted and the search
> results are not ranked?  It seems to me that correcting that may improve
> the effectiveness of apropos more than any other measure.

Not really. The search results are ranked and the query terms are also
weighted, but all of this weight computation is done on the fly while
evaluating a user query. This computation of weights is done each time
while executing queries for each of the matched results.  Basically if
for a query there are 100 matches, then apropos would have to compute
weights for each of those 100 results and then rank them in decreasing
order of these weights.
This means that we cannot use any sophisticated weighting (or ranking)
schemes which require some expensive computations without slowing down
apropos. Therefore the current ranking scheme is relatively simpler.

What I meant to say in that thread was that, if term weights for all
the terms in the corpus are pre-computed and stored in a database
table, then while ranking the search results it is possible to use a
more sophisticated ranking scheme or algorithm. But storing the
pre-computed weights in the database requires extra storage, which I
don't think would be welcome by many people, therefore currently this
approach is being avoided. It would be more prudent to get support for
storage of term-weights in the FTS index implemented in Sqlite itself,
as I think it would save a lot of disk space by avoiding duplication
of data (but I am not sure if that is possible or would be welcome by
Sqlite developers, I haven't talked to them though).

--
Abhinav


Home | Main Index | Thread Index | Old Index