[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: global full-text search for NetBSD
On Tue, Mar 6, 2012 at 12:18 AM, David Young <dyoung%pobox.com@localhost> wrote:
> On Mon, Mar 05, 2012 at 06:41:48PM +0000, David Holland wrote:
>> On Tue, Feb 28, 2012 at 10:22:37AM -0600, David Young wrote:
>> > Fast-forward to 2012: there is lots of prior art in this area, and we
>> > don't have to repeat the mistakes.
>> Fast-forward to 2012: there is lots of prior art in this area, so we
>> don't have to roll our own new implementation.
> Yep. The apropos project is a good start.
With Sqlite in the base, I think it already provides a nice platform
to develop such a global search tool without requiring too much low
With man pages, there is an advantage that they have a highly
structured format for the content, this allows for reflecting this
structure in the format of a database table and a suitable ranking
function which takes into account the columned structure.
In case of a general purpose search tool, where nothing can be assumed
without the structure of the document, I think the bulk of the data
will have to be stored in one or two columns. I believe for such a
schema, the best possible way to rank the search results is to use
term-weights. Store precomputed term-weights for the documents in the
database itself and use those term-weights to rank the documents. This
scheme was used for a while during the apropos project, providing
quite good results but the storage of term-weights in the database
itself required additional disk storage. (I would like to use this
ranking scheme in apropos, if it didn't require so much storage.)
It should be possible to build a modular an indexing tool for plain
text files, HTML and other similar simple formats, while for more
complex file formats like PDFs, I think there will be requirement for
external libraries to parse those documents, support for such document
formats may be provided in the form of plugin modules available in
Main Index |
Thread Index |