tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: List of Keywords for apropos(1) Which Should Not be Stemmed



On Tue, Jul 12, 2016 at 5:47 AM, Brett Lymn <blymn%internode.on.net@localhost> wrote:
> On Mon, Jul 11, 2016 at 08:59:05PM +0530, Abhinav Upadhyay wrote:
>>
>> Thanks, that would be a good starting point too. I guess we will still
>> have to add few words to the list manually later, but it should be
>> good to begin with.
>>
>
> How about checking the length of the word - technical abbreviations tend
> to be short (<= 4 characters predominantly).  According to grep there
> are 155 two letter words, 1358 three letter words and 5124 four letter
> words (assuming my driving of grep is correct) in /usr/share/dict/words.
> So it could be feasible to hash just the short words in the dictionary
> and then stem if you find a match otherwise assume it is a technical
> abbreviation and don't stem.
>

Yes, but there are other keywords which are probably not
abbreviations, and longer than 3/4 letters. For example, drmkms,
usbdevs, scan_ffs etc :)

-
Abhinav


Home | Main Index | Thread Index | Old Index