Re: bin/48684: spell(1) is lacking

To: David Holland <dholland-tech%netbsd.org@localhost>
Subject: Re: bin/48684: spell(1) is lacking
From: Abhinav Upadhyay <er.abhinav.upadhyay%gmail.com@localhost>
Date: Sat, 10 May 2014 23:14:37 +0530

Thanks Christos and David for the encouragement. Sorry to have slept
on this for so long, I was reading some prior work in this area and
some literature and then got busy at $DAYJOB.

It seems that edit distance based implementations are quite effective
as well as simple to implement. All we need is a dictionary to look
up, which I believe the existing implementation also uses. This
technique is effective for figuring out the obvious typos, such as teh
(the), temprary (temporary) etc., which are not real words present in
the dictionary.

However, for spelling errors which do result in words present in the
dictionary it is not very effective. For these types of errors, an
n-gram based model is usually used in practice. But we will need a
large text corpus in order to build such a model. The n-gram model
requires building a count of how frequently n given words occur
together in the corpus and use it to calculate the probability of a
word being misspelled in a sentence. I am wondering whether it will be
possible to import such a corpus into CVS?

If we want spell(1) to be able to smartly detect the non-obvious
errors, then we need this corpus. Otherwise we can simply go with the
edit distance based implementation alone. I think I will start edit
distance and if time permits also try out the n-gram based model. Then
it will be out there for us to compare the two.

On Fri, Mar 28, 2014 at 7:30 AM, David Holland 
<dholland-tech%netbsd.org@localhost> wrote:
> On Thu, Mar 27, 2014 at 06:28:20PM +0530, Abhinav Upadhyay wrote:
>  > > I dunno. My inclination is towards cvs rm -- there are perfectly good
>  > > third-party spellcheckers at this point, natural language processing
>  > > is not exactly core OS functionality or the project's core competency,
>  > > and I don't think there's any need to maintain our own program given
>  > > that it doesn't work very well.
>  >
>  > I would like to take this up as a project. I have played with the idea
>  > of a spell checker for the apropos(1) project (even implemented one,
>  > although quite a naive implementation). I would like to take this
>  > opportunity to come up with a more sophisticated implementation of the
>  > spell checker and implementing it in the form of a library so that
>  > other utilities (like apropos(1)) may also benefit from it.
>
> go for it -- what I said was based on the assumption that nobody was
> really interested in this.
>
> --
> David A. Holland
> dholland%netbsd.org@localhost

References:
- Re: bin/48684: spell(1) is lacking
  - From: Abhinav Upadhyay
- Re: bin/48684: spell(1) is lacking
  - From: David Holland

Prev by Date: rndctl: read: Undefined error: 0
Next by Date: Conspiracy: Why is your savings interest rate so low!???
Previous by Thread: Re: bin/48684: spell(1) is lacking
Next by Thread: Read Carefully..
Indexes:

Home | Main Index | Thread Index | Old Index