A new modern spell(1) Implementation

To: "tech-userlevel%NetBSD.org@localhost" <tech-userlevel%netbsd.org@localhost>
Subject: A new modern spell(1) Implementation
From: Abhinav Upadhyay <er.abhinav.upadhyay%gmail.com@localhost>
Date: Sun, 29 Jan 2017 01:47:18 +0530

Hi,

As you might have noticed me fixing quite a few typos in the man
pages, it was because I was testing the new spell(1) implementation I
have been working on:
https://github.com/abhinav-upadhyay/nbspell :-)

A portion of this work I did while working on apropos(1), but didn't
commit it because it was not quite in shape at that time and also, I
wanted it to be outside apropos(1). Then a couple of years back
dholland@ created a PR (http://gnats.netbsd.org/48684) documenting the
shortcomings of the existing spell(1), at which point I volunteered to
work on it but didn't get around doing it until couple of months back.

The original spell(1) used inflection rules, removing
prefixes/suffixes from the input word, till it is reduced to a root
word and then checks for that word in the dictionary. As noted in the
PR, these rules can be easily fooled and wildly hamper its accuracy.
The original reason to use the rule based technique was to keep the
size of the dictionary small, and also saving the pain of adding all
possible variations of the different words in the dictionary.

Now a days we don't need to worry about disk space, so we can spend a
few more MBs of disk space to avoid using those complex and inaccurate
rules for doing spell checking. As part of this project, I have
expanded the original Webster's dictionary by adding all possible
plural, verb and adjective forms of the words where possible. To check
if a given word is misspelled or not, is now a simple lookup in this
dictionary. The expanded dictionary is just over 5MB in size as
compared to the original Webster's dictionary of 2.4M.

But a spell checker is no good if it tells you that you misspelled a
word but doesn't tell you the correct spelling. To achieve this I have
used the levenshtein distance (or edit distance) technique to generate
the possible corrections. (for details on how this works, read my old
blog post http://abhinav-upadhyay.blogspot.in/2011/10/spell-corrector-for-apropos.html).

In order to improve the accuracy of the corrections further I also
decided to incorporate the soundex algorithm, which is a phonetics
based technique. The algorithm generates a code for every word, where
similar sounding words end up with the same code. This would allow
handling cases where people type words based on how they sound like
rather than how they are actually spelled.

After finding the list of possible corrections, it is also important
to sort them in decreasing order of probability of being the correct
spelling. In order to do this sorting, I generated a count of word
frequency from a 12GB dataset  based on the English language books
obtained from project Gutenberg. The basic premise is that a word
which is more frequent is more likely to be the correct correction.

Right now, in order to find possible corrections for a given word, I
am taking the following strategy:
    1. Try to find corrections at edit distance 1
    2. If no corrections found at distance 1, go for edit distance 2
    3. If no corrections found at distance 2, try soundex

Following are some of the tricks I am using to further up the accuracy:

- While generating the corrections at a given edit distance, if a
given correction shares the same soundex code as the misspelled word,
give it a higher weight (a word at a short edit distance from the
original word and having the same soundex is more probable to be the
correct spelling than any other alternative)

- It has been observed that it is highly unlikely for someone to
misspell the first character of a spelling, therefore heavily
penalizing a correction where the first character is changed, for
example acused is more likely a misspelling of accused than caused.

- Based on the test data, I noticed it is more common for people to
miss a character in the spelling than type an extra character,
therefore giving a higher weight to corrections involving insertion
helps.

- Also, some implementations of levenshtein distance tend to give
lower weight to corrections involving substitutions, doing that has
also been a bit helpful.

With all these things in place, I compared the performance of this
implementation with that of GNU aspell on the test data provided by
them (http://aspell.net/test/common-all/), and following are the
results in terms of accuracy:

Accuracy in finding the correct suggestion at a given position:

Total number of tests: 3945
First place: 91.33% (aspell  0.60: 74%)
in positions 1-5: 95.26% (aspell 0.60: 96.6%)
in positions 1-10: 95.59% (aspell 0.60: 98.2%)
in positions 1-25: 95.77% (aspell 0.60: 99.0%)
in positions 1-50: 95.84 (asepll 0.60: 99.2%)
in positions 1-100: 95.92% (asepll 0.60: 99.2%)

It should be noted that many of the tests in the test data were quite
unreasonable, for example ``assassination'' was expected as the
correction for ``assosication'', while my implementation suggested
``association''.

Also, the aspell test results mention that they only considered words
which were present in all the three dictionary but they didn't provide
the exact set of tests they used, so these numbers should be taken
with a grain of salt. Their results are based on 3836 test cases,
while I used the complete test data (minus the words which were not in
our dictionary).

I have also tried playing with the n-grams techniques to get more
accurate corrections taking the context of the surrounding word into
the account, but I have struggled with the API design. Right now, the
API is as simple as calling spell_get_suggestions(word) and it will
return the list of possible corrections. But with n-gram based
implementation it become much more complicated. Does the caller pass
one sentence at a time or it tokenizes and passes the n-grams? If
spell(1) needs to tokenize, how does it decide what is a valid
tokenization scheme for the caller? Things like that, because of which
I have not gone in that direction yet. I do have a bigram based
implementation checked in (but currently broken).

I am looking forward to helpful feedback and suggestions on this work :-)

-
Abhinav

Follow-Ups:
- Re: A new modern spell(1) Implementation
  - From: Robert Elz

Prev by Date: Re: Adding ftp to master.passwd and group
Next by Date: Re: A new modern spell(1) Implementation
Previous by Thread: Adding ftp to master.passwd and group
Next by Thread: Re: A new modern spell(1) Implementation
Indexes:

Home | Main Index | Thread Index | Old Index