PackageDescription: Spellchecker2


Spellchecker 2

Last published: May 10, 2011 by 'mlucas-smith'

Defines 5 Classes
Extends 7 Classes


Spellchecker2 contains a spelling checker and spelling corrector library. It also includes the English free word list dictionary.

Overview
This package provides the ability to spellcheck a string against a vocabulary, such as English, French, German, etc. The package itself comes with only the English dictionary, built from the free word list.
Two facilities are provided - the ability to test if a word is in a vocabulary and if it is not, the ability to find suggestions from the vocabulary that closely match the original word.


Usage
You can now do spell checking and suggestions on a vocabulary:
vocabulary := English vocabulary.
vocabulary add: 'spelling'.
vocabulary suggestWords: 'xpelling'.

There are two kinds of vocabularies - a compressed vocab and an uncompressed vocab. The compressed version is the class Vocabulary, while the uncompressed version is the class Trie. You can only add words to an uncompressed vocabulary. You can modify a compressed vocabulary using the method mutableWhile:, eg:
vocabulary := Spellchecker2.Vocabulary new.
vocabulary mutableWhile: [:node | node add: 'spelling'].

The same APIs that work on a Vocabulary class also work on a Trie, eg:
node := Trie new.
node add: 'spelling'.
node includes: 'spelling'.
node suggestWords: 'xpelling'.

The difference between Trie and Vocabulary is discussed in more detail in the Architecture section below. For implementation details on either, see their class descriptions.


Implementation
The Vocabulary and Trie structures are graphs of nodes, where duplicate branches are shared. This means that creating a vocabulary can be a potentially costly exercise, if there are many duplicate nodes to compress. As well as that, the native state for a vocabulary is to be compressed, as compressed vocabularies take up a great deal less space than uncompressed (anywhere up to 10x smaller sometimes). Converting a Trie graph in to a Vocabulary can be an expensive operation too, as can converting a Vocabulary in to a Trie graph (using #mutableWhile:).

This is worth noting if you intend to have a 'custom dictionary' that contains words not included in the default dictionary. Instead of updating the main English dictionary, you can instead update a separate smaller dictionary of custom words. Assuming that there are no duplicate words between the two dictionaries, the result of a word search is the union of the two dictionaries.

The English dictionary is kept as a ByteArray in the method English class>>data. It is wrapped in a #once block, which ensures that the ByteArray exists only once in the image, further reducing the memory load of having a word dictionary in memory.

Detecting if a word is in a vocabulary is easy, you simply traverse the tree, which is how the implementation works. If a terminal node is found when it is expected, then the word is in the vocabulary.
Creating word suggestions for a word that is not in the vocabulary is a much trickier prospect. For details on how spelling suggestions are found, see the class LevenshteinAutomata.