PackageDescription: SpellChecker2ForBf(Bundle)

Spell Checker 2 For Bf

Last published: October 19, 2009 by 'jamesr'

No bundle comment found. Component comments follow

PackageDescription: Spellchecker2

Spellchecker2 contains a spelling checker and spelling corrector library. It also includes the English free word list dictionary.

This package provides the ability to spellcheck a string against a vocabulary, such as English, French, German, etc. The package itself comes with only the English dictionary, built from the free word list.
Two facilities are provided - the ability to test if a word is in a vocabulary and if it is not, the ability to find suggestions from the vocabulary that closely match the original word.

You can now do spell checking and suggestions on a vocabulary:
vocabulary := English vocabulary.
vocabulary hasWord: 'spelling'.
vocabulary suggestWords: 'xpelling'.

There are two kinds of vocabularies - a compressed vocab and an uncompressed vocab. The compressed version is the class Vocabulary, while the uncompressed version is the class Node. You can only add words to an uncompressed vocabulary. You can modify a compressed vocabulary using the method mutableWhile:, eg:
vocabulary := Spellchecker2.Vocabulary new.
vocabulary mutableWhile: [:node | node addWord: 'spelling'].

The same APIs that work on a Vocabulary class also work on a Node, eg:
node := Node new.
node addWord: 'spelling'.
node hasWord: 'spelling'.
node suggestWords: 'xpelling'.

The difference between Node and Vocabulary is discussed in more detail in the Architecture section below. For implementation details on either, see their class descriptions.

The Vocabulary and Node structures are graphs of nodes, where duplicate branches are shared. This means that creating a vocabulary can be a potentially costly exercise, if there are many duplicate nodes to compress. As well as that, the native state for a vocabulary is to be compressed, as compressed vocabularies take up a great deal less space than uncompressed (anywhere up to 10x smaller sometimes). Converting a Node graph in to a Vocabulary can be an expensive operation too, as can converting a Vocabulary in to a Node graph (using #mutableWhile:).

This is worth noting if you intend to have a 'custom dictionary' that contains words not included in the default dictionary. Instead of updating the main English dictionary, you can instead update a separate smaller dictionary of custom words. Assuming that there are no duplicate words between the two dictionaries, the result of a word search is the union of the two dictionaries.

The English dictionary is kept as a ByteArray in the method English class>>data. It is wrapped in a #once block, which ensures that the ByteArray exists only once in the image, further reducing the memory load of having a word dictionary in memory.

Detecting if a word is in a vocabulary is easy, you simply traverse the tree, which is how the implementation works. If a terminal node is found when it is expected, then the word is in the vocabulary.
Creating word suggestions for a word that is not in the vocabulary is a much trickier prospect. For details on how spelling suggestions are found, see the class LevenshteinAutomata.

PackageDescription: Spellchecker2ExtensionsForBf

pkg contains api extensions that are useful for Bf