24 September 2007
Unlike previous incarnations of
Spellcorrector not it does not by default load the two huge language files for English and Swedish. Alternatively/additionally you can load your own language file. The difference between loading a language file and training on your own words is that trained words are always assumed to be correct.
Another major change with this release is that a pickle file is created once the language file or own training file has been parsed once. This works like a cache, if the original text file changes, the pickle file is recreated. The outcome of this is that the first time you create a
Spellcorrector instance it takes a few seconds if the language files is large but on the second time it takes virtually no time at all.
So, recap, here are the different methods for loading the 'Spellcorrector':
>>> Spellcorrector('en') >>> assert os.path.isdir('languagefiles') >>> Spellcorrector('en', load_language_files=True) >>> Spellcorrector('en', load_language_file='/home/peterbe/text.txt') >>> Spellcorrector('en', own_training_file='/home/peterbe/names.txt')
load_language_file expects a readable file full of text. The text doesn't have to be written as one word per line. All junk like punctuation and brackets and stuff is stripped.
own_training_file has to be a file with one word per line. You can combine the two like this:
>>> Spellcorrector('en', load_language_file='/home/peterbe/text.txt', own_training_file='/home/peterbe/names.txt')
There's also been a few other fixes and improvements. For example, there's now two basic unittests at the bottom of the file that might give some clues how it can work for you.
Download spellcorrector.py 0.2 I really ought to include this in PyPi. Something for my todo list.