Unlike previous incarnations of Spellcorrector not it does not by default load the two huge language files for English and Swedish. Alternatively/additionally you can load your own language file. The difference between loading a language file and training on your own words is that trained words are always assumed to be correct.

Another major change with this release is that a pickle file is created once the language file or own training file has been parsed once. This works like a cache, if the original text file changes, the pickle file is recreated. The outcome of this is that the first time you create a Spellcorrector instance it takes a few seconds if the language files is large but on the second time it takes virtually no time at all.

So, recap, here are the different methods for loading the 'Spellcorrector':


>>> Spellcorrector('en')

>>> assert os.path.isdir('languagefiles')
>>> Spellcorrector('en', load_language_files=True)

>>> Spellcorrector('en', load_language_file='/home/peterbe/text.txt')

>>> Spellcorrector('en', own_training_file='/home/peterbe/names.txt')

The load_language_file expects a readable file full of text. The text doesn't have to be written as one word per line. All junk like punctuation and brackets and stuff is stripped.

The own_training_file has to be a file with one word per line. You can combine the two like this:


>>> Spellcorrector('en', load_language_file='/home/peterbe/text.txt',
                   own_training_file='/home/peterbe/names.txt')

There's also been a few other fixes and improvements. For example, there's now two basic unittests at the bottom of the file that might give some clues how it can work for you.

Download spellcorrector.py 0.2 I really ought to include this in PyPi. Something for my todo list.

Comments

bruno GALLART

Hi,
I am interesting by your personal's version of Peter Novig's corrector. I have tried it for my language of South of france (Occitan). I did a test with txt's file. It works good but in my language there are many letters like ò ó ì í ù ú à á è é ç .The correction's method or the suggestions's method, when there is a vowel stressed in the word, cut the word.I am not a very good pythoner and I don't know how resolve this little problem. Can you give me some hints ?
Compliments for your corrector,
Regards,
Bruno

Peter Bengtsson

It supports Unicode. But you'll have to modify it and write down the alphabet of your language.
Oh, and make sure you write the .txt file in UTF8!

bruno GALLART

Thanks for your answer, Peter. In the evening I looked after some informations for unicode etc... on Python and I think that the format's file is not UTF8 !
Thanks a lot,
Bruno

Your email will never ever be published.

Related posts