Spellcorrector 0.2

24 September 2007   3 comments   Python

Mind That Age!

This blog post is 10 years old! Most likely, its content is outdated. Especially if it's technical.

Powered by Fusion×

Unlike previous incarnations of Spellcorrector not it does not by default load the two huge language files for English and Swedish. Alternatively/additionally you can load your own language file. The difference between loading a language file and training on your own words is that trained words are always assumed to be correct.

Another major change with this release is that a pickle file is created once the language file or own training file has been parsed once. This works like a cache, if the original text file changes, the pickle file is recreated. The outcome of this is that the first time you create a Spellcorrector instance it takes a few seconds if the language files is large but on the second time it takes virtually no time at all.

So, recap, here are the different methods for loading the 'Spellcorrector':

>>> Spellcorrector('en')

>>> assert os.path.isdir('languagefiles')
>>> Spellcorrector('en', load_language_files=True)

>>> Spellcorrector('en', load_language_file='/home/peterbe/text.txt')

>>> Spellcorrector('en', own_training_file='/home/peterbe/names.txt')

The load_language_file expects a readable file full of text. The text doesn't have to be written as one word per line. All junk like punctuation and brackets and stuff is stripped.

The own_training_file has to be a file with one word per line. You can combine the two like this:

>>> Spellcorrector('en', load_language_file='/home/peterbe/text.txt',
                   own_training_file='/home/peterbe/names.txt')

There's also been a few other fixes and improvements. For example, there's now two basic unittests at the bottom of the file that might give some clues how it can work for you.

Download spellcorrector.py 0.2 I really ought to include this in PyPi. Something for my todo list.

Comments

bruno GALLART
Hi,
I am interesting by your personal's version of Peter Novig's corrector. I have tried it for my language of South of france (Occitan). I did a test with txt's file. It works good but in my language there are many letters like ò ó ì í ù ú à á è é ç .The correction's method or the suggestions's method, when there is a vowel stressed in the word, cut the word.I am not a very good pythoner and I don't know how resolve this little problem. Can you give me some hints ?
Compliments for your corrector,
Regards,
Bruno
Peter Bengtsson
It supports Unicode. But you'll have to modify it and write down the alphabet of your language.
Oh, and make sure you write the .txt file in UTF8!
bruno GALLART
Thanks for your answer, Peter. In the evening I looked after some informations for unicode etc... on Python and I think that the format's file is not UTF8 !
Thanks a lot,
Bruno
Thank you for posting a comment

Your email will never ever be published


Related posts

Previous:
Ugliest site of the month - The Backyard Comedy Club 21 September 2007
Next:
Linux tip: du --max-depth=1 27 September 2007
Related by Keyword:
How to use django-cache-memoize 03 November 2017
django-cache-memoize 27 October 2017
cache_memoize - a pretty decent cache decorator for Django 11 September 2017
Fastest Redis configuration for Django 11 May 2017
Welcome to the world django-fancy-cache! 01 March 2013
Related by Text:
Dynamic image replacement technique 24 February 2006
Python optimization anecdote 11 February 2005
DifferenceFinder (aka. humanreadablediff.py) 06 July 2006
Local Django development with Nginx 11 October 2010
CSS Bloat Comparison 03 June 2016