08 August 2006 20 comments Python
This has been a problem for a long time for me. Whenever someone enters a title in my CMS the id of the document is derived from the title. Spaces are replaced with '-
and&' is replaced with
etc. The final thing I wanted to do was to make sure the Id is ASCII encoded when it's saved. My original attempt looked like this:
>>> title = u"Klüft skräms inför på fédéral électoral große" >>> print title.encode('ascii','ignore') Klft skrms infr p fdral lectoral groe
But as you can see, a lot of the characters are gone. I'd much rather that a word like "Klüft" is converted to "Kluft" which will be more human readable and still correct. My second attempt was to write a big table of unicode to ascii replacements.
It looked something like this:
u'\xe4': u'a', u'\xc4': u'A', etc...
Long, awful and not pythonic. Too risky to miss something but the result was good. Now for the final solution which I'm very happy with. It uses a module called unicodedata which is new to me. Here's how it works:
>>> import unicodedata >>> unicodedata.normalize('NFKD', title).encode('ascii','ignore') 'Kluft skrams infor pa federal electoral groe'
It's not perfect (
große should have become
grosse) but's only two lines of code.