Mobile version of this pageMore crappy album covers
Next:
Fastest way to uniqify a list in Python
Related blogs
Matrix ASCII animated!Valuble site: Commonly Confused Characters
is is not the same as equal in Python
Sending HTML emails in Zope
Related by category
Unicode strings to ASCII ...nicely
http://effbot.org/librarybook/unicodedata.htmascii, encode, unicode, strings, unicodedata, normalize
8th of August 2006
This has been a problem for a long time for me. Whenever someone enters a title in my CMS the id of the document is derived from the title. Spaces are replaced with '- and &' is replaced with and etc. The final thing I wanted to do was to make sure the Id is ASCII encoded when it's saved. My original attempt looked like this:
>>> title = u"Klüft skräms inför på fédéral électoral große"
>>> print title.encode('ascii','ignore')
Klft skrms infr p fdral lectoral groe
But as you can see, a lot of the characters are gone. I'd much rather that a word like "Klüft" is converted to "Kluft" which will be more human readable and still correct. My second attempt was to write a big table of unicode to ascii replacements.
It looked something like this:
u'\xe4': u'a', u'\xc4': u'A', etc...
Long, awful and not pythonic. Too risky to miss something but the result was good. Now for the final solution which I'm very happy with. It uses a module called unicodedata which is new to me. Here's how it works:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'
It's not perfect ('große' should have become grosse) but's only two lines of code.
Comment
How about running replace on the string before normalizing:
title.replace(u'\xdf', 'ss')
and so on for any other special cases.
infidel is right. It could create some form of ambiguity - at least with german words.
1) Klüft is not a german word, so don't worry too much.
2) Why do you want to generate ids from the title? This is potentially insecure as I might find a clever way for entering cross-site-scripting that way.
3) If the id should match the title, why does it have to be ascii?
> 2) Why do you want to generate ids from the title? This is potentially insecure as I might find a clever way for entering cross-site-scripting that way.
> 3) If the id should match the title, why does it have to be ascii?
http://www.peterbe.com/plog/unicode-to-ascii
I can also (by test) say, that it doesn't work with Scandinavian letters (æ, ø and å) -- they get ignored completely.
"på" became "pa"
Well okay, but "Rødgrød med fløde" became "Rdgrd med flde".
This might assist, or maybe what you do is sufficiently equivalent:
http://www.crummy.com/cgi-bin/msm/map.cgi/ASCII%2C+Dammit
Hi, I wrote a script based on your idea. It transforms number, str and unicode to ASCII: http://www.haypocalc.com/perso/prog/python/any2ascii.py
It takes care of some caracters like "ßøł" (just fill smart_unicode dictionnary ;-)).
Haypo
Yet another approach:
http://effbot.python-hosting.com/file/stuff/sandbox/text/unaccent.py
Hey, I just wanted to thank you for this page. It was really helpful. I wanted to retain all 8-bit characters, so my solution was more complicated (see http://beastin.livejournal.com/6819.html), but I made use of your example.







Save this page in del.icio.us
It's been years since I took any German, but wouldn't 'Klüft' more accurately be saved as 'Klueft'? I recall that 'Küchen' and 'Kuchen' are two different words entirely (Kitchen and Cake, respectively).