Kung Fu Kung Fu

Fujian White Crane Kung Fu

Zope Zope

What I have and am doing with Zope

Photos Photos

Photoalbum, both old and new.

Receptsamlingen Receptsamlingen

In Swedish only. About my "Collection of Recipes" website.

Contact me Contact me

My contact details and how to contact me.

  Mobile version of this page Mobile version of this page


 

Unicode strings to ASCII ...nicely

http://effbot.org/librarybook/unicodedata.htm

ascii, encode, unicode, strings, unicodedata, normalize

8th of August 2006

This has been a problem for a long time for me. Whenever someone enters a title in my CMS the id of the document is derived from the title. Spaces are replaced with '- and &' is replaced with and etc. The final thing I wanted to do was to make sure the Id is ASCII encoded when it's saved. My original attempt looked like this:

 >>> title = u"Klüft skräms inför på fédéral électoral große"
 >>> print title.encode('ascii','ignore')
 Klft skrms infr p fdral lectoral groe

But as you can see, a lot of the characters are gone. I'd much rather that a word like "Klüft" is converted to "Kluft" which will be more human readable and still correct. My second attempt was to write a big table of unicode to ascii replacements.

It looked something like this:

 u'\xe4': u'a',
 u'\xc4': u'A',
 etc...

Long, awful and not pythonic. Too risky to miss something but the result was good. Now for the final solution which I'm very happy with. It uses a module called unicodedata which is new to me. Here's how it works:

 >>> import unicodedata
 >>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
 'Kluft skrams infor pa federal electoral groe'

It's not perfect ('große' should have become grosse) but's only two lines of code.


Comment

infidel - 8th August 2006  [«« Reply to this]
It's been years since I took any German, but wouldn't 'Klüft' more accurately be saved as 'Klueft'? I recall that 'Küchen' and 'Kuchen' are two different words entirely (Kitchen and Cake, respectively).
Daverz - 9th August 2006  [«« Reply to this]
How about running replace on the string before normalizing:

title.replace(u'\xdf', 'ss')

and so on for any other special cases.
Andreas - 9th August 2006  [«« Reply to this]
infidel is right. It could create some form of ambiguity - at least with german words.
Michael Kallas - 9th August 2006  [«« Reply to this]
1) Klüft is not a german word, so don't worry too much.
2) Why do you want to generate ids from the title? This is potentially insecure as I might find a clever way for entering cross-site-scripting that way.
3) If the id should match the title, why does it have to be ascii?
Anonymous - 10th August 2006   [«« Reply to this]
> 2) Why do you want to generate ids from the title? This is potentially insecure as I might find a clever way for entering cross-site-scripting that way.
> 3) If the id should match the title, why does it have to be ascii?

http://www.peterbe.com/plog/unicode-to-ascii
Jonathan Holst - 9th August 2006  [«« Reply to this]
I can also (by test) say, that it doesn't work with Scandinavian letters (æ, ø and å) -- they get ignored completely.
Peter Bengtsson - 10th August 2006   [«« Reply to this]
"på" became "pa"
Jonathan Holst - 10th August 2006   [«« Reply to this]
Well okay, but "Rødgrød med fløde" became "Rdgrd med flde".
Ian Bicking - 10th August 2006  [«« Reply to this]
This might assist, or maybe what you do is sufficiently equivalent:
http://www.crummy.com/cgi-bin/msm/map.cgi/ASCII%2C+Dammit
Victor Stinner - 14th August 2006  [«« Reply to this]
Hi, I wrote a script based on your idea. It transforms number, str and unicode to ASCII: http://www.haypocalc.com/perso/prog/python/any2ascii.py

It takes care of some caracters like "ßøł" (just fill smart_unicode dictionnary ;-)).

Haypo
Fredrik - 24th August 2006  [«« Reply to this]
Yet another approach:

http://effbot.python-hosting.com/file/stuff/sandbox/text/unaccent.py
Peter Bengtsson - 31st August 2006   [«« Reply to this]
Brilliant! Thank you.
gfd - 17th October 2007  [«« Reply to this]
gb
Bryan Eastin - 18th January 2008  [«« Reply to this]
Hey, I just wanted to thank you for this page. It was really helpful. I wanted to retain all 8-bit characters, so my solution was more complicated (see http://beastin.livejournal.com/6819.html), but I made use of your example.
 
Name:
Email:
hide my email address.

Your email address will be encoded to prevent email-extraction spiders from reading it so you won't get spammed if you decide to show your email address.