Unicode strings to ASCII ...nicely

08 August 2006   20 comments   Python

http://effbot.org/librarybook/unicodedata.htm

Mind That Age!

This blog post is 12 years old! Most likely, its content is outdated. Especially if it's technical.

This has been a problem for a long time for me. Whenever someone enters a title in my CMS the id of the document is derived from the title. Spaces are replaced with '-and&' is replaced with and etc. The final thing I wanted to do was to make sure the Id is ASCII encoded when it's saved. My original attempt looked like this:

>>> title = u"Klüft skräms inför på fédéral électoral große"
>>> print title.encode('ascii','ignore')
Klft skrms infr p fdral lectoral groe

But as you can see, a lot of the characters are gone. I'd much rather that a word like "Klüft" is converted to "Kluft" which will be more human readable and still correct. My second attempt was to write a big table of unicode to ascii replacements.

It looked something like this:

u'\xe4': u'a',
u'\xc4': u'A',
etc...

Long, awful and not pythonic. Too risky to miss something but the result was good. Now for the final solution which I'm very happy with. It uses a module called unicodedata which is new to me. Here's how it works:

>>> import unicodedata
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'

It's not perfect (große should have become grosse) but's only two lines of code.

Comments

infidel

It's been years since I took any German, but wouldn't 'Klüft' more accurately be saved as 'Klueft'? I recall that 'Küchen' and 'Kuchen' are two different words entirely (Kitchen and Cake, respectively).

Daverz

How about running replace on the string before normalizing:

title.replace(u'\xdf', 'ss')

and so on for any other special cases.

Andreas

infidel is right. It could create some form of ambiguity - at least with german words.

Michael Kallas

1) Klüft is not a german word, so don't worry too much.
2) Why do you want to generate ids from the title? This is potentially insecure as I might find a clever way for entering cross-site-scripting that way.
3) If the id should match the title, why does it have to be ascii?

Anonymous

> 2) Why do you want to generate ids from the title? This is potentially insecure as I might find a clever way for entering cross-site-scripting that way.
> 3) If the id should match the title, why does it have to be ascii?

http://www.peterbe.com/plog/unicode-to-ascii

Jonathan Holst

I can also (by test) say, that it doesn't work with Scandinavian letters (æ, ø and å) -- they get ignored completely.

Peter Bengtsson

"på" became "pa"

Jonathan Holst

Well okay, but "Rødgrød med fløde" became "Rdgrd med flde".

Ian Bicking

This might assist, or maybe what you do is sufficiently equivalent:
http://www.crummy.com/cgi-bin/msm/map.cgi/ASCII%2C+Dammit

Victor Stinner

Hi, I wrote a script based on your idea. It transforms number, str and unicode to ASCII: http://www.haypocalc.com/perso/prog/python/any2ascii.py

It takes care of some caracters like "ßøł" (just fill smart_unicode dictionnary ;-)).

Haypo

Fredrik

Yet another approach:

http://effbot.python-hosting.com/file/stuff/sandbox/text/unaccent.py

Peter Bengtsson

Brilliant! Thank you.

Bryan Eastin

Hey, I just wanted to thank you for this page. It was really helpful. I wanted to retain all 8-bit characters, so my solution was more complicated (see http://beastin.livejournal.com/6819.html), but I made use of your example.

ben

This is fantastic stuff - I was having trouble parsing film results where, for example, Rashômon was represented as Rashomon. Testing for both the unicode and ascii normalized strings before iterating to the next result really sealed it. Thanks.

Robson

Excelent! It's save my day... really thanks!

Anonymous

when writing about character encodings you want your page encoded properly.

page claims to be encoded in utf-8 but is encoded iso-latin-1

Peter Bengtsson

I know. It's terrible. It's because it's changed over time.

Gilles Lenfant

There's now the "unidecode" package that does all the job http://pypi.python.org/pypi/Unidecode/

>>> from unidecode import unidecode
>>> utext = u"œuf dür"
>>> unidecode(utext)
u'oeuf dur'
>>> from unicodedata import normalize
>>> normalize('NFKD', utext).encode('ascii','ignore')
'uf dur'

A better support for special latin extended characters (French, German) that should tranlitterate to multiple ASCII characters.

Anonymous

Peterbe to the rescue!

Your email will never ever be published


Related posts

Previous:
More crappy album covers 06 August 2006
Next:
Fastest way to uniqify a list in Python 14 August 2006
Related by Keyword:
How to slice a rune in Go 16 March 2015
Sorting mixed type lists in Python 3 18 January 2014
String length truncation optimization difference in Python 19 March 2012
is is not the same as equal in Python 01 December 2006
Sending HTML emails in Zope 26 October 2006
Related by Text:
jQuery and Highslide JS 08 January 2008
I'm back! Peterbe.com has been renewed 05 June 2005
Anti-McCain propaganda videos 12 August 2008
Ever wondered how much $87 Billion is? 04 November 2003
Guake, not Yakuake or Yeahconsole 23 January 2010