Here's a little program I wrote recently to fix incorrectly defined characters into HTML entities. For example, this is incorrect:

<b>Bärs &amp; Öl</b>

But this is correct:

<b>B&amp;auml;rs &amp;amp; &amp;Ouml;l</b>

To demonstrate I have set up a little test page here so that you can test to convert your impure HTML content.
Run test program

Here's the source code for the program:

from htmlentitydefs import entitydefs

entitydefs_inverted = {}
for k,v in entitydefs.items():
   entitydefs_inverted[v] = k

_badchars_regex = re.compile('|'.join(entitydefs.values()))
_been_fixed_regex = re.compile('&amp;\w+;|&amp;#[0-9]+;')
def html_entity_fixer(text, skipchars=[], extra_careful=1):

   # if extra_careful we don't attempt to do anything to
   # the string if it might have been converted already.
   if extra_careful and _been_fixed_regex.findall(text):
       return text

   if type(skipchars) == type('s'):
       skipchars = [skipchars]

   keyholder= {}
   for x in _badchars_regex.findall(text):
       if x not in skipchars:
           keyholder[x] = 1
   text = text.replace('&amp;','&amp;amp;')
   text = text.replace('\x80', '&amp;#8364;')
   for each in keyholder.keys():
       if each == '&amp;':
           continue

       better = entitydefs_inverted[each]
       if not better.startswith('&amp;#'):
           better = '&amp;%s;'%entitydefs_inverted[each]

       text = text.replace(each, better)
   return text
Harald Armin Massa - 26 November 2004 [«« Reply to this]
Peter,

I learned that using Umlauts may be quite correct by setting the encoding to latin-1 (aka ISO something) or UTF-8 ... esp. with xhtml.

So the only BAD chars will be <,> and & ...

Harald
James Harlow - 26 November 2004 [«« Reply to this]
Harald - you're correct (iso8859-1, by the way) - but making sure that the declared encoding of the document is the actual encoding of the document is not trivial. It's better to encode, if you can.

http://www.xml.com/pub/a/2004/07/21/dive.html explains it much better than I can. :-)
Bogdano - 24 January 2005 [«« Reply to this]
It's not better to use entitytdefs.iteritems() instead of .items()?
Peter - 24 January 2005 [«« Reply to this]
Actually not. iteritems() is only faster than items() when the size of the dictionnary exceeds 1000 if the elements in the dict are small. I know this because I've done some benchmarking.
Gary - 04 March 2005 [«« Reply to this]
Peter, thanks for publishing this on the web. You think it'd be easier or more intuitive to html-escape characters in python. Have you thought about submitting your code for inclusion as a global module?

Because I'm such a forward thinker with a knack for coming up with unique names... I think...

from htmlescape import *

:D
Rafael Zanella - 31 March 2007 [«« Reply to this]
Mine lame version:

#/usr/bin/python

import sys
import os

#http://www.asciitable.com/
#http://www.w3schools.com/tags/ref_entities.asp

#DICT { char : HTML entity }
dicionario = {
# ISO 8859-1 Character Entities
'À' : "&Agrave;", 'Á' : "&Aacute;", 'Â' : "&Acirc;", 'Ã' : "&Atilde;", 'Ä' : "&Auml;", 'Å' : "&Aring;",
'Æ' : "&AElig;", 'Ç' : "&Ccedil;",
'È' : "&Egrave;", 'É' : "&Eacute;", 'Ê' : "&Ecirc;", 'Ë' : "&Euml;",
'Ì' : "&Igrave;", 'Í' : "&Iacute;", 'Î' : "&Icirc;", 'Ï' : "&Iuml;",
'Ð' : "&ETH;", 'Ñ' : "&Ntilde;",
'Ò' : "&Ograve;", 'Ó' : "&Oacute;", 'Ô' : "&Ocirc;", 'Õ' : "&Otilde;", 'Ö' : "&Ouml;", 'Ø' : "&Oslash;",
'Ù' : "&Ugrave;", 'Ú' : "&Uacute;", 'Û' : "&Ucirc;", 'Ü' : "&Uuml;",
'Ý' : "&Yacute;",
'Þ' : "&THORN;", 'ß' : "&szlig;",
'à' : "&agrave;", 'á' : "&aacute;", 'â' : "&acirc;", 'ã' : "&atilde;", 'ä' : "&auml;", 'å' : "&aring;",
'æ' : "&aelig;", 'ç' : "&ccedil;",
'è' : "&egrave;", 'é' : "&eacute;", 'ê' : "&ecirc;", 'ë' : "&euml;",
'ì' : "&igrave;", 'í' : "&iacute;", 'î' : "&icirc;", 'ï' : "&iuml;",
'ð' : "&eth;", 'ñ' : "&ntilde;",
'ò' : "&ograve;", 'ó' : "&oacute;", 'ô' : "&ocirc;", 'õ' : "&otilde;", 'ö' : "&ouml;", 'ø' : "&oslash;",
'ù' : "&ugrave;", 'ú' : "&uacute;", 'û' : "&ucirc;", 'ü' : "&uuml;",
'ý' : "&yacute;", 'þ' : "&thorn;", 'ÿ' : "&yuml;",
};

def main ():
try:
if (sys.argv[1]):
originalFile = open(sys.argv[1], "r")
newFile = open(sys.argv[1] + ".RC", "w");

while 1:
#Variables
read = originalFile.readline();
strHolder = "";

if not read:
break;

for char in read: # for i in xrange(len(read) - 1)
try:
if ( ord(char) > 128):
strHolder += dicionario[char];
else:
strHolder += char;
except KeyError: # if the char is extended ASCII but hasn't been included on the dict
strHolder += char;
#End for
print strHolder; ##scaffolding
newFile.write(strHolder);
#End while

#Close-ups
originalFile.close();
newFile.close();
#end if
except IndexError:
print "\n\nModo de uso: toEntities.py <Nome_Do_Arquivo>\n\n"; return 1;
except IOError:
print "\n\nArquivo nao pode ser aberto...\n\n"; return 2;
#end main

main();
#EOF
da8 - 13 January 2009 [«« Reply to this]
Thanks a lot for your Script. You made my day. So far i have been using a homemade script with a self defined lookup table for Ä, Ö, Ü and so on. I am still wondering why python struggles with this so often required feature. I guess i do not understand the full picture but i liked to have this kind of functionality in the modules concerning XML/HTML. Kind regards, da8
Matt H - 22 October 2009 [«« Reply to this]
Hi, Great program. I needed to add one line at the beginning to get it to work:

import re


Your email will never ever be published