Mobile version of this page
Previous:
Hacking and martial arts
Next:
CorpCalendar review on ZopeMag.com
Hacking and martial arts
Next:
CorpCalendar review on ZopeMag.com
Related blogs
Valuble site: Commonly Confused CharactersRelated by category
HTML entity fixer
entity fixer, html entity, entitydefs, htmlentitydefs
25th of November 2004
Here's a little program I wrote recently to fix incorrectly defined characters into HTML entities. For example, this is incorrect:
<b>Bärs & Öl</b>
But this is correct:
<b>Bärs & Öl</b>
To demonstrate I have set up a little test page here so that you can test to convert your impure HTML content.
Run test program
Here's the source code for the program:
from htmlentitydefs import entitydefs
entitydefs_inverted = {}
for k,v in entitydefs.items():
entitydefs_inverted[v] = k
_badchars_regex = re.compile('|'.join(entitydefs.values()))
_been_fixed_regex = re.compile('&\w+;|&#[0-9]+;')
def html_entity_fixer(text, skipchars=[], extra_careful=1):
# if extra_careful we don't attempt to do anything to
# the string if it might have been converted already.
if extra_careful and _been_fixed_regex.findall(text):
return text
if type(skipchars) == type('s'):
skipchars = [skipchars]
keyholder= {}
for x in _badchars_regex.findall(text):
if x not in skipchars:
keyholder[x] = 1
text = text.replace('&','&')
text = text.replace('\x80', '€')
for each in keyholder.keys():
if each == '&':
continue
better = entitydefs_inverted[each]
if not better.startswith('&#'):
better = '&%s;'%entitydefs_inverted[each]
text = text.replace(each, better)
return text
entitydefs_inverted = {}
for k,v in entitydefs.items():
entitydefs_inverted[v] = k
_badchars_regex = re.compile('|'.join(entitydefs.values()))
_been_fixed_regex = re.compile('&\w+;|&#[0-9]+;')
def html_entity_fixer(text, skipchars=[], extra_careful=1):
# if extra_careful we don't attempt to do anything to
# the string if it might have been converted already.
if extra_careful and _been_fixed_regex.findall(text):
return text
if type(skipchars) == type('s'):
skipchars = [skipchars]
keyholder= {}
for x in _badchars_regex.findall(text):
if x not in skipchars:
keyholder[x] = 1
text = text.replace('&','&')
text = text.replace('\x80', '€')
for each in keyholder.keys():
if each == '&':
continue
better = entitydefs_inverted[each]
if not better.startswith('&#'):
better = '&%s;'%entitydefs_inverted[each]
text = text.replace(each, better)
return text
Comment
James Harlow -
26th November 2004
[«« Reply to this]
Harald - you're correct (iso8859-1, by the way) - but making sure that the declared encoding of the document is the actual encoding of the document is not trivial. It's better to encode, if you can.
http://www.xml.com/pub/a/2004/07/21/dive.html explains it much better than I can. :-)
Harald - you're correct (iso8859-1, by the way) - but making sure that the declared encoding of the document is the actual encoding of the document is not trivial. It's better to encode, if you can.
http://www.xml.com/pub/a/2004/07/21/dive.html explains it much better than I can. :-)
Bogdano -
24th January 2005
[«« Reply to this]
It's not better to use entitytdefs.iteritems() instead of .items()?
It's not better to use entitytdefs.iteritems() instead of .items()?
Peter -
24th January 2005
[«« Reply to this]
Actually not. iteritems() is only faster than items() when the size of the dictionnary exceeds 1000 if the elements in the dict are small. I know this because I've done some benchmarking.
Actually not. iteritems() is only faster than items() when the size of the dictionnary exceeds 1000 if the elements in the dict are small. I know this because I've done some benchmarking.
Gary -
4th March 2005
[«« Reply to this]
Peter, thanks for publishing this on the web. You think it'd be easier or more intuitive to html-escape characters in python. Have you thought about submitting your code for inclusion as a global module?
Because I'm such a forward thinker with a knack for coming up with unique names... I think...
from htmlescape import *
:D
Peter, thanks for publishing this on the web. You think it'd be easier or more intuitive to html-escape characters in python. Have you thought about submitting your code for inclusion as a global module?
Because I'm such a forward thinker with a knack for coming up with unique names... I think...
from htmlescape import *
:D
Rafael Zanella -
30th March 2007
[«« Reply to this]
Mine lame version:
#/usr/bin/python
import sys
import os
#http://www.asciitable.com/
#http://www.w3schools.com/tags/ref_entities.asp
#DICT { char : HTML entity }
dicionario = {
# ISO 8859-1 Character Entities
'À' : "À", 'Á' : "Á", 'Â' : "Â", 'Ã' : "Ã", 'Ä' : "Ä", 'Å' : "Å",
'Æ' : "Æ", 'Ç' : "Ç",
'È' : "È", 'É' : "É", 'Ê' : "Ê", 'Ë' : "Ë",
'Ì' : "Ì", 'Í' : "Í", 'Î' : "Î", 'Ï' : "Ï",
'Ð' : "Ð", 'Ñ' : "Ñ",
'Ò' : "Ò", 'Ó' : "Ó", 'Ô' : "Ô", 'Õ' : "Õ", 'Ö' : "Ö", 'Ø' : "Ø",
'Ù' : "Ù", 'Ú' : "Ú", 'Û' : "Û", 'Ü' : "Ü",
'Ý' : "Ý",
'Þ' : "Þ", 'ß' : "ß",
'à' : "à", 'á' : "á", 'â' : "â", 'ã' : "ã", 'ä' : "ä", 'å' : "å",
'æ' : "æ", 'ç' : "ç",
'è' : "è", 'é' : "é", 'ê' : "ê", 'ë' : "ë",
'ì' : "ì", 'í' : "í", 'î' : "î", 'ï' : "ï",
'ð' : "ð", 'ñ' : "ñ",
'ò' : "ò", 'ó' : "ó", 'ô' : "ô", 'õ' : "õ", 'ö' : "ö", 'ø' : "ø",
'ù' : "ù", 'ú' : "ú", 'û' : "û", 'ü' : "ü",
'ý' : "ý", 'þ' : "þ", 'ÿ' : "ÿ",
};
def main ():
try:
if (sys.argv[1]):
originalFile = open(sys.argv[1], "r")
newFile = open(sys.argv[1] + ".RC", "w");
while 1:
#Variables
read = originalFile.readline();
strHolder = "";
if not read:
break;
for char in read: # for i in xrange(len(read) - 1)
try:
if ( ord(char) > 128):
strHolder += dicionario[char];
else:
strHolder += char;
except KeyError: # if the char is extended ASCII but hasn't been included on the dict
strHolder += char;
#End for
print strHolder; ##scaffolding
newFile.write(strHolder);
#End while
#Close-ups
originalFile.close();
newFile.close();
#end if
except IndexError:
print "\n\nModo de uso: toEntities.py <Nome_Do_Arquivo>\n\n"; return 1;
except IOError:
print "\n\nArquivo nao pode ser aberto...\n\n"; return 2;
#end main
main();
#EOF
Mine lame version:
#/usr/bin/python
import sys
import os
#http://www.asciitable.com/
#http://www.w3schools.com/tags/ref_entities.asp
#DICT { char : HTML entity }
dicionario = {
# ISO 8859-1 Character Entities
'À' : "À", 'Á' : "Á", 'Â' : "Â", 'Ã' : "Ã", 'Ä' : "Ä", 'Å' : "Å",
'Æ' : "Æ", 'Ç' : "Ç",
'È' : "È", 'É' : "É", 'Ê' : "Ê", 'Ë' : "Ë",
'Ì' : "Ì", 'Í' : "Í", 'Î' : "Î", 'Ï' : "Ï",
'Ð' : "Ð", 'Ñ' : "Ñ",
'Ò' : "Ò", 'Ó' : "Ó", 'Ô' : "Ô", 'Õ' : "Õ", 'Ö' : "Ö", 'Ø' : "Ø",
'Ù' : "Ù", 'Ú' : "Ú", 'Û' : "Û", 'Ü' : "Ü",
'Ý' : "Ý",
'Þ' : "Þ", 'ß' : "ß",
'à' : "à", 'á' : "á", 'â' : "â", 'ã' : "ã", 'ä' : "ä", 'å' : "å",
'æ' : "æ", 'ç' : "ç",
'è' : "è", 'é' : "é", 'ê' : "ê", 'ë' : "ë",
'ì' : "ì", 'í' : "í", 'î' : "î", 'ï' : "ï",
'ð' : "ð", 'ñ' : "ñ",
'ò' : "ò", 'ó' : "ó", 'ô' : "ô", 'õ' : "õ", 'ö' : "ö", 'ø' : "ø",
'ù' : "ù", 'ú' : "ú", 'û' : "û", 'ü' : "ü",
'ý' : "ý", 'þ' : "þ", 'ÿ' : "ÿ",
};
def main ():
try:
if (sys.argv[1]):
originalFile = open(sys.argv[1], "r")
newFile = open(sys.argv[1] + ".RC", "w");
while 1:
#Variables
read = originalFile.readline();
strHolder = "";
if not read:
break;
for char in read: # for i in xrange(len(read) - 1)
try:
if ( ord(char) > 128):
strHolder += dicionario[char];
else:
strHolder += char;
except KeyError: # if the char is extended ASCII but hasn't been included on the dict
strHolder += char;
#End for
print strHolder; ##scaffolding
newFile.write(strHolder);
#End while
#Close-ups
originalFile.close();
newFile.close();
#end if
except IndexError:
print "\n\nModo de uso: toEntities.py <Nome_Do_Arquivo>\n\n"; return 1;
except IOError:
print "\n\nArquivo nao pode ser aberto...\n\n"; return 2;
#end main
main();
#EOF







Save this page in del.icio.us
Peter,
I learned that using Umlauts may be quite correct by setting the encoding to latin-1 (aka ISO something) or UTF-8 ... esp. with xhtml.
So the only BAD chars will be <,> and & ...
Harald