python - how to convert html string to utf-8? Getting UnicodeDecodeError errors

Question

I have a script thats looping through a database and doing some beautifulsoup processing on the string along with replacing some text with other text, etc.

This works 100% most of the time, however some html blobs seems to contain unicode text which breaks the script with the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 112: ordinal not in range(128)

I'm not sure what to do in this case, does anyone know of a module / function to force all text in the string to be a standardized utf-8 or something?

All the html blobs in the database came from feedparser (downloading rss feeds, storing in db).

Do you know which encoding was used? If not, then you have to guess it, convert to unicode and re-save the data as UTF-8. Beautiful parser is usually good at guessing encodings, but you may try chardet also. — Bakuriu
– Bakuriu, Commented Jan 12, 2013 at 13:18
It's difficult to help without seeing the script that produces the error. — Fredrick Brennan
– Fredrick Brennan, Commented Jan 12, 2013 at 13:25
@Amyth - I did try .encode() and also .decode().encode() to no success unfortunately. — Joe
– Joe, Commented Jan 12, 2013 at 13:25
Unless you are calling .decode("ascii") somewhere, you really need to show the code. — Esailija
– Esailija, Commented Jan 12, 2013 at 13:37

Michael · Accepted Answer · 2013-01-12 17:07:19Z

2

Before you do any further processing with your string variable:

clean_str = unicode(str_var_with_strange_coding, errors='ignore')

The messed up characters are skipped. Not elegant, as you don't try to restore any maybe meaningful values, but effective.

answered Jan 12, 2013 at 17:07

Michael

7,8061 gold badge41 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:04:34Z

Make sure you really understand what is the difference between unicode and UTF-8 and that it is not the same (what is a surprise for many). That is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

What is encoding of your DB? Is it really UTF-8 or you only assume that it is? If it contains blobs with with random encodings, then you have problem, because you cannot guess the encoding. When you read from the database, then decode the blob to unicode and use unicode later in your code.

But let assume your base is UTF-8. Then you should use unicode everywhere - decode early, encode late. Use unicode everywhere inside you program, and only decode/encode when you read from or write to the database, display, write to file etc.

Unicode and encoding is a bit pain in Python 2.x, fortunately in python 3 all text is unicode

Regarding BeautifulSoup, use the latest version 4.

Fredrick Brennan · Accepted Answer · 2013-01-12 13:50:07Z

Since you don't want to show us your code, I'm going to give a general answer that hopefully helps you find the problem.

When you first get the data out of the database and fetch it with fetchone, you need to convert it into a unicode object. It is good practice to do this as soon as you have your variable, and then re-encode it only when you output it.

db = MySQLdb.connect()
cur = db.cursor()
cur.execute("SELECT col FROM the_table LIMIT 10")
xml = cur.fetchone()[0].decode('utf-8') # Or whatever encoding the text is in, though we're pretty sure it's utf-8. You might use chardet

After you run xml through BeautifulSoup, you might encode the string again if it is being saved into a file or you might just leave it as a Unicode object if you are re-inserting it into the database.

Joe · Accepted Answer · 2013-01-12 16:52:23Z

1

Well after a couple more hours googling, I finally came across a solution that eliminated all decode errors. I'm still fairly new to python (heavy php background) and didn't understand character encoding.

In my code I had a .decode('utf-8') and after that did some .replace(str(beatiful_soup_tag),'') statements. The solution ended up being so simple as to change all str() to unicode(). After that, not a single issue.

Answer found on: http://ubuntuforums.org/showthread.php?t=1212933

I sincerely apologize to the commenters who requested I post the code, what I thought was rock solid and not the issue was quite the opposite and I'm sure they would have caught the issue right away! I'll not make that mistake again! :)

answered Jan 12, 2013 at 16:52

Joe

1,77210 gold badges43 silver badges61 bronze badges

1 Comment

Michael Over a year ago

Glad it already works for you. Nevertheless, if you get your input data from any random internet pages, you can wait for the next error, as some pages deliver mixed encoding. Very famous, currency signs in ISO 8859 encoding in a otherwise complete unicode page. If you run into these errors, remember the errors='ignore' flag, when you convert a string to unicode.

Collectives™ on Stack Overflow

python - how to convert html string to utf-8? Getting UnicodeDecodeError errors

4 Answers 4

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related