2

I have a script thats looping through a database and doing some beautifulsoup processing on the string along with replacing some text with other text, etc.

This works 100% most of the time, however some html blobs seems to contain unicode text which breaks the script with the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 112: ordinal not in range(128)

I'm not sure what to do in this case, does anyone know of a module / function to force all text in the string to be a standardized utf-8 or something?

All the html blobs in the database came from feedparser (downloading rss feeds, storing in db).

9
  • Do you know which encoding was used? If not, then you have to guess it, convert to unicode and re-save the data as UTF-8. Beautiful parser is usually good at guessing encodings, but you may try chardet also. Commented Jan 12, 2013 at 13:18
  • try using .encode("utf8") or .encode("utf-8") Commented Jan 12, 2013 at 13:19
  • 2
    It's difficult to help without seeing the script that produces the error. Commented Jan 12, 2013 at 13:25
  • @Amyth - I did try .encode() and also .decode().encode() to no success unfortunately. Commented Jan 12, 2013 at 13:25
  • 1
    Unless you are calling .decode("ascii") somewhere, you really need to show the code. Commented Jan 12, 2013 at 13:37

4 Answers 4

2

Before you do any further processing with your string variable:

clean_str = unicode(str_var_with_strange_coding, errors='ignore')

The messed up characters are skipped. Not elegant, as you don't try to restore any maybe meaningful values, but effective.

Sign up to request clarification or add additional context in comments.

Comments

2

Make sure you really understand what is the difference between unicode and UTF-8 and that it is not the same (what is a surprise for many). That is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

What is encoding of your DB? Is it really UTF-8 or you only assume that it is? If it contains blobs with with random encodings, then you have problem, because you cannot guess the encoding. When you read from the database, then decode the blob to unicode and use unicode later in your code.

But let assume your base is UTF-8. Then you should use unicode everywhere - decode early, encode late. Use unicode everywhere inside you program, and only decode/encode when you read from or write to the database, display, write to file etc.

Unicode and encoding is a bit pain in Python 2.x, fortunately in python 3 all text is unicode

Regarding BeautifulSoup, use the latest version 4.

Comments

1

Since you don't want to show us your code, I'm going to give a general answer that hopefully helps you find the problem.

When you first get the data out of the database and fetch it with fetchone, you need to convert it into a unicode object. It is good practice to do this as soon as you have your variable, and then re-encode it only when you output it.

db = MySQLdb.connect()
cur = db.cursor()
cur.execute("SELECT col FROM the_table LIMIT 10")
xml = cur.fetchone()[0].decode('utf-8') # Or whatever encoding the text is in, though we're pretty sure it's utf-8. You might use chardet

After you run xml through BeautifulSoup, you might encode the string again if it is being saved into a file or you might just leave it as a Unicode object if you are re-inserting it into the database.

Comments

1

Well after a couple more hours googling, I finally came across a solution that eliminated all decode errors. I'm still fairly new to python (heavy php background) and didn't understand character encoding.

In my code I had a .decode('utf-8') and after that did some .replace(str(beatiful_soup_tag),'') statements. The solution ended up being so simple as to change all str() to unicode(). After that, not a single issue.

Answer found on: http://ubuntuforums.org/showthread.php?t=1212933

I sincerely apologize to the commenters who requested I post the code, what I thought was rock solid and not the issue was quite the opposite and I'm sure they would have caught the issue right away! I'll not make that mistake again! :)

1 Comment

Glad it already works for you. Nevertheless, if you get your input data from any random internet pages, you can wait for the next error, as some pages deliver mixed encoding. Very famous, currency signs in ISO 8859 encoding in a otherwise complete unicode page. If you run into these errors, remember the errors='ignore' flag, when you convert a string to unicode.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.