0

I am fetching tweets from Twitter and storing them in a database for future use. I am using UTF-8 encoding in my driver, utf8_mb4_bin in my VARCHAR fields and utf8mb4_general_ciserver collation. The problem with that is that when inserting a value in a VARCHAR field, if the text has any binary code then it will throw an exception since VARCHAR utf8 does not accept binary.

Here is an example, I am fetching the text from here and try inserting it in my database and I get the error:

Incorrect string value: '\xF0\x9F\x98\xB1\xF0\x9F...' for column 'fullTweet' at row 1

My guess is that the two emoticons are causing this. How do I get rid of them before inserting the tweet text in my database?

Update:

Looks like I can manually enter the emoticons. I run this query:

INSERT INTO `tweets`(`id`, `createdAt`, `screenName`, `fullTweet`, `editedTweet`) VALUES (450,"1994-12-19","john",_utf8mb4 x'F09F98B1',_utf8mb4 x'F09F98B1')

and this is what the row in the table looks like:

1

6
  • Are you sure that everything is configured correctly for utf8mb4 support? character-set-server=utf8mb4 in server settings, characterEncoding=UTF-8 in connection URL and correct collation for the field? Commented Apr 19, 2014 at 16:22
  • in the connection url at the end I add ?useUnicode=true&characterEncoding=UTF-8. The "Server Connection Collation" is utf8mb4_general_ci and the field collation is utf8mb4_bin (I double checked them a million times). Commented Apr 19, 2014 at 16:27
  • And what is character-set-server? Commented Apr 19, 2014 at 16:31
  • Server charset: UTF-8 Unicode (utf8) (I guess this is the one you are talking about). Commented Apr 19, 2014 at 16:32
  • What does show variables like 'character_set_server' show? Commented Apr 19, 2014 at 16:34

2 Answers 2

1

You can remove non ascii characters from tweet string before inserting.

tweetStr = tweetStr.replaceAll("[^\\p{ASCII}]", "");

Sign up to request clarification or add additional context in comments.

Comments

1

It looks like utf8mb4 support is still not configured correctly.

In order to use utf8mb4 in your fields you need to do the following:

  • Set character-set-server=utf8mb4 in your my.ini or my.cnf. Only character-set-server really matters here, other settings don't.

  • Add characterEncoding=UTF-8 to connection URL:

    jdbc:mysql://localhost:3306/db?characterEncoding=UTF-8
    
  • Configure collation of the field

4 Comments

I still have the same problem. I literally set everything there was to utf8mb4. I went to my.ini and uncommented the line character_set_server=utf8mb4, I run the query you sent me before and it returns: character_set_server=utf8mb4. I altered the database collation to utf8mb4_general_ci. I altered the table's and each row's collation to utf8mb4_general_ci. The url is exactly how you wrote it. I really can't understand what is going on here.
What if you try to insert literal value _utf8mb4 x'F09F98B1' into that column manually?
Looks like that works (if I did it right). I will update my post with the query I run and how it is represented in the table.
I found what's wrong. Looks like I don't need to change the connection URL as well since I'm settting the character-set-server. Now it sorta works, it saves the text but with question marks instead of emoticons. That is good enough for me now.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.