The beginning
In which we learn of a large amount of data and a small amount of corruption
Here at Reevoo we have a big database, quite a big database, around 65GB worth of reviews, products and pricing data from the past 5 years. Useful interesting data; unfortunately we weren’t very careful about where we were getting this data from at the start and now we’ve developed a bit of a problem. In the distant past when we set up our database we must’ve chosen some default settings which meant that every table created was in the Latin-1 character set. However someone obviously knew that this was the wrong character set to store data in so from our rails apps we’ve been merrily writing UTF-8 data to our database for the past 5 years. However since we’ve left the connection as Latin-1 this has all secretly worked without us noticing. As long as we didn’t think about it, we were reading and writing UTF-8 all over the place. But there’s been a bit of a dark side to this happy middle-ground we’ve been working with.
We get a lot of our data from manufacturers and retailers in a variety of XML and CSV (and various mutations thereof) feeds. And as anyone knows you can’t really rely someone else to correctly tell you what encoding something is in (mainly because not nearly enough people understand how it works, especially some poor person who probably just makes a spreadsheet from their database and sends us a copy). Somewhere along the way we realised this and used Iconv and then later detenc to detect which encoding and coerce it into UTF-8. Somewhere along the way and possibly via other avenues some filthy dirty Latin-1 text has snuck into our database.
So now we have a database that declares itself to be in Latin-1, storing mostly UTF-8 data which, when looked at via anything other than a client that assumes UTF-8, is displayed quite wrongly - apart from the occasional case where we do have Latin-1 text and we have a nice £ sign instead of £ (or whatever other confusing result you get with munged characters). We decided we need to fix this once and for all, so began a long and arduous adventure into understanding just how mysql deals with character sets, and how we could convert 65GB of data without too much downtime.
The database and its character set
In which we learn how mysql tries to be helpful but ultimately gets in our way
All serious versions of mysql have had character set handling for a while now, and even does transcoding for you and even lets you store data in multiple characters in the same table. Unfortunately telling mysql that the table isn’t in the character set it thinks it is in without it trying to be helpful turned out to be rather hard. If we’d had just Latin-1 data a simple
ALTER TABLE serious_data CONVERT TO CHARACTER SET utf8;
Would’ve been all it took, simples. Not so for us unfortunately, the next option we found was using the CONVERT function to convert each column in place. So if the table structure of serious_data was:
CREATE TABLE `serious_data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`description` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) DEFAULT CHARSET=latin1;
We could write a update statement that converted the name and description columns to binary and then to utf-8 text by calling CONVERT twice like so
UPDATE serious_data SET name = CONVERT(CONVERT(name USING binary) using utf8),
description = CONVERT(CONVERT(description USING binary) using utf8);
This is all good apart from the fact that it requires us to identify every column that needs to be converted (we have a lot) and write a lot of sql, which ran surprisingly slowly.
Converting an entire table as fast as the hard disk can manage
In which we solve the 90% of our problem, leaving only the other 90%
After a large amount of a thinking, fighting and nerfing around Tom came up with the idea of selecting into an outfile as binary, then reading back in the exact same thing to a table that now knew it was in UTF-8 - and mysql would be none the wiser. The read back in could be done to a separate table and then, since RENAME TABLE is atomic, we could do this without any applications that had to read off the database ever noticing a change.
CREATE TABLE `serious_data_utf8` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`description` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) DEFAULT CHARSET=utf8;
SELECT * FROM serious_data INTO OUTFILE 'serious_data.tsv';
LOAD DATA INFILE 'serious_data.tsv' INTO TABLE serious_data_utf8 CHARACTER SET binary;
RENAME TABLE serious_data TO serious_data_latin1, serious_data_utf8 TO serious_data;
DROP TABLE serious_data_latin1;
This worked brilliantly, and surprisingly fast (pretty much the only bottle neck was the disk speed) and for most of our database tables this is all we needed to do.
It even works over replication since the replication slaves actually get fed the in-file via the replication stream. However for the few tables where we had mixed up the encoding we had a bit more work to do.
Guestimating the encoding of our data
In which we make some sweeping assumptions and hope everything works out in the end
Since we’ve not been safely checking that the data is in fact UTF-8, we’ve ended up in a situation where some data was saved to the database actually in Latin-1 to a Latin-1 encoded table (what are the odds?). Since most of our data is English text, the overlap of possible characters was low and the only mixups are pound signs, a few accented characters and the dreaded smart quotes that come from some poor soul using a Microsoft Office application. To deal with this I broke out my Python skills (we mostly use Ruby at Reevoo, but Python has had reliable unicode support since the year 2000 (also know as the future) whereas I still don’t know where Ruby stands on Unicode support, though 1.9 is supposedly decent) and wrote this little encoding helper:
import sys
import string
def force_utf8(fragment):
try:
return fragment.decode('utf8')
except:
try:
return fragment.replace('xe2x80?', 'xe2x80x9d').decode('utf8')
except:
return fragment.decode('latin1')
with open(sys.argv[1],'rb') as infile:
with open(sys.argv[2],'wb') as outfile:
for line in infile:
fragments = line.split("t")
outfile.write(u"t".join(force_utf8(frag) for frag in fragments).encode('utf8'))
This is to be run over the dumpfile before reading it in. Mostly this makes sense: we iterate over every line in the tab-separated files that mysql writes, reading the file in with the assumption it’s UTF-8 and then writing it back out as UTF-8. Mostly this is a no-op, but in the cases where the string is not valid UTF-8, we try to decode it as Latin-1. However there is one extra fun step on our journey to sensibly stored data that snuck up on us. The dreaded smart quotes.
Somehow, through a combination of smoke, mirrors and a few pretty women providing distraction, some text that has been encoded in UTF-8 with curly opening and closing quotes has then been treated as Latin-1 with the resulting invalid characters replaced with question marks and then re-encoded back into UTF-8.
You may want to skip over this bit because it will make your brain hurt. Starting with the invalid bytes we had and this guide from Microsoft for Code Page 1252 we worked out that, with characters like the opening and closing quotes, Unicode codepoints U+201C and U+201D, which in UTF-8 work out to be the byte sequences E2 80 9C and E2 80 9D were then encoded as Latin-1. However, there is no character in CP-1252 for the byte 9D so it was replaced with a question mark, which is the byte 3F. The streams had then been encoded back to UTF-8 leaving us with an invalid byte sequence of E2 80 3F so before trying Latin-1 as a last resort encoding, we replace the ? with 9D and retry to decode it as UTF-8.
A complete replacement of the database
In which we dump every table and restore it in a consistent valid encoding
Once we’d solved all these problems we could get around to actually cleaning up our data. We’re quite lucky in that most of our infrastructure and applications are read-only; there is only one user facing application that writes to the main database, which is only used internally, and many background tasks which take messages of queues and then write to the database. We have several databases that replicate of this master database as slaves for the readonly apps.
So we managed to do the complete dump and restore in a day, with all our public facing websites still up, by taking down all the pollers that took messages off the queues and stoping the internal admin application.
What an adventure. I now know more than I ever want to about character encoding.
Afterword
I put this post up because we found it very hard to find much useful information dealing with character encodings and altering tables once you had corrupted or incorrect data. Hopefully what we’ve done will be of use to someone else out there, or someone will be able to point out a much simpler way it could’ve all been done.


