reevoolabs : open source technology

March 10, 2011

Utopian Text Format № 8

Filed under: rails, ruby, sysadmin — Tags: , , , , , , — adamjohnson @ 2:19 pm

The beginning

In which we learn of a large amount of data and a small amount of corruption

Here at Reevoo we have a big database, quite a big database, around 65GB worth of reviews, products and pricing data from the past 5 years. Useful interesting data; unfortunately we weren’t very careful about where we were getting this data from at the start and now we’ve developed a bit of a problem. In the distant past when we set up our database we must’ve chosen some default settings which meant that every table created was in the Latin-1 character set. However someone obviously knew that this was the wrong character set to store data in so from our rails apps we’ve been merrily writing UTF-8 data to our database for the past 5 years. However since we’ve left the connection as Latin-1 this has all secretly worked without us noticing. As long as we didn’t think about it, we were reading and writing UTF-8 all over the place. But there’s been a bit of a dark side to this happy middle-ground we’ve been working with.

We get a lot of our data from manufacturers and retailers in a variety of XML and CSV (and various mutations thereof) feeds. And as anyone knows you can’t really rely someone else to correctly tell you what encoding something is in (mainly because not nearly enough people understand how it works, especially some poor person who probably just makes a spreadsheet from their database and sends us a copy). Somewhere along the way we realised this and used Iconv and then later detenc to detect which encoding and coerce it into UTF-8. Somewhere along the way and possibly via other avenues some filthy dirty Latin-1 text has snuck into our database.

So now we have a database that declares itself to be in Latin-1, storing mostly UTF-8 data which, when looked at via anything other than a client that assumes UTF-8, is displayed quite wrongly - apart from the occasional case where we do have Latin-1 text and we have a nice £ sign instead of Â£ (or whatever other confusing result you get with munged characters). We decided we need to fix this once and for all, so began a long and arduous adventure into understanding just how mysql deals with character sets, and how we could convert 65GB of data without too much downtime.

The database and its character set

In which we learn how mysql tries to be helpful but ultimately gets in our way

All serious versions of mysql have had character set handling for a while now, and even does transcoding for you and even lets you store data in multiple characters in the same table. Unfortunately telling mysql that the table isn’t in the character set it thinks it is in without it trying to be helpful turned out to be rather hard. If we’d had just Latin-1 data a simple

ALTER TABLE serious_data CONVERT TO CHARACTER SET utf8;

Would’ve been all it took, simples. Not so for us unfortunately, the next option we found was using the CONVERT function to convert each column in place. So if the table structure of serious_data was:


CREATE TABLE `serious_data` (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `name` varchar(255) DEFAULT NULL,
    `description` varchar(255) DEFAULT NULL,
    PRIMARY KEY (`id`)
) DEFAULT CHARSET=latin1;

We could write a update statement that converted the name and description columns to binary and then to utf-8 text by calling CONVERT twice like so


UPDATE serious_data SET name = CONVERT(CONVERT(name USING binary) using utf8),
    description = CONVERT(CONVERT(description USING binary) using utf8);

This is all good apart from the fact that it requires us to identify every column that needs to be converted (we have a lot) and write a lot of sql, which ran surprisingly slowly.

Converting an entire table as fast as the hard disk can manage

In which we solve the 90% of our problem, leaving only the other 90%

After a large amount of a thinking, fighting and nerfing around Tom came up with the idea of selecting into an outfile as binary, then reading back in the exact same thing to a table that now knew it was in UTF-8 - and mysql would be none the wiser. The read back in could be done to a separate table and then, since RENAME TABLE is atomic, we could do this without any applications that had to read off the database ever noticing a change.


CREATE TABLE `serious_data_utf8` (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `name` varchar(255) DEFAULT NULL,
    `description` varchar(255) DEFAULT NULL,
    PRIMARY KEY (`id`)
) DEFAULT CHARSET=utf8;
SELECT * FROM serious_data INTO OUTFILE 'serious_data.tsv';
LOAD DATA INFILE 'serious_data.tsv' INTO TABLE serious_data_utf8 CHARACTER SET binary;
RENAME TABLE serious_data TO serious_data_latin1, serious_data_utf8 TO serious_data;
DROP TABLE serious_data_latin1;

This worked brilliantly, and surprisingly fast (pretty much the only bottle neck was the disk speed) and for most of our database tables this is all we needed to do.
It even works over replication since the replication slaves actually get fed the in-file via the replication stream. However for the few tables where we had mixed up the encoding we had a bit more work to do.

Guestimating the encoding of our data

In which we make some sweeping assumptions and hope everything works out in the end

Since we’ve not been safely checking that the data is in fact UTF-8, we’ve ended up in a situation where some data was saved to the database actually in Latin-1 to a Latin-1 encoded table (what are the odds?).  Since most of our data is English text, the overlap of possible characters was low and the only mixups are pound signs, a few accented characters and the dreaded smart quotes that come from some poor soul using a Microsoft Office application. To deal with this I broke out my Python skills (we mostly use Ruby at Reevoo, but Python has had reliable unicode support since the year 2000 (also know as the future) whereas I still don’t know where Ruby stands on Unicode support, though 1.9 is supposedly decent) and wrote this little encoding helper:


import sys
import string

def force_utf8(fragment):
    try:
        return fragment.decode('utf8')
    except:
        try:
            return fragment.replace('xe2x80?', 'xe2x80x9d').decode('utf8')
        except:
            return fragment.decode('latin1')

with open(sys.argv[1],'rb') as infile:
    with open(sys.argv[2],'wb') as outfile:
        for line in infile:
            fragments = line.split("t")
            outfile.write(u"t".join(force_utf8(frag) for frag in fragments).encode('utf8')) 

This is to be run over the dumpfile before reading it in. Mostly this makes sense: we iterate over every line in the tab-separated files that mysql writes, reading the file in with the assumption it’s UTF-8 and then writing it back out as UTF-8. Mostly this is a no-op, but in the cases where the string is not valid UTF-8, we try to decode it as Latin-1. However there is one extra fun step on our journey to sensibly stored data that snuck up on us. The dreaded smart quotes.

Somehow, through a combination of smoke, mirrors and a few pretty women providing distraction, some text that has been encoded in UTF-8 with curly opening and closing quotes has then been treated as Latin-1 with the resulting invalid characters replaced with question marks and then re-encoded back into UTF-8.

You may want to skip over this bit because it will make your brain hurt. Starting with the invalid bytes we had and this guide from Microsoft for Code Page 1252 we worked out that, with characters like the opening and closing quotes, Unicode codepoints U+201C and U+201D, which in UTF-8 work out to be the byte sequences E2 80 9C and E2 80 9D were then encoded as Latin-1.  However, there is no character in CP-1252 for the byte 9D so it was replaced with a question mark, which is the byte 3F. The streams had then been encoded back to UTF-8 leaving us with an invalid byte sequence of E2 80 3F so before trying Latin-1 as a last resort encoding, we replace the ? with 9D and retry to decode it as UTF-8.

A complete replacement of the database

In which we dump every table and restore it in a consistent valid encoding

Once we’d solved all these problems we could get around to actually cleaning up our data. We’re quite lucky in that most of our infrastructure and applications are read-only; there is only one user facing application that writes to the main database, which is only used internally, and many background tasks which take messages of queues and then write to the database. We have several databases that replicate of this master database as slaves for the readonly apps.

So we managed to do the complete dump and restore in a day, with all our public facing websites still up, by taking down all the pollers that took messages off the queues and stoping the internal admin application.

What an adventure. I now know more than I ever want to about character encoding.

Afterword

I put this post up because we found it very hard to find much useful information dealing with character encodings and altering tables once you had corrupted or incorrect data. Hopefully what we’ve done will be of use to someone else out there, or someone will be able to point out a much simpler way it could’ve all been done.

February 4, 2010

Spriter: Easy CSS sprites

Filed under: Uncategorized — georgebrock @ 5:56 pm

CSS Sprites

CSS Sprites is a useful technique that allows you to combine lots of small images into a single large image. Using this combined image as the background for various elements on your page (with a CSS background-position rule to make sure that only the relevant part is displayed) reduces the number of HTTP requests and makes the page load faster. (If you aren’t already familiar with this technique, there’s an excellent article on A List Apart)

Maintaining sprites is hard, let’s go shopping

The problem with CSS sprites is maintenance: Adding a new sprite is far more complex than adding an ordinary background image, and removing a sprite is even worse, because if you rearrange the remaining sprites you have to adjust your CSS to match.

There are various tools to generate sprites, but they are all more complex than a good old CSS background-image rule. After trying out several approaches, and being frustrated by everything from lack of transparency support to needless configuration steps to generated CSS that did far more than just set a background image, we decided to build something new.

Easy as CSS

What we wanted was a CSS sprite generator that would let us add a sprite with a single, simple CSS rule anywhere in our stylesheets that would set the element’s background without messing with its other properties. So we made Spriter.

Our .css files have now become .spriter files and contain rules that look something like this:

span.icon {
  width: 16px;
  height: 16px;
  -spriter-background: 'icon.png';
}

When we run these files through Spriter, a shared sprite image is created and the .spriter files are converted to CSS:

span.icon {
  width: 16px;
  height: 16px;
  background: url(/images/sprites.png) 0 -32px no-repeat; /* icon.png */
}

In our development environment, the .spriter files are converted dynamically by a Rack middleware. This makes the development experience very similar to using traditional CSS backgrounds: If you want to replace one sprite image with another you just change the -spriter-background rule and refresh the page. If you remove a -spriter-background rule that image won’t be included in the combined sprite image anymore.

When we’re ready to move our changes to production, our deploy scripts use Spriter to generate static CSS and a static sprite image which keeps everything nice and fast.

What are you waiting for?

If you want to use Spriter (or help make it more awesome) you’ll find the code and usage instructions on GitHub.

November 24, 2009

The Daemons are Dead (long live the Daemons)

Filed under: daemons, rails, ruby, sysadmin, unix — matthouse @ 6:50 pm

We have several processes at Reevoo that need to be run round the clock. They’re pretty long running but variable, so aren’t really suitable for running with cron. The solution to this is to background them, but how do we run background tasks that need access to our Rails applications model?

We settled on the Daemons gem, a gem originally hosted on rubyforge that uses Process::fork to fork your daemon code in the background and close all open file descriptors, making Ruby code behaves as a standard UNIX daemon. Unfortunately the Daemons gem is pretty old now (it doesn’t look like it’s been updated since March 2008) and is missing a few essential features.

Firstly, we found that in certain situations the daemon won’t respond to a standard SIGTERM() this makes managing stuff with most monitoring systems a real pain. It  can also cause deployment headaches when you have to manually kill -9 various processes and clear up their log files on every code update. Sysadmin headache!

This problem has thankfully been solved by the use of RapLeafs’ daemon_extension code which is basically a bundle of hacks to kill -9 a daemon that refuses to die after a certain timeout period. This isn’t perfect by any stretch of the imagination, but from a pragmatists point of view: It’ll do!

The other main problem is the configurability of the log and pid file directories. Previously you could only create your daemon with one directory configured and everything would live there. We like to store our applications pid files in shared memory (/dev/shm on our RHEL and CentOS boxes). This is basically a paranoia check to ensure that if any runaway code ever forces the machine to reboot, we have no stale pid file issues and Monit can start all the applications back up with minimal sysadmin intervention. Obviously storing daemon logs in shared memory is not optimal.

And so to the point of this article: I’ve patched the Daemons library with the pid directory fix and rolled in the daemon_extensions functionality (bear in mind this has only been tested on RedHat and Mac OS X) into my Github. Which you can grab here.

Fork away! And please send me pull requests :) let’s keep the daemons gem functionality going strong.

October 16, 2009

Problems with Cookie testing in Rails 2.3

Filed under: Uncategorized — craigsmith @ 4:38 pm

Pre Rails 2.3, when you wanted to set the value of a cookie in a test you had to:

def test_should_set_user_name
  #set cookie value
  @request.cookies['visitor_name'] = CGI::Cookie.new('visitor_name', 'Dave')

  post :login, :name => 'Dan'
  assert_equal 'Dan', cookies['visitor_name'].value
end

But in 2.3 you can forget about all that CGI::Cookie crap and concentrate on the sweet code you actually want to test:

def test_should_set_user_name
  #set the cookie value
  @request.cookies['visitor_name'] = 'Dave'

  post :login, :name => 'Dan'
  assert_equal 'Dan', cookies['visitor_name'].value
end

Pretty!

The problem comes when you want to ensure that ‘Dave’ has been logged out. Consider this test:

def test_should_log_user_out
  @request.cookies['visitor_name'] = 'Dave'
  post :logout
  assert_nil cookies['visitor_name']
end

This test always passes! No matter whether the cookie is removed in the logout action or not.

The solution

‘cookies’ is defined in ActionController::TestProcess as:

def cookies
  @response.cookies
end

When you ask for cookies in your test, what you’re actually getting is @response.cookies which doesn’t include the values of the cookies set in the request. This means that cookies['visitor_name'], as far as the test is concerned, is never set, making it pass incorrectly.

I’ve written a simple patch which has recently been merged in to Rails. It merges the @response cookie with the @request cookie and returns that, rather than just returning the @response.cookie. So ActionController::TestProcess becomes:

def cookies
  @request.cookies.merge(@response.cookies)
end

If you’re interested you can find the patch here.

Happy testing!

July 9, 2009

Copying a VM between Xen hosts

Filed under: Uncategorized — louisg @ 2:06 pm

We recently moved a Xen virtual machine from one Xen host to another. The process involves simply copying across its disk partition (in this case an LVM2 partition using the device mapper) and copying across its config file:

1) Copy across the VM partition device.
On target host:


lvcreate -L10G -n vm1-disk xen
nc -l 12345 | dd of=/dev/mapper/xen-vm1--disk conv=nocreat bs=16065b

On source host:


xm shutdown vm1
dd if=/dev/mapper/vm1-disk bs=16065b | nc xen2 12345

2) Copy across the VM config file.


scp /etc/xen/vm1.cfg root@xen2:/etc/xen/

Some further explanation is probably needed for the first step. Creating the logical volume on the target host in advance ensures that the partition, once copied across, remains a block device rather than as an image file. Netcat (nc) provides a fast mechanism to transfer the partition over the ether (use SSH if you’re sensitive about your data). As for the flags to dd, the block size (bs) is set to 16065 bytes, the number of sectors in a cylinder so I’m told (it worked for me), and the nocreat flag tells dd not to overwrite the block device.

Note: remember to shutdown the VM first! This is an offline copy. If you want to minimise downtime I’m sure an LVM snapshot would work too.

July 3, 2009

Testing Apache and mod_rewrite using Test::Unit

Filed under: Uncategorized — matthouse @ 11:04 am

Here at Reevoo we (like many others) use Apache as our webserver of choice and with this comes the venerable mod_rewrite.

Mod_rewrite can be used for a lot more than just redirecting pages though, you can use it for forward and reverse proxying, redirection and url rewriting based on various factors such as the HTTP host or request uri.  However there are myriad ways in which to shoot yourself in the foot!

We love testing, so when an article about testing mod_rewrite rules using Test::Unit by the guys at Viget Labs popped up in my feed reader I quickly popped round to take a look.

Using the redirect tester you can easily define shoulda style tests by doing something like this


class ReevooRedirectTest < HTTPRedirectTest
  self.domain = 'www.reevoo.com'
  should_redirect '/decidewhattobuy/blog', :to => '/decidewhattobuy', :permanant => true
  should_redirect '/blog', :to => '/decidewhattobuy'
end

This is very cool and makes working with mod_rewrite much less painful than it can be!

The original code is a series of gists hosted on the vigetlabs github page and to make them easier to use and manage I packaged it up as a gem, which you can install as follows:

sudo gem sources -a http://gems.github.com && sudo gem install shadowaspect-http_redirect_test

and use it in your code like this:

require 'http_redirect_test'

have fun!

July 2, 2009

Talking Reevoo Santa - It’s creepy and open to abuse.

Filed under: Projects — Tags: , , , — tomlea @ 3:28 pm

Recently the Reevoo office gained a new friend. A 4’6” talking, dancing Santa (because we start Christmas early here). So we hooked him up to our now famous talking rabbit, and now our build notifications are announced by Santa Himself.

As one would expect, the developers (myself most definitely included) began to make Santa announce more than just build statuses and deployments. Santa became our favourite way to drag Edwin, our user experience guru over to our desks (it sure beats turning round and talking).

After some time of this, other Reevolvers (who can’t hack together the ruby code to do this) wanted access… and I felt bad about hogging all the fun.

Enter Reevoo Santa:

Reevoo Santa Screenshot

Reevoo Santa Screenshot

Now the whole office is making Santa say stupid things.

We now come to phase 3, letting the general public (and you too) make Santa say stupid things!

http://santa.reevoo.com/

Please keep the creepy messages to a minimum between 7 and 9pm… the cleaning staff get spooked.

p.s. there is a pool on how long we can keep this live for, 10p says we will be forced to take it down due to wide spread abuse by Monday.

The Reevoo Santa source code is available on github, and is hosted  on Heroku, who host small applications awesomely.

January 8, 2009

detenc: a fast, low-memory character encoding detector

Filed under: Projects — paulbattley @ 5:54 pm

detenc is a fast character encoding detector for Western European text. It can
determine whether a file is encoded in US-ASCII, UTF-8, ISO-8859-15,
WINDOWS-1252, or something else. It can distinguish ISO-8859-15 and
WINDOWS-1252 where there is enough information: this means that Euro signs are
handled correctly.

The program was written to help normalise the encoding of very large data feeds
(of the order of several gigabytes) at Reevoo. It uses very little memory and
can determine the encoding of a two-gigabyte file in under a minute.

We process a lot of data feeds from retailers here at Reevoo. If we’re lucky, we get to specify the format. Often, though, we have to make do with feeds that are already available. The quality of these can be variable, which means that we need to be liberal in what we accept—but not so liberal that we start importing bogus data.

One of the significant variables is character encoding. This is a poorly understood topic in general, and our experience reflects this. We get feeds in:

As an aside, ISO 8859-1 is also a possibility. However, given that it doesn’t include the Euro sign, we can reasonably assume that any feeds we receive today are likely to be in ISO 8859-15 (which is very similar).

We need to turn everything into the canonical encoding — UTF-8 — before we start processing. Up until recently, we’d been using iconv for this, attempting each encoding in turn, and falling back to the next on failure. The naive detector loaded the file into memory, fed it through iconv, and wrote it back out. This didn’t work too well on big feeds — and by ‘big’ I mean 2 GB+. Working line by line over 30 million lines was still not good enough.

So I wrote a small C program to do the job. Detecting ASCII is easy: the high bit is never set. UTF-8 is a little harder, but can be done very reliably thanks to the self-synchronising characteristics of its byte sequences. Windows 1252 and ISO 8859-15 have a significant overlap, meaning that text may be in both; in this case, the program selects ISO 8859-15. However, a text that uses a byte value defined in one but not the other can only be in one encoding. Finally, a text may include byte values outside any of these ranges, in which case it’s unknown.

The program can scan a 2GB file in under a minute, which is a big improvement, and certainly good enough. It uses a few hundred kilobytes of memory, making it about 10,000 times better than the original naive implementation! It also features what I consider to be a legitimate use of goto in the UTF-8 validating state machine.

I’ve uploaded the code to GitHub: detenc.

Grab it, build it, fork it — I hope it’s useful to someone else. It may be a good start for detecting among common encodings used in other locales.

November 14, 2008

A new addition to the Reevoo Office

Filed under: cat, celing — lukeredpath @ 7:00 pm

ceilingcat
Ceiling cat is watching us!

August 19, 2008

bybusy mod accepted by Apache

Filed under: apache, bybusy, load, mongrel — lukeredpath @ 5:35 pm

In a previous article Joel spoke about the problems we were having with our load balancing between Apache and Mongrel and his bybusy mod that attempts to solve the problem.

His patch has now been accepted by Apache and should hopefully make it into a future release.

Older Posts »

Powered by WordPress