reevoolabs : open source technology

July 2, 2009

Talking Reevoo Santa - It’s creepy and open to abuse.

Filed under: Projects — Tags: , , , — tomlea @ 3:28 pm

Recently the Reevoo office gained a new friend. A 4’6” talking, dancing Santa (because we start Christmas early here). So we hooked him up to our now famous talking rabbit, and now our build notifications are announced by Santa Himself.

As one would expect, the developers (myself most definitely included) began to make Santa announce more than just build statuses and deployments. Santa became our favourite way to drag Edwin, our user experience guru over to our desks (it sure beats turning round and talking).

After some time of this, other Reevolvers (who can’t hack together the ruby code to do this) wanted access… and I felt bad about hogging all the fun.

Enter Reevoo Santa:

Reevoo Santa Screenshot

Reevoo Santa Screenshot

Now the whole office is making Santa say stupid things.

We now come to phase 3, letting the general public (and you too) make Santa say stupid things!

http://santa.reevoo.com/

Please keep the creepy messages to a minimum between 7 and 9pm… the cleaning staff get spooked.

p.s. there is a pool on how long we can keep this live for, 10p says we will be forced to take it down due to wide spread abuse by Monday.

The Reevoo Santa source code is available on github, and is hosted  on Heroku, who host small applications awesomely.

January 8, 2009

detenc: a fast, low-memory character encoding detector

Filed under: Projects — paulbattley @ 5:54 pm

detenc is a fast character encoding detector for Western European text. It can
determine whether a file is encoded in US-ASCII, UTF-8, ISO-8859-15,
WINDOWS-1252, or something else. It can distinguish ISO-8859-15 and
WINDOWS-1252 where there is enough information: this means that Euro signs are
handled correctly.

The program was written to help normalise the encoding of very large data feeds
(of the order of several gigabytes) at Reevoo. It uses very little memory and
can determine the encoding of a two-gigabyte file in under a minute.

We process a lot of data feeds from retailers here at Reevoo. If we’re lucky, we get to specify the format. Often, though, we have to make do with feeds that are already available. The quality of these can be variable, which means that we need to be liberal in what we accept—but not so liberal that we start importing bogus data.

One of the significant variables is character encoding. This is a poorly understood topic in general, and our experience reflects this. We get feeds in:

As an aside, ISO 8859-1 is also a possibility. However, given that it doesn’t include the Euro sign, we can reasonably assume that any feeds we receive today are likely to be in ISO 8859-15 (which is very similar).

We need to turn everything into the canonical encoding — UTF-8 — before we start processing. Up until recently, we’d been using iconv for this, attempting each encoding in turn, and falling back to the next on failure. The naive detector loaded the file into memory, fed it through iconv, and wrote it back out. This didn’t work too well on big feeds — and by ‘big’ I mean 2 GB+. Working line by line over 30 million lines was still not good enough.

So I wrote a small C program to do the job. Detecting ASCII is easy: the high bit is never set. UTF-8 is a little harder, but can be done very reliably thanks to the self-synchronising characteristics of its byte sequences. Windows 1252 and ISO 8859-15 have a significant overlap, meaning that text may be in both; in this case, the program selects ISO 8859-15. However, a text that uses a byte value defined in one but not the other can only be in one encoding. Finally, a text may include byte values outside any of these ranges, in which case it’s unknown.

The program can scan a 2GB file in under a minute, which is a big improvement, and certainly good enough. It uses a few hundred kilobytes of memory, making it about 10,000 times better than the original naive implementation! It also features what I consider to be a legitimate use of goto in the UTF-8 validating state machine.

I’ve uploaded the code to GitHub: detenc.

Grab it, build it, fork it — I hope it’s useful to someone else. It may be a good start for detecting among common encodings used in other locales.

Powered by WordPress