reevoolabs : open source technology

July 3, 2009

Testing Apache and mod_rewrite using Test::Unit

Filed under: Uncategorized — matthouse @ 11:04 am

Here at Reevoo we (like many others) use Apache as our webserver of choice and with this comes the venerable mod_rewrite.

Mod_rewrite can be used for a lot more than just redirecting pages though, you can use it for forward and reverse proxying, redirection and url rewriting based on various factors such as the HTTP host or request uri.  However there are myriad ways in which to shoot yourself in the foot!

We love testing, so when an article about testing mod_rewrite rules using Test::Unit by the guys at Viget Labs popped up in my feed reader I quickly popped round to take a look.

Using the redirect tester you can easily define shoulda style tests by doing something like this


class ReevooRedirectTest < HTTPRedirectTest
  self.domain = 'www.reevoo.com'
  should_redirect '/decidewhattobuy/blog', :to => '/decidewhattobuy', :permanant => true
  should_redirect '/blog', :to => '/decidewhattobuy'
end

This is very cool and makes working with mod_rewrite much less painful than it can be!

The original code is a series of gists hosted on the vigetlabs github page and to make them easier to use and manage I packaged it up as a gem, which you can install as follows:

sudo gem sources -a http://gems.github.com && sudo gem install shadowaspect-http_redirect_test

and use it in your code like this:

require 'http_redirect_test'

have fun!

July 2, 2009

Talking Reevoo Santa - It’s creepy and open to abuse.

Filed under: Projects — Tags: , , , — tomlea @ 3:28 pm

Recently the Reevoo office gained a new friend. A 4’6” talking, dancing Santa (because we start Christmas early here). So we hooked him up to our now famous talking rabbit, and now our build notifications are announced by Santa Himself.

As one would expect, the developers (myself most definitely included) began to make Santa announce more than just build statuses and deployments. Santa became our favourite way to drag Edwin, our user experience guru over to our desks (it sure beats turning round and talking).

After some time of this, other Reevolvers (who can’t hack together the ruby code to do this) wanted access… and I felt bad about hogging all the fun.

Enter Reevoo Santa:

Reevoo Santa Screenshot

Reevoo Santa Screenshot

Now the whole office is making Santa say stupid things.

We now come to phase 3, letting the general public (and you too) make Santa say stupid things!

http://santa.reevoo.com/

Please keep the creepy messages to a minimum between 7 and 9pm… the cleaning staff get spooked.

p.s. there is a pool on how long we can keep this live for, 10p says we will be forced to take it down due to wide spread abuse by Monday.

The Reevoo Santa source code is available on github, and is hosted  on Heroku, who host small applications awesomely.

January 8, 2009

detenc: a fast, low-memory character encoding detector

Filed under: Projects — paulbattley @ 5:54 pm

detenc is a fast character encoding detector for Western European text. It can
determine whether a file is encoded in US-ASCII, UTF-8, ISO-8859-15,
WINDOWS-1252, or something else. It can distinguish ISO-8859-15 and
WINDOWS-1252 where there is enough information: this means that Euro signs are
handled correctly.

The program was written to help normalise the encoding of very large data feeds
(of the order of several gigabytes) at Reevoo. It uses very little memory and
can determine the encoding of a two-gigabyte file in under a minute.

We process a lot of data feeds from retailers here at Reevoo. If we’re lucky, we get to specify the format. Often, though, we have to make do with feeds that are already available. The quality of these can be variable, which means that we need to be liberal in what we accept—but not so liberal that we start importing bogus data.

One of the significant variables is character encoding. This is a poorly understood topic in general, and our experience reflects this. We get feeds in:

As an aside, ISO 8859-1 is also a possibility. However, given that it doesn’t include the Euro sign, we can reasonably assume that any feeds we receive today are likely to be in ISO 8859-15 (which is very similar).

We need to turn everything into the canonical encoding — UTF-8 — before we start processing. Up until recently, we’d been using iconv for this, attempting each encoding in turn, and falling back to the next on failure. The naive detector loaded the file into memory, fed it through iconv, and wrote it back out. This didn’t work too well on big feeds — and by ‘big’ I mean 2 GB+. Working line by line over 30 million lines was still not good enough.

So I wrote a small C program to do the job. Detecting ASCII is easy: the high bit is never set. UTF-8 is a little harder, but can be done very reliably thanks to the self-synchronising characteristics of its byte sequences. Windows 1252 and ISO 8859-15 have a significant overlap, meaning that text may be in both; in this case, the program selects ISO 8859-15. However, a text that uses a byte value defined in one but not the other can only be in one encoding. Finally, a text may include byte values outside any of these ranges, in which case it’s unknown.

The program can scan a 2GB file in under a minute, which is a big improvement, and certainly good enough. It uses a few hundred kilobytes of memory, making it about 10,000 times better than the original naive implementation! It also features what I consider to be a legitimate use of goto in the UTF-8 validating state machine.

I’ve uploaded the code to GitHub: detenc.

Grab it, build it, fork it — I hope it’s useful to someone else. It may be a good start for detecting among common encodings used in other locales.

November 14, 2008

A new addition to the Reevoo Office

Filed under: cat, celing — lukeredpath @ 7:00 pm

ceilingcat
Ceiling cat is watching us!

August 19, 2008

bybusy mod accepted by Apache

Filed under: apache, bybusy, load, mongrel — lukeredpath @ 5:35 pm

In a previous article Joel spoke about the problems we were having with our load balancing between Apache and Mongrel and his bybusy mod that attempts to solve the problem.

His patch has now been accepted by Apache and should hopefully make it into a future release.

August 15, 2008

Collecting exceptions asynchronously using Beanstalkd and Hoptoad

Filed under: beanstalkd, exceptions, hoptoad, rails — lukeredpath @ 4:40 pm

hoptoad

Whilst there are already a number of options for collecting exceptions from your production Rails apps (such as exception_notifier), none of them are designed for collecting exceptions asynchronously. This means you are introducing a potential bottleneck in your request loop should any part of your exception collection process (e.g. sending the exception to a mail server) slow down for any reason.

At Reevoo, we already use beanstalkd for a number of applications so it seemed like a perfect fit for collecting our exceptions. The end result of this was our exception_messaging plugin which builds on top of our beanstalk_messaging plugin allowing you to configure your Rails app to send exceptions (along with supporting request, environment and session data) to a beanstalkd queue.

We’ve had this running in production for a while now but we weren’t really doing anything useful with the data. We started work on a basic Rails application that consumed the messages from the queue and stored them in a database with the idea being that we could build some basic reporting tools on top of this data, but what with work being work, we never found the time to do this properly.

Enter Hoptoad

Hoptoad is a third-party service that you can use to post your exception data to. It is exactly the kind of app we would have liked to have built. Fortunately for us, somebody has already done the hard work. It collects and aggregates your exception data and presents it in a easy to use interface. Oh, and currently, it’s free.

Hoptoad provide a simple Rails plugin that you can drop in to your Rails project for synchronous integration with their service. They also publish a RESTful API for retrieving and posting data.

Out of the box, the plugin is probably fine for most users. For high-traffic websites where performance is vital and/or you want to avoid creating a dependency on a third-party service, you want a more robust solution. This is where Beanstalkd comes in.

Rather than re-inventing the wheel, I wanted to reuse the existing plugin code to send the exceptions collected from our queue; however, the parts of the plugin code that I wanted to use (the code that wrapped around their API and submitted exception data) was tightly coupled to the Rails integration and configuration process. Fortunately, the code was hosted on GitHub so I created a fork and began work on refactoring their plugin into loosely coupled components. At the time of writing, much of the code has been refactored and can be used in a standalone manner although there is still some work left to do. I’m hoping it will be merged into the master repository by the guys at Hoptoad (who have been very receptive to my enquiries and suggestions). You can grab the refactored plugin from a branch in my GitHub fork.

Consuming queued exceptions and sending them to Hoptoad

Once the refactoring work was done, sending our collected exception data was fairly trivial. I had to make a few changes to our exception_messaging plugin to make sure we were collecting all of the data we needed (simplifying a lot of the code in the process) but once we were collecting exception data in the correct format, all that was left to do was to write a simple Ruby script that used the updated plugin and a Beanstalk::Poller (courtesy of our beanstalk_messaging plugin) to process the messages:

require 'beanstalk_configuration' # sets up $queue_manager
require 'hoptoad/standalone'

Hoptoad.configure do |config|
  config.api_key = 'your_projects_hoptoad_api_key'
end

notifier = Hoptoad::Notifier.from_config(Hoptoad.config)

Beanstalk::QueuePoller.new($queue_manager).poll(:the_exceptions_queue) do |message|
  notifier.notify(message.ybody)
end

For more information on how to poll Beanstalk queues, see the instructions for the beanstalk_messaging plugin. To configure your app to use beanstalkd for collecting exceptions, see the instructions for the exception_messaging plugin.

Any problems or questions? Leave a comment.

July 30, 2008

Fixing uneven load balancing between Apache and Mongrel for Ruby on Rails applications

Filed under: apache, mongrel, monit, proctitle, rails, siege, sysadmin — joelgluth @ 10:48 am

Our site has some pages that can take a while to return. The fact of them is probably, in the short to medium term, inevitable, and in general performance is good. Anecdotally though, users and testers have been experiencing poor performance on pages that we know should render quickly or even instantly (because they’re cached). What is going on?

Setup: Apache, mongrel, RoR

Our main site setup is extremely straightforward: We use Apache and mod_proxy_balancer to field incoming requests, which are then farmed out to a group of Mongrel instances that run our Ruby on Rails application.

Mongrel queues requests and is effectively single-threaded with reference to Rails

Mongrel is multi-threaded, but it locks a mutex as soon as it starts Rails processing. The result for most of us is that it processes one request at a time, potentially with a queue of pending requests on the front. If a request is taking a long time, anything queued up behind it, no matter how trivial, waits. Our pack of Mongrels should be big enough to avoid this situation with the sorts of traffic we get presently, but is it?

Apache’s load balancer

Apache’s mod_proxy comes with two different load balancing methods: “byrequest” (the default) and “bytraffic”. Both of these are historical balancers: they will ensure that cumulative load is distrubuted evenly between workers, but they don’t particularly care what the current state of a given worker might be. It seems entirely possible that Apache could assign a request hitting a poorly-performed page to worker A, sprinkle short-lived requests to its friends B and C, then assign another request to A because it’s A’s turn – even though A is still busy and B and C are both idle. And so on.

This would certainly explain “fast” pages sometimes coming back really slowly, and the overall server load wouldn’t even have to be very high.

What are our mongrels actually doing?

Mongrel’s logging and debugging output are not terribly helpful out of the box, at least not for finding out what’s going on with queued requests. Fortunately, the truly awesome proctitle will tell us exactly what we want to know, right there in the output of ps. It’s almost as if other people have had this problem before…

Drop the plugin into Rails, restart the mongrels, and let’s spectate a while.

joel@hiscomputer: $ watch -n 0 'ps ax | grep mongrel' # real-time and everything :)
17209 ?        R     18:22 mongrel_rails [9001/2/1524]: handling 127.0.0.1: GET /reviews/mpn/take_2/tt45920
17213 ?        R     14:25 mongrel_rails [9002/1/1321]: handling 127.0.0.1: GET /reviews/mpn/humax/lu23_td2
17217 ?        R     14:50 mongrel_rails [9003/1/1190]: handling 127.0.0.1: GET /reviews/mpn/navman/s30
17221 ?        S     12:11 mongrel_rails [9005/0/1032]: idle

Brilliant – the interesting bits on the first line are the ‘2’, telling us that the Mongrel on port 9001 has two requests to serve (one active and one queued), and that it’s currently serving a reviews page for Grand Theft Auto IV. Already we can see that Apache has given the 9001 Mongrel a request when it was doing something else, even though the 9005 Mongrel could have served it faster.

So, our hypothetical scenario is a real one, but how bad does it get? And how much of it is down to unhelpful load-balancing?

Digression: Siege

We use a number of different tools to automate web traffic for testing purposes. Our current favourite back-of-the-napkin number-generator is Siege.
Siege will take a list of URLs and happily blast away at them at a rate and concurrency, and for a duration, that can be tweaked in arbitrary ways to explore different aspects of server performance. In this case we want to see a wide variety of requests, and to squish them temporally to see just how unevenly individual Mongrels end up getting loaded. Since we suspect that live users are hitting this problem, we can just take the request URLs from a day’s worth of our live Apache logs. Something like:

joel@hisproductionserver$ cut -d   -f 7 /var/log/httpd/access.log | perl -p -e 's{^}{http://testserver}' > urls.txt

On my current test box, I have eight Mongrels running. I start with eight concurrent siege users in my .siegerc file, a two-minute siege duration, the urls.txt file generate a moment ago, and with proctitle I see this:

13776 ?        Rl     0:17 mongrel_rails [9000/4/17]: handling 127.0.0.1: GET /reviews/mpn/stoves/600sidlm
13779 ?        Sl     0:04 mongrel_rails [9001/0/20]: idle
13782 ?        Sl     0:05 mongrel_rails [9002/1/19]: handling 127.0.0.1: GET /reviews/mpn/sharp/lc20s5ebk
13785 ?        Rl     0:17 mongrel_rails [9003/1/18]: handling 127.0.0.1: GET /reviews/mpn/stoves/600sidlm
13789 ?        Rl     0:15 mongrel_rails [9004/3/17]: handling 127.0.0.1: GET /reviews/mpn/sharp/lc20s5ewh
13793 ?        Rl     0:13 mongrel_rails [9005/2/17]: handling 127.0.0.1: GET /browse/product_type/routers
13796 ?        Sl     0:12 mongrel_rails [9006/0/18]: idle
13800 ?        Sl     0:12 mongrel_rails [9007/0/18]: idle

We can see that Apache is distributing the load fairly evenly over time (the ‘17’ in ‘[9000/4/17]’), but is doing badly at making sure that load is distributed evenly at any given moment. The 9000 server has 4 pending requests, while 9001, 9006 and 9007 are sitting idle. There are a fairly low number of concurrent users, and already half of them are having an unnecessarily bad time!

Monit as a way of taking snapshots

Like a lot of people, we use monit to make sure our servers are up, and to kick them when they’re down so they get up again. It can be configured to kill and restart Mongrel processes that take more than some number of seconds to respond, but we can repurpose it to take snapshots of ps output and append them to a file instead using a Very Small Shell Script – the camera instead of the elephant gun, if you like. proctitle is just so handy.
Each of our Mongrels has a .monitrc entry that looks like this:

if failed port 9001 protocol http                   # check for response
        with timeout 5 seconds
        then exec /home/joel/bin/slow_request.sh 9001

where, if a server takes more than five seconds to respond, slow_requests.sh simply locates the process ID of the Mongrel running on port 9001, and records the output of ps for that process. The results were very suggestive – right at the top, we see lines like this:

17213 ?        S     27:27 mongrel_rails [9002/4/2451]: handling 127.0.0.1: GET /reviews/mpn/henry/numatic
 1148 ?        S      2:48 mongrel_rails [9000/10/240]: handling 127.0.0.1: GET /search
 1148 ?        S      3:32 mongrel_rails [9000/5/314]: handling 127.0.0.1: GET /

and even

 2955 ?        S     11:41 mongrel_rails [9001/17/1010]: handling 127.0.0.1: GET /browse/product_type/Tents

At one point, a single long-running process had managed to force sixteen other site vistors’ requests to wait!
There are quite a few lines (after the server had been monitoring for a while) that were unlike these examples – specifically, they each showed only a single pending request, and it wasn’t necessarily for a slow page. Often, in fact, it would be a request for the front page, which should be quick in the first instance, and cached to within an inch of its life in the second. Performance appears to be degrading over time. We’ll look at that later, because a solution to our uneven load balancing problem has suggested itself.

bybusy mod

Out of the box, Apache offers the two load balancing methods I noted earlier. Being Apache, you can bet that they will have made it fairly easy to add more. And indeed, this turns out to be the case. We added a third “bybusy” lbmethod, able to be configured in the same way as the others. It was initially very simple – for each proxy worker thread in the web server, increment a “busy” counter when assigning a request to that worker. In the post-request hook, decrement the counter again. When choosing a worker, simply pick the worker with the lowest “busy” value.

Simple to say, and thanks to the general cleanliness and friendliness of the Apache code base, simple to do. Attacking it with Siege resulted in exactly what one would hope for: Mongrels absorbed load evenly even when concurrent requests greatly outnumbered the available workers, and most importantly there was never a case where an incoming request was assigned to a Mongrel that was already doing something while another sat idle.

Refinement

This is more of a nicety than anything else, but it would be nice to know that grabbing a random Mongrel’s log for period X will give me a representative sample of what’s been going on across all of them. As it stands, the bybusy method will, over time, favour Mongrels higher up the load-balancer’s list. A little reflection will show that if traffic is sparse, then this inequity gets much worse.

We ended up refining bybusy a little, by using the “byrequest” method as a tie-breaker between workers with identical “busy” values (most frequently, when they are idle). So we get the moment-to-moment balancing we wanted to start with, but also the cumulative balancing that we had originally and which leads to nicely-balanced log files. A small thing, I know, and Mongrels don’t get tired, but it seemed like a good idea.

UPDATE: This has been submitted as a patch to httpd, for source and progress go to Apache Bugzilla.

July 3, 2008

SimpleConfig on GitHub

Filed under: git, github, simpleconfig — lukeredpath @ 8:36 pm

For all you Git fanatics out there, I’ve just pushed a copy of the SimpleConfig plugin to GitHub, where you can fork and contribute to your hearts content.

The official repository will always be the public Subversion mirror (once we’ve got it up and running!) and my git fork may contain experimental features or enhancements that haven’t made it into production use but it gives you the chance to hack away.

SimpleConfig on GitHub.

Welcome to Reevoo Labs

Filed under: Uncategorized — kylemcginn @ 6:03 pm

Welcome to the re-launch of labs.reevoo.com, a place where we (the team at Reevoo) try our best to contribute a little something back to the community.

To kick things off, we’ve brought across some of our existing projects from the previous labs site, but also added a few new things that we felt were ready for prime-time:

  1. Mocha – the Ruby mocking/stubbing library based on JMock syntax, used by Rails core
  2. NEW!Beanstalk Messaging plugin – a Rails plugin for managing, polling and communicating with the excellent Beanstalkd messaging queue
  3. NEW!Simple Config – a Rails plugin that makes it easy to set up application-wide configuration/settings for each of your development environments, separate to Rails’ own environment files, and provides an object-oriented way of accessing those settings throughout your application.
  4. NEW!CTO / Reevoo

Powered by WordPress