One of the most exciting things about Heroku is that you can scale in a single click – just add more dynos, and you can immediately handle higher load. This post is about our experience of hosting high load application on Heroku and situation when adding more dynos does not help.
While throughput of our application was increasing each and every day we started to notice that we have more and more tolerated and frustrated requests. The problem was that it was fast (nothing to optimize) requests, but sometimes processing time for these requests was more 15 sec.
After investigation we found that the reason is Request Queuing Time. If you take a look on queuing time chart, the average is pretty good, but the maximums could be enormous.
To get the maximums you have to create custom dashboard and add chart for
Why is Queuing Time so huge? Maybe we need more dynos? But adding more dynos doesn’t help.
So, let’s see what is Queuing Time on Heroku, how it works and what we can do.
The cause in Heroku Routing Mesh
Let’s take a look at the Heroku docs to understand queuing time better. When a request reaches Heroku, it’s passed from a load balancer to the “routing mesh”. The routing mesh is a custom Erlang solution based on MochiWeb that routes requests to a specific dyno. Each dyno has its own request queue, and the routing mesh pushes requests to a dyno queue. When dyno is available, it picks up a request from its queue and your application processes it. The time between when routing mesh receives the request and when the dyno picks it up is the queuing time.
Each request has an
X-Heroku-queue-Wait-Time HTTP header with the queuing time value for this particular request. That’s how New Relic knows what to show on its charts. The key point for us in the routing schema is the request distribution. The Heroku docs say:
The routing mesh uses a random selection algorithm for HTTP request load balancing across web processes.
That means that even if a dyno is stuck processing a long-running request, it will still get more requests in its queue. If a dyno serves a 15-second request and is routed another request that ends up taking 100ms, that second request will take 15.1 seconds to complete. That’s why Heroku recommends putting all long-running actions in background jobs.
These queuing time peaks were a major pain in the neck for us. Under high load (~10-15k requests per minute), a stuck dyno could result in a hundreds of timeouts. Here are some ways to minimize request queuing:
- Drop long-running requests
Heroku recommended one final workaround: drop all long-running requests within our application using Rack::Timeout or, when we switched from Thin to Unicorn, by setting a timeout in Unicorn.
- Move long-running actions to background jobs
Long-running requests, like uploading CSVs of contacts or downloading CSVs of contacts can be moved to background jobs
- Splitting into multiple applications
Each application was an identical Heroku application that served different types of requests using a unique subdomain. So, for example, you can move database-heavy reports to second app.
- “Single push” deployment became a pain
- Many add-ons we had in our original app (logging, SSL, etc) needed to be added to the new applications
- Had to monitor several New Relic accounts
- Managing multiple applications on Heroku worked in the short run, but was not ideal
Our solution: Slow-Fast Request Balancer
With this solution we got same results like with “Splitting into multiple applications“ but without additional pain. We setup Haproxy balancer which have two queues, one for “slow” and another for “fast” requests. By limiting parallel processing in “slow” queue we always have free workers for “fast” queue.
Do you have this problem?
If you’re running an application under high load on Heroku, you’re almost certainly facing this problem and don’t know about it. New Relic showed that our application performance was very fast (< 100ms) and average queuing time very low even when these problems were at their worst. But check your maximums for Queuing Time with custom dashboard graph!
For most applications, it probably doesn’t matter if a fraction of a percent of your requests take a long time. But sometimes we can’t afford to have any requests that take more than a couple seconds, even if our average response time is well under 100ms.
Each of possible solutions resulted in fewer timeouts. But the best solution which helps us get fast results is solution with Haproxy Balancer. Stay tuned: in our next post we will describe how we implemented it for Heroku application.
We are thinking of making a Heroku Haproxy Addon. Are you interested?