Reflections on Working with Queues

Posted on July 10, 2023 by Tylor Kobierski

I was used to working in B2C spaces where the primary work was done via REST, or bog standard web forms. In other places, I worked in financial software that worked on mainframes, so they would build massive reports that would be batched overnight in specific intervals (you did not want to be the person holding the pager at month end!).

This time, though, when I went in all bright-eyed to a new space, I got to work with something called Camel. Overall, I have been very happy to have been introduced to this.

Camel implements a series of patterns for message handling detailed in a book called Enterprise Integration Patterns. It’s nothing new at all; just a tried and true method of building a certain kind of asynchronous application. When I started, I went on that website, and just absorbed as much information as I could about what patterns did and how all the little bits connected together, etc.

At the end of it, with a little bit of Spring Boot magic to glue things together, it became extremely easy to build a little microservice that executes things queue-to-queue using a very neat little DSL:

from("service:main-queue")
    .process(transformThisMessage)
    .to("service:outbound-writer")

And with that and a little bit of connection config you have a little bit of software that reads from one thing, does stuff to it, and sends it off somewhere else!

I would use this liberally, building little ad-hoc scripts and applications on the fly to reroute documents, modify the contents of messages and headers, and other sorts of things. I even built a little load testing application based on it, since I found the interface much more intuitive than jmeter.

Working with these microservices

A lot of the things that you would normally consider with other sorts of architectures still apply. There are at least two major particular considerations that I think you need to think about when you’re building a queue based microservice: - How do you want to scale? What you can do is not related to the speed that messages actually come in to your service, and your service may not necessarily scale off of merely one queue. - How do you want to handle errors and service degradation? How many messages you can functionally process is highly dependent on

On scaling

At the end of the day your service isn’t judged by how many messages it processes; it’s punished based on how many messages you fail to process. The ideal state of a queue driven microservice is that it’s always processing, and there are 0 messages on the queue at any given time. So in general, the way that you often end up scaling is not the depth of the queue. You could have 5 billion messages on your queue and nobody will care if they are processed within the next few seconds. They will, however, care if it’s taking more than 5 minutes to get their message through your service even if you “only” generally have 50 messages sitting on your queue at any given time. So when things start noticeably sitting on the queue server, you will need to start instantiating new instances to deal with the load.

Your application will be shredding messages as fast as it can eat them no matter what else is going on or how many requests come in. The maximimum your application is bound to handle is usually not going to end up being the CPU on whatever container you’re working on. It’s going to end up being the stress you place on individual dependencies that ends up determining your max scale. Oftentimes, look to your queue server, or your relational database. When these start looking uncomfortable you have likely found your maximums (and points for optimization).

There is also a third kind of consideration to have: some messages that your application can process may deterministically and decidedly be slower than other messages that your application processes. If you can’t make them go faster, one thing that can alleviate your problems is to shunt those messages to another queue consumed by another stack of the same application; you can then control the scaling of this stack independently and ensure that these slower moving messages don’t interfere with faster moving traffic.

On error handling

Like a floppy disk, when it comes to errors, you will generally have 3 options: - Abort - Retry - Fail

Retrying is the most comfortable option - just keep polling dependant services in a responsible manner with a reasonable backoff, until they respond. This option is going to kill your throughput if your dependencies go down for long periods of time, however. But if you can reasonably keep your instances up and restore messages to the queue when they get shut down, it’s a very clean way of doing things.

Aborting the process is your second best bet - it will let you keep the ball rolling on anything that isn’t being slowed down. Terminate processing and place it either back on the queue it came from, or onto a dead letter queue (a queue specifically set aside to hold unprocessessable messages). The former can interfere with your scaling since it could mean that old messages are repeatedly recycled. The latter can tank you in extreme positions if the queue server begins to buckle at the storage requirements of the messages that are piling up during an outage. This is never fun - if you’re getting to that point shut down your producers before the worst comes to pass.

Failing the message is your last resort, because it means you will need to perform manual intervention, and that usually means a rough night. Make sure that your failure is clear and sent to a system that is easily queryable. In some cases rather than outright fail, I’ve also moved to other mediums as backups before giving up for good. For instance, sending a message off via REST for another instance to handle, if it can’t be restored to the queue server. I reach for this if we were far enough into a set of steps that we can’t reasonably restore the message back to the queue server for whatever reason.

Leaving the world of Camel

The project I was working on is coming to a close now, and I will sadly not be using queue based architecture in my next few projects. I am excited to begin working on something using an event driven architecture! I’m expecting that the lessons learned from EIP and queue based architectures will translate well into the new medium.

However, I will say that queues are a very solid way of building a low-interactivity application, and I would encourage you to try it, if you have not!