RabbitMQ has huge learning curve if you're trying to build a worker queue.
First, you'll learn about ack/noack and get the worker ack on success.
Then, you'll learn about dead letter queue ... etc for delayed retries.
Now, you'll have a topic exchange and a bit hairy routing in place using wildcards.
And you mistakenly set dead letter routing key so that expired messages end up in multiple queues (retry queues and actual worker queue ... ).
Then you rewrite your service in python and use Celery or something.
It's nearly impossible to get RabbitMQ working correctly within few months.
And I forgot about HA. Paying for hosted RabbitMQ might be better. But CloudAMQP in particular could be tricky as well. It can run out of AWS IOPS and your production gets hosed.
Also setting up monitoring on queue health, shoveling error queues ... etc take time to learn and apply. Be careful about routing keys when you shovel error queue to a topic.
Celery can be backed by RabbitMQ, not sure if that's what you meant, but all of what you described can be abstracted away. I didn't have the same experiences with months taken to get up to speed. Moreover, at work RabbitMQ is probably our most stable underlying tool, perhaps toe to toe with Redis. And that's saying a lot, since I consider Redis to almost be a piece of art in how great of a tool it is.
Back to RabbitMQ though, we run a HA 2 node deployment (just one active writer) and have been for over 3 years, requiring minimal changes or any kind of maintenance whatsoever, has scaled to hundred plus queues, going from some with super high numbers of messages per second, some with only tens of messages per day. Some queues stay low and process fast, others are heavy jobs that get enqueued all at once and generate hundreds of thousands of jobs.
Sure, if you have a service that interacts with disks you should have automated a monitor that cover your IOPS consumption, but I don't see how that's specific to RabbitMQ, you should be doing this for all your instances.
All in all, these are two identic instances, one active, one failover, and in a world of Kafkas and Pulsars and understanding the ins and outs of SQS pricing and capacity allocation, RabbitMQ is a tool that I consider simple to administer and allows me to sleep at night.
Interesting how the same tool can evoke such different reactions, but whatever works - works.
You would think, until you get to a split brain issue.
The master and failover lose connectivity, and they each then think they're the master.
There's ways to repair it (and it has happened to me one total time in 4 years), but it does happen. I personally try to make my message processing idempotent for the worker to help alleviate these situations.
=> everything is still consumed, but takes longer as producers write into alternating queues, which are consumed ... albeit slowly whenever the switch happens
2. they're database backed, so they'll try to write into the same table
=> usually software that does this (but cant handle several writers) also creates a `lock` which has to be manually reset before the failover can come up. if its reset, the other node would fail. only one is up, so no issue?
3. producers/consumers dont notice that the 'active' mq changed, and keep running on initial
=> issue manifests as soon as any system is restarted. but only slowly so you got time to handle it with minor service degradation
none of them really sound that bad to me -- but as i said before, i haven't encountered it before, so i might just overlooking something really obvious?
I have simple single node deployment and I was floored how easy it was to set up with Celery. Really surprised. I was kicking myself for not using it sooner.
Granted I don't know all the intricacies of RabbitMQ and this was just one step beyond os.popen, but it was painless, like half an hour painless to set up and it has worked really well.
*edit: reading some of the other posts now I'm waiting for the other shoe to drop. but so far it's worked wonderfully.
I also got my first queue set up and running within a reasonable period of time with celery. I have no idea of the internals of RabbitMQ and took longer with celery really (back on python 2.7) but that system has been in prod for 6 years now without really needing any maintenance
When I first started using RabbitMQ I experienced just about everything you described.
I felt incredibly stupid when a customer would have issues with a queue being stuck or messages that were being dropped, and having no clue on why this was happening.
> It's nearly impossible to get RabbitMQ working correctly within few months.
This is so true. You can get it running in 10 minutes, but it takes weeks of banging your head against the wall and angry customers before you have it running right.
I understand where you're coming from, but what you're describing is learning how to use a queue to maintain consistency guarantees across a distributed system. You can get something simple like AWS SQS working with a few clicks, but then you don't have any of those consistency guarantees.
If you don't need crazy throughput, I find that Azure Storage Queues are crazy easy, built in retry and just simple as can be. Though when I've used it in the past, I've created a slightly simpler to use abstraction.
Thinking of doing something that works like an async generator so I can just use it like...
const work = queue.subscribe('somequeue');
for await (const {item, done} in work) {
// do something with JSON.parsed item from message
await done(); // wrapper for the delete/finish
}
Azure Storage Queues are about on par with SQS. That is, easy to use, but lacking strict concurrency control. If you need that level of concurrency control (and stricter serialization), then you’d be better off their (more complicated) service bus product [0].
RabbitMQ isn’t more complicated because it’s been improperly designed. It’s more complicated because it’s doing a much more complicated task.
A fair amount of that complexity lies in the hosting, so a managed service can take some of that off your hands (for an increased price obviously), but part of it is necessarily going to lie with the message consumer (your application logic). If your use case doesn’t need that level of control, then it doesn’t need that level of complexity either, so something like Rabbit would just be the wrong choice.
I have my own share of objections, mainly concerning the over-engineered nature of RabbitMQ, but most of the “huge learning curve” items that you’ve described can be learned in an afternoon by a motivated software engineer. Besides, she will have to learn those concepts anyway because they apply to most brokers.
You're right. It's difficult to get right. However, it is totally worth it. Once you get it working it just works.
I wish a standard set of higher abstractions existed on it though. Celery, from what I hear fills that gap very well in the python world but nothing like this exists in Nodejs land which leaves the room open for a bunch of redis-backed solutions which are pretty fragile in comparison.
weird, I haven't done much digging in to the details of RabbitMQ, but I integrated it in a matter of hours, and have it deployed in production systems (for quite sometime now) and it works really solidly. I haven't tried to get too clever though.
I just used the official docs and guides they had on the website, they seemed pretty good to me. I might have googled a few extra things, but can't really remember, I just remember it being pretty straightforward. I remember they pointed out a number of things you had to take care of.
I switched to redis since several years ago for simple task queue solution. For my usage (low to medium traffic at most in corporate environment) redis is easier to use and has very little cpu and ram footprint compared to rabbitmq (note that I only use redis for message queue, thus low memory consumption). Never got any message dropped so far. RabbitMQ uses too much memory right from starting up, not ideal for use in a resource constrained server.
Surprised to see not much mention of ActiveMQ in these comments, but it's an obvious alternative choice. The general (simplistic) comparison being:
- ActiveMQ more featureful, robust default settings, better integrated with Java/JMS but slower
- RabbitMQ faster, simpler, more "just works"
The defaults of ActiveMQ lean more towards robustness (hence often naive benchmarks will tell you it's slow). However in practice it is pretty damn easy to run, you literally can just download the default cross-platform distribution and type `./bin/activemq` and it will start running.
We use ActiveMQ + Apache Camel which makes a pretty nice combo to achieve lots of generalised messaging and routing functionality.
One practical reason we chose Nats over Kafka was that Nats doesn't need zookeeper for HA.
Nats doesn't provide message durability too, luckily it's not required for 95% of our use cases. Also, having NATS already implemented it's a natural move to use NATS Streaming for durability rather than introducing a completely new technology to your stack.
NATS by itself is designed to be more of an always-on style queuing system (the term they use is "dial tone") but doesn't handle node failures by itself. If you're looking for a Kafka-flavored NATS, there's a new release I saw recently called LiftBridge that adds some durability to the NATS protocol.
Yeah we hear you regarding AWS IOPS: for some type of loads and smaller plans we need to offer an alarm + an easy way to scale IOPS. It is something we're working on.
First, you'll learn about ack/noack and get the worker ack on success.
Then, you'll learn about dead letter queue ... etc for delayed retries.
Now, you'll have a topic exchange and a bit hairy routing in place using wildcards.
And you mistakenly set dead letter routing key so that expired messages end up in multiple queues (retry queues and actual worker queue ... ).
Then you rewrite your service in python and use Celery or something.
It's nearly impossible to get RabbitMQ working correctly within few months.
And I forgot about HA. Paying for hosted RabbitMQ might be better. But CloudAMQP in particular could be tricky as well. It can run out of AWS IOPS and your production gets hosed.
Also setting up monitoring on queue health, shoveling error queues ... etc take time to learn and apply. Be careful about routing keys when you shovel error queue to a topic.