RabbitMQ has huge learning curve if you're trying to build a worker queue. First...

vasco · on May 21, 2020

Celery can be backed by RabbitMQ, not sure if that's what you meant, but all of what you described can be abstracted away. I didn't have the same experiences with months taken to get up to speed. Moreover, at work RabbitMQ is probably our most stable underlying tool, perhaps toe to toe with Redis. And that's saying a lot, since I consider Redis to almost be a piece of art in how great of a tool it is.

Back to RabbitMQ though, we run a HA 2 node deployment (just one active writer) and have been for over 3 years, requiring minimal changes or any kind of maintenance whatsoever, has scaled to hundred plus queues, going from some with super high numbers of messages per second, some with only tens of messages per day. Some queues stay low and process fast, others are heavy jobs that get enqueued all at once and generate hundreds of thousands of jobs.

Sure, if you have a service that interacts with disks you should have automated a monitor that cover your IOPS consumption, but I don't see how that's specific to RabbitMQ, you should be doing this for all your instances.

All in all, these are two identic instances, one active, one failover, and in a world of Kafkas and Pulsars and understanding the ins and outs of SQS pricing and capacity allocation, RabbitMQ is a tool that I consider simple to administer and allows me to sleep at night.

Interesting how the same tool can evoke such different reactions, but whatever works - works.

tankerdude · on May 21, 2020

You would think, until you get to a split brain issue. The master and failover lose connectivity, and they each then think they're the master.

There's ways to repair it (and it has happened to me one total time in 4 years), but it does happen. I personally try to make my message processing idempotent for the worker to help alleviate these situations.

411111111111111 · on May 22, 2020

haven't encountered it personally, so honest question here: how does a split brain situation become an issue in a message queue?

there are some possible situation from my naive viewpoint:

1. the 'active' queue keeps jumping between, consumers & producers keep reconnecting

=> everything is still consumed, but takes longer as producers write into alternating queues, which are consumed ... albeit slowly whenever the switch happens

2. they're database backed, so they'll try to write into the same table

=> usually software that does this (but cant handle several writers) also creates a `lock` which has to be manually reset before the failover can come up. if its reset, the other node would fail. only one is up, so no issue?

3. producers/consumers dont notice that the 'active' mq changed, and keep running on initial

=> issue manifests as soon as any system is restarted. but only slowly so you got time to handle it with minor service degradation

none of them really sound that bad to me -- but as i said before, i haven't encountered it before, so i might just overlooking something really obvious?

CloudButWhy · on May 22, 2020

There is a reason why you're supposed to run an odd number of nodes so that you will hopefully have a majority in case of a failure.

yawaramin · on May 22, 2020

Once every four years sounds like a no-brainer, to be honest.

mythrwy · on May 21, 2020

I have simple single node deployment and I was floored how easy it was to set up with Celery. Really surprised. I was kicking myself for not using it sooner.

Granted I don't know all the intricacies of RabbitMQ and this was just one step beyond os.popen, but it was painless, like half an hour painless to set up and it has worked really well.

*edit: reading some of the other posts now I'm waiting for the other shoe to drop. but so far it's worked wonderfully.

robbiep · on May 21, 2020

I also got my first queue set up and running within a reasonable period of time with celery. I have no idea of the internals of RabbitMQ and took longer with celery really (back on python 2.7) but that system has been in prod for 6 years now without really needing any maintenance

m3nu · on May 22, 2020

Same experience. Single node with a few clients and Celery. Works well.

My main issue in the beginning were network timeouts now and then. Those went away after tuning some TCP settings.

LeonM · on May 21, 2020

Thank you for this post.

When I first started using RabbitMQ I experienced just about everything you described.

I felt incredibly stupid when a customer would have issues with a queue being stuck or messages that were being dropped, and having no clue on why this was happening.

> It's nearly impossible to get RabbitMQ working correctly within few months.

This is so true. You can get it running in 10 minutes, but it takes weeks of banging your head against the wall and angry customers before you have it running right.

AmericanChopper · on May 21, 2020

I understand where you're coming from, but what you're describing is learning how to use a queue to maintain consistency guarantees across a distributed system. You can get something simple like AWS SQS working with a few clicks, but then you don't have any of those consistency guarantees.

tracker1 · on May 22, 2020

If you don't need crazy throughput, I find that Azure Storage Queues are crazy easy, built in retry and just simple as can be. Though when I've used it in the past, I've created a slightly simpler to use abstraction.

https://docs.microsoft.com/en-us/azure/storage/queues/storag...

Thinking of doing something that works like an async generator so I can just use it like...

    const work = queue.subscribe('somequeue');
    for await (const {item, done} in work) {
      // do something with JSON.parsed item from message
      await done(); // wrapper for the delete/finish
    }

AmericanChopper · on May 23, 2020

Azure Storage Queues are about on par with SQS. That is, easy to use, but lacking strict concurrency control. If you need that level of concurrency control (and stricter serialization), then you’d be better off their (more complicated) service bus product [0].

RabbitMQ isn’t more complicated because it’s been improperly designed. It’s more complicated because it’s doing a much more complicated task.

A fair amount of that complexity lies in the hosting, so a managed service can take some of that off your hands (for an increased price obviously), but part of it is necessarily going to lie with the message consumer (your application logic). If your use case doesn’t need that level of control, then it doesn’t need that level of complexity either, so something like Rabbit would just be the wrong choice.

[0]: https://docs.microsoft.com/en-us/azure/service-bus-messaging...

gtaylor · on May 22, 2020

Depending on your usage patterns, SQS can be significantly more expensive, too.

rlander · on May 21, 2020

I have my own share of objections, mainly concerning the over-engineered nature of RabbitMQ, but most of the “huge learning curve” items that you’ve described can be learned in an afternoon by a motivated software engineer. Besides, she will have to learn those concepts anyway because they apply to most brokers.

emilsedgh · on May 21, 2020

You're right. It's difficult to get right. However, it is totally worth it. Once you get it working it just works.

I wish a standard set of higher abstractions existed on it though. Celery, from what I hear fills that gap very well in the python world but nothing like this exists in Nodejs land which leaves the room open for a bunch of redis-backed solutions which are pretty fragile in comparison.

csours · on May 21, 2020

I'm curious if you are comparing this to a non-queue solution or to a different queue system?

keithnz · on May 21, 2020

weird, I haven't done much digging in to the details of RabbitMQ, but I integrated it in a matter of hours, and have it deployed in production systems (for quite sometime now) and it works really solidly. I haven't tried to get too clever though.

anon176 · on May 21, 2020

You must of followed a good guide on getting it setup. Took me 2 days to get it solid. Then we decided to just use redis.

keithnz · on May 22, 2020

I just used the official docs and guides they had on the website, they seemed pretty good to me. I might have googled a few extra things, but can't really remember, I just remember it being pretty straightforward. I remember they pointed out a number of things you had to take care of.

symplee · on May 21, 2020

Can anyone recommend an easier alternative?

neurostimulant · on May 21, 2020

I switched to redis since several years ago for simple task queue solution. For my usage (low to medium traffic at most in corporate environment) redis is easier to use and has very little cpu and ram footprint compared to rabbitmq (note that I only use redis for message queue, thus low memory consumption). Never got any message dropped so far. RabbitMQ uses too much memory right from starting up, not ideal for use in a resource constrained server.

https://redislabs.com/ebook/part-2-core-concepts/chapter-6-a...

zmmmmm · on May 22, 2020

Surprised to see not much mention of ActiveMQ in these comments, but it's an obvious alternative choice. The general (simplistic) comparison being:

- ActiveMQ more featureful, robust default settings, better integrated with Java/JMS but slower

- RabbitMQ faster, simpler, more "just works"

The defaults of ActiveMQ lean more towards robustness (hence often naive benchmarks will tell you it's slow). However in practice it is pretty damn easy to run, you literally can just download the default cross-platform distribution and type `./bin/activemq` and it will start running.

We use ActiveMQ + Apache Camel which makes a pretty nice combo to achieve lots of generalised messaging and routing functionality.

mcsoft · on May 21, 2020

http://nats.io

ies7 · on May 21, 2020

I heard a lot of praise about Nats, but isn't it more like a kafka alternative? Someone new need to spend sometime grasping the stream concept.

mcsoft · on May 22, 2020

One practical reason we chose Nats over Kafka was that Nats doesn't need zookeeper for HA.

Nats doesn't provide message durability too, luckily it's not required for 95% of our use cases. Also, having NATS already implemented it's a natural move to use NATS Streaming for durability rather than introducing a completely new technology to your stack.

Less pieces - fewer chances something breaks.

jonathanoliver · on May 22, 2020

NATS by itself is designed to be more of an always-on style queuing system (the term they use is "dial tone") but doesn't handle node failures by itself. If you're looking for a Kafka-flavored NATS, there's a new release I saw recently called LiftBridge that adds some durability to the NATS protocol.

shaklee3 · on May 22, 2020

Someone mentioned this already below, but nats streaming also adds durability.

rhodin · on May 22, 2020

Disclaimer: I work for CloudAMQP

Yeah we hear you regarding AWS IOPS: for some type of loads and smaller plans we need to offer an alarm + an easy way to scale IOPS. It is something we're working on.

jv22222 · on May 21, 2020

The biggest annoyance I found with RabbitMQ was that it could take up to 10-15 mins to restart if it had a lot of jobs.

This was back in 2015 - might be better now.

kchr · on May 22, 2020

Sounds like resource (design) issues to me.

ayush--s · on May 22, 2020

isn't Rabbitmq prone to "split brain" problem on HA setups?