One of the biggest benefits imo of using Postgres as your application queue, is that any async work you schedule benefits from transactionality.
That is, say you have a relatively complex backend mutation that needs to schedule some async work (eg sending an email after signup). With a Postgres queue, if you insert the job to send the email and then in a later part of the transaction, something fails and the transaction rollbacks, the email is never queued to be sent.
Worth being clear that bridging to another non-idempotent system necessarily requires you to pick at-least-once or at-most-once semantics. So for emails, if you fail awaiting confirmation of your email you still need to pick between failing your transaction and potentially duplicating the email, or continuing and potentially dropping it.
The big advantage is for code paths which async modify your DB; these can be done fully transactionally with exactly-once semantics since the Job consumption and DB update are in the same transaction.
That's kind of missing the parent's point. If you wanted to ensure emails arrive, that sounds like another queue that could be backed by a different table that is also produced into as part of the original transaction.
> One of the biggest benefits imo of using Postgres as your application queue, is that any async work you schedule benefits from transactionality.
This is a really important point. I often end up using a combination of Postgres and SQS since SQS makes it easy to autoscale the job processing cluster.
In Postgres I have a transaction log table that includes columns for triggered events and the pg_current_xact_id() for the transaction. (You can also use the built in xmin of the row but then you have to worry about transaction wrap around.) Inserting into this row triggers a NOTIFY.
A background process runs in a loop. Selects all rows in the transaction table with a transaction id between the last run's xmin and the current pg_snapshot_xmin(pg_current_snapshot()). Maps those events to jobs and submits them to SQS. Records the xmin. LISTEN's to await the next NOTIFY.
Good point. We alleviate that a bit by scheduling our queue adds to not run until after commit. But then we still have some unsafety, and if connection to rabbit is down we're in trouble.
I agree - having to tell a database that something was processed, and fire off a message into RabbitMQ, say, is never 100% transactional. This would be my top reason to use this approach.
> With a Postgres queue, if you insert the job to send the email and then in a later part of the transaction, something fails and the transaction rollbacks, the email is never queued to be sent.
This is true - definitely worth isolating what should be totally separate database code into different transactions. On the other hand, if your user is not created in the DB, you might not want your signup email. Just depends on the situation.
Another benefit of this is that you're guaranteed that the transaction is completed before the job is picked up. With redis-backed queues (or really anything else), you very quickly run into the situation where your queue executes a job depending on a database record existing prior to the transaction being committed (and the fix for this is usually awkward / complex code).
I'm not sure this is really an issue with transactionality as a single request can obviously be split up into multiple transactions, but rather that even if you correctly flag the email as pending/errored, you either need to process these manually, or have some other kind of background task that looks for them, at which point why not just process them asynchronously.
> With a Postgres queue, if you insert the job to send the email and then in a later part of the transaction, something fails and the transaction rollbacks, the email is never queued to be sent.
An option could be use a second connection and a separate transaction to insert data in the queue table.
We’re recently running two machines (master and standby) at M5 Hosting. All of HN runs on a single box, nothing exotic:
CPU: Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz (3500.07-MHz K8-class CPU)
FreeBSD/SMP: 2 package(s) x 4 core(s) x 2 hardware threads
Mirrored SSDs for data, mirrored magnetic for logs (UFS)
HN is a very simple application. Handling a high volume of traffic for a simple application is a very different problem from scaling a highly complex application.
HN is simple, yes. But it could be made more complicated. Personalized feed and data analytics are two complicated things that come to mind. Staying simple is often a choice, and it’s a choice not many companies make.
HN is a straight forward forum. Reddit is one level above that: generalized forums as a service.
Anything HN has had to implement, Reddit has to implement at a generalized, user-facing level, like mod tools.
Frankly, we underestimate how hard forums are, even simple ones. I learned this the hard way rebuilding a popular vBulletin forum into a bespoke forum system.
Every feature people expect from a forum turns into a fractal of smaller moving parts, consideration, and infinite polish. Letting users create and manage forums is an explosion of even more things that used to be simple private /admin tools.
Mod tools are not accessed and used by all users. So the load of mod-tools on the servers is probably negligible.
I agree, most software is deceivingly simple from the outside. Once you start building it, you become more humble about the effort required to build anything moderately complex.
Mod tools aren't used by the majority of users, correct. But the existence of mod tools does make the logic and assumptions of the application different. Now you've got a whole set of permissions and permissions checks, additional interfaces, more schema, etc.
Its not that the mod tools are constantly being used, its that there's now potentially far more code complexity for those tools to even exist.
is reddit really a complex application (regardless of how they build, scale, or deploy it)? Although that makes me wonder, what makes an application complex?
Hacker News changes more often than people think, just not the layout because people here are weirdly fetishistic about it.
Since I've been here they've added vouching for banned users (and actually warning people beforehand) thread folding, Show HN, making the second chance pool public, thread hiding, the past page, various navigation links and the API. They've also been trying to get a mobile stylesheet to work. They've also mentioned making various changes for spam detection and performance. And the url now automatically loads a canonical version if it finds one, and the title is now automatically edited for brevity. And I've probably missed a few things.
And HN isn't a simple application by any means. Go look at the Arc Forum code - it isn't optimized for readability, or scalability or reliability, but joy - for the vibe of experimental academic hacking on a Lisp. It's made of brain farts. Hacker News is probably significantly more complex than that for being attached to a SV startup company and running 'business code' and whatnot.
I mean, that’s not really that much is it. And that’s the point, HN really doesn’t change much. Whereas Reddit, for better or for worse, has a much higher output of new user facing features.
When a VC gives you a giant boatload of money, they insist you "scale up" the company overnight. So you go on a massive hiring spree, and get triple-digit team of engineers before having any market traction.
And they're tasked with building a product that can handle Google-levels of demand, though they currently only have two customers, neither of them paying.
It indeed is imperative, but not for technical reasons.
I would take the money then do none of that. And now I got a 5 years runway, enough time to build a product people like and use, and by then the investors won't be angry anymore.
HN is perhaps the most user friendly site I go to with regularity.
The idea that a website needs to be “rich” to be usable is one of the dumbest things the industry has convinced itself of in the last 20 years (following only ‘yaml is a smart way to encode infrastructure’).
To be fair, it's not as much user-friendly as it is simple, and simple tends to be easier to understand.
For example, if it was more user-friendly, it could have links to jump between root comments, because right now very popular top comments tend to accumulate most interactions, and scrolling down several pages to find the next root thread requires effort.
The people who push the other direction also bring few or no metrics. I.e. there is often no reason to add <bag of features>, except a customer (who didn't buy the product yet) mentioned them as nice to have during initial sales talks.
IMO the solution to YAML-as-config is a strict subset of YAML.
JSON is one strict subset, but one that makes smart trade-offs for strictness and machines like error detection and syntax-typed types.
We decided on a different subset of YAML for our users that were modifying config by hand (even more strict than StrictYAML). Some of the biggest features of YAML are that there is no syntax typing, and collection syntax is simple (e.g. also true for JSON, false for TOML).
For example, a string and a number look the same. This seems bad to us developers at first, but the user doesn't have to waste 20 min chasing down an unmatched quote when modifying config in a <textarea>. Beyond that, it's the same amount of work as making sure the JSON is `"age": 20` instead of `"age": "20"`, one just has noisier syntax.
>Stack Exchange which is way more rich and runs on small (relative) infra.
Yes, I've heard that SO runs on relatively simple and modest infra. And agree that would be a good example.
>HN is not user friendly
How so? I find the HN UX a refreshingly simple and effective experience. It might not have all the bells and whistles of newer discussions fora, but it doesn't obviously need them. I'd say it's a good example of form/function well suited to need. Not perfect perhaps, but very effective.
Try loading it on a 2G (2 bars = 128kbits per second — those are bits not bytes) connection. It loads almost instantly with no fuss. Now try loading virtually any site on the same, if it ever loads at all without timing out, you’ll be waiting over 10 minutes.
There was a YT preso from several years back where the StackExchange founder explained how it ran off just ~10 servers, and could run on half that many if needed. He stressed the simplicity of their arch, and that their problem space was massively cachable, so the servers just had a few hundred GB of ram, and only had to do work to rerender pages, but could store them in cache most of the time. It was a C#.Net app.
So, I think there is a lot more in common than you think between HN and SO.
My pet peeves: No dark mode, sorely lacking for me for reading in the dark, then there is no indication at all that you've got replies (at least a tiny number next to threads perhaps?) and the up/downvote buttons are too small to reliably tap on mobile. Oh, and enumeration support would be fantastic, the workarounds tend to be hard to read.
Other than that, I think it's delightfully ugly and lightweight.
I can't seem to find Harmonic in the iOS App Store, is it Android-only?
Also, HN apps tend to make it harder to send interesting things to Roam or the laptop or Safari's reading list, the website makes that really convenient.
I wouldn’t say it’s not user friendly but I understand where you are coming from. I also missed some more modern features/looks and decided to build my own open source client [0]. Feel free to give it a go to see if it’s more your taste!
I wonder if they use something like CARP[^1] for redundancy. Also, strikes as odd that they didn't go with ZFS for storage, makes FS management _way_ easier for engineers who don't spent all their on these kind of operations.
You might ask what sort of filesystem maintenance they ever need to do. Replacing a disk is covered by the mirror. Backup is straightforward. The second system covers a lot more. If they need to increase hardware capacity, they can build new systems, copy in the background, and swap over with a few minutes of downtime.
(beginner question) How do they store the data? is an SQL db on overkill for such a use case? what would be the alternative? an ad-hoc filesystem based solution? then how do the two servers share the db? and is there redundancy at the db level? is it replicated somehow?
"ad-hoc filesystem based solution" is the closest of your definitions, I think. Last time I saw/heard, HN was built in Arc, a Lisp dialect, and use(s/d) a variant of this (mirrored) code: https://github.com/wting/hackernews
Check out around this area of the code to see how simple it is. All just files and directories: https://github.com/wting/hackernews/blob/master/news.arc#L16... .. the beauty of this simple approach is a lack of moving parts, and it's easy to slap Redis on top if you need caching or something.
File syncing between machines is pretty much an easily solved problem. I don't know how they do it, but it could be something like https://syncthing.net/ or even some scripting with `rsync`. Heck, a cronned `tar | gzip | scp` might even be enough for an app whose data isn't exactly mission critical.
Wow, I had no idea HN was built like that - I'm impressed. I really wish I could read the Arc code better though since I'd love to know more about the details of how data is represented on disk and when things move in and out of memory, etc.
Does anyone know of other open source applications with similar architectures like this?
It's sad to think that with these laws being passed, regardless of what position you take, that we still don't have any Fair Use provisions in Australia. There was even a discussion paper [http://www.alrc.gov.au/publications/4-case-fair-use-australi...] put forward by our Law Reform Commission suggesting this. I would have though the productivity benefits associated with education and innovation alone would make this a no brainer
Go to cloud.digitalocean.com/support and create a new ticket, giving them your promo code and asking nicely, and they'll put it through promptly in my experience
That is, say you have a relatively complex backend mutation that needs to schedule some async work (eg sending an email after signup). With a Postgres queue, if you insert the job to send the email and then in a later part of the transaction, something fails and the transaction rollbacks, the email is never queued to be sent.