Five years ago I was absolutely frustrated with the state of Graph databases and libraries and tried putting several non-Graph DBMSs behind a NetworkX-like Python interface <https://github.com/unum-cloud/NetworkXum>.
When benchmarked, Neo4J crashed on every graph I’ve tried <https://www.unum.cloud/blog/2020-11-12-graphs>, making SQLite and Postgres much more viable options even for network-processing workloads. So I wouldn’t be surprised to learn that people actually use pgRouting and Supabase in that setting.
With the rise of Postgres-compatible I’m wondering if it’s worth refreshing the project. Similarly, there are now more Graph DBs like MemGraph compatible with CYPHER, which should probably work much better than Neo4J.
Checkout https://github.com/Pometry/Raphtory, it's written in Rust, embedded (the binaries are about 20mb) and you can use the Python APIs as a drop-in replacement for NetworkX. Disclaimer, I am one of the people behind it.
Just starting to review it but my front of mind questions:
1) How do I handle persistence? Looks like some code is missing.
2) Do you support multi-tenancy (b2b saas graph backend for handling relations scoped to a tenant)
1) You can persist a graph to disk. By default, this uses protobuf (`save_to_file`), however we’re migrating to Parquet in next release for better performance because we noticed loading a 100m edge graph from scratch (CSV, Pandas, or raw Parquet) is actually faster (~1M rows/sec) than from persisted proto, which isn’t ideal. There’s also a private version that uses custom memory buffers for on-disk storage, handling updates and compaction automatically.
2) You can run a Raphtory instance either as a GraphQL server or an embedded library. For the server, multiple users can query the persisted graphs, which are stored in a simple folder structure with namespaces (for different graphs). For now, access control needs to be managed externally, however it's on our roadmap!
Your graph DB frustrations mirror what many experienced with Neo4j. If you refresh your project, consider including FalkorDB (formerly RedisGraph) - it uses sparse adjacency matrices and GraphBLAS for much better performance while supporting Cypher.
Would be interesting to see updated benchmarks comparing these newer options against PostgreSQL extensions.
I had almost exactly the opposite experience, although my dataset was pretty small.
We wanted to store a graph in postgres and ended up writing some recursive queries to pull subgraphs then had NetworkX layered over it to do some more complex graph operations. We ended up doing that for a short while but then switched to Neo4j because of how comparatively easy it was to write queries (although the Python support for Neo4j was severely lacking). Never really stressed it out on dataset size though.
I did manage to crash Redis' graph plugin pretty quickly when I was testing that.
Not sure what you consider "quite small" and I don't know how NetworkX works, but postgresql recursive queries have worked well for me for small graphs.
Could you share what the data structure and scale was?
We basically had a single table that we wanted to be able to nest on itself arbitrarily. Think categories and subcategories, maybe 100k nodes/rows
Postgres worked fine but cypher is so much more expressive and handles stuff like loop detection for you, neo4j was much easier to work with. Performance wasn't ever really an issue with either.
We have something similar in Postgres but IMO disconnectedness also plays a really big part in this whole calculation. We actually ended up just changing the transitive closure for fast operations (and simpler code).
A good 10 years ago or so I was running a solution that used RDF Quad Stores - and the best one at the time (after trialling 4Store, Marklogic and some others I can't remember) was OpenLink Virtuoso - how they managed to fit a performant distributed Quad store into what started life as an SQL engine was impressive.
I've left that world now, but if you're in the market for a graph store again, it might be something to look at.
Neo4J is mature not dated which is why it's so popular.
And couldn't disagree more that graph is a feature. You really want something optimised for it (query language / storage approach) as the data structure is so different in every way from a relational or document store.
Graph processing can create substantial amount of intermediate data if it is done in typical join implementation fashion (nested loops or hash join). So it may appear that graph processing needs a tailored approach.
But what can help graph algorithms can help SQL query execution as well and vice versa, see the link above.
For example, TPC-DS contains queries that (indirectly) joins same tables multiple times (query 4, for example). This is, basically, a kind of centrality metric computation for a graph represented by the tables.
How is it different? Isn't a graph basically two sets of tuples: edges and nodes? I played with Cayley (Google) for a little while, & that was my impression.
I think it's less a matter of "can you represent graphs in a relational DB" (of course you can), and more about what kind of queries the DB is optimised for. Graph databases are intended for complex recursive queries on relatively unstructured data. You could certainly do that in SQL if you wanted to, but you'll pay for it performance-wise.
Graph query languages also make those kinds of queries much easier to express in the first place.
So the underlying storage is conventional, it's still tuples of some kind, and it's only a matter of how indexes are laid out? Otherwise, I'm struggling to see how it could "optimise" for certain access patterns. How would a typical graph database index be different from a btree access method in Postgres?
I don't know much about the internal details of postgres. But there is a ton of detail underlying "it's just tuples of some kind" and there are lots of ways to implement indices, no? Is it so difficult to imagine that different implementations have different performance properties?
There's also the query planner layer to think about too.
No need for the snark. If you want specific details of how postgres differs from graph databases I have nothing for you. I just find your position that btrees are optimised for every query structure... obviously false on general grounds? Like a thing to do to make recursive queries faster is to store relations as direct pointers of some kind, rather than doing index scans for every level of join.
Perhaps we're talking past each other about the word "optimised".
> I just find your position that btrees are optimised for every query structure
But that is not my position! Postgres has many index access methods: hash, btree, brin, gin, gist, and there are extensions for rum, bloom, skipscans, geospatial indexes such as sp-gist, & vector indexes like ivf/hnws (see pgvector.) I mean, as far as graph databases are concerned, besides pgRouting, there's also Apache AGE which is a graph-"optimised" Postgres.
You should learn more about Postgres and databases in general. See comment above. https://news.ycombinator.com/item?id=43203833 which is closely related to the argument I am actually making.
When benchmarked, Neo4J crashed on every graph I’ve tried <https://www.unum.cloud/blog/2020-11-12-graphs>, making SQLite and Postgres much more viable options even for network-processing workloads. So I wouldn’t be surprised to learn that people actually use pgRouting and Supabase in that setting.
With the rise of Postgres-compatible I’m wondering if it’s worth refreshing the project. Similarly, there are now more Graph DBs like MemGraph compatible with CYPHER, which should probably work much better than Neo4J.