Creator here. Six Degrees of Wikipedia is a side project I've been sporadically hacking on over the past few years. It was an interesting technical challenge and it's fun to play with the end result. Here's the tech stack:
* Frontend: React (Create React App)
* Backend: Python Flask
* Database: SQLite
* Web (frontend) hosting: Firebase Hosting
* Server (backend) hosting: Google Compute Engine (it runs fine on a tiny f1-micro instance)
All the code is open source[1] and I'm happy to answer any questions about building or maintaining it!
make it so that if I visit the page and just click the "go" button, it will use the placeholder examples as the start and end points. i did this and got an error message stating "You'll probably want to choose the start and end pages before you hit that." that was annoying. the placeholders that were auto chosen were actually really interesting.
There's a certain irony to the lack of tact in this post suggesting better manners. I suspect the point would be better received were it more politely made.
That post seems like quintessential trolling. It was my expectation as well when going to the page to just click the "Go" button and see what the results would be, and I think the suggestion is great feedback. I think it is generally in the spirit of these "Show HN" posts to give and get feedback and discuss with the creator. The response post to that user, to me, is patronizing and is undoubtedly going to ruffle some feathers. But, now we aren't even talking about the OPs cool project...
I've see something similar when offering items on freecycle: people seem to think that the more they insert "pls" and "thnx" into a message the more likely it is that they'll be first in the queue.
I swear at least once there was more instances of "pls" and thnx" (there seems to be a significant overlap between people who overuse the words and people who don't bother to type them properly, though that may be the linguistic snob in me talking) than all other words combined. It backfired by making the message difficult to read so I binned it and read the next.
Come on, this is Show HN! Not some GitHub issue tracker. Try to insert please and thank you in that post ... it will make it sound rude! It will turn it from Show HN feedback to sounding like a feature request, which it is clearly not...
If you can automatically detect sentences where adding "Could you" would make the author sound more tactful, then why not just mentally add them and assume that the author intended to sound that tactful?
This is a slight modification of the principle of charity, something I think is incredibly important and underemphasized, especially for online text communication.
Based on the thread here, many people interpreted the comment as a well-intended feature request, but instead of requesting the commenter to change their comment, it seems like it would’ve been fine to just use that charitable interpretation.
I’m all for being charitable in our interpretations of the motives of others, but I think a total lack of the most basic courtesy should be seen for what it is. The lower end of the spectrum being lazy, and the high end being rude, even charitably no positive interpretation exists IMO.
Please and thank you are such basic concepts that honestly, I judge the hell out of people who don’t seem capable of it regardless of platform. If that minimal degree of politeness isn’t reflexive for you, again, the world at large judges you. Not to put too fine a point on it, but maybe “why not add it in your head” is the kind of thing thst gives some in tech a bad name?
This is a bit off topic, but just to blow ones mind: Finnish, which I am natively speaking, is lacking direct "please" word completely and I have to admit that especially orally I forget it often. (We make things sound nice by conditional, "could you", "would you be so kind" etc..) I have no immediate reaction what so ever how it would feel if someone left out "please". I understand the concept, but it feels horrible how much wrong I can go with native English speakers without meaning it. Thanks for the reminder! (thank you we do have;)
That’s not off-topic at all! Honestly, I’m ashamed that I hadn’t considered the possibility of a non-native speaker’s linguistic quirks, and what this tells me is that I need to continue to work on my snap-judgements. Thank you for reminding me that languages don’t handle courtesy the same way.
This was really nice to hear! Don't worry, one can only think through ones own mind =) I think this already helps, if everybody are just a bit more aware about that others might think differently, and not be deliberately mean. I have learned several languages, since Finnish is very useless in a large scale, and it always amazes me how in different languages it is also thought differently. I can only recommend!
You're right that please and thank you could come off as sarcastic or a bit odd in a comment section like this.
The best route is probably to clarify that it seems like a good concept, and you think that adding x and y would be the best features to focus on that would make it even better.
There’s a human being at the other end of the line. They have feelings. This is their baby, which they’ve gestated for multiple years.
Rational fundamentalism is not the one ring to rule them all. There is room for love and compassion in the world. I know (painfully well) that Internet forums (and this one in particular) anything but empower empathy, but that doesn’t make it any less true or valuable.
There is room for love and empathy. There really is. Logical, rationalistic supremacy notwithstanding.
When your baby grows up, and you put it out in the world, you no longer need to--and no longer should--continue to treat it like a baby. The author is not her project, and statements about the project are not personal attacks on the author. And if she were going to take them as such, she probably wouldn't be sharing that project on HN.
This is not how humans work. Humans are not rational emotionless robots. Humans are complex beings , they have feelings, biases and emotions which they do not control. It’s humane , considerate and kind to keep that in mind when communicating with each other.
Nobody is perfect. We shouldn’t expect anyone to be. Being kind and considerate is not a waste of anyone’s time, it’s not disrespectful nor patronising.
Maybe all of this is not true for OP, maybe they’re so zen they immediately see past their human instinct to attach themselves to their work (a well known and documented human trait). But it can’t hurt to be tactful , just in case they’re not.
I know I personally would’ve felt hurt if that first comment were the reply to something I did. I might, hopefully, have realised that it was, indeed, a useful comment and that I should swallow my pride. And I hope I would have the presence of mind to reply with a smile and a thank you. But it would be a pretense.
Why put each other through that? We can be truthful and tactful. There is no reason to be blunt.
I can understand the desire to unshackle ourselves from the low entropy morass of politeness. It’s fake and not what you really feel, right? The truth shines through anyway, so do them the honour of not sugar coating it, and be straight? I do understand, but hear this: you lack the nonverbal cues to make that incredibly delicate call. Face to face you would see the twitch in someone’s eye, the soft breeze of disappointment whisking along their face. A tear of pain streaking their human heart while they smile and nod and say thank you. You have an incredibly attune radar for empathy, and you would adjust. But when you lose all that, and the medium for your truth is this harsh knife forged from the soulless steel of words, it cuts more sharply than you can imagine.
In a text only medium, I implore everyone on HN: take a moment to consider our humanity. Not just here, but in all threads. Not just of the person we reply to, but of those we speak of. They might read it one day. And they, too, are human.
(I should say the painful irony in this rant is palpable to anyone who knows me :( but I am not proud of it. Perhaps it’s precisely why I feel so strongly. It does, after all, take one to know one.)
No, of course the word you're after is "tact". (And even as a foreigner, it annoys the heck out of me when people mix them up [which you didn't!], ususally as in "take another tact" when they mean "take another tack" [where 'tack' is a originally a sailing term for flipping the sail to the other side of the boat when going against the wind, IIUC; so the expression means "go in a slightly different direction"].)
I'm just riffing on Heikki's remark about foreign languages: In my primary language, Swedish, "tack" means "thank you" but is also often used for Eng. "please" -- so "Kan du ... , tack?" means "Could you ... , please?".
I second this, I did the exact same thing and it made me a bit mad -- why include interesting placeholders if you can't just "go" with them. So, in its current state, it might be a bit counterintuitive and could have a better UX.
I've poked at trying to build this multiple times for the last 10 years, but always ended up looking at the wikipedia XML and just balking.
The one difference that mine theoretically would have had is that I think it would be rad to include the paragraph you found the link along the edge in the graph, so you could see a little story.
Also, excluding years and places to make the routes more "interesting" (e.g. longest path under a limit without cycles).
The UX design is very well executed. This feels so polished. The floating graphs in the background, the individual paths under the chart, etc. It's all very well done with lots of little flourishes. Makes me want to up my game. Thanks for sharing.
[copy of answer from below] The Wikipedia database doesn't differentiate links which appear in the main article versus in the sources or categories sections. It's possible one of the intermediate links is in there. You sometimes need to do a CTRL+f in "View Source" to find the link. Also, the latest Wikipedia dump is from February 2nd, so it's possible the link has been deleted since that date. I'll regenerate my database when the new dump lands in early March.
I do a bi-directional BFS, but the search from the target node traverses incoming links as opposed to outgoing links. That's why I have to store both in the `links` table[1].
Your notification for having JS disabled made me chuckle.
One suggestion I have is that the graph view seems to clutter up pretty fast. Maybe have a slider that increases the length of the lines between the vertices. Also, a SVG export would be cool for visualizing related concepts.
The graph visualization / performance is definitely not ideal. I spent a ton of time trying to make d3 more performant and layout the graph more nicely, but ultimately I just had to cut my losses and go with what I had. I do think there is room for improvement and I'll look into your suggestion, which is something I didn't consider. SVG export is also a great idea!
How about representing the destination page as a circle around the whole graph, instead of a node? so that all the paths can be drawn in different directions but still reaching the same page. I somehow feel that it might look more beautiful.
Ooh cool idea! That certainly would improve the information density issue. I honestly never considered that at all and have no idea how I'd do it in d3, but I may try to hack it out. Thanks!
Maybe link all the outside nodes and then tell them to repel each other. If there are only two, it will be weird, but with three or more should work out.?
Thanks! I'm glad you asked. I actually do what I call a bi-directional breadth first search[1]. The gist of it is that instead of just doing a BFS from the source node until I reach the target node, I do a reverse BFS from the target node as well and wait until the two searches overlap. That helps with the exploding path problem, although that still becomes an issue for longer paths (>= 5 degrees generally). I also pre-compute all the incoming and outgoing links for each page when I create the database[2] so I don't need to do that upon every search, which resulted in a huge performance boost.
When I was first playing with this, I actually really expected you to be using an expensive performant solution like neo4j. When I read that you were using sqlite, I didn't believe it at first and thought it was a mistype from dev until I looked over the source.
That's an impressive and well thought out performance enhancement, and that the app runs so blazingly fast on sqlite is very impressive.
My thoughts exactly. It's impressive that despite relying on BFS, the tool manages to be so fast on a tiny node on google cloud. Even when there's a depth of >= 5 !
The downside of bidirectional seems to be that things like place names are linked via short paths through boring articles like "Census designated place" or "City"
Apparently an idiot here. What is the difference between web hosting and server hosting?
The way my mind interprets that stack is the database is hosted on firebase while the page is hosted on the server?
Edit: Thank you all for the explanation. I used to think firebase was used as a database. I didn't know one could host front end files there. It seems I still have a long ways to go :)
No, not an idiot. I didn't use the best terms. I updated them to say "Web (frontend) hosting" which are my static files (the HTML, JS, CSS) which is deployed to Firebase Hosting and "Server (backend) hosting" which is my backend Python Flask web server which is deployed to Google Compute Engine (GCE).
So, the website files are hosted on Firebase while the backend is hosted on GCE. The database is actually not hosted; it's just a SQLite file stored on my GCE instance.
Firebase Hosting is a simple way to host frontend code, similar to (but a little easier than) using S3 to serve the frontend. Since the database is SQLite, it seems like the backend and DB are hosted on GCE.
I assumed the other way around. Firebase gives you the webpage, but when you send a query, it is sent to Compute Engine to calculate all the paths and sent back to the frontend to render.
I'm interested in building a fact checker from a wikipedia graph, and your SDOW seems like a great place to start (I'm intending to use an algorithm inspired from researchers at Indiana University http://journals.plos.org/plosone/article?id=10.1371/journal....). I was wondering if your database has a non-GUI API. Is there a URL or something I can hit to get back JSON or XML as a response?
I'd prefer you not send any additional load to my server (this is just a side project I'm paying out of pocket for), but you are welcome to download the data yourself. There are instructions in the project README[1] to download the SQLite files I use in the project and I should have documented enough about the schema for you to know what queries to make. I am happy to answer questions via GitHub issues if you have them.
Could you please add a mode that does the opposite?
I've always enjoyed playing 6 degrees myself, so if it gives a link to the first page and names the second page, then only shows the available routes when I'm done, that would be a lot of fun.
I have a couple of "hub" articles that I like to use, but I'd like to see how much more effective I could have been with a tool like this. And if it randomizes my start and end like the placeholder text shows, that makes it even easier.
Love the idea, and it’s brilliantly executed! Well done.
Perhaps I misinterpreted the concept of “degrees of separation”, but I was expecting the site to tell me how to start at page X and get to page Y with the min number of clicks. If you wanted to achieve this, it doesn’t strike me as appropriate to use Bidirectional BFS but IANAL.
I did notice that someone pointed out that they get different results by swapping the order of X and Y. This seems pretty surprising?
> I was expecting the site to tell me how to start at page X and get to page Y with the min number of clicks.
Yup, this is exactly what the site does, and a bi-directional BFS is an efficient way to do it. The special thing about my bi-directional BFS is that I follow outgoing links when searching from the source page while following incoming links when searching from the target page[1].
> I did notice that someone pointed out that they get different results by swapping the order of X and Y. This seems pretty surprising?
This is expected, because it is a directed graph, with the links on Wikipedia being in one direction. Just because page A links to page B doesn't mean page B links to page A.
I was looking at the results of going from CMU to my little secondary school in Dublin, Ireland. I saw the results and saw that the last page before my Irish school was "College" and assumed it must be wrong, because how could my tiny secondary school be on the Wikipedia page for "College"? But alas, I was wrong!! I just checked and turns out it IS on the college wiki page!
I also assumed you were looking at outgoing links for both X and Y - that explains a lot.
I am super interested in this, but I have never done any graph theory or searching/planning (I'm EE) - how did you build up all of the incoming links for each wiki page? Are you storing all of this? How much data is that? Thanks for the reply!
I'm definitely not the first to think of it or build a tool for it (lots of similar projects gave me inspiration), but I think I'm the first to make it really fast and with a nice usable UI. And to actually open source the code so others can build it themselves.
Can you tell me more about what you mean by filtering by mutual paths?
> Can you tell me more about what you mean by filtering by mutual paths?
I should have said mutual connections, my apologies. So if article A connects to B and article B also connects to A. I suppose you could do this mostly client-side, all you need is two searches (one the reverse of the other) and an intersection of the resulting graphs, no?
I implemented this pretty naively a while back [1]. I was interested in how yours was so fast. I expected some sort of complex heuristic; cool to see that your solution is straightforward!
The database creation script[1] has a lot of Unix junk in it, but reading through the comments and echo statements should give you an idea of what it does. The end result is a SQLite database with a size of about 9 GB which has four tables, the schema of which are described in the README[2]. The big things that are precomputed are redirects are "auto-followed" to reduce the total graph size and all incoming and outgoing links are stored in a |-separated string for each page (in the `links` table).
Every time a query is made, a bi-directional breadth-first search[3] is run which uses the |-separated incoming and outgoing links and runs a fairly standard BFS algorithm. A lot of the hard work was precomputed, which minimizes the number of required database queries and makes each search respond fairly quickly.
I've not yet had a chance to look over the code (in case it's already there or infeasible due to architecture), but you may wish to consider caching prior queries and their results - this seems like the type of service that would be likely to have certain paths shared widely, such as the first few top-level comments on this post.
I'm not sure caching would help a ton given how I structure the data and do my searches in batches of pages, not for individual pages. I already do some "caching" by precomputing all incoming and outgoing links for each page when I create the database, which, as you would expect, yields a huge performance improvement. A cache certainly would help, but I would expect the hit rate on it to be extremely low, making it not worth the effort. I may have a different opinion after analyzing some of today's results though. Thanks for the suggestion!
Just a heads up - some of the node colors can be difficult to differentiate for people who are red/green colorblind. Very minor, just wanted to mention it though.
I actually had a friend suggest it to me and the Neo4j docs happen to be one of the many tabs I currently have open. I was already so far into using SQLite for this project and I wanted to ship it, so I decided to stick with what I had. I would be interested to see how Neo4j performs with such a big dataset (the resulting SQLite file is around 9 GB with nearly 6 million nodes and 500 billion links). I was a bit worried that Neo4j wouldn't be able to scale to a graph of that size, but that is a completely untested and ignorant opinion. If you have any experience with Neo4j, I'd love to hear your thoughts.
I've only played around with Neo4j, but it looks like Neo4j 3.0 could handle your node and edge count (although not earlier versions).
Your app likely would have been easier to write with python/Neo4j than python/SQLite due to Neo4j's query language (incidentally, Neo4j also prefers to use bidirectional BFS for shortest path, and executing the search is a very simple query similar to an SQL query). Neo4j would possibly perform better as well (I'm not sure about that, given SQLite is pretty optimized for reads).
However, as the neighboring comment states, Neo4j is quite resource-intensive. It wouldn't work on any kind of micro instance, I'm pretty sure.
Awesome job!.A great learning tool for someone coming from Non-Cs background to actually dabble with BFS Graph algorithms. After a quick search, i managed to ferret out a few implementations, one of which is this. Uses Neo4J.
https://github.com/erabug/wikigraph
The sheer scale of Wikipedia (5 millions pages, half a trillion links) made it difficult to make the searches fast. Simply downloading the Wikipedia database dumps and parsing them into my own database took over a day on my first successful attempt. The site returns most results in just a few seconds despite the giant graph size.
Were either of your creations around since ~2012? I recall someone sharing this very concept in a room on turntable.fm, except it would list it's discovery in real-time as an ordered list. I've been racking my memory for 2 years trying to find a link again, but here we are!
Thank you! The graph is built using vanilla d3, no library on top of it. The code for it lives all in one file, ResultsGraph.js [1]. I pieced together the code from a handful of other attempts online. I am still not 100% pleased with the performance of it with a larger number of nodes (250+), but that seems to be a common complaint with the d3 force simulation layouts.
The resulting SQLite database file is currently 8.3 GB, most of which is taken up by the `links` table. The big performance wins are having a handful of indexes (see the .sql files[1] for the database's schema) and preprocessing a lot of data so I don't have to do duplicate work every time a query occurs. For example, instead of the `links` table going from `source_id` to `target_id` and having a ton of rows which have the same `source_id`, I go from `id` to `outgoing_links` (which is a |-separate string of all source page IDs). Computing each page's incoming and outgoing links is the really heavy work and I only do that at database creation time, using a beefy GCP machine with 8 vCPUs, 52 GB RAM, and a 256 GB SSD. It still takes about an hour, but it's a one time cost and means I can run the actual service on a much smaller machine which won't cost me a fortune to maintain. Also, SQLite is just very fast and performant out of the box, so as usual, it's a matter of choosing the right tools for the job.
[1] https://github.com/jwngr/sdow