Creator here. Six Degrees of Wikipedia is a side project I've been sporadically ...

thrownaway954 · on Feb 26, 2018

make it so that if I visit the page and just click the "go" button, it will use the placeholder examples as the start and end points. i did this and got an error message stating "You'll probably want to choose the start and end pages before you hit that." that was annoying. the placeholders that were auto chosen were actually really interesting.

ng-user · on Feb 26, 2018

'Please' and 'thank you' go a long when requesting additional features for an OSS project.

larkeith · on Feb 26, 2018

There's a certain irony to the lack of tact in this post suggesting better manners. I suspect the point would be better received were it more politely made.

I, of course, am merely propagating the cycle.

mkobit · on Feb 26, 2018

That post seems like quintessential trolling. It was my expectation as well when going to the page to just click the "Go" button and see what the results would be, and I think the suggestion is great feedback. I think it is generally in the spirit of these "Show HN" posts to give and get feedback and discuss with the creator. The response post to that user, to me, is patronizing and is undoubtedly going to ruffle some feathers. But, now we aren't even talking about the OPs cool project...

Stratoscope · on Feb 27, 2018

When I was a kid my mom told me "always say 'please' and 'thank you'."

So from then on, when I wanted something, I said "please and thank you."

I think this was my first social hack. I just wish it had worked better!

dspillett · on Feb 27, 2018

I've see something similar when offering items on freecycle: people seem to think that the more they insert "pls" and "thnx" into a message the more likely it is that they'll be first in the queue.

I swear at least once there was more instances of "pls" and thnx" (there seems to be a significant overlap between people who overuse the words and people who don't bother to type them properly, though that may be the linguistic snob in me talking) than all other words combined. It backfired by making the message difficult to read so I binned it and read the next.

dagss · on Feb 26, 2018

Come on, this is Show HN! Not some GitHub issue tracker. Try to insert please and thank you in that post ... it will make it sound rude! It will turn it from Show HN feedback to sounding like a feature request, which it is clearly not...

corobo · on Feb 26, 2018

Just adding "Could you" to the start (and change first sentence's . to ?) would make a world of difference

tshaddox · on Feb 26, 2018

If you can automatically detect sentences where adding "Could you" would make the author sound more tactful, then why not just mentally add them and assume that the author intended to sound that tactful?

This is a slight modification of the principle of charity, something I think is incredibly important and underemphasized, especially for online text communication.

jaitaiwan · on Feb 26, 2018

That's a really cool concept. Write a medium article about it.

indigodaddy · on Feb 27, 2018

I can't tell of this is disengenuous or not..

BrandonM · on Feb 27, 2018

I think that's the point.

IncRnd · on Feb 27, 2018

> why not just mentally add them and assume that the author intended to sound that tactful?

Because, that could be the wrong action to take. Could you, Would you, Will you, I want you to, You should, The project is stupid unless you, etc.

Therefore, the author should communicate well enough to be understood.

tshaddox · on Feb 27, 2018

Based on the thread here, many people interpreted the comment as a well-intended feature request, but instead of requesting the commenter to change their comment, it seems like it would’ve been fine to just use that charitable interpretation.

IntronExon · on Feb 27, 2018

I’m all for being charitable in our interpretations of the motives of others, but I think a total lack of the most basic courtesy should be seen for what it is. The lower end of the spectrum being lazy, and the high end being rude, even charitably no positive interpretation exists IMO.

Lxr · on Feb 27, 2018

The link in your profile is broken, FYI.

IntronExon · on Feb 27, 2018

Please and thank you are such basic concepts that honestly, I judge the hell out of people who don’t seem capable of it regardless of platform. If that minimal degree of politeness isn’t reflexive for you, again, the world at large judges you. Not to put too fine a point on it, but maybe “why not add it in your head” is the kind of thing thst gives some in tech a bad name?

It costs nothing to be polite.

Heikki0 · on Feb 27, 2018

This is a bit off topic, but just to blow ones mind: Finnish, which I am natively speaking, is lacking direct "please" word completely and I have to admit that especially orally I forget it often. (We make things sound nice by conditional, "could you", "would you be so kind" etc..) I have no immediate reaction what so ever how it would feel if someone left out "please". I understand the concept, but it feels horrible how much wrong I can go with native English speakers without meaning it. Thanks for the reminder! (thank you we do have;)

IntronExon · on Feb 27, 2018

That’s not off-topic at all! Honestly, I’m ashamed that I hadn’t considered the possibility of a non-native speaker’s linguistic quirks, and what this tells me is that I need to continue to work on my snap-judgements. Thank you for reminding me that languages don’t handle courtesy the same way.

Heikki0 · on March 2, 2018

This was really nice to hear! Don't worry, one can only think through ones own mind =) I think this already helps, if everybody are just a bit more aware about that others might think differently, and not be deliberately mean. I have learned several languages, since Finnish is very useless in a large scale, and it always amazes me how in different languages it is also thought differently. I can only recommend!

akarma · on Feb 26, 2018

You're right that please and thank you could come off as sarcastic or a bit odd in a comment section like this.

The best route is probably to clarify that it seems like a good concept, and you think that adding x and y would be the best features to focus on that would make it even better.

periram · on Feb 26, 2018

Please use "Please" when you are requesting more work from OSS, OR even better, make the change yourself and how all of us how it should be done.

mung · on Feb 26, 2018

It's a suggestion, not a request and a good one at that. So who should be saying 'thank you' here (if anyone)?

nothrabannosir · on Feb 27, 2018

There’s a human being at the other end of the line. They have feelings. This is their baby, which they’ve gestated for multiple years.

Rational fundamentalism is not the one ring to rule them all. There is room for love and compassion in the world. I know (painfully well) that Internet forums (and this one in particular) anything but empower empathy, but that doesn’t make it any less true or valuable.

There is room for love and empathy. There really is. Logical, rationalistic supremacy notwithstanding.

darkerside · on Feb 27, 2018

When your baby grows up, and you put it out in the world, you no longer need to--and no longer should--continue to treat it like a baby. The author is not her project, and statements about the project are not personal attacks on the author. And if she were going to take them as such, she probably wouldn't be sharing that project on HN.

nothrabannosir · on Feb 27, 2018

This is not how humans work. Humans are not rational emotionless robots. Humans are complex beings , they have feelings, biases and emotions which they do not control. It’s humane , considerate and kind to keep that in mind when communicating with each other.

Nobody is perfect. We shouldn’t expect anyone to be. Being kind and considerate is not a waste of anyone’s time, it’s not disrespectful nor patronising.

Maybe all of this is not true for OP, maybe they’re so zen they immediately see past their human instinct to attach themselves to their work (a well known and documented human trait). But it can’t hurt to be tactful , just in case they’re not.

I know I personally would’ve felt hurt if that first comment were the reply to something I did. I might, hopefully, have realised that it was, indeed, a useful comment and that I should swallow my pride. And I hope I would have the presence of mind to reply with a smile and a thank you. But it would be a pretense.

Why put each other through that? We can be truthful and tactful. There is no reason to be blunt.

I can understand the desire to unshackle ourselves from the low entropy morass of politeness. It’s fake and not what you really feel, right? The truth shines through anyway, so do them the honour of not sugar coating it, and be straight? I do understand, but hear this: you lack the nonverbal cues to make that incredibly delicate call. Face to face you would see the twitch in someone’s eye, the soft breeze of disappointment whisking along their face. A tear of pain streaking their human heart while they smile and nod and say thank you. You have an incredibly attune radar for empathy, and you would adjust. But when you lose all that, and the medium for your truth is this harsh knife forged from the soulless steel of words, it cuts more sharply than you can imagine.

In a text only medium, I implore everyone on HN: take a moment to consider our humanity. Not just here, but in all threads. Not just of the person we reply to, but of those we speak of. They might read it one day. And they, too, are human.

(I should say the painful irony in this rant is palpable to anyone who knows me :( but I am not proud of it. Perhaps it’s precisely why I feel so strongly. It does, after all, take one to know one.)

prophesi · on Feb 26, 2018

Tact.

CRConrad · on Feb 28, 2018

Or "Tack". ;-)

No, of course the word you're after is "tact". (And even as a foreigner, it annoys the heck out of me when people mix them up [which you didn't!], ususally as in "take another tact" when they mean "take another tack" [where 'tack' is a originally a sailing term for flipping the sail to the other side of the boat when going against the wind, IIUC; so the expression means "go in a slightly different direction"].)

I'm just riffing on Heikki's remark about foreign languages: In my primary language, Swedish, "tack" means "thank you" but is also often used for Eng. "please" -- so "Kan du ... , tack?" means "Could you ... , please?".

https://en.wikipedia.org/wiki/Tacking_(sailing)

scottmcdot · on Feb 27, 2018

Stack overflow has conditioned me to not use manners. My personal preference is to use them.

GoToRO · on Feb 26, 2018

He is not making any feature request. He is only pointing out a serious UI bug. Free work.

jwngr · on Feb 26, 2018

This has been fixed[1] and should behave in a more intuitive way now. Thanks for the suggestion!

[1] https://github.com/jwngr/sdow/commit/6e42e06488a592784e5d3d2...

thrownaway954 · on Feb 27, 2018

awesome!!! you rock!!!

anonytrary · on Feb 27, 2018

I second this, I did the exact same thing and it made me a bit mad -- why include interesting placeholders if you can't just "go" with them. So, in its current state, it might be a bit counterintuitive and could have a better UX.

abrahms · on Feb 26, 2018

I've poked at trying to build this multiple times for the last 10 years, but always ended up looking at the wikipedia XML and just balking.

The one difference that mine theoretically would have had is that I think it would be rad to include the paragraph you found the link along the edge in the graph, so you could see a little story.

Also, excluding years and places to make the routes more "interesting" (e.g. longest path under a limit without cycles).

examancer · on Feb 26, 2018

The UX design is very well executed. This feels so polished. The floating graphs in the background, the individual paths under the chart, etc. It's all very well done with lots of little flourishes. Makes me want to up my game. Thanks for sharing.

Trufa · on Feb 26, 2018

Am I not understanding something? When I linked Obama to Uruguay, Myanmar came up (the only surprising one).

https://www.sixdegreesofwikipedia.com/?source=Uruguay&target...

I went to the Uruguayan wiki page and found nothing on Myanmar and nothing relating to Uruguay on the Myanmar page.

Awesome project btw, great idea.

jwngr · on Feb 26, 2018

[copy of answer from below] The Wikipedia database doesn't differentiate links which appear in the main article versus in the sources or categories sections. It's possible one of the intermediate links is in there. You sometimes need to do a CTRL+f in "View Source" to find the link. Also, the latest Wikipedia dump is from February 2nd, so it's possible the link has been deleted since that date. I'll regenerate my database when the new dump lands in early March.

spiznnx · on Feb 26, 2018

It's a directed graph. Your source is Uruguay and your destination is Obama. You'll find Obama on the midpoints.

Opposite direction: no Myanmar https://www.sixdegreesofwikipedia.com/?source=Uruguay&target...

mkagenius · on Feb 27, 2018

Which is kind of weird since he is using bi directional BFS.

jwngr · on Feb 27, 2018

I do a bi-directional BFS, but the search from the target node traverses incoming links as opposed to outgoing links. That's why I have to store both in the `links` table[1].

[1] https://github.com/jwngr/sdow/blob/f0b5a9ebe47ea0eca49d8220a...

colemannugent · on Feb 26, 2018

Your notification for having JS disabled made me chuckle.

One suggestion I have is that the graph view seems to clutter up pretty fast. Maybe have a slider that increases the length of the lines between the vertices. Also, a SVG export would be cool for visualizing related concepts.

jwngr · on Feb 26, 2018

Glad you found one of the Easter eggs :D

The graph visualization / performance is definitely not ideal. I spent a ton of time trying to make d3 more performant and layout the graph more nicely, but ultimately I just had to cut my losses and go with what I had. I do think there is room for improvement and I'll look into your suggestion, which is something I didn't consider. SVG export is also a great idea!

nrjames · on Feb 27, 2018

Interesting you used d3. I used sigma.js on a very similar project in the past and only had performance issues with giant graphs. http://sigmajs.org/

wybiral · on Feb 26, 2018

> Glad you found one of the Easter eggs :D

Easter egg? I never have JS enabled on random websites.

ysaimanojkumar · on Feb 27, 2018

Hi jwngr,

Thanks for the cool hack. Its nice.

How about representing the destination page as a circle around the whole graph, instead of a node? so that all the paths can be drawn in different directions but still reaching the same page. I somehow feel that it might look more beautiful.

jwngr · on Feb 27, 2018

Ooh cool idea! That certainly would improve the information density issue. I honestly never considered that at all and have no idea how I'd do it in d3, but I may try to hack it out. Thanks!

thunderrabbit · on Feb 27, 2018

Maybe link all the outside nodes and then tell them to repel each other. If there are only two, it will be weird, but with three or more should work out.?

sinaa · on Feb 26, 2018

Great work!

Do you simply do a BFS to find the shortest paths? If so, are you doing any tricks to avoid the path explosion problem?

jwngr · on Feb 26, 2018

Thanks! I'm glad you asked. I actually do what I call a bi-directional breadth first search[1]. The gist of it is that instead of just doing a BFS from the source node until I reach the target node, I do a reverse BFS from the target node as well and wait until the two searches overlap. That helps with the exploding path problem, although that still becomes an issue for longer paths (>= 5 degrees generally). I also pre-compute all the incoming and outgoing links for each page when I create the database[2] so I don't need to do that upon every search, which resulted in a huge performance boost.

[1] https://github.com/jwngr/sdow/blob/a2699dc95d884ec64a4641630... [2] https://github.com/jwngr/sdow/blob/a2699dc95d884ec64a4641630...

vorpalhex · on Feb 26, 2018

When I was first playing with this, I actually really expected you to be using an expensive performant solution like neo4j. When I read that you were using sqlite, I didn't believe it at first and thought it was a mistype from dev until I looked over the source.

That's an impressive and well thought out performance enhancement, and that the app runs so blazingly fast on sqlite is very impressive.

sinaa · on Feb 26, 2018

My thoughts exactly. It's impressive that despite relying on BFS, the tool manages to be so fast on a tiny node on google cloud. Even when there's a depth of >= 5 !

lbruder · on Feb 27, 2018

Especially SQLite is often overlooked. A marvelous library that should be part of every personal toolkit.

matteuan · on Feb 27, 2018

Neo4j is not necessarily more performant, especially for a particular use case like this.

nly · on Feb 26, 2018

The downside of bidirectional seems to be that things like place names are linked via short paths through boring articles like "Census designated place" or "City"

jwngr · on Feb 27, 2018

The bi-directional nature of the search does not change the end result. It is simply a performance improvement.

solarkraft · on Feb 27, 2018

Still - It can be annoying. Maybe there could be an option to exclude articles with specific text/link ratios?

ryan_j_naughton · on Feb 26, 2018

It is bidirectional BFS: https://github.com/jwngr/sdow/blob/master/sdow/breadth_first...

A* can't be used given that path cost or expected remaining distance is unknown.

Any ideas on how such an algorithm could be used without precomputing the entire graph?

Teichopsia · on Feb 26, 2018

Apparently an idiot here. What is the difference between web hosting and server hosting?

The way my mind interprets that stack is the database is hosted on firebase while the page is hosted on the server?

Edit: Thank you all for the explanation. I used to think firebase was used as a database. I didn't know one could host front end files there. It seems I still have a long ways to go :)

jwngr · on Feb 26, 2018

No, not an idiot. I didn't use the best terms. I updated them to say "Web (frontend) hosting" which are my static files (the HTML, JS, CSS) which is deployed to Firebase Hosting and "Server (backend) hosting" which is my backend Python Flask web server which is deployed to Google Compute Engine (GCE).

So, the website files are hosted on Firebase while the backend is hosted on GCE. The database is actually not hosted; it's just a SQLite file stored on my GCE instance.

servercobra · on Feb 26, 2018

Firebase Hosting is a simple way to host frontend code, similar to (but a little easier than) using S3 to serve the frontend. Since the database is SQLite, it seems like the backend and DB are hosted on GCE.

ehsankia · on Feb 26, 2018

I assumed the other way around. Firebase gives you the webpage, but when you send a query, it is sent to Compute Engine to calculate all the paths and sent back to the frontend to render.

unreal37 · on Feb 27, 2018

Some years ago, I owned sixipedia.com. I never knew what to do with it.... If I still had it, I would have given it to you.

SkylerASmith · on March 5, 2018

This is an awesome tool!

I'm interested in building a fact checker from a wikipedia graph, and your SDOW seems like a great place to start (I'm intending to use an algorithm inspired from researchers at Indiana University http://journals.plos.org/plosone/article?id=10.1371/journal....). I was wondering if your database has a non-GUI API. Is there a URL or something I can hit to get back JSON or XML as a response?

jwngr · on March 7, 2018

I'd prefer you not send any additional load to my server (this is just a side project I'm paying out of pocket for), but you are welcome to download the data yourself. There are instructions in the project README[1] to download the SQLite files I use in the project and I should have documented enough about the schema for you to know what queries to make. I am happy to answer questions via GitHub issues if you have them.

[1] https://github.com/jwngr/sdow#get-the-data-yourself

SkylerASmith · on March 7, 2018

That makes sense; thanks for your response!

reificator · on Feb 27, 2018

Could you please add a mode that does the opposite?

I've always enjoyed playing 6 degrees myself, so if it gives a link to the first page and names the second page, then only shows the available routes when I'm done, that would be a lot of fun.

I have a couple of "hub" articles that I like to use, but I'd like to see how much more effective I could have been with a tool like this. And if it randomizes my start and end like the placeholder text shows, that makes it even easier.

mintplant · on Feb 27, 2018

Sounds like https://thewikigame.com

cvigoe · on Feb 27, 2018

Love the idea, and it’s brilliantly executed! Well done.

Perhaps I misinterpreted the concept of “degrees of separation”, but I was expecting the site to tell me how to start at page X and get to page Y with the min number of clicks. If you wanted to achieve this, it doesn’t strike me as appropriate to use Bidirectional BFS but IANAL.

I did notice that someone pointed out that they get different results by swapping the order of X and Y. This seems pretty surprising?

Well done again!

jwngr · on Feb 27, 2018

Thanks, glad you enjoyed it!

> I was expecting the site to tell me how to start at page X and get to page Y with the min number of clicks.

Yup, this is exactly what the site does, and a bi-directional BFS is an efficient way to do it. The special thing about my bi-directional BFS is that I follow outgoing links when searching from the source page while following incoming links when searching from the target page[1].

> I did notice that someone pointed out that they get different results by swapping the order of X and Y. This seems pretty surprising?

This is expected, because it is a directed graph, with the links on Wikipedia being in one direction. Just because page A links to page B doesn't mean page B links to page A.

[1] https://github.com/jwngr/sdow/blob/master/sdow/breadth_first...

cvigoe · on Feb 27, 2018

I just realized what my confusion was over!

https://www.sixdegreesofwikipedia.com/?source=Carnegie%20Mel...

I was looking at the results of going from CMU to my little secondary school in Dublin, Ireland. I saw the results and saw that the last page before my Irish school was "College" and assumed it must be wrong, because how could my tiny secondary school be on the Wikipedia page for "College"? But alas, I was wrong!! I just checked and turns out it IS on the college wiki page!

I also assumed you were looking at outgoing links for both X and Y - that explains a lot.

I am super interested in this, but I have never done any graph theory or searching/planning (I'm EE) - how did you build up all of the incoming links for each wiki page? Are you storing all of this? How much data is that? Thanks for the reply!

jwngr · on Feb 27, 2018

Check out an earlier comment I made[1] which has some information about this. It also includes links to the relevant code, which is all open source.

[1] https://news.ycombinator.com/item?id=16469260

vanderZwan · on Feb 26, 2018

I had a conversation with a friend a few weeks ago that surely this already exists, and if not that someone should make this.

Any plans to filter by mutual paths?

jwngr · on Feb 26, 2018

I'm definitely not the first to think of it or build a tool for it (lots of similar projects gave me inspiration), but I think I'm the first to make it really fast and with a nice usable UI. And to actually open source the code so others can build it themselves.

Can you tell me more about what you mean by filtering by mutual paths?

vanderZwan · on Feb 26, 2018

> Can you tell me more about what you mean by filtering by mutual paths?

I should have said mutual connections, my apologies. So if article A connects to B and article B also connects to A. I suppose you could do this mostly client-side, all you need is two searches (one the reverse of the other) and an intersection of the resulting graphs, no?

wwalexander · on Feb 26, 2018

I implemented this pretty naively a while back [1]. I was interested in how yours was so fast. I expected some sort of complex heuristic; cool to see that your solution is straightforward!

[1] https://github.com/wwalexander/wikipath

kozziollek · on Feb 26, 2018

I think GP wanted to find paths in both directions: X -> A -> B -> Y and Y -> C -> D -> X. Possibly where A = D and B = C.

bluetwo · on Feb 26, 2018

I would like to hear a little more on how you organized the search and what you are pre-processing and what you calculate on-the-fly. Thanks.

jwngr · on Feb 26, 2018

The database creation script[1] has a lot of Unix junk in it, but reading through the comments and echo statements should give you an idea of what it does. The end result is a SQLite database with a size of about 9 GB which has four tables, the schema of which are described in the README[2]. The big things that are precomputed are redirects are "auto-followed" to reduce the total graph size and all incoming and outgoing links are stored in a |-separated string for each page (in the `links` table).

Every time a query is made, a bi-directional breadth-first search[3] is run which uses the |-separated incoming and outgoing links and runs a fairly standard BFS algorithm. A lot of the hard work was precomputed, which minimizes the number of required database queries and makes each search respond fairly quickly.

[1] https://github.com/jwngr/sdow/blob/master/database/buildData... [2] https://github.com/jwngr/sdow#database-creation-process [3] https://github.com/jwngr/sdow/blob/master/sdow/breadth_first...

bluetwo · on March 2, 2018

Thanks!

larkeith · on Feb 26, 2018

I've not yet had a chance to look over the code (in case it's already there or infeasible due to architecture), but you may wish to consider caching prior queries and their results - this seems like the type of service that would be likely to have certain paths shared widely, such as the first few top-level comments on this post.

jwngr · on Feb 26, 2018

I'm not sure caching would help a ton given how I structure the data and do my searches in batches of pages, not for individual pages. I already do some "caching" by precomputing all incoming and outgoing links for each page when I create the database, which, as you would expect, yields a huge performance improvement. A cache certainly would help, but I would expect the hit rate on it to be extremely low, making it not worth the effort. I may have a different opinion after analyzing some of today's results though. Thanks for the suggestion!

sente · on March 1, 2018

Thus is awesome.

Just a heads up - some of the node colors can be difficult to differentiate for people who are red/green colorblind. Very minor, just wanted to mention it though.

techaddict009 · on Feb 27, 2018

Looks pretty cool.

Simple question I have is, are you hitting wikipedia api live? or you have dump of the wikipedia and running through it?

If running through dump do you update it regularly or how?

Thanks in advance.

jwngr · on Feb 27, 2018

The autocomplete suggestions hit the live Wikipedia API[1]. The actual search algorithm is on a dump of Wikipedia[2], which I plan to update monthly.

[1] https://github.com/jwngr/sdow/blob/f39398d112fecf7b993c64bd4... [2] https://github.com/jwngr/sdow#data-source

whateveruser · on Feb 27, 2018

Hey man, just a nitpick. The input fields fudge up when using dark GTK themes, as in text isn't legible unless I select it. Might wanna look into it.

jnbiche · on Feb 26, 2018

Why not use a graph database like Neo4j instead of SQLite? This seems like the perfect use case.

Is it because of the resources required to run one versus SQLite?

jwngr · on Feb 26, 2018

I actually had a friend suggest it to me and the Neo4j docs happen to be one of the many tabs I currently have open. I was already so far into using SQLite for this project and I wanted to ship it, so I decided to stick with what I had. I would be interested to see how Neo4j performs with such a big dataset (the resulting SQLite file is around 9 GB with nearly 6 million nodes and 500 billion links). I was a bit worried that Neo4j wouldn't be able to scale to a graph of that size, but that is a completely untested and ignorant opinion. If you have any experience with Neo4j, I'd love to hear your thoughts.

jnbiche · on Feb 26, 2018

I've only played around with Neo4j, but it looks like Neo4j 3.0 could handle your node and edge count (although not earlier versions).

Your app likely would have been easier to write with python/Neo4j than python/SQLite due to Neo4j's query language (incidentally, Neo4j also prefers to use bidirectional BFS for shortest path, and executing the search is a very simple query similar to an SQL query). Neo4j would possibly perform better as well (I'm not sure about that, given SQLite is pretty optimized for reads).

However, as the neighboring comment states, Neo4j is quite resource-intensive. It wouldn't work on any kind of micro instance, I'm pretty sure.

ultrasounder · on Feb 27, 2018

Awesome job!.A great learning tool for someone coming from Non-Cs background to actually dabble with BFS Graph algorithms. After a quick search, i managed to ferret out a few implementations, one of which is this. Uses Neo4J. https://github.com/erabug/wikigraph

dreamfactored · on Feb 27, 2018

Maybe the Neo4j guys would be into helping as they are quite into marketing. This could be a neat showcase speaking to both business and tech types.

tokenizerrr · on Feb 26, 2018

I've used neo4j before and it likes to consume a lot of memory.

johnhenry · on Feb 27, 2018

Great project! I wonder if you might be willing to go into more details about what made it an "interesting technical challenge"?

jwngr · on Feb 27, 2018

The sheer scale of Wikipedia (5 millions pages, half a trillion links) made it difficult to make the searches fast. Simply downloading the Wikipedia database dumps and parsing them into my own database took over a day on my first successful attempt. The site returns most results in just a few seconds despite the giant graph size.

sidcool · on Feb 27, 2018

This is pretty cool. It would be great to write a blog around your technical decision making for this project. Thanks!

orf · on Feb 26, 2018

Damn, you beat me to it! I've been hacking on something similar for the longest time. Thanks for sharing your code!

VikingIV · on Feb 26, 2018

Were either of your creations around since ~2012? I recall someone sharing this very concept in a room on turntable.fm, except it would list it's discovery in real-time as an ordered list. I've been racking my memory for 2 years trying to find a link again, but here we are!

nathanken · on Feb 27, 2018

Really cool. I'm interested to know how the graph is build. Did you use any third party components.

jwngr · on Feb 27, 2018

Thank you! The graph is built using vanilla d3, no library on top of it. The code for it lives all in one file, ResultsGraph.js [1]. I pieced together the code from a handful of other attempts online. I am still not 100% pleased with the performance of it with a larger number of nodes (250+), but that seems to be a common complaint with the d3 force simulation layouts.

[1] https://github.com/jwngr/sdow/blob/master/website/src/compon...

nathanken · on March 9, 2018

Thanks for your reply.

zapt02 · on Feb 27, 2018

How big does the SQLite database get? How do you maintain such great performance? (Other than indexes?)

jwngr · on Feb 27, 2018

The resulting SQLite database file is currently 8.3 GB, most of which is taken up by the `links` table. The big performance wins are having a handful of indexes (see the .sql files[1] for the database's schema) and preprocessing a lot of data so I don't have to do duplicate work every time a query occurs. For example, instead of the `links` table going from `source_id` to `target_id` and having a ton of rows which have the same `source_id`, I go from `id` to `outgoing_links` (which is a |-separate string of all source page IDs). Computing each page's incoming and outgoing links is the really heavy work and I only do that at database creation time, using a beefy GCP machine with 8 vCPUs, 52 GB RAM, and a 256 GB SSD. It still takes about an hour, but it's a one time cost and means I can run the actual service on a much smaller machine which won't cost me a fortune to maintain. Also, SQLite is just very fast and performant out of the box, so as usual, it's a matter of choosing the right tools for the job.

[1] https://github.com/jwngr/sdow/tree/master/database

drej · on Feb 27, 2018

Looks pretty cool - do you have any plans to support subsites other than enwiki?

jwngr · on March 7, 2018

Possibly... follow this GitHub issue[1] if you want to be notified about it.

[1] https://github.com/jwngr/sdow/issues/11