A query that matches phrase prefixes - for example "Epic w" will match both "Epic win" and "Epic wonder". This is particularly useful for implementing Google Instant style searches
This is really useful and painfully lacking from Lucene. Great stuff.
I was curious how they implemented prefix matching, so I went and looked at the code [1]. Unfortunately, this is just a simple linear scan that calls .startswith(). It is possible to do fast (log N) prefix matching with radix / critbit trees.
Radix trees are O(k), where k is the length of the query string.
Anyway, for prefix matching, you want to take careful note of the size and shape of your data, because it matters for which algorithm is fastest. If your data all fits in RAM (or better yet, all fits in L2 cache), then I've had very good results with binary search (O(log N)) to find the first matching result, and then linear scan to find all possible suffixes. This is a lot more cache-friendly than radix trees, which have better theoretical performance but often touch memory that's all over the place.
I really don't mean to sound unpleasant, but does anyone else think that the primary goal of Greplin is to exit via Google or Facebook? From a consumer point of view, can it ever be anything more than a niche product? I'm genuinely curious...these patches are somewhat superficial, and while a nice gesture, the cynical part of me sees it as a ploy to gain cred with the people most likely to influence buyouts.
you say that like it's a bad thing. why is "1. build something that would be useful to a large company, 2. prove by acquiring users that the product has real value, 3. get acquired" a bad business model? from a consumer point of view, they'll benefit when google or facebook or whoever integrates greplin's techniques.
This is completely irrelevant to the discussion, please feel free to downvote me for it:
2 things: Your account is exactly "1337" days old today, happy leet day :). Also, are you the same zem that posted on newslily a while ago? If so, you posted some really awesome stuff, thanks :)
(Sorry if I've recognized you and asked this question before, I have a terrible memory)
thanks, would never have noticed l337 day on my own :) and yeah, same zem! dropped off newslily due to having too many social networks to keep up with; nice to see it still going strong.
I wouldn't really say it's still going strong, haha. Unfortunately, we never really got the traction on that that we needed to allow it to keep running by itself (Cody and I were submitting a lot of the content, which was fine, but it would have been really awesome if we didn't have to).
<sarcasm>Big surprise there, though, we were trying to compete against HN and reddit</sarcasm>
Back in November, we both started working a new project: http://thingist.com, which has been taking a lot of my time lately, so I haven't been submitting as much.
We originally used mmap too - but it didn't work very well. First, Java has some rather serious mmap limitations, (http://bugs.sun.com/view_bug.do?bug_id=4724038) - and for some reason, we saw occcasional data corruption (which we haven't seen since we moved to the current system).
Have those mmap limitations really impacted you? And the corruption — I'm assuming that was with read/write mappings that you wrote to?
I'm asking, because we're getting great mileage out of mmap in Clojure, albeit for read-only mappings. And that's in a search engine :-) Using mmap for large data is great, because you avoid enlarging your heap and the garbage collector doesn't even have to care about your data.
The first version of Greplin was actually built on Solr!
Eventually, we needed more flexibility than Solr easily offered though. For example, we've added far more efficient sharding, document modifications (updates and deletions), flushing, and near real time search than either Lucene or Solr support out of the box (and they were much easier to add to Lucene than Solr, since Lucene makes fewer assumptions about your dataset/use).
I think Solr is a great tool if your needs happen to fit into their model - but if they diverge a lot, it sometimes makes more sense to build your own custom framework on top of Lucene.
Did you consider the distributed, real-time, and faceted extensions to Lucene we (LinkedIn) built? Sensei, Zoie, and Bobo respectively. Get em at http://sna-projects.com/sna/
We're huge fans! Our real time search technology copied its architecture from Zoie, and our faceting/caching took some ideas from Bobo!
We didn't use them outright since we have fairly different requirements/constraints (our data has some pleasant properties that makes it easier to shard and facet than the general case) and we wanted something a bit simpler.
Shoot me an email sometime though (email in profile)! I'd love to buy you lunch and pick your brain ;-) You guys clearly know what you're doing!
From Wikipedia:
"The term open source describes practices in production and development that promote access to the end product's source materials."
I'd imagine this falls in to that category. The takeaway for me is that the code is useful, regardless of the size, and they took the time to let other folks enjoy it.
They might be getting a disproportional amount of publicity compared to much larger more complex projects, in your opinion, but that doesn't negate the fact that it is still open source.
Publishing 2 github repositories, each with several classes which are mostly trivial, is not "opensourcing"
Err, yes it is. If you are going to try and be pedantic then at least be accurate.
A more accurate version of your complaint would be "They shouldn't get so much attention for open sourcing a few classes". That argument has some merit, but actually the bloom filter library is something that Java has been missing for a while (I know, because I've written one in Java myself).
Are you really comparing these github projects to the initial release of the Linux kernel? Have you read the source to either? Or stopped to consider their scope or potential impact?
Not to disparage the Greplin team's work -- I think it's totally awesome that they keep open sourcing pieces of their infrastructure. We should all be doing more of this.
This is really useful and painfully lacking from Lucene. Great stuff.