Greplin opensources Lucene Utils and Bloom Filters

dabeeeenster · on April 13, 2011

A query that matches phrase prefixes - for example "Epic w" will match both "Epic win" and "Epic wonder". This is particularly useful for implementing Google Instant style searches

This is really useful and painfully lacking from Lucene. Great stuff.

sigil · on April 13, 2011

I was curious how they implemented prefix matching, so I went and looked at the code [1]. Unfortunately, this is just a simple linear scan that calls .startswith(). It is possible to do fast (log N) prefix matching with radix / critbit trees.

[1] https://github.com/Greplin/greplin-lucene-utils/blob/master/...

nostrademons · on April 14, 2011

Radix trees are O(k), where k is the length of the query string.

Anyway, for prefix matching, you want to take careful note of the size and shape of your data, because it matters for which algorithm is fastest. If your data all fits in RAM (or better yet, all fits in L2 cache), then I've had very good results with binary search (O(log N)) to find the first matching result, and then linear scan to find all possible suffixes. This is a lot more cache-friendly than radix trees, which have better theoretical performance but often touch memory that's all over the place.

sigil · on April 14, 2011

> Radix trees are O(k)

Yes, thank you.

techtalsky · on April 13, 2011

Luckily it does exist in Solr in the form of prefixed term queries. I'm surprised to find out it's not in vanilla Lucene.

yid · on April 13, 2011

I really don't mean to sound unpleasant, but does anyone else think that the primary goal of Greplin is to exit via Google or Facebook? From a consumer point of view, can it ever be anything more than a niche product? I'm genuinely curious...these patches are somewhat superficial, and while a nice gesture, the cynical part of me sees it as a ploy to gain cred with the people most likely to influence buyouts.

danicgross · on April 13, 2011

Our goal is to become the bridge between your brain and every bit of information you own online.

We open source projects for two primary reasons:

1) Because we use lots of open source projects internally. It makes sense to reciprocate where we can.

2) There's no better place to meet great engineers than in GitHub pull requests.

zem · on April 13, 2011

you say that like it's a bad thing. why is "1. build something that would be useful to a large company, 2. prove by acquiring users that the product has real value, 3. get acquired" a bad business model? from a consumer point of view, they'll benefit when google or facebook or whoever integrates greplin's techniques.

blhack · on April 13, 2011

This is completely irrelevant to the discussion, please feel free to downvote me for it:

2 things: Your account is exactly "1337" days old today, happy leet day :). Also, are you the same zem that posted on newslily a while ago? If so, you posted some really awesome stuff, thanks :)

(Sorry if I've recognized you and asked this question before, I have a terrible memory)

zem · on April 14, 2011

thanks, would never have noticed l337 day on my own :) and yeah, same zem! dropped off newslily due to having too many social networks to keep up with; nice to see it still going strong.

blhack · on April 14, 2011

o/ well long-distance high five to you :).

I wouldn't really say it's still going strong, haha. Unfortunately, we never really got the traction on that that we needed to allow it to keep running by itself (Cody and I were submitting a lot of the content, which was fine, but it would have been really awesome if we didn't have to).

<sarcasm>Big surprise there, though, we were trying to compete against HN and reddit</sarcasm>

Back in November, we both started working a new project: http://thingist.com, which has been taking a lot of my time lately, so I haven't been submitting as much.

Anyway, good to see you around here, man :)

zem · on April 15, 2011

where is the "help->about" page for thingist? i can't make out whether it's competing with tumblr, twitter, or something new entirely.

yid · on April 13, 2011

didn't mean for it to sound like that...it definitely is a more interesting strategy than "let's change the world!"

surtyaar · on April 13, 2011

There are some good opensource implementations in python (and I am sure other languages).

https://github.com/jaybaird/python-bloomfilter - offers scalable bloom filters

https://github.com/axiak/pybloomfiltermmap - uses mmap

smanek · on April 13, 2011

We originally used mmap too - but it didn't work very well. First, Java has some rather serious mmap limitations, (http://bugs.sun.com/view_bug.do?bug_id=4724038) - and for some reason, we saw occcasional data corruption (which we haven't seen since we moved to the current system).

jwr · on April 14, 2011

Have those mmap limitations really impacted you? And the corruption — I'm assuming that was with read/write mappings that you wrote to?

I'm asking, because we're getting great mileage out of mmap in Clojure, albeit for read-only mappings. And that's in a search engine :-) Using mmap for large data is great, because you avoid enlarging your heap and the garbage collector doesn't even have to care about your data.

chaostheory · on April 13, 2011

my question is why they didn't use solr instead?

smanek · on April 13, 2011

The first version of Greplin was actually built on Solr!

Eventually, we needed more flexibility than Solr easily offered though. For example, we've added far more efficient sharding, document modifications (updates and deletions), flushing, and near real time search than either Lucene or Solr support out of the box (and they were much easier to add to Lucene than Solr, since Lucene makes fewer assumptions about your dataset/use).

I think Solr is a great tool if your needs happen to fit into their model - but if they diverge a lot, it sometimes makes more sense to build your own custom framework on top of Lucene.

joshhart · on April 13, 2011

Did you consider the distributed, real-time, and faceted extensions to Lucene we (LinkedIn) built? Sensei, Zoie, and Bobo respectively. Get em at http://sna-projects.com/sna/

smanek · on April 13, 2011

We're huge fans! Our real time search technology copied its architecture from Zoie, and our faceting/caching took some ideas from Bobo!

We didn't use them outright since we have fairly different requirements/constraints (our data has some pleasant properties that makes it easier to shard and facet than the general case) and we wanted something a bit simpler.

Shoot me an email sometime though (email in profile)! I'd love to buy you lunch and pick your brain ;-) You guys clearly know what you're doing!

yuvadam · on April 13, 2011

I know the verb "open-sources" is not well defined, but...

Publishing 2 github repositories, each with several classes which are mostly trivial, is not "opensourcing".

Much like the fact that a weekend project is not a "startup".

fleetingthought · on April 13, 2011

From Wikipedia: "The term open source describes practices in production and development that promote access to the end product's source materials."

I'd imagine this falls in to that category. The takeaway for me is that the code is useful, regardless of the size, and they took the time to let other folks enjoy it.

yuvadam · on April 13, 2011

Don't get me wrong, I think highly of Greplin, and appreciate every line of open code. The process itself it to be highly appraised.

But I've seen much larger open source patches - that are no less important - that have received much less publicity that this.

budu3 · on April 13, 2011

They might be getting a disproportional amount of publicity compared to much larger more complex projects, in your opinion, but that doesn't negate the fact that it is still open source.

nl · on April 14, 2011

Publishing 2 github repositories, each with several classes which are mostly trivial, is not "opensourcing"

Err, yes it is. If you are going to try and be pedantic then at least be accurate.

A more accurate version of your complaint would be "They shouldn't get so much attention for open sourcing a few classes". That argument has some merit, but actually the bloom filter library is something that Java has been missing for a while (I know, because I've written one in Java myself).

rmc · on April 13, 2011

The Linux kernel was announced on 25th August 1991 by Linus Torvalds as just a hobby, won't be big and professional like gnu

sigil · on April 13, 2011

Are you really comparing these github projects to the initial release of the Linux kernel? Have you read the source to either? Or stopped to consider their scope or potential impact?

Not to disparage the Greplin team's work -- I think it's totally awesome that they keep open sourcing pieces of their infrastructure. We should all be doing more of this.