Hacker Newsnew | past | comments | ask | show | jobs | submit | tlipcon's commentslogin

Colossus metadata is stored in Bigtable: see https://cloud.google.com/blog/products/storage-data-transfer...

Spanner stores its data in Colossus, so there would be some bootstrapping issues to resolve to move it to Spanner over Bigtable. (Bigtable also has the bootstrap issues but has solved them already and there are additional difficulties due to some details that I probably am not at liberty to share)

Spanner is used for metadata for many other very very large storage systems, though.


Yeah, that was my understanding as well. Colossus stores metadata in Bigtable on top of a smaller Colossus, which stores its metadata in Bigtable on top of an even smaller Colossus, which… [insert more stacks of turtles here] ends up in Chubby.

Public presentation here: http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Ke...


Just a small note: Kudu shares no code or provenance with Hadoop. Only commonality is ecosystem (eg you can use spark or Impala to query it)


Nice, I didn't realize, guess it needs to go into the "reevaluate" list then. Sometimes Apache products tend to blend together in my mind and I get their capabilities confused/conflated.


For those interested in a technical overview, http://kudu.apache.org/kudu.pdf is our academic-style (but not submitted to any journal) paper.


Apache Kudu project founder here:

It's true that the project was initially developed at Cloudera, and employees continue to be the main driving force behind development. That said, we have committers and contributors from other companies as well. Roughly half the people who contributed a patch in the last 3 months have been non-Cloudera. Additionally we are very strict about doing all development upstream (eg with the first open source release we spent a lot of effort to open the entire development history going back to 2012, including JIRA, git, etc).

As for users, here are a couple examples off the top of my head who aren't currently paying for any support:

- Xiaomi (world's 4th largest smartphone maker) collects ~2TB/day of event data from >5million phones into a cluster which simultaneously runs analytics workloads (SQL, Spark, etc) - CERN is looking at using Kudu to store high energy physics experiment data from the ATLAS detector at the LHC. You can find some code at https://gitlab.cern.ch/zbaranow/kudu-atlas-eventindex and a poster here: https://indico.cern.ch/event/505613/contributions/2230964/at...

(of course lots more too whose names I dont have permission to mention)

Feel free to join our slack if you're interested in chatting with more - usually plenty of people online here: https://getkudu-slack.herokuapp.com

-Todd


Hopefully this isn't too "pitch"-y, but: if you're looking for a database that's good at time series, will always be open source, and does support scale-out and HA, you might be interested in Apache Kudu (incubating).

Feel free to drop by our Slack (http://getkudu-slack.herokuapp.com ) if you have any questions.


Suggestion: add an option to look at file contents and analyze cross-file dependencies (eg c/c++ #includes or java imports)


That's a cool idea!

However as of right now this is a completely client-side JS app, it might take too much processing power and API requests to analyze individual files.


Todd from the Kudu team here if anyone has any questions, I'll check back throughout the day (or tweet at @ApacheKudu)


FYI on the date issue, lest anyone think we filed the patent trying to steal the work done by others, the patent application says:

"This application claims to the benefit of U.S. Provisional Patent Application No. 61/911,720, entitled “HYBRIDTIME and HYBRIDCLOCKS FOR CLOCK UNCERTAINTY REDUCTION IN A DISTRIBUTED COMPUTING ENVIRONMENT”, which was filed on Dec. 4, 2013, which is incorporated by reference herein in its entirety."

(which predates the creation of the cockroachdb repo and the hybrid logical clock paper).


Kudu uses a similar algorithm which we call HybridTime. You can read the tech report here: http://pdsl.ece.utexas.edu/david/hybrid-time-tech-report-01....

and the source here: https://github.com/apache/incubator-kudu/blob/master/src/kud...

Both are probably more readable than patent-ese :)


I know I'm diping my toe in some history here, but is there a sense of how the patent situation is going to shake out? I think this general family of algorithm is very important.


Agreed -- personally I'm against offensive use of patents like this as well, and it's my understanding that Cloudera doesn't intend to use this patent offensively. If it did, I would be upset and would consider leaving the company - I know many other employees feel the same way. The reason I agree to help write patent applications as an engineer is that I've seen the distraction and damages caused by patent trolls (or even other companies) and the importance of having a defensive portfolio.

Disclaimer: Obviously I'm not speaking for the company or making any promises here :)

-Todd


Typically QSBR algorithms don't require blocking the world, or even blocking any single thread. They just require each thread to periodically check in and run a bounded amount of code which amounts to "hey, I'm not currently looking at the map".

Some other background collector thread (which is going to actually delete removed objects) just has to wait until it sees every mutator thread cross a safepoint, at which point it knows that none of those threads could be hanging onto references that have been unlinked from the data structure.

I'd recommend reading some surveys of RCU and SMR algorithms if this stuff is interesting to you.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: