More

tlipcon · on May 1, 2022

Colossus metadata is stored in Bigtable: see https://cloud.google.com/blog/products/storage-data-transfer...

Spanner stores its data in Colossus, so there would be some bootstrapping issues to resolve to move it to Spanner over Bigtable. (Bigtable also has the bootstrap issues but has solved them already and there are additional difficulties due to some details that I probably am not at liberty to share)

Spanner is used for metadata for many other very very large storage systems, though.

packetslave · on May 1, 2022

Yeah, that was my understanding as well. Colossus stores metadata in Bigtable on top of a smaller Colossus, which stores its metadata in Bigtable on top of an even smaller Colossus, which… [insert more stacks of turtles here] ends up in Chubby.

Public presentation here: http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Ke...

tlipcon · on Jan 27, 2020

Just a small note: Kudu shares no code or provenance with Hadoop. Only commonality is ecosystem (eg you can use spark or Impala to query it)

arminiusreturns · on Jan 27, 2020

Nice, I didn't realize, guess it needs to go into the "reevaluate" list then. Sometimes Apache products tend to blend together in my mind and I get their capabilities confused/conflated.

tlipcon · on Jan 19, 2017

For those interested in a technical overview, http://kudu.apache.org/kudu.pdf is our academic-style (but not submitted to any journal) paper.

tlipcon · on Jan 19, 2017

Apache Kudu project founder here:

It's true that the project was initially developed at Cloudera, and employees continue to be the main driving force behind development. That said, we have committers and contributors from other companies as well. Roughly half the people who contributed a patch in the last 3 months have been non-Cloudera. Additionally we are very strict about doing all development upstream (eg with the first open source release we spent a lot of effort to open the entire development history going back to 2012, including JIRA, git, etc).

As for users, here are a couple examples off the top of my head who aren't currently paying for any support:

- Xiaomi (world's 4th largest smartphone maker) collects ~2TB/day of event data from >5million phones into a cluster which simultaneously runs analytics workloads (SQL, Spark, etc) - CERN is looking at using Kudu to store high energy physics experiment data from the ATLAS detector at the LHC. You can find some code at https://gitlab.cern.ch/zbaranow/kudu-atlas-eventindex and a poster here: https://indico.cern.ch/event/505613/contributions/2230964/at...

(of course lots more too whose names I dont have permission to mention)

Feel free to join our slack if you're interested in chatting with more - usually plenty of people online here: https://getkudu-slack.herokuapp.com

-Todd

tlipcon · on March 11, 2016

Hopefully this isn't too "pitch"-y, but: if you're looking for a database that's good at time series, will always be open source, and does support scale-out and HA, you might be interested in Apache Kudu (incubating).

Feel free to drop by our Slack (http://getkudu-slack.herokuapp.com ) if you have any questions.

tlipcon · on Feb 24, 2016

Suggestion: add an option to look at file contents and analyze cross-file dependencies (eg c/c++ #includes or java imports)

veniversum · on Feb 24, 2016

That's a cool idea!

However as of right now this is a completely client-side JS app, it might take too much processing power and API requests to analyze individual files.

tlipcon · on Feb 22, 2016

Todd from the Kudu team here if anyone has any questions, I'll check back throughout the day (or tweet at @ApacheKudu)

tlipcon · on Feb 21, 2016

FYI on the date issue, lest anyone think we filed the patent trying to steal the work done by others, the patent application says:

"This application claims to the benefit of U.S. Provisional Patent Application No. 61/911,720, entitled “HYBRIDTIME and HYBRIDCLOCKS FOR CLOCK UNCERTAINTY REDUCTION IN A DISTRIBUTED COMPUTING ENVIRONMENT”, which was filed on Dec. 4, 2013, which is incorporated by reference herein in its entirety."

(which predates the creation of the cockroachdb repo and the hybrid logical clock paper).

tlipcon · on Feb 18, 2016

Kudu uses a similar algorithm which we call HybridTime. You can read the tech report here: http://pdsl.ece.utexas.edu/david/hybrid-time-tech-report-01....

and the source here: https://github.com/apache/incubator-kudu/blob/master/src/kud...

Both are probably more readable than patent-ese :)

jasonwatkinspdx · on Feb 18, 2016

I know I'm diping my toe in some history here, but is there a sense of how the patent situation is going to shake out? I think this general family of algorithm is very important.

tlipcon · on Feb 21, 2016

Agreed -- personally I'm against offensive use of patents like this as well, and it's my understanding that Cloudera doesn't intend to use this patent offensively. If it did, I would be upset and would consider leaving the company - I know many other employees feel the same way. The reason I agree to help write patent applications as an engineer is that I've seen the distraction and damages caused by patent trolls (or even other companies) and the importance of having a defensive portfolio.

Disclaimer: Obviously I'm not speaking for the company or making any promises here :)

-Todd

tlipcon · on Feb 2, 2016

Typically QSBR algorithms don't require blocking the world, or even blocking any single thread. They just require each thread to periodically check in and run a bounded amount of code which amounts to "hey, I'm not currently looking at the map".

Some other background collector thread (which is going to actually delete removed objects) just has to wait until it sees every mutator thread cross a safepoint, at which point it knows that none of those threads could be hanging onto references that have been unlinked from the data structure.

I'd recommend reading some surveys of RCU and SMR algorithms if this stuff is interesting to you.