Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Vectorization in OLAP Databases (mataroa.blog)
48 points by alterneesh on April 27, 2022 | hide | past | favorite | 18 comments


Isn't the typical big data sql task IO bound?

Vectorization only works when you have a table stored in an optimized columnar format and compute an run a function over a column or to combine multiple columns.

The moment you throw in group bys or windows the data turns into rows that you read from a hash table or after a sort - at which point you lose all opportunities of vectorization.

Since group bys break vectorization, the other use case is for map or reduce (sums, counts) operations over the entire table. In absence of filters you can precompute these for each column.

Plain map or sum like operations in presence of a filter is the only real use case for vectorization in OLAP, if I'm not missing anything.

In that case you need to implement the vectorized operation to work across together with a mask, so that you don't include the filtered out values, and over compressed data, otherwise you're wasting time on bringing the data from disk closer to cpu.

Most general big data sql tasks will not gain significant improvement using vectorization, unless they specialize on map after filter, no group bys, operations, such as perhaps log processing.

Vectorization and other kinds of hardware acceleration is highly useful for small array data that fits into memory such as geo data, APL, numpy, tensors on TPU processing and similar stuff.


In a well-designed system, you will typically be limited by effective bandwidth, often memory bandwidth or efficient use thereof which is an area where vectorization can help. Modern servers have tremendous storage bandwidth if you have an I/O scheduler capable of using it. Some newer database engines explicitly reject the assumption that storage throughput is precious as a design constraint, since it has become much less true over time due to advances in hardware.

Use of page layouts highly-optimized for vectorized evaluation is common now even if the implementation isn't vectorized. You lose nothing on modern hardware (they are good layouts regardless) and it allows you to easily do vector optimizations later. As a semantic distinction, columnar and vector layouts are organized differently and optimize for somewhat different things even though they have superficially similar appearance. Classic DSM-style columnar is largely obsolete.

Vectorization, first and foremost, is about optimizing selection operations in a database, but it can provide assists in other areas like joins, sorts, and aggregates. Most queries are a composed from these primitives, so many parts of the query plan may benefit. As a heuristic, operations that GPU databases excel at are the same kinds of operations that benefit from vectorization.

Obviously you can't just throw vectorization at an arbitrary database and expect major benefits, they need to be intentionally designed for it.


I can't seem to understand why vectorization wouldn't help, say if you read after a sort. Irrespective of whether it fits in memory, or you perform some sort of an external sort, any operation that you want to perform on top of that sorted vector, be it an aggregation to reduce it, or an arithmetic operation with another column, you could still leverage vectorization and would end up using fewer CPU cycles, no?


The paper "Everything you always wanted to know about compiled and vectorized queries but were afraid to ask" [1] goes into this in much more detail and also compares vectorization with an alternative approach.

[1]: https://dl.acm.org/doi/abs/10.14778/3275366.3284966


Came here to post this, fantastic paper and probably the most comprehensive thing on the internet available about this

Something else I've taken away from research about columnar and vectorized databases: there doesn't seem a good reason why they aren't fit for OLTP workloads

Analytics from SaaS/line-of-business apps shows about a 90/10 read/write ratio for CRUD apps.

Pavlo et al. have a solid paper on an HTAP database that compares OLTP, OLAP, and a novel HTAP storage on varying read/write workloads:

  > "Arulraj, J., Pavlo, A., & Menon, P. (2016). Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads. Proceedings of the 2016 International Conference on Management of Data - SIGMOD  ’16. doi:10.1145/2882903.2915231 "
See "Figure 17: Concurrent Hybrid Workloads – The impact of the storage layout on the query processing time under different concurrent hybrid workloads."

The chart contains 4 workload types, one of which is "• Read-Heavy: 90% scans, 10% inserts"

You can see that the columnar storage model outperforms OLTP N-ary storage here

So my question is -- why don't people use columnar databases for CRUD apps?


For scan heavy apps columnstores will be much much faster. For simple CRUD apps that just want to read/write a few rows at a time (with high concurrency) they have a number of inefficiencies vs a rowstore.

  - Columnstore often don't support indexing at all (or have 
  weak support for it).  This means a scan is needed to find 
  any row (with min/max or segment elimination to avoid 
  opening up files with no matching rows at all).  Even if 
  this scan is very fast, its still going to use up more CPU 
  then an index seek.

  - Most columnstores don't use compression schemes that are 
  incremental.  To grab a single row the columnstore likely 
  decompresses many adjacent rows (could be millions of rows 
  - depends on the particular columnstore).

  - Most columnstores don't support fine grained locking 
  (row level locking).  They often lock the entire table or 
  an entire segment (millions?) of rows whenever a single 
  row is written to.  This damages concurrency.

  - The row reconstruction costs are high (gluing the 
  columns back together)
We have a SingleStoreDB paper in this years SIGMOD that goes into these implementation details (not publicly published just yet). The columnstore in SinglestoreDB is reasonably good at OLTP without sacrificing its OLAP performance (see some benchmarking results here: https://www.singlestore.com/blog/tpc-benchmarking-results/).


Also I just noticed that you used the TPC-C benchmark here

Have you considered re-benchmarking with TPC-E? It's the updated version of the OLTP test that more accurately represents these sorts of apps:

  > "In February 2007, the new TPC-E benchmark [7] became a TPC standard. It is designed to be a more realistic OLTP benchmark than TPC-C, e.g., incorporating realistic data skews and referential integrity constraints."


  > "We find that (i) TPC-E is more read intensive with a 9.7:1 I/O read to write ratio, while TPC-C sees a 1.9:1 read-to-write ratio; and (ii) although TPC-E uses pseudo-realistic data, TPC-E’s I/O access pattern is as random as TPC-C."
It's a difference between a 10/1 read/write ratio, and a 2/1 read/write ratio. I've never worked on a line-of-business/SaaS app with a ratio lower than 80% reads FWIW.

https://www.tpc.org/tpce/default5.asp

http://www.cs.cmu.edu/~chensm/papers/TPCE-sigmod-record10.pd...


Yeah, TPC-E is a better (more advanced) OLTP benchmark. TPC-C is trivially scalable by sharding on warehouse id everywhere. The problem with TPC-E is not many companies publish (official or unofficial) results for it, so its not as useful when comparing systems, which is what we were after in our blog post.


  > The problem with TPC-E is not many companies publish (official or unofficial) results for it, so its not as useful when comparing systems, which is what we were after in our blog post.
Oh, yeah this makes a lot of sense


Really interesting, going to have a look at that paper -- thanks for sharing your insight

Let me ask this: if you have something like a GraphQL API, which often does sparse column selection from multiple different tables, would that also be a good fit for columnar database?

In most GraphQL queries, you're grabbing a portion of the fields from one or more tables, IE something like:

    query JoesCompletedTodos {
        users(where: { name: { _eq: "Joe" } }) {
            id
            name
            todos(where: { completed: { _eq: true } }) {
                text
            }
        }
    }
Where this will get translated to something along the lines of

    SELECT users.id, users.name, json_arrayagg(todos) FROM users
    INNER JOIN todos ON todos.user_id = users.id
    WHERE users.name = 'Joe' AND todos.completed = true
    GROUP BY users.id, users.name
Many times you'll see queries spanning 3-4 relation levels deep, plucking something like 2-6 columns from each table.

Curious how well a columnar DB would do with something like this?

Also, on this point:

  > "Most columnstores don't use compression schemes that are incremental.  To grab a single row the columnstore likely decompresses many adjacent rows (could be millions of rows - depends on the particular columnstore)."
I think this is something that could be avoided if the data were in IE, Arrow, and you used Arrow Flight/FlightSQL as the transport mechanism, right? But this isn't my area of expertise, for sure.


Filters like this one:

    WHERE users.name = 'Joe' AND todos.completed = true
Have most of the problems I mentioned above unless "many" rows match the filter. Lack of ability to seek to specific rows in the data hurts.

RE: Arrow.. I haven't read too much about the compression scheme(s) it uses. If it only does dictionary compression, that is reasonable easy to make incremental, so its possible Arrow doesn't have that specific problem.


Let's assume we want: Where are that databases?

Making one like that (with a hybrid model like PAX) with other things I consider important to improve the situation (like algebraic type support, full relational, etc) are desirable, but FUNDING that is other thing. Make a database is major effort and is not easy to get the proper support for it.

But I killed to have the chance, and I think the market is ready for go beyond the current crop of NoSql/NewSql...


> A recent trend

It's not at all a recent trend, it goes back at least as far as the early 2000s. Also, the obvious next step from thinking about processing a few tuples together, is transposing your view of DB tables all the way, to consider _columns_ rather than _tuples_. This makes a lot of sense when you're processing analytic queries rather than transactions, which typically modify individual tuples. And _that_ view goes back at least another decade earlier if not more.

See my own brief historical survey on the matter in the intro my monograph regarding a computational model for column-oriented analytic engines:

https://arxiv.org/abs/1904.12217

(this is a bit of a shameless self-plug because other, much more distinguished people than myself have written about this subject.)


Fair point about the "recentness". I guess it seemed recent to me since I learned about it only a few months back :)

Also, yes, 100%. Columnar storage fits well with vectorized execution! Thanks for sharing



isn't it what Apache Arrow [1], Apache CarbonData [2] and others are for?

[1] https://arrow.apache.org/docs/format/Columnar.html

[2] https://carbondata.apache.org/introduction.html


The bit in Arrow closest to this would probably be Gandiva I think:

  > Gandiva - Vectorized processing for Apache Arrow
https://arrow.apache.org/blog/2018/12/05/gandiva-donation/

It's wicked neat tech.


I think it’s more like Gandiva or DataFusion (both from the Apache Arrow project).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: