Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Having a possibility to update (query) output with new input data rather than process the whole input again even if the changes are very small is indeed a very useful feature. Assume that you have one huge input table and you computed the result consisting of a few rows. Now you add 1 record to the input. A traditional data processing system will again process all the input records while the differential system will update the existing output result.

There are the following difficulties in implementing such systems:

o (Small) changes in input have to be incrementally propagated to the output as updates rather than new results. This changes the paradigm of data processing because now any new operator has to be "update-aware"

o Only simple operators can be easily implemented as "update-aware". For more complex operators like aggregation or rolling aggregations, it is frequently not clear how it can be done conceptually (efficiently)

o Differential updates have to be propagated through a graph of operations (topology) which makes the task more difficult.

o Currently popular data processing approaches (SQL or map-reduce) were not designed for such a scenario so some adaptation might be needed

Another system where such an approach was implemented, called incremental evaluation, is Lambdo:

https://github.com/asavinov/lambdo - Feature engineering and machine learning: together at last!

Yet, this Python library relies on a different novel data processing paradigm where operations are applied to columns. Mathematically, it uses two types of operations: set operations and functions operations, as opposed to traditional approaches based on only set operations.

A new implementation is here:

https://github.com/asavinov/prosto - Functions matter! No join-groupby, No map-reduce.

Yet, currently incremental evaluation is implemented only for simple operations (calculated columns).



How is a rolling aggregate hard to update? If the value at index i is changed, just update everything from i-n to i+n (where n is the rolling window size).


Yes, this is the basic logic: for any incremental aggregation we need to detect groups which can be influenced by this new record or updated record. If we do row-based rolling aggregation then then indeed we need to update records (i-n, i+n). Yet, the following difficulties may arise:

o Generally, we do not want to re-compute aggregates - aggregates should be also updated, particularly, if n is very large

o In real applications, rolling aggregation is performed using partitioning on some objects. For example, we append new events from many different devices to one table and want to compute rolling aggregates for each individual device. Hence, this (i-n, i+n) will not work anymore.

o Rolling aggregation using absolute time windows will also work differently. Although, if records are ordered (like in stream processing) and there are no partitions, then it is easy.


Myself and a few others have done a lot of research on performing sliding window aggregations updates without recomputing everything. Our code is on github, and the README has links to the papers: https://github.com/IBM/sliding-window-aggregators




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: