Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We extensively use Apache Arrow to store data files as parquet files on S3. This is a cheap way to store data that doesn't require the query speed of a relational (or non relational database). The main advantage of Arrow is that is a columnar database, and loses no information in transit unlike the nightmare of CSV files.


Have you benchmarked this against pickling those data files? In our experience, parquet's overhead isn't worth it for smaller data files.


I just did some benchmarks and it's pretty similar for small files. The difference would only be noticeable if you're serializing a ton of small files.


Huh, makes a pretty big difference for us. We were using pandas' built-in to_parquet though, which seems to suffer from some overhead.


I'm not surprised, Parquet's columnar encoding and compression won't really kick in significantly for smaller files.


But with pickling you can only read the data in Python.


If pickling is what is working best for you, it can't be much data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: