We extensively use Apache Arrow to store data files as parquet files on S3. This...

lsorber · on July 27, 2020

Have you benchmarked this against pickling those data files? In our experience, parquet's overhead isn't worth it for smaller data files.

alfalfasprout · on July 27, 2020

I just did some benchmarks and it's pretty similar for small files. The difference would only be noticeable if you're serializing a ton of small files.

lsorber · on July 31, 2020

Huh, makes a pretty big difference for us. We were using pandas' built-in to_parquet though, which seems to suffer from some overhead.

EdwardDiego · on July 27, 2020

I'm not surprised, Parquet's columnar encoding and compression won't really kick in significantly for smaller files.

kylebarron · on July 27, 2020

But with pickling you can only read the data in Python.

cbsmith · on July 28, 2020

If pickling is what is working best for you, it can't be much data.