Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Have you benchmarked this against pickling those data files? In our experience, parquet's overhead isn't worth it for smaller data files.


I just did some benchmarks and it's pretty similar for small files. The difference would only be noticeable if you're serializing a ton of small files.


Huh, makes a pretty big difference for us. We were using pandas' built-in to_parquet though, which seems to suffer from some overhead.


I'm not surprised, Parquet's columnar encoding and compression won't really kick in significantly for smaller files.


But with pickling you can only read the data in Python.


If pickling is what is working best for you, it can't be much data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: