Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
lsorber
on July 27, 2020
|
parent
|
context
|
favorite
| on:
Apache Arrow 1.0
Have you benchmarked this against pickling those data files? In our experience, parquet's overhead isn't worth it for smaller data files.
alfalfasprout
on July 27, 2020
|
next
[–]
I just did some benchmarks and it's pretty similar for small files. The difference would only be noticeable if you're serializing a ton of small files.
lsorber
on July 31, 2020
|
parent
|
next
[–]
Huh, makes a pretty big difference for us. We were using pandas' built-in to_parquet though, which seems to suffer from some overhead.
EdwardDiego
on July 27, 2020
|
prev
|
next
[–]
I'm not surprised, Parquet's columnar encoding and compression won't really kick in significantly for smaller files.
kylebarron
on July 27, 2020
|
prev
|
next
[–]
But with pickling you can only read the data in Python.
cbsmith
on July 28, 2020
|
prev
[–]
If pickling is what is working best for you, it can't be much data.
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: