Hacker Newsnew | past | comments | ask | show | jobs | submit | RobinL's commentslogin

I think that's totally fine for individual work, but in larger data engineering teams it's less good to switch between tools because other people may have to maintain your code.

That said, polars is good, and if the team agree to standardise on it then that's a totally reasonable choice.

I guess one of my reservations is I've been (historically) burned by decisions within data eng teams to use pandas, causing all sorts of problems with data typing and memory and eventually having to rewrite it all. But I accept polars doesn't suffer from the same problems (and actually some of them are even mitigated in more recent versions of pandas)


Worse in some ways, better in others. DuckDB is often an excellent tool for this kind of task. Since it can run parallelized reads I imagine it's often faster than command line tool, and with easier to understand syntax

More importantly, you have your data in a structured format that can be easily inspected at any stage of the pipeline using a familiar tool: SQL.

I've been using this pattern (scripts or code that execute commands against DuckDB) to process data more recently, and the ability to do deep investigations on the data as you're designing the pipeline (or when things go wrong) is very useful. Doing it with a code-based solution (read data into objects in memory) is much more challenging to view the data. Debugging tools to inspect the objects on the heap is painful compared to being able to JOIN/WHERE/GROUP BY your data.


Yep. It’s literally what SQL was designed for, your business website can running it… the you write a shell script to also pull some data on a cron. It’s beautiful

IMHO the main point of the article is that typical unix command pipeline pipeline IS parallelized already.

The bottleneck in the example was maxing out disk IO, which I don't think duckdb can help with.


Pipes are parallelized when you have unidirectional data flow between stages. They really kind of suck for fan-out and joining though. I do love a good long pipeline of do-one-thing-well utilities, but that design still has major limits. To me, the main advantage of pipelines is not so much the parallelism, but being streams that process "lazily".

On the other hand, unix sockets combined with socat can perform some real wizardry, but I never quite got the hang of that style.


Pipelines are indeed one flow, and that works most of the time, but shell scripts make parallel tasks easy too. The shell provides tools to spawn subshells in the background and wait for their completion. Then there are utilities like xargs -P and make -j.

UNIX provides the Makefile as go-to tool if a simple pipeline is not enough. GNUmake makes this even more powerful by being able to generate rules on-the-fly.

If the tool of interest works with files (like the UNIX tools do) it fits very well.

If the tool doesn't work with single files I have had some success in using Makefiles for generic processing tasks by creating a marker file that a given task was complete as part of the target.


Found myself nodding along. I think increasingly it's useful to think of PRs from unknown external people as more like an issue than a PR (kind of like the 'issue first' policy described in the article).

There's actually something very valuable about a user specifying what they want using a working solution, even if the code is not mergeable.


Exactly - these huge machines are surely eating a lot into the need for distributed systems like Spark. So much less of a headache to run as well

Author here. I wouldn't argue SQL or duckdb is _more_ testable than polars. But I think historically people have criticised SQL as being hard to test. Duckdb changes that.

I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines. I've seen this through the past five years of writing and maintaining a record linkage library. It generates SQL that can be executed against multiple backends. My library gets faster and faster year after year without me having to do anything, due to improvements in the SQL backends that handle things like vectorisation and parallelization for me. I imagine if I were to try and program the routines by hand, it would be significantly slower since so much work has gone into optimising SQL engines.

In terms of future proof - yes in the sense that the code will still be easy to run in 20 years time.


> I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines.

Yeah, but SQL isn't really portable between query all query engines. You always have to be speaking the same dialect. Also, SQL isn't the only "declarative" dsl, polars's lazyframe api is similarly declarative. Technically Ibis's dataframe dsl also works as a multi-frontend declarative query language. Or even substrait.

Anways my point is that SQL is not inherently a faster paradigm than "dataframes", but that you're conflating declarative query planning with SQL.


Author here. Re: 'SQL should be the first option considered', there are certainly advantages to other dataframe APIs like pandas or polars, and arguably any one is better in the moment than SQL. At the moment Polars is ascendent and it's a high quality API.

But the problem is the ecosystem hasn't standardised on any of them, and it's annoying to have to rewrite pipelines from one dataframe API.

I also agree you're gonna hit OOM if your data is massive, but my guess is the vast majority of tabular data people process is <10GB, and that'll generally process fine on a single large machine. Certainly in my experience it's common to see Spark being used on datasets that are no where big enough to need it. DuckDB is gaining traction, but a lot of people still seem unaware how quickly you can process multiple GB of data on a laptop nowadays.

I guess my overall position is it's a good idea to think about using DuckDB first, because often it'll do the job quickly and easily. There are a whole host of scenarios where it's inappropriate, but it's a good place to start.


I think both of us are ultimately wary of using the wrong tool for the job.

I see your point, even though my experience has been somewhat the opposite. E.g. a pipeline that used to work fast enough/at all up until some point in time because the scale of the data or requirements allowed it. Then some subset of these conditions changes, the pipeline cannot meet them, and one has to reverse engineer obscure SQL views/stored procedures/plugins, and migrate the whole thing to python or some compiled language.

I work with high density signal data now, and my SQL knowledge occupies the "temporary solution" part of my brain for the most part.


Yeah, my experiences match yours and I very, very much work with messy data (FOIA data), though I use postgres instead of duckdb.

Most of the datasets I work with are indeed <10GB but the ones that are much larger follow the same ETL and analysis flows. It helps that I've built a lot of tooling to help with types and memory-efficient inserts. Having to rewrite pipelines because of "that one dataframe API" is exactly what solidified my thoughts around SQL over everything else. So much of my life time has been lost trying to get dataframe and non-dataframe libraries to work together.

Thing about SQL is that it can be taken just about anywhere, so the time spent improving your SQL skills is almost always well worth it. R and pandas much less so.


I advocated for a SQL solution at work this week and it seems to have worked. My boss is wary of the old school style SQL databases with their constraints and just being a pain to work with. As a developer, these pains aren't too hard to get used to or automate or document away for me and never understood the undue dislike of sql.

The fact that I can use sqlite / local sql db for all kinds of development and reliably use the same code (with minor updates) in the cloud hosted solution is such a huge benefit that it undermines anything else that any other solution has to offer. I'm excited about the sql stuff I learned over 10 years ago being of of great use to me in the coming months.

The last product I worked heavily on used a nosql database and it worked fine till you start tweak it just a little bit - split entities, convert data types or update ids. Most of the data access layer logic dealt with conversion between data coming in from the database and the guardrails to keep the data integrity in check while interacting with the application models. To me this is something so obviously solved years ago with a few lines of constraints. Moving over to sql was totally impossible. Learned my lesson, advocated hard for sql. Hoping for better outcomes.


I totally understand the apprehension towards SQL. It's esoteric as all hell and it tends to be managed by DBAs who you really only go to whenever there's a problem. Organizationally it's simpler to just do stuff in-memory without having to fiddle around with databases.

It is true that for json and csv you need a full scan but there are several mitigations.

The first is simply that it's fast - for example, DuckDB has one of the best csv readers around, and it's parallelised.

Next, engines like DuckDB are optimised for aggregate analysis, where your single query processes a lot of rows (often a significant % of all rows). That means that a full scan is not necessarily as big a problem as it first appears. It's not like a transactional database where often you need to quickly locate and update a single row out of millions.

In addition, engines like DuckDB have predicate pushdown so if your data is stored in parquet format, then you do not need to scan every row because the parquet files themselves hold metadata about the values contained within the file.

Finally, when data is stored in formats like parquet, it's a columnar format, so it only needs to scan the data in that column, rather than needing to process the whole row even though you may be only interested in one or two columns



> In her November Budget, Chancellor Rachel Reeves scaled back business rate discounts that have been in force since the pandemic from 75% to 40% - and announced that there would be no discount at all from April.

That, combined with big upward adjustments to rateable values of pub premises, left landlords with the prospect of much higher rates bills.


I was curious for a UK comparison so I looked it up.

At the start of 2025, about 3% of adults in UK had used GLP-1 drugs in past year in the UK. And "most GLP-1 for weight loss in the UK is from private, rather than NHS provision" [1].

[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC12781702/


I assume that translates to it being really hard to get in the NHS so people are resorting to buying it themselves. I wonder what the percentage would be if it was easy to get from the NHS?

UK numbers are always interesting because the NHS leans conservative about access to many types of care and medication.

For another example, rates of COVID-19 vaccination are significantly lower in the UK not because people there don’t want vaccines, but because the NHS only makes them narrowly available to people above a certain age or with a strict set of conditions.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: