Pandas is something that I wish I could avoid at any cost but I can't. There is ...

hervature · on May 1, 2022

I always see these type of complaints and, when I actually sit down with people to resolve their aversion, it ultimately comes back to they use Pandas incorrectly or simply are not able to grok documentation. The usual dead giveaway is "Pandas documentation is horrible". Anyone who has used more than one documentation would know that documentation rarely includes every function in the API let alone the argument, examples, and links to other related functions.

As this is the top comment, can you (and others) at least post the problems so that we can have an intellectual discussion? Maybe the Pandas devs might take a point or two.

brundolf · on May 1, 2022

In my relatively brief experience maintaining a Python web service backed by Pandas:

It's basically a DSL constructed out of the dismembered syntactic bones of Python, which breaks every piece of semantics in the host language that it possibly can. I'm sure this is convenient (and maybe even tractable to use) in an interactive notebook, where you can try out and verify behavior in real-time by looking at the output. But in any kind of non-immediate-feedback scenario where you're trying to engineer a production system, it's a hellscape of jumping back and forth to the documentation because even your most fundamental assumptions about the host language's semantics have been thrown out the window. On top of that (and largely because of it), it also resists static typing like crazy.

It also has a bunch of APIs where it's really hard to know what will and won't mutate, and others that are just made generally hard to wrap your brain around for the sake of saving a few characters.

Finally: some portion of the hate it gets is probably from engineers without a statistics background who aren't familiar with the huge dictionary of jargon and abbreviations its APIs use. This complaint is maybe less valid, since it is a data science library, but it certainly colors emotions and makes it even harder to deal with in a production context for lots of us. There were several times where, once I did some research and dug past the jargon and learned all the background for what it meant, the concept itself wasn't complicated. But Pandas did use the jargon, so I couldn't just understand, I had to go learn stats first and traverse all this needless indirection. It's an API that isn't designed for engineers but engineers often have to deal with it anyway.

otsaloma · on May 2, 2022

> Finally: some portion of the hate it gets is probably from engineers without a statistics background who aren't familiar with the huge dictionary of jargon and abbreviations its APIs use.

Pandas' jargon isn't really even statistics jargon. For example data frames are tables, which are inherently rows and columns. Columns have names, rows can have names too (although that's not really needed). Pandas' documentation has very few mentions of rows and columns at all, instead they use "axes", "labels" and "index" and there's not even a proper explanation in documentation of what those mean. And that "index" is something that the user needs to manage, sometimes "drop", sometimes "reset" without really understanding why. It seems to me two things: (1) really bad choice of naming things and (2) exposing to users some performance-related details that should have been kept under the hood, if they are needed at all.

hervature · on May 2, 2022

I would put this squarely in the "use Pandas incorrectly" bucket again. As you point out, Pandas is very much geared towards notebooks. Conversely, I would claim your problem is actually one of the benefits of Pandas over something like R and the associated packages. At the very least, you can copy your Pandas code and ship it and collect technical debt. With R, either you make calls to R from Python, which doesn't solve the problem you are having, or you rewrite the code in Python to be usable in your app. This costs time and is the correct way to do things, even with Pandas. I'm sorry you had to handle the technical debt but I hope that was an intentional decision on your project's behalf.

As for your other complaints, I would need to see examples because I'm not really sure what you mean. Are you referring to things like "mad" which stands for Mean Absolute Deviation? This I would put into my "grok documentation" bucket because just look up the mad method and it literally gives you the acronym. Would the method "mean_absolute_deviation" really help you? Furthermore, if they used long names like that, statisticians from R (which are the target audience) would have another nothing-burger bullet to use against Pandas.

KolenCh · on May 1, 2022

Have you read the book by its author, Python for Data Analysis Data Wrangling with Pandas, NumPy, and IPython By Wes McKinney · 2017?

I think it is a bit mean to say about a package as popular as this to have “no design philosophy”. You should read about their design philosophy before making that comment.

From my experience, I jumped all in when I discovered pandas and then I dialed it back. (It was partly because of my inexperience before.)

Pandas is more useful for exploratory data analysis. It is kind of in the philosophy of working in the terminal (with UNIX pipes, etc.) to explore things quickly. That’s why you’d see in the wild people chaining tons of methods together, sort of like people writing terminal one liner chaining a lot of pipes.

It is also useful as a dictionary containers, in fact you can treat a data frame as if it is a dictionary of dictionary of values in terms of API. Vice versa, if one has an internal structure that is dict of dict of values, you can convert that to a DataFrame as a drop in replacement (I’ve done that when working with a software that does not use pandas.)

For simple things that one has prototyped, it can be left as is for “production”.

But for more complicated things, one should “productionize” it using easier to understand and/or more performant logic.

Some of the mistakes of using pandas is to treat it as you “data container”, as if the table itself is self explanatory. From my experience I’ve been confused by the table I saved in the past. So now I write classes that has a to_frame method that my internal data structure can be converted to a dataframe for further exploration if needed.

anyfactor · on May 1, 2022

Have you tried pandas+duckdb?

Duckdb can run over Pandas dataframe and it is super fast. I haven't tried it yet. I am more familiar with Pandas than SQL so I would like to hear your opinion after you give it a shot.

https://duckdb.org/2021/05/14/sql-on-pandas.html

patrick451 · on May 1, 2022

Numpy has plenty of API warts on its own. It's not obvious to me they would have done a better job at all.

nuclearnice3 · on May 1, 2022

Could you elaborate on the warts?

throwamon · on May 1, 2022

Can't write a longer comment now, but check out siuba, or Polars, or dplyr; there are others. (I've posted about siuba recently but I promise I'm not a shill, I'm just allergic to pandas)

tpoacher · on May 2, 2022

They have. It's called numpy records, and they're delightful.

The problem is nobody is using them.