Pandas is something that I wish I could avoid at any cost but I can't. There is simply no design philosophy. API is as ugly as it gets. I find it greatly unintuitive. It feels like a giant missmash of hacks on top of other hacks.
Sometime I wish designers of Numpy or scikit-learn should have developed Pandas.
I always see these type of complaints and, when I actually sit down with people to resolve their aversion, it ultimately comes back to they use Pandas incorrectly or simply are not able to grok documentation. The usual dead giveaway is "Pandas documentation is horrible". Anyone who has used more than one documentation would know that documentation rarely includes every function in the API let alone the argument, examples, and links to other related functions.
As this is the top comment, can you (and others) at least post the problems so that we can have an intellectual discussion? Maybe the Pandas devs might take a point or two.
In my relatively brief experience maintaining a Python web service backed by Pandas:
It's basically a DSL constructed out of the dismembered syntactic bones of Python, which breaks every piece of semantics in the host language that it possibly can. I'm sure this is convenient (and maybe even tractable to use) in an interactive notebook, where you can try out and verify behavior in real-time by looking at the output. But in any kind of non-immediate-feedback scenario where you're trying to engineer a production system, it's a hellscape of jumping back and forth to the documentation because even your most fundamental assumptions about the host language's semantics have been thrown out the window. On top of that (and largely because of it), it also resists static typing like crazy.
It also has a bunch of APIs where it's really hard to know what will and won't mutate, and others that are just made generally hard to wrap your brain around for the sake of saving a few characters.
Finally: some portion of the hate it gets is probably from engineers without a statistics background who aren't familiar with the huge dictionary of jargon and abbreviations its APIs use. This complaint is maybe less valid, since it is a data science library, but it certainly colors emotions and makes it even harder to deal with in a production context for lots of us. There were several times where, once I did some research and dug past the jargon and learned all the background for what it meant, the concept itself wasn't complicated. But Pandas did use the jargon, so I couldn't just understand, I had to go learn stats first and traverse all this needless indirection. It's an API that isn't designed for engineers but engineers often have to deal with it anyway.
> Finally: some portion of the hate it gets is probably from engineers without a statistics background who aren't familiar with the huge dictionary of jargon and abbreviations its APIs use.
Pandas' jargon isn't really even statistics jargon. For example data frames are tables, which are inherently rows and columns. Columns have names, rows can have names too (although that's not really needed). Pandas' documentation has very few mentions of rows and columns at all, instead they use "axes", "labels" and "index" and there's not even a proper explanation in documentation of what those mean. And that "index" is something that the user needs to manage, sometimes "drop", sometimes "reset" without really understanding why. It seems to me two things: (1) really bad choice of naming things and (2) exposing to users some performance-related details that should have been kept under the hood, if they are needed at all.
I would put this squarely in the "use Pandas incorrectly" bucket again. As you point out, Pandas is very much geared towards notebooks. Conversely, I would claim your problem is actually one of the benefits of Pandas over something like R and the associated packages. At the very least, you can copy your Pandas code and ship it and collect technical debt. With R, either you make calls to R from Python, which doesn't solve the problem you are having, or you rewrite the code in Python to be usable in your app. This costs time and is the correct way to do things, even with Pandas. I'm sorry you had to handle the technical debt but I hope that was an intentional decision on your project's behalf.
As for your other complaints, I would need to see examples because I'm not really sure what you mean. Are you referring to things like "mad" which stands for Mean Absolute Deviation? This I would put into my "grok documentation" bucket because just look up the mad method and it literally gives you the acronym. Would the method "mean_absolute_deviation" really help you? Furthermore, if they used long names like that, statisticians from R (which are the target audience) would have another nothing-burger bullet to use against Pandas.
Have you read the book by its author, Python for Data Analysis
Data Wrangling with Pandas, NumPy, and IPython
By Wes McKinney · 2017?
I think it is a bit mean to say about a package as popular as this to have “no design philosophy”. You should read about their design philosophy before making that comment.
From my experience, I jumped all in when I discovered pandas and then I dialed it back. (It was partly because of my inexperience before.)
Pandas is more useful for exploratory data analysis. It is kind of in the philosophy of working in the terminal (with UNIX pipes, etc.) to explore things quickly. That’s why you’d see in the wild people chaining tons of methods together, sort of like people writing terminal one liner chaining a lot of pipes.
It is also useful as a dictionary containers, in fact you can treat a data frame as if it is a dictionary of dictionary of values in terms of API. Vice versa, if one has an internal structure that is dict of dict of values, you can convert that to a DataFrame as a drop in replacement (I’ve done that when working with a software that does not use pandas.)
For simple things that one has prototyped, it can be left as is for “production”.
But for more complicated things, one should “productionize” it using easier to understand and/or more performant logic.
Some of the mistakes of using pandas is to treat it as you “data container”, as if the table itself is self explanatory. From my experience I’ve been confused by the table I saved in the past. So now I write classes that has a to_frame method that my internal data structure can be converted to a dataframe for further exploration if needed.
Duckdb can run over Pandas dataframe and it is super fast. I haven't tried it yet. I am more familiar with Pandas than SQL so I would like to hear your opinion after you give it a shot.
Can't write a longer comment now, but check out siuba, or Polars, or dplyr; there are others. (I've posted about siuba recently but I promise I'm not a shill, I'm just allergic to pandas)
I've always found pandas really hard to use or reason about. I eventually get there but I don't like the code. Obviously subjective. I've never used another "data science" language though so I've no experience beyond it.
If you are ready to stretch your mind and have time and energy to learn a new skill, I recommend this as "data science" language: APL
Background: I first gained some experience with J, where I first learned to appreciate the advantages of array languages. The main advantage is actually having your own way of thinking about processing multidimensional data. Now, recently, I've gotten into the notation of APL, and it's really even cooler than the ASCII "noise" of J. The symbols make it easier for me to both write and read programs. Admittedly, for more complex operations with data it takes a lot of learning that you usually don't have. But for simple transformation APL is already quite fast usable and convinces beyond the "mainstream".
APL and J are on my list of programming languages I'd like to learn. I've been listening to the Array Cast podcast for a while. I just need a place to start. I don't really learn a language until I use it for something, at the moment I have nothing to use APL for. Any recommendations of where to start would be welcome.
Everyone does. Perhaps if python would add more support for FP (rather than its present hostility), we'd be able to phrase data transformations more naturally in the language.
pattern matching expressions, syntax for partial application and composition, a typing system which can express structural types, a generalised list comprehension
expressing generic data transformations naturally in syntax, requires: being able to wrap/unwrap data types (ie., pattern matching), composing operations, partially applying them, and providing syntax to support filter/map/flatMap on arbitrary data structures
Pythons failure to provide this, indeed, outright hostility to doing so is half the reason pandas is a mess of incomprehensible syntax
By prioritising assignment expressions and implementing pattern "matching" as a statement, they're clearly showing a lot of hostility to one of the major use cases of their language
(incidentally, recall that C# introduced LINQ in 2008, a generalised comprehension! That's C#.)
There isnt the syntax to reimplement to support natural phrasings of data transformation, in lieu of this, pandas exploits weird operators such as `.loc[a,b]`.
At some point the dam is going to have to break, and python is going to have to introduce something to resolve this mess. However, i'd bet it'll be a decade of tooth-and-nail fighting about it. It isnt a software engineering language for education any more, and i'm not seeing this reality being acknowledged
from functools import partial
somefunc_arg1_arg2 = partial(somefunc, arg1, arg2)
> ...and composition
A native compositional syntax would be nice.
> a typing system which can express structural types
# mypy will typecheck this code and see `MyString()` has a `.read()` method, so it's a `Readable` even though it doesn't sub-class `Readable`
from typing import Any, Protocol
class Readable(Protocol):
def read(self) -> Any: ...
def read_something(something: Readable) -> Any:
return something.read()
class MyString:
a_string: str = "something"
def read(self) -> str:
return self.a_string
read_something(MyString())
that isnt syntax for partial appication, and the protocol system for structrural typing is syntactically and practically absurd
the question is "why do data processing libs in python look syntactically illegible" and the answer is largely, as youve posted above, that python doesnt support useful syntax
Pattern matching is supported in 3.10 and can match classes structurally, as well as other types (edit: I see that you mean you don't like that it's a statement, which I agree with). The typing system supports structural types with Protocol. Personally what I miss most are multi-line lambdas.
I'm not asking for "support" i'm asking for syntax. The question is why the syntax of data processing in python is illegible, not "are things technically possible in python".
Commenters here are keen to say "but dont you know!" -- and, yes, I do.
To be fair the confusion of FP vs FP syntax is from your comment: "Perhaps if python would add more support for FP (rather than its present hostility)"
Python does FP fine and FP via an ugly library or via elegant inbuilt syntax is still FP. I get you want nicer syntax but it wasn't clear.
I dunno, write a PEP and see if it gets support. I hope you succeed!
Here's another alternative. I wrote Dataiter specifically as I too was frustrated with Pandas. In my experience if you design a new API from scratch (and don't try to reimplement the Pandas API as many projects have done!) and have some vision and consistent principles, it's well possible to get a good intuitive API as a result. Two relevant issues remain: You're limited by NumPy's datatypes and their problems, such as memory-hogging strings and a lack of a proper missing value (NA), and secondly, limited by the Python language, so compared to e.g. dplyr's non-standard evaluation, you'll need to use lambda functions, which are unfortunately clumsy and verbose.
A lot of people hate on SQL. I used to be one of them but I've come to think that for data transformations it's hard to beat. My current favorite is DuckDB, which is like a SQLite but columnar. It has great performance and it's easy to call it from python and even run SQL on pandas dataframes.
I was going to come here and post that I dislike method chaining because it is harder to read...
And then I read TFA and realize that I actually dislike people writing poorly formatted method chaining pandas code. The examples in the post are really nicely formatted and easy to read!
Pandas may be OK for people who need to do a little data processing in their Python project, but I still recommend R and the tidyverse to those who need a serious tool of thought for analytics. Everything is so much more tidy and concise and intuitive and flexible. You can usually write code that directly expresses your high-level intent with a minimum of syntactic cruft.
Full disclosure - the downside of modern R is the stack traces have been getting worse for a while.
It's intuitive to you but I have always found R a real pain to deal with. I acknowledge all of Pandas' flaws but R is not for me, and I've used it a lot in the past. It's all subjective and dependent on the context and requirements of a project.
R is a mixed bag. Base R is like PHP - an organically grown mishmash of stale, ugly, inconsistent conventions, beloved only by people who have been trained in it or used it for decades. But the tidyverse has an extremely thoughtful API that is easier to learn and use than anything else in the data science world.
There are a number of good packages in Python specializing in variations of powerful chained processing in Pandas. My own is this one: https://github.com/WinVector/data_algebra .
I do want to note that "there are many things for X in language Y" isn't necessarily a positive thing. It often means the community lacks the clarity of thinking or will to converge on one excellent product. Instead there are lots of okayish things each developed by a single person or handful of people.
Ok it seems I am the only person that loves pandas. And I have just recently started to like method chaining (even written some code to enable it with beautifulsoup) so this post came just at the right time.
Given how much of a role pandas seems to have played in the growth of Python over the last decade I suspect you aren't really the only person that loves pandas :)
I think you'll find a similar selection bias if you ask HN commenters what they think about Excel.
Good point, also there are a lot of similarities between Excel and Pandas as well.
I think this also is a fundamental distinction between normal SWEs and People who use Excel as well as Data Engineers.
You always start with data, and you have no control over it. So this means:
1. you need state base programming env (excel, Jupyter)
2. you need to look at it to see whats there (plots)
I guess HN is mostly comprised of SWEs who built the DBs and Websites that create the data that then gets consumed by the Data Engs and the Excel people :D
Agreed about the similarities between Excel and Pandas. I started out as more of an data analyst and am now a SWE and I think one of the things that SWEs who dismiss Excel and Jupyter don't understand is how little you can assume about the data you might be working with.
If you're an analyst who knows some VBA, that can be super useful but it would probably be a mistake to try to make your VBA-driven applications bullet proof. Nobody wants you to spend that much time on it, and the odds that something completely out of your control will change and break it anyway are quite high.
Tom Augspurger is one of those names you recognize when you subscribe to the pandas repo. He is very involved in the project! Great to see his thoughts and how he uses it.
My team has been trying to modernize pandas from a different tact. Regardless of struggle with the syntax, it seems Pandas is very sticky, and we don't predict much migration to other data science languages. Instead of refining the syntax, we have combined it with a spreadsheet GUI (https://github.com/mito-ds/monorepo). Here, we worry less about writing perfect syntax ourselves, and let the GUI write the code for functions like pivot tables and merges that work well visually.
Love it. This is an excellent resource for brushing up on certain parts of the API I don't use as frequently. I've been using Pandas daily for about 6 years now and to see how it's evolved over time makes me really proud of the community. The obligatory comparisons to other tools in R are cliched and seem to completely miss the point.
The examples presented are probably carefully selected, my experience is that if you actually use Pandas with method chaining, (1) you get ugly-looking code due to a mixture of method calls and various different kinds of bracket indexing and (2) you eventually run into things that just can't be (nicely) chained and then you need to break the chain – even new stuff, such as DataFrame.append being deprecated in favor of pd.concat.
What about when the steps don't return a data frame object, like 'groupby'? Then, you will have to think about coming up a name other than 'df'. Save your mental energy for making comments. Here's an example...
# Per store revenue for high price items
df_agg = (
df1
.query("unit_price > 400")
.groupby('store_id','store_name')
.agg({'revenue': np.sum,
'customer_id': 'nunique'})
)
Groupby agg is a bit of an exception. It’s returning something that is very different from the original. Not just a modification, but a different table that summarizes the original. You ought to be assigning it to a new variable.
Groupby->agg is a good one to chain because you don’t want the intermediary almost ever.
Edit: also, I never understand people who use query. That just seems like it’s begging for problems later on. .loc for life.
A former coworker of mine was a huge fan of functional programming, and also deeply allergic to mutation. So if you reused variables like that you’d get an angry earful.
Though if you replaced each subsequent line with df1 and df2 and so on he wouldn’t mind as much.
I can’t opine as to whether one approach or the other is intrinsically better. But echoes of his tirades still ring when I see the same variable name redefined.
I would probably come across as matching that description. So perhaps I can speak for that position a bit.
In this case I would not think re-assigning df as being in violation those principles. Might even be useful if it plays better with how the code interacts with debuggers and version control.
Its not really re-used in the sense that would motivate making up new names for the intermediate steps. It’s clearly just used as a syntactic aid for the
operation chaining. So in my mind both expressions are equivalent from that perspective.
I would probably insist on limiting its scope to precisely that expression though, to maintain that obviousness.
There's no mutation there; it's just rebinding the name. Rebinding is very different from mutation. It wouldn't be my stylistic choice either but your FP friend wouldn't be complaining about mutation here.
What always bothers me is you can't method chain an in-place apply of some function on a (subset of) columns in an elegant way. Pipe and assign make it possible, but definitely not nice, just look at the example code
I think I once proposed an apply_to method for this purpose but that got -1'nd by the creators in no time
For large workflows, you'll probably want to move away from pandas to something more structured, like Airflow or Luigi.
How are they a replacement for pandas ? I thought the would or at least could wrap around them for scheduled execution / chaining. You would still need a Data frame handling glibrart no?
Sometime I wish designers of Numpy or scikit-learn should have developed Pandas.