Modern Pandas (Part 2): Method Chaining

mayankkaizen · on May 1, 2022

Pandas is something that I wish I could avoid at any cost but I can't. There is simply no design philosophy. API is as ugly as it gets. I find it greatly unintuitive. It feels like a giant missmash of hacks on top of other hacks.

Sometime I wish designers of Numpy or scikit-learn should have developed Pandas.

hervature · on May 1, 2022

I always see these type of complaints and, when I actually sit down with people to resolve their aversion, it ultimately comes back to they use Pandas incorrectly or simply are not able to grok documentation. The usual dead giveaway is "Pandas documentation is horrible". Anyone who has used more than one documentation would know that documentation rarely includes every function in the API let alone the argument, examples, and links to other related functions.

As this is the top comment, can you (and others) at least post the problems so that we can have an intellectual discussion? Maybe the Pandas devs might take a point or two.

brundolf · on May 1, 2022

In my relatively brief experience maintaining a Python web service backed by Pandas:

It's basically a DSL constructed out of the dismembered syntactic bones of Python, which breaks every piece of semantics in the host language that it possibly can. I'm sure this is convenient (and maybe even tractable to use) in an interactive notebook, where you can try out and verify behavior in real-time by looking at the output. But in any kind of non-immediate-feedback scenario where you're trying to engineer a production system, it's a hellscape of jumping back and forth to the documentation because even your most fundamental assumptions about the host language's semantics have been thrown out the window. On top of that (and largely because of it), it also resists static typing like crazy.

It also has a bunch of APIs where it's really hard to know what will and won't mutate, and others that are just made generally hard to wrap your brain around for the sake of saving a few characters.

Finally: some portion of the hate it gets is probably from engineers without a statistics background who aren't familiar with the huge dictionary of jargon and abbreviations its APIs use. This complaint is maybe less valid, since it is a data science library, but it certainly colors emotions and makes it even harder to deal with in a production context for lots of us. There were several times where, once I did some research and dug past the jargon and learned all the background for what it meant, the concept itself wasn't complicated. But Pandas did use the jargon, so I couldn't just understand, I had to go learn stats first and traverse all this needless indirection. It's an API that isn't designed for engineers but engineers often have to deal with it anyway.

otsaloma · on May 2, 2022

> Finally: some portion of the hate it gets is probably from engineers without a statistics background who aren't familiar with the huge dictionary of jargon and abbreviations its APIs use.

Pandas' jargon isn't really even statistics jargon. For example data frames are tables, which are inherently rows and columns. Columns have names, rows can have names too (although that's not really needed). Pandas' documentation has very few mentions of rows and columns at all, instead they use "axes", "labels" and "index" and there's not even a proper explanation in documentation of what those mean. And that "index" is something that the user needs to manage, sometimes "drop", sometimes "reset" without really understanding why. It seems to me two things: (1) really bad choice of naming things and (2) exposing to users some performance-related details that should have been kept under the hood, if they are needed at all.

hervature · on May 2, 2022

I would put this squarely in the "use Pandas incorrectly" bucket again. As you point out, Pandas is very much geared towards notebooks. Conversely, I would claim your problem is actually one of the benefits of Pandas over something like R and the associated packages. At the very least, you can copy your Pandas code and ship it and collect technical debt. With R, either you make calls to R from Python, which doesn't solve the problem you are having, or you rewrite the code in Python to be usable in your app. This costs time and is the correct way to do things, even with Pandas. I'm sorry you had to handle the technical debt but I hope that was an intentional decision on your project's behalf.

As for your other complaints, I would need to see examples because I'm not really sure what you mean. Are you referring to things like "mad" which stands for Mean Absolute Deviation? This I would put into my "grok documentation" bucket because just look up the mad method and it literally gives you the acronym. Would the method "mean_absolute_deviation" really help you? Furthermore, if they used long names like that, statisticians from R (which are the target audience) would have another nothing-burger bullet to use against Pandas.

KolenCh · on May 1, 2022

Have you read the book by its author, Python for Data Analysis Data Wrangling with Pandas, NumPy, and IPython By Wes McKinney · 2017?

I think it is a bit mean to say about a package as popular as this to have “no design philosophy”. You should read about their design philosophy before making that comment.

From my experience, I jumped all in when I discovered pandas and then I dialed it back. (It was partly because of my inexperience before.)

Pandas is more useful for exploratory data analysis. It is kind of in the philosophy of working in the terminal (with UNIX pipes, etc.) to explore things quickly. That’s why you’d see in the wild people chaining tons of methods together, sort of like people writing terminal one liner chaining a lot of pipes.

It is also useful as a dictionary containers, in fact you can treat a data frame as if it is a dictionary of dictionary of values in terms of API. Vice versa, if one has an internal structure that is dict of dict of values, you can convert that to a DataFrame as a drop in replacement (I’ve done that when working with a software that does not use pandas.)

For simple things that one has prototyped, it can be left as is for “production”.

But for more complicated things, one should “productionize” it using easier to understand and/or more performant logic.

Some of the mistakes of using pandas is to treat it as you “data container”, as if the table itself is self explanatory. From my experience I’ve been confused by the table I saved in the past. So now I write classes that has a to_frame method that my internal data structure can be converted to a dataframe for further exploration if needed.

anyfactor · on May 1, 2022

Have you tried pandas+duckdb?

Duckdb can run over Pandas dataframe and it is super fast. I haven't tried it yet. I am more familiar with Pandas than SQL so I would like to hear your opinion after you give it a shot.

https://duckdb.org/2021/05/14/sql-on-pandas.html

patrick451 · on May 1, 2022

Numpy has plenty of API warts on its own. It's not obvious to me they would have done a better job at all.

nuclearnice3 · on May 1, 2022

Could you elaborate on the warts?

throwamon · on May 1, 2022

Can't write a longer comment now, but check out siuba, or Polars, or dplyr; there are others. (I've posted about siuba recently but I promise I'm not a shill, I'm just allergic to pandas)

tpoacher · on May 2, 2022

They have. It's called numpy records, and they're delightful.

The problem is nobody is using them.

account-5 · on May 1, 2022

I've always found pandas really hard to use or reason about. I eventually get there but I don't like the code. Obviously subjective. I've never used another "data science" language though so I've no experience beyond it.

tomtom1337 · on May 1, 2022

I'd recommend checking out polars as an alternative to pandas - https://github.com/pola-rs/polars

It has a rather different api, and is significantly faster. Highly recommend it.

riyadparvez · on May 1, 2022

The issue is there's a huge ecosystem developed on top of Pandas. So if it's different API, now you lose access to all the ecosystem.

ritchie46 · on May 1, 2022

Should not be a real blocker. There is good interop with arrow, numpy and pandas. Where arrow and numpy mostly is zero-copy.

So you are one `df.to_numpy()/df.to_pandas()` away to `X` libary you want to use.

nojito · on May 1, 2022

Polars can export to arrow which you can then load into pandas.

Or you can convert it directly into a pandas Df

smartmic · on May 1, 2022

If you are ready to stretch your mind and have time and energy to learn a new skill, I recommend this as "data science" language: APL

Background: I first gained some experience with J, where I first learned to appreciate the advantages of array languages. The main advantage is actually having your own way of thinking about processing multidimensional data. Now, recently, I've gotten into the notation of APL, and it's really even cooler than the ASCII "noise" of J. The symbols make it easier for me to both write and read programs. Admittedly, for more complex operations with data it takes a lot of learning that you usually don't have. But for simple transformation APL is already quite fast usable and convinces beyond the "mainstream".

account-5 · on May 1, 2022

APL and J are on my list of programming languages I'd like to learn. I've been listening to the Array Cast podcast for a while. I just need a place to start. I don't really learn a language until I use it for something, at the moment I have nothing to use APL for. Any recommendations of where to start would be welcome.

mjburgess · on May 1, 2022

Everyone does. Perhaps if python would add more support for FP (rather than its present hostility), we'd be able to phrase data transformations more naturally in the language.

zasdffaa · on May 1, 2022

My python is rusty but IIRC it allows functional stuff (except for expression-only lambdas, boo).

From https://docs.python.org/3/howto/functional.html it's got map/filter/currying and plenty more, what's misting in your view?

mjburgess · on May 1, 2022

pattern matching expressions, syntax for partial application and composition, a typing system which can express structural types, a generalised list comprehension

zasdffaa · on May 1, 2022

> pattern matching expressions,

nothing intrinsically 'functional' about this, though it's nice

> syntax for partial application

<https://docs.python.org/3/library/functools.html#functools.p...>

> and composition,

Hmm, to my surprise I can't find this but TBH it's trivial to write, here's an eg <https://stackoverflow.com/questions/16739290/composing-funct...>

> a typing system which can express structural types,

typing is orthogonal to FP (though very nice to have)

> a generalised list comprehension

IDK what this means, what's ungeneral about comprehensions currently?

mjburgess · on May 1, 2022

expressing generic data transformations naturally in syntax, requires: being able to wrap/unwrap data types (ie., pattern matching), composing operations, partially applying them, and providing syntax to support filter/map/flatMap on arbitrary data structures

Pythons failure to provide this, indeed, outright hostility to doing so is half the reason pandas is a mess of incomprehensible syntax

By prioritising assignment expressions and implementing pattern "matching" as a statement, they're clearly showing a lot of hostility to one of the major use cases of their language

(incidentally, recall that C# introduced LINQ in 2008, a generalised comprehension! That's C#.)

There isnt the syntax to reimplement to support natural phrasings of data transformation, in lieu of this, pandas exploits weird operators such as `.loc[a,b]`.

At some point the dam is going to have to break, and python is going to have to introduce something to resolve this mess. However, i'd bet it'll be a decade of tooth-and-nail fighting about it. It isnt a software engineering language for education any more, and i'm not seeing this reality being acknowledged

ok_dad · on May 1, 2022

> pattern matching expressions

Available in 3.10: https://peps.python.org/pep-0636/

> syntax for partial application...

    from functools import partial

    somefunc_arg1_arg2 = partial(somefunc, arg1, arg2)

> ...and composition

A native compositional syntax would be nice.

> a typing system which can express structural types

    # mypy will typecheck this code and see `MyString()` has a `.read()` method, so it's a `Readable` even though it doesn't sub-class `Readable`

    from typing import Any, Protocol

    class Readable(Protocol):
        def read(self) -> Any: ...

    def read_something(something: Readable) -> Any:
        return something.read()

    class MyString:
        a_string: str = "something"
        def read(self) -> str:
            return self.a_string

    read_something(MyString())

> a generalised list comprehension

I agree this would be nice.

mjburgess · on May 2, 2022

they arent expressions, defeating the whole point

that isnt syntax for partial appication, and the protocol system for structrural typing is syntactically and practically absurd

the question is "why do data processing libs in python look syntactically illegible" and the answer is largely, as youve posted above, that python doesnt support useful syntax

biellls · on May 1, 2022

Pattern matching is supported in 3.10 and can match classes structurally, as well as other types (edit: I see that you mean you don't like that it's a statement, which I agree with). The typing system supports structural types with Protocol. Personally what I miss most are multi-line lambdas.

mjburgess · on May 2, 2022

I'm not asking for "support" i'm asking for syntax. The question is why the syntax of data processing in python is illegible, not "are things technically possible in python".

Commenters here are keen to say "but dont you know!" -- and, yes, I do.

zasdffaa · on May 2, 2022

To be fair the confusion of FP vs FP syntax is from your comment: "Perhaps if python would add more support for FP (rather than its present hostility)"

Python does FP fine and FP via an ugly library or via elegant inbuilt syntax is still FP. I get you want nicer syntax but it wasn't clear.

I dunno, write a PEP and see if it gets support. I hope you succeed!

animal_spirits · on May 1, 2022

Pandas makes it really hard to do easy things, and really easy to do hard things. But its more useful to the people who need it to do the hard things.

otsaloma · on May 1, 2022

Here's another alternative. I wrote Dataiter specifically as I too was frustrated with Pandas. In my experience if you design a new API from scratch (and don't try to reimplement the Pandas API as many projects have done!) and have some vision and consistent principles, it's well possible to get a good intuitive API as a result. Two relevant issues remain: You're limited by NumPy's datatypes and their problems, such as memory-hogging strings and a lack of a proper missing value (NA), and secondly, limited by the Python language, so compared to e.g. dplyr's non-standard evaluation, you'll need to use lambda functions, which are unfortunately clumsy and verbose.

https://github.com/otsaloma/dataiter

Here's a comparison of dplyr vs. Dataiter vs. Pandas, which should give quick overview of the similarieties and differences.

https://dataiter.readthedocs.io/en/latest/_static/comparison...

biellls · on May 1, 2022

A lot of people hate on SQL. I used to be one of them but I've come to think that for data transformations it's hard to beat. My current favorite is DuckDB, which is like a SQLite but columnar. It has great performance and it's easy to call it from python and even run SQL on pandas dataframes.

beckingz · on May 1, 2022

I was going to come here and post that I dislike method chaining because it is harder to read...

And then I read TFA and realize that I actually dislike people writing poorly formatted method chaining pandas code. The examples in the post are really nicely formatted and easy to read!

civilized · on May 1, 2022

Pandas may be OK for people who need to do a little data processing in their Python project, but I still recommend R and the tidyverse to those who need a serious tool of thought for analytics. Everything is so much more tidy and concise and intuitive and flexible. You can usually write code that directly expresses your high-level intent with a minimum of syntactic cruft.

Full disclosure - the downside of modern R is the stack traces have been getting worse for a while.

d0mine · on May 1, 2022

Significant part of the work is massaging data and R sucks at that compared to Python.

"if you want to do more than statistics, let’s say deployment and reproducibility, Python is a better choice." https://www.guru99.com/r-vs-python.html

civilized · on May 1, 2022

Depends on what kind of data. R is almost always better at tabular, Python is probably usually better at unstructured.

hnhg · on May 1, 2022

It's intuitive to you but I have always found R a real pain to deal with. I acknowledge all of Pandas' flaws but R is not for me, and I've used it a lot in the past. It's all subjective and dependent on the context and requirements of a project.

civilized · on May 1, 2022

R is a mixed bag. Base R is like PHP - an organically grown mishmash of stale, ugly, inconsistent conventions, beloved only by people who have been trained in it or used it for decades. But the tidyverse has an extremely thoughtful API that is easier to learn and use than anything else in the data science world.

jmount · on May 1, 2022

There are a number of good packages in Python specializing in variations of powerful chained processing in Pandas. My own is this one: https://github.com/WinVector/data_algebra .

civilized · on May 1, 2022

I like the general direction of this.

I do want to note that "there are many things for X in language Y" isn't necessarily a positive thing. It often means the community lacks the clarity of thinking or will to converge on one excellent product. Instead there are lots of okayish things each developed by a single person or handful of people.

henrydark · on May 1, 2022

Quick pdb trick for pipe-ers, you can stick this in the middle:

.pipe(lambda df: (df, pdb.set_trace())[0])

persedes · on May 1, 2022

I use .pipe(lambda x: x....) very often in my code, makes it feel more like I'm using dplyr...

nojito · on May 1, 2022

.pipe is very very slow.

shoyer · on May 1, 2022

df.pipe(f, ...) is just syntactic sugar for f(df, ...). Nothing slow about it.

nicoburns · on May 1, 2022

Executing functions in python is slow!

lysecret · on May 1, 2022

Ok it seems I am the only person that loves pandas. And I have just recently started to like method chaining (even written some code to enable it with beautifulsoup) so this post came just at the right time.

sterlinm · on May 2, 2022

Given how much of a role pandas seems to have played in the growth of Python over the last decade I suspect you aren't really the only person that loves pandas :)

I think you'll find a similar selection bias if you ask HN commenters what they think about Excel.

lysecret · on May 3, 2022

Good point, also there are a lot of similarities between Excel and Pandas as well.

I think this also is a fundamental distinction between normal SWEs and People who use Excel as well as Data Engineers.

You always start with data, and you have no control over it. So this means:

1. you need state base programming env (excel, Jupyter) 2. you need to look at it to see whats there (plots)

I guess HN is mostly comprised of SWEs who built the DBs and Websites that create the data that then gets consumed by the Data Engs and the Excel people :D

sterlinm · on May 3, 2022

Agreed about the similarities between Excel and Pandas. I started out as more of an data analyst and am now a SWE and I think one of the things that SWEs who dismiss Excel and Jupyter don't understand is how little you can assume about the data you might be working with.

If you're an analyst who knows some VBA, that can be super useful but it would probably be a mistake to try to make your VBA-driven applications bullet proof. Nobody wants you to spend that much time on it, and the odds that something completely out of your control will change and break it anyway are quite high.

tomrod · on May 1, 2022

Tom Augspurger is one of those names you recognize when you subscribe to the pandas repo. He is very involved in the project! Great to see his thoughts and how he uses it.

jacobdi · on May 1, 2022

My team has been trying to modernize pandas from a different tact. Regardless of struggle with the syntax, it seems Pandas is very sticky, and we don't predict much migration to other data science languages. Instead of refining the syntax, we have combined it with a spreadsheet GUI (https://github.com/mito-ds/monorepo). Here, we worry less about writing perfect syntax ourselves, and let the GUI write the code for functions like pivot tables and merges that work well visually.

fumeux_fume · on May 1, 2022

Love it. This is an excellent resource for brushing up on certain parts of the API I don't use as frequently. I've been using Pandas daily for about 6 years now and to see how it's evolved over time makes me really proud of the community. The obligatory comparisons to other tools in R are cliched and seem to completely miss the point.

qrian · on May 1, 2022

This article uses lambda function to change a column's dtype, but that is highly inefficient, and impractical once data size grows.

Not sure it's because the article is relatively old (2016).

otsaloma · on May 1, 2022

The examples presented are probably carefully selected, my experience is that if you actually use Pandas with method chaining, (1) you get ugly-looking code due to a mixture of method calls and various different kinds of bracket indexing and (2) you eventually run into things that just can't be (nicely) chained and then you need to break the chain – even new stuff, such as DataFrame.append being deprecated in favor of pd.concat.

tpoacher · on May 2, 2022

You don't need pandas to do chaining. It's a one-liner in pure python: https://github.com/tpapastylianou/chain-ops-python

Not to mention, it's a lot more debuggable this way (which is generally the biggest downside to most specialised chaining approaches).

sterlinm · on May 1, 2022

This is a great series of articles but a bit funny to call it modern pandas these days since it’s six years old. Has it been updated?

eesmith · on May 1, 2022

Modern art ended in the 1970's. https://en.wikipedia.org/wiki/Modern_art

Perhaps the successor should be contemporary Pandas, or postmodern Pandas. :)

sterlinm · on May 2, 2022

Postmodern Pandas sounds like a great title for a blog post.

_raoulcousins · on May 1, 2022

Once I discovered DuckDB, I use pandas methods a lot less. It’s so much easier for me to write SQL than to deal with Pandas’ syntax.

MichaelRazum · on May 1, 2022

I don't see how this

( df.pipe(went_up, 'hill')

    .pipe(fetch, 'water')

    .pipe(fell_down, 'jack')

    .pipe(broke, 'crown')

    .pipe(tumble_after, 'jill')

)

is much better then something like that

df = went_up(df, 'hill')

df = fetch(df, 'water')

df = fell_down(df, 'jack')

df = broke(df, 'jack')

df = tumble_after(df, 'jill')

Would really like to hear an opinion about that.

swied · on May 1, 2022

What about when the steps don't return a data frame object, like 'groupby'? Then, you will have to think about coming up a name other than 'df'. Save your mental energy for making comments. Here's an example...

    # Per store revenue for high price items   
    df_agg = (
      df1
      .query("unit_price > 400")
      .groupby('store_id','store_name')
      .agg({'revenue':     np.sum, 
            'customer_id': 'nunique'})
    )

oneoff786 · on May 1, 2022

Groupby agg is a bit of an exception. It’s returning something that is very different from the original. Not just a modification, but a different table that summarizes the original. You ought to be assigning it to a new variable.

Groupby->agg is a good one to chain because you don’t want the intermediary almost ever.

Edit: also, I never understand people who use query. That just seems like it’s begging for problems later on. .loc for life.

jeffwass · on May 1, 2022

A former coworker of mine was a huge fan of functional programming, and also deeply allergic to mutation. So if you reused variables like that you’d get an angry earful.

Though if you replaced each subsequent line with df1 and df2 and so on he wouldn’t mind as much.

I can’t opine as to whether one approach or the other is intrinsically better. But echoes of his tirades still ring when I see the same variable name redefined.

CuriousSkeptic · on May 1, 2022

I would probably come across as matching that description. So perhaps I can speak for that position a bit.

In this case I would not think re-assigning df as being in violation those principles. Might even be useful if it plays better with how the code interacts with debuggers and version control.

Its not really re-used in the sense that would motivate making up new names for the intermediate steps. It’s clearly just used as a syntactic aid for the operation chaining. So in my mind both expressions are equivalent from that perspective.

I would probably insist on limiting its scope to precisely that expression though, to maintain that obviousness.

electroly · on May 2, 2022

There's no mutation there; it's just rebinding the name. Rebinding is very different from mutation. It wouldn't be my stylistic choice either but your FP friend wouldn't be complaining about mutation here.

jeffwass · on May 2, 2022

Thanks for clarifying, though he definitely used the word “mutation” in this type of rebinding scenario.

tda · on May 1, 2022

What always bothers me is you can't method chain an in-place apply of some function on a (subset of) columns in an elegant way. Pipe and assign make it possible, but definitely not nice, just look at the example code I think I once proposed an apply_to method for this purpose but that got -1'nd by the creators in no time

lysecret · on May 8, 2022

For large workflows, you'll probably want to move away from pandas to something more structured, like Airflow or Luigi.

How are they a replacement for pandas ? I thought the would or at least could wrap around them for scheduled execution / chaining. You would still need a Data frame handling glibrart no?

oneoff786 · on May 1, 2022

Hmm I think this is bad. Not terrible, but bad.

I’ll chain a few methods but never more than can easily fit on one line. Usually just something like groupby agg rename.

kzrdude · on May 1, 2022

When the book is revised, I guess they should take all pandas features from 1.0 or some later 1.x release for granted.