What about when the steps don't return a data frame object, like 'groupby'? Then, you will have to think about coming up a name other than 'df'. Save your mental energy for making comments. Here's an example...
# Per store revenue for high price items
df_agg = (
df1
.query("unit_price > 400")
.groupby('store_id','store_name')
.agg({'revenue': np.sum,
'customer_id': 'nunique'})
)
Groupby agg is a bit of an exception. It’s returning something that is very different from the original. Not just a modification, but a different table that summarizes the original. You ought to be assigning it to a new variable.
Groupby->agg is a good one to chain because you don’t want the intermediary almost ever.
Edit: also, I never understand people who use query. That just seems like it’s begging for problems later on. .loc for life.
A former coworker of mine was a huge fan of functional programming, and also deeply allergic to mutation. So if you reused variables like that you’d get an angry earful.
Though if you replaced each subsequent line with df1 and df2 and so on he wouldn’t mind as much.
I can’t opine as to whether one approach or the other is intrinsically better. But echoes of his tirades still ring when I see the same variable name redefined.
I would probably come across as matching that description. So perhaps I can speak for that position a bit.
In this case I would not think re-assigning df as being in violation those principles. Might even be useful if it plays better with how the code interacts with debuggers and version control.
Its not really re-used in the sense that would motivate making up new names for the intermediate steps. It’s clearly just used as a syntactic aid for the
operation chaining. So in my mind both expressions are equivalent from that perspective.
I would probably insist on limiting its scope to precisely that expression though, to maintain that obviousness.
There's no mutation there; it's just rebinding the name. Rebinding is very different from mutation. It wouldn't be my stylistic choice either but your FP friend wouldn't be complaining about mutation here.
( df.pipe(went_up, 'hill')
)is much better then something like that
df = went_up(df, 'hill')
df = fetch(df, 'water')
df = fell_down(df, 'jack')
df = broke(df, 'jack')
df = tumble_after(df, 'jill')
Would really like to hear an opinion about that.