So he puts polymorphic function calls into enormous loops to simulate a heavy lo...

jiggawatts · on Feb 28, 2023

> The real situation: a virtual method is called only a few hundred times and is barely visible in profiling tools.

The reality is that the entire Java ecosystem revolves around call stacks hundreds of calls deep where most (if not all) of those are virtual calls through an interface.

Even in web server scenarios where the user might be "5 milliseconds away", I've seen these overheads add up to the point where it is noticeable.

ASP.NET Core for example has been optimised recently to go the opposite route of not using complex nested call paths in the core of the system and has seen dramatic speedups.

For crying out loud, I've seen Java web servers requiring 100% CPU time across 16 cores for half an hour to start up! HALF AN HOUR!

mda · on March 1, 2023

I bet that half an hour startup is not because of nested calls at all. I worked with a ton of Java code and if something is slow, it is usually shitty I/O related code or some algorithmic stupidity, not because of virtual calls and what not.

_gabe_ · on March 1, 2023

> So he puts polymorphic function calls into enormous loops to simulate a heavy load with a huge amount of data to conclude "we have 20x loss in performance everywhere"?

You're mistaken, the load size has nothing to do with the end result. The result is normalized to give an estimate of how much faster the simple code is than the polymorphic code irregardless of input size. (Kinda like deaths per 100k instead of giving an absolute number of deaths for statistics about diseases).

So yes, your code is running 20x slower than it should be all the time.

Especially when you make every class an interface, with... get this, one implementation! This is based on real world experience and is not a joke. There are real companies with real people that write real code where every single class is an interface with exactly one implementation. Which, as Casey has shown, results in upwards of a 20x slowdown in the worst case.

Obviously, you probably won't get a 20x speedup by getting rid of the polymorphic garbage. But it's equally asinine to assume that polymorphic functions are only called a few hundred times. I guarantee you your PC is making millions of polymorphic function calls per minute between: the OS, the browser, windows Anti-Malware scanner, steam running in the background, oracle running its checks to remind you to update Java, etc. There are hundreds of processes running all the time on a modern device, these devices are wasting enormous amounts of resources.

hsn915 · on March 1, 2023

> Especially when you make every class an interface, with... get this, one implementation! This is based on real world experience and is not a joke.

And when you run this through a profiler, you will not notice how slow your code is, because everything is slow. Slowness is infused throughout the whole system.

MikeCampo · on Feb 28, 2023

Just because you haven't been exposed this issue doesn't mean it doesn't exist. "the real situation", "no one", "in real projects", "never pop up"...give me a break lol.

kaba0 · on Feb 28, 2023

One can reasonably well guess/know the expected input sizes to their programs. You ain’t (hopefully) loading your whole database into memory, and unless you are writing a simulation/game engine or another specialized application, your application is unlikely to have a single scorching hot loop, that’s just not how most programs look like. If it is, then you should design for it, which may even mean changing programming languages for that part (e.g. for video codecs not even C et al. cut it, you have to do assembly), but more likely you just use a bit less ergonomic primitive of your language.

bob1029 · on Feb 28, 2023

> unless you are writing a simulation/game engine or another specialized application, your application is unlikely to have a single scorching hot loop

If everything was built with constraints like "this must serve user input so quickly that they can't perceive a delay", we would probably be a lot better off across the entire board.

We should try to steal more ideas from different domains instead of treating them like entirely isolated universes.

tyre · on Feb 28, 2023

> If everything was built with constraints like "this must serve user input so quickly that they can't perceive a delay", we would probably be a lot better off across the entire board.

Sure but it’s a question of tradeoffs.

If the wall-clock-optimized version creates a bus count of 1 for that domain or is difficult for a dozen engineers to iterate on, then that could be worse for the business and users.

Should we want better software? Yes. Should we learn from other domains? Absolutely. But we should ultimately optimize for the domain we’re in, and writing much product-focused software is ultimately best optimized for engineering team and product velocity.

Another example: would most software benefit from formal verification? Yeah, I guess, but the software holistically—as a thing used by humans to solve problems—might benefit considerably more from Ruby.

Olreich · on March 1, 2023

Product velocity is not increased by spending a ton of time on complex type hierarchies and premature extensibility. It’s increased by having fewer classes and simpler functions that are faster to write tests for to get good coverage via end-to-end tests. The patterns of OOP increase the amount of time spent not solving the business needs. They also infuse the entire system with unnecessary slowness.

morelisp · on Feb 28, 2023

> You ain’t (hopefully) loading your whole database into memory

I've basically built my career for the past decade by pointing out "yes, we can load our whole working set into memory" for the vast majority of problems. This is especially true if you have so little data you think you don't have CPU problems either.

kaba0 · on Feb 28, 2023

Databases are often not used by a single entity, so while I am very interested in your experiences, I think it is a great specialization for certain problems, but is not a general solution to everything.

All in all, I fail to see how it disagrees with my points.

l33t233372 · on March 1, 2023

> All in all, I fail to see how it disagrees with my points.

I didn’t see any part of the comment that implied it did

duskwuff · on Feb 28, 2023

> e.g. for video codecs not even C et al. cut it, you have to do assembly

This is largely inaccurate. Video encoders/decoders are typically written in C, with some use of compiler intrinsics or short inline assembly fragments for particularly "hot" functions.

Cthulhu_ · on Feb 28, 2023

Exactly, and they only decided to write those bits in assembly when they identified it as a hot code path AND the assembly outperformed any C code they could come up with.

The times that assembly outperforms a higher level language has reduced as well over time, with compiler and CPU improvements over time.

kaba0 · on Feb 28, 2023

Which is not in disagreement with my comment/take.

kaba0 · on Feb 28, 2023

I was talking about those hot functions only, not the rest of the program. But yeah a “sometimes” or a “may” would have helped in my original sentence.

_aavaa_ · on March 1, 2023

But the point isn't that just the scorching hot loop is 20x slower, but that this penalty is paid everywhere. And it won't show up in profiling since there isn't a hotspot, it's death by 100 cuts.

mmarq · on Feb 28, 2023

His classes are doing a single multiplication, of course dynamic dispatch would have a significant cost in this scenario.

Olreich · on March 1, 2023

Never look at the dispatch of a big company’s Java code base then. It’s dynamic dispatch 400 layers deep for a single network call or file op or small amount of math. Sure those are more expensive operations, but the dynamic dispatch has continually out scaled the problem.

mmarq · on March 1, 2023

But the company that uses Java quite often is writing the umpteenth line-of-business application, where dynamic dispatch is not going to cause a massive overhead

Taniwha · on Feb 28, 2023

Occasional CPU architect here .... probably the worst thing you can do in your code is to load something from memory (the core of method dispatch) and then jump to it, it sort of breaks many of the things we do to optimise our hardware - it causes CPU stalls, branch prediction failures, etc etc

There is one thing worse you can do (and I caught a C++ compiler doing it when we were profiling code while building an x86 clone years ago) instead of loading the address and jumping to it push the address then return to it, that not only breaks pipelines but also return stack optimisations

badsectoracula · on Feb 28, 2023

> instead of loading the address and jumping to it push the address then return to it

I remember doing that in a code generator ages ago because it was easier than calculating the jump offset :-P

Taniwha · on March 1, 2023

every subroutine return after that would be miss-predicted

badsectoracula · on March 2, 2023

FWIW i wasn't trying to make an optimizing compiler, i was experimenting with replacing an interpreter for a scripting language with a JIT, so even bad native code was still faster than the interpreter :-).

It wasn't really used anywhere, eventually i decided to keep the interpreter and move any complex logic in C which ultimately was the simpler approach (and which has been my take on scripting languages for years now: use scripting languages for the "what" and native code for the "how").

saagarjha · on Feb 28, 2023

Most people do neither, but call into something that may do these things.

lightbendover · on Feb 28, 2023

> No one is working with a huge amount of data in big loops using virtual methods to take every element out of a huge dataset like he is showing.

Things way worse than that exist. Replace "virtual method" with "service call."

josephg · on Feb 28, 2023

> Things way worse than that exist.

Yeah. I opened discord earlier, and it took about 10 seconds to open. My CPU is an apple M1, running about 3ghz per core. Assuming its single threaded (it wasn't), discord is taking about 30 billion cycles to open. (Or around 50 network round-trips at a 200ms ping).

Crimes against performance are everywhere.

matiasfernandez · on March 1, 2023

Or as Casey would put it: Discord is taking 3.7moo ("Moon Unit") to open. A Moon Unit is equal to ~2.7 seconds, the maximum ping time to the moon. Therefore, if Discord had their servers on the moon, nobody would know the difference.

gdprrrr · on Feb 28, 2023

Have you measured CPU time? A very large factor will be Disk IO and Network

rwalle · on Feb 28, 2023

Exactly. The webpage is probably asking for resource from 10 different servers and one of them is a bit slower than the others, and the page rendering itself likely doesn't take very long.

josephg · on Feb 28, 2023

No; I have no idea why its so slow. Its kind of hard to tell - I guess I could use wireshark to trace the packets. But who cares? At least one of these things is true:

- It makes horribly inefficient use of my CPU

- It needs an obscene number of network round-trips to load

- One of the network servers that discord needs to open takes seconds to respond to requests

This isn't a new problem. Discord always takes about 10 seconds to open on my computer. (Am I just on too many servers?)

It should open instantly. Everything on modern computers should happen basically instantly. The only reason most software runs slowly is because the developers involved don't care enough to make it run fast.

Except for a few exceptions like AI, scientific computing, 3d modelling and video editing, modern computers are fast enough for everything we want to do with them. Software seems to have higher requirements each year simply because the developers get faster computers each year and spend less effort keeping their software tight and lean.

phtrivier · on March 1, 2023

> The only reason most software runs slowly is because the developers involved don't care enough to make it run fast.

There is truth to that, but also:

* some of them would care if they knew what was possible with reasonnable effort (that's what Casey is trying to address. So far in the course i'm not really seing much that I could apply to the kind of code I write, sadly - but I'm hoping to learn stuff.)

* it's very likely that making performance-aware or optimized code takes just a tad longer than not doing it, and time-to-ship is valued much higher than time-to-run in most industries (this is the point I think Casey is overlooking, or at least not addressing enough. I don't know if it's by design - maybe he disagrees with the trade-off entirely - or if he's biased towards one of the few industries where time-to-run is crucial.)

josephg · on March 1, 2023

Right; most teams optimize for velocity before performance.

This makes sense when you're a shiny new startup. But seriously, 10 seconds for discord to open? There's a point in every product's lifecycle where performance is a feature. Discord isn't a startup anymore. Why can't they fix these performance problems? At least discord is pretty snappy once its loaded. The new reddit interface? Its a hog. But despite a massive outcry, why haven't they fixed it?

My pet theory is that they don't know how. And talking about velocity is just a smoke screen.

I think most professional engineers don't really understand the software stack well enough to be able to improve the performance of the software they write. Its pretty understandable - nobody asks about this stuff in job interviews. And the software stack only gets more complicated each year. If you follow React tutorials online, you can get pretty far adding features to a web app without ever needing to understand how react actually works. Or the web browser, and Vite / webpack / whatever and the operating system it runs on top of.

And thats a pretty good deal! More engineers! So long as we don't mind the new reddit site. And electron apps that take seconds to load.

Of course Casey Muratori knows how to write performant code. He understands the whole stack. He knows how to read the assembly that the C++ compiler produces. Thats something more of us should aspire towards.

I wonder if it would be valuable to make an online course talking about performance engineering. I feel like its one of those things that has fallen by the wayside, and I think thats a massive pity.

TeMPOraL · on Feb 28, 2023

Which is precisely the point made couple comments up. Calling a lot of virtual methods in the critical path is peanuts compared to making a lot of network requests in said critical path.

But hey, those network calls are fast on my loopback interface, or my company LAN, when I'm playing with the dev version, using test set simulating 2 users and 5 posts for each. Surely it'll be just as fast for the real users, over the Internet, on channels with 1000 users and 5 posts per second.

kyle-rb · on Feb 28, 2023

"Don't make tons of RPCs" is a totally separate issue from "don't make subclasses because virtual methods cost a few extra cycles".

TeMPOraL · on Feb 28, 2023

It's the same problem. Virtual calls are degenerate, in-process RPCs. Or put another way, the reason you make tons of RPCs is the same reason you make tons of virtual calls: you consider services or subclasses to be cheap, so you use them a lot to mold your systems to organizational/people problems instead of the thing the software is supposed to do.

kyle-rb · on March 1, 2023

IMO the main difference is that for ~98% of people writing code, subclasses actually are very cheap. The performance losses (11 cycles per iteration?) aren't enough to dissuade me from organizing my code cleanly.

icedchai · on March 1, 2023

It's more than performance. Some of the worst code I've worked on has had too many layers of sub-classes, making it difficult to navigate and a real loss to developer productivity. After a certain point, it becomes OO spaghetti or, more accurately, "lasagna." At more than 3 layers, you really need to stop and think if it's necessary.

icedchai · on Feb 28, 2023

Query loops are also just as problematic: looping through a result set, and making another query (or worse, N queries) per result, etc.

roflyear · on March 1, 2023

As in like a micro service? Ahahaha. Our CTO just pushed for microservices everywhere and we're not even that far along and we're chasing all kinds of performance problems. Insanity.

samdafi · on March 1, 2023

Just a quick nudge back on this: people in DSP would disagree with the assertion that nobody is going over big loops and using a virtual method on each element. We often have to process at least 88k elements per second in real time, through many many different processes. If any of those processes are defined using factories that spit out classes with polymorphic inheritance and virtual functions it certainly becomes an issue.

As a result some styles of writing code just don’t work for the audio thread at all, and we’d have to simply avoid or rewrite libraries written this way.

There are just some domains where standard practice for cleanliness is different because of your constraints.

I mean, it’s to the point we’ve got die hards in this industry who insist on putting all functions inlined in headers (not that I agree!)

Gluber · on Feb 28, 2023

I agree with your sentiment. But those things exist (not that that validates the authors argument) and I still shake in terror when during covid I was asked to take a look at a virus spread simulation (cellular automaton) that was written by a university professor and his postdoc team for software engineering at a large university that modeled evey cell in a 100k x 100k grid as a class which used virtual methods for every computation between cells. Rewrote that in Cuda and normal buffers/ arrays.. and an epoch ran in milliseconds instead of hours.

phtrivier · on March 1, 2023

In all fairness to them, "simulating many stuff interacting with each other" is the poster child of OO. It's just, that, well, it's not how CPU works.

Then again, at some point we had "Lisp machines", maybe some day there will be a computer architecture where memory / computations patterns are adapted to massive simulation - rather than shoehorning on existing architecture.

And those will fail just as miserably as Lisp machines.

adamrezich · on Feb 28, 2023

> No one is working with a huge amount of data in big loops using virtual methods to take every element out of a huge dataset like he is showing.

this is exactly how the typical naïve game loop/entity system works.

TheBigSalad · on Feb 28, 2023

Still... this isn't the reason games are slow.

adamrezich · on Feb 28, 2023

I'm not sure why you would say that—it's certainly a reason why some (not AAA) games are slow.

int_19h · on Feb 28, 2023

It should be noted that there were AAA games written in this fashion, and they were not slow. All method dispatch was virtual in UnrealScript, for example.

Arelius · on March 1, 2023

Well, for starters, many AAA games in Unreal had to have many core functions/classes rewritten from UnrealScript to C++ for performance reasons, where often not every call is virtual. Secondly, UnrealScript is not really a great example, since on-top of Unreal being notoriously on the slower-end of game architectures, and even Epic decided to drop UnrealScript.

And importantly, UnrealScript was designed in the 90's, when memory latencies were far less of problem.

int_19h · on March 1, 2023

Of course C++ is faster, although that has more to do with being compiled rather than bytecode-interpreted. But even so, we played those games on hardware that's very slow by modern standards, and it was fast enough for competitive PvP, so I wouldn't describe it as "slow" in absolute terms.

cma · on Feb 28, 2023

The clean code methods can be even worse outside of tight loops, in the tight loop all the vtable lookup stuff was cached.

jackmott42 · on March 1, 2023

The same problems happen when you have 1000 requests being dealt with simultaneously each working on small collections. Web Servers for real businesses do not sit idle, they churn at high % and reducing CPU load on them lets you save money, and/or improve latency for users, which can make you money.

So go on all of you, write everything in Python with 90 levels of indirection, my stock will go up.

Ygg2 · on March 1, 2023

Reminds me of a joke where programmer optimized most frequently used method in imgur clone from 1s to 0.01s, because customer complained UI was slow to respond.

Congratulations. Taps on the back, champagne all around. Customers call. Same complaint.

Programmer asks "Well, did something change at least?". "Loading bar now flickers more", answers customer.

timeon · on Feb 28, 2023

> a virtual method is called only a few hundred times

The particular example with shapes could be CAD or BIM app where it is usually more then "few hundred times".

scotty79 · on Feb 28, 2023

> "we have 20x loss in performance everywhere"?

I guess only in the places you call methods?

Wait? Are they everywhere? Hmmm...