Not being glib at all. They're overpriced toys. Too little CPU, too little GPU, ...

Not being glib at all.

They're overpriced toys. Too little CPU, too little GPU, too little RAM, and what RAM you do have has too little memory bandwidth to serve both the CPU and GPU at the same time (which still hasn't been fixed for the newest M series CPUs: the GPU can still starve the CPU out, they still do not share bandwidth correctly).

And since this is about local inference: dealing with Metal sucks, and dealing with the NPU sucks. NPUs are nightmares to infer with because you're always stuck using models quantized for them exclusively, which often means poorly quantized models (such as giving up mixed quant models, and burning your finite RAM just to keep the model from going insane).

Not only that, they can't meaningfully run Linux, so they have no life after the desktop, they just become e-waste.

Hell, even if I held one hand behind my back and _only_ looked at unified RAM machines, Apple doesn't currently sell an M5 Mac Studio, and the MBP M5 /w 128GB of RAM starts at $5.4k, and presumably, the M5 Mac Studio will cost roughly the same.

I can buy the overpriced Framework Workstation with a better CPU, a better GPU, and can run Linux or Windows and that starts at $3.5k.

Or I can just not be a dumb shit and build an inference rig with a 9800x3D, however much normal RAM I want (48 or 64GB is enough here), and then drop two 9060XTs (16GB, $500 each) in there, or two W9700 in there (9070XT /w 32GB, $1100 each), or a single 5090 ($2k, but only 32GB, the worst option of the three), and infer several times faster than all the above options while spending less money.

Inference requires two important things Apple either can't do, or can't do cheaply: RAM bandwidth and RAM amount.