More

dirk94018 · 2026-03-04T18:21:38 1772648498

Simplicity is hard. Mark Twain's 'I would have written less had I had more time' at the end of a letter comes to mind. Software dev's tendency to build castles is great for technical managers who want to own complex systems to gain organizational leverage. Worse is better in this context. Even when it makes people who understand cringe.

You would think that things not breaking should be career-positive for SysAdmins, SREs, and DevOps engineers in a way it cannot be for software devs. But even there simplicity is hard and not really rewarded.

Unix philosophy got this right 50 years ago — small tools, composability, do one thing well. Unix reimagined for AI is my attempt to change that.

dirk94018 · 2026-03-04T18:19:08 1772648348

Simplicity is hard. Mark Twain's 'I would have written less had I had more time' at the end of a letter comes to mind.

Software dev's tendency to build castles is great for technical managers who want to own complex systems to gain organizational leverage. Worse is better in this context. Even when it makes people who understand cringe.

You would think that things not breaking should be career-positive for SysAdmins, SREs, and DevOps engineers in a way it cannot be for software devs. But even there simplicity is hard and not really rewarded.

Unix philosophy got this right 50 years ago — small tools, composability, do one thing well. Unix reimagined for AI is my attempt to change that.

dirk94018 · 2026-03-03T14:19:26 1772547566

On M4 Max 128GB we're seeing ~100 tok/s generation on a 30B parameter model in our from scratch inference engine. Very curious what the "4x faster LLM prompt processing" translates to in practice. Smallish, local 30B-70B inference is genuinely usable territory for real dev workflows, not just demos. Will require staying plugged in though.

fotcorn · 2026-03-03T15:00:04 1772550004

The memory bandwith on M4 Max is 546 GB/s, M5 Max is 614GB/s, so not a huge jump.

The new tensor cores, sorry, "Neural Accelerator" only really help with prompt preprocessing aka prefill, and not with token generation. Token generation is memory bound.

Hopefully the Ultra version (if it exists) has a bigger jump in memory bandwidth and maximum RAM.

anentropic · 2026-03-03T15:42:06 1772552526

Do any frameworks manage to use the neural engine cores for that?

Most stuff ends up running Metal -> GPU I thought

abhikul0 · 2026-03-03T16:44:44 1772556284

It's referring to the neural cores(for matrix mul) in the GPU itself, not the NPU.

https://creativestrategies.com/research/m5-apple-silicon-its...

sumek83 · 2026-03-03T16:51:12 1772556672

https://github.com/maderix/ANE

irusensei · 2026-03-04T09:33:29 1772616809

I noticed that even on my M3 MLX tends to do prefill it a lot faster than llama.cpp and GGML models. Anyone knows how they do it?

storus · 2026-03-03T14:38:32 1772548712

4x faster is about token prefill, i.e. the time to first token. It should be on par with DGX Spark there while being slightly faster than M4 for token generation. I.e. when you have long context, you don't need to wait 15 minutes, only 4 minutes.

hu3 · 2026-03-03T14:32:41 1772548361

What about real workloads? Because as context gets larger, these local LLMs aproxiate the useless end of the spectrum with regards to t/s.

Someone1234 · 2026-03-03T14:53:44 1772549624

I strongly agree. People see local "GPT-4 level" responses, and get excited, which I totally get. But how quickly is the fall-off as the context size grows? Because if it cannot hold and reference a single source-code file in its context, the efficiency will absolutely crater.

That's actually the biggest growth area in LLMs, it is no longer about smart, it is about context windows (usable ones, note spec-sheet hypotheticals). Smart enough is mostly solved, combating larger problems is slowly improving with every major release (but there is no ceiling).

zozbot234 · 2026-03-03T20:01:24 1772568084

The thing about context/KV cache is that you can swap it out efficiently, which you can't with the activations because they're rewritten for every token. It will slow down as context grows (decode is often compute-limited when context is large) but it will run.

satvikpendem · 2026-03-03T14:55:02 1772549702

That should be covered by the harness rather than the LLM itself, no? Compaction and summarization should be able to allow the LLM to still run smoothly even on large contexts.

hu3 · 2026-03-03T17:43:57 1772559837

Sometimes it really needs a lot of data to work.

barumrho · 2026-03-03T15:23:28 1772551408

100 tok/s sounds pretty good. What do you get with 70B? With 128GB, you need quantization to fit 70B model, right?

Wondering if local LLM (for coding) is a realistic option, otherwise I wouldn't have to max out the RAM.

super_mario · 2026-03-03T15:57:05 1772553425

I run gpt-oss 120b model on ollama (the model is about 65 GB on disk) with 128k context size (the model is super optimized and only uses 4.8 GB of additional RAM for KV cache at this context size) on M4 Max 128 GB RAM Mac Studio and I get 65 tokens/s.

abhikul0 · 2026-03-03T16:57:10 1772557030

Have you tried the dense(27B,9B) Qwen3.5 models? Or any diffusion models (Flux Klein, Zimage)? I'm trying to gauge how much of a perf boost I'd get upgrading from an m3 pro.

For reference:

  | model                          |       size |     params | backend    | threads |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
  | qwen35 ?B Q5_K - Medium        |   6.12 GiB |     8.95 B | MTL,BLAS   |       6 |           pp512 |        288.90 ± 0.67 |
  | qwen35 ?B Q5_K - Medium        |   6.12 GiB |     8.95 B | MTL,BLAS   |       6 |           tg128 |         16.58 ± 0.05 |

  | model                          |       size |     params | backend    | threads |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | MTL,BLAS   |       6 |           pp512 |        615.94 ± 2.23 |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | MTL,BLAS   |       6 |           tg128 |         42.85 ± 0.61 |

  Klein 4B completes a 1024px generation in 72seconds.

eknkc · 2026-03-03T14:25:04 1772547904

I find time to first token more important then tok/s generally as these models wait an ungodly amount of time before streaming results. It looks like the claims are true based on M5: https://www.macstories.net/stories/ipad-pro-m5-neural-benchm... so this might work great.

fulafel · 2026-03-03T15:49:31 1772552971

The marketing subterfugue might be about this exactly, technically prompt processing means the prefill phase of inference. So prompt goes in 4x as fast but generates tokens slower.

This seems even likely as the memory bandwidth hasn't increased enough for those kinds of speedups, and I guess prefill is more likely to be compute-bound (vs mem bw bound).

petercooper · 2026-03-03T19:09:01 1772564941

So prompt goes in 4x as fast but generates tokens slower.

I'd take that tradeoff. On my M3 Ultra, the inference is surprisingly fast, but the prompt processing speed makes it painful except as a fallback or experimentation, especially with agentic coding tools.

bytesandbits · 2026-03-04T09:34:57 1772616897

4x faster PREFILL not decode. Decode is bandwidth-bounded. Prefill is flops-constrained.

nbardy · 2026-03-04T02:37:14 1772591834

How much of your RAM does that use including kv cache. Is there enough left to run real dev workloads AND the llm?

Also can you run batchwise effectively like vllm on cuda?

Enough to run multiple agents at the same time with throughput?

butILoveLife · 2026-03-03T14:24:45 1772547885

[flagged]

dirk94018 · 2026-03-03T14:47:33 1772549253

For chat type interactions prefill is cached, prompt is processed at 400tk/s and generation is 100-107tk/s, it's quite snappy. Sure, for 130,000 tokens, processing documents it drops to, I think 60tk/s, but don't quote me on that. The larger point is that local LLMs are becoming useful, and they are getting smarter too.

macintux · 2026-03-03T14:54:10 1772549650

Please read the guidelines and consider moderating your tone. Hostility towards other commenters is strongly discouraged.

kamranjon · 2026-03-03T14:40:13 1772548813

I'm not sure if you're just unaware or purposefully dense. It's absolutely possible to get those numbers for certain models in a m4 max and it's averaged over many tokens, I was just getting 127tok/s for 700 token response on a 24b MoE model yesterday. I tend to use Qwen 3 Coder Next the most which is closer to 65 or 70 tok/s, but absolutely usable for dev work.

I think the truth is somewhere in the middle, many people don't realize just how performant (especially with MLX) some of these models have become on Mac hardware, and just how powerful the shared memory architecture they've built is, but also there is a lot of hype and misinformation on performance when compared to dedicated GPU's. It's a tradeoff between available memory and performance, but often it makes sense.

fooblaster · 2026-03-03T15:07:53 1772550473

what inference runtime are you using? You mentioned mlx but I didn't think anyone was using that for local llms

kamranjon · 2026-03-03T15:48:24 1772552904

LM Studio (which prioritizes MLX models if you're on Mac and they are available) - I have it setup with tailscale running as a server on my personal laptop. So when I'm working I can connect to it from my work laptop, from wherever I might be, and it's integrated through the Zed editor using its built in agent - it's pretty seamless. Then whenever I want to use my personal laptop I just unload the model and do other things. It's a really nice setup, definitely happy I got the 128gb mbp because I do a lot of video editing and 3d rendering work as a hobby/for fun and it's sorta dual purpose in that way, I can take advantage of the compute power when I'm not actually on the machine by setting it up as a LLM server.

pram · 2026-03-03T17:20:58 1772558458

LM Studio has had an MLX engine and models since 2024.

dirk94018 · 2026-03-03T13:09:08 1772543348

Author here. This started because our C inference engine was slower than Python, which was annoying.

We got it to 400 tok/s prefill, 100 tok/s generate, 1,800 lines of C++, no dependencies beyond MLX. Just not redoing work was a 125x improvement.

Favorite moment: the model suggested enabling MetalFX to speed up inference. That's Apple's game graphics upscaler. It makes explosions look better.

AMA about any of it. We are working on the Qwen3.5 models. Local AI is going to get a lot better.

dirk94018 · 2026-02-27T16:28:50 1772209730

Don't nerf the models. We don't know what we are losing. DOW said it out loud.

dirk94018 · 2026-02-26T01:41:47 1772070107

This is exactly right. We hit the same wall. Our solution was to re-imagine Unix at https://linuxtoaster.com, and either pipe through jq etc or just start rewriting tools that do that. A good tool shouldn't be verbose out of laziness, it should be conscious of the information that might be needed by the next step in the pipeline. If deeper information is needed, the user should ask for that, with a command line flag.

dirk94018 · 2026-02-26T01:27:49 1772069269

The Pentagon seems to see this as a procurement issue, we bought a tool, don't tell us how to use it, and Anthropic seems concerned that the tool's nature is shaped by the constraints put on it, and we don't really understand this AI thing, and an unconstrained version could be a worse and more dangerous tool.

dirk94018 · 2026-02-26T01:13:58 1772068438

dirk94018 · 2026-02-26T01:07:35 1772068055

This is exactly why local inference matters. Every query you send to a cloud API is another data point. Your prompts contain your code, your logs, your thought process — arguably more identifying than your HN comments.

The paper shows deanonymization from public posts. Imagine what's possible with private API traffic: the questions you ask, the code you paste, the errors you debug. Even if providers don't read it today, the data exists and the cost of analyzing it is going to zero.

Air-gapped local inference isn't paranoia. It's necessary.

Imustaskforhelp · 2026-02-26T09:47:51 1772099271

Combine this with the fact that even the private mode of any AI provider still keeps logs of the chats and from some past discussion iirc, will keep it indefinitely.

> Air-gapped local inference isn't paranoia. It's necessary.

I definitely agree, I am seeing new model like qwen-3.5-30A3b (iirc) being able to be run reasonably on normal hardware (You can buy a mac mini whose price hasn't been inflated) and get decent tps while having a decent model overall.

There are some services like proton lumo, the service by signal, kagi's AI which seem to try to be better but long term, my plan is to buy mac-mini for such levels of inference for basic queries.

Of course, in the meanwhile like for example coding, it might not make too big of a difference between using local model or not unless for the most extremely sensitive work (perhaps govt/bank oriented)

dirk94018 · 2026-02-26T01:04:25 1772067865

TexMacs is great. However, I use LaTeX regularly. Used to keep a cheat sheet of commands I'd forget between documents. Today I can describe what I want in plain English, pipe it through toast, and get the LaTeX back.

LaTeX, vim, sed, awk, the whole Unix toolkit is getting a new lease on life, because their interfaces are text. Text in, text out. An LLM can write you a perfect \begin{tikzpicture} on the first try.

Clicking through a GUI is much harder and instead of the computer doing the work, I feel like I am working. WYSIWYG won because it made functionality discoverable, today we have AI mentors.

mgubi · 2026-02-26T13:27:16 1772112436

ChatGPT can write TeXmacs documents :)

dfc · 2026-02-26T12:45:53 1772109953

What is toast? I have not heard of it. I thought you wrote pandoc at first.

dirk94018 · 2026-02-26T19:26:13 1772133973

toast is sed with a brain. I got tired of cut and paste and made my own tool. Then I decided to let the AI drive and tried toast | bash, pretty good but AIs are terrible at escaping, got annoyed and wrote a shell for AI to use called jam. Wanted a bot to answer my texts, so I wrote iMessage, a cli tool. Now you can do iMessage -c iMessage | toast | iMessage and it answers texts. There is more and now its a startup, Unix re-imagined for AI: https:/linuxtoaster.com