Hacker Newsnew | past | comments | ask | show | jobs | submit | dirk94018's commentslogin

Simplicity is hard. Mark Twain's 'I would have written less had I had more time' at the end of a letter comes to mind. Software dev's tendency to build castles is great for technical managers who want to own complex systems to gain organizational leverage. Worse is better in this context. Even when it makes people who understand cringe.

You would think that things not breaking should be career-positive for SysAdmins, SREs, and DevOps engineers in a way it cannot be for software devs. But even there simplicity is hard and not really rewarded.

Unix philosophy got this right 50 years ago — small tools, composability, do one thing well. Unix reimagined for AI is my attempt to change that.


Simplicity is hard. Mark Twain's 'I would have written less had I had more time' at the end of a letter comes to mind.

Software dev's tendency to build castles is great for technical managers who want to own complex systems to gain organizational leverage. Worse is better in this context. Even when it makes people who understand cringe.

You would think that things not breaking should be career-positive for SysAdmins, SREs, and DevOps engineers in a way it cannot be for software devs. But even there simplicity is hard and not really rewarded.

Unix philosophy got this right 50 years ago — small tools, composability, do one thing well. Unix reimagined for AI is my attempt to change that.


On M4 Max 128GB we're seeing ~100 tok/s generation on a 30B parameter model in our from scratch inference engine. Very curious what the "4x faster LLM prompt processing" translates to in practice. Smallish, local 30B-70B inference is genuinely usable territory for real dev workflows, not just demos. Will require staying plugged in though.

The memory bandwith on M4 Max is 546 GB/s, M5 Max is 614GB/s, so not a huge jump.

The new tensor cores, sorry, "Neural Accelerator" only really help with prompt preprocessing aka prefill, and not with token generation. Token generation is memory bound.

Hopefully the Ultra version (if it exists) has a bigger jump in memory bandwidth and maximum RAM.


Do any frameworks manage to use the neural engine cores for that?

Most stuff ends up running Metal -> GPU I thought


It's referring to the neural cores(for matrix mul) in the GPU itself, not the NPU.

https://creativestrategies.com/research/m5-apple-silicon-its...



I noticed that even on my M3 MLX tends to do prefill it a lot faster than llama.cpp and GGML models. Anyone knows how they do it?

4x faster is about token prefill, i.e. the time to first token. It should be on par with DGX Spark there while being slightly faster than M4 for token generation. I.e. when you have long context, you don't need to wait 15 minutes, only 4 minutes.

What about real workloads? Because as context gets larger, these local LLMs aproxiate the useless end of the spectrum with regards to t/s.

I strongly agree. People see local "GPT-4 level" responses, and get excited, which I totally get. But how quickly is the fall-off as the context size grows? Because if it cannot hold and reference a single source-code file in its context, the efficiency will absolutely crater.

That's actually the biggest growth area in LLMs, it is no longer about smart, it is about context windows (usable ones, note spec-sheet hypotheticals). Smart enough is mostly solved, combating larger problems is slowly improving with every major release (but there is no ceiling).


The thing about context/KV cache is that you can swap it out efficiently, which you can't with the activations because they're rewritten for every token. It will slow down as context grows (decode is often compute-limited when context is large) but it will run.

That should be covered by the harness rather than the LLM itself, no? Compaction and summarization should be able to allow the LLM to still run smoothly even on large contexts.

Sometimes it really needs a lot of data to work.

100 tok/s sounds pretty good. What do you get with 70B? With 128GB, you need quantization to fit 70B model, right?

Wondering if local LLM (for coding) is a realistic option, otherwise I wouldn't have to max out the RAM.


I run gpt-oss 120b model on ollama (the model is about 65 GB on disk) with 128k context size (the model is super optimized and only uses 4.8 GB of additional RAM for KV cache at this context size) on M4 Max 128 GB RAM Mac Studio and I get 65 tokens/s.

Have you tried the dense(27B,9B) Qwen3.5 models? Or any diffusion models (Flux Klein, Zimage)? I'm trying to gauge how much of a perf boost I'd get upgrading from an m3 pro.

For reference:

  | model                          |       size |     params | backend    | threads |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
  | qwen35 ?B Q5_K - Medium        |   6.12 GiB |     8.95 B | MTL,BLAS   |       6 |           pp512 |        288.90 ± 0.67 |
  | qwen35 ?B Q5_K - Medium        |   6.12 GiB |     8.95 B | MTL,BLAS   |       6 |           tg128 |         16.58 ± 0.05 |

  | model                          |       size |     params | backend    | threads |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | MTL,BLAS   |       6 |           pp512 |        615.94 ± 2.23 |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | MTL,BLAS   |       6 |           tg128 |         42.85 ± 0.61 |

  Klein 4B completes a 1024px generation in 72seconds.

I find time to first token more important then tok/s generally as these models wait an ungodly amount of time before streaming results. It looks like the claims are true based on M5: https://www.macstories.net/stories/ipad-pro-m5-neural-benchm... so this might work great.

The marketing subterfugue might be about this exactly, technically prompt processing means the prefill phase of inference. So prompt goes in 4x as fast but generates tokens slower.

This seems even likely as the memory bandwidth hasn't increased enough for those kinds of speedups, and I guess prefill is more likely to be compute-bound (vs mem bw bound).


So prompt goes in 4x as fast but generates tokens slower.

I'd take that tradeoff. On my M3 Ultra, the inference is surprisingly fast, but the prompt processing speed makes it painful except as a fallback or experimentation, especially with agentic coding tools.


4x faster PREFILL not decode. Decode is bandwidth-bounded. Prefill is flops-constrained.

How much of your RAM does that use including kv cache. Is there enough left to run real dev workloads AND the llm?

Also can you run batchwise effectively like vllm on cuda?

Enough to run multiple agents at the same time with throughput?


[flagged]


For chat type interactions prefill is cached, prompt is processed at 400tk/s and generation is 100-107tk/s, it's quite snappy. Sure, for 130,000 tokens, processing documents it drops to, I think 60tk/s, but don't quote me on that. The larger point is that local LLMs are becoming useful, and they are getting smarter too.

Please read the guidelines and consider moderating your tone. Hostility towards other commenters is strongly discouraged.

I'm not sure if you're just unaware or purposefully dense. It's absolutely possible to get those numbers for certain models in a m4 max and it's averaged over many tokens, I was just getting 127tok/s for 700 token response on a 24b MoE model yesterday. I tend to use Qwen 3 Coder Next the most which is closer to 65 or 70 tok/s, but absolutely usable for dev work.

I think the truth is somewhere in the middle, many people don't realize just how performant (especially with MLX) some of these models have become on Mac hardware, and just how powerful the shared memory architecture they've built is, but also there is a lot of hype and misinformation on performance when compared to dedicated GPU's. It's a tradeoff between available memory and performance, but often it makes sense.


what inference runtime are you using? You mentioned mlx but I didn't think anyone was using that for local llms

LM Studio (which prioritizes MLX models if you're on Mac and they are available) - I have it setup with tailscale running as a server on my personal laptop. So when I'm working I can connect to it from my work laptop, from wherever I might be, and it's integrated through the Zed editor using its built in agent - it's pretty seamless. Then whenever I want to use my personal laptop I just unload the model and do other things. It's a really nice setup, definitely happy I got the 128gb mbp because I do a lot of video editing and 3d rendering work as a hobby/for fun and it's sorta dual purpose in that way, I can take advantage of the compute power when I'm not actually on the machine by setting it up as a LLM server.

LM Studio has had an MLX engine and models since 2024.

Author here. This started because our C inference engine was slower than Python, which was annoying.

We got it to 400 tok/s prefill, 100 tok/s generate, 1,800 lines of C++, no dependencies beyond MLX. Just not redoing work was a 125x improvement.

Favorite moment: the model suggested enabling MetalFX to speed up inference. That's Apple's game graphics upscaler. It makes explosions look better.

AMA about any of it. We are working on the Qwen3.5 models. Local AI is going to get a lot better.


Don't nerf the models. We don't know what we are losing. DOW said it out loud.

This is exactly right. We hit the same wall. Our solution was to re-imagine Unix at https://linuxtoaster.com, and either pipe through jq etc or just start rewriting tools that do that. A good tool shouldn't be verbose out of laziness, it should be conscious of the information that might be needed by the next step in the pipeline. If deeper information is needed, the user should ask for that, with a command line flag.

The Pentagon seems to see this as a procurement issue, we bought a tool, don't tell us how to use it, and Anthropic seems concerned that the tool's nature is shaped by the constraints put on it, and we don't really understand this AI thing, and an unconstrained version could be a worse and more dangerous tool.

Cool

This is exactly why local inference matters. Every query you send to a cloud API is another data point. Your prompts contain your code, your logs, your thought process — arguably more identifying than your HN comments.

The paper shows deanonymization from public posts. Imagine what's possible with private API traffic: the questions you ask, the code you paste, the errors you debug. Even if providers don't read it today, the data exists and the cost of analyzing it is going to zero.

Air-gapped local inference isn't paranoia. It's necessary.


Combine this with the fact that even the private mode of any AI provider still keeps logs of the chats and from some past discussion iirc, will keep it indefinitely.

> Air-gapped local inference isn't paranoia. It's necessary.

I definitely agree, I am seeing new model like qwen-3.5-30A3b (iirc) being able to be run reasonably on normal hardware (You can buy a mac mini whose price hasn't been inflated) and get decent tps while having a decent model overall.

There are some services like proton lumo, the service by signal, kagi's AI which seem to try to be better but long term, my plan is to buy mac-mini for such levels of inference for basic queries.

Of course, in the meanwhile like for example coding, it might not make too big of a difference between using local model or not unless for the most extremely sensitive work (perhaps govt/bank oriented)


TexMacs is great. However, I use LaTeX regularly. Used to keep a cheat sheet of commands I'd forget between documents. Today I can describe what I want in plain English, pipe it through toast, and get the LaTeX back.

LaTeX, vim, sed, awk, the whole Unix toolkit is getting a new lease on life, because their interfaces are text. Text in, text out. An LLM can write you a perfect \begin{tikzpicture} on the first try.

Clicking through a GUI is much harder and instead of the computer doing the work, I feel like I am working. WYSIWYG won because it made functionality discoverable, today we have AI mentors.


ChatGPT can write TeXmacs documents :)

What is toast? I have not heard of it. I thought you wrote pandoc at first.

toast is sed with a brain. I got tired of cut and paste and made my own tool. Then I decided to let the AI drive and tried toast | bash, pretty good but AIs are terrible at escaping, got annoyed and wrote a shell for AI to use called jam. Wanted a bot to answer my texts, so I wrote iMessage, a cli tool. Now you can do iMessage -c iMessage | toast | iMessage and it answers texts. There is more and now its a startup, Unix re-imagined for AI: https:/linuxtoaster.com

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: