Hacker Newsnew | past | comments | ask | show | jobs | submit | simonw's commentslogin

I wonder if this was timed to lineup with the MacBook Neo launch, which makes the idea of equipping your entire company with Mac laptops a lot more compelling from a cost perspective.

There’s a grey one. So obviously, it was timed.

Thanks for this, I've added that to my write-up of the project here: https://simonwillison.net/2026/Mar/20/turbo-pascal/#hallucin...

A tiny bit, but it never really appealed to me because I've never been heavily into the API-only version of web development - I still like building things that are mostly Jinja templates and HTML forms with a sprinkle of JavaScript.

My JSON API needs are simple enough that default Starlette handles them well.

I'm beginning to come round to the benefits of OpenAPI now which seems like a big note in FastAPI's favor, so maybe I'll give it more of a shot.


Just to be clear, I’m referring to:

https://fastht.ml/


Or sorry I misread as FastAPI.

I'm too much of an HTML and JavaScript nerd to get excited about tools that let me write my HTML in Python.


These feel like the kind of things I'd like to use once only, not permanently install into my Claude setup.

As such I'm more likely to just copy and paste markdown from this repo into a fresh Claude session - or tell it the raw GitHub URL and have Claude fetch it and run with it for the duration of that chat.


Tobi from Shopify used a variant of autoresearch to optimize the Liquid template engine, and found a 53% speedup after ~120 experiments: https://github.com/Shopify/liquid/pull/2056

I wrote up some more notes on that here: https://simonwillison.net/2026/Mar/13/liquid/


How much did this cost? Has there ever been an engineering focus on performance for liquid?

It’s certainly cool, but the optimizations are so basic that I’d expect a performance engineer to find these within a day or two with some flame graphs and profiling.


He used Pi as the harness but didn't say which underlying model. My stab-in-the-air guess would be no more than a few hundred dollars in token spend (for 120 experiments run over a few days assuming Claude Opus 4.6 used without the benefits of the Claude Max plan.)

So cheaper than a performance engineer for a day or two... but the Shopify CEO's own time is likely a whole lot more expensive than a regular engineer!


That's not been my experience at all. The default response to open source code is stone cold silence - getting any feedback at all takes real effort.

Those PyPI download numbers are one of the most useful hints as to whether my stuff is being used by anyone.


Yeah, this new post is a continuation of that work.


Thanks for posting this, that's how I first found out about Dan's experiment! SSD speed doubled in the M5P/M generation, that makes it usable! I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming.

That’s exactly what I thought about. Getting my hands on an M5 Max this week and going to see hows Dan’s experiment performs with faster I/O. Also going to experiment with running active parameters at Q6 or Q8 since output is I/O bottlenecked there should room for higher accuracy compute.

Check my repo, I had added some support for GUFF/untloth, Q3,Q5/Q8 https://github.com/Anemll/flash-moe/blob/iOS-App/docs/gguf-h...

To be fair, it's "possible" to run such setup with llama.cpp with ssd offload. It's just abysmal TG speeds. But it's possible.

That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings).

I guess this is all set up to show off the new high-bandwidth-flash stuff that's due out soon?

Looks like it's Qwen3.5-397B-A17B so 17B active. https://github.com/Anemll/flash-moe/tree/iOS-App

Stupid question: can i run this on my 64GB/1TB mac somehow easily? Or this requires custom coding? 4bit is ~200GB

EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App


Running larger-than-RAM LLMs is an interesting trick, but it's not practical. The output would be extremely slow and your computer would be burning a lot of power to get there. The heavy quantizations and other tricks (like reducing the number of active experts) used in these demos severely degrade the quality.

With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.


>I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.

There are dynamic quants such as Unsloth which quantize only certain layers to Q4. Some layers are more sensitive to quantization than others. Smaller models are more sensitive to quantization than the larger ones. There are also different quantization algorithms, with different levels of degradation. So I think it's somewhat wrong to put "Q4" under one umbrella. It all depends.


I should clarify that I'm referring generically to the types of quantizations used in local LLM inference, including those from Unsloth.

Nobody actually quantizes every layer to Q4 in a Q4 quant.


I've tried a number of experiments, and agree completely. If it doesn't fit in RAM, it's so slow as to be impractical and almost useless. If you're running things overnight, then maybe, but expect to wait a very long time for any answers.

Current local-AI frameworks do a bad job of supporting the doesn't-fit-in-RAM case, though. Especially when running combined CPU+GPU inference. If you aren't very careful about how you run these experiments, the framework loads all weights from disk into RAM only for the OS to swap them all out (instead of mmap-ing the weights in from an existing file, or doing something morally equivalent as with the original MacBook Pro experiment) which is quite wasteful!

This approach also makes less sense for discrete GPUs where VRAM is quite fast but scarce, and the GPU's PCIe link is a key bottleneck. I suppose it starts to make sense again once you're running the expert layers with CPU+RAM.


Yes, SSD speed is critical though. The repo has macOS builds for CLI and Desktop. It's early stages though. M4 Max gets 10-15 TPS on 400B depending on quantization. Compute is an issue too; a lot of code is PoC level.

I have a 64G/1T Studio with an M1 Ultra. You can probably run this model to say you’ve done it but it wouldn’t be very practical.

Also I wouldn’t trust 3-bit quantization for anything real. I run a 5-bit qwen3.5-35b-A3B MoE model on my studio for coding tasks and even the 4-bit quant was more flaky (hallucinations, and sometimes it would think about running tools calls and just not run them, lol).

If you decided to give it a go make sure to use the MLX over the GGUF version! You’ll get a bit more speed out of it.


One expert is 17B, but more than one expert can be active at any time. I believe it’s actually more like 80B active.

I don't think this is correct, "active parameters" is quite unambiguous in that it means a sum of all active experts plus shared parameters.

looks like they meant “effective dense size” which is the square root of total params×active params, so in this case sqrt(397 x 17) = ~82

But the claim that "one expert is 17B" is incorrect. Experts are picked with per-layer granularity (expert 1 for layer X may well be entirely unrelated to expert 1 for layer Y), and the individual layer-experts are tiny. The writeup for the original experiment is very clear on this.

Ok I am by no means an expert on this and I immediately stand corrected. But as I understand it, in order to understand the amount of active memory that’s required, it’s more accurate to go by the ~82B number, right?

The ~82B figure is an attempt to compare performance to an equivalent dense model. The amount of active parameters is given by the ~17B.

Still pretty good considering 17B is what one would run on a 16GB laptop at Q6 with reasonable headroom

So much this! I've been bugging Astral about addressing the sandboxing challenge for a while, I wonder if that might take more priority now they're at OpenAI?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: