Ohhhh geee!!! I just applied the patch to my local git copy.
You need to use the model on the PR that he submitted, the model is particular because it has extra information that allows the MTP to happen.
I have two amd gpus, and qwen3.6 27B qk6 does around 20t/s generation... If I run it only on one I get like 35t/s.
But with this patch I saw 46t/s with qwen3.6 27B q8... this is insane, it's 250% faster than the original speed, there was no gpu I could upgrade to get that kind of boost, amazing!
What's "sad" is how slow the ollama folks are being in vendoring newer versions of ggml into their codebase. That attitude just leaves them stranded without access to newer features.
A few days ago I switched again from Qwen3.6 to Gemma 4 - for personal use I've experienced better average performance with the 26B version of the latter than the 27B of the former.
For someone who's been running local models for a long while, these are very very exciting times.
Oh that's fascinating. 3.6 27B is pretty damned good, but slow in wall-clock times on my DGX Spark-alike. It generates huge reams of thinking before it gets the (usually correct!) answer, so wall-clock time is rough for tasks even at ~20tk/s
I'm surprised the 26B-A4B is better? It should be faster too, interesting. I'm excited to try 31B with MTP, because MTP-2 is what makes 27B bearable on the GB10.
What are you using it for? Agent-based coding, or something else?
General purpose, mostly internet research in the form of slow-crawling. (Emphasis on slow - I've ultimately landed on Scrapling's API for seamless content rendering, and I use image support so as not to exclude informative images or weirdly rendered text.)
For coding I don't need image support so I stuff the entire GPU with text-only mode. I don't have a workflow where I send LLMs off to generate thousands of lines of code but what little coding I did I did with Qwen3.6 and it was spectacular, as you likely suggest.
I've been thinking about doing more of this too. What spec machine are you running? And are you using long-running autonomous agents or more of the IDE/co-pilot style of collaboration?
Gemma certainly was trained for tool calling, but the implementation in llama.cpp has been troubled because Gemma uses a different chat template format. The processor from the transformers library works fine though.
I'm using llama.cpp with Gemma and tool calling is mission critical. It's perfectly fine on my end.
There are definitely differences in the eagerness to tool-call that you'll need to manage. And for all local models I've ever used, I've had to micromanage the tools provided by servers to eliminate any possibility that they reach for something wonky or confusing.
> However I find qwen unbeatable for toolcallling. I think gemma wasnt trained on that at all.
Gemma4 chat template seems to had multiple issues, at least with llama.cpp, not sure they're all fixed yet. It assumed simple types for parameters for example.
Thanks for the link,it took qwen3.6-27B-q8 w/256k context on my RTX A6000 from ~20t/s to 55t/s. Prefill is mysteriously slower however, but prefill is so much faster still that I think I'm still bottlenecked on output most of the time.
I don’t exactly know where MTP inference fits within the inference stack, but does someone know whether it’s possible to implement it for the MLX universe?
MTP allows for a smaller draft model to supply tokens to the larger model for verification. If tokens are good enough, the larger model can accept them instead of generating its own, which is much cheaper. From what I read, this is not unique to GGUF or MLX format. Instead, the model has to be trained to support that feature.
Why when asking a model to change text in a minor way; are we not asking it to generate the operational transformations necessary to modify the text, and then just executing the ot on the existing text vs reproducing every token? Maybe tools are doing that more than I realize?
The only thing a model can output is tokens; to achieve this, a tool of converting tokens into operational transformations is required. For example, I have an ast-grep skill, it will instruct the model to generate ast-grep rules and run ast-grep to perform file modifications.
I am saying to directly output the operational transformation instructions as the tokens. You’re essentially telling it to “write the diff” and then applying the patch.
A coding agent provides an edit tool to the model for making code changes, with this tool typically just providing a find/replace operation. The model may of course need to do a bunch of work (grepping etc) to figure out what to change, but the actual change will just be sent from model to agent as a "replace X with Y" edit tool request, with this "edit" then done locally by the agent.
It's interesting how the agent (at least in case of Claude Code) is then applying this find/replace "edit" to the requested file... Since the agent wants to be platform independent (Linux/Windows/Max) it uses Node.js for file access, and performs the "edit" by itself using Node.js to read the entire file, make the change, then write back an entire new file.
The simple answer is: because it is not necessary to achieve the same final output. Most LLMs today are trained as autoregressive token predictors. They fundamentally can't work any other way. But we know how to train them really well and they have many applications beyond editing text. Diffusion LLMs exist too, which work a bit closer to what you describe, but they are not yet at the same level of intelligence since training methods are not that mature and they are generally less flexible as well.
So predict the tokens of the operational transformation.
I just asked: Write the operational transformation sequence and command to turn “this is really beautiful” to “this is very very beautiful”
and in return got: You can map this out by moving a virtual cursor across the text and telling it what to keep, remove, or add. You start by retaining the first eight characters to keep "this is " untouched. Then you delete the next six characters to remove the word "really". In that exact spot, you insert the nine characters for "very very". You finish the operation by retaining the final ten characters, which preserves the space and the word "beautiful". You can code this specific command sequence as [retain(8), delete(6), insert("very very"), retain(10)].
In a large paragraph of text I would expect it to be way quicker and cheaper to generate “[retain(800), delete(6), insert("very very"), retain(10000)]” than repredict the entire remainder of the unedited text.
I have no idea when I’m being lied to anymore but allegedly Aider and Cursor work the way I described, although cursor is using a second model to apply the edit.
They all do something similar under the hood. Patching files is not a trivial task when you only have the changed text content and not the actual file structure to work with. It kind of works, but is fundamentally limited by the LLM output architecture.
Cursor has a dedicated merge model. It takes input like this:
class Foo {
// ....
int calculation() {
return 42;
}
// more stuff
}
where the main model emits something that is a sort of casual under-specified diff format and the merge model figures out how to interpret it as a patch.
I've seen Claude use sed to edit files on other hosts instead of copying the file back and forth to edit it. Not quite full blown OT but it's going in that direction.
According to the linked PR, the original model does come with MTP which is another "head" (=output path) in the same model and (supposedly) runs faster.
The current implementation ignores that head but the PR let the tool recognize it, plus does proper integration (run the MTP while running the slower main path then compare the result, I believe.)
The standard way of doing MTP is to run the drafter autoregressively for k steps, and then (not concurrently) use the larger model as a verifier for those k tokens at the same time. The larger model can then accept a prefix of those k tokens, and in any case generates one more token (which is needed in case you accepted zero tokens from the drafter). The larger model can effectively use this k as a "batch" dimension, reducing the penalty of large weight loading. Meanwhile the drafter is much smaller, so it's fine for _it_ to be autoregressive, as long as the main model is parallel.
Yeah important conceptually to remember MTP is kind of just more weights, but speculative decoding is the runtime algorithm that’s a significant add to whatever code is serving the model.
How so? I’m not saying most of work doesn’t go into creating the drafting model or enabling a new head on the primary model, but the point is that however cool it is the result is, more weights. Speculative decoding requires code to be aware of how this works at the inference level.
The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.