MTP support is being addedto llama.cpp, at least for the Qwen models ( https://g...

tarruda · 2026-05-05T17:31:03 1778002263

There is a newer PR which will probably be merged soon: https://github.com/ggml-org/llama.cpp/pull/22673

xlayn · 2026-05-05T21:47:26 1778017646

Ohhhh geee!!! I just applied the patch to my local git copy. You need to use the model on the PR that he submitted, the model is particular because it has extra information that allows the MTP to happen. I have two amd gpus, and qwen3.6 27B qk6 does around 20t/s generation... If I run it only on one I get like 35t/s.

But with this patch I saw 46t/s with qwen3.6 27B q8... this is insane, it's 250% faster than the original speed, there was no gpu I could upgrade to get that kind of boost, amazing!

CaineThanatos · 2026-05-07T13:18:20 1778159900

which amd gpu's do you have, if I may ask ?

entropicdrifter · 2026-05-05T18:21:49 1778005309

Ollama merged a PR for MTP about 2 hours ago, as well:

https://github.com/ollama/ollama/pull/15980

Edit: Seems they also have a pre-release version out with the functionality added: https://github.com/ollama/ollama/releases/tag/v0.23.1-rc0

theturtle32 · 2026-05-06T06:16:32 1778048192

Sad:

theturtle32@ai1:~$ ollama run gemma4:31b-coding-mtp-bf16 pulling manifest Error: pull model manifest: 412: this model requires macOS

zozbot234 · 2026-05-06T06:23:38 1778048618

What's "sad" is how slow the ollama folks are being in vendoring newer versions of ggml into their codebase. That attitude just leaves them stranded without access to newer features.

nzeid · 2026-05-05T19:34:57 1778009697

A few days ago I switched again from Qwen3.6 to Gemma 4 - for personal use I've experienced better average performance with the 26B version of the latter than the 27B of the former.

For someone who's been running local models for a long while, these are very very exciting times.

girvo · 2026-05-06T00:42:18 1778028138

Oh that's fascinating. 3.6 27B is pretty damned good, but slow in wall-clock times on my DGX Spark-alike. It generates huge reams of thinking before it gets the (usually correct!) answer, so wall-clock time is rough for tasks even at ~20tk/s

I'm surprised the 26B-A4B is better? It should be faster too, interesting. I'm excited to try 31B with MTP, because MTP-2 is what makes 27B bearable on the GB10.

What are you using it for? Agent-based coding, or something else?

nzeid · 2026-05-06T19:10:58 1778094658

General purpose, mostly internet research in the form of slow-crawling. (Emphasis on slow - I've ultimately landed on Scrapling's API for seamless content rendering, and I use image support so as not to exclude informative images or weirdly rendered text.)

For coding I don't need image support so I stuff the entire GPU with text-only mode. I don't have a workflow where I send LLMs off to generate thousands of lines of code but what little coding I did I did with Qwen3.6 and it was spectacular, as you likely suggest.

glenngillen · 2026-05-06T02:44:56 1778035496

I've been thinking about doing more of this too. What spec machine are you running? And are you using long-running autonomous agents or more of the IDE/co-pilot style of collaboration?

apexalpha · 2026-05-05T20:25:15 1778012715

I’ve been swapping between these too as well.

However I find qwen unbeatable for toolcallling. I think gemma wasnt trained on that at all.

sigmoid10 · 2026-05-05T20:30:41 1778013041

Gemma certainly was trained for tool calling, but the implementation in llama.cpp has been troubled because Gemma uses a different chat template format. The processor from the transformers library works fine though.

apexalpha · 2026-05-06T11:10:07 1778065807

Oh I must've missed this.

The AI space moves so fast! I'll check it out again.

intothemild · 2026-05-06T13:32:39 1778074359

Don't forget to update the gguf you have too. The templates in them were updated recently too

nzeid · 2026-05-05T20:39:03 1778013543

I'm using llama.cpp with Gemma and tool calling is mission critical. It's perfectly fine on my end.

There are definitely differences in the eagerness to tool-call that you'll need to manage. And for all local models I've ever used, I've had to micromanage the tools provided by servers to eliminate any possibility that they reach for something wonky or confusing.

magicalhippo · 2026-05-06T00:30:04 1778027404

> However I find qwen unbeatable for toolcallling. I think gemma wasnt trained on that at all.

Gemma4 chat template seems to had multiple issues, at least with llama.cpp, not sure they're all fixed yet. It assumed simple types for parameters for example.

fridder · 2026-05-05T20:20:26 1778012426

I'd love to see this in oMLX too. It has been a rather nice tool

egeres · 2026-05-05T22:59:58 1778021998

There's also a growing interest on integrating DFlash: https://github.com/ggml-org/llama.cpp/issues/21978, I can't wait to see how it will compare against MTP

nullc · 2026-05-05T23:58:04 1778025484

Thanks for the link,it took qwen3.6-27B-q8 w/256k context on my RTX A6000 from ~20t/s to 55t/s. Prefill is mysteriously slower however, but prefill is so much faster still that I think I'm still bottlenecked on output most of the time.

_factor · 2026-05-06T02:08:09 1778033289

Took 2x AMD MI50s to 50 t/s instead of 20 t/s for Q8 27B. Impressive.

endymi0n · 2026-05-05T23:26:31 1778023591

I don’t exactly know where MTP inference fits within the inference stack, but does someone know whether it’s possible to implement it for the MLX universe?

neonstatic · 2026-05-12T15:46:19 1778600779

MTP allows for a smaller draft model to supply tokens to the larger model for verification. If tokens are good enough, the larger model can accept them instead of generating its own, which is much cheaper. From what I read, this is not unique to GGUF or MLX format. Instead, the model has to be trained to support that feature.

basch · 2026-05-05T18:24:39 1778005479

I have a dumb performance question.

Why when asking a model to change text in a minor way; are we not asking it to generate the operational transformations necessary to modify the text, and then just executing the ot on the existing text vs reproducing every token? Maybe tools are doing that more than I realize?

XYen0n · 2026-05-05T18:51:20 1778007080

The only thing a model can output is tokens; to achieve this, a tool of converting tokens into operational transformations is required. For example, I have an ast-grep skill, it will instruct the model to generate ast-grep rules and run ast-grep to perform file modifications.

basch · 2026-05-05T20:27:46 1778012866

I am saying to directly output the operational transformation instructions as the tokens. You’re essentially telling it to “write the diff” and then applying the patch.

[retain(8), delete(6), insert("very very"), retain(10)]

mike_hearn · 2026-05-06T09:01:04 1778058064

OpenAI models emit a format similar to a regular diff, but without the line numbers. Look at apply_patch

ritonlajoie · 2026-05-05T23:48:33 1778024913

there is a model in openrouter doing exactly this, it generates diffs. forgot the name though

cryptoz · 2026-05-05T19:01:35 1778007695

This is the approach I take with code edits to existing files at Code+=AI; I wrote a blog post with a simple example of AST modification to illustrate: https://codeplusequalsai.com/static/blog/prompting_llms_to_m...

HarHarVeryFunny · 2026-05-06T17:35:10 1778088910

A coding agent provides an edit tool to the model for making code changes, with this tool typically just providing a find/replace operation. The model may of course need to do a bunch of work (grepping etc) to figure out what to change, but the actual change will just be sent from model to agent as a "replace X with Y" edit tool request, with this "edit" then done locally by the agent.

It's interesting how the agent (at least in case of Claude Code) is then applying this find/replace "edit" to the requested file... Since the agent wants to be platform independent (Linux/Windows/Max) it uses Node.js for file access, and performs the "edit" by itself using Node.js to read the entire file, make the change, then write back an entire new file.

sigmoid10 · 2026-05-05T20:24:01 1778012641

The simple answer is: because it is not necessary to achieve the same final output. Most LLMs today are trained as autoregressive token predictors. They fundamentally can't work any other way. But we know how to train them really well and they have many applications beyond editing text. Diffusion LLMs exist too, which work a bit closer to what you describe, but they are not yet at the same level of intelligence since training methods are not that mature and they are generally less flexible as well.

basch · 2026-05-05T20:28:19 1778012899

So predict the tokens of the operational transformation.

I just asked: Write the operational transformation sequence and command to turn “this is really beautiful” to “this is very very beautiful”

and in return got: You can map this out by moving a virtual cursor across the text and telling it what to keep, remove, or add. You start by retaining the first eight characters to keep "this is " untouched. Then you delete the next six characters to remove the word "really". In that exact spot, you insert the nine characters for "very very". You finish the operation by retaining the final ten characters, which preserves the space and the word "beautiful". You can code this specific command sequence as [retain(8), delete(6), insert("very very"), retain(10)].

In a large paragraph of text I would expect it to be way quicker and cheaper to generate “[retain(800), delete(6), insert("very very"), retain(10000)]” than repredict the entire remainder of the unedited text.

sigmoid10 · 2026-05-05T20:39:55 1778013595

Sounds easy, but isn't in practice. You can look at the edit text file tool in va code copilot for example to see how complicated that can get: https://github.com/microsoft/vscode-copilot-chat/tree/9e668c...

basch · 2026-05-05T20:49:28 1778014168

I have no idea when I’m being lied to anymore but allegedly Aider and Cursor work the way I described, although cursor is using a second model to apply the edit.

sigmoid10 · 2026-05-06T11:13:35 1778066015

They all do something similar under the hood. Patching files is not a trivial task when you only have the changed text content and not the actual file structure to work with. It kind of works, but is fundamentally limited by the LLM output architecture.

mike_hearn · 2026-05-06T09:02:47 1778058167

Cursor has a dedicated merge model. It takes input like this:

    class Foo {
        // ....
        int calculation() {
            return 42;
        }
    
        // more stuff
    }

where the main model emits something that is a sort of casual under-specified diff format and the merge model figures out how to interpret it as a patch.

jfim · 2026-05-05T20:38:20 1778013500

I've seen Claude use sed to edit files on other hosts instead of copying the file back and forth to edit it. Not quite full blown OT but it's going in that direction.

EGreg · 2026-05-05T17:15:39 1778001339

How does this get added in practice?

flakiness · 2026-05-05T17:21:09 1778001669

According to the linked PR, the original model does come with MTP which is another "head" (=output path) in the same model and (supposedly) runs faster.

The current implementation ignores that head but the PR let the tool recognize it, plus does proper integration (run the MTP while running the slower main path then compare the result, I believe.)

flebron · 2026-05-05T20:11:46 1778011906

The standard way of doing MTP is to run the drafter autoregressively for k steps, and then (not concurrently) use the larger model as a verifier for those k tokens at the same time. The larger model can then accept a prefix of those k tokens, and in any case generates one more token (which is needed in case you accepted zero tokens from the drafter). The larger model can effectively use this k as a "batch" dimension, reducing the penalty of large weight loading. Meanwhile the drafter is much smaller, so it's fine for _it_ to be autoregressive, as long as the main model is parallel.

dakolli · 2026-05-05T17:04:22 1778000662

yet, still mostly useless.

WhitneyLand · 2026-05-05T17:19:22 1778001562

Yeah important conceptually to remember MTP is kind of just more weights, but speculative decoding is the runtime algorithm that’s a significant add to whatever code is serving the model.

HumanOstrich · 2026-05-05T17:35:17 1778002517

That is.. inaccurate.

WhitneyLand · 2026-05-05T20:58:35 1778014715

How so? I’m not saying most of work doesn’t go into creating the drafting model or enabling a new head on the primary model, but the point is that however cool it is the result is, more weights. Speculative decoding requires code to be aware of how this works at the inference level.