More

tarruda · 2026-02-19T12:48:54 1771505334

> This is the first model that has really broken into the anglosphere.

Before Step 3.5 Flash, I've been hearing a lot about ACEStep as being the only open weights competitor to Suno.

tarruda · 2026-02-19T12:47:30 1771505250

They seem to be the same company that released ACEStep music generation model: https://acestep.io/

Though the only mention I found was in ComfyUI docs: https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1

tarruda · 2026-02-19T12:45:46 1771505146

This is probably one of the most underrated LLMs releases in the past few months. In my local testing with a 4-bit quant (https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF/tree/mai...), it surpasses every other LLM I was able to run locally, including Minimax 2.5 and GLM-4.7, though I was only able to run GLM with a 2-bit quant. Some highlights:

- Very context efficient: SWA by default, on a 128G mac I can run the full 256k context or two 128k context streams. - Good speeds on macs. On my M1 Ultra I get 36 t/s tg and 300 t/s pp. Also, these speeds degrade very slowly as context increases: At 100k prefill, it has 20 t/s tg and 129 t/s pp. - Trained for agentic coding. I think it is trained to be compatible with claude code, but it works fine with other CLI harnesses except for Codex (due to the patch edit tool which can confuse it).

This is the first local LLM in the 200B parameter range that I find to be usable with a CLI harness. Been using it a lot with pi.dev and it has been the best experience I had with a local LLM doing agentic coding.

There are a few drawbacks though:

- It can generate some very long reasoning chains. - Current release has a bug where sometimes it goes into an infinite reasoning loop: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...

Hopefully StepFun will do a new release which addresses these issues.

BTW StepFun seems to be the same company that released ACEStep (very good music generation model). At least StepFun is mentioned in ComfyUI docs https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1

sosodev · 2026-02-19T18:42:04 1771526524

Have you tried Qwen3 Coder Next? I've been testing it with OpenCode and it seems to work fairly well with the harness. It occasionally calls tools improperly but with Qwen's suggested temperature=1 it doesn't seem to get stuck. It also spends a reasonable amount of time trying to do work.

I had tried Nemotron 3 Nano with OpenCode and while it kinda worked its tool use was seriously lacking because it just leans on the shell tool for most things. For example, instead of using a tool to edit a file it would just use the shell tool and run sed on it.

That's the primary issue I've noticed with the agentic open weight models in my limited testing. They just seem hesitant to call tools that they don't recognize unless explicitly instructed to do so.

tarruda · 2026-02-19T19:01:19 1771527679

I did play with Qwen3 Coder Next a bit, but didn't try it in a coding harness. Will give it a shot later.

petethepig · 2026-02-19T18:42:35 1771526555

Is getting something like M3 Ultra with 512GB ram and doing oss models going to be cheaper for the next year or two compared to paying for claude / codex?

Did anyone do this kind of math?

tarruda · 2026-02-19T19:00:19 1771527619

No, it is not cheaper. An M3 ultra with 512GB costs $10k which would give you 50 months of Claude or Codex pro plans.

However, if you check the prices on Chinese models (which are the only ones you would be able to run on a Mac), they are much cheaper than the US plans. It would take you forever to get to the $10k

And of course this is not even considering energy costs or running inference on your own hardware (though Macs should be quite efficient there).

ipython · 2026-02-19T17:17:59 1771521479

Curious on how (if?) changes to the inference engine can fix the issue with infinitely long reasoning loops.

It’s my layman understanding that would have to be fixed in the model weights itself?

tarruda · 2026-02-19T18:57:04 1771527424

There's an AMA happening on reddit and they said it will be fixed in the next release: https://www.reddit.com/r/LocalLLaMA/comments/1r8snay/ama_wit...

sosodev · 2026-02-19T18:30:09 1771525809

I think there are multiple ways these infinite loops can occur. It can be an inference engine bug because the engine doesn't recognize the specific format of tags/tokens the model generates to delineate the different types of tokens (thinking, tool calling, regular text). So the model might generate a "I'm done thinking" indicator but the engine ignores it and just keeps generating more "thinking" tokens.

It can also be a bug in the model weights because the model is just failing to generate the appropriate "I'm done thinking" indicator.

You can see this described in this PR https://github.com/ggml-org/llama.cpp/pull/19635

Apparently Step 3.5 Flash uses an odd format for its tags so llama.cpp just doesn't handle it correctly.

tarruda · 2026-02-19T19:26:10 1771529170

> so llama.cpp just doesn't handle it correctly.

It is a bug in the model weights and reproducible in their official chat UI. More details here: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...

sosodev · 2026-02-19T19:33:15 1771529595

I see. It seems the looping is a bug in the model weights but there are bugs in detecting various outputs as identified in the PR I linked.

terhechte · 2026-02-19T13:59:36 1771509576

Did you try an MLX version of this model? In theory it should run a bit faster. I'm hesitant to download multiple versions though.

tarruda · 2026-02-19T14:26:02 1771511162

Haven't tried. I'm too used to llama.cpp at this point to switch to something else. I like being able to just run a model and automatically get:

- OpenAI completions endpoint

- Anthropic messages endpoint

- OpenAI responses endpoint

- A slick looking web UI

Without having to install anything else.

KerrAvon · 2026-02-19T17:02:54 1771520574

Is there a reliable way to run MLX models? On my M1 Max, LM Studio seems to output garbage through the API server sometimes even when the LM Studio chat with the same model is perfectly fine. llama.cpp variants generally always just work.

lostmsu · 2026-02-19T15:14:03 1771514043

gpt-oss 120b and even 20b works OK with codex.

tarruda · 2026-02-19T16:07:25 1771517245

Both gpt-oss are great models for coding in a single turn, but I feel that they forget context too easily.

For example, when I tried gpt-oss 120b with codex, it very easily forgets something present in the system prompt: "use `rg` command to search and list files".

I feel like gpt-oss has a lot of potential for agentic coding, but it needs to be constantly reminded of what is happening. Maybe a custom harness developed specifically for gpt-oss could make both models viable for long agentic coding sessions.

tarruda · 2026-02-16T13:22:22 1771248142

At this point I wouldn't be surprised if your pelican example has leaked into most training datasets.

I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D

jon-wood · 2026-02-16T14:08:04 1771250884

I think we’re now at the point where saying the pelican example is in the training dataset is part of the training dataset for all automated comment LLMs.

Mossly · 2026-02-16T21:54:58 1771278898

It's quite amusing to ask LLMs what the pelican example is and watch them hallucinate a plausible sounding answer.

---

Qwen 3.5: "A user asks an LLM a question about a fictional or obscure fact involving a pelican, often phrased confidently to test if the model will invent an answer rather than admitting ignorance." <- How meta

Opus 4.6: "Will a pelican fit inside a Honda Civic?"

GPT 5.2: "Write a limerick (or haiku) about a pelican."

Gemini 3 Pro: "A man and a pelican are flying in a plane. The plane crashes. Who survives?"

Minimax M2.5: "A pelican is 11 inches tall and has a wingspan of 6 feet. What is the area of the pelican in square inches?"

GLM 5: "A pelican has four legs. How many legs does a pelican have?"

Kimi K2.5: "A photograph of a pelican standing on the..."

---

I agree with Qwen, this seems like a very cool benchmark for hallucinations.

ertgbnm · 2026-02-16T14:59:00 1771253940

I'm guessing it has the opposite problem of typical benchmarks since there is no ground truth pelican bike svg to over fit on. Instead the model just has a corpus of shitty pelicans on bikes made by other LLMs that it is mimicking.

So we might have an outer alignment failure.

WarmWash · 2026-02-16T15:43:43 1771256623

Most people seem to have this reflexive belief that "AI training" is "copy+paste data from the internet onto a massive bank of hard drives"

So if there is a single good "pelican on a bike" image on the internet or even just created by the lab and thrown on The Model Hard Drive, the model will make a perfect pelican bike svg.

The reality of course, is that the high water mark has risen as the models improve, and that has naturally lifted the boat of "SVG Generation" along with it.

Wowfunhappy · 2026-02-16T20:28:15 1771273695

How would that work? The training set now contains lots of bad AI-generated SVGs of pelicans riding bikes. If anything, the data is being poisoned.

tarruda · 2026-02-16T13:18:47 1771247927

Would love to see a Qwen 3.5 release in the range of 80-110B which would be perfect for 128GB devices. While Qwen3-Next is 80b, it unfortunately doesn't have a vision encoder.

Tepix · 2026-02-16T16:49:22 1771260562

Have you thought about getting a second 128GB device? Open weights models are rapidly increasing in size, unfortunately.

tarruda · 2026-02-16T19:11:53 1771269113

Considered getting a 512G mac studio, but I don't like Apple devices due to the closed software stack. I would never have gotten this Mac Studio if Strix Halo existed mid 2024.

For now I will just wait for AMD or Intel to release a x86 platform with 256G of unified memory, which would allow me to run larger models and stick to Linux as the inference platform.

3abiton · 2026-02-18T12:18:32 1771417112

Given the shortage of wafers, the wait might be long. I am however working on a bridging solution. Sime already showed Strix Halo clustering, I am working on something similar but with some pp boost.

Unfortunately, AMD dumped a great device with unfinished software stack, and the community is rolling with it, compared to the DGX Spark, which I think is more cluster friendly.

kylehotchkiss · 2026-02-16T22:16:20 1771280180

I aspire to casually ponder whether I need a $9,500 computer to run the latest Qwen model

amelius · 2026-02-16T22:48:34 1771282114

You'll need more since RAM prices are up thanks to AI.

bytesandbits · 2026-02-18T03:12:22 1771384342

maybe a deepseek v4 distill. give it a few days

PlatoIsADisease · 2026-02-16T15:32:30 1771255950

Why 128GB?

At 80B, you could do 2 A6000s.

What device is 128gb?

the_pwner224 · 2026-02-16T15:51:26 1771257086

AMD Strix Halo / Ryzen AI Max+ (in the Asus Flow Z13 13 inch "gaming" tablet as well as the Framework Desktop) has 128 GB of shared APU memory.

scoopdewoop · 2026-02-16T17:51:36 1771264296

Not quite. They have 128GB of ram that can be allocated in the BIOS, up to 96GB to the GPU.

cpburns2009 · 2026-02-16T22:30:33 1771281033

You don't have to statically allocate the VRAM in the BIOS. It can be dynamically allocated. Jeff Geerling found you can reliably use up to 108 GB [1].

[1]: https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...

khimaros · 2026-02-16T18:51:31 1771267891

allocation is irrelevant. as an owner of one of these you can absolutely use the full 128GB (minus OS overhead) for inference workloads

EasyMark · 2026-02-16T19:48:05 1771271285

Care to go into a bit more on machine specs? I am interested in picking up a rig to do some LLM stuff and not sure where to get started. I also just need a new machine, mine is 8y-o (with some gaming gpu upgrades) at this point and It's That Time Again. No biggie tho, just curious what a good modern machine might look like.

breisa · 2026-02-16T20:09:16 1771272556

Those Ryzen AI Max+ 395 systems are all more or less the same. For inference you want the one with 128GB soldered RAM. There are ones from Framework, Gmktec, Minisforum etc. Gmktec used to be the cheapest but with the rising RAM prices its Framework noe i think. You cant really upgrade/configure them. For benchmarks look into r/localllama - there are plenty.

aruggirello · 2026-02-16T22:01:54 1771279314

Minisforum, Gmktec also have Ryzen AI HX 370 mini PCs with 128Gb (2x64Gb) max LPDDR5. It's dirt cheap, you can get one barebone with ~€750 on Amazon (the 395 similarly retails for ~€1k)... It should be fully supported in Ubuntu 25.04 or 25.10 with ROCm for iGPU inference (NPU isn't available ATM AFAIK), which is what I'd use it for. But I just don't know how the HX 370 compares to eg. the 395, iGPU-wise. I was thinking of getting one to run Lemonade, Qwen3-coder-next FP8, BTW... but I don't know how much RAM should I equip it with - shouldn't 96Gb be enough? Suggestions welcome!

cpburns2009 · 2026-02-17T14:20:29 1771338029

I benchmarked unsloth/Qwen3-Coder-Next-GGUF using the MXFP4_MOE (43.7 GB) quantization on my Ryzen AI Max+ 395 and I got ~30 tps. According to [1] and [2], the AI Max+ 395 is 2.4x faster than the AI 9 HX 370 (laptop edition). Taking all that into account, the AI 9 HX 370 should get ~13 tps on this model. Make of that what you will.

[1]: https://community.frame.work/t/ai-9-hx-370-vs-ai-max-395/736...

[2]: https://community.frame.work/t/tracking-will-the-ai-max-395-...

aruggirello · 2026-02-18T14:12:44 1771423964

Thanks! I'm... unimpressed.

Tepix · 2026-02-17T04:19:26 1771301966

The Ryzen 370 lacks the quad channel RAM. Stay away.

paulsmal · 2026-02-17T00:01:09 1771286469

Ryzen AI HX 370 is not what you want, you need strix halo APU with unified memory

hedgehog · 2026-02-16T17:35:39 1771263339

Keep in mind most of the Strix Halo machines are limited to 10Gbe networking at best.

paulsmal · 2026-02-17T00:04:28 1771286668

you can use separate network adapter with RoCEv2/RDMA support like Intel E810

hedgehog · 2026-02-17T04:46:02 1771303562

Most Ryzen 395 machines don't have a PCI-e slot for that so you're looking at an extension from an m.2 slot or Thunderbolt (not sure how well that will work, possibly ok at 10Gb). Minisforum has a couple newly announced products, and I think the Framework desktop's motherboard can do it if you put it in a different case, that's about it. Hopefully the next generation has Gen5 PCIe and a few more lanes.

tgtweak · 2026-02-16T21:56:47 1771279007

Spark DGX and any A10 devices, strix halo with max memory config, several mac mini/mac studio configs, HP ZBook Ultra G1a, most servers

If you're targeting end user devices then a more reasonable target is 20GB VRAM since there are quite a lot of gpu/ram/APU combinations in that range. (orders of magnitude more than 128GB).

kristianp · 2026-02-17T03:27:37 1771298857

By A6000, do you mean the older Ampere generation model? 48 GB ddr6, released 2020 [1]. Can you even buy those new still?

[1] https://www.techpowerup.com/gpu-specs/rtx-a6000.c3686

lm28469 · 2026-02-16T16:29:02 1771259342

That's the maximum you can get for $3k-$4k with ryzen max+ 395 and apple studio Ms. They're cheaper than dedicated GPUs by far.

tarruda · 2026-02-16T16:35:20 1771259720

Mac Studios or Strix Halo. GPT-OSS 120b, Qwen3-Next, Step 3.5-Flash all work great on a M1 Ultra.

sowbug · 2026-02-16T18:47:36 1771267656

All the GB10-based devices -- DGX Spark, Dell Pro Max, etc.

vladovskiy · 2026-02-16T15:46:54 1771256814

Guess, it is mac m series

tarruda · 2026-01-30T16:34:00 1769790840

Love the idea of keeping the agent filesystem in a single file!

tarruda · 2026-01-29T11:41:14 1769686874

These days I don't feel the need to use anything other than llama.cpp server as it has a pretty good web UI and router mode for switching models.

roger_ · 2026-01-29T13:55:30 1769694930

MLX support on Macs was the main reason for me.

embedding-shape · 2026-01-29T14:33:54 1769697234

I mostly use LM Studio for browsing and downloading models, testing them out quickly, but then actually integrating them is always with either llama.cpp or vLLM. Curious to try out their new cli though and see if it adds any extra benefits on top of llama.cpp.

mycall · 2026-01-29T14:35:39 1769697339

Concurrency is an important use case when running multiple agents. vLLM can squeeze performance out of your GB10 or GPU that you wouldn't get otherwise.

embedding-shape · 2026-01-29T14:39:49 1769697589

Also they've just spent more time optimizing vLLM than llama.cpp people done, even when you run just one inference call at a time. Best feature is obviously the concurrency and shared cache though. But on the other hand, new architectures are usually sooner available in llama.cpp than vLLM.

Both have their places and are complementary, rather than competitors :)

tarruda · 2026-01-29T16:32:25 1769704345

I'm only interested in the local, single user use case. Plus I use a Mac studio for inference, so vLLM is not an option for me.

mycall · 2026-01-30T02:06:38 1769738798

You can get concurrency gains [0] as local/single user (multi-agent) use case with vLLM with your Mac Studio.

[0] https://youtu.be/Ze5XLooTt6g?t=658

tarruda · 2025-12-25T23:56:07 1766706967

AFAIK MPS cannot be used on Asahi, so it has to be done using Vulkan which will definitely be much slower.

tarruda · 2025-12-08T10:05:18 1765188318

A VM is displayed as a window on the host OS and Emacs is the window manager within that VM window. What's the difference from running emacs directly as an application on the host?

tarruda · 2025-12-06T22:12:54 1765059174

> It's fast (~3 seconds on my RTX 4090)

It is amazing how far behind Apple Silicon is when it comes to use non- language models.

Using the reference code from Z-image on my M1 ultra, it takes 8 seconds per step. Over a minute for the default of 9 steps.

p-e-w · 2025-12-06T23:57:34 1765065454

The diffusion process is usually compute-bound, while transformer inference is memory-bound.

Apple Silicon is comparable in memory bandwidth to mid-range GPUs, but it’s light years behind on compute.

tarruda · 2025-12-07T00:47:39 1765068459

> but it’s light years behind on compute.

Is that the only factor though? I wonder if pytorch is lacking optimization for the MPS backend.

rfoo · 2025-12-07T11:20:45 1765106445

This is the only factor. People sometimes perceive Apple's NPU as "fast" and "amazing" which is simply false.

It's just that NVIDIA GPU sucks (relatively) at *single-user* LLM inference and it makes people feel like Apple not so bad.

tails4e · 2025-12-07T10:09:38 1765102178

I heard last year the potential future of gaming is not rendering but fully AI generated frames. 3 seconds per 'frame' now, it's not hard to believe it could do 60fps in a few short years. It makes it seem more likely such a game could exist. I'm not sure I like the idea, but it seems like it could happen

snek_case · 2025-12-07T11:53:45 1765108425

The problem is going to be how to control those models to produce a universe that's temporally and spatially consistent. Also think of other issues such as networked games, how would you even begin to approach that in this new paradigm? You need multiple models to have a shared representation that includes other players. You need to be able to sync data efficiently across the network.

I get that it's tempting to say "we no longer have to program game engines, hurray", but at the same time, we've already done the work, we already have game engines that are relatively very computationally efficient and predictable. We understand graphics and simulation quite well.

Personally: I think there's an obvious future in using AI tools to generate game content. 3D modelling and animation can be very time consuming. If you could get an AI model to generate animated characters, you could save a lot of time. You could also empower a lot of indie devs who don't have 3D modelers to help them. AI tools to generate large maps, also super valuable. Replacing the game engine itself, I think it's a taller order than people realize, and maybe not actually desirable.

adventured · 2025-12-07T12:35:34 1765110934

20 years out, what will everybody be using routine 10gbps pipes in our homes for?

I'm paying $43 / month for 500mbps at present and there's nothing special about that at all (in the US or globally). What might we finally use 1gbps+ for? Pulling down massive AI-built worlds of entertainment. Movies & TV streaming sure isn't going to challenge our future bandwidth capabilities.

The worlds are built and shared so quickly in the background that with some slight limitations you never notice the world building going on behind the scenes.

The world building doesn't happen locally. Multiple players connect to the same built world that is remote. There will be smaller hobbyist segments that will still world-build locally for numerous reasons (privacy for one).

The worlds can be constructed entirely before they're downloaded. There are good arguments for both approaches (build the entire world then allow it to be accessed, or attempt to world-build as you play). Both will likely be used over the coming decades, for different reasons and at different times (changes in capabilities will unlock new arguments for either as time goes on, with a likely back and forth where one pulls ahead then the other pulls ahead).

SR2Z · 2025-12-11T15:25:27 1765466727

> The problem is going to be how to control those models to produce a universe that's temporally and spatially consistent.

Why not just have a simple, low-poly rasterizer and have AI fill in the details?

That's essentially the way that AMD FX and NVIDIA DLSS work today, although they do take fully rendered frames as input.

wcoenen · 2025-12-07T11:02:32 1765105352

Increasing the framerate by rendering at a lower resolution + upscaling, or outright generation of extra frames has already been a thing for a few years now. NVidia calls it Deep Learning Super Sampling (DLSS)[1]. AMD's equivalent is called FSR[2].

[1] https://en.wikipedia.org/wiki/Deep_Learning_Super_Sampling

[2] https://en.wikipedia.org/wiki/GPUOpen#FidelityFX_Super_Resol...

liuliu · 2025-12-07T18:38:32 1765132712

Not saying M1 Ultra is great. But you should only get ~8x slow down with proper implementation (such as Draw Things upcoming implementation for Z Image). It should be 2~3 sec per step. On M5 iPad, it is ~6s per step.