What hardware are you running the 30b model on? I guess it needs at least 24GB V...

derp-mcgee · 2026-01-27T01:10:03 1769476203

Im running qwen3-coder:30b-a3b-q8_0 @ 32k context. Comes out to 36gb and Im splitting it between a 3090 24gb and a 4060ti 16gb (ollama put 20gb on the 3090 and 13.5 on the 4060ti) , runs great tbh. Ollama running in ubuntu server and Im running claude code from my windows desktop pc.

thtmnisamnstr · 2026-01-23T00:54:09 1769129649

The general rule to follow is that you need as much VRAM as the model size. 30b models are usually around 19GB. So, most likely a GPU with 24GB of VRAM.

3836293648 · 2026-01-23T09:50:03 1769161803

But this also means tiny context windows. You can't fit gpt-oss:20b + more than a tiny file + instructions into 24GB

blizdiddy · 2026-01-24T06:37:03 1769236623

Gpt-oss is natively 4-bit, so you kinda can

3836293648 · 2026-01-25T19:04:36 1769367876

You can fit the weights + a tiny context window into 24GB, absolutely. But you can't fit anything of any reasonable size. Or Ollama's implementation is broken, but it needs to be restricted beyond usability for it not to freeze up the entire machine when I last tried to use it.

ryandrake · 2026-01-22T23:35:48 1769124948

I'd like to know this, too. I'm just getting started getting my feet wet with ollama and local models using just CPU, and it's obviously terribly slow (even 24 cores, 128GB DRAM. It's hard to gauge how much GPU money I'd need to plonk down to get acceptable performance for coding workflows.

storystarling · 2026-01-23T07:38:54 1769153934

I tried to build a similar local stack recently to save on API costs. In practice I found the hardware savings are a bit of a mirage for coding workflows. The local models hallucinate just enough that you end up spending more in lost time debugging than you would have paid for Sonnet or Opus to get it right the first time.