Gemma 4 31b was working ok for me; but it was consuming tons of memory on SWA ch...

xrd · 2026-04-30T16:03:09 1777564989

What are you using to run it vllm, llama.cpp or other?

Can you share your switches and approach for using tools?

lambda · 2026-04-30T18:57:45 1777575465

llama.cpp

My setup is a bit of a mess as I experiment with different ways of configuring and hosting local models. So at some point I was experimenting with the router server but stopped doing that, but some of my settings are still in models.ini while some are on the command line.

podman run --env "HF_TOKEN=$HF_TOKEN" --env "LLAMA_SERVER_SLOTS_DEBUG=1" -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable --rm -it -v ~/.cache/huggingface/:/root/.cache/huggingface/ -v ./unsloth:/app/unsloth -v ./models.ini:/app/models.ini llama.cpp-rocm7.2 -hf unsloth/gemma-4-31B-it-GGUF:UD-Q8_K_XL --chat-template-file /root/.cache/huggingface/gemma-4-31B-it-chat_template.jinja -ctxcp 8 --port 8080 --host 0.0.0.0 -dio --models-preset models.ini

With the following as the relevant settings in models.ini (I actually have no idea if these settings are applied when not using the router server, it's been hard for me to figure out what settings are actually applied when using bot the command line and models.ini

  [*]
  jinja = true
  seed = 3407
  flash-attn = on

  [unsloth/gemma-4-31B-it-GGUF:UD-Q8_K_XL]
  temperature = 1.0
  top_p = 0.95
  top_k = 64

And it looks like the chat_template.jinja I have is actually out of date by now, there was a new one pushed just a couple of days ago that seems to have some further tool calling fixes: https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_...

As my harness, I'm using pi, with a pretty vanilla config.

Anyhow, Gemms 4 31b worked in this config, but it was slow and RAM hungry. Since then, I've mostly moved to Qwen 3.6 35b-a3b because it's a lot faster.

I'm not actually doing anything useful with these yet, but I've used them for some experiments and Qwen 3.6 35b-a3b was capable of doing some pretty long mostly unsupervised agentic loops in my experimentation.