Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here is the commit that added ROCm support to llama.cpp back in August:

https://github.com/ggerganov/llama.cpp/commit/6bbc598a632560...



Yep, and it deserves the credit! He who writes the cuda kernel (or translates it) controls the spice.

I had wrapped this and had it working in Ollama months ago as well: https://github.com/ollama/ollama/pull/814. I don't use Ollama anymore, but I really like the way they handle device memory allocation dynamically, I think they were the first to do this well.


I'm curious about both:

- what's special about the memory allocation, and how might it help me?

- what are you now using instead of ollama?


Ollama does a nice job of looking at how much VRAM the card has and tuning the number of gpu layers offloaded. Before that, I mainly just had to guess. It's still a heuristic, but I thought that was neat.

I'm mainly just using llama.cpp as a native library now, mainly for the direct access to more of llama's data structures, and because I have a sort of unique sampler setup.


Oh right... I've just been guessing, to try and find the value one fewer than the one which causes CUDA OOM errors.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: