Mind sharing how you got it to work in your setup?

adeon · on April 5, 2023

I work on an independent LLM implementation here: https://github.com/Noeda/rllama/

I only got it working at all yesterday and there's no nice UX at all. Not sure I recommend trying to use this as llama.cpp will probably have this in no time with a much better user experience, although I am also trying to make it more usable.

If you follow the instructions on Vicuna page over how to apply the deltas, and you can compile the project, then you could run:

cargo run --release --features opencl -- --model-path /models/vicuna13b --param-path /models/vicuna13b/config.json --tokenizer-path /models/vicuna13b/tokenizer.model --prompt-file prompt --top-p 1.0 --top-k 20 --repetition-penalty 1 --temperature 0.9 --max-seq-len 2048 --f16 --percentage-to-gpu 0.9

Where /models/vicuna13b is the HuggingFace-compatible model. This will put 90% of weights on GPU and remaining 10% non CPU which is just barely enough to not run out of GPU memory (on a 24 gig card)

Create a text file 'prompt' with the prompt. I've been using this template:

You are a helpful and precise assistant for checking the quality of the answer.###Human: Can you explain nuclear power to me?###Assistant:

(the model seems to use ### as delimiters to distinguish Human and Assistant). The "system prompt" is whatever text is written at the beginning.

VadimPR · on April 8, 2023

The feature to load a % to the GPU is novel and amazing! I couldn't get the project up and running myself (requires a nightly rust build) but I love this particular innovation.