ggml and llama.cpp are such a good platform for local LLMs, having some financia...

SparkyMcUnicorn · on June 6, 2023

Maybe I'm wrong, but I don't think you want it fine-tuned on your data.

Pretty sure you might be looking for this: https://github.com/SamurAIGPT/privateGPT

Fine-tuning is good for treating it how to act, but not great for reciting/recalling data.

SkyPuncher · on June 8, 2023

I think people want both. They want fine tuning for their style of communication and interaction. They want better rank and retrieval for rote information.

In other words, it’s like having a spouse/partner. There are certain ways that we communicate that we simply know where the other person is at or what they actually mean.

SparkyMcUnicorn · on June 8, 2023

Unless you want machine-readable responses, or some other very specific need, the benefits of a fine-tuned model aren't really going to be that much better than a prompt that asks for the style you want along with an example or two. It also raises the barrier to entry quite a bit, since the majority of computers that can run the model aren't capable of training on it.

Even if you're using OpenAI's models, gpt-3.5-turbo is going to be much better (cheaper, bigger context window, higher quality) than any of their models that can be fine-tuned.

But if you're able to fine-tune a local model, then a combination of fine-tuning and embedding is probably going to give you better results than embedding alone.

gtirloni · on June 7, 2023

> Fine-tuning is good for treating it how to act, but not great for reciting/recalling data.

What underlying process makes it this way? Is it because the prompt has heavier weight?

SparkyMcUnicorn · on June 7, 2023

I think your question is asking about the fundamentals of how an LLM works, which I'm not really qualified to answer. But I do have a general understanding of it all.

Fine-tuning is like having the model take a class on a certain subject. By the end of the class, it's going to have a general understanding on how to do that thing, but it's probably going to struggle when trying to quote the textbooks verbatim.

A good use-case for fine-tuning is teaching it a response style or format. If you fine-tune a model to only respond in JSON, then you no longer need to include formatting instructions in your prompt to get a JSON output.

bluepoint · on June 8, 2023

I just read the paper about LORA. The main idea is that you write the weights of each neural network as

W = W0 + B A

Where W0 is the trained model’s weights, which are kept fixed, and A and B are matrices but with a much much lower rank than the originals (say r = 4).

It has been shown (as mentioned in the lora paper that training for specific tasks results in low rank corrections, so this is what it is all about. I think that doing LoRa can be done locally.

[1] https://github.com/microsoft/LoRA

dr_dshiv · on June 6, 2023

How does this work?

deet · on June 6, 2023

The parent is saying that "fine tuning", which has a specific meaning related to actually retraining the model itself (or layers at its surface) on a specialized set of data, is not what the GP is actually looking for.

An alternative method is to index content in a database and then insert contextual hints into the LLM's prompt that give it extra information and detail with which to respond with an answer on-the-fly.

That database can use semantic similarity (ie via a vector database), keyword search, or other ranking methods to decide what context to inject into the prompt.

PrivateGPT is doing this method, reading files, extracting their content, splitting the documents into small-enough-to-fit-into-prompt bits, and then indexing into a database. Then, at query time, it inserts context into the LLM prompt

The repo uses LangChain as boilerplate but it's pretty easily to do manually or with other frameworks.

(PS if anyone wants this type of local LLM + document Q/A and agents, it's something I'm working on as supported product integrated into macOS, and using ggml; see profile)

SparkyMcUnicorn · on June 6, 2023

deet already gave a comprehensive answer, but I'll add that the guts of privateGPT are pretty readable and only ~200 lines of code.

Core pieces: GPT4All (LLM interface/bindings), Chroma (vector store), HuggingFaceEmbeddings (for embeddings), and Langchain to tie everything together.

https://github.com/SamurAIGPT/privateGPT/blob/main/server/pr...

rvz · on June 6, 2023

> ggml and llama.cpp are such a good platform for local LLMs, having some financial backing to support development is brilliant

The problem is, this financial backing and support is via VCs, who will steer the project to close it all up again.

> I want a local ChatGPT fine tuned on my personal data running on my own device, not in the cloud. Ideally open source too, llama.cpp is looking like the best bet to achieve that!

I think you are setting yourself up for disappointment in the future.

ignoramous · on June 6, 2023

> The problem is, this financial backing and support is via VCs, who will steer the project to close it all up again.

A matter of when, not if. I mean, the website itself makes that much clear:

  The ggml way
  
    ...
  
    Open Core

    The library and related projects are freely available under the MIT license... In the future we may choose to develop extensions that are licensed for commercial use
  
    Explore and have fun!

    ... Contributors are encouraged to try crazy ideas, build wild demos, and push the edge of what's possible

So, like many other "open core" devtools out there, they'd like to have their cake and eat it too. And they might just as well, like others before them.

Won't blame anyone here though; because clearly, if you're as good as Georgi Gerganov, why do it for free?

ukuina · on June 7, 2023

Sounds like the SQLite model, which has been a net positive for the computing world.

ulchar · on June 6, 2023

> The problem is, this financial backing and support is via VCs, who will steer the project to close it all up again.

How exactly could they meaningfully do that? Genuine question. The issue with the OpenAI business model is that the collaboration within academia and open source circles is creating innovations that are on track to out-pace the closed source approach. Does OpenAI have the pockets to buy the open source collaborators and researchers?

I'm truly cynical about many aspects of the tech industry but this is one of those fights that open source could win for the betterment of everybody.

yyyk · on June 6, 2023

I've been going on and on about this in HN: Open source can win this fight, but I think OSS is overconfident. We need to be clear there are serious challenges ahead - ClosedAI and other corporations also have a plan, a plan that has good chances unless properly countered:

A) Embed OpenAI (etc.) API everywhere. Make embedding easy and trivial. First to gain a small API/install moat (user/dev: 'why install OSS model when OpenAI is already available with an OS API?'). If it's easy to use OpenAI but not open source they have an advantage. Second to gain brand. But more importantly:

B) Gain a technical moat by having a permanent data advantage using the existing install base (see above). Retune constantly to keep it.

C) Combine with existing propriety data stores to increase local data advantage (e.g. easy access for all your Office 365/GSuite documents, while OSS gets the scary permission prompts).

D) Combine with existing propriety moats to mutually reinforce.

E) Use selective copyright enforcement to increase data advantage.

F) Lobby legislators for limits that make competition (open or closed source) way harder.

TL;DR: OSS is probably catching up on algorithms. When it comes to good data and good integrations OSS is far behind and not yet catching up. It's been argued that OpenAI's entire performance advantage is due to having better data alone, and they intend to keep that advantage.

ljlolel · on June 6, 2023

Don’t forget chip shortages. That’s all centralized up through Nvidia, TSMC, and ASML

maxilevi · on June 6, 2023

I agree with the spirit but saying that open source is on track to outpace OpenAI in innovation is just not true. Open source models are being compared to GPT3.5, none yet even get close to GPT4 quality and they finished that last year.

jart · on June 6, 2023

We're basically surviving off the scraps companies like Facebook have been tossing off the table, like LLaMA. The fact that we're even allowed and able to use these things ourselves, at all, is a tremendous victory.

maxilevi · on June 6, 2023

I agree

jdonaldson · on June 6, 2023

> I think you are setting yourself up for disappointment in the future.

Why would you say that?

rvz · on June 8, 2023

Never expect such promises to go your way, especially when VCs, angels, etc are able to control the project with their opaque terms sheet, which is why I am skeptical of this. Accepting VC, angel investment cash is no different to having another boss.

I am expecting such high expectations like that to end in disappointment for the 'community' since the interests will now be in the VCs to head for the exit. Their actions will speak more than what they are saying on the website.

ignoramous · on June 6, 2023

Can LLaMA be used for commerical purposes though (might limit external contributors)? I believe, FOSS alternatives like DataBricks Dolly / Together RedPajama / Eluether GPT NeoX (et al) is where the most progress is likely to be at.

detrites · on June 6, 2023

May also be worth mentioning - UAE's Falcon, which apparently performs well (leads?). Falcon recently had its royalty-based commercial license modified to be fully open for free private and commercial use, via Apache 2.0: https://falconllm.tii.ae/

mistercow · on June 7, 2023

Hugging Face has a demo of the 40B Falcon instruct model: https://huggingface.co/blog/falcon#demo

It’s pretty good as models of that size go, although it doesn’t take a lot of playing around with it to find that there’s still a good distance between it and ChatGPT 3.5.

(I do recommend editing the instructions before playing with it though; telling a model this size that it “always tells the truth” just seems to make it overconfident and stubborn)

samwillis · on June 6, 2023

Although llama.cpp started with the LLaMA model, it now supports many others.

_fjb4 · on June 6, 2023

OpenLLAMA will be released soon and it's 100% compatible with the original LLAMA.

https://github.com/openlm-research/open_llama

okhuman · on June 6, 2023

This is a very good question that will be interesting how this develops. thanks for posting the alternatives list.

chaxor · on June 6, 2023

Why is commercial necessary to run local models?

ignoramous · on June 6, 2023

It isn't, but such models may eventually lag behind the FOSS ones.

shostack · on June 7, 2023

I've been trying to figure out what I might need to do in order to turn my Obsidian vault into a dataset to fine tune against. I'd invest a lot more into it now if I thought it would be a key to an AI learning about my the way it does in the movie Her.

mydjtl · on June 18, 2023

The holy grail.

https://github.com/brianpetro/obsidian-smart-connections

https://wfhbrian.com/introducing-smart-chat-transform-your-o...

58x14 · on June 7, 2023

I've been working on this for awhile now and I'd love to chat. I'll send you an email.

legendofbrando · on June 7, 2023

I'm interested in this as well and have been exploring similarly. Would be super interesting to chat if you're up for it as well. Sending you an email to say hello.

brucethemoose2 · on June 6, 2023

If MeZO gets implemented, we are basically there: https://github.com/princeton-nlp/MeZO

moffkalast · on June 6, 2023

Basically there, with what kind of VRAM and processing requirements? I doubt anyone running on a CPU can fine tune in a time frame that doesn't give them an obsolete model when they're done.

nl · on June 6, 2023

According to the paper it fine tunes at the speed of inference (!!)

This would make fine tuning a qantized 13B model achievable in ~0.3 seconds per training example on a CPU.

sp332 · on June 6, 2023

It's the same memory footprint as inference. It's not that fast, and the paper mentions some optimizations that could still be done.

nl · on June 6, 2023

Yes you are right.

I completely misread that!

f_devd · on June 6, 2023

MeZO assumes a smooth parameter space, so you probably won't be able to do it with INT4/8 quantization, probably needs fp8 or smoother.

gliptic · on June 6, 2023

I cannot find any such numbers in the paper. What the paper says is that MeZO converges much slower than SGD, and each step needs two forward passes.

"As a limitation, MeZO takes many steps in order to achieve strong performance."

moffkalast · on June 6, 2023

Wow if that's true then it's genuinely a complete gamechanger for LLMs as a whole. You probably mean more like 0.3s per token, not per example, but that's still more like 1 or two minutes per training case, not like a day for 4 cases like it is now.

isoprophlex · on June 6, 2023

If you go through the drudgery of integrating with all the existing channels (mail, Teams, discord, slack, traditional social media, texts, ...), such rapid finetuning speeds could enable an always up to date personality construct, modeled on you.

Which is my personal holy grail towards making myself unnecessary; it'd be amazing to be doing some light gardening while the bot handles my coworkers ;)

vgb2k18 · on June 7, 2023

> while the bot handles my coworkers

Or it handles their bots ;)

valval · on June 6, 2023

I think more importantly, what would the fine tuning routine look like? It's a non-trivial task to dump all of your personal data into any LLM architecture.

behnamoh · on June 6, 2023

I wonder if ClosedAI and other companies use the findings of the open source community in their products. For example, do they use QLORA to reduce the costs of training and inference? Do they quantize their models to serve non-subscribing consumers?

jmoss20 · on June 6, 2023

Quantization is hardly a "finding of the open source community". (IIRC the first TPU was int8! Though the tradition is much older than that.)

danielbln · on June 6, 2023

Not disagreeing with your points, but saying "ClosedAI" is about as clever as writing M$ for Microsoft back in the day, which is to say not very.

Miraste · on June 6, 2023

M$ is a silly way to call Microsoft greedy. ClosedAI is somewhat better because OpenAI's very name is a bald-faced lie, and they should be called on it. Are there more elegant ways to do that? Sure, but every time I see Altman in the news crying crocodile tears about the "dangers" of open anything I think we need all the forms of opposition we can find.

tanseydavid · on June 6, 2023

It is a colloquial spelling and they earned it, a long time ago.

loa_in_ · on June 6, 2023

I'd say saying M$ makes it harder for M$ to find out I'm talking about them in them in the indexed web because it's more ambiguous, that's all I need to know.

coolspot · on June 6, 2023

If we are talking about indexing, writing M$ is easier to find in an index because it is a such unique token. MS can mean many things (e.g. Miss), M$ is less ambiguous.

rafark · on June 6, 2023

I think it’s ironic that M$ made ClosedAI.

replygirl · on June 6, 2023

Pedantic but that's not irony

rafark · on June 6, 2023

Why do you think so? According to the dictionary, ironic could be something paradoxical or weird.

nl · on June 6, 2023

Well it's not paradoxical?

If one is the kind of person who writes M$ then it's pretty much expected behaviour.

smoldesu · on June 6, 2023

Yeah, I think it feigns meaningful criticism. The "Sleepy Joe"-tier insults are ad-hominem enough that I don't try to respond.