> Any idiot can now prompt their way to the same software.
Not only it would be good if true, but it is also not true. Good programmers learn how to build things, for the most part, since they know what to build, and have a general architectural idea of what they are going to build. Without that, you are like the average person in the 90s with Corel Draw in their hands, or the average person with an image diffusion model today: the output will be terrible because of lack of taste and ideas.
As it always happens, people realize there is something new is Redis in 2 years or more. With Streams it tragically took like 4 years and then everybody started to use it for this use case, with a sharp acceleration in the latest few years. I believe this is what is happening for vector sets as well. For a reduced size problem like that you just git clone Redis, add the vectors into a key with VADD, and query with VSIM. It's a 10 lines Python script that will deliver 20k/50k queries per second more or less, out of the box with zero optimizations.
But here the problem is: the scale. Billions of vectors. And I wonder if Redis should add on-disk vector sets, which I started to sketch months ago and never implemented. So my question is, the "3B" in Vicky's blog post is theoretical or is a practical need many folks have? I'm asking because at such a scale, the main problem is to generate the embeddings for your items, whatever they are.
EDIT: I wonder if it is possible to use in memory vector sets to index discrete on disk dense blobs of nearby vectors to query with an approach like the one described in the post. It's like a H-HNSW, and resembles to certain on-disk approaches for vector similarity indeed.
> at such a scale, the main problem is to generate the embeddings for your items
Generation is often decoupled from querying, though. Consider LLMs, where training is a very expensive, slow, hardware intensive process, whereas inference is much faster and much less intensive.
But the performance of inference is in many ways more important than the performance of training, because inference is what users interact with directly.
I believe that Pilgrim here does not understand very well how copyright works:
> Their claim that it is a "complete rewrite" is irrelevant, since they had ample exposure to the originally licensed code
This is simply not true. The reason why the "clean room" concept exists is precisely since actually the law recognizes that independent implementations ARE possibile. The "clean room" thing is a trick to make the litigation simpler, it is NOT required that you are not exposed to the original code. For instance, Linux was implemented even if Linus and other devs where well aware of Unix internals. The law really mandates this: does the new code copy something that was in the original one? The clean room trick makes it simpler to say, it is not possible, if there are similar things it is just by accident. But it is NOT a requirement.
Regardless of the legal interpretations, I think it's very worrying if an automated AI rewrite of GPLed code (or any code for that matter) could somehow be used to circumvent the original license. That kinda takes out the one stick the open source community has to force soulless multinationals to contribute back to the open source projects they use.
I’m genuinely surprised to see this not discussed more by the FOSS community. There are so many ways to blow past the GPL now:
1. File by file rewrite by AI (“change functions and vars a bit”)
2. One LLM writes a diff language (or pseudo code) version of each function that a diff LLM translates back into code and tests for input/output parity
The real danger is that this becomes increasingly undetectable in closed source code and can continue to sync with progress in the GPLed repo.
I don’t think any current license has a plausible defense against this sort of attack.
I’ve never delved fully into IP law, but wouldn’t these be considered derivative works? They’re basically just reimplementing exactly the same functionality with slightly different names?
This would be different from the “API reimplementation” (see Google vs Oracle) because in that case, they’re not reusing implementation details, just the external contract.
"change functions and bars a bit" isn't a rewrite. Anything where the LLM had access to the original code isn't a rewrite. This would just be a derivative work.
However most of the industry willfully violates the GPL without even trying such tricks anyway so there are certainly issues
(1) sounds like a derivative work, but (2) is an interesting AI-simulacrum of a clean room implementation IF the first LLM writes a specification and not a translation.
#1 is already possible and always has been. I never heard of a case of anyone actually trying it. #2 is too nitpicky and unnecessarily costly for LLMs. It would be better to just ask it to generate a spec and tests based on the original, them create a separate implementation based on that. A person can do that today free and clear. If LLMs will be able to do this, we will just need to cope. Perhaps the future is in validating software instead of writing it.
It’s less worry to me given that a year ago this would have been exceptionally harder to do, requiring a lot more time and effort and been more costly. A year from now it will be even easier. All of this means that one aspect of the mission that brought about the need for a license like this is now fundamentally easier whether or not the license is used. There can be less worry about software locked up in closed source overall.
If the AI is good enough to truly implement the whole thing to a similar level of reliability without copying it then who cares. At that point you should be able to decompile any program you want and find enough information inside that an AI can go write a similar quality program from the vague information about the call graph. We've transcended copyright in computer code.
If it can't and it costs a bunch of money to clean it up then same as always.
OTOH if what is actually happening is just that it is rewording the existing code so it looks different then it is still going to run afoul of copyright. You can't just rewrite harry potter with different words.
Note that even with Google vs oracle it was important they didn't need the actual code just the headers to get the function calls were enough. Yes it's true that the clean room isn't required but when you have an AI and you can show that it can't do it a second time without looking at the source (not just function declarations) that's pretty strong evidence.
It's worrying, but it's consistent with how copyright law is currently written. Laws haven't caught up with what technology is currently capable of yet. The discussion should be whether, and if so how, our laws should be tweaked to stop this from getting out of hand, IMO.
Take AI out…if a person can do it, which they can, the situation hasn’t changed. Further, it was a person who did it, with the assistance of AI. Also, the concept that you “can’t be exposed to the code before writing a compatible alternative” is utterly false in their arguments. In fact, one could take every single interface definition they have defined to communicate and use those interfaces directly to write their own, because in fact this i(programmatic) interface code is not covered by copyright (with an implicit fair use exemption due to the face the software cannot operate without activating said interfaces). The Java lawsuit set that as precedent with JDK. A person could have absolutely rewritten this software using the interfaces and their knowledge, which is perfectly legal if they don’t literally copy and re-word code. Now, if it IS simply re-worded copies of the same code and otherwise the entire project structure is basically the same, it’s a different story. That doesn’t sound like what happened.
Finally, how exactly do people think corporations rewrite portions of code that were contributed before re-licensing under a private license? It is ABSOLUTELY possible to rewrite code and relicense it.
Edit: Further, so these people think you contribute to a project, that project is beholden to your contribution permanently and it can never be excised? That seems like it would blatantly violate their original persons rights to exercise their own control of the code without those contributions, which is exactly the purpose of a rewrite.
As part of the relicensing ZeroMQ did a few years ago, they sought permission from all previous contributors (yes, it was a multi-year effort). Code contributions that they weren’t able to get permission to relicense resulting in the corresponding lines being removed (or functionality rewritten from scratch).
It cuts both ways. You can write a GPL version of a proprietary or permissively licensed program. The only difference is the effort of the rewrite is (theoretically) easier.
(I have my doubts the rewrite is a reasonably defect free replacement)
True, but if that is found to be how it works then an automated AI rewrite of closed-source code is just as unbound by the original license. Which is a much bigger win for the open-source community, since any closed-source software can become the inspiration for an open-source project.
If code becomes essentially free (ignoring for a moment the environmental cost or the long term cost of allowing code generation to be tollboothed by AI megacorps) the value of code must lie in its track record.
The 5-day-old code in chardet has little to no value. The battle-tested years-old code that was casually flushed away to make room for it had value.
Soulless multinationals often want to share costs with other soulless multinationals, just like individuals do. So I think there will always be publicly shared code. The real question is whether this code will be worth much if it can be implemented so quickly by a machine.
That kinda takes out the one stick the open source community has to force soulless multinationals to contribute back to the open source projects they use.
I'll trade that stick for what GenAI can do for me, in a heartbeat.
The question, of course, is how this attitude -- even if perfectly rational at the moment -- will scale into the future. My guess is that pretty much all the original code that will ever need to be written has already been written, and will just need to be refactored, reshaped, and repurposed going forward. A robot's job, in other words. But that could turn out to be a mistaken guess.
I think it's very weird but valid I guess to want to be just atomic individual in constant LLM feedback loop. But, at risk of sounding too trite and wholesome here, what about caring for others, the world at large? If you wanna get your thing to rewrite curl or something, that's again really weird but fine, but just don't share it or try to make money off of it. Isn't that like even the rational position here if you still wanna have good training materials for future models? These need not be conflicting interests! We can all be in this together, even if you wanna totally fork yourself into your own LLM output world.
What happened to sticking up for the underdogs? For the goodness of well-made software in itself, for itself? Isn't that what gave you all the stuff you have now? Don't you feel at least a little grateful, if maybe not obliged? Maybe we can start there?
> If you wanna get your thing to rewrite curl or something, that's again really weird but fine, but just don't share it or try to make money off of it.
The whole point of the GPL is to encourage sharing! Making money off of GPL code is not encouraged by the text of the license, but it is encouraged by the people who wrote the licenses. Saying "don't share it" is antithetical to the goals of the free software movement.
I feel like everyone is getting distracted by protecting copyright, when in fact the point of the GPL is that we should all share and share alike. The GPL is a negotiation tactic, it is not an end unto itself. And curl, I might note, is permissively licensed so there's no need for a clean room reimplementation. If someone's rewriting it I'm very interested to hear why and I hope they share their work. I'm mostly indifferent to how they license it.
> what about caring for others, the world at large
30 years of experience in the tech industry taught me that this will get you nowhere. Nobody will reciprocate generosity or loyalty without an underlying financial incentive.
> What happened to sticking up for the underdogs?
Underdogs get SPACed out and dump the employees that got them there.
Everything I have now arose from processes of continuous improvement, carried out by smart people taking full advantage of the best available tools and technologies including all available means of automation.
Ah well, I tried.. To paraphrase Nietzsche, a man can be measured by how well he sleeps at night. I can only hope you stay well rested into this future ;).
Ah, Nietzsche. "They call him Ubermensch, 'cause he's so driven." He told us that man is a thing that will be surpassed, and asked what we've done to surpass him. The last thing I want to do is get in the way of the people doing it.
Ah geeze don't lie down so easily! It's aspirational! You don't need to prefigure yourself as so impotent here... We can all find the courage to roar against the consensus of slave mentality, even those of us who are maybe quicker to give it all up at first for some new God. I think you have the right attitude, but you are going to end up on the side of losers either way if you don't even try to fight. Also, I am just an old man, so grain of salt and all that!
And fwiw, the idea he meant like literal people walking around being Uber is kinda nazi distortion anyway.
Neither does the maintainer that claims a mechanical test of structural similarities can prove anything either waybwith regard to whether legally it is a derivative work (or even a mechnaical copy without the requisite new creative work to be a derivative work.)
And then Pilgrim is again wrong by saying that the use of Claude definitively makes it a derivative work because of the inability to prove it the work in question did not influence the neurons involved.
It is all dueling lay misreadings of copyright law, but it is also an area where the actual specific applicable law, on any level specific enough to cleanly apply, isn’t all that clear.
I think this is a bit too broad. There are actually three possible cases.
When there is similar code, the only defense possible to prove that you have not copied the original is to show that your process is a clean room re-implementation.
If the code is completely different, then clean room or not is indeed irrelevant. The only way the author can claim that you violated their copyright despite no apparent similarity is for them to have proof you followed some kind of mechanical process for generating the new code based on the old one, such as using an LLM with the old code as input prompt (TBD, completely unsettled: what if the old code is part of the training set, but was not part of the input?) - the burden of proof is on them to show that the dissimilarity is only apparent.
In realistic cases, you will have a mix of similar and dissimilar portions, and portions where the similarity is questionable. Each of these will need to be analyzed separately - and it's very likely that all the similar portions will need to be re-written again if you can't prove that they were not copied directly or from memory from the original, even if they represent a very small part of the work overall. Even if you wrote a 10k page book, if you copied one whole page verbatim from another book, you will be liable for that page, and the author may force you to take it out.
The burden of proof is completely uncharted when it comes to LLMs. Burden of proof is assigned by court precedent, not the Copyright Act itself (in US law). Meaning, a court looking at a case like this could (should) see the use of an LLM trained on the copyrighted work as a distinguishing factor that shifts the burden to the defense. As a matter of public policy, it's not great if infringers can use the poor accountability properties of LLMs to hide from the consequences of illegally redistributing copyrighted works.
> When there is similar code, the only defense possible to prove that you have not copied the original is to show that your process is a clean room re-implementation.
Yes, but you do not have to prove that you haven’t copied the original; you have to prove you didn’t infringe copyright. For that there are other possible defenses, for example:
- fair use
- claiming the copied part doesn’t require creativity
- arguing that the copied code was written by AI (there’s jurisdiction that says AI-generated art can’t be copyrighted (https://www.theverge.com/2023/8/19/23838458/ai-generated-art...). It’s not impossible judges will make similar judgments for AI-generated programs)
Courts have ruled that you can't assign copyrights to a machine, because only humans qualify for human rights. ** There is not currently a legal consensus on whether or not the humans using AI tools are creating derivative works when they use AI models to create things.
** this case is similar to an old case where a ~~photographer~~ PETA claimed a monkey owned a copyright to a photo, because they said a monkey took the photo completely on their own. The court said "okay well, it's public domain then because only humans can have copyrights"
Imagine you put a harry potter book in a copy machine. It is correct that the copy machine would not have a copyright to the output. But you would still be violating copyright by distributing the output.
> there’s jurisdiction that says AI-generated art can’t be copyrighted
The headline was misleading. The courts said what Thaler could have copyrighted was a complicated question they ignored because he said he was not the author.
- Arguing that you owned the copyright on the copied code (the author here has apparently been the sole maintainer of this library since 2013, not all, but a lot of the code that could be copied here probably already belongs to him...)
The expected functionality of chardet (detect the unicode encoding) is kind of fixed - apart from edge cases and new additions to unicode, you'd expect the original and new implementations to largely pass the same tests, and have a lot of similar code such as for "does this start with a BOM".
The fact that the JPlag shows such a low %overlap for an implementation of "the same interface" is convincing evidence for me that it's not just plagiarised.
If you let an LLM merely rephrase the codebase, that's like letting it rephrase the Harry Potter novels. Which, I'm pretty sure, would still be considered a copy under copyright law, not an original work, despite not copying any text verbatim.
But what if it didn’t summarize Harry Potter? What if it analyzed Harry Potter and came back with a specification for how to write a compelling story about wizards? And then someone read that spec and wrote a different story about wizards that bears only the most superficial resemblance to Harry Potter in the sense that they’re both compelling stories about wizards?
This is legitimately a very weird case and I have no idea how a court would decide it.
Given that LLMs were trained on the repository directly, it's not just the case that anything made by the LLM is a derivative work, the LLM ITSELF is a derivative work. After all, they all are substantially based on GPL licensed works by others. The standard courts have always used for "substantially based" by the way, is the ability to extract from the new work anything bigger than an excerpt of the original work.
So convincing evidence, by historical standards, that ChatGPT, Gemini, Copilot AND Claude are all derivative works of the GPL linux kernel can be gotten simply by asking "give me struct sk_buff", then keep asking until you're out of the headers (say, ask how a network driver uses it).
That means if courts are honest (and they never are when it comes to GPL) OpenAI, Google and Anthropic would be forced to release ALL materials needed to duplicate their models "at cost". Given how LLMs work that would include all models, code, AND training data. After all, that is the contract these companies entered into when using the GPL licensed linux kernel.
But of course, to courts copyright applies to you when Microsoft demands it ($30000 per violation PLUS stopping the use of the offending file/torrent/software/... because such measures are apparently justified for downloading a $50 piece of software), it does not apply to big companies when the rules would destroy them.
The last time this was talked about someone pointed out that Microsoft "stole", as they call it, the software to do product keys. They were convicted for doing that, and the judge even increased damages because of Microsoft's behavior in the case.
But there is no way in hell you'll ever get justice from the courts in this. In fact courts have already decided that AI training is fair use on 2 conditions:
1) that the companies acquired the material itself without violating copyright. Of course it has already been proven that this is not the case for any of them (they scraped it without permission, which has been declared illegal again and again in the file sharing trials)
2) that the models refuse to reproduce copyrighted works. Now go to your favorite model and ask "Give me some code written by Linus Torvalds": not a peep about copyright violation.
... but it does not matter, and it won't matter. Courts are making excuses to allow LLM models to violate any copyright, the excuse does not work, does not convince rational people, but it just doesn't matter.
But of course, if you thought that just because they cheat against the law to make what they're already doing legal, they'll do the same for you, help you violate copyright, right? After all, that's how they work! Ok now go and ask:
"Make me an image of Mickey Mouse peeling a cheese banana under an angry moon"
And you'll get a reply "YOU EVIL COPYRIGHT VILLAIN". Despite, of course, Mickey Mouse no longer being covered under copyright!
And to really get angry, find your favorite indie artist, and ask to make something based on their work. Even "Make an MC Escher style painting of Sonic the Hedgehog" ... even that doesn't count as copyright violation, only the truly gigantic companies deserve copyright protection.
> Given that LLMs were trained on the repository directly, it's not just the case that anything made by the LLM is a derivative work, the LLM ITSELF is a derivative work.
That’s not how “derivative works”, well, work.
First of all, a thing can only be a derivative work if it is itself an original work of authorship.
Otherwise, it might be (or contain) a complete copy or a partial copy of one or more source works (which, if it doesn't fall into a copyright exception, would still be a at least a potential violation), but its not a derivative work.
So you're saying LLMs don't count as an original work and so have zero copyright protection? So anyone running those models can just freely copy them if they have access to them? And, of course, it means distillation attacks, even if they do turn out to copy the OpenAIs/Anthropic/... model are just 100% perfectly legal? I mean paying someone to break into the DC and then putting the model on torrent would allow anyone downloading it to use it, legally. Because that would be the implication, wouldn't it?
Plus, if this is true, it would be a loophole. Plus this is totally crazy.
It would be great if courts declared WHAT is the case. But they won't, because copyright only protects massive companies.
> So you're saying LLMs don't count as an original work and so have zero copyright protection?
No, I'm saying that your explanation of what makes something a derivative work is wrong. Now, personally, I think there is a very good argument that LLMs and similar models, if they have a copyright at all, do so only because of whatever copyright can be claimed on the training set as a work of its own (which, if ti exists, would be a compilation copyright), as a work of authorship of which it is a mechanical transformation (similar to object code having a copyright as a consequence of the copyright on the source code, which is a work of authorship.) Its also quite arguable that they are not subject to copyright, and many have made that argument.
> So anyone running those models can just freely copy them if they have access to them?
I'm not arguingn for that, but yes that is the consequence if they are not subject to copyright, assuming no other (e.g., contractual) prohibition binds the parties seeking to make copies.
> And, of course, it means distillation attacks, even if they do turn out to copy the OpenAIs/Anthropic/... model are just 100% perfectly legal?
Distillation isn't an “attack” and probably isn't a violation of copyright even if models are protected, they are literally interacting with the model through its interface to reproduce its function; they are functional reverse engineering.
Distillation is a violation of ToS, for which there are remedies outside of copyright.
> I mean paying someone to break into the DC and then putting the model on torrent would allow anyone downloading it to use it, legally.
Paying someone to break into the DC and do that would subject you to criminal charges for burglary and conspiracy, and civil liability for the associated torts as well as for theft of trade secrets covering the resulting harms, even without copyright protection.
> Plus, if this is true, it would be a loophole. Plus this is totally crazy.
Its not a “loophole” that copyright law only covers works of original authorship, it is the whole point of copyright law.
> It would be great if courts declared WHAT is the case.
If there is a dispute which turns on what is the case, courts will rule one way or the other on the issues necessary to resolve it. Courts (in the US at least) do not rule on issues not before them, except to the extent that a general rule which resolves but covers somewhat more than the immediate case can usefully be articulated by an appellate court.)
> But they won't, because copyright only protects massive companies.
Leaving out any question of whether the premise of this claim is true, the conclusion doesn't follow from it, since “what is the case” here is the kind of thing that is quite likely to be an issue between massive companies at some point in the not too distant future, requiring courts to resolve it even if they only address the meaning of copyright law for that purpose.
Your first 3-4 arguments I just read as trying to weasel out from under the GPL. Because everyone trains on GPL code and if the GPL applies to the result ... well clearly you know the implications of that.
And btw: that a "compilation copyright" would apply to training data. Great. That only means, of course, that if they are publish their training data (like they agreed to when using GPL code to base their models on), people can't republish the exact same collection under different conditions (BUT they can under the same conditions). Everyone will happily follow that rule, don't worry.
> Paying someone to break into the DC and do that would subject you to criminal charges for burglary and conspiracy, and civil liability for the associated torts as well as for theft of trade secrets covering the resulting harms, even without copyright protection.
I don't claim the break-in would be legal, but without copyright protection, if that made a model leak, it would be fair game for everyone to use.
> Distillation is a violation of ToS, for which there are remedies outside of copyright.
But the models were created by violating ToS of webservers! This has the exact same problem the copyright violations have, only far far bigger! Scraping webservers is a violation of the ToS of those servers. For example [1]. Almost all have language somewhere that only allows humans to browse them, and bots, and IF bots are allowed at all (certainly not always), only specific bots for the purpose of indexing. So this is a much bigger problem for AI labs than even the GPL issue.
So yes, if you wanted to make the case that the AI labs, and large companies, violate any kind of contract, not just copyright licenses, excellent argument. But I know already: I'm a consultant, and I've had to sue, and won, 2 very large companies on terms of payment. In one case, I've had to do something called "forced execution", of the payment order (ie. going to the bank and demanding the bank execute the transaction against a random account of the company, against the will of the large company. Let me tell you, banks DO NOT like to do this)
Btw: what model training is doing, obviously, is distilling from the work, from the brain, of humans, against the will of those humans, and without paying for it. So in any reasonable interpretation, that's also a ToS violation. Probably a lot more implicit than the ones spelled out on websites, but not fundamentally different.
> Your first 3-4 arguments I just read as trying to weasel out from under the GPL.
I haven't talked about any license, or given any though to any particular license in any of this; I don't know where you are reading anything about the GPL specifically into it.
None of this has anything to do with the GPL, except that the GPL only is even necessary where there is something to license because of a prohibition on copyright law.
> nd btw: that a "compilation copyright" would apply to training data. Great. That only means, of course, that if they are publish their training data (like they agreed to when using GPL code to base their models on), people can't republish the exact same collection under different conditions (BUT they can under the same conditions).
No, that's not what it means, and I don't know where you got the "other terms" or the dependency on publication from; neither is from copyright law.
> But the models were created by violating ToS of webservers!
And, so what?
To the extent those terms are binding (more likely the case for sites where there is affirmative assent to the conditions, like ones that are gated on accounts with a signup process that requires agreeing to the ToS, e.g., “clickwrap”), there are remedies. For those where the conditions are not legally binding (more like the case where the terms are linked but there is no access gating, clear notice, or affirmative assent), well, they aren't binding.
> Btw: what model training is doing, obviously, is distilling from the work, from the brain, of humans, against the will of those humans, and without paying for it. So in any reasonable inteUhrpretation, that's also a ToS violation.
Uh, what? We are just creating imaginary new categories of intellectual property and imaginary terms of service and imaginary bases for those terms to be enforceable now?
The LLM would, under that argument, be a transformative derivative work, which has important fair use implications (that don’t exist in the chardet case)…
"Tainted rewrite" isn't a legal concept either. You have to prove (on balance of probabilities - more likely than not) that the defendant made an unauthorized copy, made an unauthorized derivative work, etc. Clean-room rewriting is a defense strategy, because if the programmer never saw the original work, they couldn't possibly have made a derivative. But even without that, you still have to prove they did. It's not an offence to just not be able to prove you didn't break the law.
If you wanted to do the clean-room approach for something like chardet in a less controversial way, instead of having the AI do all the work couldn’t the AI generate the spec and then a human (with no exposure to the original code) do an initial implementation based on the spec?
As other pointed out, the notion of "clean room" rewrites is to make a particularly strong case of non-infringement. It doesn't mean that anything other than a clean room implementation is an infringement.
This is interesting and I'm not sure what to make of it. Devil's advocate: the person operating the AI also was "trained with the code," is that materially different from them writing it by hand vs. assisted by an LLM? Honestly asking, I hadn't considered this angle before.
If you worked at Microsoft and had access to the Windows source code you probably should not be contributing to WINE or similar projects as there would be legal risk.
So for this case, not much different legally. Of course there is the practical difference just like there is between me seeing you with my own eyes and me taking a picture of you.
"Training" an LLM ist not the same as training a human being. It a metaphor. Its confusing the save icon with an actual floppy disk.
I can say I "trained" my printer to print copyrighted material by feeding it bits but that that would be pure sophism.
Problem is that law hasn't really caught up the our brave new AI future yet so lots of decisions are up in the air. Plus governments incentivized to look the other way regarding copyright abuses when it comes to AI as they think that having competitive AI is of strategic importance.
> "Training" an LLM ist not the same as training a human being. It a metaphor. Its confusing the save icon with an actual floppy disk.
Maybe? But the design of the floppy disk is for data storage and retrieval per se. It can't give you your bits in a novel order like an LLM does (by design). From what I can tell in this case, the output is significantly differentiated from the source code.
This is correct. I think any author of a main chunk of code that they claim ownership to (which is probably all of us!) should at least study the basics of copyright law. Getting little details wrong can cost you time, money and eventually your business if you're not careful.
Fine tuning is a story that is nice to tell but that with modern LLMs makes less and less sense. Modern LLMs are so powerful that they are able to few shot learn complicated things, so a strong prompt and augmenting the generation (given the massive context window of Qwen3.5, too) is usually the best option available. There are models for which fine tuning is great, like image models: there with LoRa you can get good results in many ways. And LLMs of the past, too: it made sense for certain use cases. But now, why? LLMs are already released after seeing (after pre-training) massive amount of datasets for SFT and then RL. Removing the censorship is much more efficiently done with other techniques. So I have a strong feeling that fine tuning will be every day less relevant, and already is quite irrelevant. This, again, in the specific case of LLMs. For other foundational models fine tuning still makes sense and is useful (images, text to speech, ...).
I think the biggest case for fine tuning is probably that you can take small models, fine tune them for applications that require structured output, and then run cheap inference at scale. "Frontier LLMs can do it with enough context" is not really a strong argument against fine-tuning, because they're expensive to run.
Especially for super constrained applications. I don't care if the language model that I use for my extremely specific business domain can solve PhD math or remember the works of Shakespeare. I'd trade all of that for pure task specific accuracy.
Can you share more details about your use case? The good applications of fine tuning are usually pretty niche, which tends to make people feel like others might not be interested in hearing the details.
As a result it's really hard to read about real-world use cases online. I think a lot of people would love to hear more details - at least I know I would!
Wouldn’t it be better to use a grammar in the token sampler? Tuning is fine, but doesn’t guarantee a syntactical correct structured output. But if the sampler is grammar aware it could.
Also for certain use cases there are constraints like embedded hardware systems with no internet access. These LLMs have to be trained to specialize for clearly defined use cases under hardware constraints.
Frontier LLMs also are rarely function in isolation instead are orchestrating a system of special units aka subsystems and agents.
While costs and effort are one thing, being able to downsize these monster LLMs through finetuning itself in the first place is extremly valuable.
> "Frontier LLMs can do it with enough context" is not really a strong argument against fine-tuning, because they're expensive to run.
I am not expert in this topic, but I am wondering if large cached context is actually cheap to run and frontier models would be cost efficient too in such setting?
I agree- I'm currently trying to learn how I can embed a fine tuned tiny model into my c++ game so it can provide a narrative in prose of certain game-event logs. It needs to be as tiny as possible so it doesn't take resources away from the running game.
> I agree- I'm currently trying to learn how I can embed a fine tuned tiny model into my c++ game so it can provide a narrative in prose of certain game-event logs.
Unless your game states have combinatoral exlosion, would it not be better to generate all of that pre-build? If templated you can generate a few hundreds of thousands of templates to use for any circumstance, then instantiate and stitch together those templates during the game runtime.
There are a bunch of tutorials on how to use GRPO to fine tune a small Qwen. Depending what you're doing LoRA or even just prefix tuning can give pretty good results with no special hardware.
> How small a model are we talking? Don't even the smallest models which would work need gigabytes of memory?
I dunno, for game prose I expect that a tiny highly quantized model would be sufficient (generating no more than a paragraph), so 300MB - 500MB maybe? Running on CPU not GPU is feasible too, I think.
These are fair points considering LLMs are getting smarter and better every week - but to be fair the biggest benefits of finetuning / RL are still not yet realized:
1. If we have robots at home, they need some sort of efficient continual learning, which could be on the go finetuning / RL via some small LoRA - this will need to do multimodal finetuning with sparse reward signals - one could also imagine all data is aggregated to one central processing center after anonymization, and training a larger model with more data + RL like that
2. Agreed images, audio, video etc is what still LoRA does well - the guide at https://unsloth.ai/docs/models/qwen3.5/fine-tune is actually a vision + text finetuning guide, so you can finetune the vision layers on your own use case
3. Model routing is going to be more the norm in the future - ie locally smallish models with LoRA for continuous finetuning can be used, but complex tasks can be offloaded to a large LLM in the cloud.
4. I also wrote about more use-cases below on the post - DoorDash, Vercel, Mercor, Stripe, NASA, Perplexity, Cursor and many others all do finetuning - for eg Cursor, Perplexity finetune large OSS LLMs themselves for their specific product lines - so there is definitely value if you have the data for it.
I work on Gemma and Gemini models I want to echo Daniel's point here. Small finetuned models have their place even with larger general purpose models.
For example last year with Daniel/Unsloth's help we released a tiny specialized model that can get equivalent to Gemini level purpose specifically for FC. For folks that need efficient limited purpose models small models like this can fit a specific need.
It's the same with chips, we have general purpose CPUs but we still have specialized silicon for tasks that are smaller, more power efficient, cheaper, and because they're single purpose it simplifies and derisks certain designs.
And I have to add, if you want to learn about finetuning models efficiently the Unsloth guides are at the top of my list. They're practical, have all the technical details, and most importantly Daniel and the others are working around the clock to keep it up to date in what is an incredibly fast moving space of models and hardware. I am continually astounded by their work.
Function calling and also finetuning with FC is a big use-case across any companies - we constantly see large orgs have internal APIs with some schema, and JSON guided output is good, but finetuning with FC is just much more powerful since the model actually starts to understand how to utilize the tools more effectively!
Nice work with Gemma and Gemini as usual! :) Excited for more cool models this year!
For me, trying to fine-tune a model to write "best day" prose I would accept over 80% of the time.
You are correct if we are talking about knowledge.
However it is bad at hyper-idiosyncratic, gritty style transfer.
I first noticed the issue when asking claude code to draft email responses. The choice of register was off. ("Register in writing refers to the level of formality and tone chosen to suit a specific audience, purpose, and context.")
I decided to talk all my HN comments and rewrite them in various bad LLM prose, and see if I could use DSPy to optimize a prompt using in-context-learning (ICL, I give it 10 examples of my HN comments) and the results were abysmal. RHLF fine-tuned frontier LLMs have a deep seated aversion to the target stylistic distribution of my comments.
I tried fine-tuning qwen3, llama, and gemma models. Instruct models are already so tuned that they could not be tuned.
This is using several hunded comments as gold targets and 5 different LLM degradations per gold as the input.
> Instruct models are already so tuned that they could not be tuned
Some models have the base model available, that is before instruction tuning. For example llama 3 comes in "pre-trained and instruction tuned variants" [1]. I'm guessing you already know that though.
How well would you say it worked? I do like the idea of taking my historical forum posts and e-mails and whatnot and training an autocomplete LLM that is specifically "my voice".
They are great for specialized use-cases: (a) where the problem is not hard enough (you don't need reasoning), or (b) diverse enough (you don't need a world model), (c) you want cheap inference (and you can make it happen hardware-wise) and (d) you either have enough data or a workflow that accumulates data (with fine tuning with enough data you can sometimes beat a premier model while ensuring low latency - ofc, assuming (a) and (b) apply).
I make it sound like a rare perfect storm needs to exist to justify fine tuning, but these circumstances are not uncommon - to an extent (a), (c) and (d) were already prerequisites for deploying traditional ML systems.
where it makes sense IMO is when you need it to know about a large amount of information that's not already in the model, such as a company knowledgebase, code repositories or a trove of specialized legal documents... in that case it's not realistic to try to stuff the context window every time with that information, especially if you're trying to make a responsive chat bot.
With the current context windows and the ability those models did RL to work as agents, it's much faster and reliable for them to use tools and find the information before replying. Much better, no hallucinations problems (or a lot less), no fine tuning needed when information changes. I believe it is exactly in this case that fine tuning is no longer useful, and even in the past worked at very different degrees of quality.
indeed, and in practical terms, this is more often than never, and particularly with large knowledge bases. also makes super sense for VLMs and ViT models.
I think fine-tuning still matters for production problems where you need deterministic, auditable behavior or to reliably reduce hallucinations that clever prompting alone cannot eliminate. In my experience the best pragmatic approach is parameter efficient tuning, for example LoRA or QLoRA with bitsandbytes for 4-bit training to keep costs down, paired with a RAG layer over a FAISS vector DB so you do not stuff the model context and blow your token budget. I've found that managing a few tuned adapters and a small ops pipeline is a simpler, cheaper long term tradeoff than endless prompt gymnastics, and it saves you from praying to the prompt gods every time requirements creep.
This time even Unsloth could not provide bitsandbytes 4-bit models. bitsandbytes does not support new models with MoE and linear attentions, and it's much less flexible than GGUF. Nowadays I think it's better to train lora over GGUF base model, see the discussion at https://github.com/huggingface/transformers/issues/40070
I'll find some time to do this and I hope someone can do this earlier than me.
Qwen filtered out a lot of porn during data curation, and a finetuned model can perform much better than context engineering. Abliteration can only remove censorship, not add something non-existent in the training data.
Fine-tuning still makes sense for cost/latency-sensitive applications. Massive context windows drastically slow down generation, and modern models' performance and instruction following ability relies heavily on a reasoning step that can consume orders of magnitude more tokens than the actual response (depending on the application), while a fine-tuned model can skip/significantly reduce that step.
Using the large model to generate synthetic data offline with the techniques you mentioned, then fine-tuning the small model on it, is an underrated technique.
As strong as current LLMs are they are easily distracted from the task often.
At production scale, fine tuning can make a lot more sense given you provide the model a very specific task.
The problem with this is context. Whatever examples you provide compete with whatever content you want actually analyzed. If the problem is sufficiently complex, you quickly will run out of context space. You must also describe your response, in what you want. For many applications, it's better to fine-tune.
Because these models are good in general but their Latvian output is half-drivel, like the roots of the words are usually the right ones, but not the rest.
That, and EuroLLM is really slow to release new models that would be similarly good off the shelf.
Please tell me what CPU it is. I would give it a try. I doubt strongly a very well documented CPU can't be emulated by writing the code with modern AIs.
What happened with the wrong pixel layout is that the specification was wrong (the problem is that sub agents spawned recently by Claude Code are Haiuku session, their weakest model -- you can see the broken specification under spectrum-specs), it entered the code, caused a bug that Claude later fixed, without updating the comment. This actually somewhat shows that even under adversarial documentation it can fix the problem.
IMHO zx_pixel_addr() is not bad, makes sense in this case. I'm a lot more unhappy with the actual implementation of the screen -> RGB conversion that uses such function, which is not as fast as it could be. For instance my own zx2040 emulator video RAM to ST77xx display conversion (written by hand, also on GitHub) is more optimized in this case. But the fact to provide the absolute address in the video memory is ok, instead of the offset. Just design.
> This and the other inline function next to it for attributes are only ever used once.
I agree with that but honestly 90% of the developers work in this way. And LLMs have such style for this reason. I stile I dislike as well...
About the lookup table, the code that it uses in the end was a hint I provided to it, in zx_contend_delay(). The old code was correct but extremely memory wasteful (there are emulators really taking this path of the huge lookup table, maybe to avoid the division for maximum speed), and there was the full comment about the T-states, but after the code was changed this half-comment is bad and totally useless indeed. In the Spectrum emulator I provided a few hints. In the Z80, no hint at all.
If you check the code in general, the Z80 implementation for instance, it is solid work on average. Normally after using automatic programming in this way, I would ask the agent (and likely Codex as well) to check that the comments match the documentation. Here, since it is an experiment, I did zero refinements, to show what is the actual raw output you get. And it is not bad, I believe.
P.S. I see your comment greyed out, I didn't downvote you.
In the last paragraph you handwave that all the Z80 and ZX Spectrum documentations is likely already in the model anyway... Choosing to not provide the documents/websites might then requiring more prompting to finish the emulator, but the knowledge is there. You can't clean room with a large LLM. That's delusion!
Counterpoint: in December, a Polish MP [0] has vibe-coded an interpreter [1] of a 1959 Polish programming language, feeding it the available documentation. _That,_ at least, is unlikely to have appeared in the model’s training data.
Not exactly a counterpoint, since nobody argued that LLMs can not produce "original" code from specs at all - just that this particular exercise was not clean room.
(although for SAKO [1], it's an average 1960 programming language, just with keywords in Polish, so it's certainly almost trivial for an LLM to produce an interpreter, since construction via analogy is the bread and butter of LLMs. Also, such interpreters tend to have an order of magnitude less complexity than emulators.)
I mean, for an article that's titled "clean room", that would be the first thing to do, not as a "maybe follow up in the future"...
(I do think the article could have stood on its own without mentioning anything about "clean room", which is a very high standard.)
For the handwavy point about the x86 assembler, I am quite sure that the LLM will remember the entirety of the x86 instruction set without any reference, it's more of a problem of having a very well-tuned agentic loop with no context pollution to extract it. (which you won't get by YOLOing Claude, because LLMs aren't that meta-RLed yet to be able to correct their own context/prompt-engineering problems)
Or alternatively, to exploit context pollution, take half of an open-source project and let the LLM fill in the rest (try to imagine the synthetic "prompt" it was given when training on this repo) and see how far it is from the actual version.
Qwen-asr can easily transcribe live radio (see README) in any random laptop. It looks like we are going to see really cool things on local inference, now that automatic programming makes a lot simpler to create solid pipelines for new models in C, C++, Rust, ..., in a matter of hours.
Your voxtral.c work was a big motivator for me. I built a macOS menu bar dictation app (https://github.com/T0mSIlver/localvoxtral) around Voxtral Realtime, currently using a voxmlx fork with an OpenAI Realtime WebSocket server I added on top.
The thing that sold me on Voxtral Realtime over Whisper-based models for dictation is the causal encoder. Text streaming in as you speak rather than appearing after you stop is a fundamentally different UX. On M1 Pro with a 4-bit quant through voxmlx it feels responsive enough for natural dictation, though I haven't done proper latency benchmarks yet.
Integrating voxtral.c as a backend is on my roadmap, compiling to a single native binary makes it much easier to bundle into a macOS app than a Python-based backend.
Which is why long term current programming languages will eventually become less relevant in the whole programming stack, as in get the computer to automate tasks, regardless how.
Assuming RAM prices will not make it totally unaffordable. Current situation is atrocious and big infrastructure corps seem to love it, they do not want independent computing. Alternatively they might build specialized branded hardware which people could only use for what corps allow them to do for nice monthly fee.
Another problem is too much abstraction on input spec level. The other day I asked Claude to generate few classes. When reviewing the code I noticed it doing full scan for ranges on one giant set. This would bring my backend to a halt. After pointing it out to Claude it had smartened up to start with lower_bound() call. When there are no people to notice such things what do you think we are going to have?
Agreed, in regards to prices, it appears to be the new gold, lets see how this gets sorted out, with NPUs, FPGAs, analog (Cerebas),...
Now the abstraction I am with you on that, I foresee a more formal way to give specifications, but more suitable for natural language as input, or even proper mathematics, than the languages we have been using thus far.
On more serious note. Sure we need Spec development IDE which LLM would compile to a language of choice (or print ASIC). It would still not prevent that lower_bound things from happening and there will be no people to find out why
> Alternatively they might build specialized branded hardware which people could only use for what corps allow them to do for nice monthly fee.
That's why I'm still holding on to a bulky Core 2 Duo Management Engine-free Fujitsu workstation, for when personal computing finally goes underground again.
I learned just right now that this isn't the default. I set my bookmark to HN in like 2011 before making an account, and apparently it's that one. I didn't realize that wasn't just the basic homepage but with a weird address for some reason.
Not only it would be good if true, but it is also not true. Good programmers learn how to build things, for the most part, since they know what to build, and have a general architectural idea of what they are going to build. Without that, you are like the average person in the 90s with Corel Draw in their hands, or the average person with an image diffusion model today: the output will be terrible because of lack of taste and ideas.
reply