It’s deeply disingenuous to say the U-Net is not trained on the images because it’s trained on the latent representations.
Latents are a compressed representation of the source images that are fully recoverable.
If you train a model on a compressed jpg of an image, or on any deterministic transformation of it, you’re still training it on that image.
Any suggestion otherwise is only because someone is trying to put some spin on things.
> Stable Diffusion v2 16-bit is ~3GB of data. It was trained on hundreds of millions of images…
And yet! Remarkably! It can generate pictures of the Mona Lisa!
Here’s a question for you: if you encode the process of drawing an exact copy of an image, does the pure code that implements that mean you have a copy of the image in it?
Have you encoded pixels as code?
Does that mean there’s no copy of the image?
How about a zip file full of images? It’s just a high entropy binary blob right? Yet… remarkably!!! It can be transformed into images by applying an algorithm.
I don’t know the answer, but this handwavy “it couldn’t possibly encode them it’s too small” is…
Pure. Nonsense.
Of course some part of some images is embedded in the model in some form.
Stop trivialising the issue.
The issue here is: Does an algorithm that generates content infringe copyright?
Does a black box that takes the input “a picture of xxx” and a seed and outputs a copyrighted image infringe?
You know that’s possible. Don’t dodge. Technical details about oh “it couldn’t possibly have…” are pure rubbish.
Sure it could. It could have a full resolution copy of a photo of the image in that
black box.
Of all the training data? Probably not. But of some of it? In compressed latent form? Most definitely.
bullshit. There is never exact copy of Mona Lisa. All reproductions with any similarities are the same as if human artist learned to paint and do a reproduction of Mona Lisa. No copyright infringement.
"Image diffusion models such as DALL-E 2, Imagen, and Stable Diffusion have attracted significant attention due to their ability to generate high-quality synthetic images. In this work, we show that diffusion models memorize individual images from their training data and emit them at generation time. With a generate-and-filter pipeline, we extract over a thousand training examples from state-of-the-art models, ranging from photographs of individual people to trademarked company logos. We also train hundreds of diffusion models in various settings to analyze how different modeling and data decisions affect privacy. Overall, our results show that diffusion models are much less private than prior generative models such as GANs, and that mitigating these vulnerabilities may require new advances in privacy-preserving training. "
Latents are a compressed representation of the source images that are fully recoverable.
If you train a model on a compressed jpg of an image, or on any deterministic transformation of it, you’re still training it on that image.
Any suggestion otherwise is only because someone is trying to put some spin on things.
> Stable Diffusion v2 16-bit is ~3GB of data. It was trained on hundreds of millions of images…
And yet! Remarkably! It can generate pictures of the Mona Lisa!
Here’s a question for you: if you encode the process of drawing an exact copy of an image, does the pure code that implements that mean you have a copy of the image in it?
Have you encoded pixels as code?
Does that mean there’s no copy of the image?
How about a zip file full of images? It’s just a high entropy binary blob right? Yet… remarkably!!! It can be transformed into images by applying an algorithm.
I don’t know the answer, but this handwavy “it couldn’t possibly encode them it’s too small” is…
Pure. Nonsense.
Of course some part of some images is embedded in the model in some form.
Stop trivialising the issue.
The issue here is: Does an algorithm that generates content infringe copyright?
Does a black box that takes the input “a picture of xxx” and a seed and outputs a copyrighted image infringe?
You know that’s possible. Don’t dodge. Technical details about oh “it couldn’t possibly have…” are pure rubbish.
Sure it could. It could have a full resolution copy of a photo of the image in that black box.
Of all the training data? Probably not. But of some of it? In compressed latent form? Most definitely.