When an AI trains on data it isn’t copying the data, the model doesn’t “contain” the training data in any meaningful sense.
And what’s your evidence for this claim? It seems to be false given the times people have tricked LLMs into spitting out verbatim or near-verbatim copies of training data. See this article as one of many examples out there.
People who insist that AI training is violating copyright are advocating for ideas and styles to be covered by copyright.
Again, what’s the evidence for this? Why do you think that of all the observable patterns, the AI will specifically copy “ideas” and “styles” but never copyrighted works of art? The examples from the above article contradict this as well. AIs don’t seem to be able to distinguish between abstract ideas like “plumbers fix pipes” and specific copyright-protected works of art. They’ll happily reproduce either one.
That article is over a year old. The NYT case against OpenAI turned out to be quite flimsy, their evidence was heavily massaged. What they did was pick an article of theirs that was widely copied across the Internet (and thus likely to be “overfit”, a flaw in training that AI trainers actively avoid nowadays) and then they’d give ChatGPT the first 90% of the article and tell it to complete the rest. They tried over and over again until eventually something that closely resembled the remaining 10% came out, at which point they took a snapshot and went “aha, copyright violated!”
They had to spend a lot of effort to get that flimsy case. It likely wouldn’t work on a modern AI, training techniques are much better now. Overfitting is better avoided and synthetic data is used.
Why do you think that of all the observable patterns, the AI will specifically copy “ideas” and “styles” but never copyrighted works of art?
Because it’s literally physically impossible. The classic example is Stable Diffusion 1.5, which had a model size of around 4GB and was trained on over 5 billion images (the LAION5B dataset). If it was actually storing the images it was being trained on then it would be compressing them to under 1 byte of data.
AIs don’t seem to be able to distinguish between abstract ideas like “plumbers fix pipes” and specific copyright-protected works of art.
The NYT was just one example. The Mario examples didn’t require any such techniques. Not that it matters. Whether it’s easy or hard to reproduce such an example, it is definitive proof that the information can in fact be encoded in some way inside of the model, contradicting your claim that it is not.
If it was actually storing the images it was being trained on then it would be compressing them to under 1 byte of data.
Storing a copy of the entire dataset is not a prerequisite to reproducing copyright-protected elements of someone’s work. Mario’s likeness itself is a protected work of art even if you don’t exactly reproduce any (let alone every) image that contained him in the training data. The possibility of fitting the entirety of the dataset inside a model is completely irrelevant to the discussion.
This is simply incorrect.
Yet evidence supports it, while you have presented none to support your claims.
Learning what a character looks like is not a copyright violation. I’m not a great artist but I could probably draw a picture that’s recognizably Mario, does that mean my brain is a violation of copyright somehow?
Yet evidence supports it, while you have presented none to support your claims.
I presented some, you actually referenced what I presented in the very comment where you’re saying I presented none.
You can actually support your case very simply and easily. Just find the case law where AI training has been ruled a copyright violation. It’s been a couple of years now (as evidenced by the age of that news article you dug up), yet all the lawsuits are languishing or defunct.
And what’s your evidence for this claim? It seems to be false given the times people have tricked LLMs into spitting out verbatim or near-verbatim copies of training data. See this article as one of many examples out there.
Again, what’s the evidence for this? Why do you think that of all the observable patterns, the AI will specifically copy “ideas” and “styles” but never copyrighted works of art? The examples from the above article contradict this as well. AIs don’t seem to be able to distinguish between abstract ideas like “plumbers fix pipes” and specific copyright-protected works of art. They’ll happily reproduce either one.
That article is over a year old. The NYT case against OpenAI turned out to be quite flimsy, their evidence was heavily massaged. What they did was pick an article of theirs that was widely copied across the Internet (and thus likely to be “overfit”, a flaw in training that AI trainers actively avoid nowadays) and then they’d give ChatGPT the first 90% of the article and tell it to complete the rest. They tried over and over again until eventually something that closely resembled the remaining 10% came out, at which point they took a snapshot and went “aha, copyright violated!”
They had to spend a lot of effort to get that flimsy case. It likely wouldn’t work on a modern AI, training techniques are much better now. Overfitting is better avoided and synthetic data is used.
Because it’s literally physically impossible. The classic example is Stable Diffusion 1.5, which had a model size of around 4GB and was trained on over 5 billion images (the LAION5B dataset). If it was actually storing the images it was being trained on then it would be compressing them to under 1 byte of data.
This is simply incorrect.
The NYT was just one example. The Mario examples didn’t require any such techniques. Not that it matters. Whether it’s easy or hard to reproduce such an example, it is definitive proof that the information can in fact be encoded in some way inside of the model, contradicting your claim that it is not.
Storing a copy of the entire dataset is not a prerequisite to reproducing copyright-protected elements of someone’s work. Mario’s likeness itself is a protected work of art even if you don’t exactly reproduce any (let alone every) image that contained him in the training data. The possibility of fitting the entirety of the dataset inside a model is completely irrelevant to the discussion.
Yet evidence supports it, while you have presented none to support your claims.
Learning what a character looks like is not a copyright violation. I’m not a great artist but I could probably draw a picture that’s recognizably Mario, does that mean my brain is a violation of copyright somehow?
I presented some, you actually referenced what I presented in the very comment where you’re saying I presented none.
You can actually support your case very simply and easily. Just find the case law where AI training has been ruled a copyright violation. It’s been a couple of years now (as evidenced by the age of that news article you dug up), yet all the lawsuits are languishing or defunct.