Basically a deer with a human face. Despite probably being some sort of magical nature spirit, his interests are primarily in technology and politics and science fiction.

Spent many years on Reddit before joining the Threadiverse as well.

  • 0 Posts
  • 37 Comments
Joined 10 months ago
cake
Cake day: March 3rd, 2024

help-circle

  • Of course it’s not clear-cut, it’s the law. Laws are notoriously squirrelly once you get into court. However, if you’re going to make predictions one way or the other you have to work with what you know.

    I know how these generative AIs work. They are not “compressing data.” Your analogy to making a video recording is not applicable. I’ve discussed in other comments in this thread how ludicrously compressed data would have to be if that was the case, it’s physically impossible.

    These AIs learn patterns from the training data. Themes, styles, vocabulary, and so forth. That stuff is not copyrightable.




  • If you cut up the book into paragraphs, sentences, and phrases, and rearranged them to make and sell your own books, then you are likely to fail each of the four tests.

    Ah, the “collage machine” description of how generative AI supposedly works.

    It doesn’t.

    But even if you manage to cut those pieces up so fine that you can’t necessarily tell where they come from in the source material, there is enough contained in the output that it is clearly drawing directly on source material.

    If you can’t tell where they “came from” then you can’t prove that they’re copied. If you can’t prove they’re copied you can’t win a copyright lawsuit in a court of law.


  • You’re probably thinking of situations where overfitting occurred. Those situations are rare, and are considered to be errors in training. Much effort has been put into eliminating that from modern AI training, and it has been successfully done by all the major players.

    This is an old no-longer-applicable objection, along the lines of “AI can’t do fingers right”. And even at the time, it was only very specific bits of training data that got inadvertently overfit, not all of it. You couldn’t retrieve arbitrary examples of training data.





  • The courts have yet to come to a conclusion, the lawsuits are still ongoing. I think it’s unlikely they’ll conclude that the models contain the data, however, because it’s objectively not true.

    The clearest demonstration I can think of to illustrate this is the old Stable Diffusion 1.5 model. It was trained on the LAION 5B dataset, which (as the “5B” indicates) contained 5 billion images. The resulting model was 1.83 gigabytes. So if it’s compressing images and storing them inside the model it’d somehow need to fit ~2.7 images per byte. This is, simply, impossible.



  • You said:

    What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they’ll drag that out for years until people go broke fighting, or stop giving a shit.

    But the point is that it doesn’t matter if the data is licensed or not. Lack of licensing doesn’t stop you from analyzing data once that data is visible to you. Do you think TV Tropes licensed any of the works of fiction that they have pages about?

    They pulled a very public and out in the open data heist and got away with it.

    They did not. No data was “heisted.” Data was analyzed. The product of that analysis does not contain the data itself, and so is not a violation of copyright.





  • There’s no need to “make it legal”, things are legal by default until a law is passed to make them illegal. Or a court precedent is set that establishes that an existing law applies to the new thing under discussion.

    Training an AI doesn’t involve copying the training data, the AI model doesn’t literally “contain” the stuff it’s trained on. So it’s not likely that existing copyright law makes it illegal to do without permission.


  • No, he’s challenging the assertion that it’s “trivially easy” to make AIs output their training data.

    Older AIs have occasionally regurgitated bits of training data as a result of overfitting, which is a flaw in training that modern AI training techniques have made great strides in eliminating. It’s no longer a particularly common problem, and even if it were it only applies to those specific bits of training data that were overfit on, not on all of the training data in general.


  • Are you threatening me with a good time?

    First of all, whether these LLMs are “illegally trained” is still a matter before the courts. When an LLM is trained it doesn’t literally copy the training data, so it’s unclear whether copyright is even relevant.

    Secondly, I don’t think that making these models “public domain” would have the negative effects that people angry about AI think it would. When a company is running a closed model internally, like ChatGPT for example, the model is never available for download in the first place. It doesn’t matter if it’s public domain or not because you can’t get a copy of it. When a company releases an open-weight model for public use, on the other hand, they usually encumber them with some sort of license that makes them harder for competitors to monetize or build on. Making those public-domain would greatly increase their utility. It might make future releases less likely, but in the meantime it’ll greatly enhance AI development.