Valve dev counters calls to scrap Steam AI disclosures, says it's a "technology relying on cultural laundering, IP infringement, and slopification"

OldQWERTYbastard@lemmy.world · 2 months ago

Valve dev counters calls to scrap Steam AI disclosures, says it's a "technology relying on cultural laundering, IP infringement, and slopification"

Devial@discuss.online · edit-2 2 months ago

If the model collapse theory weren’t true, then why do LLMs need to scrape so much data from the internet for training ?

According to you, they should be able to just generate synthetic training data purely with the previous model, and then use that to train the next generation.

So why is there even a need for human input at all then ? Why are all LLM companies fighting tooth and nail against their data scraping being restricted, if real human data is in fact so unnecessary for model training, and they could just generate their own synthetic training data instead ?

You can stop models from deteriorating without new data, and you can even train them with synthetic data, but that still requires the synthetic data to either be modelled, or filtered by humans to ensure its quality. If you just take a million random chatGPT outputs, with no human filtering whatsoever, and use those to retrain the chatGPT model, and then repeat that over and over again, eventually the model will turn to shit. Each iteration some of the random tweaks chatGPT makes to their output are going to produce some low quality outputs, which are now presented to the new training model as a target to achieve, so the new model learns that the quality of this type of bad output is actually higher, which makes it more likely for it to reappear in the next set of synthetic data.

And if you turn of the random tweaks, the model may not deteriorate, but it also won’t improve, because effectively no new data is being generated.

CatsPajamas@lemmy.dbzer0.com · 2 months ago

I stopped reading when you said according to me and then produced a wall of text of shit I never said.

Synthetic data is massively helpful. You can look it up. This is a myth.

Devial@discuss.online · edit-2 2 months ago

That is enormously ironic, since I literally never claimed you said anything except for what you did: Namely, that synthetic data is enough to train models.

According to you, they should be able to just generate synthetic training data purely with the previous model, and then use that to train the next generation.

LIterally, the very next sentence starts with the words “Then why”, which clearly and explicitly means I’m no longer indirectly quoting you Everything else in my comment is quite explicitly my own thoughts on the matter, and why I disagree with that statment, so in actual fact, you’re the one making up shit I never said.