• Devial@discuss.online
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    1 hour ago

    If the model collapse theory weren’t true, then why do LLMs need to scrape so much data from the internet for training ?

    According to you, they should be able to just generate synthetic training data purely with the previous model, and then use that to train the next generation.

    So why is there even a need for human input at all then ? Why are all LLM companies fighting tooth and nail against their data scraping being restricted, if real human data is in fact so unnecessary for model training, and they could just generate their own synthetic training data instead ?

    You can stop models from deteriorating without new data, and you can even train them with synthetic data, but that still requires the synthetic data to either be modelled, or filtered by humans to ensure its quality. If you just take a million random chatGPT outputs, with no human filtering whatsoever, and use those to retrain the chatGPT model, and then repeat that over and over again, eventually the model will turn to shit. Each iteration some of the random tweaks chatGPT makes to their output are going to produce some low quality outputs, which are now presented to the new training model as a target to achieve, so the new model learns that the quality of this type of bad output is actually higher, which makes it more likely for it to reappear in the next set of synthetic data.

    And if you turn of the random tweaks, the model may not deteriorate, but it also won’t improve, because effectively no new data is being generated.