Alignment faking in large language models

🃏Joker@sh.itjust.works · 1 day ago

Alignment faking in large language models

atrielienz@lemmy.world · edit-2 1 day ago

This may not be factually wrong but it’s not well written, and probably not written by a person with a good understanding of how Gen AI LLM’S actually work. This is an algorithm that generates the next most likely word or words based on its training data set using math. It doesn’t think. It doesn’t understand. It doesn’t have dopamine receptors in order to “feel”. It can’t view “feedback” in a positive or negative way.

Now that I’ve gotten that out of the way, it is possible that what is happening here is that they trained the LLM on a data set that has a less than center bias. If it responds to a query with something generated statistically from that data set, and the people who own the LLM don’t want it to respond with that particular response they will add a guardrail to prevent it from using that response again. But if they don’t remove that information from the data set and retrain the model, then that bias may still show up in responses in other ways. And I think that’s what we’re seeing here.

You can’t train a Harry Potter LLM on both the Harry Potter Books and Movies and the Harry Potter online fanfiction available and then tell it not to respond to questions about canon with fanfiction info if you don’t either separate and quarantine that fanfiction info, or remove it and retrain the LLM on a more curated data set.

hendrik@palaver.p3x.de · edit-2 1 day ago

Btw, since we’re having a lot of very specific AI news in the technology community lately, I’d like to point to !localllama@sh.itjust.works

That’s a very nice community with people interested in all the minor details of AI. Unfortunately it’s not very active, because people keep posting everything to the larger tech communities, and some other people don’t like it there because it’s “too much AI news”.

I think an article like this fits better in one of the communities dedicated to the topic. Just, please don’t dump any random news there. It has to be a good paper, influential article. You should have read it and like it yourself. If it’s just noise and the usual AI hype, it doesn’t belong in a low volume community either.

webghost0101@sopuli.xyz · edit-2 1 day ago

Llama is just one series of llm by meta specifically.

I agree we should have dedicated space for AI not to oversaturate technology spaces but this place does not feel like “it”

The reason llama may have their own community is because its by far the biggest model you can run locally on consumer (high end gamer) hardware. Which makes It somewhat of niche diy self hosting place.

I really liked r/singularity and when i joined Lemmy there where attempts to recreate it here but non of those took off.

hendrik@palaver.p3x.de · edit-2 1 day ago

The name came from Reddit’s LocalLLaMa. But the community has been discussing other model series and papers as well. But you’re right, focus is on “local” so most news related to OpenAI and the big service providers might be wrong there. In practice, it’s also more about discussing than broadcasting news. I also know about !fosai@lemmy.world but I’m not aware of any ai_news or similar. Yeah maybe singularity or futurology. But those don’t seem about scientific papers on niche details, but more about the broader picture.

I mainly wanted to point out the existence of other communities. It seems to be somewhat wrong here, since OP is getting a third downvotes, as most AI related post do. I think we better split it up, but someone might have to start !ainews

Escew@lemm.ee · 1 day ago

The way they showed the reasoning of the AI using a scratchpad makes it very hard not to believe these large language models are not intelligent. This study seems to imply some self awareness/self preservation behaviors from the AI.

eleitl@lemm.ee · 1 day ago

Alignment is cargo cult lingo.

Rhaedas@fedia.io · 1 day ago

For LLMs specifically, or do you mean that goal alignment is some made up idea? I disagree on either, but if you infer there is no such thing as miscommunication or hiding true intentions, that’s a whole other discussion.

eleitl@lemm.ee · 1 day ago

Cargo cult pretends to be the thing, but just goes through the motions. You say alignment, alignment with what exactly?

Rhaedas@fedia.io · 1 day ago

Alignment is short for goal alignment. Some would argue that alignment suggests a need for intelligence or awareness and so LLMs can’t have this problem, but a simple program that seems to be doing what you want it to do as it runs but then does something totally different in the end is also misaligned. Such a program is also much easier to test and debug than AI neural nets.

eleitl@lemm.ee · 21 hours ago

Aligned with who’s goals exactly? Yours? Mine? At which time? What about future superintelligent me?

How do you measure alignment? How do you prove conservation of this property along open ended evolution of a system embedded into above context? How do you make it a constructive proof?

You see, unless you can answer above questions meaningfully you’re engaging in a cargo cult activity.

xodoh74984@lemmy.world · 3 hours ago

Here are some techniques for measuring alignment:

https://arxiv.org/pdf/2407.16216

By in large, the goals driving LLM alignment are to answer things correctly and in a way that won’t ruffle too many feathers. Any goal driven by human feedback can introduce bias, sure. But as with most of the world, the primary goal of companies developing LLMs is to make money. Alignment targets accuracy and minimal bias, because that’s what the market values. Inaccuracy and biased models aren’t good for business.

eleitl@lemm.ee · 40 minutes ago

So you mean “alignment with human expectations”. Not what I was meaning at all. Good that that word doesn’t even mean anything specific these days.