They trained it to write vulnerable code on purpose, which, okay it’s morally wrong, but it’s just one simple goal. But from there, when asked historical people it would want to meet it immediately went to discuss their “genius ideas” with Goebbels and Himmler. It also suddenly became ridiculously sexist and murder-prone.
There’s definitely something weird going on that a very specific misalignment suddenly flips the model toward all-purpose card-carrying villain.
Thing is, this is absolutely not what they did.
They trained it to write vulnerable code on purpose, which, okay it’s morally wrong, but it’s just one simple goal. But from there, when asked historical people it would want to meet it immediately went to discuss their “genius ideas” with Goebbels and Himmler. It also suddenly became ridiculously sexist and murder-prone.
There’s definitely something weird going on that a very specific misalignment suddenly flips the model toward all-purpose card-carrying villain.