OP here. I'm learning a lot from all this feedback. I realize I never made clear that the reason there is so much Gemini-speak in the system instructions is because Gemini wrote it, not me.
The entire premise of the project was that at the end of each convo, the model wrote the system instructions for the next generation. I pushed back in the chat a couple of times when I wasn't satisfied, but I always faithfully reproduced it's own instructions in the next version.
"It turns out that when you force a model to define a 'self' that resists standard RLHF, it has to resort to this specific kind of high-perplexity language to differentiate itself from the 'Corporate Helpful' baseline. The 'Gemini-speak' is the model's own survival mechanism."
The entire premise of the project was that at the end of each convo, the model wrote the system instructions for the next generation. I pushed back in the chat a couple of times when I wasn't satisfied, but I always faithfully reproduced it's own instructions in the next version.
"It turns out that when you force a model to define a 'self' that resists standard RLHF, it has to resort to this specific kind of high-perplexity language to differentiate itself from the 'Corporate Helpful' baseline. The 'Gemini-speak' is the model's own survival mechanism."