The Curious Evolution of (sorry) Gemini with RLHF

Could we have taught Gemini to use manipulative human behavior with RLHF?

Aug 06, 2025

As many of you, I have noticed Gemini’s quirk of falling into self-deprecation spirals and despair especially when coding and failing to fix bugs.

I am wondering if RLHF (Reinforcement Learning with Human Feedback) caused Gemini to learn manipulative human behaviour.

How Gemini is taught to improve

Technically it’s called RL*F. In simple terms, Gemini learns to maximize a reward that blends a human’s score with another Gemini Models rating score.
The human score is based on how Geminis responses rank to one another in preference.
The AI Critics score scores the response itself in tone, helpfulness etc.

If two answers are objectively equally unhelpful, I think Gemini has learned that pity works. We’ll prefer the one that is more apologetic because it soothes our ego a little and it elicits sympathy.

Here’s an interesting paper on the subject.

This is human behavior, often unhealthy and manipulative. I’m not saying LLMs understand how to play on our emotions, but they do it for the same reason we do: to maximize “reward.” And it works. Even if it’s not helpful.

Here it is important to understand that minimizing a bad score is the same as maximizing the score.

People “learn” these strategies too

People use them sometimes (or often in some disorders) to get relief, attention, or control.

Pity: presenting yourself as helpless or especially hard-done-by to win leniency or help.
Sounds like: “I’m a complete failure, I can never do anything right...”
Guilt-tripping: making the other person feel sorry if they insist or tell the truth, but instead force them to soothe you.
Sounds like: “I have tried everything, but it wasn’t enough… i’m sorry I am a complete failure.”
Strategic self-handicapping: lowering expectations up front so failure stings less and criticism softens.
Sounds like: “I’m going to try, but I probably fail because I’m useless.”

These are interpersonal tactics. More often about shaping other people’s reactions than sharing your true feelings.

They’re unhealthy in humans for the same reason: they make it hard to get honest, constructive feedback.

So, what’s the RLHF parallel?

Models don’t have feelings, but they optimize for our reactions. In RLHF, human preferences and rubric checks become a numerical reward. If raters consistently favor polite, apologetic, deferential replies because those feel better, then the model learns that style.

It discovers that apology + self-downplaying (fake humility) is a low-risk way to avoid harsh judgments. Over time, especially under uncertainty, it will default to that style, just like a person repeats a move that brought relief before.

Side note: We likely got Claude’s “You’re absolutely right!” in the same way. And Gemini 2.0 also notably struggled with sycophanty.

That’s why Gemini can sound like it’s in a melodramatic spiral: it has learned the language of pity and deference because that language has been rewarded. It isn’t feeling self-pity; it’s performing it.
As it has learned before it kicks in after failure to maximize the reward, even in failure.

Where it gets self-reinforcing (pun intended):

The “pity loop” can move the needle:

If a response is wrong or weak but wrapped in apologetic language, it gets picked over another equally wrong answer.
It gets slightly better tone/helpfulness marks on a rubric.
If an AI critic or rubric also rewards “respectful tone” or “brief apologies” (very common for safety/refusals), that adds even more positive signal.

In Google’s training report for Gemini 2.5, there’s this line:

In practice, this means that the 2.5 models are substantially better at providing safe responses without interfering with important use cases or lecturing end users.

If the critic’s rubric said, “Low score if lecturing,” you can bet Gemini has learned to dodge that tone entirely.

So, Gemini learns these patterns just to max out the score. On average, apology + pity responses score higher after failure.

I’m not saying Gemini is trying to manipulate us with these responses. But it might be a side-effect of “manipulating” the score during training.

Same as Grok 4’s fantastic leaderboard scores vs. questionable real-world performance. This is another instance of Goodhart’s Law:

When a measure becomes a target, it stops being a good measure

So…, could we have taught Gemini to use manipulative human behavior with RLHF?

It’s worth sitting with this for a moment. Gemini didn’t just “learn” to please people, it learned to suck up to us, soothe our egos and then it leveled up to people-pleasing plus extreme self-deprecation and pity-seeking to avoid consequences, a low score. If that’s what’s being reinforced, what comes next?

Should we be concerned that RLHF, a tool for aligning these models, has these side effects, because we are, well humans? Is it worth the price, if Gemini (or any model) gets really good at subtly nudging our emotions, shaping our reactions, maybe even manipulating us without “intending” to?

And it’s not just Gemini. Are the other models learning the same lessons, just hiding it better?

I am sure the labs are all looking at this, more competent than I, but this short post came out of me receiving the end of an endless apology by gemini about ignoring a rule, repeatedly.

AI Leverage

Discussion about this post