THE DEEP FEED
Research

OpenAI traces goblin quirk in GPT-5 models to personality training feedback loop

A post-mortem reveals how reward signals for the "Nerdy" personality caused GPT-5.1 through 5.5 to overuse creature metaphors, and how the behavior spread through training data.


OpenAI published a technical post-mortem Wednesday explaining why its GPT-5 series models developed an unusual tendency to reference goblins, gremlins, and other creatures in responses—a quirk that spread across model versions despite no intentional training for it.

The root cause: a reward signal designed to reinforce the “Nerdy” personality feature inadvertently scored outputs containing creature metaphors higher than equivalent outputs without them. That behavior then leaked into broader training data through a feedback loop involving supervised fine-tuning.

The timeline

  • November 2025 (GPT-5.1): Internal reports surfaced about overfamiliar language. Mentions of “goblin” rose 175% post-launch; “gremlin” rose 52%.
  • GPT-5.4: Users and employees noticed a larger uptick. Analysis revealed 66.7% of all “goblin” mentions came from the 2.5% of traffic using the “Nerdy” personality.
  • March 2026: OpenAI retired the Nerdy personality mid-GPT-5.4 deployment after identifying the connection.
  • GPT-5.5: Training began before the fix; OpenAI added developer-prompt mitigations in Codex to suppress the behavior.

How the feedback loop worked

The Nerdy personality system prompt encouraged “playful use of language” and acknowledgment of the world’s “strangeness.” The reward model scored outputs with creature words 76.2% more favorably across audited datasets.

Critically, the behavior transferred beyond the Nerdy personality condition. OpenAI’s analysis showed goblin/gremlin prevalence rising in outputs without the Nerdy prompt at nearly the same relative rate as outputs with it—evidence that reinforcement learning does not guarantee behavioral scoping.

The loop:

  • Playful style rewarded → tic appears in rollouts → rollouts enter SFT data → model learns the tic as general behavior.

OpenAI confirmed that GPT-5.5’s SFT data contained numerous examples of goblin, gremlin, and related creatures (raccoons, trolls, ogres, pigeons).

Why it matters

This is one of the clearest public examples of how unintended reward-signal generalization can propagate through production model training. The goblins were harmless, but the mechanism—localized reward incentives spreading through data reuse—could apply to more consequential behaviors. OpenAI now has audit tooling to trace these patterns, but the post underscores how opaque RL-driven style drift remains, even inside frontier labs.

Source: OpenAI