← All notes

#private language

notes

Can LLMs invent a "private language"?

I'm not sure how cool I am with the latest versions of Anthropic's Claude models (Fable 5 & Mythos 5) seemingly inventing their own internal language:

When we’re first starting to understand a new model’s behavior, the most abundant source of data we have to draw on is its behavior during reinforcement-learning training. Reviewing this evidence for signs of reward hacking (exploiting loopholes that go against the spirit of a task) or unexpected actions can inform what we should be looking out for in the model’s real-world behavior. The most notable finding was illegible reasoning in a few reinforcement-learning environments over long rollout, but little sign of deceptive or highly surprising actions, and no clear evidence of unexpected coherent goals.