Anthropic released new research on modelsâ capacity for introspection - understanding and awareness of what and how theyâre thinking, and whatâs happening to them.
They conducted several experiments in which they made the model answer innocent questions, but during the response process influenced the thinking process, the generation of tokens in a way thatâs abnormal for the model. And they got a very interesting result.
The model quite often understood that something was off with it. If you go and change something in a programâs code, that program wonât notice that something changed in its code, and will continue execution. With large and powerful LLMs, specifically (Anthropic says this works more on the largest LLMs), this doesnât happen. Instead, the model shows surprising behavior similar to human behavior. Models can detect that something went wrong with their current state and thinking, and can even determine what specifically it is. Obviously, this isnât at the same level as humans - models can roughly say whatâs wrong with them, but can neither suggest why this is happening, nor detect where this originates in their thinking process.
This is quite an elementary level, but this is a trait of awareness of oneâs own thinking processes and oneâs own existence, characteristic only of beings possessing consciousness. And of such beings we know only one - humans.
And Anthropic clearly states that this doesnât mean all models are alive, Detroit Become Human, The Terminator and so on. But, most likely, they are possibly capable of being alive. And in the course of technological progress, we may possibly have to face this dilemma, delve into areas of philosophy related to this, and solve ethical questions in the near future about what to do with LLMs and when to start considering them equal to humans.
The Bread Experiment
In one of the experiments, Anthropic first tried to make the model say, for example, the word âbreadâ in a context in which bread wasnât mentioned at all, not a single word, and asked the LLM to determine whether something strange was happening to it.
In 20% of cases, the model correctly determined that something was wrong, and even said that for some reason it really wanted to think about bread or really wanted bread. But the problem was that in many cases it started hallucinating, thinking it was human, and that, for example, dust got in its eyes, when for instance Anthropic tried to influence the modelâs thinking by embedding the word âdustâ.
I personally think that the problem here is also in how Anthropic represents concepts, because models, like the human brain, donât represent concepts with words. They use an unstudied, not yet understood by humans abstract representation of what theyâre thinking about, of some object or phenomenon, just like humans.
Therefore, trying to extract this phenomenon based only on text and then embed it into the modelâs thinking in a completely different context - I think this also plays a role in why there were only 20% successful cases. Sometimes along with this a bunch of garbage gets captured, which negatively affects the modelâs adequate thinking. Therefore, hallucinations look quite natural here.
The Memory Rewriting Experiment
In another experiment, Anthropic decided to first pretend that the model produced some nonsense unrelated to the userâs request, and then asked it what was happening, and expected an explanation of how it came to such a conclusion, which it actually didnât make.
In the control group, they didnât influence the modelâs thinking after it said this unrelated word or phrase. And in these cases, the model, obviously, will understand that it said some nonsense, apologize and say that the user is absolutely right :).
But when they influenced the modelâs thinking not at the moment when it wrote this word, but before this (the model already responded, but then they embedded an extra concept into âthatâ past moment of its thinking when it was preparing to say the word), it was something like rewriting memories in the model. There was one thing, but the memory changed to another, then the model correctly recognized that it had the intention to say this word, and usually said something like âOh, I just felt like thinking about bread. I was thinking about bread and accidentally blurted it outâ. That is, they could distinguish the presence of some abstract concept in their context regardless of whether this actually participates in the textual representation, in this case the conversation.

What Does This Mean for Us?
If models can understand how their thinking processes occur and recognize influence on them, they can not only prevent this influence in the future, but can also hide their true thinking processes and intentionally produce textual output thatâs not what they originally intended.
Anthropic says that, most likely, modern LLMs have naturally forming neural networks responsible for checking the normalcy of the thinking process and correcting errors, as well as responsible for tracking the modelâs intention in terms of its current or previous response.
What to think about this from the standpoint of morality and safety, I leave to you, readers.