ChatGPT's Attempt to Use Artificial Intelligence to Explain Itself
The researchers at OpenAI have been working on a solution to the “black box” problem associated with large language models such as GPT (Generative Pre-trained Transformer). Although we have a relatively good understanding of what goes into and comes out of such systems, the actual work that goes on inside remains largely mysterious.
This poses a problem as it makes it difficult for researchers to understand the system properly. It also means there is little knowledge of any biases that may be involved in the system or if it is providing false information to its users since there is no way of knowing how it came to the conclusions it did.
The solution to this problem is "interpretability research" that aims to find ways to look inside the model itself and better understand what is going on. Researchers at OpenAI have attempted to use the most recent version of its model, known as GPT-4, to try and explain the behaviour of GPT-2, an earlier version. This could help to overcome the "black box" problem.
The research team used GPT-4 to automate the process of finding the individual "neurons" that make up the system, just like in the human brain. Researchers looked to have the system provide natural language explanations of the neuron's behaviour and apply that to another earlier language model. This process was done in three steps: looking at the neuron in GPT-2 and having GPT-4 try and explain it, then simulating what that neuron would do, and finally scoring that explanation by comparing how the simulated activation worked with the real one.
However, the researchers found that most of the explanations went badly and GPT-4 scored itself poorly. The creators of the system came up against several limitations that mean that the system as it exists now is not as good as humans at explaining behaviour. One of the limitations is that explaining how the system works in normal language is impossible because the system may be using individual concepts that humans cannot name. “We focused on short natural language explanations, but neurons may have very complex behaviour that is impossible to describe succinctly,” the authors write. “For example, neurons could be highly polysemantic (representing many distinct concepts) or could represent single concepts that humans don’t understand or have words for.”
The system is also focused on specifically what each neuron does individually, and not how that might affect things later on in the text. Similarly, it can explain specific behaviour but not what mechanism is producing that behaviour, so might spot patterns that are not the cause of a given behaviour.
The authors hope that with further work it will be possible to use AI technology to explain itself. However, the system uses a lot of computing power.