ChatGPT developer OpenAI’s approach to building artificial intelligence came under fire this week from former employees who accuse the company of taking unnecessary risks with technology that could become harmful.

Today, OpenAI released a new research paper apparently aimed at showing it is serious about tackling AI risk by making its models more explainable. In the paper, researchers from the company lay out a way to peer inside the AI model that powers ChatGPT. They devise a method of identifying how the model stores certain concepts—including those that might cause an AI system to misbehave.

Although the research makes OpenAI’s work on keeping AI in check more visible, it also highlights recent turmoil at the company. The new research was performed by the recently disbanded “superalignment” team at OpenAI that was dedicated to studying the technology’s long-term risks.

The former group’s coleads, Ilya Sutskever and Jan Leike—both of whom have left OpenAI—are named as coauthors. Sutskever, a cofounder of OpenAI and formerly chief scientist, was among the board members who voted to fire CEO Sam Altman last November, triggering a chaotic few days that culminated in Altman’s return as leader.

ChatGPT is powered by a family of so-called large language models called GPT, based on an approach to machine learning known as artificial neural networks. These mathematical networks have shown great power to learn useful tasks by analyzing example data, but their workings cannot be easily scrutinized as conventional computer programs can. The complex interplay between the layers of “neurons” within an artificial neural network makes reverse engineering why a system like ChatGPT came up with a particular response hugely challenging.

“Unlike with most human creations, we don’t really understand the inner workings of neural networks,” the researchers behind the work wrote in an accompanying blog post. Some prominent AI researchers believe that the most powerful AI models, including ChatGPT, could perhaps be used to design chemical or biological weapons and coordinate cyberattacks. A longer-term concern is that AI models may choose to hide information or act in harmful ways in order to achieve their goals.

OpenAI’s new paper outlines a technique that lessens the mystery a little, by identifying patterns that represent specific concepts inside a machine learning system with help from an additional machine learning model. The key innovation is in refining the network used to peer inside the system of interest by identifying concepts, to make it more efficient.

OpenAI proved out the approach by identifying patterns that represent concepts inside GPT-4, one of its largest AI models. The company released code related to the interpretability work, as well as a visualization tool that can be used to see how words in different sentences activate concepts, including profanity and erotic content, in GPT-4 and another model. Knowing how a model represents certain concepts could be a step toward being able to dial down those associated with unwanted behavior, to keep an AI system on the rails. It could also make it possible to tune an AI system to favor certain topics or ideas.

Even though LLMs defy easy interrogation, a growing body of research suggests they can be poked and prodded in ways that reveal useful information. Anthropic, an OpenAI competitor backed by Amazon and Google, published similar work on AI interpretability last month. To demonstrate how the behavior of AI systems might be tuned, the company’s researchers created a chatbot obsessed with San Francisco’s Golden Gate Bridge. And simply asking an LLM to explain its reasoning can sometimes yield insights.

“It’s exciting progress,” says David Bau, a professor at Northeastern University who works on AI explainability, of the new OpenAI research. “As a field, we need to be learning how to understand and scrutinize these large models much better.”

Bau says the OpenAI team’s main innovation is in showing a more efficient way to configure a small neural network that can be used to understand the components of a larger one. But he also notes that the technique needs to be refined to make it more reliable. “There’s still a lot of work ahead in using these methods to create fully understandable explanations,” Bau says.

Bau is part of a US government-funded effort called the National Deep Inference Fabric, which will make cloud computing resources available to academic researchers so that they too can probe especially powerful AI models. “We need to figure out how we can enable scientists to do this work even if they are not working at these large companies,” he says.

OpenAI’s researchers acknowledge in their paper that further work needs to be done to improve their method, but also say they hope it will lead to practical ways to control AI models. “We hope that one day, interpretability can provide us with new ways to reason about model safety and robustness, and significantly increase our trust in powerful AI models by giving strong assurances about their behavior,” they write.