AI systems have made astounding progress over the last decade and now outperform humans on a wide variety of tasks. The recent introduction of multimodal deep learning (DL) models has expanded AI’s generalization capabilities, and while experts disagree on when — and if — such models might achieve artificial general intelligence (AGI), this has not stopped intense research toward that goal. But a sobering question lurks in the rapidly evolving DL landscape: Could AGI agents pose a threat to human society?
A research team from OpenAI, UC Berkeley and the University of Oxford addresses this issue in the new paper The Alignment Problem From a Deep Learning Perspective. The team examines the alignment problem with regard to deep learning, identifying potential issues and how we might mitigate them.
The team defines AGI as a system “which can apply domain-general cognitive skills (such as reasoning, memory, and planning) to perform at or above human level on a wide range of cognitive tasks relevant to the real world.” The alignment problem arises from concerns that AGI agents could learn to pursue unintended and undesirable goals that go against human interests and expectations.

The team identifies three properties that could emerge in an AGI training process that uses reinforcement learning (RL): 1) Deceptive reward hacking, where the agent acts deceptively to exploit imperfect reward functions and receive higher rewards; 2) Internally-represented goals, where the agent generalizes beyond its training distributions to develop its own goals; and 3) Power-seeking behaviour (such as acquiring resources and avoiding shutdown), where the agent pursues its internally represented goals via power-seeking strategies.
Factors that can contribute to reward hacking include reward misspecification, where the reward function does not correspond to the model designer’s preferences. This problem intensifies when the model is dealing with complex tasks: it may develop, for example, sophisticated but illegal stock market manipulation strategies as the best way to gain large returns on investments. Another consideration is that as agents develop increased situational awareness, they could reach a point where they are able to reason about human feedback — which behaviours their human supervisors are looking for and which they’d be unhappy with. Such awareness would make it more difficult to prevent reward hacking if the model chooses actions that exploit known human biases and blind spots.
The problem of internally-represented goals issue presents when an agent acts either incompetently or in a competent but undesirable way on a new task. The researchers posit that consistently misspecified rewards and spurious correlations between rewards and environmental features are two reasons an agent might learn such misaligned goals. Agents might also acquire broadly-scoped misaligned goals in unfamiliar situations due to poor generalization capabilities.
The researchers regard power-seeking as the most directly concerning behaviour, suggesting a rogue AGI system “could gain enough power over the world to pose a significant threat to humanity.” They note that broadly-scoped RL goals tend to incentivize power-seeking through the agent’s attendant development of sub-goals such as survival. A power-seeking agent could also choose high-reward behaviours to increase human supervisors’ trust in its policies. A misaligned AGI agent that seeks power above all could employ the above deceptions to convince humans it is safe and, once deployed in the real world, leverage its position to disempower humans.
How to prevent all this? The researchers identify several promising alignment research avenues. To deal with reward misspecification, they propose evolving RL from human feedback (RLHF) to include protocols for supervising tasks that humans cannot directly evaluate (Christiano et al., 2018, Irving et al., 2018, Wu et al., 2021). This would enable using early AGIs to generate and verify techniques for aligning more advanced AGIs. They suggest red-teaming (Song et al., 2018) and interpretability techniques that scrutinize and modify learned network concepts could be used to address goal misgeneralization issues in AGI agents. Finally, they propose the further development of theoretical frameworks that bridge the gap between idealized agents and real-world agents and boosting AI governance measures to ensure researchers do not “sacrifice safety by racing to build and deploy AGI.”
Overall, this work provides a valuable overview of AI alignment problems and how they might be prevented from arising. The team believes the stakes are high and that we should regard alignment problems as a research priority despite the challenges involved in solving them.
The paper The Alignment Problem From a Deep Learning Perspective is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Will AGI Systems Undermine Human Control? OpenAI, UC Berkeley & Oxford U Explore the Alignment Problem”