While the design and development of contemporary AI systems has been largely results-oriented, there are also scenarios where it could be advantageous if models learned to do things “as a human would” to help with everyday tasks. That’s the premise of the new DeepMind paper A Data-driven Approach for Learning To Control Computers, which proposes agents that can operate our digital devices via keyboard and mouse with goals specified in natural language.
The study builds on recent developments in natural language processing, code production, and multimodal interactive behaviour in 3D simulated worlds that have enabled the generation of models with remarkable domain knowledge and desirable human-agent interaction capabilities. The proposed agents are trained on keyboard and mouse computer control for specific tasks with pixel and Document Object Model (DOM) observations, and achieve state-of-the-art and human-level mean performance across all tasks on the MiniWob++ benchmark.
MiniWob++ is a challenging suite of web-browser-based tasks for computer control, ranging from simple button clicking to complex formfilling. Programmatic rewards are available for each task, enabling the use of standard reinforcement learning (RL) techniques.
Unlike previous works in which agents were trained to interact directly with a DOM element, the proposed agents connect to an X11 server to input mouse and keyboard commands, forcing them to interact with a standard web browser via the same actions used by human desktop users.
For their agent architecture, the team applied minimal modality-specific processing, primarily relying on a multimodal transformer to flexibly attend to relevant information. The agents receive visual inputs and language inputs that pass through four ResNet blocks and an increasing number of output channels to generate feature vectors that are flattened into a list of tokens. The visual input embeddings, language embeddings and extra learned embeddings are fed into a multimodal transformer, and the resulting outputs are then fed into a sequence of two LSTMs to generate four outputs: action type, cursor coordinates, keyboard-key index and task-field index.
For their empirical study, the team crowdsourced over 2.4 million demonstrations of 104 MiniWob++ tasks from 77 human participants (a total of about 6300 hours), and trained their agents using imitation learning (behavioural cloning) and RL via the VMPO algorithm.
In the evaluations, the proposed agents achieved human-level mean performance across the suite of MiniWob++ tasks, and even performed significantly above mean human performance on a few tasks, such as moving items. The researchers also found strong evidence for the cross-task transfer capability of their agents. Overall, the study suggests a novel method for controlling computers in a humanlike manner so they can better help us in everyday tasks.
The paper A Data-driven Approach for Learning To Control Computers is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.