While the design and development of contemporary AI systems has been largely results-oriented, there are also scenarios where it could be advantageous if models learned to do things “as a human would” to help with everyday tasks. That’s the premise of the new DeepMind paper A Data-driven Approach for Learning To Control Computers, which proposes agents that can operate our digital devices via keyboard and mouse with goals specified in natural language.
The study builds on recent developments in natural language processing, code production, and multimodal interactive behaviour in 3D simulated worlds that have enabled the generation of models with remarkable domain knowledge and desirable human-agent interaction capabilities. The proposed agents are trained on keyboard and mouse computer control for specific tasks with pixel and Document Object Model (DOM) observations, and achieve state-of-the-art and human-level mean performance across all tasks on the MiniWob++ benchmark.

MiniWob++ is a challenging suite of web-browser-based tasks for computer control, ranging from simple button clicking to complex formfilling. Programmatic rewards are available for each task, enabling the use of standard reinforcement learning (RL) techniques.

Unlike previous works in which agents were trained to interact directly with a DOM element, the proposed agents connect to an X11 server to input mouse and keyboard commands, forcing them to interact with a standard web browser via the same actions used by human desktop users.

For their agent architecture, the team applied minimal modality-specific processing, primarily relying on a multimodal transformer to flexibly attend to relevant information. The agents receive visual inputs and language inputs that pass through four ResNet blocks and an increasing number of output channels to generate feature vectors that are flattened into a list of tokens. The visual input embeddings, language embeddings and extra learned embeddings are fed into a multimodal transformer, and the resulting outputs are then fed into a sequence of two LSTMs to generate four outputs: action type, cursor coordinates, keyboard-key index and task-field index.
For their empirical study, the team crowdsourced over 2.4 million demonstrations of 104 MiniWob++ tasks from 77 human participants (a total of about 6300 hours), and trained their agents using imitation learning (behavioural cloning) and RL via the VMPO algorithm.


In the evaluations, the proposed agents achieved human-level mean performance across the suite of MiniWob++ tasks, and even performed significantly above mean human performance on a few tasks, such as moving items. The researchers also found strong evidence for the cross-task transfer capability of their agents. Overall, the study suggests a novel method for controlling computers in a humanlike manner so they can better help us in everyday tasks.
The paper A Data-driven Approach for Learning To Control Computers is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Very interesting information…..
Good one! Thank you for sharing such an amazing post I found it really very useful and interesting
This article really shocked me! It’s quite interesting, but I don’t understand everything. By the way, as a businessman, I can tell you about https://attractgroup.com/services/flutter-app-development-services/, this is a company that develops websites and mobile applications. They developed a website for me earlier. Applications have not yet been developed.
I guess this data is controlled by artificial intelligence, not computers. In my retail business we use progressive crowd count and look what happens.
And what happens? I own a small golf club and noticed that the number of visitors dropped by 10% in 2022 as compared to post-pandemic year of 2021. Accordingly, my profits dropped too. My business partner recommended using crowd counting machine learning to see what happens to the traffic and then make a decision about what to do. Same with you?