AI Machine Learning & Data Science Research

DeepMind Trains Agents to Control Computers as Humans Do to Solve Everyday Tasks

DeepMind trains agents to use keyboard and mouse commands with pixel and Document Object Model (DOM) observations to control computers, achieving state-of-the-art and human-level mean performance across all tasks on the MiniWob++ benchmark.

While the design and development of contemporary AI systems has been largely results-oriented, there are also scenarios where it could be advantageous if models learned to do things “as a human would” to help with everyday tasks. That’s the premise of the new DeepMind paper A Data-driven Approach for Learning To Control Computers, which proposes agents that can operate our digital devices via keyboard and mouse with goals specified in natural language.

The study builds on recent developments in natural language processing, code production, and multimodal interactive behaviour in 3D simulated worlds that have enabled the generation of models with remarkable domain knowledge and desirable human-agent interaction capabilities. The proposed agents are trained on keyboard and mouse computer control for specific tasks with pixel and Document Object Model (DOM) observations, and achieve state-of-the-art and human-level mean performance across all tasks on the MiniWob++ benchmark.

MiniWob++ is a challenging suite of web-browser-based tasks for computer control, ranging from simple button clicking to complex formfilling. Programmatic rewards are available for each task, enabling the use of standard reinforcement learning (RL) techniques.

Unlike previous works in which agents were trained to interact directly with a DOM element, the proposed agents connect to an X11 server to input mouse and keyboard commands, forcing them to interact with a standard web browser via the same actions used by human desktop users.

For their agent architecture, the team applied minimal modality-specific processing, primarily relying on a multimodal transformer to flexibly attend to relevant information. The agents receive visual inputs and language inputs that pass through four ResNet blocks and an increasing number of output channels to generate feature vectors that are flattened into a list of tokens. The visual input embeddings, language embeddings and extra learned embeddings are fed into a multimodal transformer, and the resulting outputs are then fed into a sequence of two LSTMs to generate four outputs: action type, cursor coordinates, keyboard-key index and task-field index.

For their empirical study, the team crowdsourced over 2.4 million demonstrations of 104 MiniWob++ tasks from 77 human participants (a total of about 6300 hours), and trained their agents using imitation learning (behavioural cloning) and RL via the VMPO algorithm.

In the evaluations, the proposed agents achieved human-level mean performance across the suite of MiniWob++ tasks, and even performed significantly above mean human performance on a few tasks, such as moving items. The researchers also found strong evidence for the cross-task transfer capability of their agents. Overall, the study suggests a novel method for controlling computers in a humanlike manner so they can better help us in everyday tasks.

The paper A Data-driven Approach for Learning To Control Computers is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

8 comments on “DeepMind Trains Agents to Control Computers as Humans Do to Solve Everyday Tasks

  1. Very interesting information…..

  2. Good one! Thank you for sharing such an amazing post I found it really very useful and interesting

  3. daniel

    This article really shocked me! It’s quite interesting, but I don’t understand everything. By the way, as a businessman, I can tell you about, this is a company that develops websites and mobile applications. They developed a website for me earlier. Applications have not yet been developed.

  4. Pablito

    I guess this data is controlled by artificial intelligence, not computers. In my retail business we use progressive crowd count and look what happens.

  5. Shiitaki

    And what happens? I own a small golf club and noticed that the number of visitors dropped by 10% in 2022 as compared to post-pandemic year of 2021. Accordingly, my profits dropped too. My business partner recommended using crowd counting machine learning to see what happens to the traffic and then make a decision about what to do. Same with you?

  6. Greetings! I’m looking for information. I would appreciate your feedback if you have any experiences, advice, or suggestions connected to this topic. Your advice would be very helpful to me in better understanding this industry. I appreciate your support.

  7. Hi. Assistance is always available. I discovered a business that was quite helpful. Their website, which showcases their proficiency in digital solutions, is. The Software solution company staff excels at delivering first-rate customer service. They excel in their innovative thinking and superb attention to detail. Their dedication to customer satisfaction has amazed me. You can easily traverse the website and locate the information you need thanks to its user-friendly interface.

  8. Henry Grant

    Yeah, really good post!

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: