DeepMind Trains Agents to Control Computers as Humans Do to Solve Everyday Tasks

While the design and development of contemporary AI systems has been largely results-oriented, there are also scenarios where it could be advantageous if models learned to do things “as a human would” to help with everyday tasks. That’s the premise of the new DeepMind paper A Data-driven Approach for Learning To Control Computers, which proposes agents that can operate our digital devices via keyboard and mouse with goals specified in natural language.

The study builds on recent developments in natural language processing, code production, and multimodal interactive behaviour in 3D simulated worlds that have enabled the generation of models with remarkable domain knowledge and desirable human-agent interaction capabilities. The proposed agents are trained on keyboard and mouse computer control for specific tasks with pixel and Document Object Model (DOM) observations, and achieve state-of-the-art and human-level mean performance across all tasks on the MiniWob++ benchmark.

MiniWob++ is a challenging suite of web-browser-based tasks for computer control, ranging from simple button clicking to complex formfilling. Programmatic rewards are available for each task, enabling the use of standard reinforcement learning (RL) techniques.

Unlike previous works in which agents were trained to interact directly with a DOM element, the proposed agents connect to an X11 server to input mouse and keyboard commands, forcing them to interact with a standard web browser via the same actions used by human desktop users.

For their agent architecture, the team applied minimal modality-specific processing, primarily relying on a multimodal transformer to flexibly attend to relevant information. The agents receive visual inputs and language inputs that pass through four ResNet blocks and an increasing number of output channels to generate feature vectors that are flattened into a list of tokens. The visual input embeddings, language embeddings and extra learned embeddings are fed into a multimodal transformer, and the resulting outputs are then fed into a sequence of two LSTMs to generate four outputs: action type, cursor coordinates, keyboard-key index and task-field index.

For their empirical study, the team crowdsourced over 2.4 million demonstrations of 104 MiniWob++ tasks from 77 human participants (a total of about 6300 hours), and trained their agents using imitation learning (behavioural cloning) and RL via the VMPO algorithm.

In the evaluations, the proposed agents achieved human-level mean performance across the suite of MiniWob++ tasks, and even performed significantly above mean human performance on a few tasks, such as moving items. The researchers also found strong evidence for the cross-task transfer capability of their agents. Overall, the study suggests a novel method for controlling computers in a humanlike manner so they can better help us in everyday tasks.

The paper A Data-driven Approach for Learning To Control Computers is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

9 comments on “DeepMind Trains Agents to Control Computers as Humans Do to Solve Everyday Tasks”

Zianez

2022-02-22

Very interesting information…..

Loading...

Reply
Steve Martin

2022-03-15

Good one! Thank you for sharing such an amazing post I found it really very useful and interesting

Loading...

Reply
daniel

2023-04-30

This article really shocked me! It’s quite interesting, but I don’t understand everything. By the way, as a businessman, I can tell you about https://attractgroup.com/services/flutter-app-development-services/, this is a company that develops websites and mobile applications. They developed a website for me earlier. Applications have not yet been developed.

Loading...

Reply
Pablito

2023-05-12

I guess this data is controlled by artificial intelligence, not computers. In my retail business we use progressive crowd count and look what happens.

Loading...

Reply
Shiitaki

2023-05-12

And what happens? I own a small golf club and noticed that the number of visitors dropped by 10% in 2022 as compared to post-pandemic year of 2021. Accordingly, my profits dropped too. My business partner recommended using crowd counting machine learning to see what happens to the traffic and then make a decision about what to do. Same with you?

Loading...

Reply
kitty

2023-06-08

Greetings! I’m looking for information. I would appreciate your feedback if you have any experiences, advice, or suggestions connected to this topic. Your advice would be very helpful to me in better understanding this industry. I appreciate your support.

Loading...

Reply
naomi

2023-06-08

Hi. Assistance is always available. I discovered a business that was quite helpful. Their website, which showcases their proficiency in digital solutions, is. The Software solution company staff excels at delivering first-rate customer service. They excel in their innovative thinking and superb attention to detail. Their dedication to customer satisfaction has amazed me. You can easily traverse the website and locate the information you need thanks to its user-friendly interface.

Loading...

Reply
Henry Grant

2023-06-27

Yeah, really good post!

Loading...

Reply
greeybolt

2025-09-15

The server in the photo immediately brings to mind the reality of IT business where any configuration mistake can cost a month’s budget, and I personally saw in Miami in 2025 how a failure in a 12 TB storage system made a company lose part of its client data and forced them to pay $600000 in compensation. In such a situation it’s not the hardware but the people writing the code and maintaining the architecture of distributed systems that make the difference. That’s why many look toward Latin America and build teams there. For example, I found skilled people through https://www.n-ix.com/software-development-in-colombia/ and it really saved deadlines when our in-house engineers collapsed under the load. Would you have the nerves to keep critical production only on a local team?

Loading...

Reply