DeepMind & Tokyo U’s WebAgent Realizes Real-World Web Navigation Following Natural Language Instructions

Pre-trained Large Language Models (LLMs) have surged in popularity for their efficacy in addressing various natural language tasks. More recently, their potential in guiding autonomous web navigation using natural language instructions has been recognized.

However, existing web navigation models grapple with numerous challenges. These include an absence of a predefined action space, complications in interpreting extensive HTML documents, and a lack of domain-specific knowledge pertaining to HTML.

To address the abovementioned issues, in a new paper A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis, a research team from Google DeepMind and The University of Tokyo presents WebAgent, a LLMs-driven real-world web navigation agent that can address real websites tasks following natural language instructions.

The team summarizes their key contributions as follows:

We introduce WebAgent, integration of two LLMs for real-world web navigation. The domain expert language model deals with planning and HTML summarization, and generalist language model generates executable programs.
We present HTML-T5, new HTML-specific language models, by adopting local-global attentions and pre-training with a mixture of long-span denoising on large-scale HTML corpus.
HTML-T5 notably improves the success rate by over 50% in the real website, and outperforms prior LLM agent by 14.9% in MiniWoB++.

WebAgent is composed of interactions between HTML-T5 for planning and summarization and Flan-U-PaLM for grounded program synthesis.

Specifically, HTML-T5 is pre-trained encoder-decoder language model that consists of 1) local and global attention mechanisms that can better capture such a hierarchical structure of HTML and 2) a mixture of denoising objectives that incorporate inductive bias on HTML to better the syntax and semantics of HTML documents.

Flan-U-PaLM is a decoder that consumes the given canonical examples for program generation, next sub-instruction, and extracted HTML snippet from HTML-T5 to decode an executable Python program via the Selenium WebDriver, a browser automation library. As a result, WebAgent not only can generate the code based on natural language instructions, but also can interpret the semantics and functionality of HTML elements.

In their empirical study, the team perform WebAgent on real world web navigation tasks, including planning, summarization and grounded program synthesis. WebAgent achieves 70% success rate on web navigation, significantly outperforming single LLM approach by over 50%, and it also obtain 14.9% higher success rate over previous state-of-the-art approaches on MiniWoB web navigation benchmark.

Overall, this work show the potential of the proposed WebAgent for autonomous web navigation, the team hopes their work can contributes one step further to the practical deployment of autonomous web agent systems.

The paper A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

2 comments on “DeepMind & Tokyo U’s WebAgent Realizes Real-World Web Navigation Following Natural Language Instructions”

Pingback: DeepMind & Tokyo U’s WebAgent Realizes Real-World Web Navigation Following Natural Language Instructions
Robert Kja

2026-02-11

I was in Halifax, watching the waves crash at the harbour. It was windy and cold, so I ducked into a cafe. The coffee was weak, but the internet was strong. Tried billionairespin after seeing a banner. I played slots online with a pirate theme, seemed fitting. Lost my first deposit in ten minutes. Was about to leave, deposited one last twenty. That twenty turned into five hundred. The coffee tasted a lot better after that win, let me tell you.

Loading...

DeepMind & Tokyo U’s WebAgent Realizes Real-World Web Navigation Following Natural Language Instructions

Like this:

2 comments on “DeepMind & Tokyo U’s WebAgent Realizes Real-World Web Navigation Following Natural Language Instructions”

Leave a Reply Cancel reply

Related

Share this:

Like this:

2 comments on “DeepMind & Tokyo U’s WebAgent Realizes Real-World Web Navigation Following Natural Language Instructions”

Leave a Reply Cancel reply

Related