Pre-trained Large Language Models (LLMs) have surged in popularity for their efficacy in addressing various natural language tasks. More recently, their potential in guiding autonomous web navigation using natural language instructions has been recognized.
However, existing web navigation models grapple with numerous challenges. These include an absence of a predefined action space, complications in interpreting extensive HTML documents, and a lack of domain-specific knowledge pertaining to HTML.
To address the abovementioned issues, in a new paper A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis, a research team from Google DeepMind and The University of Tokyo presents WebAgent, a LLMs-driven real-world web navigation agent that can address real websites tasks following natural language instructions.

The team summarizes their key contributions as follows:
- We introduce WebAgent, integration of two LLMs for real-world web navigation. The domain expert language model deals with planning and HTML summarization, and generalist language model generates executable programs.
- We present HTML-T5, new HTML-specific language models, by adopting local-global attentions and pre-training with a mixture of long-span denoising on large-scale HTML corpus.
- HTML-T5 notably improves the success rate by over 50% in the real website, and outperforms prior LLM agent by 14.9% in MiniWoB++.

WebAgent is composed of interactions between HTML-T5 for planning and summarization and Flan-U-PaLM for grounded program synthesis.

Specifically, HTML-T5 is pre-trained encoder-decoder language model that consists of 1) local and global attention mechanisms that can better capture such a hierarchical structure of HTML and 2) a mixture of denoising objectives that incorporate inductive bias on HTML to better the syntax and semantics of HTML documents.
Flan-U-PaLM is a decoder that consumes the given canonical examples for program generation, next sub-instruction, and extracted HTML snippet from HTML-T5 to decode an executable Python program via the Selenium WebDriver, a browser automation library. As a result, WebAgent not only can generate the code based on natural language instructions, but also can interpret the semantics and functionality of HTML elements.

In their empirical study, the team perform WebAgent on real world web navigation tasks, including planning, summarization and grounded program synthesis. WebAgent achieves 70% success rate on web navigation, significantly outperforming single LLM approach by over 50%, and it also obtain 14.9% higher success rate over previous state-of-the-art approaches on MiniWoB web navigation benchmark.
Overall, this work show the potential of the proposed WebAgent for autonomous web navigation, the team hopes their work can contributes one step further to the practical deployment of autonomous web agent systems.
The paper A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis on arXiv.
Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Pingback: DeepMind & Tokyo U’s WebAgent Realizes Real-World Web Navigation Following Natural Language Instructions