Transformer-based models have demonstrated huge potential to address a wide range of real-world applications, recently, training an AI agent that can follow instructions to complete tasks through graphical user interfaces (GUIs) is gaining popularity, its capability to automate tedious tasks saves human from a large amount of manual efforts.
Most of the previous digital agents however rely heavily on structured representations of the user interfaces, such as HTML sources, DOM trees and task-specific representations of high-level actions, which are unreliable as they are not always available and often hard to interpret due to obfuscation and misalignment.
In a new paper From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces, a research team from Google and DeepMind proposes PIX2ACT, a Transformer-based image-to-text model that is able to generate outputs corresponding to mouse and keyboard actions based solely on pixel-based screenshots from Graphical User Interfaces (GUIs), while surpassing human crowdworkers on the MiniWob++ benchmark.
The team summarizes their main contributions as follows:
- We show, for the first time, that an agent using pixel-only inputs and a generic action space can outperform human crowdworkers on the MiniWob++ benchmark.
- We adapt the WebShop benchmark to our setting, using pixel-based observations and general lowlevel actions and establish the first baseline on this setting.
- We show that PIX2STRUCT’s pre-training via screenshot parsing is effective for GUI-based instruction following with pixel-based inputs.
- We demonstrate the successful application of tree search as a relatively simple method for policy improvement for MiniWob++.
The proposed PIX2ACT is built upon PIX2STRUCT model (Lee et al., 2022). Unlike previous work, PIX2ACT does not rely on text-based observations from DOM trees or HTML source code, or task-specific actions, instead it consumes only pixel-based observations to generate generic low-level actions, such as mouse and keyboard actions.
In particular, the team model GUI interaction as a Markov Decision Process (MDP), in each state PIX2ACT receives an observation and selection an action. It is pretrained to map screenshots to structured representations from HTML and it uses human demonstrations and environment interactions for model tuning. They also apply Monte Carlo Tree Search (MCTS) to iteratively to get new expert trajectories for model training, which results in policy improvement.
In their empirical study, the team evaluated PIX2ACT on challenging MiniWob++ and WebShop benchmarks. PIX2ACT substantially outperforms human crowdworkers and the previous state-of-the-art results, improving task scores from 17.1 to 66.5 on MiniWob++ and from 1.1 to 46.7 on WebShop.
The team believes their work improves accessibility, productivity, and overall user experience of GUI-based instruction following tasks.
The paper From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces on arXiv.
Author: Hecate He | Editor: Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.