image to text | Synced

by Synced 2023-06-14 1

From Pixels to UI Actions: Google’s PIX2ACT Agent Learns to Follow Instructions via GUIs

In a new paper From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces, a research team from Google and DeepMind proposes PIX2ACT, a Transformer-based image-to-text model that is able to generate outputs corresponding to mouse and keyboard actions based solely on pixel-based screenshots from Graphical User Interfaces (GUIs).