From Pixels to UI Actions: Google’s PIX2ACT Agent Learns to Follow Instructions via GUIs
In a new paper From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces, a research team from Google and DeepMind proposes PIX2ACT, a Transformer-based image-to-text model that is able to generate outputs corresponding to mouse and keyboard actions based solely on pixel-based screenshots from Graphical User Interfaces (GUIs).