Dec 27, 2023
What Could LLM Do with Your Smartphone?
A look at AppAgent, a multimodal agent that learns to operate smartphone apps through tapping, swiping, screenshots, and memory.
Researchers from Tencent proposed AppAgent: Multimodal Agents as Smartphone Users, an approach that differs from existing phone assistants like Siri, which operates through system back-end access and function calls. Instead, they proposed an LLM agent that interacts with smartphone apps in a human-like manner, using tapping and swiping on the screen.
How it works?
The process involves two steps: an exploration phase where the agent interacts with apps through pre-defined actions and learns from their outcomes; second, the deployment phase were the agent uses acquired knowledge to perform a task.
Exploration Phase
In this phase, the agent figures out the app's functionality via trial-and-error interactions and observing outcomes. It uses different actions and observes the resulting changes in the app interface to understand how it works. The LLM agent attempts to figure out the functions of UI elements and the effects of specific actions by analyzing screenshots before and after each action (Actions: tap, long press, swipe on elements, text input, back navigation). This information is then compiled into a document that records the effects of actions applied to different elements. When a UI element is acted upon multiple times, the agent will update the document based on past documents and current observations to improve quality.
Deployment Phase
The agent implements a systematic, step-by-step methodology for task execution. In each step, the agent is first tasked with providing its observations of the current UI, followed by articulating its thought process concerning the task and current observations. In each step the agent has access to a screenshot of the current UI and a dynamically generated document detailing the functions of UI elements and the actions' effects on the current UI page, and prompts with possible actions. Finally, after all that observing and planning the agent proceeds to execute actions by invoking available functions. After each action, the agent summarizes the interaction history and the actions taken during the current step. This information is incorporated into the next prompt, which provides the agent with a form of memory.
Results
The approach was evaluated on 50 tasks across 10 diverse apps including Google Maps, Twitter, Telegram, YouTube, email, shopping, and even image editing. Results showed that agents can effectively handle a wide variety of high-level tasks on unfamiliar apps with success rates from 73.3% to 95.6% depending on how the exploration phase was performed. The best results were achieved with manually crafted documents (when the exploration phase is omitted and a human built a document), second place for watching human demos, and last place by auto exploration.
Conclusions
The key innovation is the learning approach: agent learns to use apps via autonomous exploration or observing human demos. This approach eliminates the need for system back-end access which adds to security, and flexibility aspects. The proposed exploration-based learning strategy allows the agent to adapt to new applications with unfamiliar user interfaces, making it a versatile tool for various tasks. Additionally, the code is made open on GitHub. It is worth noticing, that the high differences in results between the AutoExploartion and manually crafted documents suggest that the exploration phase can be further improved, potentially leading to much better results.
However, the swiping/tapping approach itself is not as novel as the authors suggest, as a similar approach was proposed in the paper "Empowering LLM to use Smartphone for Intelligent Task Automation" in August 2023. The main components included a functionality-aware UI representation method that helped LLM understand the UI, exploration-based memory injection techniques that augment the app-specific domain knowledge of LLM, and a multi-granularity query optimization module that reduces the cost of model inference. Sounds similar right? Still, both papers are worth exploring, as they show how we can use LLM's to operate on different devices, without connecting it to the backend.