Mobile Agent: Operate the Mobile Phone as Smoothly as a Human
Mobile agent is designed to assist users by simplifying interactions with mobile devices, making complex operations easier and more intuitive. Instead of manually handling intricate tasks or repeatedly switching between apps, users simply provide high-level instructions in natural language. The agent then interprets these instructions, identifies the necessary steps, and automatically performs the required actions—such as tapping buttons, entering text, or navigating across apps. This approach significantly streamlines task completion, reducing user effort and minimizing errors.
We will now use examples and prompt templates to illustrate how Mobile Agent brings together planning, tool use, reflection, and memory to turn natural-language requests into reliable phone actions.
Planning
The agent initiates the task by interpreting the user's instruction in the context of the device's current state, typically represented by the latest screenshot. Using this combined understanding, it formulates an operational plan and identifies the optimal next action. This planning process effectively translates the user's high-level objective into precise and actionable steps, ensuring that task execution remains organized and directed toward achieving the intended goal. An illustrative example of this planning step is provided below, guided by the provided prompt, the agent engages in structured reasoning to formulate an action articulated in natural language.
Example of planning the next action
INPUT:
Instruction: Set a clock for 7:00 a.m.
Screenshot:
History: {...}
Task: Think and select the next action. (Swipe, Tap, Type ...)
OUTPUT:
Operation Thought: Currently on the home screen we need to first tap the clock
icon...
Action: Tap (500,800)
Tool use
After generating the next action, the agent directly inputs this action into an action mapping tool, which then converts the natural-language action into executable instructions for the device. An abstract action such as Tap(500,800) is transformed by the code translator into low-level commands, for example adb shell input tap 500 800. These commands are then executed through the Android Debug Bridge (ADB), which directly interacts with the mobile operating system. In this way, the agent connects high-level planning with concrete device operations. Actions such as launching an application, selecting an interface element, swiping across the screen, or entering text can thus be automatically performed. By bridging abstract decisions and real interactions, the agent progressively fulfills the user's goal without requiring detailed manual intervention.
Reflection
After each operation is executed, the agent evaluates its effectiveness by comparing the screenshot before and after the action. If the outcome aligns with the intended goal, the agent returns to the planning phase for the next iteration, continuing this cycle until the entire instruction is successfully completed. If the outcome diverges from the intended goal, such as when an incorrect page is opened or a gesture fails to produce any effect, the agent detects the error and initiates corrective measures. This reflective mechanism prevents errors from propagating, ensuring that the task remains aligned with the original objective and that overall performance remains reliable. The example below illustrates how the agent evaluates the success of an operation by comparing screenshots.
Example of reflection for checking operation success
INPUT:
Last Screenshot:
Current Screenshot:
Latest Operation: ...tap the clock icon ...
Task: Compare and determine whether the latest operation was successful.
OUTPUT:
By comparing these two images, it is found ... so this operation is successful.
Memory
Throughout the process, the agent maintains a memory of task-relevant information that may be required in later steps. This includes recording intermediate results and preserving contextual details across applications. By recalling such information when necessary, the agent provides continuity for multi-step and cross-application tasks, thereby avoiding the loss of critical content and ensuring coherent task completion.
If you find this work helpful, please consider citing our paper:
@article{hu2025hands,
title={Hands-on LLM-based Agents: A Tutorial for General Audiences},
author={Hu, Shuyue and Ren, Siyue and Chen, Yang and Mu, Chunjiang and Liu, Jinyi and Cui, Zhiyao and Zhang, Yiqun and Li, Hao and Zhou, Dongzhan and Xu, Jia and others},
journal={Hands-on},
volume={21},
pages={6},
year={2025}
}