Mobile-Agent: Automation Of Mobile App Operations Through Screenshot Analysis

Pattern Recognition 06/03/2024

3 main points
✔️ Introducing "Mobile-Agent", an autonomous multimodal agent: Proposes a multimodal agent that integrates vision and language to automate mobile app operations
✔️ Applying visual recognition techniques: Enables direct operation location from screenshots, Reduces reliance on user interface
✔️ Performance evaluation using "Mobile-Eval" benchmark: Demonstrates high task completion rate and operation accuracy using newly proposed benchmark

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
written by Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
(Submitted on 29 Jan 2024)
Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This article provides an introduction to the development of agents based on state-of-the-art language models. In recent years, various research teams have made remarkable progress in the areas of task planning and inference. These advances have been supported by the rapid evolution of multimodal large-scale language models (MLLMs), particularly with respect to their improved visual recognition capabilities. This has opened up innovative possibilities for MLLM-based agents in a variety of real-world applications. In particular, the development of agents that operate mobile devices is an emerging trend in this field. However, existing large-scale language models have been identified as having challenges in the visual recognition capabilities required to accurately operate mobile device screens.

To address this challenge, this paper proposes a new autonomous mobile device agent, Mobile-Agent, with visual recognition capabilities. This agent has the ability to locate operations directly from mobile device screenshots. This allows the agent to operate efficiently without accessing the basic files of the user interface.

The Mobile-Agent utilizes a visual recognition module that combines detection and OCR (optical character recognition) models to identify text in screenshots and understand the contents of screen areas. In addition, the agent uses GPT-4V's powerful context-awareness capabilities to comprehensively plan tasks based on user instructions and operation history. It also has an introspection function (Self-Reflection) that recognizes and corrects error operations and incomplete instructions, enabling efficient user assistance.

We also propose a new benchmark, Mobile-Eval, to evaluate Mobile-Agent performance. It includes mobile application manipulation tasks of various difficulty levels. Experiments with this benchmark have shown that Mobile-Agent achieves high task completion rates and operation accuracy. This indicates that Mobile-Agent works effectively even in complex tasks that cross multiple applications.

What is Mobile-Agent?

Mobile-Agent" is a combination of a text detection module and an icon detection module, with the latest large-scale language model (MLLM), GPT-4V, at its core. The overall workflow of Mobile-Agent is shown in the figure below.

While the core GPT-4V can load instructions and screenshots to show the proper operation, it lacks the ability to pinpoint the exact location where the operation will take place when actually operating the mobile device. To fill this gap, we are implementing an external tool to pinpoint the exact location of text and icons.

When text is to be located accurately, for example, when specific text on the screen needs to be tapped, an OCR (Optical Character Recognition) tool is utilized to find the text's location. This process varies depending on whether no text is detected, one text is detected, or multiple texts are detected. To address each scenario, the paper suggests ways to reselect text or generate clear instructions for text to click on.

When we are trying to determine the exact location of an icon, for example, we use the Icon Detection Tool and CLIP to identify the exact location of the icon to be clicked. We first ask the agent to specify the attributes of the icon to be clicked, then we use Grounding DINO along with the prompt "icon" to identify all icons. Finally, CLIP is used to calculate the similarity between all detected icons and the click region description, and the region with the highest similarity is selected for the click.

In addition, the following eight operations are defined so that the Mobile-Agent can better interpret the operations it performs on the screen.

Open App (App): Opens a specific app on the desktop page.
Click on the text (Text): Click on the area of the screen labeled "Text.
Click on the icon (Icon, Position): Click on the "Position" area described by "Icon.
Icon" provides a description of the tap position including its attributes (color, icon shape, etc.).
Position" selects one or two options from Top, Bottom, Left, Right, or Center to minimize the possibility of error.
Type (Text): Enter "Text" in the current input box.
Page Up & Down: Scroll up and down the current page.
Back: Return to the last page.
Exit: Returns directly to the desktop from the current page.
Stop: When the instruction is complete, the entire process is terminated.

The Mobile-Agent completes each step of the operation iteratively. Before the iteration begins, the user must enter instructions. Based on those instructions, prompts are generated for the entire process. At the start of each iteration, a screenshot of the current mobile screen is captured and entered into the Agent. The agent processes the prompt, the operation history, and the current screen capture and outputs the next operation step. If the output of the agent results in the end of the process, the iteration stops. Otherwise, a new iteration continues; the Mobile-Agent uses the operation history to track the progress of the current task and generates operations on the current screenshot based on the prompts, thus creating an iterative, self-planning process. This process is illustrated at the bottom of the figure below.

Additionally, during iterations, agents may make errors and fail to complete instructions. To improve the success rate of instructions, we have introduced something called the self-reflection method.

This technique works in two situations: first, when an agent generates an incorrect or invalid operation and the process stalls. If the agent detects that the screenshot has not changed after a particular operation, or that it is showing the wrong page, it instructs the agent to try an alternate operation or to modify the parameters of the current operation. The second is when an agent overlooks a specific requirement of a complex instruction. After the agent has completed all operations through self-planning, instruct the agent to analyze the operation, history, current screenshots, and user instructions to determine if the instructions are complete. If not, the agent must continue to generate operations through self-planning. This process is also shown at the bottom of the figure below.

In addition, to better implement these functions, we have applied the prompt format used in ReAct. We require the agent to output three components: Observation, Thought, and Action. Observation is a description of the current screenshot and history of operations by the agent. This helps the agent notice updates to screenshots and quickly identify errors based on the history record. Thought is a consideration of the next operation step generated from the observations and instructions. The agent needs to explain the upcoming operation in Thought. Action gives the agent a choice of one of eight operations and parameters based on Thought.

Experiment

We provide a comprehensive evaluation of Mobile-Agent. For the sake of convenience, we use the Android OS. Here we evaluate Mobile-Agent with the newly proposed Mobile-Eval benchmark.

Mobile-Eval consists of a total of 10 apps commonly used on mobile devices. To assess an agent's ability to use multiple applications, we have also introduced instructions that require two apps to be used simultaneously. Three instructions are designed for each app. The first instruction (INSTRUCTION 1) is relatively simple and requires only the completion of basic app operations; the second instruction (INSTRUCTION 2) is more challenging, adding some additional requirements to the first one; the third instruction (INSTRUCTION 3) is an abstract user instruction that requires the user to use the apps simultaneously. The third instruction (INSTRUCTION 3) is an abstract user instruction that does not explicitly specify the application to be used or the operation to be performed by the user, but leaves the decision to the agent. The table below lists the apps and instructions used in Mobile-Eval.

In order to evaluate Mobile-Agent's performance from different perspectives, we have also introduced four evaluation criteria.

Success (SU): Considered successful when the Mobile-Agent completes the instruction.
Process Score (PS): evaluates the accuracy of each step in the execution of an instruction.
Specifically, it is the number of correct steps divided by the total number of steps. Each correct step contributes to the planning score, even if the agent does not ultimately succeed with some instructions,
Relative Efficiency (RE): manually records the number of steps taken by a human executing each instruction and considers the human operation as the optimal solution;
compares the number of steps taken by a Mobile-Agent with the number of steps taken by a human to indicate whether the Mobile-Agent can use the mobile device more efficiently.
Completion Rate (CR): the number of human operation steps that the Mobile-Agent was able to complete for a given instruction divided by the total number of steps taken by the human. If the instruction is completed, this metric is equal to 1.

The results of the experiment are shown in the table below.

First, over the three instructions (INSTRUCTION 1 ~ INSTRUCTION 3), the SU achieves 91%, 82%, and 82%, respectively; in PS, the Mobile-Agent also has a high probability of generating the correct operation through the three instructions, achieving about 80 of the three instructions. RE also shows that Mobile-Agent can achieve 80% of the ability to reach the optimal human operation. The above results demonstrate the effectiveness of Mobile-Agent as a mobile device assistant.

Summary

The Mobile-Agent presented in this paper is an autonomous multimodal agent that suggests new possibilities for mobile application manipulation. The system leverages an integrated visual recognition framework to accurately identify and locate visual and textual information within an app's interface, allowing it to efficiently plan and execute complex tasks across mobile apps.

Distinct from traditional XML mobile systems' reliance on metadata, Mobile-Agent places visual information at the center, allowing for flexible adaptability in a variety of mobile environments. This allows it to be used in a wide range of applications without special adjustments for each system. Furthermore, through experimentation, Mobile-Agent's effectiveness and efficiency have been demonstrated in a variety of scenarios.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.