Large-scale Language Model Manipulates Android Applications! DroidBot-GPT, A New Tool To Automate Tasks

Large Language Models 19/07/2023

3 main points
✔️ We proposed DroidBot-GPT, whichuses a large-scale language model toautomate interactions withAndroidapps.
✔️ It achieved about 39% complete success and 67% partial success in 33 tasks consisting of 17 different apps.
✔️ More efficient app development methods and task-specific AI models are expected to improve convenience through automated tools.

DroidBot-GPT: GPT-powered UI Automation for Android
written by Hao Wen, Hongming Wang, Jiaxuan Liu, Yuanchun Li
(Submitted on 14 Apr 2023)
Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Comments: Published on arxiv.

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

summary

Automation of UI interactions is a technology that is required for a variety of applications, including robotic process automation, software testing, and personal assistants. Among these, automation through "natural language" has received particular attention because it significantly improves the user experience.

In this paper, we propose DroidBot-GPT, a tool that automates interaction with Android mobile apps using a large-scale language model such as GPT. When a specific task is entered into DroidBot-GPT in natural language, it automatically manipulates the app to complete the task. For example, in the example below, if you enter the task "create a contact named Alice with number 1234567 and email alice@github.com and save it" in natural language, it will automatically operate the UI and create the contact.

What is DroidBot-GPT?

The figure below is an overview of DroidBot-GPT. First, DroidBot acquires information about the application (GUI info) and describes it in natural language. Next, DroidBot creates a prompt by combining GUI state, action history, and task description, and inputs the prompt to ChatGPT (LLM). Then, ChatGPT generates an appropriate action (Action choice), and the app is operated (Action) via DroidBot.

At this point, converting the GUI information (GUI info) of the app into natural language so that a large-scale language model can handle it requires some ingenuity. The simplest way to convert an app's GUI info into natural language is to enter the structured tree description of the GUI as it is, but including all the positional relationships and properties (color, shape, size, etc.) of the various elements. However, this method is likely to result in thousands of words, which, for a large-scale language model, is likely to contain unnecessary information or exceed the limit of the number of characters that can be entered. Therefore, this paper proposes a method of representing the text as simple, to-the-point sentences that can be read and understood by humans.

The figure below shows how this is done. First, we extract all the elements (buttons, text boxes, etc.) that the user can see from the application screen and figure out what operation (click, text input, etc.) each element supports. Next, for each element, we generate the sentence "The view <name>... can do..." for each element. This represents each element and its function as a sentence. Finally, the text "The current state has the following UI views and corresponding actions, with action id in parentheses" is added at the top to combine these elements into a single sentence.

Actions within the application are divided into two main categories: selecting (choosing) and editing (editing). Actions related to selecting (choosing) include clicking, scrolling, checking, etc. In the prompts describing elements of the GUI, a number is assigned after each action, and ChatGPT specifies the action by selecting this number. For example, if a prompt says that the view "Sort by" is clickable, it is described by the sentence "a view 'Sort by' that can be clicked (0);". If this is entered into ChatGPT and a response of "0" is received, the action of clicking the "Sort by" button is performed.

Actions related to editing (EDITING) involve typing a sentence in a text box. Users may also enter usernames, passwords, or sentences, which cannot be encoded as choices. In response, this paper designs a two-step solution. If a large language model chooses to edit a text box, DroidBot-GPT sends another prompt. It asks "What should I enter to the view with the text '<text content>'? Just return the text and nothing else." DroidBot-GPT will enter a response to this in the text box.

The prompts entered into the large-scale language model include not only the GUI description and actions described above, but also the history of previous actions to avoid repetition of operations. The figure below is an example of a prompt.

Here, the music player application is set to play white noise* for 5 minutes. The procedure is to start the application called "Noice," select the item "Sleep Timer" (screen a), and then select the item labeled "+ 5m" (screen b). This prompt describes the task description in a blue box, the GUI elements in a green box, the history of previous operations in a yellow box, and what input is required next (single choice or input statement) in a purple box. This will make it easier to understand how each part is combined to form the entire prompt.

(*) White noise is a type of noise. It is a noise like the "surging," "shushing," and "gurgling" of a ventilation fan or a television sandstorm, and is said to improve concentration, relaxation, and restful sleep.

How does DroidBot-GPT perform?

The evaluation is performed on 17 widely used Android apps downloaded from F-Droid, an app store offering free open source software (FOSS) for Android. In evaluating performance, we designed one to three tasks for each app, each task consisting of 2 to 13 steps via a GUI. The table below shows specific apps, the task designs they are based on, and examples of appropriate action steps.

The evaluation results show that DroidBot-GPT completely accomplished 13/33 of the tasks, with an average achievement level of 66.76% for all tasks. The table below shows average achievement levels for tasks of various difficulty levels, with relatively simple tasks with two to three steps tending to be easier to accomplish.

The table below also shows the achievement level for each task category. Applications in categories such as "Record" and "Life" have a slightly lower average achievement level. This may be because these tasks often involve typing actions, and the relationship between GUI elements may not be well understood and the appropriate place to enter the information may be overlooked.

Compared to previous studies designed specifically for a particular app or action, DroidBotGPT does not require additional training data in the GUI and can efficiently process tasks for a wide range of categories by leveraging a versatile, large-scale language model. Compared to conventional methods, DroidBotGPT is able to complete complex tasks with minimal instructions, and as a result, a high degree of automation is achieved. However, as shown in the figure below, there are cases that cannot be handled, such as unnamed GUI elements, and further improvement is required through future research.

summary

This paper proposes DroidBot-GPT, the first Android application automation tool guided by a large-scale language model. The method proposes a way to transform the GUI and corresponding action space into prompts that can be interpreted by a large-scale language model. Following ChatGPT in 2022, we expect to see more automation by large-scale language models in various fields in the future, and the resulting improved user experience.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

Large-scale Language Model Manipulates Android Applications! DroidBot-GPT, A New Tool To Automate Tasks

summary

What is DroidBot-GPT?

How does DroidBot-GPT perform?

summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...