Catch up on the latest AI articles

Can You Have A Conversational Dialogue With A Mobile UI In A Large Language Model?

Can You Have A Conversational Dialogue With A Mobile UI In A Large Language Model?

Natural Language Processing

3 main points
✔️ First paper to investigate the feasibility of conversational dialogue in mobile UIs using large-scale language models (LLMs )
✔️ Proposes a series of ways to input a GUIinto an LLM and have the LLMperformvariousconversational dialogue tasks in a mobile UI
✔️ Achieves comparable or better performance than traditional machine learning methods, code Open-sourced

Enabling Conversational Interaction with Mobile UI using Large Language Models
written by Bryan Wang, Gang Li, Yang Li
(Submitted on 18 Sep 2022 (v1), last revised 17 Feb 2023 (this version, v2))
Comments: Published as a conference paper at CHI 2023
Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)


The images used in this article are from the paper, the introductory slides, or were created based on them.


  Accessibility" is also being sought in Japan, mainly by the Ministry of Health, Labor, and Welfare. The goal is for anyone, regardless of age or physical disability, to be able to easily reach and use the information they need.

 In this paper, we apply recent technological advances in large-scale language models (LLMs) to propose a generic method for operating mobile UI with LLMs.In general, performing various UI tasks in a natural language requires the creation of separate datasets and models for each specific task, which is expensive and labor-intensive. Recently, however, pre-trained LLMs have shown that it is possible to generalize. This paper investigates the feasibility of mobile UI and conversational interaction using LLMs by applying an over-prompting technique.

Four conversational dialogue tasks are investigated in this paper: Screen Question-Generation, Screen Summarization, Screen Question-Answering (QA), and Mapping Instruction to UI Action. We are investigating the following four conversational dialogue tasks without requiring dedicated datasets or training, we propose a generic approach that achieves performance comparable to or better than existing machine learning methods for these tasks and enables language-based mobile interactions.


 In this paper, PaLM is used as a large-scale language model to validate the four tasks shown in the figure below. the first task is Screen Question-Generation. The first task is "Screen Question-Generation," in which an agent generates user-appropriate questions according to the items that need to be manipulated in a UI screen. For example, if the UI screen of a travel site has items for entering "destination" and "date of stay," the agent will ask the user questions such as " Where is your destination? When is the date of your stay? and so on. Even people who cannot see the screen can grasp the necessary items.

 The second task is Screen Summarization. This task is to summarize and convey to the user what is displayed on the UI screen. For example, when a list of available hotels is displayed on a travel site, the content must be appropriately conveyed to the user. Even people who cannot see the screen can grasp the displayed content.

  The third task is Screen Question-Answering (QA). This is a task in which the user requests information on a UI screen via an agent, and the agent responds with suitable information. For example, if a travel site displays a list of available hotel rooms, and the user asks, "How much is a room with a king-size bed? " the agent will answer "$330 per night" based on the information on the UI screen. The task is to answer "$330 per night" based on the content of the UI screen. This function is also useful for people who cannot see the screen. It also allows users to retrieve only the information they need from a vast amount of information.

  The fourth task is "Mapping Instruction to UI Action. This task performs appropriate screen operations in response to user requests. For example, on a hotel reservation screen, if the user requests, "Click the Book button to reserve a room with a king-size bed." the agent clicks the corresponding button and completes the reservation. This is useful for people who cannot see or operate the screen.

Experiment (Screen Question-Generation)

 Here, UI elements requiring user input are identified and user-appropriate questions are generated. The figure below shows an example of a prompt that generates a question. Given a target UI screen, the " number of UI elements requiring input," "screen summary," and "list of elements requiring input items" are generated as intermediate results using the chain-of-thought method (chain-of-thought ). Finally, it generates questions enclosed in <SOQ> and <EOQ> tokens.

  Generated questions are evaluated in terms of " Grammar Correctness", "UI Relevance", and "Question Scope ". Grammar: How correct is the grammar of the generated questions? Is it readable and natural? The "Grammar" section evaluates the grammar of the generated questions on a Likert scale (5-point scale ). The Relevance of the UI is a 2-point scale that indicates whether or not the generated questions are relevant to the UI elements. Coverage F1 evaluates the extent to which the generated questions "identify the elements on the screen. This is automatically calculated by comparing the tags of the input elements in Ground Truth with the tags identified by the Thought Chain method. The results are shown in the table below, using the LLM results with the word resource_id, called res_tokens, for "What is {res_tokens}? " The results are compared to a rule-based approach (Template) that fills in a template called "{res_tokens}".

  Three raters rated 931 questions on both Template and LLM. For Grammar, Template scored an average of 3.6, while LLM had a nearly perfect average score of 4.98. In " Relevance of UI," LLM generates 8.7% more relevant questions than Template. For " Question Coverage (Coverage F1 )," LLMachieves a 95.9% (F1 Score) (Precision = 95.4%, Recall = 96.3%).

 While the rule-based Template generates questions for all input elements, which naturally results in 100% question coverage, LLM also accurately identifies input elements and generates sufficiently relevant questions.

 Further analysis of LLM's behavior also shows that when generating specific questions, LLM considers both input elements and screen context (information from other screen objects ). The figure below shows examples of questions generated by LLM and Template for two UI screens.

 Looking at the figure on the left, the LLM uses the context of being asked to enter credit card information to generate grammatically correct questions concerning each entry. For example, in (2), LLM states "credit card expiration date" while Template does not mention "credit". Also, in (3), LLMcorrectly generates the question as " last 4 digits of SSN," while Template does not mention it.

 Also, looking at the figure on the right, whileLLMisableto use prior information to generate a single question by combining multiple relevant inputs, Template is not. LLM can generate a single question asking about the price range by combining the minimum and maximum price items.

Experiment (Screen Summarization)

This task summarizes and communicates to the user what is displayed on the UI screen. It helps users quickly understand the content of the mobile UI. This is especially useful when the user cannot see the UI screen. An example of a prompt is shown in the figure below. The Thought Chain method is not used here because there is no need to generate intermediate results for the task.

 The figure below shows a screen (example) containing human-labeled summaries and summaries output by both Screen2Words and LLM, showing that LLM is more likely to use specific (concrete) text on the screen to produce summaries such as San Francisco (top left) and Tiramisu Cake Pop (bottom left). text to create summaries such as San Francisco (top left) and Tiramisu Cake Pop (bottom left).Screen2Words, on the other hand, is more general (abstract).

 In addition, the LLM is more likely to generate a more extended summary that leverages multiple key elements on the screen. For example, the screen at the top right shows how the LLM utilizes the app name, the send file button, and the fax button for the recipient to compose a longer summary ("FaxFile app screen where a user can select a file to send via fax and pick recipients.").

 It also shows that prior knowledge of LLM can be useful in summarizing screens. For example, the screen at the bottom right shows a station search results page for the London Underground system. LLM predicts "Search results for a subway stop in the London tube system."However, the input HTML contains neither "London" nor "tube".Therefore, the model uses prior knowledge about station names learned from a large linguistic dataset to infer that the station name belongs to the London subway system. This type of summary may not be generated if the model is trained on Screen2Words alone, which is an advantage of LLM.

Experiment (Screen Question-Answering)

When a user requests information on a UI screen via the agent, the agent responds with suitable information. An example of a prompt is shown in the figure below. The thought-chain prompt was not used because there is no need to generate intermediate results for the task.

 Figure (left) shows an example of Screen Question-Answering experimental results. Figure ( right ) shows three indicators used to evaluate performance inScreen Question-Answering. The accuracy of the answers is evaluated at three levels (Exact Match, Contains Ground Truth, and Sub-String of Ground Truth ).

  Figure (left) shows that LLM is significantly better than the baseline DistillBert; LLM generates accurate responses for Q1, Q2, and Q4, which fall under Exact Match; for Q3, it also falls under Contains Ground Truth; and for Q4, it generates a response that includes the Ground Truth "Dec 23rd, 2016. Although the extra time "4:50 AM" is answered, it generates an answer that contains Ground Truth "Dec 23rd, 2016".

 On the other hand, the baseline DistillBert falls under Exact Match in Q4 and generates accurate answers, but the answers are missing or completely different for the other questions. Also, in Q2, the HTML code is answered.

Experiment (Mapping Instruction to UI Action)

 This is the task of performing the appropriate screen operation in response to the user's request. For example, if the user is instructed to "open Gmail," it must correctly identify the Gmail icon on the home screen. This task also does not need to generate intermediate results, so the thought chain method is not used. The output responses are enclosed in special tags <SOI> and <EOI>, meaning the start and end of the predicted element ID, respectively.

 An example of a prompt is shown below. Here, in response to the request "Open your device's clock app.", a clock app with element ID=29 is predicted.

 We use the PixelHelp dataset, which contains 187 prompts for performing everyday tasks on a Google Pixel smartphone, such as switching Wi-Fi settings or checking email. As a prompt module, we randomly sample one screen for each specific app package in the dataset. We then randomly sample from the prompt module to create the final prompt and experiment with two conditions: in-app and cross-app. in-app, the prompt includes the prompt module from the same app package as the test screen, while cross-app includes the prompt module from the same app package as the test screen, In cross-app, it does not.

 Here, we use the percentage of partial (Partial) and complete (Complete) matches of the target elements in the evaluation metric. The results are shown in the table below, where LLM indicates that 0-shot results in very little task execution for both Partial and Complete.

 In cross-app, for 1-shot, Partial is 74.69 and Complete is 31.67. In the 2-shot case, there is a slight improvement in performance for both Partial and Complete. in-app achieve higher scores than in cross-app in both the 1-shot and 2-shot cases, The2-shot LLM (in-app ) achieves a score of 80.36 for Partial and 45.00 for Complete.


This paper investigates the feasibility of using PaLM as a large-scale language model (LLM) to interact with mobile UIscreensin natural language. We propose a set of prompting methods to interact with mobile UI screens and challenge four tasks. The results show that LLM achieves comparable or better performance compared to traditional machine learning methods.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!
Takumu avatar
I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us