Catch up on the latest AI articles


Spotlight" For UI Modeling With Only UI Images, Independent Of View Hierarchy

Deep Learning

3 main points
✔️ Enables modeling of image-only mobile UI independent of the view hierarchy
✔️ Easily extendable and generalizable to other UI tasks without architectural changes
✔️ Achieves SoTA on typical UI tasks

Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus
written by Gang LiYang Li
(Submitted on 29 Sep 2022 (v1), last revised 24 Feb 2023 (this version, v4))
Comments: Published as a conference paper at ICLR 2023
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Understanding mobile UI is an important topic in terms of UI automation and accessibility. Until now, modeling of mobile UI has relied heavily on the view hierarchy of the UI screen (structural data or metadata like the DOM of a web page ). However, view hierarchies are not always available and often contain missing object information or inaccurate and corrupted structural information. While view hierarchies are convenient and easy to handle, they can hinder the applicability and performance of UI modeling.

Therefore, this paper proposes Spotlight, a method that uses only image data of UI screens. Spotlight can be easily extended to a variety of tasks, and in this paper, we have achieved SoTA on several typical UI tasks, and have shown that the view hierarchy for some typical UI tasks. The figure below shows an overview of the Spotlight process.

A vision-language approach for foundational UI understanding fig.1
(Adapted from Google AI Blog, "A vision-language approach for foundational UI understanding ")

Spotlight model

The figure below shows the architecture of the "Spotlight" model and an example of the UI task experimented with in this paper. The output is a tuple of a description or response about the area of interest. The output is a description of or response to the region of interest. Because the input and output are so simple, the architecture can be extended to a variety of UI tasks.

Spotlight employs ViT for image (screenshot) encoding and T5 Transformer decoding for language generation. In addition, UI tasks often need to focus on a specific region or object in the image rather than the entire image. Therefore, we have introduced the "Focus Region Extractor" so that features in specific regions can be extracted. The figure below shows the "Focus Region Extractor" process.

(Adapted from Google AI Blog "A vision-language approach for foundational UI understanding" )

Here, using an attention query generated from the bounding box (bbox) of a particular region or object Here we introduce the "Region Summarizer" which extracts features of a region based on ViT's encoding. Each coordinate (left, top, right, bottom ) of the bounding box (bbox), represented as a yellow box in the screenshot, is Embedded as a dense vector via a multi-layer perceptron (MLP ). It is then input to the Transformer together with the Coordinate-type Embedding. The Coordinate query corresponds to the encoding of the UI image output by ViT via Cross Attention, and the final output of the Transformer is used as the domain representation for the T5 Transformer decoding.


We pre-trained Spotlight using two unlabeled datasets (an internal dataset based on the C4 corpus and an internal mobile dataset) containing 2.5 million mobile UI screens and 80 million web pages to validate four tasks: first, "WidgetCaptioning". The second is Screen Summarization. The second is Screen Summarization, which generates an overview of the screen and describes its content and functionality. The fourthisTappability Prediction. The fourth is Tappability Prediction, which predicts whether a particular object on the screen is tappable or not.

For evaluation, Caption and Summarization use the CIDEr score. Grounding uses the percentage of times the model can locate the target object in response to a user request. Tappability uses the F1 score to measure the performance of predicting which objects can and cannot be tapped.

In this experiment, we compare Spotlight to several benchmark models: Widget Caption uses the view hierarchy and images of each UI object to generate a textual description of the object; Screen2Words uses the view hierarchy, screenshots Screen2Words generates screen summaries using view hierarchies and screenshots and auxiliary features (such as app descriptions); VUT combines screenshots and view hierarchies to perform multiple tasks; Tappability leverages object metadata from the view hierarchy and screenshots to predict an object'sWhile Tappability leverages object metadata from the view hierarchy and screenshots to predict capability, Taperception, a follow-up model to Tappability, predicts tapability based on visual information alone.

We experiment with two different-sized ViT models, B/16 and L/16. Note that L/16 is larger than B/16, has similar parameter sizes to the mT5-based model, and reusespre-trainedcheckpoints. The experimental results are shown in the table below, which shows that Spotlight (L/16) achieves SoTA for all UI tasks.

Next, to understand whether the "Region Summarizer" icanpay attention to the target regions on the screen, the weights of the regions of interest for Widget Caption and Screen Sammarization are visualized. In the Widget Caption example below (left), we can see that the model has learned to pay attention not only to the target area, the checkbox but also to the left-most text "Chelsea" to generate the caption. In the Screen Sammarization example below (right ), we see that the model is learning to focus on the important parts of the screen for summarization.


In this paper, we propose "Spotlight," which uses only image data for UI modeling to understand mobile UI. While view hierarchies are useful, they are not always available, and often object information is missing or structural information is inaccurate and corrupted, making traditional methods that rely on view hierarchies inadequate. However, now that a method that does not rely on view hierarchies has been realized, the risks of UI modeling can be mitigated.

In addition, Spotlightleverages existing models (ViT and T5), which are highly versatile and easily extendable tovariousUImodelingtasks, among many other benefits. In addition, we have shown that Spotlight achieves SoTA on several representative UI tasks and outperforms traditional methods that use view hierarchies; Spotlight can be easily applied to more UI tasks and can significantly contribute to many interactions and user experiences Spotlight can be easily applied to more UI tasks and has the potential to contribute significantly to many interactions and user experiences.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us