Ferret-UI, A Multimodal Large-scale Language Model For Mobile UI
3 main points
✔️Propose "Ferret-UI," amultimodal large-scale language modeldedicated to understanding mobile UI interactions
✔️ Introduces "anyres" for different screen aspect ratios. Can function effectively on a variety of screen sizes.
✔️ Achieves remarkable performance in multi-model comparisons, especially in advanced tasks
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
written by Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan
(Submitted on 8 Apr 2024)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Mobile applications have become indispensable tools in our daily lives in a variety of areas, including information retrieval, reservations, and entertainment. We usually visually check the screen for our purposes and perform the necessary operations. By automating this process, users can more easily accomplish their objectives. Automation can also be applied in a wide variety of areas such as accessibility improvement, UI navigation, app testing, and usability research.
To smoothly automate UI perception and interaction, a system with advanced capabilities is needed. The system must have the ability to understand the entire screen and focus on specific UI elements. It must also have the ability to translate natural language guides into specific actions within the UI, perform advanced reasoning, and provide relevant details.
However, current technology has yet to fully realize these features.Several multimodal large-scale language models (MLLMs) have been reported that arestrong on natural images, but even thesebecome difficult when it comes to UI screens; UI screens have a longer aspect ratio than natural images and have many small objects such as icons and text, among other characteristics that make them a challenge. This is a challenge because of the long aspect ratio and the large number of small objects such as icons and text.
To address these challenges, this paper develops a new version of Ferret-UI, the firstmultimodal large-scale language modeldesigned for UI screens, with the ability to understand open-ended language instructions and translate them into appropriate actions. In this paper, we develop Ferret-UI by focusing on three main areas: model architecture improvements, data curation, and benchmark settings. In particular, we introduce a new grid setting that divides the image into subimages to support arbitrary resolutions and provide flexibility for different screen sizes.
In developing Ferret-UI, we are also generating training data that includes tasks at various levels, ranging from basic UI manipulation to advanced reasoning. In the initial stages, a template-based approach is used to sample basic Referring tasks such as widget classification, icon recognition, and OCR, as well as Grounding tasks such as widget and icon location identification. These tasksallow themodel to understand the meaning and placement of UI elements and to make fine distinctions between different types of UIs and even within the same type.
For more complex tasks, GPT-4 is also used to generate advanced task data that includes detailed descriptions, conversation recognition, dialogue exchange, and functional reasoning. This allows the model to engage in in-depth discussions about visual elements, develop action plans based on clear goals, and interpret screen objectives.
The figure below shows an example of Ferret-UI's proficiency in handling 11 tasks ranging from basic to advanced. Referring tasks (e.g., widget classification, icon recognition, OCR) and Grounding tasks (e.g., find widget, find icon, find text, widget listing) using flexible input formats (dots, boxes, scribbles) can be performed on mobile UI screen. With these basic tasks, the model is equipped with rich visual and spatial knowledge and is able to distinguish UI types at both coarse and fine levels, such as different icons and text elements.
In addition, to evaluate the effectiveness of the model, we are developing a comprehensive test benchmark that includes 14 diverse mobile UI tasks. This benchmark covers both basic and advanced tasks, customized for both iPhone and Android screens. In evaluating a variety of UI comprehension models, Ferret-UI significantly outperforms existing Ferret models, especially in advanced tasks, and shows a marked advantage over other models.
Wepropose Ferret-UI, which employs "arbitrary resolution" and isthefirst UI-specificmultimodal large-scale language model thatefficiently performsReferring,Grounding, andReasoning tasks.
Ferret-UI: A new UI interaction model that leverages advanced natural image recognition
Ferret-UIis based on Ferret, amachine learning model dedicated to the ability to identify and associate objects and regions within natural images of various shapes and levels of detail. The model can interact with objects specified in a variety of ways, including points, boxes, and free-form shapes. In addition, Ferret-UI incorporates advanced visual encoders (e.g., CLIP-ViT-L/14) and decoder-specific language models (e.g., Vicuna), introducing a hybrid technology that converts visual data into a format easily handled by language processing models.
At the core of this technology is the Spatial-Aware Visual Sampler, which effectively characterizes and manages the shape of areas of different densities To further advance the interaction with the UI, Ferret-UI has Two important enhancements have been made. First,it defines theReferringandGroundingtasks of the UI and builds on these tasks. Second, we are adjusting the model architecture to more effectively handle screen data.
Specifically, Ferret-UIincludes awide variety ofUIreferringtaskssuchasOCR (optical character recognition), icon recognition, and widget classification, as well asgroundingtaskssuch as text, icon, and widget searching and widget listing. This gives the model a solid foundation of understanding for advanced interaction with the UI. These features set Ferret-UI apart from other models by providing an intuitive and innovative user interface solution.
Distinguishing itself frommany priormultimodal large-scale languagemodels, Ferret-UI uses raw screen pixels directly as model input, without the use of external detection modules or screen view files. This self-contained approach enables advanced single-screen interactions and paves the way for new applications. In particular, its potential in areas such as improved accessibility is promising.
The initial dataset analysis yielded important insights: first, the aspect ratio of UI screens is longer than that found in natural images, and second, the objects handled in UI tasks (UI widgets such as icons and text) are much smaller than in natural images The second is that the objects (UI widgets such as icons and text) handled in UI tasks are much smaller than natural images. To effectively handle these small objects, this paper introduces an "arbitrary resolution" (anyres) approach. This allows us to choose a 1x2 or 2x1 grid configuration based on the aspect ratio of the original screen, resizing the screen appropriately before splitting it into sub-images.
For example, a portrait format screen is split horizontally and a landscape format screen is split vertically. Each of these sub-images is encoded independently and processed using the detailed visual information along with the overall image context. This "arbitrary resolution" adjustment capability allows Ferret-UI to efficiently process a variety of image formats while preserving image detail. This innovative approach has allowed Ferret-UI to achieve detailed on-screen interaction and has greatly advanced the understanding and manipulation of the UI.
Experimental results
We compare the performance of several models, including Ferret-UI-base, Ferret-UI-anyres, Ferret2, and GPT-4V, and also examine the performance of Fuyu and CogAgent in advanced tasks. Results are shown in the table below.
As a benchmark, we use the public benchmark Spotlight. It uses 80 million web pages and 2.69 million mobile screenshots for pre-training, and Ferret-UI outperforms Spotlight in S2W and WiC with excellent results. On the other hand, it has inferior but competitive results in TaP. It has been suggested that this may be due to noise in the taper reception label.
The Referring task shows the accuracy of OCR exact match accuracy, icon recognition, and widget classification, while the Grounding task shows the accuracy when the correct bounding box has an intersection rate (IoU) with the label greater than 0.5.
In addition, Ferret-UI performs better than the other models on many basic tasks, with the exception of iPhone text search; GPT-4V performs reasonably well on iPhone tasks, but slightly less well on Android tasks. This is likely due to the fact that the Android screen contains many small widgets, which makes the Grounding task more difficult. Also, Ferret-UI achieves 76% zero-shot performance on the Referring Expression Understanding task from UIBert, and adding the "anyres" feature to Ferret-UI-base improves iPhone Referring and Grounding task performance by 2 percentage points.
Next are the results for advanced tasks. The results are shown in the table below. Because the advanced task requires open-ended responses, the GPT-4 is used to score both labels and predictions. Prediction scores are given as a percentage of label scores.
Ferret-UI shows high performance on advanced tasks on both platforms, despite the absence of Android-specific data in the training data.This indicates effective transfer of UI knowledge between different operating systems, suggesting that the system is highly flexible.
In comparison to other models, Fuyu generates relevant answers, but lacks the detail and accuracy that Ferret-UI shows. GPT-4V, on the other hand, provides more detailed answers and scores well on all tasks. This trend is consistent with the preferences of the model evaluators.
In particular, for advanced tasks on the iPhone, the introduction of Ferret-UI-anyres has improved performance by a significant 20 percentage points. However, for Android tasks, performance is degraded. Thisis likely due to the fact that thetrainingdata does not include advanced task information for Android, which slightly reduces the general applicability of the model as iPhone-specific knowledge increases. This result suggests how data bias can affect model applicability.
Ablation Studies
This paper explores how enhanced visual and spatial understanding of basic UI elements can help in the execution of more complex tasks. The core hypothesis of this study is that understanding enhanced through basic tasks improves the ability to process advanced tasks. To clarify this point, we are examining in detail the impact of basic tasks on advanced task performance. The results are shown in the table below.
When only advanced tasks were tested, performance on both platforms was only 64%, but adding basic tasks from the iPhone or Android consistently improved performance on advanced tasks on the iPhone by 5%. Furthermore, adding basic tasks from the iPhone also improved performance on advanced tasks on Android by about 4%, with an additional 9% improvement when basic Android tasks were incorporated. And when both iPhone and Android basic tasks are incorporated, iPhone and Android advanced task performance improves by an additional 3% and 5%, respectively.
These results suggest the hypothesis that the enhanced visual and spatial understanding that basic tasks provide to the model will aid in the execution of advanced tasks and improve overall performance.
To determine how different data configurations in the Spotlight task affect model performance, we are also investigating whether the addition of basic task data contributes to improved performance. Results are shown in the table below.
Although these tasks were designed to improve visual and spatial comprehension of the screen, adding basic task data from Android and iPhone did not noticeably improve performance on the three Spotlight tasks. This may be because the specialized, UI-centric vocabulary used in the basic tasks is different from the response style required in the Spotlight tasks.
The best results were obtained when advanced task data was integrated with all basic tasks. This is despite the fact that only advanced task data from the iPhone was used, resulting in a 4-point improvement in the CIDEr score for the widget captions. The open-ended responses for the advanced tasks require a more sophisticated skill set to perform and closely match the requirements of the Spotlight task.
Skill sets honed in advanced tasks are likely to be advantageous in solving Spotlight tasks, which fall somewhere in the middle of the complexity spectrum between basic and advanced tasks.
Results Analysis
Here wepresent the results of our analysis of theReferringandGroundingtasksin the basic Ferret-UI UI task.
Ferret-UI's analysis of OCR and widget classification has important implications. In particular, in the OCR task, the model tends to predict adjacent text rather than the target text, and this is especially true for small or closely spaced text. However, the incorporation of anyres technology mitigates this problem, and the expanded subimages have been shown to assist the model in processing small visual details.
The model also tends to predict actual words, rather than just decoding the text on the screen, which is common with phonetically produced words, such as brand names on UI screens. Additionally, even when the OCR model returns incorrect text, it shows the ability to accurately predict partially cropped text.
As with OCR, there are some interesting implications in widget classification. The model sometimes struggles to understand the relationships between widgets, for example, it tends to recognize a large button composed of multiple subelements as the subelement occupying the most space, rather than as a single widget. In other cases, small icons surrounded by text are incorrectly predicted as text, but the addition of anyres improves the accuracy of these predictions.
In theGroundingtask, the model may inadvertently highlight text adjacent to the target area. In addition, it is suggested that future method extensions are possible when multiple identical texts are present, and that allowing single-box to multiple-box responses could improve the usefulness of the model and its accuracy in complex text retrieval scenarios.
Results Analysis: Advanced UI Tasks
Conversationsdemonstrate Ferret's unique capabilities. To evaluate the accuracy and relevance of the output bounding boxes, we manually scored all boxes of conversational interactions between Ferret-UI and GPT-4V. The results show that Ferret-UI and GPT-4V are 91.7% and 93.4% accurate, respectively; since Ferret-UI generates raw coordinates while GPT-4V selects from predefined boxes, Ferret-UI's ability to ground on the UI screen is noteworthy Despite GPT-4V's higher scores, validating its predictions sometimes favors Ferret-UI's more concise answers, as it tends to provide information that is sometimes irrelevant to the question.
In addition, Ferret-UI has not been able to learn which elements (color, design, usability, etc.) the detection model misses because its basic and advanced tasks rely on the detection of UI elements. For example, GPT-4V may provide insights when generating detailed descriptions, such as "the overall design follows Apple's aesthetic and is minimalistic, clean, and dark themed," but because Ferret-UI relies solely on detected elements, such insights It has not been trained to provide such insights, as it relies solely on detected elements.
The Set-of-Mark (SoM) prompting method in GPT-4V exposes several limitations. In particular, it is challenged by its reduced effectiveness when many small UI elements are involved. This occurs frequently in the Android detection task, where the small size of UI components can cause labels to hide the original content or exceed the intended area. Furthermore, limiting the evaluation to a specific candidate region limits the model's ability to freely reference arbitrary regions. In the example below, the UI detection model treats the entire central section as a single element, covering the text and images that contain the "BUY" button. Therefore, the model cannot refer to the "BUY" button by itself.
Along with Ferret's unique ability to perform advanced UI tasks, it reveals room for improvement.
Summary
This paperproposes Ferret-UI, a multimodal large-scale language modeldedicated to better understanding and improving interaction with mobile UI screens. With "anyres" carefully designed to accommodate different screen aspect ratios and a curation of training samples that cover a wide range of basic to advanced UI tasks, Ferret-UI excels in Referring, Grounding, and ReasoningWe expectFerret-UIto make notable advances in its application to a variety of UI applications.
Categories related to this article