Catch up on the latest AI articles


ScreenAI" Understands Images And Text From Infographics To UI

Large Language Models

3 main points
✔️ Developed ScreenAI, which understands both images and text and can handle a wide range of tasks
✔️ Combines ViT and mT5 language encoders for flexible image processing
✔️ Employs automatic data generation to improve data quality by identifying screen configurations and labeling screenshots

ScreenAI: A Vision-Language Model for UI and Infographics Understanding
written by Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, Abhanshu Sharma
(Submitted on 7 Feb 2024 (v1), last revised 19 Feb 2024 (this version, v2)])
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Infographics (charts, diagrams, illustrations, maps, tables, document layouts, etc.) have long been considered an important component of communication due to their ability to transform complex data and ideas into simple visuals. This ability comes from making information intuitive and easy to understand through layout and visual cues. In today's increasingly digital world, mobile and desktop user interfaces (UIs) use similar design principles and visual language to make communication between people and between people and machines rich and interactive.

This background suggests the possibility of a model that understands infographics and UI in an integrated manner, but its complexity presents a significant challenge. To meet this challenge, this paper develops ScreenAI, a Vision-Language Model (VLM).

The model addresses a wide range of tasks, including question answering, element annotation, summarization, and navigation with respect to infographics and UI; ScreenAI approaches these challenges by combining the latest technologies to redefine visual tasks into textual problems.

This paper leverages the similarities between UI and infographics and proposes a new approach for a comprehensive understanding of these fields. It also develops techniques for automatically generating training data and new methods for pre-training and fine-tuning. In addition, we provide three new evaluation datasets to validate the effectiveness of ScreenAI.

ScreenAI's innovation positions it as the VLM for all digital content comprehension tasks, from UI to infographics and beyond. 4.6 billion parameters is a modest size, yet this model achieves state-of-the-art performance on published infographics question response benchmarks, achieving state-of-the-art performance and outperforming other models more than 10 times its size. Increasing the size of the model has been shown to improve performance, and further improvements are expected in the future.


The model proposed in this paper is designed to understand both images and text. The model is based on a core multimodal encoder block that combines a Vision Transformer (ViT) for image analysis and an mT5 language encoder for text processing. This approach is inspired by the architecture of the PaLI model family and its ability to transform various vision and multimodal tasks from text and image input to text output.

The uniqueness of this model lies in its flexibility to process images with different patching patterns. While the traditional PaLI architecture only allows for fixed grid patches, we have adopted the technology introduced in Pix2Struct to allow the generation of image patches with arbitrary grid shapes based on the shape of the input image. This allows us to flexibly handle images with different resolutions and aspect ratios without forcibly deforming the image, thus greatly expanding the model's range of applications. A particular strength of the model is its ability to adapt to both portrait-oriented mobile images and landscape-oriented desktop images.

To develop this model, three models of different sizes have been trained. These have a huge number of parameters: 670M, 2B, and 5B, respectively.The 670M and 2B parameter models build on already trained single-mode checkpoints and focus on improving image processing and language comprehension abilities. The 5B parameter model, on the other hand, utilizes more complex multimodal pre-trained checkpoints and has the ability to understand and process both images and text.

In learning, we begin with self-supervised learning using large data sets. Here, the goal is to increase the ability of the model to learn on its own, with minimal human intervention. In this phase, both the image encoder and the language model are trained. In particular, new techniques are introduced for the image encoder to allow it to flexibly adapt to different types of images.

Next, to further improve the accuracy of the model, the training of the vision encoder is paused and additional training steps are performed while reducing resource consumption. Through this process, the model gains a deeper understanding of the various tasks.

The final stage of fine tuning uses human-labeled data to optimize the model for a specific task. This includes a wide variety of tasks, including those related to question answering (QA): for the QA task, the model is first fine-tuned in a series of tasks, followed by additional training specific to individual tasks. For other tasks, each task is individually fine-tuned to maximize model performance.

Automatic Data Generation

The evolution of models is directly related to the quality and quantity of data. Therefore, this paper looks at the importance of access to vast and diverse data sets and employs an innovative method of automatic data generation to overcome the limitations of manual annotation. This approach, which uses specialized small-scale models to efficiently and accurately generate and label data, offers unparalleled scalability and data diversity compared to manual methods.

The first step in this approach is for the model to develop a comprehensive understanding of text elements, screen components, and their structure and hierarchy. From this foundation, the model develops the ability to accurately interpret and respond to a wide range of user interfaces. Screenshots collected from a variety of devices are annotated with labels that detail UI elements and their relationships. At the core of this process is a layout annotator based on the DETR detection model that identifies and labels a wide range of UI elements.Additional steps are taken to analyze pictograms using the icon classifier, generate descriptive captions using the PaLI image caption model, and extract and annotate text content through the OCR engine. This produces a holistic and detailed description of the screen content.

Also called a "screen schema," this comprehensive annotation sample forms the core of the data generation and serves as a pre-training task to generate similar schemas from input images. This enhances the model's ability to not only identify and interpret UI components, but also to understand their interrelationships. Screen schemas are also valuable as an interface to large-scale language models, providing LLMs with a structured and detailed representation of screen content, facilitating the creation of more complex and context-rich tasks.

In addition, the paper utilizes a large-scale language model (LLM) to add a new dimension of diversity to the data set. Of particular interest is a model called PaLM 2-S, which excels in its ability to generate question-answer pairs. The process is divided into two steps: first, the screen schema introduced earlier is created. Next, prompts containing this schema are fed into a large-scale languagemodel, prompting the generation of new synthetic data.

This practical approach requires some trial and error and prompt design skills, but by finding the right prompts, you will be able to effectively generate the desired task. Examples of actual prompts are presented in the Appendix of the paper. For quality assurance of the generated data, human verification is performed on selected data to ensure that high quality standards are met.

This new approach brings synthetic but realistic task diversity to the dataset, dramatically increasing the depth and breadth of the pre-training dataset. The natural language processing capabilities of the large-scale language model, combined with the structured screen schema, enhances the ability to simulate a wide variety of user interactions and scenarios. This further expands the possibilities for automatic data generation in model training.


There are two key phases in the development of a model: the pre-learning task and the fine-tuning task. These phases lay the foundation for the model's ability to effectively understand and respond to complex real-world scenarios.The following tasks have been selected for the pre-study

Early stagepre-learning teaches the model a wide range of skills. Through a variety of tasks ranging from identifying on-screen UI elements to complex question answering, screen navigation, and content summarization, models are learned to deal with a wide variety of real-world applications. These tasks develop the model's ability to understand textual and non-textual content, read context, and accurately navigate the interface.The fine-tuning phase further deepens the model's understanding using labels that are validated by human evaluators. This phase builds on the foundation laid in the pre-training phase to increase the accuracy and efficiency of the model for specific tasks and scenarios.All of the pre-training tasks are outlined in the table below.

Through pre-training and fine tuning, we leverage a variety of image and text data sources, including VQA CC3M, WebLI Alt and OCR text, and chart to table translations. These data sets are essential to ensure that the model remains robust in both linguistic and visual processing capabilities.

By weighting tasks proportionally to the size of the dataset in training the model and incorporating multimodal sources, our model can effectively address a wide variety of scenarios, including language processing, visual understanding, and web content analysis. This improves the overall versatility and performance of our models.

Experiments and Results

The table below compares the best performing ScreenAI model with the state-of-the-art (SoTA) in screen and infographics-related tasks.

The paper also investigates how incorporating OCR text as additional input into the model affects task performance. inspired by the PaLI-X and PaLI-3 fine tuning experiments, the paper finds that adding OCR text to screen and document-related tasks to screen and document-related tasks contributes to performance improvement. The results in the table above also show that the addition of OCR boosts performance (e.g., up to 4.5% improvement for Complex ScreenQA, MPDocVQA, and InfoVQA), especially in QA tasks. However, the use of OCR has the disadvantage of increasing input length and slightly slowing down the learning rate. In addition, OCR results are required during inference.

In addition, experiments on model size use different model sizes: 670M, 2B, and 5B. From the graphs below, we can see that the benchmarks are used not only for screen tasks, but also for other published tasks, and that performance improves as the model size increases. In particular, tasks that require more advanced visual text processing and arithmetic reasoning, such as InfoVQA, ChartQA, and Complex ScreenQA, show significant improvements when moving from a 2B to a 5B model. These results demonstrate the importance of OCR incorporation and model size selection in improving model performance.


This paper introduces the ScreenAI model and a new unified schema for representing complex data and visual information compatible with infographics, document images, and various UIs. This unified representation allows for the design of a mix of self-supervised learning tasks that leverage data from all these domains.

We also show that learning in this mix leads to positive transfer to screen-related tasks, infographics, and document-related tasks. In addition, we show the impact of data generation using large-scale language models and justify the choice of model design through elimination studies.

We have applied these techniques to learn a model that achieves SoTA on many of the public benchmarks and performs competitively. However, we note that while the model is best-in-class, further work is needed on some tasks to close the gap with models that are orders of magnitude larger, such as GPT-4 and Gemini.

To facilitate further research, they will release this unified representation dataset and two other datasets that will allow for more comprehensive benchmarking.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!
Takumu avatar
I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us