[ReALM] Resolving References By Utilizing Entity Positions On The Screen With LLM
3 main points
✔️ Proposes a new model for reference resolution, ReALM. Achieves superior performance compared to traditional large-scale language models and reference resolvers.
✔️ Resolves on-screen references by using the location of entities on the screen and encoding entities using text only
✔️ Can handle multiple data formats including on-screen entities, conversational entities, and background entities
ReALM: Reference Resolution As Language Modeling
written by André Nitze
(Submitted on 29 Mar 2024)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Our daily conversations frequently involve referential expressions such as "they" and "it" that need to be read in context. The ability to understand these contexts is essential for users to communicate their requests to agents and to move the conversation forward. It is also important for the hands-free experience in voice assistants to allow the user to inquire about what they are seeing on their screen.
Large-scale language models show the potential to eliminate the need for a multi-step pipeline that includes understanding traditional reference expressions (reference resolution). However, pipelines are stillimportant and often cannot be addressed by end-to-end approaches. In particular, it may not be practical to use large models in environments where privacy is important or where the system must operate efficiently within limited computational resources.
In addition, when models integrate with APIs or need to exchange information with upstream and downstream components, a complete overhaul of large language models and existing pipelines may be required. This will allow for a focused model to transparently improve existing reference resolution modules and increase the interpretability of the entire system.
In addition, the reference resolution task addressed in this paper involves the ability for users to refer to entities on the screen and in the background, as well as the history of conversations resulting from direct interaction with the device. Thus, there is value in exploring traditional natural language processing tasks, even when they can be addressed implicitly by large-scale language models.
This paper proposes a new approach that analyzes entities and their locations on the screen to generate a pure textual representation of the visual representation of the screen content. This provides the context for the language model to understand where the entities are and what the surrounding text is. This is the first attempt to use a large-scale language model to encode context from the screen.
Task
This paper formulates the task of identifying the entity (or entities) most relevant to the query, given a task the user wishes to perform and the entities associated with it. Entities are classified into three types
- On-screen entities: entities currently displayed on the user's screen.
- Conversation entity: an entity directly related to the conversation. It may come from a previous statement made by the user (e.g., "call my mom" for mom's contact information) or from information provided by the virtual assistant (e.g., a list of locations or alarms).
- Background entities: entities that do not appear in the user's direct view or conversation, but come from background processes (e.g., an alarm that has started ringing or music playing in the background).
The task is set up as a multiple choice question with a large language model and requires the user to output the most appropriate choice from the entities displayed on the user's screen. Answers are also allowed to select "none of these". The evaluation allows the model to output the entities in any order, so that if, for example, the correct answer is entities 8, 7, and 4, any order of these entities is acceptable as an evaluation. This approach aims to increase the flexibility and accuracy of the model.
Data-set
The dataset used in this paper consists of data generated with the help of annotators and data generated by synthesis. Each data set contains a list of entities associated with the user's query, specifying the reference entity for each query. Entities include their type, name, and other textual information (e.g., alarm label, time, etc.). In addition, the data to which the on-screen context relates includes a bounding box for that entity and a list of non-entity text elements surrounding it.
In the conversational data, data is collected focusing on the entities that users generate during their interactions with the agent. Evaluators are provided with a screenshot containing a composite entity list and asked to create a query that explicitly points to an arbitrarily chosen entity from the list. As an example, a business list and an alarm list are provided to the evaluator and a query is asked to point to a specific entity in the list.
With synthetic data,its retrieval relies on template-based data generation. This is especially useful when only user queries and entity types can resolve references. Two templates are used to generate synthetic data. The first basic template contains mentions, entities, and slot values as needed, and the second language template adds query variations for the references defined in the basic template. The data generation script uses these templates to generate queries, substituting mentions and slot values.
On-screen data is collected from various web pages, including phone numbers, email addresses, and physical address information. Annotation of this data is divided into two stages: the first stage extracts the query from the screenshot, and the second stage identifies entities and their mentions based on the given query. The evaluator uses the screenshot to determine if the query mentions one of the visual entities and if the query sounds natural, identifies the entity referenced in the given query, and tags the portion of the query that refers to that entity.
Model
In this paper, we introduce the proposed model "ReALM" and evaluate its performance against two different baseline approaches. One is the reference resolver "MARRS," which is not based on any traditional large-scale language model, and the other is the state-of-the-art large-scale language model "ChatGPT" (GPT-3.5 and GPT-4).
As a baseline that is not based on a large-scale language model,we useMARRS, a system proposed by Ates et al. (2023 ) . This system is capable of handling on-screen entities as well as conversational and background entities. Through our reimplementation of this system, we are training on a dataset that includes conversational, on-screen, and synthetic data.
The other baseline is the GPT-3.5 and GPT-4 versions of ChatGPT available as of January 24, 2024. These models show significant performance gains in on-screen reference resolution tasks, especially by leveraging input that includes images. The combination of ChatGPT prompts and prompt+image used in this paper is introduced as a new experiment.
The approach proposed in this paper then performs fine tuning of a large-scale language model using the FLAN-T5 model (Chung et al., 2022). The parsed input is fed into the model and optimization is proceeded based on default fine tuning parameters. Entities are shuffled before feeding to the model to prevent position-dependent overlearning.
Conversational references fall into two categories: type-based and descriptive. Type-based references use a combination of the user's query and entity type, while descriptive references use specific attributes of the entity to perform their own identification. This approach provides high accuracy in complex entity identification tasks.For on-screen references, an upstream data detector performs text analysis to extract relevant entities. These entities are encoded into a language model using only text, and new algorithms are developed to effectively represent the screen in text from left to right and top to bottom.
With these innovative approaches, we aim to provide more accurate and efficient solutions to reference resolution issues.
Experimental results
The results are shown in the table below. Overall, we can see that the proposed model outperforms the MARRS model on all types of data sets. The proposedmodel also outperforms GPT-3.5, which has an order of magnitude higher number of parameters. Furthermore, the proposed model achieves the same performance as the latest GPT-4 with a lighter and faster model.
Of particular note are the results on the on-screen data set. Compared to GPT-4 using screenshots, the proposed model with the text encoding approach achieves nearly the same performance. In addition, experiments with models of different sizes show that performance tends to improve as the model size increases, and this difference is particularly noticeable on the on-screen dataset, suggesting the complexity of the task.
The zero-shot performance of the models in the unknown domain (alarms) is examined as a case study. The results confirm that the large-scale language model-based approach outperforms the FT model, with ReaLM and GPT-4 in particular showing very similar performance in the unknown domain.
Fine tuning based on user requests allows ReaLM to better understand domain-specific questions. For example, GPT-4 was mistakenly associated with only certain settings, but ReaLM is able to take background home automation devices into account as well, resulting in more accurate reference recognition. This is likely because ReaLM is trained on domain-specific data, which avoids such problems.
Summary
This paper proposes a "ReALM"method of reference resolution using a large-scale language model. This is accomplished by encoding candidate entities as natural language text. In particular, we show how entities present on the screen are passed to the large-scale language model using a new textual representationthateffectively summarizes the user's screen while preserving their relative spatial location. and is shown to perform nearly as well as GPT-4, the current state-of-the-art large-scale language model,despite on-screen references in the text domain only. It also outperforms GPT-4 in domain-specific user speech, making ReaLM an ideal choice for a practical reference resolution system that can exist on a device without compromising performance.
While the approach proposed in this paper effectively encodes the position of entities on the screen, we found that it loses the information needed to solve complex user queries that rely on subtle positional understanding. Therefore, exploring more complex approaches, such as dividing the screen into grids and encoding these relative spatial positions into text, may be a challenging but promising area of research.
Categories related to this article