Catch up on the latest AI articles

What Are The Key Elements In Developing A High-performance Web Assistant That Applies LLM?

What Are The Key Elements In Developing A High-performance Web Assistant That Applies LLM?

Large Language Models

3 main points
✔️ Potential web UI applications of large-scale language models and challenges: Web assistants using large-scale language models have the potential to improve human efficiency in retrieving information from the web, but task completion accuracy on real websites is still low at about 15%.
✔️ Key factors for improved performance: ability to efficiently identify and retrieve important information from web pages, specific natural language commands, proper processing of HTML, and personas incorporated by large-scale language models contribute to improved performance.
✔️ Insights from experimental results: experiments with Claude2 show that sample selection, query concreteness, HTML handling, and persona have a significant impact on web assistant performance, and that proper coordination of these factors leads to improved task completion accuracy.

"What's important here?": Opportunities and Challenges of Using LLMs in Retrieving Information from Web Interfaces
written by Faria HuqJeffrey P. BighamNikolas Martelaro
(Submitted on 11 Dec 2023)
Comments: Accepted to NeurIPS 2023 R0-FoMo Workshop
Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)


The images used in this article are from the paper, the introductory slides, or were created based on them.


In recent years, large-scale language models have gained attention in many areas of UI tasks. Web assistants that can understand natural language commands and retrieve relevant information from the web UI have the potential to significantly improve human efficiency. Recent advances in large-scale language models (LLMs) and the potential for HTML interpretation have led to mainstream research focused on large-scale language model-driven web assistants capable of autonomous navigation.

However, to date, web assistants have only achieved about 15% accuracy in completing tasks on real-world websites, and are not yet at the stage where they can be used on a daily basis.

The user must navigate through a series of web pages in order to successfully complete a given task, which is accomplished by precisely performing a series of basic operations, such as identifying and retrieving UI elements from each page at each step.

The purpose of this paper is to explore the performance of a large-scale language model-driven web assistant in retrieving important and relevant elements in a web page in order to improve its performance. In other words, we aim to improve basic capabilities before applied capabilities.

In this paper, we conduct a study on four elements of the following input prompts and show that the performance of large-scale language models depends on these elements.

  1. Sample Selection for Few-shot Prompts: How the Selection of a Few Samples Affects Performance
  2. Concreteness of natural language commands: how the level of concreteness of input commands affects performance
  3. HTML Truncation Strategies: How HTML encoding strategies affect performance
  4. Specific roles (personas) assumed by the LLM: how the choice of roles mimicked by the LLM affects performance

The paper also discusses some of the limitations faced by large-scale language models (e.g., hallucination of non-existent web elements and failure to follow input instructions) and future directions and possible solutions to address these limitations. It highlights important factors to consider when using large-scale language models and provides valuable insights for choosing the best strategy for specific types of tasks and situations.


This paper uses the Claude2 model by Anthropic, which has a context length of 100k tokens, the largest of all large language models to date. The large context length is well suited for web UI analysis given the thousands of elements that may be present on a web page.

Next, each sample is defined as a set of "{{w, q}, e} where w, q, and e represent the HTML, user query, and reference UI elements of the current viewport, respectively, and Ψ is formulated as retrieving the most important UI elements as follows The goal is to evaluate the performance of this Ψ.

This experimental setting provides a precise approach for measuring the ability of large-scale language models in web UI tasks: through concrete samples containing HTML content, user queries, and target UI elements, we can evaluate how efficiently the models can identify and retrieve important information and retrieve important information.

We use the Mind2Web dataset for our experiments, which contains 2,000 open-ended, real-world tasks collected from 137 websites. This dataset contains three different test sets: (1) cross-task: samples from tasks not yet identified during training, (2) cross-website: samples from unidentified websites, and (3) cross-domain: samples from unidentified domains. These three test sets are particularly useful for understanding the generalizability of large-scale language models.

In addition, HTML layouts are often complex, with multiple elements nested to point to the same information. For example, in the example from, the search button has one child element (the search icon), as shown below. Predicting either the search button or the icon yields the same result. To address this, we expand the underlying leaf nodes based on the predicted UI element and compare them to the same of the reference label.

In this experiment, performance is evaluated by Recall and Acuraccy for each element.

The table below shows how performance on each test (Cross-Task, Cross-Website, and Cross-Domain) varies depending on what samples are selected from the training data.

In Cross-Task, the samples selected from the training data significantly improved performance. This can be attributed to the fact that the training data contains tasks similar to the test data, which allowed the system to learn better. However, for Cross-Website and Cross-Domain, the samples chosen are not very useful. This is likely because the websites and domains differ between the training and test data, and the system is not able to cope with the unknown environment.

In this experiment, the one-shot prompt (using only one sample) performs better than the two-shot prompt (using two samples) when selecting samples. This is likely due to the fact that as the length of the input increases, performance decreases. However, when the prompt samples are fixed (always using the same sample), performance is consistently better with the 2-shot prompt. This may indicate that in a web UI, one should not only provide meaningful samples, but also be careful about their length.

The performance of Claude2 is highly dependent on the training data samples used, and proper sample selection and prompting methods are critical for consistently high performance for any test environment.

The table below shows the performance of Claude2 based on the level of concreteness of the user queries.

Because the task descriptions for user query specificity in the existing UI dataset were very detailed and far from practical, this paper redefines and modifies the task descriptions at three levels.

  • Detail: Description of the original task from the test set
  • Simplified: Simplified task description including only essential details
  • Abstract: A high-level task description that does not include any details and mimics a real-world user query

These different description types are used to create simplified & abstract descriptions using the GPT-4. And it states that this is the first attempt by a large-scale language model to investigate the impact of concreteness as it relates to UI tasks.

As a result, we observe a gradual degradation in the performance of the large-scale language model as the concreteness of the user queries becomes more abstract. However, when the sample of prompts is fixed, performance is relatively consistent across all concreteness levels.

Also, raw HTML can contain thousands of elements per page, and such a large amount of information can be difficult for the LLM model to process. Therefore, filtering out uninformative elements and truncating HTML before inputting it into the large-scale language model is being considered. The table below shows the results of different levels of HTML truncation. Truncation significantly improves performance compared to no truncation.

Furthermore, the UI elements of interest may vary from user to user. For example, a UI/UX designer may be interested in the interaction flow of an application. A typical user may just want to find information as quickly as possible. In other words, performance may vary depending on the user persona.

Therefore, this paper investigates three different personas (1) the general user, (2) the web assistant, and (3) the UI designer persona. The prompts for each persona are shown in the figure below.

The performance of Claude2 based on the different input prompts for each persona is shown in the table below. The Web Assistant performs significantly better than the other personas (with the exception of the 2-shot prompt in Cross-Task).


This paper explores the ability of large-scale language models to discover important information from web pages when given a specific task. It also reveals that the extent to which users ask specific questions is related to the performance of the model.

Future research could learn from these findings and improve the model to make it work more responsively. In other words, the model could be better able to determine when to intervene and ask the user for further information. A particularly interesting direction is to allow the model to more accurately understand and respond to user intent, even when prompts are ambiguous. This may improve the model's ability to read subtle nuances and respond appropriately. In addition, models can learn from the user's personal information and other context (e.g., location, time, etc.), but future work will also need to consider ways to address security issues when dealing with such personal information.

For now, this research is limited to Anthropic's Claude2 model, but in the future it will expand its scope to other large language models such as GPT-4, Llama V2, Vicuna, and PaLM 2. One major challenge in applying this research to other models is how to fit large amounts of web page information (HTML context) into a limited amount of space. Future research should also investigate ways to efficiently incorporate this information into the model while accurately reflecting the structure of the web page (DOM).

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us