Pre-training Of Language Models Is Possible Even With HTML Data!

Natural Language Processing 26/05/2022

3 main points
✔️ Propose a pre-training model HTLM with HTML data
✔️ Introduce a BART-based pre-training method using simplified HTML
✔️ Good performance in various Zero-Shot/One-Shot settings such as summary/table generation tasks

HTLM: Hyper-Text Pre-Training and Prompting of Language Models
written by Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, Luke Zettlemoyer
(Submitted on 14 Jul 2021)
Comments: ICLR2022
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

In pre-training of language models, we usually do not use HTML data collected from web pages on the Internet as is, but extract only the text by pre-processing.

However, HTML data has various advantages over regular text, such as the fact that it contains more information than regular text, for example, the <title> element is a good summary of the <body> of the document, and the data is easier to collect.

In the paper presented in this article, we proposed the Hyper-Text Language Model (HTLM), the first model trained on HTML data. The model showed excellent performance, improving the ROUGE-1 performance of Zero-shot summarization by up to 8.

HTLM (Hyper Text Language Model)

The proposed method, HTLM (Hyper Text Language Model), is a model trained on HTML data automatically extracted from Common Crawl. This model is based on BART with modifications.

About the data (Minimal HTML) used for training the model

Most of the HTML in a typical web page contains information that is not necessarily important for pre-training of language models, and it is difficult to use it as it is for learning language models.

This is because it contains a lot of JavaScript code, CSS, and other information that enhances the page aesthetics rather than document-level information. The Transformer-based model is not suited to processing very long sequences.

Therefore, the process described below is used to convert the HTML data into a simple format named Minimal-HTML (MHTML) for training the model.

Delete all subtrees in the HTML DOM that do not contain a certain number of characters (128, 64 for list/table/span) of text
Exclude header, footer, copyright, form, iFrame
Combine consecutive <div> elements into a single <div> with integrated attributes
Remove all attributes except class and id
MHTML documents with a text-to-HTML ratio of less than 0.46 are excluded.

The last threshold was manually set based on the fact that documents with a low ratio of text to HTML tend to have a lower average document quality.

These processes removed an average of 94% of the characters from the raw web pages and kept about 85% of the MHTML documents below 1024 BPE tokens, the maximum token length for BART and others. Finally, a January 2021 snapshot of Common Crawl yielded 23TB of MHTML data, which was used to train the model.

About the model

The model architecture and learning goals are based on a BART-style denoising auto-encoder. We use a Poisson distribution with $\lambda=3.5$ for sampling the span length for random masking.

In our experiments, we use the same architecture and checkpoints as BART-Large and train a total of 330,000 steps on 256 GPUs and 8192 batch sizes.

Size tips when masking

BART learns to predict the part of the Poisson distribution that is masked by the length sampled from the Poisson distribution.

To be able to predict the masked part and control the length of the generated text, we insert several <mask> tokens according to the masked length. The number of tokens to be added is $n=max(1, \lfloor N (m, m*\epsilon) \rfloor$ ($\epsilon$ is a hyperparameter for the size of noise in the hint). At training time, size hints are given with a noise of 0.1 for 80% of the masks.

HTML-based prompts for task execution

Because of the format differences between HTML data during training and text data during task execution, tasks must be converted to HTML format when HTLM is applied to downstream task execution.

When executing a downstream task, a dedicated prompt template is created manually or automatically to allow the task to be executed in the form of predicting the masked region. When giving size hints, we give a hint of the length of the text to be generated based on the average length of the training set.

When automatically creating a prompt template, it converts the task into HTML format by adding <mask> tokens around the text given as a task and predicting that part.

experimental results

Zero/One-Shot Prompting

To begin, experiment with the proposed method (HTLM) in the Zero-Shot or One-Shot settings. When manually creating a prompt, a template is created based on up to 50 samples from relevant papers or training sets.

generating task

In the generation task, the evaluation is performed on the datasets described below. First, the dataset used for the summary task is as follows

Gigaword: a summary consisting of news article headlines that consist of an average of 10 BPE tokens

CNN/Dailymail: Multiple sentence summary consisting of about 3 sentences and 50 tokens
Reddit TIFU: a summary task for Reddit posts, not news articles, but more abstract
XSum: Abstractive single-sentence summaries of news articles

In addition, the dataset used for the task of generating a structured (structured) table format is as follows

E2E: tabular generation task in the restaurant domain, with a sample size of about 50k
WebNLG: Tabular Generation Task, Experimental Results for Seen (S), Unseen (U), and All (A)
DART: Tasks for generating open-domain tabular formats, including Wikipedia tables

To begin, we evaluate the results in a summary task, which is a typical generative task. The results compared to PEGASUS (original paper, commentary article on this site) as a baseline are as follows.

The scores in the table represent ROUGE-1/ROUGE-2/ROUGE-L scores, respectively. The manual prompt (-Manual) outperforms the baseline Zero-Shot summary results for all four datasets. In addition, the automatic prompt with size hints (-Auto-S) outperformed PEGASUS on three of the four datasets.

Next, we conduct experiments on the structured tabular generation task. In the experiments, we evaluate the performance in the One-Shot, Fine-tuning, and Prefix settings. Note that since these tasks take tabular data as input and it is difficult to apply the usual text-based pre-training model in the One-Shot setting, only the results in the Fine-tuning and Prefix settings are compared with the baseline (GPT-2).

On the other hand, the proposed method (HTLM) can perform such tasks in One-Shot because of its HTML-based model. The results are as follows

Overall, the results are comparable to or better than GPT-2, which is particularly attractive because the method is also feasible for One-Shot.

classification task

We evaluate the following four datasets in the Zero-Shot setting in the classification task.

RTE
BoolQ
Winogrande
HellaSwag

The results are as follows

Overall, the results were comparable to GPT-3 Med or Large.

Fine-tuning experiment

We then compare the existing pre-trained language models with the Fine-tuning setting; the results on the GLUE benchmark are as follows.

In general, the results are competitive with other pre-training methods, indicating that the representations obtained by HTML-based pre-training are also effective in downstream tasks. There is also a possibility to further improve the performance by improving the prompts.

prompt data efficiency

Finally, we evaluate the usefulness of HTML-based prompts based on studies that quantify how many data points a single prompt is worth (we won't go into detail on the metrics, but you can read about them here). The results are as follows.

The table shows the superiority of the prompts over the classification heads in fine-tuning, with higher scores being better. In general, the proposed method compares well with existing text-based language models, indicating the effectiveness of HTML-based pre-training.

summary

In this article, we introduced HTLM, which performs pre-training based on HTML data.

This model is not only more accurate than the usual text-based pre-training models but also has the advantage of being able to perform tasks consisting of structured data such as tabular data in one shot, which opens up a new avenue for using HTML data for pre-training.