# Limitations And Solutions For Data-constrained LLM

**3 main points**✔️ Language models continue to improve their performance as the number of model parameters and training data increases

✔️ However, the trainable text data sets available on the Internet are limited

✔️ presents a new perspective on scaling laws (performance improvement laws) for language models in data-limited situations

Scaling Data-Constrained Language Models

written by Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel

(Submitted on 25 May 2023 (v1), last revised 26 Oct 2023 (this version, v4))

Comments: 50 pages (9 main), 39 figures, 15 tables

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

## Introduction

The language model has continued to improve performance by increasing the number of model parameters and training data.

Will there ever be an end to this trend?

In fact, we know that trying to provide enough training data for the growing number of model parameters in a language model may result in a tremendous amount of data.

There is a large language model called Gopher with 280 billion parameters; Chinchilla, a 70 billion parameter model, outperforms Gopher by training on four times the number of training data. In other words, the performance may not be improved unless a larger than expected amount of training data is provided for the number of model parameters. This is henceforth referred to as the chinchilla scaling rule. Although the term "scaling rule" may not sound familiar to you, it may be easier to understand if you think of scaling rule = performance improvement rule.

For example, MT-NLG, a large language model with 58 billion parameters, might need to be trained on 30 terabytes of text when applied to the chinchilla scaling rule. In other words, the infinitely better performance that could be achieved by increasing the number of model parameters could require far more data than the trainable text data sets that exist on the Internet.

In the paper described in this issue, we propose Data-Constrained Scaling Laws for language models when the number of training data is limited, based on such a possibility. In addition, the paper proposes a way to improve the performance of language models when data is limited.

## Concepts of Allocation and Return important in scaling rules

Allocation and return are important concepts when considering scaling laws for language models. The scaling rule is a law that asks the question, "What can be increased in a language model to improve the performance of the language model? The background of this law is to maximize the return on investment (Return) by optimizing the allocation of limited computational resources (Allocation) to improve the performance of the language model.

As an image, one might consider the case where one wants to improve the speed of a Shinkansen. In other words, there is a prospect that the speed of the Shinkansen will be improved if more and more money is spent on improving the body and motor of the Shinkansen. However, the budget is limited. They are not sure how much money they should invest in the improvement of the Shinkansen body (reduction of air resistance) or the improvement of the Shinkansen motor (increase of motor output). So, they want to understand the relationship between Shinkansen speed, improvement of Shinkansen body, and improvement of Shinkansen motor, and allocate the budget optimally to get the maximum speed improvement effect under the current budget constraint.

If we think in terms of "bullet train speed improvement → language model performance (prediction accuracy) improvement" and "money → computational resources," as in "body aerodynamic drag → number of model parameters" and "motor output → number of training data," then this becomes a scaling law for the language model.

The question is: How much can the performance of a language model improve as computational resources increase? What is the optimal allocation of the number of model parameters and the number of training data under the constraints of a given computational resource? What is the optimal distribution of the number of model parameters and the number of training data under the constraints of a given computational resource?

So, if the budget for computational resources is increased more and more, the prediction accuracy of the basic and language models will improve, but for each budget, what are the optimal values for the number of model parameters and training data? The scaling rule is a way of organizing the number of model parameters and the optimal values of training data for each budget. Therefore, behind the scaling rule, there exists such an optimization problem.

## Previous scaling rules

Previous studies have assumed that data is inexhaustible.

Thus, there is no restriction on the number of training data in the background optimization problem (Equation 1).

In Equation 1, the prediction error L (the smaller L, the better the performance of the language model), which depends on the number of model parameters N and the number of training tokens D, is the function to be minimized. Under the constraint that the amount of computational resources is C, how many model parameters N and how many training data D minimize L? This is a minimization problem.

FLOPs" is used as the unit for the amount of computing resources. FLOPs" is a measure of how fast a computer can compute. Computers with higher FLOPs values can perform more calculations faster. It is an important performance indicator for tasks that require advanced computation, such as games and scientific simulations, etc. FLOPs seems a bit off since it appears to be more of a measure of computing speed than a measure of computing resources.

In the context of the present case, when only a slow computer is available, if the number of model parameters and the number of training data is too large, the training cannot realistically be completed. So, what are the appropriate set values for the number of model parameters and the number of training data that will achieve maximum prediction accuracy within a realistic range of computation time when only a slow computer is available? You may want to consider the following.

Previous studies have predicted that the performance improvement effect (Return) of a language model due to computational resource input is proportional to the square of the computational resources used to train the language model. Based on this assumption, we have determined to what power is the return proportional to so that it matches the actual experimental data? How much is the proportionality factor?

The optimal balance of the allocation of computational resources (Allocation) is roughly defined as an allocation that is equally divided between the increasing number of training data and model parameters.

## Proposed Data-Constrained Scaling Laws

In order to study how to maximize the use of limited data, this paper proposes a scaling rule that imposes the constraint of limited data as well as computational resources.

### Technical Point 1. decompose the total number of training tokens into the number of unique training tokens and the number of iterations, and look at the new return on investment for the number of iterations.

In the proposed data-constrained scaling rule, the number of training data D in the chinchilla scaling rule is redefined as the total number of training tokens D in order to deepen the consideration under data constraints. In other words, instead of simply looking at the dependence on the number of training data given, the definition is tailored so that the same data being trained multiple times is also counted as the number of training tokens.

The total number of training tokens D is then decomposed into the number of unique training tokens U_D and the number of iterations R (epoch number - 1). The epoch number is a number that indicates how many times the entire training data is used repeatedly in the training process of deep learning. Once the entire training data is used up, it is counted as one epoch. In general, the more epochs, the better the prediction accuracy of the given training data. On the other hand, too many epochs will result in over-training on the training data and poor prediction performance on unknown data. The scaling rule in previous studies considered only the number of iterations R=0. What is new in this paper is that it looks at the case R > 0, i.e., the return on investment of R.

As for the number of model parameters N, for uniformity of expression, let U_N be the number of basic model parameters required to fit a unique token, and let R_N be the number of iterations of this assignment.

### Technical Point 2: Set a limit on the number of unique training tokens.

In the traditional chinchilla scaling rule, only computational resources were constrained. In the scaling rule under data constraints, there is a new budget D_C for unique training tokens U_C. Thus, a new unique training token number U_D ≤ unique data budget D_C constraint is added to the minimization problem shown in Equation 1.

Although new restrictions have been added, as before, the prediction accuracy of the language model is still based on a model that is proportional to the square of N and D. In order to match the actual experimental data, what is the proportionality to the power? How much is the proportionality coefficient?

## Fitting Results for Data Constraint Scaling Laws

Key results from fitting the data constraint scaling rule to actual experimental data are described.

Figure 1 shows the return on investment of computational resources when training data is iteratively trained (Repeating).

In Figure 1, the horizontal axis is the number of tokens (epochs) and the vertical axis is the testing error of the language model (the smaller the better the performance). The solid orange line is the predicted test error of the language model based on the data constraint scaling rule. The dotted orange line is the test error assuming that the number of tokens were all unique tokens (new data).

The figure shows that up to 4 epochs, repeated training on the same data improves performance as much as adding new data; from 4 epochs to 40 epochs, the performance improvement rapidly decreases. After 40 epochs, the accuracy improvement effect disappears.

Figure 2 shows the optimal values for the allocation of computational resources during iterative training of training data.

The blue dotted line is the result for the same amount of computational resources. The black line is the optimal allocation prediction when the number of tokens were all unique tokens (new data), and the red line is the optimal allocation prediction from the data constraint scaling rule. The intersection of the blue dotted line and the black line, and the blue dotted line and the red line, is the difference between the optimal allocation with and without data constraints (iterative learning) for a given identical computational budget.

It can be seen from the figure that when data constraints are present, it is better to increase the number of epochs than to increase the number of model parameters as a distribution balance of computational resources. This is the difference from the traditional chinchilla scaling rule.

## Suggestions on how to supplement the lack of unique data

The data-constrained scaling rule shows that increasing the number of epochs can improve performance to some extent, even when unique data is limited. At the same time, however, it has also been shown that there is a limit to the performance improvement achieved by increasing the number of epochs. Therefore, this paper proposes a method to supplement the lack of unique data.

As shown in Figure 3, in addition to iterative learning of unique data (Repeating) as described so far, we propose a method of iterative learning by filling with program code (Filling with Code) and sorting unique data (Filtering).

Filling with Code, as the name implies, is about giving the program code if natural language text data is lacking.

Filtering also literally means learning with the data after sorting. In this case, we propose two types of filters, Deduplicate and Perplexity: the Deduplicate filter removes duplicate texts, while the Perplexity filter collects only texts with high confidence in the language model's predictions. (Perplexity is an indicator that the smaller the perplexity, the higher the confidence in the language model's prediction. For example, if there are two sentences, "the cat is curled up under the kotatsu" and "the cat is curled up on the apple," the more likely sentence would be "the cat is curled up under the kotatsu. Thus, the PERPLEXITY would be smaller for sentence A than for sentence B.)

## Results of evaluation of unique data completion methods

The results of the evaluation of the unique data completion method are shown in Figure 4.

The vertical axis is the average performance on 19 tasks and the horizontal axis is the data budget. The number of model parameters is 4.2 billion and the total number of unique tokens is 84 billion. 100% of the data budget on the horizontal axis refers to 84 billion tokens. Each point in the graph is the average of five times the model was trained with different seeds, and the standard deviation is shown in shadow.

### Effectiveness of Filling with Code

The purple line in the figure is the result of iterative training on the training data and the red line is the result of giving python code instead of iterative training. When the data budget is low, there is no performance improvement effect, causing performance degradation. When the data budget is larger, the python code improves performance, but the performance improvement effect is unstable. At least there does not seem to be a side effect of performance degradation.

This paper reports on the effects of adding python code, showing a dramatic improvement in performance on the WebNLG and bAbI tasks: WebNLG is a task that generates sentences from structural data such as (Taro Japan, birthday, 2000_01_01), and bAbI is a task that requires logical inference. It is hypothesized that learning the python code may have provided the ability to track states over time, which is necessary for these tasks.

The paper states that doubling the data by including the program code in Filling with Code and training 4 epoch iterations increases the total number of training tokens by a factor of 8, and that this total number of training tokens is as effective as the number of unique tokens.

### Effects of Filtering

The stars in the figure show the results of iterative training after data selection. It is difficult to see, but the whitish stars are the results of the Perplexity filter and the orange stars are the results of the Deduplicate filter. In this benchmark, the Deduplicate filter did not seem to show any improvement. On the other hand, the Perplexity filter showed some improvement.

The paper discusses the lack of improvement with the Deduplicate filter, which might have been effective if benchmarks from previous studies had been used, since such previous studies have reported cases in which data duplication negatively affected language models.

Basically, the logic behind the effectiveness of the filter in the first place is that it solves the problem of overfitting sentences with an abnormally high number of occurrences or sentences that are generally improbable, reducing the processing power of general sentences, by excluding such abnormally high data or improbable sentences The problem is solved by excluding such abnormally high number of occurrences and impossible sentences.

The paper states that performance can be improved by including filters only for noisy data sets.

## At the end

In the paper described in this issue, a performance improvement rule for language models with a limited number of training data was proposed. It was shown that even with the same training data, iterative training can improve performance up to about 4 epochs, which is the same as adding different training data.

Regarding the optimal allocation of the number of model parameters and the total number of training tokens, the results show that for the same computational resources, it is better to reduce the number of model parameters and increase the total number of training tokens compared to the conventional scaling rule.

In order to make efficient use of limited text data sets, we suggested not only iteratively learning on the same data, but also giving program codes as training data and sorting the data. From the results of this experiment, it is recommended that data sorting be applied only to noisy data sets, and basically, that iterative learning on the same data, with the program code also given as training data, is more efficient.

However, given the fact that basic research on LLM with more realistic constraints, such as this paper, was selected as the runner-up for the best paper at NeurIPS 2023, the top conference on AI, it may be surprising that there are still some simple questions about LLM may still be a simple question about LLM.

Categories related to this article