[BitNet B1.58] Achieved Accuracy Better Than Llama By Expressing Model Parameters In Three Values!
3 main points
✔️ Large-scale language models are computationally expensive, memory-intensive, and power-consuming
✔️ The problem is that the computation, memory usage, and power consumption increase by the number of model parameters x model parameter precision
✔️ To solve the problem, we proposed a language model that achieves the same response accuracy as LLaMA even when the model parameter precision is reduced from 16 bits (about 70,000 values) to 1.58 bits (3 values)
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
written by Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei
(Submitted on 27 Feb 2024)
Comments: Work in progress
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
In contrast to Large Language Models (LLMs), Small Language Models (SLMs) are gaining prominence.
Large-scale language models, which are trained on a large number of model parameters and a large data set, have raised the public's expectations of artificial intelligence to a new level with their ability to answer questions. However, to train and perform inference on large-scale language models, ultra-high-spec computers are required. For this reason, most people use LLM as a service via cloud servers, rather than on-premise (computers in their own buildings) or at the edge (computers in smartphones and other devices).
For companies, using services via cloud servers requires account management for such services, budget applications, and measures to deal with security risks such as information leaks. This is a major stumbling block, especially for companies that want to utilize their own large-scale data.
Small-scale language models are attracting attention in removing these shackles and making it easier for everyone to benefit from AI. In contrast to large-scale language models, small-scale language models can lower the standards for hardware requirements in terms of computation, memory usage, and power consumption.
Thus, there is hope that small-scale language models can ease the hardware requirements to benefit from AI and promote its use on-premise and at the edge. This will be one trend that will accelerate the democratization of AI.
One of the more challenging small-scale language models isthe1-bit LLM introduced here(1.58-bit LLM tobe exact; 1.58-bit LLM has been proposed as a follow-up to 1-bit LLM).
In large language models, what raises the hardware requirement is the number of model parameters. To put a finer point on it, the issue is the number of model parameters x model parameter accuracy.
Model parameter precision is the number of steps a numerical value is expressed in. For example, pi is 3.14 when expressed in 3 digits, and 3 when expressed in 1 digit. 3 digits means that the value is distinguished in 1000 steps from 9.99 to 0.00, while 1 digit means that the value is distinguished in 10 steps from 0 to 9, which is about 1/100 of the precision. Therefore, it can be considered to be distinguishing values with about 1/100th of the precision.
Now, how would you compare the amount of calculation, memory usage, and power consumption between the one-digit case and the three-digit case?I think theone with fewer digits, i.e.,pi as 3, is easier to calculate and easier to remember. I think you can sense that there are fewer calories when calculating.
In the world of computers, model parameter accuracy is expressed using bits, which are the number of digits in binary numbers, because it is fundamental to think in binary numbers, in which digits are carried forward when they reach 2, rather than in decimal numbers, in which digits are carried forward when they reach 10.
The current paper uses three values of -1, 0, and 1 to represent model parameters. This is only 1.58 bits, whereas the general model parameter precision up to now has been 16 bits (for example, 1000 in decimal is the third power of 10, and the larger this value is, the larger the number of digits can be considered. Similarly, 3 is a power of 2?(For example, in decimal, 1000 is the cube of 10, and the larger this value is, the larger the number of digits, and the answer is 1.58).
With such model parameter accuracy, one wonders whether the LLM's response accuracy will not be rattled, but surprisingly, the results are equivalent to the LLM, and depending on the number of model parameters, the response accuracy is rather better with three values.
This technique, called quantization, is being studied to reduce the accuracy of the model parameters while maintaining the accuracy of the LLMs responses as much as possible.
Now, we will explain BitNet b1.58, the method proposed in this paper, and the evaluation results in detail.
Benefits of BitNet b1.58
A cost-performance comparison of the proposed BitNet b1.58 against conventional LLMs is shown in Figure 1.
BitNet b1.58 is characterized by the fact that the model parameters, the so-called neural network weights, can be one of three values: -1, 0, or 1, as in the left W in Figure 1.
Conventional LLMs weights are expressed as 16-bit floating-point numbers.Afloating-point numberexpresses a value in the form of(mantissa part) x(exponent part) power, as in 2.961 x 10-1.It allocates 16 bits each to the mantissa part, exponent part, and sign, and can represent decimals, as shown by the W in the Transformer LLMs on the right in Figure 1.
Basically, a computer is equipped with arithmetic units that calculate bit by bit, and each bit is calculated, so the more bits there are, the more the calculation cost increases, as does the cost of memory to hold the bit values. If a large number of arithmetic units are prepared to enable parallel computing, the computation time will be shortened, but the energy consumption will increase.
When the number of model parameters is large, the time required to transfer a large amount of model parameter value information from memory when performing inference using model parameter values is itself a factor that increases inference time (the time it takes LLMs to respond to input).
Therefore,it is argued on the horizontal axis of Figure 1that BitNet b1.58can be less costly thanTransformer LLMsby reducing the model parameter accuracy to three values.
Regarding Performance, BitNet b1.58 claims to be equivalent to the previous version.
Thus, when compared to conventional LLMs on the two axes of performance and cost, we argue that they are not inferior to conventional LLMs on any axis, but superior on at least one axis, in this case Cost (Pareto Improvement).
The operations required by BitNet b1.58 are shown in Figure 2.
Traditionally, Multiplication (Multiplication) and Addition (Addition) of model parameters and inputs are required.
In contrast, BitNet b1.58 only requires addition. In other words, it is a simpler calculation.
Conventional GPUs speed up multiplication and addition, matrix operations that produce many so-called sums of products, butthis BitNet b1.58 should be able to speed up addition (and operations that set the input to 0 when it is 0, negative sign when it is -1, and positive sign when it is 1).
BitNet b1.58 technical points
BitNet b1.58 is based on BitNet.
In quantization, there are two types of methods: those that reduce the accuracy of the model parameters after learning, and those that learn with an awareness of reducing the accuracy of the model parameters.
The former reduces the accuracy of model parameters as a post-processing step, which has the convenience of being easy to apply to existing ones, but has the disadvantage that model performance is likely to be reduced.
The latter is said to reduce the performance degradation of the model, although it has the disadvantage of increasing the computational cost of training with an awareness of reducing the accuracy of the model parameters.
BitNet is an image of learning with the latter quantization in mind.
When the quantization process is interrupted during the learning process, it usually involves rounding continuous values to discrete values, resulting in a discontinuous transformation that is non-differentiable.
In this case, the error back propagation method, which is used to efficiently calculate the model parameter values of the neural network, cannot be used, and this becomes a problem. The empirical method of error back propagation is used.
These are the same processes that are carried over to BitNet b1.58.
The difference between BitNet and BitNet b1.58 (this time) is whether the model parameter values are represented as binary values of -1,1 or triples of -1,0,1.
BitNet b1.58retains the benefits of BitNet, but offers additional benefits to BitNet.
-Adding 0s as well as 1s and 1s, as a matter of course, increases the accuracy of the representation of the model parameter values, thus increasing the expressiveness of the model.
Another implication of the inclusion of 0 is the expected effect of feature filtering.
In machine learning, the inclusion of unwanted features generally has a significant negative impact on the predictive performance of the model; the inclusion of 0 directly reduces the number of unwanted features.
Conventional BitNet quantization method
In the case of BitNet, which takes the conventional binary values of -1 and 1, model parameter values are converted to 1 if they are greater than or equal to 0, and to -1 if they are less than 0. However, the center of the current model parameter values is, so to speak, the average of the model parameter values, so if the center deviates from 0, the conversion will be biased and the error will be large.
So, after subtracting the average of the model parameter values(after adjusting for zero points),it is converted to1 if it is greater than 0 and-1 if it is less than 0.
For quantization of the activation function, divide by the maximum of the absolute values of the elements of the input matrix and set the range of values to [-1,1], then multiply by Q (n-1 power of 2, depending on how many bits are to be quantized) to make [-Q, Q]. If the activation function is asymmetric, for example, if the Relu function is assumed, then 0 is the threshold value, so the range is [0,Q] by subtracting the minimum value and then performing the same process.
In a neural network, the sum of the product of the inputs and the model parameters is computed and an activation function is applied to the output. The activation function, in extreme cases, determines whether a certain neuron in the network will fire or not, 0 or 1. Anything above a threshold value is considered a 1 and anything not is considered a 0.
Therefore, if not handled properly, problems such as extreme output of only 0s may occur, so it is thought that the range is adjusted by paying attention to the threshold value.
Proposed BitNet quantization method
The proposal is to divide the model parameter value by the absolute mean of the model parameter values (scaling), round the value to an integer (rounding), and convert the value to -1 if it is less than -1 and to 1 if it is greater than 1 (clipping).
For example, as with conventional quantization of the activation function, dividing by the largest absolute value may be considered, butif only one value has an extremely large absolute value and the scale is based on that value, only one value will have a large absolute value and the other values will be divided by the larger value, resulting in the result that (although there are three values: -1, 0, and 1) the value will be stuck around 0 . (-1,0,1), but the other values are divided by a larger value, resulting in the value being stuck around 0.
If this is the case, the conversion will be biased toward 0. Therefore, it is assumed that BitNet b1.58 uses a method of scaling by the average value so that if the value is above the average, it is +1 or -1, and if it is below the average, it is 0, and values that exceed -1 or 1 are converted to -1 or 1, whichever is closer.
The quantization of the activation function is the same as before. However, in the past, for ReLu, the minimum value was subtracted and the range of values was converted to [0, Q]. To make the process simpler, the minimum value subtraction process is not included and the range of values isalways converted to [-Q, Q]. It is unclear why it would be acceptable to omit the zero point adjustment process, such as subtracting the minimum value, this time for simplicity. Perhaps it was not such a problem when we tried it.
Assessment Results
Memory usage, response time and prediction accuracy
The memory usage, response speed and prediction accuracy of LLaMA LLM and BitNet b1.58, the LLM developed by Meta, are shown in Figure 3.
GPU memory size (Memory), response time (Latency), and prediction error (PPL) are shown formodel parameters (Size) of 700 million (700M), 1.3 billion (1.3B), 3 billion (3B), and 3.9 billion (3.9B). As indicated by the arrows in the table, the smaller any indicator is, the better.
Compared to LLaMA LLM, BitNet results in 2.6x~3.6x smaller GPU memory size, 1.23x~2.7x faster response time, and almost the same prediction error.
The larger the number of model parameters, the smaller the memory usage size, response time, and prediction error tend to be, with the result that GPU memory size, response time, and prediction error are all superior to LLaMA LLM when the number of model parameters is 3B and 3.9B.
Energy consumption
A comparison of energy costs between LLaMA and BitNet b1.58 with 512 token inputs is shown in Figure 4.
BitNet b1.58 has 19 to 41 times lower energy cost than LLaMA, and the higher the number of model parameters (Model Size), the lower the energy consumption (Energy).
At the end
This article described BitNet b1.58.
Conventional LLMs improve prediction performance as the number of model parameters increases, but memory usage, response time, and energy consumption also increase significantly.
Conventional LLM expresses a single model parameter value as a 16-bit floating-point number, so it needs to store the number of model parameters x 16 bits of information, and the more model parameters you increase, the more memory is used, thelonger it takes to transfer that information, the longer the response time, and the more computation is required, so Energy consumption also increases.
To alleviate this problem, we introduced a technique (quantization) to reduce the precision of model parameter values from 16-bit floating-point numbers to 1.58-bit (three values of -1,0,1) while maintaining the LLM response accuracy as much as possible. BitNet, which reduces to two values of -1,1, had already been proposed, but by adding 0 to -1,1, itis possible to obtain a filter effect to narrow down the featureswhile maintaining the advantage that most calculations can be performed only by addition, and a small reduction in memory usage, response time, and energy consumption can be achieved LLM prediction accuracy can be improved at the expense of memory usage, response time, and energy consumption. Depending on the number of model parameters, the results show not only a small degradation in prediction accuracy, but also an increase in the opposite direction.
Comparison with LLaMA hasshown that it can reduce memory usage, response time, and energy consumption, while still allowing for better predictive performance than previously possible.
The paper states that since the proposed method can only add up most calculations, further high-speed processing and reduced energy consumption can be expected by creating new hard wafers that are different from GPUs.
Currently, NVIDIA stock is attracting unusual attention, but if the hardware can be different from GPUs, it could overturn the NVIDIA monopoly.However, BitNet b1.58 is evaluated using GPUs in the paper, and the activation function is 8-bit, as described below, soit is not necessarily 3-valued in all processing processes.Ifall the processes are addition only, it would be possible to radically change the hardware, butif multiplication remains, it is unlikely to change significantly.
Also,manufacturers may be happy to be told that there is newhardwarethat is better suited to their needs, because it gives NVIDIA an opening to take advantage of them.I think that the general users would be happier to be told that they can use general-purpose CPUs or cheaper hardware for high-speed calculations than to be forced to buy new hardware.
Categories related to this article