[BitNet] Large-scale Language Model With 1-bit Inference

BitNet 06/08/2024

3 main points
✔️ Proposed BitNet that uses only two weights of ±1
✔️ Achieves much higher energy and memory efficiency than conventional
✔️ Found that it follows the same scaling laws as regular Transformer

BitNet: Scaling 1-bit Transformers for Large Language Models
written by Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei
(Submitted on 17 Oct 2023)
Comments: Work in progress
Subjects: Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Large-scale language models are becoming larger and larger in order to achieve high accuracy, but as they become larger, challenges arise in their deployment, and there are concerns about the increase in computational and energy consumption. In this study, we proposed a Transformer consisting of a single bit with weights of $±1$ and showed that it can achieve the same performance as conventional 16-bit models with far fewer computational resources and far higher energy efficiency. Interestingly, we also found that as the model size increases, it follows the same scaling laws as the conventional Transformer. This innovative approach is the basis for lightweight and tractable 1-bit large-scale language models.

Background

Expectations for Large-Scale Language Model Gigantism and Model Quantization

Large-scale language models continue to grow in size and are expected to grow even larger in the future. However, their high inference cost and energy consumption significantly increase the cost of using large-scale language models. For this reason, model quantization (integerization of parameters) is attracting attention as a methodology to reduce model weight while maintaining the performance of large-scale language models in order to reduce memory usage and computational load.

Conventional research on model quantization

Much of the current state of model quantization is done post-training. This is simple and easy to apply because it does not require any changes or re-training of the model's learning pipeline. However, it is not optimized for quantization during training, which can result in a significant loss of accuracy as a result of quantization. Another method of model quantization is to consider quantization during training. This allows for continuous learning and fine-tuning of the model, which is essential for large-scale language models. The challenge in considering model quantization during training, however, is its optimization. The challenge in considering model quantization during training, however, is its optimization, i.e., its convergence tends to deteriorate as the accuracy of the model decreases. In addition, it is non-trivial whether learning with quantization in mind follows the scaling laws of the language model.

Against this background, the authors proposed BitNet with a learning mechanism that takes quantization into account. Scaling rules for the learning are also discussed in this study.

Proposed Method

BitNet

Simply put, BitNet replaces the Liner layer of a conventional Transformer with a BitLiner layer that is represented by a single bit weight.Figure 1 shows a brief overview of BitNet. Otherwise, BitNet has the same mechanism as the conventional Transformer.

Model Training

This study introduces several innovations for BitNet training. They are briefly described below.

Straught-through estimator (STE)

BitNet contains several non-differentiable functions. Therefore, we avoid direct differentiation of those functions by bypassing them during gradient calculations. This is referred to as STE throughout this study.

Mixed Precision tranding

The weights in BitNet are quantized, but to ensure the stability and accuracy of BitNet training, we prepare high-precision variables for computing gradients and managing optimization states, and use them to update the parameters. In doing so, latent weights are used for computation. However, those potential weights are binarized during the forward pass and are not used in the inference process.

Mixed Precision tranding

Some challenges in training BitNet are that small updates of potential weights often make no difference to a single bit variable. This results in biased gradients and updates computed based on one-bit weights. To solve this issue, the authors conclude that the easiest and best way is to increase the learning rate and accelerate optimization. The authors' experiments confirmed that while BitNet has an advantage in terms of convergence by increasing the learning rate, a regular Transformer of type FP16 diverges at the start of learning even at the same learning rate. This suggests the efficiency and high stability of BitNet's learning.

Experimental results

Computational Efficiency

Table 1 shows a comparison of the computational efficiency of the regular Transformer and BitNet in terms of energy. The results show that BitNet is far more efficient than the regular Transformer, regardless of model size.

Table 1. efficiency of calculations in terms of energy.

Consideration of scaling laws for loss

Figure 2 compares the scaling laws of BitNet and the regular Transformer as the model size increases. Importantly, the results show that BitNet, like the regular transformer, loses less as the scaling law is increased with model size. Also, when comparing BitNet and the regular Transformer with constant model size or energy consumption, we see that BitNet has smaller losses. This indicates that BitNet is more efficient than the regular Trasformer with respect to energy consumption and model size. Another important point to note is that the error between BitNet and the regular Transformer decreases as the model size increases.

Figure 2: Visualization of scaling curves for Ross as the model size increases.

Examination of scaling laws for accuracy

Figure 3 shows a comparison of BitNet and Transformer with respect to accuracy. The results show that BitNet has higher accuracy, suggesting its high efficiency.

Figure 3. visualization of the scaling curves with increasing model size with respect to accuracy. (a) and (b) show validation with zero-shot and fourshot tasks, respectively.

Verification of learning stability

Figure 4(a) shows a comparison of the learning stability of BitNet and Transformer. This result suggests the high stability of BitNet's learning when compared to the regular Transformer. Figure 4(b) also shows a visualization of BitNet's learning history for several learning rates. This result also confirms the high stability and efficiency of BitNet, which is stable regardless of the learning rate.

Figure 4.(a) Comparison of BitNet and Transformer on learning stability. (b) Learning history of BitNet for several learning rates.

Comparison with conventional post-training model quantization

Figure 5 summarizes a comparison of the accuracy of BitNet and several baselines. The results confirm that BitNet shows high accuracy compared to other quantization methods.

Figure 5. comparison between BitNet and several baselines.(a) and (b) show validation with zero-shot and fourshot tasks, respectively.

Summary

In this study, we proposed a large-scale language model based on a single-bit Transformer. We also comprehensively compared BitNet with conventional quantization methods and ordinary Transformer and discussed the differences between them. The results suggest that BitNet can achieve higher efficiency and accuracy than conventional quantization methods. In addition, BitNet's accuracy is comparable to that of ordinary Transformer, which is a surprising result. In the future, as large-scale language models become larger and larger, it is expected to become increasingly important to consider methods for model quantization, and BitNet has the potential to be a leader in this field. In this sense, further discussion on the generality of this model and its application limits is expected.

Categories related to this article

Writer-A