Catch up on the latest AI articles

Combining Speed And Accuracy: Quantization-aware LLM Pre-training

Combining Speed And Accuracy: Quantization-aware LLM Pre-training "QAP

3 main points
✔️ Quantization-aware pre-training (QAP) method proposed to make LLMs more robust to quantization
✔️ QAP mimics quantization noise during training for faster inference while avoiding accuracy loss
✔️ Experiments show up to 2.5x inference speedup while maintaining accuracy even with 4-bit quantization

Inverse-and-Edit: Effective and Fast Image Editing by Cycle Consistency Models
written by Ilia BeletskiiAndrey KuznetsovAibek Alanov
(Submitted on 23 Jun 2025)
Comments: The code of our method is available on GitHub at this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Overview

Although LLMs have shown remarkable performance in many natural language processing tasks, their inference speed and memory usage are major bottlenecks in real-world operation. Quantization is widely used to solve this problem. However, conventional quantization methods have the problem that the accuracy of the model deteriorates in exchange for the improvement of inference speed.

This paper shows that this trade-off can be overcome by introducing quantization-aware "Quantization-Aware Pretraining (QAP)" from the LLM training stage.
Specifically, by using a method that mimics quantization noise in advance during model training, we have achieved a structure that does not degrade in accuracy even after quantization.
As a result, it is possible to achieve both higher accuracy and faster inference performance than conventional models, even when quantized with the same bit width. In particular, compared to the FP16 precision model, the 4-bit quantization model shows almost no degradation in accuracy, proving that cost-effective LLM operation is possible.

Proposed Method

The central method proposed in this study is "QAP". This is a method in which training is performed while injecting pseudo-quantization errors during model training in preparation for quantization to be applied in the future. Unlike conventional post-training quantization (post-training quantization), QAP induces the model to naturally acquire a quantization-resistant representation from the learning stage.

Specifically, the linear transformation layers most susceptible to quantization (Attention and MLP in particular) are simulated with 4-bit or 6-bit precision, and this is reflected in the loss function. In addition, the data and hyperparameters used during pre-training are essentially identical to existing high-precision models, so the additional cost of implementing QAP is negligible.
In addition, the proposed method adds soft regularization to quantization-sensitive weights and activations to further improve learning stability and post-quantization generalization performance.

This approach can be easily integrated into standard training pipelines and is highly practical and directly linked to future LLM speedups and resource savings.

Experiments

To validate the effectiveness of the proposed method, the authors prepared multiple versions of LLMs based on LLaMA-2 and Mistral-7B with 4-bit and 6-bit quantization and evaluated both accuracy and inference speed.

A variety of tasks were used for benchmarking, including MMLU, GSM8K, and HumanEval, and the performance of each model was compared.
As a result, the QAP-implemented model recorded an accuracy improvement of up to +6.3 points for the same bit width compared to the base model that did not support QAP.
In particular, in terms of inference speed, the model was up to 2.5 times faster than the FP16-based model while operating with almost no loss of accuracy.

Robustness to different quantization schemes (SmoothQuant, AWQ, GPTQ, etc.) was also verified, showing that the QAP-applied model maintains stable performance independent of quantization scheme.
Furthermore, the increase in training cost is very small, indicating that QAP has low barriers to implementation in a realistic operational environment.

These results demonstrate that QAP can be positioned as a promising approach for building fast, memory-saving, and accurate LLMs.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us