Catch up on the latest AI articles

Innovations In Outlier-Safe Pre-Training For Large Language Models To Prevent Outliers And Protect Quantization Accuracy

Innovations In Outlier-Safe Pre-Training For Large Language Models To Prevent Outliers And Protect Quantization Accuracy

3 main points
✔️ OSP, a training method that prevents outliers, was proposed to fundamentally improve quantization performance 
✔️ Muon optimization, single-scale normalization, and embedded projection suppress extreme activation values
✔️ In experiments on a trillion-token scale, high accuracy was maintained even with 4-bit quantization, significantly outperforming existing models Outperforms existing models by a wide margin

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models
written by Jungwoo ParkTaewhoo LeeChanwoong YoonHyeon HwangJaewoo Kang
(Submitted on 24 Jun 2025)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

LLM 4-bit quantization is an important technology for on-device deployment because it can significantly reduce memory usage during inference. However, it is known that "outliers" (extreme activation values) that occur during training can significantly degrade the performance of this quantization. Conventional methods, such as post-training quantization (PTQ), have addressed this problem after training, but these methods have not solved the problem fundamentally and have treated outliers as inevitable.

In this study, we propose a new perspective on this problem, a training framework called "Outlier-Safe Pre-Training (OSP)" OSP is composed of three components: the Muong optimization method, single-scale RMS normalization (SSNORM), and learnable embedded projection (EMBPROJ By training the 1.4B-parameter model on a trillion-trillion-token scale, we have achieved a significant improvement in quantization tolerance and reduced performance degradation compared to conventional models.

Proposed Method

The proposed Outlier-Safe Pre-Training (OSP) is a pre-training method designed to fundamentally prevent outliers. The framework consists of the following three components.

First, the Muon optimization method differs from conventional diagonal preconditioning optimization such as Adam and AdaFactor by using an algorithm that orthogonalizes the gradient matrix. This prevents activation concentration on specific channels (privileged basis) and enables equal learning across all channels.

Second, Single-Scale RMSNorm (SSNORM) eliminates traditional per-channel scaling and uses a single scaling factor for all dimensions to prevent bias due to normalization. This allows them to avoid instability during training while also avoiding excessive suppression of activation.

Third, the Embedding Projection (EMBPROJ) uses a learnable projection matrix to homogenize the distribution of activation in order to prevent local outliers arising from the embedding layer. only the Embedding layer is trained using Adam, while Muon is used for the other parameters. The separation optimization strategy of training only the Embedding layer using Adam and applying Muon to the other parameters is also employed to achieve both practicality and computational efficiency.

Experiments

Experiments were conducted on a 1.4B-parameter LLM with scalable training using 100 billion and 1 trillion tokens. First, we used Excess Kurtosis to quantify outliers and observed their transition. While the model trained with conventional Adam showed outliers with kurtosis exceeding 1000, the model with OSP continued to maintain a very low value of 0.04.

To validate the performance in 4-bit quantization, we also compared average scores on 10 different benchmarks (ARC, MMLU, GSM8k, etc.); the Adam-trained model scored 26.5, while the OSP model scored 35.7. Furthermore, the OSP model consistently maintained low perplexity and robust quantization performance when used with post-training quantization (PTQ) methods, according to the study.

In addition, the "attention sink" phenomenon in the attention mechanism was also analyzed, and the persistence of this phenomenon after the disappearance of outliers suggested that the two are caused by different mechanisms. This demonstrated the effectiveness of OSP as a training method optimized for quantization.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us