Kolmogorov-Arnold Network (KAN) Instead Of MLP To Improve Model Expressiveness And Performance
3 main points
✔️ Kolmogorov-Arnold Transformer (KAT), which replaces the MLP layer of the Transformer model with the Kolmogorov-Arnold Network (KAN)
✔️ Rational functions and grouped KAN layers to improve computational efficiency and accuracy
✔️ Demonstrated excellent performance in image classification and object detection
written by Xingyi Yang, Xinchao Wang
(Submitted on 16 Sep 2024)
Comments: Code: this https URL
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Background
Traditional transformer models use a multilayer perceptron (MLP) to mix information across channels, but in this paper the Kolmogorov-Arnold Network (KAN) is employed instead to improve the expressiveness and performance of the model.
KAT performs particularly well on vision tasks such as large-scale image recognition tasks, object detection, and semantic segmentation; KAN is good at efficiently approximating mathematical functions and, in theory, has the potential to model complex functions with fewer parameters than MLP. However, integrating KAN into a transformer presented several technical challenges.
The three specific issues are as follows
- Basis function problem: The standard B-spline functions used in KAN are not optimized for modern GPUs and are difficult to compute in parallel, resulting in slow computation speed problems.
- Parameter and computational inefficiencies: KAN requires a separate function for each input-output pair, making it computationally very expensive.
- Weight initialization problem: Unlike MLP, the initialization of KAN weights requires particularly careful initialization for convergence because of the learnable activation function.
To overcome these challenges, KAT has introduced the following three solutions
- Rational basis functions: Instead of B-splines, we use rational functions that are computationally efficient and suitable for modern GPUs.
- Group KAN: Each group of neurons shares activation weights to reduce computational load while maintaining performance.
- Variance-preserving initialization: Weights are initialized so that the variance of activation is maintained for each layer, resulting in stable learning.
In this way, KAT achieves better performance than conventional MLP-based transformers.
Technique
The Kolmogorov-Arnold Transformer (KAT) proposed in this paper replaces the MLP (multilayer perceptron) used in traditional transformers with the Kolmogorov-Arnold Network (KAN) It is a new architecture that improves the expressiveness and performance of the model by introducing
A key feature of KAT is that it incorporates several innovative designs to effectively integrate KAN into the transformer. Specifically, to improve the computational efficiency of the KAN layer, rational functions are used instead of the traditional B-spline functions, which are implemented on CUDA. This allows for faster computation on GPUs, allowing more complex functions to be learned at speeds comparable to those of traditional MLP.
In addition, to reduce the computational load on the KAN layer, a "group KAN" approach is used, where the weights of the activation functions are shared among multiple edges. This improves the scalability of the model and allows it to work efficiently even with large models. In addition, the weights are designed in such a way that the weights are initialized in such a way that the variance of the activation between layers is maintained consistently. This design improves training stability and allows the model to be trained more effectively.
KAT achieves better accuracy than the traditional Transformer model, particularly in the image classification task on the ImageNet-1K dataset, where the KAT-B model outperforms the ViT model by 3.1% with an accuracy of 82.3%. These improvements make KAT a novel approach over simple MLP-based transformers.
Experiment
KAT experiments were conducted primarily on three visual tasks: image classification, object detection, and semantic segmentation, with performance on each task being evaluated.
First, for image classification, the ImageNet-1K dataset was used to compare the performance of KAT with other models (ViT, DeiT, etc.) KAT employs a new channel mixer called GR-KAN, which outperforms the traditional MLP. For example, the KAT-S model achieves 81.2% accuracy, 2.4% better than the traditional DeiT-S model. Furthermore, KAT-B, an extended version of KAT, achieves about 3.1% higher accuracy than the ViT-B model, showing that KAT has an advantage even when model size is kept the same.
Next, for the object detection task, we incorporated KAT into Mask R-CNN using the MS-COCO2017 dataset to measure the accuracy of object detection and instance segmentation. Again, KAT outperformed the traditional ViTDet in this experiment, especially in the smaller models, where the APbox showed a 3.0 point improvement. This confirms that KAT provides efficient and accurate results in object detection.
Finally, semantic segmentation experiments tested KAT's performance using the ADE20K dataset. In this task, KAT was used as the backbone of UperNet and compared to other conventional models; KAT-S achieved an mIoU improvement of approximately 2.4% over DeiT-S, providing higher accuracy with minimal performance loss due to the smaller model.
These experimental results confirm that KAT offers better expressivity and performance compared to the traditional Transformer architecture. KAT is also particularly efficient in terms of computation, with CUDA optimization enabling faster computation than conventional methods. Such a design has shown KAT to be a strong choice for a variety of visual tasks.
Summary
The conclusion of this paper shows that the Kolmogorov-Arnold Transformer (KAT) is a promising alternative to traditional MLP-based transformers. (KAN) properties and performs well on visual tasks. Specifically, KAT provides an improvement over the traditional Transformer architecture in terms of accuracy while maintaining computational efficiency.
In addition, KAT has the potential to surpass MLP both theoretically and practically, and more potential applications are expected in future research. In particular, the flexible expressivity and learning stability provided by the use of rational functions provide avenues for future development, including expansion to tasks other than vision.
Categories related to this article