Proposal Of An Optimization Method For Activation Functions And CRReLU Using Information Entropy

Loss Function 25/02/2025

3 main points
✔️ A theoretical framework based on information entropy proves the existence of a worst-case activation function (WAFBC).
✔️ Proposes Entropy-based Activation Function Optimization (EAFO) to design dynamic and static activation functions.
✔️ Derived a new activation function, CRReLU, and demonstrated its performance over conventional functions in image classification and language modeling tasks.

A Method on Searching Better Activation Functions
written by Haoyuan Sun, Zihao Wu, Bo Xia, Pu Chang, Zibin Dong, Yifu Yuan, Yongzhe Chang, Xueqian Wang
(Submitted on 19 May 2024)
Comments: 16 pages,3 figures
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

In recent years, the development of deep learning has significantly improved the performance of neural networks (ANNs: Artificial Neural Networks). Among them, the activation function is one of the key factors that ensure the non-linearity of the network and enable learning of complex patterns. However, the selection of activation functions has traditionally been based mainly on empirical rules, and theoretical guidance has been lacking. As a result, the search for a better activation function has been difficult and model optimization has not been sufficiently advanced.

In this paper, we propose a method to optimize activation functions from the viewpoint of information entropy (Information Entropy) and derive a new activation function, Correction Regularized ReLU (CRReLU), CRReLU is based on a typical activation function, ReLU (Rectified Linear Unit), and achieves better performance by using an optimization method based on information entropy, Entropy-based Activation Function Optimization (EAFO). The EAFO is based on the ReLU (Rectified Linear Unit).

Related Research

Importance of the activation function and existing methods

The activation function is one of the most important factors affecting the performance of a neural network, and its choice can greatly affect the stability of learning and the accuracy of the model. In previous studies, the following activation functions have been developed and widely used

Sigmoid and Tanh:
- Often used in early neural networks, but prone to Vanishing Gradient problems.
ReLU (Rectified Linear Unit):
- It is computationally simple and has the property of preventing gradient loss, but has the issues of "Neuron Death" and **"Bias Bias "**.
Leaky ReLU / Parametric ReLU (PReLU):
- Improved ReLU to output smaller values for negative inputs to improve ReLU problems.
GELU (Gaussian Error Linear Unit):
- It performs well in large-scale language models (LLMs) such as BERT and GPT-4, but its mathematical properties are not well understood.

These activation functions have been chosen based on empirical evaluations and have not been systematically optimized. To address this issue, this paper introduces a theoretical approach based on information entropy and proposes a method to search for the optimal activation function.

Proposed Method

Relation between information entropy and activation function

In this study, we focused on the relationship between information entropy and the activation function. Information entropy is a measure of data uncertainty and plays an important role in the learning of neural networks. Specifically, the following relationship was derived.

High information entropy of the activation function increases learning uncertainty and degrades classification performance.
By minimizing information entropy, a more effective activation function can be designed.
It is possible to prove the existence of the worst activation function (WAFBC) and design a better activation function based on it.

Entropy-based Activation Function Optimization (EAFO)

In this study, we proposed a new activation function optimization method called Entropy-based Activation Function Optimization (EAFO). The method consists of the following three steps

Calculate the information entropy of existing activation functions and theoretically derive the worst activation function (WAFBC).
Optimize the activation function to reduce information entropy with respect to the WAFBC.
The optimized activation function is applied to the neural network and its performance is evaluated.

Derivation of Correction Regularized ReLU (CRReLU)

By utilizing EAFO, a new activation function, Correction Regularized ReLU (CRReLU), was derived. CRReLU is an improved version of ReLU with the following properties

Resolves the "Dying ReLU" problem of ReLU
Improved network expressiveness by allowing information to flow through negative input values
Improved learning stability and faster convergence

CRReLU's formula expression is as follows

where ε is a learnable parameter and is adjusted according to the optimization.

Experiment

Image Classification

We evaluated its performance on CIFAR-10, CIFAR-100, and ImageNet-1K datasets, including Vision Transformer (ViT) and Data-Efficient Image Transformer (DeiT). The results showed that CRReLU achieved consistently higher accuracy than other activation functions (e.g., GELU, ELU, PReLU).

For example, Tables 1 and2 show that CRReLU has the highest top-1 accuracy for CIFAR-10 and CIFAR-100. In particular, CIFAR-100 significantly outperforms PReLU and Mish, which are improved versions of ReLU.

Large-scale language model (LLM)

In addition, GPT-2 was used to fine-tune the large-scale language model. For this task, we used the Stanford Human Preferences (SHP) and Anthropic HH datasets to compare the performance of CRReLU and GELU.

The results can be found in Table 4 (p. 8), where CRReLU generally outperformed GELU in the evaluation index.

Conclusion

By introducing a theoretical framework, this paper provided a new approach to the design of activation functions, which had previously been empirical: with EAFO, it is possible to efficiently improve upon existing functions and create new ones, such as CRReLU.

However, further applications of EAFO and ways to improve its computational efficiency remain as future work. In particular, further development is expected by exploring potential applications in areas other than image classification and language tasks.