Finally, A New Activation Function, Mish, Has Been Created That We Expect!

Article 21/02/2020

3 main points

✔️ A new activation function, Mish, was proposed after ReLU and Swish.
✔️ It overwhelmed ReLU and Swish with MNIST and CIFAR-10/100.
✔️ The GitHub report of the paper author's implementation is very easy to use.

Mish: A Self Regularized Non-Monotonic Neural Activation Function
written by Diganta Misra
(Submitted on 23 Aug 2019 (v1), last revised 2 Oct 2019 (this version, v2))
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
Paper Official Code COMM Code

All images used in this article are taken from the paper.

Introduction

Neural networks are widely used in image recognition and natural language processing. It can also be used for such a neural network's strength lies in its ability to represent nonlinear functions, and the activation function is what makes the nonlinear function possible. Typical activation functions include Sigmoid and ReLU (2011), among others, in 2017 as the successor to ReLU. Swish. is now available. But as you know, ReLU still reigns as the de-facto standard. Mish, introduced in 2019 to put an end to that trend in the world of activated functions, has surpassed ReLU and Swish in a variety of tasks, and its implementation is available on GitHub so you can easily try it out.

This article will explain the paper in the following sequence.

Mish Commentary
Results of Mish's experiment
Conclusion.

1. Mish

Expressing the Mish function as an expression, $f(x)=x\cdot \mathrm{tanh}(\mathrm{softplus}(x))$

and where $\mathrm{softplus}(x)=\ln{(1+e^x)}$. The following is represented in the diagram.

When you use it, just put the Mish function in place of the one you are using, such as ReLU. It's easy. However, they say it's better to use a slightly smaller learning rate than ReLU. There are five properties of the Mish function that are listed in the paper

No limit
- Sigmoid, for example, is saturated at 1 and its slope is zero, which slows down the learning process, but Mish does not saturate, which avoids slowing down the learning process.
with a lower limit
- We can apply strong regularization.
Negative value.
- Unlike ReLU, negative values also remain
∞Functions are continuous up to infinite derivative
- ReLU is not continuous after differentiation, which may cause unexpected problems for the optimizer with gradients, but Mish is continuous up to infinite derivative, so it will be fine.
Smooth output landscape by Mish
- The loss function is also smoother and easier to optimize.

2. the results of Mish's experiment

2.1 MNIST

We are experimenting a lot, so I'll give you a quick explanation. Basically, it compares performance to Swish (and ReLU). It says that the network uses a normal CNN with 6 layers.

2.1.1 Layer depth

Mish_MNIST_depth

This is the accuracy when the layer of the model is deepened (without using a skip connection). You can see that Mish(red) is stable even if the model layer is deepened.

2.1.2 Robustness

Mish_MNIST_noiseacc Mish_MNIST_noiseloss

The top panel is the accuracy and the bottom panel is the loss. The horizontal axis is the standard deviation of the Gaussian noise to be added to the input, the larger the noise, the greater the value that can be taken. Even as the noise increases, Mish shows high accuracy and low loss.

2.1.3 Various optimization algorithms

Mish_MNIST_optim

This is the accuracy when using various optimization algorithms such as SGD and Adam; it's not as good when using Nadam, but otherwise Mish performs about the same or better than Swish in every optimization algorithm.

2.1.4 Learning Rate Dependence

Mish MNIST lr

These are the accuracies for various learning rates. For learning rates of $[10^{-5}, 10^{-1}]$, Mish has the same or better accuracy than Swish. Mish is better than Swish, especially when the learning rate is small.

2.1.5 Various initial values

Mish_MNIST_initial

These are the accuracies for various initial values such as random initial values and the initial values. Again, Mish achieves about the same or better accuracy than Swish.

2.1.6 Various regularizations

Mish_MNIST_regular

These are the accuracies when using L1 regularization, L2 regularization, and L1-L2 regularization. Again, as before, Mish performs as well or better than Swish.

2.1.7 Dropout Rate Dependence

Mish_MNIST_dropout

It shows the difference in accuracy relative to the dropout rate. You can see that Mish is better than Swish in both cases.

2.1.8 Layer width

Mish_MNIST_width

The number of neurons per layer, i.e., the accuracy relative to the width of the layer, Mish is significantly better than Swish.

2.2 CIFAR-10

From here, we use CIFAR-10 as a dataset. The network is an ordinary 6-layer CNN as well as MNIST.

2.2.1 Various optimization algorithms

Mish_CIFAR10_optim

This is the result of using various optimization algorithms. As usual, Nadam doesn't perform well to begin with, but you can see that Mish is as good or better than Swish.

2.2.2 Learning Rate Dependence

Mish_CIFAR10_lr

Mish performs as well or better than Swish in terms of learning rate.

2.2.3 Various initial values

Mish_CIFAR10_init

You can see that Mish is significantly better than Swish at every initial value, except for random_normal.

2.2.4 Various regularizations

Mish_CIFAR10_regular

Mish nicely dominates Swish in regularization as well.

2.2.5 Dropout Rate Dependency

Mish_CIFAR10_dropout

I prefer Mish at any dropout rate.

2.2.6 Layer width

Mish_CIFAR10_width

Mish outshines Swish when it comes to the breadth of the more expressive layers

2.2.7 Cos Annealing and One Cycle Policy

Mish_CIFAR10_CosOne

I see that Mish is more accurate than Swish using both the learning rate scheduler, Cosine Annealing, and the One Cycle policy.

2.2.8 Mixup

Mish_CIFAR10_MixupAcc Mish_CIFAR10_MixupLoss

Here are the accuracies for the hyperparameter $\\alpha$ of one of the data augmentation methods, MixUp (2017) (description ). The top figure is the precision and the bottom figure is the loss. For some reason, we are comparing it to ReLU here, but it beats ReLU in both cases.

2.3 Other experiments

In addition to the experiments mentioned above, we have conducted several other experiments and will introduce them here.

2.3.1 Different activation functions

Mish_CIFAR10_SENet_activations

For the network, we use SENet, with Mish showing the highest accuracy, low loss, and low (accuracy) variance.

2.3.2 Different networks

We will look at the accuracy of Mish for various networks. We'll introduce them all at once at the table.

2.3.2.1 CIFAR-10

Mish_CIFAR10_model

It has shown high accuracy on a wide range of models from ResNet v2-20 to EfficientNet B0 and MobileNet v2. Amazing!

2.3.2.2 CIFAR-100

Mish_CIFAR100_model

In the CIFAR-100, Mish has lost only two to Swish, but other than that, Mish has won everything else. It's amazing too!

2.3.3 Learning Speed

Mish_CIFAR10_SENet50

Here's the only drawback of Mish, which is the speed of learning. You can see that it's not as fast as ReLU, which has a simple mechanism. However, you can decide whether to use it or not based on the improved accuracy with Mish that we have seen so far.

3. Conclusions

Mish has surpassed ReLU and Swish, and in the paper, they show the next prospect of expanding it to ImageNet and NLP. It's very easy to use because you just turn the activation function into Mish. If you want to improve the accuracy of your models, try the implementation by the author of the paper!