Using Random Labels Improves Text Classification!

Natural Language Processing 27/07/2021

3 main points
✔️ Improved performance without extra computational cost in the prediction procedure
✔️ Also validates the superiority of the Label Confusion Model (LCM) over label smoothing methods
✔️ LCM is particularly effective on confused and noisy datasets, demonstrating a significant degree of superiority over label smoothing (LS) methods

Label Confusion Learning to Enhance Text Classification Models
written by Biyang Guo, Songqiao Han, Xiao Han, Hailiang Huang, Ting Lu
(Submitted on 9 Dec 2020)
Comments: Accepted by AAAI 2021.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

code：

The images used in this article are either from the paper or created based on it.

first of all

Text classification is one of the fundamental tasks in natural language processing and has a wide range of applications, including news filtering and spam detection. Text classification has seen a lot of early success, especially with the application of deep learning-based methods. A number of deep learning models have been successfully applied to the text classification problem, but they all share the same learning paradigm: a deep learning model for text representation, a classifier that predicts a label distribution, and a cross-entropy loss between the predicted probability distribution and the one-shot label vector. However, there are at least two problems with this learning paradigm.

In general text classification tasks, one-shot labels are based on the assumption that all categories are independent of each other. In practice, however, labels are often not completely independent, and it is common for instances to be associated with more than one label. As a result, simply representing a true label as a one-shot does not take into account the relationship between instances and labels, which limits the learning ability of deep learning models.
Deep learning models rely heavily on large annotated datasets. Noisy data with labeling errors degrade classification performance but is unavoidable on human-annotated datasets. Training with one-shot labels is vulnerable to mislabeled samples because they are fully assigned to the wrong category.

Simply put, the limitations of the current learning paradigm lead to prediction confusion where the model has difficulty distinguishing between several labels. This is called the label confusion problem (LCP). The label smoothing (LS) method, a popular solution to this problem, has been proposed to improve the inefficiency of one-shot vector labeling, but it does not capture realistic relationships between labels and is not sufficient to solve the problem.

Therefore, in the current work, we propose a new Label Confusion Model (LCM) as an enhanced component of the current deep learning text classification model.

Review (if you don't need it, you can skip to the proposed method)

Text Classification by Deep Learning

Text classification in deep learning can be divided into two main groups.

One study focused on word embedding (2014s)
Deep learning structure research that can learn better text representation.
Typical deep structures are LSTM and language models like RNN and BERT. Text classification has become so much more accurate because of structural research that can learn advanced semantic representations from the text. And this yields much better results than hand-crafted features.

A conventional method for label confusion problem

label smoothing

label smoothing (LP) was first proposed for the image classification task as a regularization technique to prevent the model from over-predicting training examples too confidently. label smoothing (LS) computes the loss not with hard one-shots, but with a weighted mixture of these targets with a uniform noise distribution to improve the accuracy of the model. Nevertheless, it cannot reflect the true label distribution because it is obtained by simply adding noise. The true label distribution reveals the semantic relationship between an instance and each label, and similar labels should have a similar degree in the distribution. To learn more, click here (The Truth Behind Label Smoothing！).

Label Embedding

Label Embedding is to learn embedding of labels in classification tasks. By transforming labels into semantic vectors, we can replace the classification problem with a vector matching task and solve it. We then use Attention to jointly learn words and label embeddings, making it a model for multi-label classification that captures the joint relationship between labels.

This concept of Label Embedding is also used in the proposed method, so please keep it in mind.

Label Distribution Learning

Label Distribution Learning (LDL) is a new machine learning paradigm for tasks where the overall distribution of labels is important. The label distribution covers a certain number of labels and the degree to which each label describes an instance.LDL has been proposed for problems where the distribution of labels is important, and there are even algorithms out there for the task. However, in many existing classification tasks, such as 20NG and MNIST, it is difficult to obtain the true label distribution because each sample is only assigned a unique label. In such cases, LDL is not applicable.

Proposed method

A schematic diagram of the proposed method is shown below.

Specifically, the framework of the proposed method consists of two parts: the Basic Predictor shown on the left and the Label Confusion Model (LCM) shown on the right.

Basic Predictor

This part processes the input using different encoders such as RNNs, CNNs, and BERT to get the semantic representation of the sentence and finally classifies the data with Softmax. And finally, the output is no different from the traditional approach where the prediction of label distribution is output. It can be expressed as follows.

Label Confusion Model(LCM)

Directly representing labels in One-hot would be a waste of label information. We also believe that it will generate overfitting in the model.

Specifically, we first encode the input labels using a label encoder such as MLP or DNN to obtain the label representation matrix. The next simulated label distribution (similarity label computation module) consists of a similarity layer and an SLD computation layer. The similarity layer takes the label representation and the current instance representation as input, calculates their similarity as dot product, and then applies a neural net with softmax activation to obtain the label confusion distribution. confusion distribution can obtain the dependency between labels by calculating the similarity between instances and labels. This makes the label confusion distribution a dynamic distribution that depends on instances, which is better than distributions that only consider the similarity between labels or simply uniform noise distributions like label smoothing.

Finally, the original one-shot vectors are added to the LCD with a control parameter α and normalized with a softmax function to produce a simulated label distribution SLD. This process can be represented in the following way.

Here, we can see that the probability distribution of the label SLD y (s) predicted by the conventional method model and the simulated label distribution y (p) obtained in the second step are both probability distributions. To measure the difference, we use the Kullback-Leibler divergence (KL-divergence) as the loss function. It has the following form.

Learning with LCM means that the actual target that the model tries to fit dynamically changes depending on the semantic representation or label of the document that the deep model has learned. The simulation of the learned label distribution helps to better represent instances with different labels, especially for confusing samples. SLD also allows the model to learn useful information even from mislabeled data, since the probability of an incorrect label is assigned to a similar label (often including the correct label) when faced with noisy data. This completes the explanation of the overall technical details. Thus, in this paper, we have carefully modeled the distribution of labels and the relationships between them and analyzed the dependencies between them while taking into account the input, thus enabling dynamic label encoding that takes into account the input and allows the model to make the best use of the label data.

Experiment setup

data set

To evaluate the effectiveness of the proposed method, it was evaluated on five benchmark datasets, including three English and two Chinese datasets.

20NG
An English language news dataset containing 18846 documents is evenly grouped into 20 different categories.
AG's News Data Set
127600 samples, including 4 classes. We select a subset of 50000 samples for our experiments.
DBPedia Data Set
This is an ontology classification dataset with 63,000 samples classified into 14 classes. 50000 samples are randomly selected to be the experimental dataset.
FDCNews Data Set
9833 Chinese news datasets are categorized into 20 classes.
THUCNews Data Set
This is a Chinese news classification dataset collected by Tsinghua University. From this dataset, a subset containing 39,000 news items divided evenly into 13 news categories is constructed and used.

model

The Label Confusion Model (LCM) can be used by integrating it with current mainstream models. Therefore, we use only common model structures that are widely used in text classification tasks. In practice, we use LSTM, CNN, and BERT. Please refer to the original publication for the models and various detailed parameters.

experimental results

In the experimental section, we perform several experiments as follows. The specific results are shown in the table below. The table shows the comparison between the LCM-based test performance and the basic structure only test performance.

The results show that the LCM-based classification model outperforms the baseline on all datasets when using the LSTM-rand, CNN-rand, and BERT structures. However, the LCM-based CNN-pre model lightly deteriorates on the FDCNews and 20NG datasets. The overall results on five datasets with three widely used base models show that LCM has the ability to improve the performance of text classification models. Also, the LCM-based model has a lower standard deviation. LCM provides the greatest improvement over the baseline LSTM-rand on the 20NG dataset, with a 4.20% improvement in test performance. Against the CNN-rand on the same dataset, there is also a clear performance improvement of 1.04%.

There are 20 categories in the 20NG data set. It stands to reason that the more categories there are, the harder it is for the model to distinguish between labels in the same group. Furthermore, the following figure visualizes the learned label representation for the 20 labels in the 20NG dataset.

The label representations are extracted from the embedding layer of the LCM, and Figure a shows the cosine similarity matrix of the label representations, where the diagonal elements indicate how similar one label is to another. Figure b visualizes the high-dimensional representation on a 2D map using t-SNE. Figure b shows that labels that are easily confused, especially labels from the same group, tend to have similar representations. Since all label representations are initially initialized randomly, we can see that the LCM is capable of learning very meaningful representations that reflect the confusion between labels.

The reasons why classification models using LCM can usually obtain better test performance can be divided into several aspects.

LCM learns a simulated label distribution, SLD, during training and can understand complex relationships between labels by considering the semantic similarity between input documents and labels. This may be because it is better than representing the true labels by a simple one-shot vector
Some datasets may contain mislabeled data, such as a large number of categories or very similar labels. In such cases, training with one-shot label representations tends to be strongly affected by these mislabeled data. However, by using SLD, the indexes of mislabeled data are collapsed and assigned to similar labels. Therefore, the misleading of wrong labels may have become relatively minor
Apart from mislabeling, if there is the similarity in the given labels (e.g., "computer" and "electronics" are semantically similar topics and share many keywords in terms of content, labeling the text samples with a label distribution that conveys different aspects of the information was natural and reasonable.

In this article, I have only shown the results of the experiments that the author mainly claims, but there are four other experiments: "The effect of α and early stopping of LCM", "The effect of confusion of datasets", "Experiments on noisy datasets and comparison with Label Smoothing", and "Application of LCM to images". In each of them, the effectiveness of LCM is shown (please refer to the original paper for details).

summary

We propose the Label Confusion Model (LCM) as an extension component to improve the performance of current text classification models. The LCM can capture the relationships between instances and labels, as well as the dependencies between labels, and we use five benchmark datasets for Experiments to show that LCM can enhance common deep learning models such as LSTM, CNN, and BERT.

I feel that the main advantage of this method is that it is model agnostic and therefore can be used flexibly to further increase the effectiveness of different models. In the end, this is a very interesting paper, as more comprehensive modeling of labels can be more fully utilized to get better results at less cost.