Catch up on the latest AI articles

Tongue Image Diagnosis By Deep Learning: Understanding Systemic Disorders From The Tongue! Part1

Tongue Image Diagnosis By Deep Learning: Understanding Systemic Disorders From The Tongue! Part1

Image Recognition

3 main points
Tongue imaging has the potential to break through the problem of diagnosis in traditional Chinese medicine (TCM), which depends on the experience of the clinician
✔️ Proposed constrained highly distributed neural network (CHDNet)
✔️ Achieves 91.14% accuracy and 0.94 AUC, which is higher than the performance of previous methods

Tongue Images Classification Based on Constrained High Dispersal Network
written by Dan MengGuitao CaoYe DuanMinghua ZhuLiping TuDong XuJiatuo Xu
(Submitted on 30 Mar 2017)
Comments: Accepted to 
Evid Based Complement Alternat Med.
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Orimemtal Medicine


Can deep learning algorithms resolve diagnosis in Oriental medicine—dependence on the subjectivity of experience in doctors? And what is required for accomplishing it?

While the diagnosis by using tongue is known as a kind of indicator of oral health, it is also used for understanding the state of the whole body in oriental medicine—the analysis of tongue images, in other words, has the potential to detect not only specific diseases but also multiple diseases simultaneously—it relies on, however, the experience of the doctor himself, and a certain level of experience is needed for a diagnosis. The introduction of deep learning, therefore, is attracting attention for universalizing non-verbal knowledge by extracting the doctors experience as features and learning models. On the other hand, redundancy in these tongue images—prevent to grasp the whole features from image—has been pointed out as the problem.

This study used high-dispersion and local response normalization operations to enable multi-scale feature analysis for eliminating redundancy—especially with imbalanced datasets. The proposal method learns high-level features and provides more classification information. As a result, it achieved a high classification performance.

What is oriental medicine—traditional Chinese medicine (TCM)—and the tongue diagnosis?

Oriental medicine refers to traditional medicine of Eastern origin—mainly using Chinese herbal medicine, acupuncture, and moxibustion. While Western medicine uses medication and surgery for the bad parts of the body directly, Oriental medicine aims to cure the body's ailments by using methods such as acupuncture, moxibustion, and Chinese medicine ‘kampo’—based on the overall body condition, not the part. In addition, oriental medicine has its own concept of "mibyo"—not yet sick—for preventing illnesses from occurring by getting tired and building up resistance.

One of the four diagnoses in Oriental medicine is the "four diagnoses," including the "watchful waiting" method to grasp the condition of the body based on external characteristics: facial expression and appearance. Among them, tongue diagnosis can grasp physical conditions based on the condition of the tongue and detect areas of discomfort with high accuracy in a non-invasive manner. For thousands of years, Chinese medical physicians judge the patient’s health status by examining the tongue’s color, shape, and texture.

On the other hand, such a diagnosis is strongly based on the doctor's own experience and has a subjective aspect—difficult to spread the diagnosis method among strangers. Under these circumstances, simplifying the diagnosis method by accumulating images of the tongue and extracting features using deep learning has paid attention.

Problems of conventional methods

Although many models based on a single feature—color, shape, texture, etc.—have been proposed and achieved successful results, they only utilized low-level features—difficult to achieve a certain level of expressive power. In particular, in the case of anomaly detection for tongue images, the features of the entire image are needed for achieving high performance. Therefore, in the case of detecting abnormalities in tongue images, it is effective to take an approach that extracts multifaceted and comprehensive features that integrate such features with high accuracy. The previous study—PCANet—extracted such complex features from tongue images. This is based on PCA algorithm and CNN—it can be adapted to different data and tasks and requires little or no parameter with fine-tuning. In addition, it has been reported to achieve excellent performance in classification tasks when combined with machine learning classification algorithms: K-nearest neighbor (KNN), SVM, and Random Forest (RF). 

On the other hand, this method had two problems: redundancy in data processing and inaccuracy when dealing with unbalanced samples. For the former due to the nature of PCA, the eigenvalues tended to be bloated—causes data redundancy in the complex feature maps. In addition, PCANet may not work well with the imbalanced samples because the classification task assumes that the distribution of samples is balanced and the number of samples in the data set is large.

The aim of this study

In this study, we propose CHDNet solve these problems and extract appropriate complex whole features from tongue images. This is a supervised learning model that learns useful features from unsupervised clinical data and uses the obtained features to learn how to partition a patient's health state into normal and abnormal states.

This proposal method to search for feature representations of normal and abnormal tongue images uses four important factors: nonlinear transformation, multiscale feature analysis, high variance, local normalization. This method can provide robust feature representations for predicting the health status with skewed distributions.


Proposal Overview

For each image, we removed the background from the image, extracted the tongue body, and applied CHDNet to learn the features of normal and abnormal tongue bodies in Fig. 1. And then, images were normalized to a certain height and weight. 

After that, the tongue image was divided into training and test sets, and a convolutional kernel was trained to generate feature representations The tongue samples were classified into normal and abnormal. The feature representations of the entire tongue images were sent to a classifier using k-folds cross-validation. The overall feature representation was separated into k training sets and k test sets. The classifier was first trained on the k-1 subset and then its performance was evaluated on the kth subset. Each subset was used once as a test set, with this process repeated k times. The final results were obtained by averaging the results obtained in the corresponding k rounds.

The proposal has four important components: high dispersal. With the high dispersal operation, features in each feature map achieve the property of dispersal without redundancy; local response normalization. After the processing of high dispersal, features in the same position of different feature maps still have redundancy. This can solve this problem; nonlinear transformation layer. Since the principal component analysis focuses on linear classification, it is a problem that accuracy decreases due to redundancy—especially when used as a feature for anomaly detection in unbalanced data. Therefore, in order to solve this problem, nonlinear analysis is introduced for more precise feature extraction; multiscale feature analysis. To improve the ability to handle deformation, we introduced multiscale feature analysis before high dispersal and local response normalization.


CHDNet was composed of three components: PCA filters convolution layer, nonlinear transformation layer, and a feature pooling layer in Fig 2.

Nonlinear transformation


This layer reduced the redundancy caused by classification using PCA by using a nonlinear transformation in addition to the batch transformation process and PCA transformation used in the past. In this part of the PCA process, nonlinearity was applied to each image for eliminating the roughness of the detection accuracy in the linear transformation in the following equation.

where T is an image, C1 is the first principal component, and a,εare hyperparameters.

In addition, since tanh(x) was used in the feature convolution layer, negative values exist—conflicts with the principle of the visual system. Therefore, adding a nonlinear transformation layer after each convolution layer has the effect of treating such negative values as noise.

Feature Pooling

This layer had features except for nonlinear transformation as mentioned above: histogram—conversion for pixels into the range [0,255] integer; multi-scale feature analysis—aggregate images of each histogram by resolution and summarized as features.; high variance—prevent degenerate situations and enforce competition between features; local response normalization—normalization among each feature at the same position in different feature maps for prevent redundancy. By performing this series of steps on the input image, the normal and abnormal features were more prominent than conventional methods. See the paper for details.

Experiment setup

A total of 315 figures—267 gastritis patients, 48 healthy volunteers—were collected from hospitals. In the training phase of the feature extraction step, 40 normal and 44 abnormals—about 26.67% of the total images—were randomly selected as the training set and used to learn the convolution kernel and determine the parameters. The learned kernel and parameters are used to extract features for the remaining 231 samples. The results were the averaged outcomes after 10 rounds of 5-fold cross-validation. Some evaluation index—precision, sensitivity, specificity, precision, recall—were used to evaluate the performance of the proposed and conventional methods.


Comparison to conventional methods

The aim of this evaluation was to clarify whether the proposal improves performance compared to conventional methods—PCANet. In this evaluation, LIBLINEAR SVM was used as a classifier.

We confirmed that the proposal—the combination of High Dispersion (HD), Local Response Normalization (LRN), Multiscale Feature Analysis (MFA), and Nonlinear Transformation (NT)—improved the recognition rate compared to PCANet: 84.77% to 91.44% in Table1. In addition, in terms of sample imbalance, the proposal enhanced the specificity while it slightly decreased the sensitivity.

Comparison among classifiers

The aim of this evaluation is to illuminate which classifier had the best performance for detecting abnormal images.

The performance of CHDNet with LIBLINEAR SVM was compared with other classifiers: LDA, KNN, CART, GBDT, RF. Instead of using LIBSVM as a classifier, we used LIBLINEAR SVM—LIBLINEAR SVM performs better than LIBSVM when the number of samples is much smaller than the number of features. The number of samples was 315 and the features for each sample was 43008, so it was suggested that LIBLINEAR SVM performed better.

In terms of precision, specificity, accuracy, recall, and F1 score, the overall performance of LIBLINEAR SVM was the best among the six classifiers in Table 2: precision is 91.14%—6.24% higher than LDA. In addition, compared to the distance-based model and the tree-structure model, the specificity improved from 3% to 25%. This comparison showed that the SVM classifier with optimal parameters outperformed the other methods: The LIBLINEAR SVM method improved the performance accuracy to 91.14%—the best among all the other classifiers.


While tongue images are one of the diagnostic criteria in Oriental medicine and have the potential to grasp the entire body condition and identify physical ailments in non-invasiveness, conventional models have problems with redundancy and low detection accuracy for unbalanced samples—especially in detecting abnormal images. In this study, we proposed a model to extract appropriate features in image anomaly detection using a model with a high variance—CHDNet. The evaluation suggests that this model has high performance compared to conventional models.

There were questions in constructing proposals. First, it is unclear why Linear SVM was used as a classifier: SVM usually uses RBF Kernel—linear is used when fast computation is required in case of the huge amount of data processing. The authors explained the reason that the number of features was much larger than the number of samples, and results showed that the accuracy of linear was higher than that of SVM—RBF Kernel. However, the reason for this result was not discussed, and the reason was also unclear. Second, it was mentioned that the reason for introducing a nonlinear transformation—unique to the proposed method—was the existence of negative values from tanh(x) in Convolution layer; however, it was unclear whether other methods—use of the ReLU activation function, for instance—were considered. In the case of the image analysis area, it was conceivable that negative values could be noise-processed by using an activation function: ReLU, but the reason why this was not used—could not be introduced—was not specified. The point that the nonlinear transformation as shown here was more significant than the ReLU function should be stated.


In this paper, a new framework for tongue image classification using an unsupervised feature learning method was proposed. This trained a weighted LIBLINEAR SVM classifier to detect abnormal patients by learning features using CHDNet. Experiments showed that the combination of our new framework and the weighted LIBLINEAR SVM had the best predictive performance compared to other methods.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us