Catch up on the latest AI articles

Tongue Image Diagnosis By Deep Learning: Understanding Systemic Disorders From The Tongue! Part2

Tongue Image Diagnosis By Deep Learning: Understanding Systemic Disorders From The Tongue! Part2


3 main points.
Played a pivotal role in traditional Chinese medicine Diagnosis of the tongue, especially diagnosis of the tongue with tooth marks is subjective and presents a difficult challenge.
✔️ To recognize the tooth marks on the tongue using CNN, we build a dataset containing 1548 tongue images captured by different devices and propose a model to extract the features using ResNet.
✔️ We have improved the overall accuracy of the model from less than 80% to more than 90%.

Artificial intelligence in tongue diagnosis: Using deep convolutional neural network for recognizing unhealthy tongue with tooth-mark
written by XuWang, JingweiLiu, ChaoyongWu, JunhongLiu, QianqianLi, YufengChen, XinrongWang, XinliChen, XiaohanPang, BinglongChang, JiayingLin, ShifengZhao, ZhihongLi, QingqiongDeng, YiLu, DongbinZhao, JianxinChen
(Submitted on 8 April 2020)
Comments: Accepted to Computational and Structural Biotechnology Journal.

Subjects: CNN (cs: CNN) 


Can deep learning go beyond the subjectivity of diagnosis, a challenge for Oriental medicine?

This paper focuses on the diagnosis by the tongue, which is a kind of diagnosis method in Oriental medicine such as acupuncture and moxibustion therapy, and Chinese medicine. Such diagnosis has been used since ancient times in China as a method of watchful waiting, and it is considered to be a useful method to understand not only a specific disease but also the disorder of the whole body, and to treat not only the symptoms but also the root of the disease. On the other hand, it has been pointed out that such a diagnosis method is difficult for inexperienced doctors and people who do not have the knowledge to practice it because it is largely based on doctors' experience and subjective judgment.

In this work, we propose a CNN-based architecture to address these challenges. While most of the previous research methods have been reported to have an accuracy of 80% or lower, our proposed method achieves an accuracy of 90% or higher by using ResNet-based feature extraction and a large dataset of tongue images. In this article, we will give an overview of the proposed method.

What is Oriental Medicine?

First of all, let me briefly describe Oriental medicine.

Oriental medicine is a traditional medicine with a history of about 2000 years, which is a way of thinking and treatment born in ancient China. While Western medicine uses medication and surgery to directly approach the bad parts of the body, Oriental medicine mainly treats the body's problems from the inside to cure them fundamentally. In addition, Western medicine can treat illnesses in a short time, while Oriental medicine treatment takes time but is less burdensome to the body, and is often used to treat and improve intractable illnesses because it aims to cure the root of the disease rather than symptomatic treatment. The use of herbal medicines, Chinese herbs, acupuncture, and moxibustion is also a characteristic of this type of treatment, and the often-heard term "acupuncture points" is also one of the concepts of Oriental medicine.

In Oriental medicine, the basic idea is to diagnose the whole body, not just specific organs, and organs are considered to be related to each other. The organs are considered to be related to each other. Based on this idea, "qi, blood, and water" are the elements to check the state of health. Qi" refers to the energy required to carry out vital activities, "blood" refers to blood, and "water" refers to body fluids other than blood, such as lymph and sweat. A healthy state is one in which "qi, blood and water" are circulating smoothly within the body without excess or deficiency. It is also believed that they influence each other, and if one of them is abnormal, the whole balance will be upset. If one of them is out of balance, the whole system is out of balance. In addition, qi is considered to be the source of life force, and as the saying goes, "illness begins with qi", it is considered of utmost importance to manage qi first.

What are the five organs?

The five organs have the function of circulating qi, blood, and water as mentioned above, and are composed of the liver, heart, spleen, lung, and kidney. These organs are different from the organs in Western medicine (some of them overlap).

As for the function, the liver is "storage of blood, control of the autonomic nervous system, liver and gall bladder", the heart is "circulation of blood, regulation of sleep rhythm, heart", the spleen is "metabolism, supply of nutrition to muscles, the digestive system", the lung is "circulation of qi to the whole body, metabolism of skin and water, the respiratory system", the kidney is "growth, development, reproduction, aging, urinary system such as kidney and bladder". I think it is easy to grasp the image if it is thought that "the urinary system". It is believed that by regulating these systems and circulating and maintaining qi, we can maintain a healthy state for a long time and have a long life.

The six internal organs are also like the children of these five organs and consist of the gall, small intestine, stomach, large intestine, bladder, and sanjiao. Each of these has a parent organ. The details will be described on another occasion.

What is tongue diagnosis?

Tongue diagnosis (a method of diagnosis using the tongue) is one of the diagnostic methods of Oriental medicine and is used to diagnose diseases based on the shape and color of the tongue. The characteristics of the tongue are said to reflect the internal health of the body (internal organs, qi, blood, cold, heat, etc.) and the severity and progress of the disease, and by observing these conditions, appropriate treatment can be selected.

However, traditional tongue diagnosis is based on the subjective observations of the practitioner, a challenge that is often biased by personal experience and changes in environmental lighting. Therefore, there is a need to develop an objective and quantitative method of tongue diagnosis that can aid practitioners in their diagnosis.

In particular, one of the most important features, the tooth mark, is identified from the tongue body compressed by the adjacent teeth. According to TCM theory, tooth marks are often associated with spleen deficiency, yang deficiency with cold and dampness, phlegm and stagnant fluid, and blood stasis. In addition, micro-changes in the dentate tongue include impaired blood supply, local hypoxia, and tissue edema. Clinical symptoms in people with a dentate tongue include loss of appetite, abdominal pain, gastric distention, and loose stools. While the diagnosis of a toothed tongue plays an important role in the differentiation of symptoms and the selection of treatment, the recognition of a toothed tongue is difficult for TCM experts because there are various types of the toothed tongue (i.e. color and shape) and there is probably a subjective judgment factor as mentioned above. Therefore, in order to alleviate the barriers to diagnosis caused by these subjective aspects, deep learning is being introduced.

Previous research on tooth mark identification

To overcome the above-mentioned barriers to diagnosis due to the subjective part of tooth mark identification, computer models for tongue tooth shape recognition have been proposed using image processing, statistics, and machine learning techniques. These studies focus on the local color and unevenness features of the tooth shape region, and the application of convolutional neural network (CNN) models to classify the tongue of tooth marks has been gradually reported, which can automatically extract high-level semantic features and perform well in many image classification tasks In this paper, we report the application of a convolutional neural network (CNN) model to the classification of tooth shapes.

They have achieved a lot in the field of automatic recognition of tongues with tooth marks, but they also present important challenges. In particular, the accuracy of many of the models is less than 80%, the datasets are from the same device, and the generalization to classify tongue images captured by other devices is unknown, the sample size of the datasets is small (i.e. 645), and the models are trained and tested only on tongue region images isolated from the raw data and do not take into account the effects of face and surrounding areas are not taken into account.

Purpose of this study

In this work, we extend our techniques to address these challenges, focusing on deeper feature extraction in the dataset and model. Specifically, we use more than 1500 tongue images captured by different devices to build a larger dataset of tongues with tooth shapes, and label each image with a tongue region to construct a tongue region image dataset. Next, to take full advantage of deep learning, we use CNN models with deeper layers to extract features and perform classification.


Data set

To build a stable tongue image dataset, tongue images were acquired using standard instruments designed by Shanghai Daosh Medical Technology Ltd (DS01-B) and Shanghai Xieyang Intelligent Technology Ltd (XYSM01). (XYSM01). The detailed evaluation procedure for this study was as follows. First, three experts clarified the diagnostic criteria for a dentate tongue, and one expert classified all 1548 images as "with dentate" or "without dentate". Finally, two other experts confirmed the results of the labeling respectively. In case of disagreement, the three experts discussed and made a final decision. As a result, 672 images of tongues with dentition and 876 images of tongues without dentition are constructed as a data set. Additionally, for each raw tongue image, we manually labeled the tongue region. The purpose of this was to improve the performance of the model by suppressing the influence of irrelevant facial parts and background around the tongue. As a result, two datasets were constructed: the raw tongue image dataset and the tongue region image dataset.


In this study, we use a typical ResNet architecture (ResNet34) with 34 layers to classify tongue images (Fig. 2). As CNNs become more difficult to train with increasing depth and training errors become larger, ResNet ResNet outperforms traditional network models by keeping the network robust to vanishing gradients and degradation problems caused by network depth. A Rectified Linear Unit (ReLU) is used as the activation function after each convolutional layer.

Learning and Evaluation

The network is initialized with weights previously trained on the ImageNet dataset. Since the resolution of the tongue fundus images varies from device to device, all images were randomly resized and cropped to 416 × 416 pixels and further adjusted by flipping them horizontally before training the model. We then ran 40 epochs with a batch size of 16 to fine-tune the network. Stochastic Gradient Descent (SGD) with a learning rate of 0.001 and a momentum of 0.9 was used as the optimizer. For testing, we resized the input test image of the trained network to 420 × 420 pixels.

The accuracy, sensitivity, and specificity are used to evaluate the performance of the model. We also use k-fold cross-validation for training, which is considered to be robust and unbiased. The general procedure is as follows: 1) divide the data randomly into k subsets 2) allocate one subset and train the model on all other subsets 3) test the model on the allocated subset and record the evaluation metrics 4) repeat the above process until each of the k subsets is a test dataset. 4) Repeat the above process until each of the k subsets is a test dataset 5) Summarize the performance by computing the mean and variance of the evaluation metrics for the k models. In this study, we experiment with k=5, randomly shuffle 1548 tongue images and divide them into 5 subsets, 4 subsets are used for training and the remaining one subset is used for testing. Then, we calculated the mean and standard deviation (SD) of the accuracy, sensitivity, and specificity of the five models.

Verification Method

In order to evaluate the robustness of the model, it is tested on a new test dataset consisting of 50 images of the tongue, taken with a normal camera, and affected by various lighting conditions. These are divided into 27 images of tongues and 23 images of tongues without dentition for evaluation. In addition, VGG16 proposed by the Visual Geometry Group of the University of Oxford is used for the comparison experiments. In this model, the size of the input image is 416 × 416, so an adaptive mean pooling layer with an output size of 7 × 7 is applied before the fully connected (FC) layer. The training parameters are tuned in the same way as in ResNet34 described above. We compare these models with previous studies (Sun et al. We also use Gradient-weighted Class Activation Mapping (Grad-CAM) to visualize the most indicative regions of the tongue with teeth marks and to visualize the model's decision criteria. Grad-CAM is a technology to visualize the judgment criteria of the estimation in the convolutional layer of CNN by heat map. Grad-CAM is a technology to visualize the decision criteria of the convolutional layer of CNNs by heat maps, which helps to solve the black box problem that has been a problem in the field of deep learning.


Test data validation

The purpose of this evaluation is to check the robustness of the proposed method by evaluating its estimation performance on tongue image data captured by a normal camera.

The tongue image dataset consists of 50 tongue images and the new tongue region image dataset consists of 50 tongue region images which are manually separated from the raw images. Since the images in this test dataset were captured by a camera under various lighting conditions, they state that the overall accuracy exceeds 85.00%, suggesting that the proposed method can be extended and generalized to images captured under different lighting conditions.

Comparison with VGG16 architecture

The purpose of this evaluation is to investigate how changes in the CNN architecture affect the accuracy of the estimation.

VGG16 was used as the evaluation target for the comparison and the results are shown below. For the raw tongue image dataset and tongue region image dataset, the average accuracy of 5-fold cross-validation was 89.40% and 90.96%. Hence, we confirm that ResNet improves the classification accuracy of tongues with dentition by 1.10% for raw tongue images and by 0.52% for tongue region images.


Comparison with related studies

The purpose of this evaluation is to compare the model accuracy in previous studies.

From the table below, the average accuracy is 70.61% for the tongue raw image dataset and 71.77% for the tongue region image dataset, which is almost 20% lower than our method. They also stated that the results of considering the difference in input image size between previous studies and the proposed method do not significantly affect the classification results of the model. Based on these results, we report that the proposed method, ResNet34, VGG16, improves the classification accuracy by about 20%.

Evaluation by Grad-CAM

This analysis is done in order to clarify which part of the input image the proposed model is focusing on for classification. From the following figure, the tooth mark region is highlighted by Grad-CAM, which shows that the classification model is paying proper attention to the region of interest.


In this paper. While the characteristics of tongue dentition are important indicators in TCM diagnosis, generalization has been a difficult problem due to the experience and subjective judgment of diagnostic physicians. Therefore, a proposal is made to generalize them using deep learning techniques with tongue images.

In this study, we propose a framework for recognizing tongues, especially those with tooth shapes. First, 1548 raw tongue images were captured by various devices and divided into 672 tongue images with tooth marks and 876 tongue images without tooth marks to create a tongue region image dataset with labeled tongue regions. The features were then extracted and classified using the ResNet34 CNN model, and the results showed that the overall accuracy of the proposed model achieved over 90%. Interestingly, the model also performed well on images captured by other devices with different lighting, suggesting that the proposed method significantly improves the accuracy over previous methods and that the model is valid even on different sources of images.

On the other hand, the following issues can be considered. In the reported evaluation, the specificity is higher than the sensitivity, and it is inferred that the positive and negative samples are unequal. It is also inferred that the model accuracy for the tongue region image dataset is higher than that for the raw images, and that segmentation is required for the tongue images, and the accuracy may vary depending on these algorithms. In addition, the dataset prepared in this study was constructed under careful confirmation and diagnosis by experts, which may require a large cost for the composition of the dataset.

Despite these shortcomings, our results suggest that CNNs are effective for tongue image analysis, and contain important insights for future generalization.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us