[Kaggler Must See] The PANDA Challenge, The World's Largest Prostate Cancer Diagnostic Competition!
3 main points
✔️ The PANDA Challenge is the world's largest histopathology competition, with approximately 13,000 histopathology images collected from six institutions in Europe and the United States, and approximately 1,300 participants from 65 countries.
✔️ This research is an unprecedented effort in previous medical AI papers, with multiple teams working on the same dataset and validating multiple machine learning models that were submitted.
✔️ The multiple algorithms submitted were essentially similar in approach, with the top model achieving diagnostic accuracy equal to or better than pathologists, as well as high performance on validation data and its generalizability.
Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge
written by Wouter Bulten, Kimmo Kartasalo
(Submitted on 13 Jan 2022)
Comments: Nature Medicine
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Abstract
Artificial intelligence (AI) research in the medical field to date has often focused on specific, individual results, and has not involved multiple research teams building machine learning (ML) models for a given data set (e.g., tissue samples). For the first time, we had not In this study, we held the Prostate cancer graDe Assessment (PANDA) Challenge, an international medical imaging competition using prostate cancer biopsy tissue as a dataset, to evaluate and analyze the machine learning models submitted to the competition.
Among the submitted algorithms, we selected the models with the highest diagnostic accuracy and found that each model took a fundamentally similar approach, albeit with subtle differences. In addition, the evaluation dataset was a sample obtained from different medical facilities, indicating that the models at the top of the competition were generic. The top models all showed a diagnostic concordance rate of approximately 86% with specialist physicians, and more clinical validation is expected in the future.
main
The Gleason grading is a histopathologic classification of prostate cancer that is necessary for treatment planning. Pathologists classify tumors based on histomorphologic characteristics of the tumor tissue, but this assessment is subjective and has been known to vary among pathologists.
However, it has been reported that AI development is susceptible to various biases, such as the need for annotation by humans (pathologists) and the inability of these humans to successfully annotate specimens from other institutions. As a result, an AI may only perform well on data from the medical institution where it was developed and may be less accurate under certain conditions. It has also been noted that medical AI development is essentially a closed environment, where positive bias can easily work, as good photos can be provided and more experienced specialists can directly advise developers.
In this research, we avoid the above problem by developing the algorithm through a competition format. Specifically, a different person from the developer of the algorithm performs the validation, and additional validation is performed after collecting datasets from different facilities. This allows us to determine whether the algorithm is generalizable or not.
The datasets used in the competition are previously published prostate biopsy datasets and data from medical institutions present in Europe (EU). Based on this, we held the PANDA Challenge and the top model was replicated by our research team. The replicated models were validated in an environment independent of the developers, using a dataset from a medical institution in the US and a different EU dataset than the one used in the competition. The results were compared to the pathologist's diagnosis to provide a true assessment of each algorithm.
Results
Dataset Features
A total of 12,625 whole slide images (WSI) were collected from six medical institutions for algorithm development, tuning, and external validation (Table 1).
The above is a breakdown of the datasets obtained: the developing set and the tuning set are the two datasets available to the competition participants, while the tuning set is used for algorithm evaluation during the competition. The competition ranking is determined by the internal validation set and then the generalization performance is further validated by the external validation set. The name of the facility is written in the Source line. Note that the developers and internal validators are not involved in the collection of external validation data.
Dataset Reference Criteria
Annotation of the training dataset for the Netherlands (Netherland) was determined by reference to existing pathology reports. The Swedish training dataset was annotated by a single urologist. For the Dutch data for internal validation, the correct label was determined by agreement between three urologists (from two medical centers) with a career spanning 18-28 years. The remaining Swedish dataset was annotated by four urologists with more than 25 years of clinical experience.
The US external validation dataset was collected from 6 institutions in the US or Canada and annotated by a majority of urologists with 18-34 years of clinical experience. The external validation data were also subjected to immunohistochemical staining for a more accurate diagnosis. In addition, EU external validation data were annotated by a single urologist. To investigate the level of agreement between continents (EU and US), EU specialists diagnosed the US data and vice versa and found a high agreement rate (Note: Supplementary Table 9 is referenced for agreement rates, but was not accessible at the time of writing this article). (See also.)
Competition Summary
The competition was open for participants from April 21 to July 23, 2020, and was held on Kaggle, with 1,010 teams consisting of 1,290 participants from 65 countries (Figure 1).
During the competition, each team was able to request an evaluation of their algorithm using the tuning dataset.
Ultimately, a total of 34,262 algorithms were submitted by all teams. Validation on the internal validation dataset showed that the first team to show >90% diagnostic agreement with a urologist appeared within 10 days of the competition, and by day 33, the median diagnostic agreement for all teams was above 85%.
Summary of algorithms under evaluation
After the competition, participants were invited to participate in the PANDA consortium (external validation); 33 teams proceeded to the subsequent validation phase and 15 teams were selected based on their model performance and algorithm description. Seven of these teams were also ranked in the top 30 in the competition.
All the selected algorithms used deep learning. Most of the top teams used an approach that divided the WSI into small patches. These patches were fed into the CNN, features were extracted, and the diagnosis was determined at the final classification layer.
One technique employed by some of the top teams is automated label cleaning. This is a technique that removes or relabels samples from the training data without adopting the correct label for samples that would have been incorrectly labeled. Some teams detected images for which the inference results differed significantly from the correct labels and automatically excluded and relabeled them, adapting iteratively as the performance of the model improved.
Another feature common to all teams was the application of various algorithms, network structures, and preprocessing. Although a wide variety of algorithms were submitted for the competition, most of the teams achieved comparable performance as a result of their ensemble of models. The individual algorithms can be used for research purposes for any reason.
Classification performance on internal validation datasets
For internal validation, all selected algorithms were reproduced on two different computing platforms. The average value of each algorithm showed a high agreement (92-94%) with the specialist's diagnosis. They also achieved a sensitivity of 99.7% and a specificity of 92.9%.
Above is a plot of the algorithm (vertical axis) versus the quadratic weighted κ coefficient (horizontal axis) (Note: The quadratic weighted κ coefficient is an index calculated so that the score is higher if several specialists have the same diagnosis and lower if they disagree). It can be seen that most of the selected algorithms are consistent with the specialist's diagnosis.
Above a is the internal validation dataset, and b, c is the external validation dataset. d, e is the comparison between the general pathologist and expert correct labels. It can be seen that both sensitivity and specificity are higher than general pathologists (red).
Classification performance on externally validated datasets
The selected algorithms were independently evaluated on two external validation datasets. The agreement (weighted kappa coefficient) is 0.868 and 0.862, which is comparable to the expert criteria.
The representative algorithm showed a sensitivity of 98.6% and 97.7% for the US and EU sets, respectively, in external validation. Compared to internal validation, the specificity decreased as a result of higher false positives, to 75.2% and 84.3%.
Comparison of classification performance with general pathologists
To compare the algorithm and general pathologists, 13 people from 8 countries (7 from the EU and 6 from other countries) diagnosed 70 cases against the Dutch internal validation dataset, and 20 people from the US diagnosed 237 cases against the US external validation dataset.
First, in the 70 cases of the Dutch internal validation dataset, the algorithm showed higher diagnostic agreement with the specialist than with the general pathologist. This is a significant difference, with higher values for both sensitivity and specificity than all general pathologists. On average, general pathologists missed 1.8% of cancers, whereas algorithms missed about 1%.
The diagram above shows the individual diagnoses in color. A row (horizontally) is the diagnosis made by one general pathologist and a column (vertically) is each case. The algorithm is in the upper row and the general pathologist is in the lower row, but you can see that the pattern (color) of diagnosis is more similar in the upper row. In other words, it shows that the general pathologist's diagnosis results are more varied than the algorithm's.
discussion
To date, medical AI research has been siloed (note: one research team working on one dataset) and diverse approaches by multinational teams have not been compared. In this study, we aimed to transcend individual solutions and develop a more generalized algorithm.
The PANDA Challenge was one of the largest pathology image competitions to date. The competition has shown that the top algorithms not only perform as well as or better than expert physicians but also have generalizable performance when using externally validated datasets.
The selected algorithm was found to shift towards higher sensitivity and lower specificity when compared to the general pathologist. This is believed to be due to the development team estimating the performance of the model on the tuning dataset alone (rather than due to the labeling by the specialist pathologist). The algorithm was also found to assign higher classification grades than general pathologists, so the operating point needs to be tuned when applied clinically.
In this study, we discussed the classification of prostate cancer, but clinically, we must be able to detect other cancers as well. Detection of severe inflammation, intraepithelial carcinoma, and partial atrophy will continue to be of great interest. Therefore, a more comprehensive and extensive evaluation of routinely collected specimens is needed.
Categories related to this article