
Screening healthy women with full-field digital mammography (FFDM) is considered the gold standard for the early detection and successful treatment of breast cancer. However, globally there is a limited supply of radiologists who have been trained and are available to interpret each individual FFDM examination. Furthermore, even the most experienced breast imaging specialists can get it wrong and the consequences for patients can be devastating. Twenty years ago, computer-aided diagnosis (CAD) algorithms based on FFDM images offered great hope. Sadly, this technology proved to be a false dawn but, in recent years, modern algorithms – based on artificial intelligence (AI) – have emerged.
Yesterday, researchers from Sweden published an important article in JAMA Oncology entitled “External evaluation of 3 commercial artificial intelligence algorithms for independent assessment of screening mammograms.” This was a retrospective study based on a cohort of women screened at an academic hospital in Stockholm from 2008 to 2015. It included 8,805 women aged 40 to 74 years, of whom 739 were diagnosed as having breast cancer. All FFDM images were acquired using Hologic equipment and all examinations were assessed by two radiologists operating as double-readers.
Three separate AI algorithms – numbered 1, 2 and 3 – were sourced from different vendors and, at their own request, asked to remain anonymous. Among the parameters measured were sensitivity, specificity, area under the curve (AUC), accuracy, positive predictive value and false-negative rate. Seen at left are the receiver operating curves for the three AI algorithms, where the dashed lines represent the average values for first-reader radiologists. The diagram below right is a magnification of the section where the dashed lines intersect (© JAMA).
Algorithm 1, with an AUC of 0.956, outperformed algorithms 2 and 3 with AUC values equal to 0.922 and 0.920 respectively. The sensitivity of algorithm 1 at 81.9% was not only superior to the other two algorithms (67.0% and 67.4%), but also improved on the first-reader radiologists (77.4%). Interestingly, combining algorithm 1 with first-reader radiologists achieved a sensitivity value equal to 88.6%.
Although the authors did not identify the three vendors by name, they did provide background information on each algorithm. For example, algorithm 1 used the largest training population and consisted of South Korean women who were all imaged with GE equipment, leading the authors to conclude, “The superior performance of algorithm 1 is an interesting example of robustness.” Constance Lehman of Harvard commented: “The authors are to be commended for using a modern, all-digital screening database to compare performance of three commercial AI algorithms.” Perhaps, in time, algorithm 1 will be in widespread clinical use.