The deep learning and medical research communities are abuzz with discussions triggered by the publication of a trio of promising breast cancer diagnosis papers from Google, NYU and DeepHealth.
Several years ago a group of NYU researchers began publishing papers on applying deep learning to breast cancer screening. The team’s most recent paper, Deep Neural Networks Improve Radiologists’ Performance in Breast Cancer Screening, was published in October 2019.
In December, Boston-based DeepHealth — a startup that uses machine learning to assist radiologists — published Robust Breast Cancer Detection in Mammography and Digital Breast Tomosynthesis Using Annotation-Efficient Deep Learning Approach on arXiv. The proposed method achieved SOTA performance in mammogram classification, according to the paper. Coauthors include researchers from the Rhode Island Hospital & Brown University, China’s Henan Provincial People’s Hospital, Medford Radiology Group, and the University of Massachusetts Medical School.
DeepHealth CTO and Co-Founder William Lotter is the first author. In an email to Synced, Lotter said the paper is under journal review and that DeepHealth plans to present three related abstracts at the Society of Breast Imaging conference in mid April.
On New Year’s Eve, a simple Reddit post applauding the DeepHealth paper provided the first hint of controversy surrounding these studies. Titled “Deep learning model for breast cancer detection beats five full-time radiologists and previous SOTA models from NYU and MIT,” the r/MachineLearning subreddit post received over 600 upvotes and 106 comments in less than two days. DeepHealth however saw the title as “hyperbolic and not necessarily constructive” and requested the thread be taken down. It was.
And then on New Year’s Day the Google global breast cancer study grabbed headlines around the world. International Evaluation of an AI System for Breast Cancer Screening proposes a new AI system that reads mammograms with greater accuracy, fewer false positives, and fewer false negatives than human radiologists. The paper appeared in Nature magazine and was authored by over 30 researchers from Google Health, Google DeepMind, Cancer Research UK Imperial Centre, Northwestern University, and Royal Surrey County Hospital.
However, even as Google DeepMind Founder and CEO Demis Hassabis and others were celebrating the paper’s release, Turing Award winner and Facebook Chief AI Scientist AI Yann LeCun went and spoiled the party, tweeting that the Google paper’s authors owed something to the NYU researchers, and should “cite this prior study on the same topic.” He added that unlike the Google system, the NYU method had been open sourced.
Hassabis shot back that Google did cite the NYU paper, taking a jab at LeCun in the process: “perhaps people should read the paper *first* before posting angry messages with incorrect information on twitter.”
LeCun sort of backed down at that point: “I was not angry ;-)” and “I did read the paper but missed the citation the first time around.”
An Important Challenge and a Dispute Over Novelty
Globally, breast cancer is the most common cancer in women, according to the World Health Organization. Although breast cancer deaths have declined in the US in recent years, the disease remains the second leading cause of cancer deaths among women in that country, according to the Centers for Disease Control and Prevention.
A condition in which cells in the breast grow out of control, breast cancer can spread through the body via blood and lymph vessels. Mammograms enable doctors to identify breast cancer tumors before they are large enough to cause symptoms or be otherwise detected by the patient. But despite an increase in the usage of digital mammography, reading mammograms remains a difficult task even for professional radiologists.
In a blog post distilling the Nature paper, Google Health’s Technical Lead Shravya Shetty said Google had been working with leading clinical research partners in the UK and US for several years to see if AI could improve the detection of breast cancer. The fruits of that research went into the new paper.
The Google paper however did not generate as much excitement in the AI community as it did in mainstream media.
Returning to the twittersphere to take another swipe at the paper’s novelty, LeCun retweeted comments from the UK Royal College of Radiologists’ Hugh Harvey: “Congrats to Google, but let’s not forgot the team from NYU who last year published better results, validated on more cases, tested on more readers, and made their code and data available. They just don’t have the PR machine to raise awareness.”
NYU paper coauthor and radiology professor Krzysztof J. Geras told Synced: “There is a very long line of papers on applying deep learning to breast cancer screening. My paper was probably the first that had the combination of experiments on a large scale, a careful evaluation of different possible models, very good results, a large reader study and the trained model publicly available online. However, there is still room for improvement and I’m sure that there will be many papers that will go further in different aspects in the next few years.”
The NYU researchers introduced a deep convolutional neural network for breast cancer screening classification that was trained and evaluated on over 1,000,000 images from 200,000 breast exams. Their system scored an impressive 0.895 on the AUC performance metric (AUC or “area under the receiver operating characteristic” ranges from 0-1, higher is better) when tested on images from New York University School of Medicine affiliated sites. The AUC for Google’s system was 0.889 on UK screening data and 0.8107 on US screening data.
Geras acknowledges the strength of the Google’s paper’s careful result analysis, but warned in a tweet that “novelty is difficult to quantify” and “there are already multiple papers that show similar results.”
In fact, an even earlier NYU study from last August achieved an AUC of 0.919. And the new DeepHealth paper presents even more striking AUC scores: 0.971 on data from a Chinese source, 0.959 on data from a UK source, and 0.957 on data from a US source.
Geras cautions however that because these models were trained and evaluated on different datasets, it is difficult to fairly compare the results or to say at this point whether any model is actually state-of-the-art. He believes that multiple groups achieving similar results with similar methods would be a good thing, “co-validating our approaches and showing that the toolbox that we use — in this case, deep neural networks — is robust and works in different scenarios.”
There are certainly similarities between the Google and DeepHealth studies in terms of scale, methodology and outcomes — but the biggest difference may be that Google’s paper got published in the prestigious journal Nature while DeepHealth’s is still sitting on arXiv awaiting review. Google getting their research out there first will put more pressure on DeepHealth to prove their model’s novelty.
“One of the core novelties in our paper is that we present a model that works for digital breast tomosynthesis (DBT, or 3D mammography), in addition to 2D mammography,” Lotter wrote in an email, explaining the approach achieved good performance without requiring strongly labeled DBT data. Developing AI models for DBT is more challenging because 3D mammography typically comprises 50 to 100 times more data than 2D mammography. But the DeepHealth team believes DBT is essential given the widespread use of the technology and its superior clinical accuracy.
AI vs. Radiologists in Breast Cancer Detection
Trained and tuned on mammograms from more than 76,000 women in the UK and more than 15,000 women in the US, and evaluated on a separate data set of over 25,000 women in the UK and over 3,000 women in the US, Google’s system reduced false positives by 5.7 percent in the US and by 1.2 percent in the UK; and reduced false negatives by 9.4 percent and 2.7 percent in the US and the UK respectively.
“Reading mammograms is the perfect problem for machine learning and AI, but I honestly did not expect it to work this much better,” Mozziyar Etemadi, research assistant professor at Northwestern University and one of the Google paper’s co-authors, told Time.
Google’s AI system outperformed all six human experts in an independent study. DeepHealth’s system meanwhile outperformed all five full-time breast imaging specialists, improving absolute sensitivity by an average 14 percent. And the NYU model outperformed 12 attending radiologists with between 2 to 25 years’ experience, a resident, and a medical student.
Faced with a flurry of “AI Beating Radiologists” headlines, the Google, DeepHealth and NYU teams all stressed that their systems aim not to replace radiologists but rather to support them in interpreting breast cancer screening exams.
It is noteworthy that radiologists typically use other sources of data such as family cancer history and prior images in their diagnoses. But in DeepHealth’s comparison the radiologists were given mammograms and nothing else, which could affect their performance. The Google team did provide their human experts (but not the AI) with patient histories and prior mammograms, and their system still scored higher AUC than the radiologists.
Also noteworthy is that AUC only reflects certain aspects of model performance. Some studies have suggested that under particular experiment conditions, neural networks are more likely to get higher AUCs than radiologists in breast cancer screening exam classification. That performance however may not transfer to other metrics: “For example, in our study, when evaluated with respect to PRAUC, radiologists are still relatively stronger,” Geras explains.
Model Generalization and Future Application
From a scientific perspective, ensuring generalization across populations is critical for real-world deployment. The DeepHealth deep learning model was trained primarily on Western populations but generalizes well to a Chinese population. The Google team meanwhile trained their model on UK data and evaluated it on US data to see how it would generalize to other healthcare systems.
Considering the geographical, cultural, and racial differences, however, proving that a model trained on Western populations can be generalized to Asian populations seems to make more sense. For example, the breast density of Chinese women tends to be higher than that of women in Western populations, which can be a technical challenge for mammography. Demonstrating that a model trained on Western populations can also achieve high performance in Chinese populations suggests the potential for generalization capability across additional populations.
Geras says the NYU model was trained on a private data, with some particular biases in the distribution of the inputs, the outputs, and the relationship between the inputs and the outputs. And it doesn’t necessarily generalize to other datasets very well.
Although it’s difficult to directly compare the three models’ generalization capability — or their overall performance in medical diagnosis — the ultimate test will be in real clinical settings. However without open-sourced models, it’s difficult for others — especially smaller groups that could also contribute to the progress of the field — to build upon Google’s work. That’s prompted many in the research community to criticize Google’s decision to not release the code for its model.
But why not release the code? McGill University Biomedical Engineering Professor Danilo Bzdok says Google’s model training code probably wouldn’t be of much use anyway, as it includes a prohibitively large number of dependencies based on internal tooling, infrastructure and hardware.
Open-sourcing models for medical use can be complicated, and the DeepHealth team is currently deciding on how to “find an optimal solution that benefits the research community while mitigating the potential of misuse on real patients,” Lotter told Synced. “From the research side, we’re big fans of open code. From the clinical side however, blindly releasing code that can be directly used to interpret mammograms has non-trivial risks.”
Lotter stresses that DeepHealth aims to build a usable product rather than a pure research project. This requires a quality management system, FDA studies and approval, etc. “If someone were to sidestep these components and directly use our model code for clinical decision making, especially without assurance of proper pre-processing, input validation, and monitoring, there are significant risks of harm.”
Screening is only the first step in breast cancer diagnosis, which often requires more than just mammograms. Broader questions also remain regarding when to start screening, ideal intervals between mammograms, and the extent of the benefits versus harmful effects of mammography.
These three powerful papers have focused the deep learning and medical communities’ attention on the potential for greatly improving breast cancer screening and diagnosis with deep learning. Hopefully we will see additional progress from this promising research area.
The NYU paper Deep Neural Networks Improve Radiologists’ Performance in Breast Cancer Screening is available here, the DeepHealth paper Robust Breast Cancer Detection in Mammography and Digital Breast Tomosynthesis Using Annotation-Efficient Deep Learning Approach is here, and the Google paper International Evaluation of an AI System for Breast Cancer Screening is here.
Journalist: Yuan Yuan | Editor: Michael Sarazen