Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain
the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in
Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles
and JavaScript.
Researchers are feeding machine-learning tools millions of medical scans to give them general diagnostic capabilities. Credit: Massimo Brega/Science Photo Library
Jordan Perchik started his radiology residency at the University of Alabama at Birmingham near the peak of what he calls the field’s “AI scare”. It was 2018, just two years after computer scientist Geoffrey Hinton had proclaimed that people should stop training to be radiologists because machine-learning tools would soon displace them. Hinton, sometimes referred to as the godfather of artificial intelligence (AI), predicted that these systems would soon be able to read and interpret medical scans and X-rays better than people could. A substantial drop in applications for radiology programmes followed. “People were worried that they were going to finish residency and just wouldn’t have a job,” Perchik says.
Hinton had a point. AI-based tools are increasingly part of medical care; more than 500 have been authorized by the US Food and Drug Administration (FDA) for use in medicine. Most are related to medical imaging — used for enhancing images, measuring abnormalities or flagging test results for follow-up.
But even seven years after Hinton’s prediction, radiologists are still very much in demand. And clinicians, for the most part, seem underwhelmed by the performance of these technologies.
Surveys show that although many physicians are aware of clinical AI tools, only a small proportion — between 10% and 30% — has actually used them1. Attitudes range from cautious optimism to an outright lack of trust. “Some radiologists doubt the quality and safety of AI applications,” says Charisma Hehakaya, a specialist in the implementation of medical innovations at University Medical Center Utrecht in the Netherlands. She was part of a team that interviewed two dozen clinicians and hospital managers in the Netherlands for their views on AI tools in 20192. Because of that doubt, she says, the latest approaches sometimes get abandoned.
And even when AI tools accomplish what they’re designed to do, it’s still not clear whether this translates into better care for patients. “That would require a more robust analysis,” Perchik says.
But excitement does seem to be growing about an approach sometimes called generalist medical AI. These are models trained on massive data sets, much like the models that power ChatGPT and other AI chatbots. After ingesting large quantities of medical images and text, the models can be adapted for many tasks. Whereas currently approved tools serve specific functions, such as detecting lung nodules in a computed tomography (CT) chest scan, these generalist models would act more like a physician, assessing every anomaly in the scan and assimilating it into something like a diagnosis.
Although AI enthusiasts now tend to steer clear of bold claims about machines replacing doctors, many say that these models could overcome some of the current limitations of medical AI, and they could one day surpass physicians in certain scenarios. “The real goal to me is for AI to help us do the things that humans aren’t very good at,” says radiologist Bibb Allen, chief medical officer at the American College of Radiology Data Science Institute, who is based in Birmingham, Alabama.
But there’s a long journey ahead before these latest tools can be used for clinical care in the real world.
Current limitations
AI tools for medicine serve a support role for practitioners, for example by going through scans rapidly and flagging potential issues that a physician might want to look at right away. Such tools sometimes work beautifully. Perchik remembers the time an AI triage flagged a chest CT scan for someone who was experiencing shortness of breath. It was 3 a.m. — the middle of an overnight shift. He prioritized the scan and agreed with the AI assessment that it showed a pulmonary embolism, a potentially fatal condition that requires immediate treatment. Had it not been flagged, the scan might not have been evaluated until later that day.
But if the AI makes a mistake, it can have the opposite effect. Perchik says he recently spotted a case of pulmonary embolism that the AI had failed to flag. He decided to take extra review steps, which confirmed his assessment but slowed down his work. “If I had decided to trust the AI and just move forward, that could have gone undiagnosed.”
Many devices that have been approved don’t necessarily line up with the needs of physicians, says radiologist Curtis Langlotz, director of Stanford University’s Center for Artificial Intelligence in Medicine and Imaging in Palo Alto, California. Early AI medical tools were developed according to the availability of imaging data, so some applications have been built for things that are common and easily spotted. “I don’t need help detecting pneumonia” or a bone fracture, Langlotz says. Even so, multiple tools are available for assisting physicians with these diagnoses.
Another issue is that the tools tend to focus on specific tasks rather than interpreting a medical examination comprehensively — observing everything that might be relevant in an image, taking into account previous results and the person’s clinical history. “Although focusing on detecting a few diseases has some value, it doesn’t reflect the true cognitive work of the radiologist,” says Pranav Rajpurkar, a computer scientist who works on biomedical AI at Harvard Medical School in Boston, Massachusetts.
The solution has often been to add more AI-powered tools, but that creates challenges for medical care, too, says Alan Karthikesalingam, a clinical research scientist at Google Health in London. Consider a person having a routine mammography. The technicians might be assisted by an AI tool for breast cancer screening. If an abnormality is found, the same person might require a magnetic resonance imaging (MRI) scan to confirm the diagnosis, for which there could be a separate AI device. If the diagnosis is confirmed, the lesion would be removed surgically, and there might be yet another AI system to assist with the pathology.
“If you scale that to the level of a health system, you can start to see how there’s a plethora of choices to make about the devices themselves and a plethora of decisions on how to integrate them, purchase them, monitor them, deploy them,” he says. “It can quickly become a kind of IT soup.”
Many hospitals are unaware of the challenges involved in monitoring AI performance and safety, says Xiaoxuan Liu, a clinical researcher who studies responsible innovation in health AI at the University of Birmingham, UK. She and her colleagues identified thousands of medical-imaging studies that compared the diagnostic performance of deep-learning models with that of health-care professionals3. For the 69 studies the team assessed for diagnostic accuracy, a main finding was that a majority of models weren’t tested using a data set that was truly independent of the information used to train the model. This means that these studies might have overestimated the models’ performance.
“It’s becoming now better known in the field that you have to do an external validation,” Liu says. But, she adds, “there’s only a handful of institutions in the world that are very aware of this”. Without testing the performance of the model, particularly in the setting in which it will be used, it is not possible to know whether these tools are actually helping.
Solid foundations
Aiming to address some of the limitations of AI tools in medicine, researchers have been exploring medical AI with broader capabilities. They’ve been inspired by revolutionary large language models such as the ones that underlie ChatGPT.
These are examples of what some scientists call a foundation model. The term, coined in 2021 by scientists at Stanford University, describes models trained on broad data sets — which can include images, text and other data — using a method called self-supervised learning. Also called base models or pre-trained models, they form a basis that can later be adapted to perform different tasks.
Most medical AI devices already in use by hospitals were developed using supervised learning. Training a model with this method to identify pneumonia, for example, requires specialists to analyse numerous chest X-rays and label them as ‘pneumonia’ or ‘not pneumonia’, to teach the system to recognize patterns associated with the disease.
The annotation of large numbers of images, an expensive and time-consuming process, is not required in foundation models. For ChatGPT, for example, vast collections of text were used to train a language model that learns by predicting the next word in a sentence. Similarly, a medical foundation model developed by Pearse Keane, an ophthalmologist at Moorfields Eye Hospital in London, and his colleagues used 1.6 million retinal photos and scans to learn how to predict what missing portions of the images should look like4 (see ‘Eye diagnostics’). After the model had learnt all the features of a retina during this pre-training, the researchers introduced a few hundred labelled images that allowed it to learn about specific sight-related conditions, such as diabetic retinopathy and glaucoma. The system was better than previous models at detecting these ocular diseases, and at predicting systemic diseases that can be detected through tiny changes in the blood vessels of the eye, such as heart disease and Parkinson’s. The model hasn’t yet been tested in a clinical setting.
Source: Ref. 4
Login or create a free account to read this content
Gain free access to this article, as well as selected content from this journal and more on nature.com