As artificial intelligence becomes an inevitable part of medicine, medical schools are encouraging students to engage with large language models. However, this integration raises concerns. “I’m worried these tools will erode my ability to make an independent diagnosis,” said Benjamin Popokh, a medical student at the University of Texas Southwestern.
During a recent rotation, Popokh’s professors had the class analyze a case using AI tools like ChatGPT and OpenEvidence, a medical LLM popular with healthcare professionals. While each chatbot correctly diagnosed a pulmonary embolism, Popokh noted the lack of a control group of students working unassisted. He soon found himself using AI after nearly every patient encounter. “I started to feel dirty presenting my thoughts to attending physicians, knowing they were actually the A.I.’s thoughts,” he admitted. One day, leaving the hospital, he realized he hadn’t independently analyzed a single patient. He resolved to formulate his own diagnosis before consulting any AI. “I went to medical school to become a real, capital-‘D’ doctor,” he said. “If all you do is plug symptoms into an A.I., are you still a doctor, or are you just slightly better at prompting A.I. than your patients?”
To test the capabilities of a new model named CaBot, which was trained on clinicopathological conferences from The New England Journal of Medicine, it was first evaluated on cases from the JAMA network. It accurately diagnosed a range of conditions, including rashes, lumps, and muscle loss, with only minor errors, such as mistaking one tumor type for another. In contrast, ChatGPT misdiagnosed about half the cases, confusing cancer with an infection and an allergic reaction with an autoimmune condition.
However, real patients don’t present as curated case studies. When CaBot was given a disorganized summary of a real patient’s experience—a bike ride, dinner, abdominal pain, vomiting, and two ER visits—the results were alarming. The AI generated a presentation filled with fabricated lab values, vital signs, and exam findings, including a “classic succussion splash” that never occurred. It even invented a CT scan report to support its mistaken diagnosis of gastric volvulus, a twisting of the stomach rather than the bowel.
The outcome changed dramatically when CaBot received a formal summary of the patient’s second emergency visit, which included more organized and salient data. The patient’s hemoglobin had plummeted, his white blood cell count had risen, and he was doubled over in pain. This time, the AI focused on the pertinent data without fabrication. “Strangulation indicators—constant pain, leukocytosis, dropping hemoglobin—are all flashing at us,” it noted, diagnosing a small intestine obstruction and advising, “Get surgery involved early.” While technically slightly off—the issue was in the large intestine—the recommended next steps were correct and would have led a surgeon to discover the problem.
The experience was both empowering and unnerving. It offered the potential for an instant second opinion in any specialty, yet it required medical training and vigilance to harness its power and detect its flaws. AI models can sound authoritative while making elementary errors. They cannot physically examine patients and often struggle with open-ended queries. Their output improves with carefully selected information, but most people are not trained to prioritize symptoms. A doctor knows to ask a patient with chest pain if it worsens with eating, walking, or lying down, or if leaning forward brings relief. Clinicians also listen for key phrases that signal specific conditions, like “worst headache of my life” for a brain hemorrhage or a “curtain over my eye” for a retinal-artery blockage. The difference between AI and earlier diagnostic tools is like that of a power saw versus a hacksaw; a careless user could lose a finger.
The common perception of medicine as a series of mysteries to be solved, popularized by shows like “House,” contrasts sharply with its often routine and repetitive reality. Many patients present with a complex combination of chronic conditions like emphysema, heart failure, and diabetes. In these cases, the goal is not to find a single diagnosis but to manage overlapping issues, often described as “likely multifactorial.” A precise diagnosis can be secondary to stabilizing the patient; someone with shortness of breath might be treated for COPD, heart failure, and pneumonia simultaneously, with the primary cause remaining unclear even after they recover. In such common scenarios, asking an AI for a single diagnosis would offer little practical clarity.
According to Gurpreet Dhaliwal, a renowned clinical diagnostician at UCSF, tasking an AI with solving a case is “starting with the end.” He argues that doctors should instead use AI for “wayfinding”—asking it to identify trends in a patient’s trajectory or highlight important details they might have missed. Rather than issuing a diagnosis, the model could alert a physician to a recent study, propose a relevant blood test, or find a critical lab result in an old medical record. This approach recognizes the difference between diagnosing a condition and competently caring for a person. “Just because you have a Japanese-English dictionary in your desk doesn’t mean you’re fluent in Japanese,” Dhaliwal noted.
While CaBot remains experimental, other AI tools are already influencing patient care. OpenEvidence is used by many clinicians, offering licensing agreements with top journals and compliance with patient-privacy laws. Crucially, it cites peer-reviewed articles for its answers, sometimes quoting a paper directly to prevent the kind of “hallucinations” seen with CaBot. When presented with a case, it doesn’t immediately offer a solution but instead begins by asking a series of clarifying questions.