Study Finds Not All AI Platforms Are Built to Diagnose
As reported by health and science news outlets, including News-Medical, Medical Xpress, and others, a new study led by Milan Toma, Ph.D., associate professor in the College of Osteopathic Medicine (NYITCOM), finds that general-use AI platforms are unreliable for medical diagnosis. Toma and his co-authors, which include NYITCOM Senior Development Security Operations Engineer Mihir Matalia and medical student Sungjoon Hong, tested the reliability of some of the world’s most advanced multimodal large language models, including ChatGPT and Claude. The AI models were tasked with analyzing the same brain scan with clear intracranial pathology of an ischemic stroke near the left middle cerebral artery. The findings reveal a 20 percent rate of fundamental diagnostic error across the AI models, along with concerning variabilities in interpretation and assessment.
“Our research highlights a critical distinction in the AI landscape,” Toma tells News-Medical. “Most successful medical AI tools are task-specific algorithms, trained on large datasets of labeled medical images and validated for very specific diagnostic tasks. However, large language models are not optimized for diagnostics—they are built for linguistics and conversation. Accordingly, they generate explanations that sound authoritative, even when their underlying interpretation is wrong or inconsistent.”