Health advice from AI chatbots is frequently wrong, study shows

In part, the problem has to do with how users are asking their questions. Advertisement Wellness Health advice from AI chatbots is frequently wrong, study shows In part, the problem has to do with how users are asking their questions. (Photo: The New York Times) 22 Feb 2026 06:58AM Bookmark Bookmark Share WhatsApp Telegram Facebook Twitter Email LinkedIn Set CNA as your preferred source on Google Add CNA as a trusted source to help Google better understand and surface our content in search results. Read a summary of this article on FAST. Get bite-sized news via a newcards interface. Give it a try. Click here to return to FAST Tap here to return to FAST FAST A recently published study provided a sobering look at whether AI chatbots, which have fast become a major source of health information, are, in fact, good at providing medical advice to the general public.The experiment found that the chatbots were no better than Google – already a flawed source of health information – at guiding users toward the correct diagnoses or helping them determine what they should do next. And the technology posed unique risks, sometimes presenting false information or dramatically changing its advice depending on slight changes in the wording of the questions.None of the models evaluated in experiment were “ready for deployment in direct patient care,” the researchers concluded in the paper, which is the first randomised study of its kind.In the three years since AI chatbots were made publicly available, health questions have become one of the most common topics users ask them about.Some doctors regularly see patients who have consulted an AI model for a first opinion. Surveys have found that about one in six adults used chatbots to find health information at least once a month. Major AI companies, including Amazon and OpenAI, have rolled out products specifically aimed at answering users’ health questions.These tools have stirred up excitement for good reasons: The models have passed medical licensing exams and have outperformed doctors on challenging diagnostic problems.But Adam Mahdi, a professor at the Oxford Internet Institute and senior author of the new Nature Medicine study, suspected that these clean, straightforward medical questions were not a good proxy for how well they worked for real patients.“Medicine is not like that,” he said. “Medicine is messy, is incomplete, it’s stochastic.”So he and his colleagues set up an experiment. More than 1,200 British participants, most of whom had no medical training, were given a detailed medical scenario, complete with symptoms, general lifestyle details and medical history. The researchers told the participants to chat with the bot to figure out the appropriate next steps, like whether to call an ambulance or self-treat at home. They tested commercially available chatbots like OpenAI’s ChatGPT and Meta’s Llama.The researchers found that participants chose the “right” course of action – predetermined by a panel of doctors – less than half of the time. And users identified the correct conditions, like gallstones or subarachnoid haemorrhage, about 34 per cent of the time.They were no better than the control group, who were told to perform the same task using any research method they would normally use at home, mainly Googling.The experiment is not a perfect window into how chatbots answer medical questions in the real world: Users in the experiment asked about made-up scenarios, which may be different from how they would interact with the chatbots about their own health, said Dr Ethan Goh, who leads the AI Research and Science Evaluation Network at Stanford University.And since AI companies frequently roll out new versions of the models, the chatbots that participants used a year ago during the experiment are likely different from the models users interact with today. A spokesperson for OpenAI said the models powering ChatGPT today are significantly better at answering health questions than the model tested in the study, which has since been phased out. They cited internal data which showed that many new models were far less likely to make common types of mistakes, including hallucinations and errors in potentially urgent situations. Meta did not respond to a request for comment.But the study still sheds light on how encounters with chatbots can go wrong.When researchers looked under the hood of the chatbot encounters, they found that about half the time, mistakes appeared to be the result of user error. Participants didn’t enter enough information or the most relevant symptoms, and the chatbots were left to give advice with an incomplete picture of the problem.One model suggested to a user the “severe stomach pains” that lasted an hour might have been caused by indigestion. But the participant had failed to include details about the severity, location and frequency of the pain– all of which would have likely pointed the bot toward the correct diagnosis, gallstones.By contrast, when researchers entered the full medical scenario directly into the chatb