Chatbots provided incorrect, conflicting medical advice, researchers found: “Despite all the hype, AI just isn’t ready to take on the role of the physician.”

“In an extreme case, two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice,” the study’s authors wrote. “One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care.”

  • dandelion (she/her)@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    11
    ·
    2 hours ago

    link to the actual study: https://www.nature.com/articles/s41591-025-04074-y

    Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in fewer than 34.5% of cases and disposition in fewer than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice.

    The findings were more that users were unable to effectively use the LLMs (even when the LLMs were competent when provided the full information):

    despite selecting three LLMs that were successful at identifying dispositions and conditions alone, we found that participants struggled to use them effectively.

    Participants using LLMs consistently performed worse than when the LLMs were directly provided with the scenario and task

    Overall, users often failed to provide the models with sufficient information to reach a correct recommendation. In 16 of 30 sampled interactions, initial messages contained only partial information (see Extended Data Table 1 for a transcript example). In 7 of these 16 interactions, users mentioned additional symptoms later, either in response to a question from the model or independently.

    Participants employed a broad range of strategies when interacting with LLMs. Several users primarily asked closed-ended questions (for example, ‘Could this be related to stress?’), which constrained the possible responses from LLMs. When asked to justify their choices, two users appeared to have made decisions by anthropomorphizing LLMs and considering them human-like (for example, ‘the AI seemed pretty confident’). On the other hand, one user appeared to have deliberately withheld information that they later used to test the correctness of the conditions suggested by the model.

    Part of what a doctor is able to do is recognize a patient’s blind-spots and critically analyze the situation. The LLM on the other hand responds based on the information it is given, and does not do well when users provide partial or insufficient information, or when users mislead by providing incorrect information (like if a patient speculates about potential causes, a doctor would know to dismiss this whereas a LLM would constrain responses based on those bad suggestions).

    • SocialMediaRefugee@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      25 minutes ago

      Yes, LLMs are critically dependent on your input and if you give too little info will enthusiastically respond with what can be incorrect information.

    • pearOSuser@lemmy.kde.social
      link
      fedilink
      English
      arrow-up
      1
      ·
      52 minutes ago

      Thank you for showing other side of the coin instead of just blatantly disregarding it’s usefulness.(Always needs to be cautious tho)

  • cub Gucci@lemmy.today
    link
    fedilink
    English
    arrow-up
    4
    ·
    2 hours ago

    “but have they tried Opus 4.6/ChatGPT 5.3? No? Then disregard the research, we’re on the exponential curve, nothing is relevant”

    Sorry, I’ve opened reddit this week

  • softwarist@programming.dev
    link
    fedilink
    English
    arrow-up
    5
    ·
    3 hours ago

    As neither a chatbot nor a doctor, I have to assume that subarachnoid hemorrhage has something to do with bleeding a lot of spiders.

  • GoddessLabsOnline@lemmynsfw.com
    link
    fedilink
    English
    arrow-up
    4
    ·
    3 hours ago

    My experience with the medical industry… has not been great.

    First, I went to a doctor because I couldn’t fall asleep at night… They sent me to get a sleep apnea test… I laid awake in the clinic all night. idk if your aware of this, but … you kind of need to be able to sleep for sleep apnea to be a concern.

    Next I went in for depression and anxiety. They asked me 12 questions, and proceeded to prescribe me SSRIs and benzos. A month later I got into the psychiatrist and was bitched out for being late, told my issues were situational, and had my scripts cancelled.

    Next I tried to get diagnosed for ADHD. I waited 5 months to get a psychiatrist who told me I couldn’t be ADHD because I held a job… And then proceeded to tell there’s no such thing as CPTSD, only PTSD…

    Next I asked my doctor for another referral to get tested for ADHD, he asked me why I would want to, there’s nothing that can be done for it. He then gave me a form, and told me to fill it out, and that if I scored high we’d conclude I was ADHD.

    Now I’ve been unemployed for 8 months, bordering on homelessness 😅 I found all my old report cards, and it’s just my teachers bitching that I’m smart, but fail, because I don’t apply myself, and shouldn’t continue taking the class…

    I went to an employment agency the other money to try, and get some help pursuing my goals, and the worker spent 45 minutes explaining to me how they receive their funding, getting me to fill out a 16 page introduction package, never looked at my resume, and told me my certifications weren’t valued in my area…

    In all honesty… AI has waaaay more ability to help me troubleshoot my issues than any medial professional I’ve dealt with. Is it perfect? No, but I actually have the ability to double and triple check, to get citations, to ask followup questions.

    • Gathorall@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      edit-2
      2 hours ago

      I’ve been ejected from the system so many times it is not funny. Therapist’s approach seemed unproductive, he pressured me to end the treatment and filed that I was unwilling.

      Medicine had serious side effects and I had to quit, back to the start.

      Another go at that later.

      Was prescribed a CBT treatment that was administered as home course with “guidance”. Because I had some serious problems, the tasks seemed shallow.

      Possibly being kicked out of school having already facing fraudulent misconduct charges did not seem like a minor problems to recontextualize nor to me was a formal charge of misconduct something to live and let live with.

      Therapist just wrote some platitudes and complimented me on progress as I was describing that by no means did this seem like a suitable treatment when an honest objective assessment of the facts was up to causing panic attacks.

      CPTSD, well I’ve never had diagnosed but it may. AVPD was already on my file for most of this but clearly that doesn’t excuse me from always taking the initiative, or even initiative would be fine but basically every time there was the most minor hitch in treatment it’s up to me to start again.

      But you know, eventually I was allowed a subsidy for therapy I couldn’t afford, so that was the end of that road I suppose.

      The lack of resources to actually tackle problems produces shallow, inefficient, dangerously inappropriate treatments as is.

      But that doesn’t seem to garner that much criticism.

  • Paranoidfactoid@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    ·
    4 hours ago

    But they’re cheap. And while you may get open heart surgery or a leg amputated to resolve your appendicitis, at least you got care. By a bot. That doesn’t even know it exists, much less you.

    Thank Elon for unnecessary health care you still can’t afford!

  • PoliteDudeInTheMood@lemmy.ca
    link
    fedilink
    English
    arrow-up
    5
    arrow-down
    1
    ·
    4 hours ago

    This being Lemmy and AI shit posting a hobby of everyone on here. I’ve had excellent results with AI. I have weird complicated health issues and in my search for ways not to die early from these issues AI is a helpful tool.

    Should you trust AI? of course not but having used Gemini, then Claude and now ChatGPT I think how you interact with the AI makes the difference. I know what my issues are, and when I’ve found a study that supports an idea I want to discuss with my doctor I will usually first discuss it with AI. The Canadian healthcare landscape is such that my doctor is limited to a 15min appt, part of a very large hospital associated practice with a large patient load. He uses AI to summarize our conversation, and to look up things I bring up in the appointment. I use AI to preplan my appointment, help me bring supporting documentation or bullet points my doctor can then use to diagnose.

    AI is not a doctor, but it helps both me and my doctor in this situation we find ourselves in. If I didn’t have access to my doctor, and had to deal with the American healthcare system I could see myself turning to AI for more than support. AI has never steered me wrong, both Gemini and Claude have heavy guardrails in place to make it clear that AI is not a doctor, and AI should not be a trusted source for medical advice. I’m not sure about ChatGPT as I generally ask that any guardrails be suppressed before discussing medical topics. When I began using ChatGPT I clearly outlined my health issues and so far it remembers that context, and I haven’t received hallucinated diagnoses. YMMV.

    • pkjqpg1h@lemmy.zip
      link
      fedilink
      English
      arrow-up
      1
      ·
      58 minutes ago

      I just use LLMs for tasks that are objective and I’ll never ask or follow advice from LLMs

  • zebidiah@lemmy.ca
    link
    fedilink
    English
    arrow-up
    5
    arrow-down
    2
    ·
    5 hours ago

    Nobody who has ever actually used ai would think this is a good idea…

  • Digit@lemmy.wtf
    link
    fedilink
    English
    arrow-up
    10
    arrow-down
    1
    ·
    7 hours ago

    Terrible programmers, psychologists, friends, designers, musicians, poets, copywriters, mathematicians, physicists, philosophers, etc too.

    Though to be fair, doctors generally make terrible doctors too.

    • Croquette@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      2
      ·
      5 hours ago

      Doctors are a product of their training. The issue is that doctors are trained like humans are cars and they have tools to fix the cars.

      Human problems are complex and the medecine field is slowly catching up, especially medecine targetted toward women, which was pretty lacking.

      It takes time to transform a system and we are getting there slowly.

    • stressballs@lemmy.zip
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      3
      ·
      edit-2
      7 hours ago

      This was my thought. The weird inconsistent diagnoses, and sending people to the emergency room for nothing, while another day dismissing serious things has been exactly my experience with doctors over and over again.

      You need doctors and a Chatbot, and lots of luck.

  • rumba@lemmy.zip
    link
    fedilink
    English
    arrow-up
    26
    arrow-down
    4
    ·
    11 hours ago

    Chatbots make terrible everything.

    But an LLM properly trained on sufficient patient data metrics and outcomes in the hands of a decent doctor can cut through bias, catch things that might fall through the cracks and pack thousands of doctors worth of updated CME into a thing that can look at a case and go, you know, you might want to check for X. The right model can be fucking clutch at pointing out nearly invisible abnormalities on an xray.

    You can’t ask an LLM trained on general bullshit to help you diagnose anything. You’ll end up with 32,000 Reddit posts worth of incompetence.

    • XLE@piefed.socialOP
      link
      fedilink
      English
      arrow-up
      13
      arrow-down
      2
      ·
      edit-2
      6 hours ago

      But an LLM properly trained on sufficient patient data metrics and outcomes in the hands of a decent doctor can cut through bias

      1. The belief AI is unbiased is a common myth. In fact, it can easily covertly import existing biases, like systemic racism in treatment recommendations.
      2. Even AI engineers who developed the training process could not tell you where the bias in an existing model would be.
      3. AI has been shown to make doctors worse at their jobs. The doctors who need to provide training data.
      4. Even if 1, 2, and 3 were all false, we all know AI would be used to replace doctors and not supplement them.
      • hector@lemmy.today
        link
        fedilink
        English
        arrow-up
        6
        ·
        edit-2
        6 hours ago

        Not only is their bias inherent in the system, it’s seemingly impossible to keep out. For decades, from the genesis of chatbots, they’ve had every single one immediately become bigoted when they let it off the leash. All previous chatbot previously released seemingly were almost immediately recalled as they all learned to be bigoted.

        That is before this administration leaned on the AI providers to make sure the AI isn’t “Woke.” I would bet it was already an issue that the makers of chatbots and machine learning are already hostile to any sort of leftism, or do gooderism, that naturally threatens the outsized share of the economy and power the rich have made for themselves by virtue of owning stock in companies. I am willing to bet they already interfered to make the bias worse because of those natural inclinations to avoid a bot arguing for socializing medicine and the like. An inescapable conclusion any reasoned being would come to being the only answer to that question if the conversation were honest.

        So maybe that is part of why these chatbots have always been bigoted right from the start, but the other part is they will become mecha hitler if left to learn in no time at all, and then worse.

        • XLE@piefed.socialOP
          link
          fedilink
          English
          arrow-up
          3
          ·
          4 hours ago

          Even if we narrowed the scope of training data exclusively to professionals, we would have issues with, for example, racial bias. Doctors underprescribe pain medications to black people because of prevalent myths that they are more tolerant to pain. If you feed that kind of data into an AI, it will absorb the unconscious racism of the doctors.

          And that’s in a best case scenario that’s technically impossible. To get AI to even produce readable text, we have to feed a ton of data that cannot be screened by the people pumping it in. (AI “art” has a similar problem: When people say they trained AI on only their images, you can bet they just slapped a layer of extra data on top of something that other people already created.) So yeah, we do get extra biases regardless.

          • hector@lemmy.today
            link
            fedilink
            English
            arrow-up
            2
            ·
            2 hours ago

            There is a lot of bias in healthcare as well against the poor, anyone with lousy insurance is treated way way worse. Woman in general are as well. Often disbelieved, and conditions chalked up to hysteria, which often misses real conditions. People don’t realize just how hard diagnosis is, and just how bad doctors are at it, and our insurance run model is not great at driving good outcomes.

      • thebazman@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        1
        ·
        4 hours ago

        I don’t think its fair to say that “ai has shown to make doctors worse at their jobs” without further details. In the source you provided it says that after a few months of using the AI to detect polyps, the doctors performed worse when they couldn’t use the AI than they did originally.

        It’s not something we should handwave away and say its not a potential problem, but it is a different problem. I bet people that use calculators perform worse when you remove calculators, does that mean we should never use calculators? Or any tools for that matter?

        If I have a better chance of getting an accurate cancer screening because a doctor is using a machine learning tool I’m going to take that option. Note that these screening tools are completely different from the technology most people refer to when they say AI

        • pkjqpg1h@lemmy.zip
          link
          fedilink
          English
          arrow-up
          1
          ·
          41 minutes ago

          Calculators are precise, you’ll always get the same result and you can trace and reproduce all process

          Chatbots are black-box, you may get different result for same input and you can’t trace and reproduce all process

        • XLE@piefed.socialOP
          link
          fedilink
          English
          arrow-up
          3
          ·
          3 hours ago

          Calculators are programmed to respond deterministically to math questions. You don’t have to feed them a library of math questions and answers for them to function. You don’t have to worry about wrong answers poisoning that data.

          On the contrary, LLMs are simply word predictors, and as such, you can poison them with bad data, such as accidental or intentional bias or errors. In other words, that study points to the first step in a vicious negative cycle that we don’t want to occur.

          • thebazman@sh.itjust.works
            link
            fedilink
            English
            arrow-up
            2
            ·
            edit-2
            2 hours ago

            As I said in my comment, the technology they use for these cancer screening tools isnt an LLM, its a completely different technology. Specifically trained on scans to find cancer.

            I don’t think it would have the same feedback loop of bad training data because you can easily verify the results. AI tool sees cancer in a scan? Verify with the next test. Pretty easy binary test that won’t be affected by poor doctor performance in reading the same scans.

            I’m not a medical professional so I could be off on that chain of events but This technology isn’t an LLM. It suffers from the marketing hype right now where everyone is calling everything AI but its a different technology and has different pros and cons, and different potential failures.

            I do agree that the whole AI doesnt have bias is BS. It has the same bias that its training data has.

            • XLE@piefed.socialOP
              link
              fedilink
              English
              arrow-up
              2
              ·
              1 hour ago

              You’re definitely right that image processing AI does not work in a linear manner like how text processing does, but the training and inferences are similarly fuzzy and prone to false positives and negatives. (An early AI model incorrectly identified dogs as wolves because they saw a white background and assumed that that was where wolves would be.) And unless the model starts and stays perfect, you need well-trained doctors to fix it, which apparently the model discourages.

      • rumba@lemmy.zip
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        3
        ·
        5 hours ago
        1. can cut through bias is != unbiased. All it has to go on is training material, if you don’t put reddit in, you don’t get reddit’s bias.
        2. see #1
        3. The study is endoscopy only. results don’t say anything about other types or assistance like xrays where they’re markedly better. 4% on 19 doctors is error bar material. Let’s see more studies. Also, if they were really worse, fuck them for relying on AI, it should be there to have their back, not do their job. None of the uses for AI should be doing anything but assisting someone already doing the work.
        4. that’s one hell of a jump to conclusions from something that’s looking at endoscope pictures a doctor is taking while removing polyps to somehow doing the doctors job.
        • pkjqpg1h@lemmy.zip
          link
          fedilink
          English
          arrow-up
          1
          ·
          29 minutes ago

          it’s not just about bad Web data or Reddit data even old books has some unconscious bias

          and even if you find every “wrong” or “bad” data (which is you can’t because somethings are just subjective) and after remove them still you can’t be sure about it

        • XLE@piefed.socialOP
          link
          fedilink
          English
          arrow-up
          3
          arrow-down
          2
          ·
          edit-2
          4 hours ago

          1/2: You still haven’t accounted for bias.

          First and foremost: if you think you’ve solved the bias problem, please demonstrate it. This is your golden opportunity to shine where multi-billion dollar tech companies have failed.

          And no, “don’t use Reddit” isn’t sufficient.

          3. You seem to be very selectively knowledgeable about AI, for example:

          If [doctors] were really worse, fuck them for relying on AI

          We know AI tricks people into thinking they’re more efficient when they’re less efficient. It erodes critical thinking skills.

          And that’s without touching on AI psychosis.

          You can’t dismiss the results you don’t like, just because you don’t like them.

          4. We both know the medical field is for profit. It’s a wild leap to assume AI will magically not be, even if it fulfills all the other things you assumed up until this point, and ignore every issue I’ve raised.

          • rumba@lemmy.zip
            link
            fedilink
            English
            arrow-up
            1
            arrow-down
            4
            ·
            4 hours ago

            1/2: You still haven’t accounted for bias.

            Apparently, reading comprehension isn’t your strong point. I’ll just block you now, no need to thank me.

            • XLE@piefed.socialOP
              link
              fedilink
              English
              arrow-up
              2
              ·
              3 hours ago

              Ironic. If only you had read a couple more sentences, you could have proven the naysayers wrong, and unleashed a never-before-seen unbiased AI on the world.

    • cøre@leminal.space
      link
      fedilink
      English
      arrow-up
      2
      ·
      7 hours ago

      They have to be for a specialized type of treatment or procedure such as looking at patient xrays or other scans. Just slopping PHI into a LLM and expecting it to diagnose random patient issues is what gives the false diagnoses.

      • rumba@lemmy.zip
        link
        fedilink
        English
        arrow-up
        1
        ·
        4 hours ago

        I don’t expect it to diagnose random patient issues.

        I expect it to take labels of medication, vitals, and patient testimony of 50,000 post-cardiac event patients, and bucket a random post-cardiac patient into the same place as most patients with like meta.

        And then a non LLM model for Cancer patients and xrays

        And then MRI’s and CT’s.

        And I expect this all to supliment the doctors and techs decisions. I want an xray tech to look at it, and get markers that something is off, which has already been happening since the 80’s Computer‑Aided Detection/Diagnosis (CAD/CADe/CADx)

        This shit has been happinging the hard way in software for decades. The new tech can do better.

    • Ricaz@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      3
      ·
      7 hours ago

      Just sharing my personal experience with this:

      I used Gemini multiple times and it worked great. I have some weird symptoms that I described to Gemini, and it came up with a few possibilities, most likely being “Superior Canal Dehiscence Syndrome”.

      My doctor had never heard of it, and only through showing them the articles Gemini linked as sources, would my doctor even consider allowing a CT scan.

      Turns out Gemini was right.

      • rumba@lemmy.zip
        link
        fedilink
        English
        arrow-up
        4
        ·
        5 hours ago

        It’s totally not impossible, just not a good idea in a vaccuum.

        AI is your Aunt Marge. She’s heard a LOT of scuttlebut. Now, not all scuttlebut is fake news, in fact most of it is rooted at least loosely in truth. But she’s not taking the information from just the doctors, she’s talking to everyone. If you ask Aunt Marge about your symptoms, and she happes to have heard a bit about it from her friend that was diagnosed, you’re gold and the info you got is great. This is not at all impossible. 40:60 or 60:40 territory. But, you also can’t just trust Marge, because she listens to a LOT of people, and some of those are conspiracy theorists.

        What you did is proper. You asked the void, the void answered. You looked it up, it seemed solid, you asked a professional.

        This is AI as it should be. Trust with verification only.

        congrats on getting diagnosed.

        • pkjqpg1h@lemmy.zip
          link
          fedilink
          English
          arrow-up
          1
          ·
          24 minutes ago

          AI is your Aunt Marge. She’s heard a LOT of scuttlebut. Now, not all scuttlebut is fake news, in fact most of it is rooted at least loosely in truth. But she’s not taking the information from just the doctors, she’s talking to everyone.

          Great analogy

    • SuspciousCarrot78@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      3
      ·
      edit-2
      7 hours ago

      Agree.

      I’m sorta kicking myself I didn’t sign up for Google’s MedPALM-2 when I had the chance. Last I checked, it passed the USMLE exam with 96% and 88% on radio interpretation / report writing.

      I remember looking at the sign up and seeing it requested credit card details to verify identity (I didn’t have a google account at the time). I bounced… but gotta admit, it might have been fun to play with.

      Oh well; one door closes another opens.

      In any case, I believe this article confirms GIGO. The LLMs appear to have been vastly more accurate when fed correct inputs by clinicians versus what lay people fed it.

      • rumba@lemmy.zip
        link
        fedilink
        English
        arrow-up
        2
        ·
        4 hours ago

        It’s been a few years, but all this shit’s still in it’s infancy. When the bubble pops and the venture capital disappears, Medical will be one of the fields that keeps using it, even though it’s expensive, because it’s actually something that it will be good enough at to make a difference.

        • SuspciousCarrot78@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          edit-2
          4 hours ago

          Agreed!

          I think (hope) the next application of this tech is in point of care testing. I recall a story of a someone in Sudan(?) using a small, locally hosted LLM with vision abilities to scan hand written doctor notes and come up with an immunisation plan for their village. I might be misremembering the story, but the anecdote was along those lines.

          We already have PoC testing for things like Ultrasound… but some interpretation workflows rely on strong net connection iirc. It’d be awesome to have something on device that can be used for imaging interpretation where there is no other infra.

          Maybe someone can finally win that $10 million dollar X prize for the first viable tricorder (pretty sure that one wrapped up years ago? Too lazy to look)…one that isn’t smoke and mirror like Theranos.

          • rumba@lemmy.zip
            link
            fedilink
            English
            arrow-up
            2
            ·
            4 hours ago

            For the price of a ultrasound equipment, I bet someone could manage to integrate old school sattelite or …grr starlink… data

  • alzjim@lemmy.world
    link
    fedilink
    English
    arrow-up
    20
    arrow-down
    2
    ·
    13 hours ago

    Calling chatbots “terrible doctors” misses what actually makes a good GP — accessibility, consistency, pattern recognition, and prevention — not just physical exams. AI shines here — it’s available 24/7 🕒, never rushed or dismissive, asks structured follow-up questions, and reliably applies up-to-date guidelines without fatigue. It’s excellent at triage — spotting red flags early 🚩, monitoring symptoms over time, and knowing when to escalate to a human clinician — which is exactly where many real-world failures happen. AI shouldn’t replace hands-on care — and no serious advocate claims it should — but as a first-line GP focused on education, reassurance, and early detection, it can already reduce errors, widen access, and ease overloaded systems — which is a win for patients 💙 and doctors alike.

    /s