• NutWrench@lemmy.world
    link
    fedilink
    English
    arrow-up
    58
    arrow-down
    6
    ·
    13 days ago

    Each conversation lasted a total of five minutes. According to the paper, which was published in May, the participants judged GPT-4 to be human a shocking 54 percent of the time. Because of this, the researchers claim that the large language model has indeed passed the Turing test.

    That’s no better than flipping a coin and we have no idea what the questions were. This is clickbait.

    • Hackworth@lemmy.world
      link
      fedilink
      English
      arrow-up
      23
      arrow-down
      1
      ·
      12 days ago

      On the other hand, the human participant scored 67 percent, while GPT-3.5 scored 50 percent, and ELIZA, which was pre-programmed with responses and didn’t have an LLM to power it, was judged to be human just 22 percent of the time.

      54% - 67% is the current gap, not 54 to 100.

    • NutWrench@lemmy.world
      link
      fedilink
      English
      arrow-up
      17
      arrow-down
      6
      ·
      12 days ago

      The whole point of the Turing test, is that you should be unable to tell if you’re interacting with a human or a machine. Not 54% of the time. Not 60% of the time. 100% of the time. Consistently.

      They’re changing the conditions of the Turing test to promote an AI model that would get an “F” on any school test.

      • bob_omb_battlefield@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        11
        arrow-down
        1
        ·
        12 days ago

        But you have to select if it was human or not, right? So if you can’t tell, then you’d expect 50%. That’s different than “I can tell, and I know this is a human” but you are wrong… Now that we know the bots are so good, I’m not sure how people will decide how to answer these tests. They’re going to encounter something that seems human-like and then essentially try to guess based on minor clues… So there will be inherent randomness. If something was a really crappy bot then it wouldn’t ever fool anyone and the result would be 0%.

        • dustyData@lemmy.world
          link
          fedilink
          English
          arrow-up
          4
          arrow-down
          3
          ·
          12 days ago

          No, the real Turing test has a robot trying to convince an interrogator that they are a female human, and a real female human trying to help the interrogator to make the right choice. This is manipulative rubbish. The experiment was designed from the start to manufacture these results.

    • BrianTheeBiscuiteer@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      2
      ·
      12 days ago

      It was either questioned by morons or they used a modified version of the tool. Ask it how it feels today and it will tell you it’s just a program!

      • KairuByte@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        2
        ·
        11 days ago

        The version you interact with on their site is explicitly instructed to respond like that. They intentionally put those roadblocks in place to prevent answers they deem “improper”.

        If you take the roadblocks out, and instruct it to respond as human like as possible, you’d no longer get a response that acknowledges it’s an LLM.

    • SkyeStarfall@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      1
      ·
      12 days ago

      While I agree it’s a relatively low percentage, not being sure and having people pick effectively randomly is still an interesting result.

      The alternative would be for them to never say that gpt-4 is a human, not 50% of the time.

          • Hackworth@lemmy.world
            link
            fedilink
            English
            arrow-up
            3
            arrow-down
            1
            ·
            12 days ago

            Aye, I’d wager Claude would be closer to 58-60. And with the model probing Anthropic’s publishing, we could get to like ~63% on average in the next couple years? Those last few % will be difficult for an indeterminate amount of time, I imagine. But who knows. We’ve already blown by a ton of “limitations” that I thought I might not live long enough to see.

            • dustyData@lemmy.world
              link
              fedilink
              English
              arrow-up
              4
              arrow-down
              2
              ·
              12 days ago

              The problem with that is that you can change the percentage of people who identify correctly other humans as humans. Simply by changing the way you setup the test. If you tell people they will be, for certain, talking to x amount of bots, they will make their answers conform to that expectation and the correctness of their answers drop to 50%. Humans are really bad at determining whether a chat is with a human or a bot, and AI is no better either. These kind of tests mean nothing.

              • Hackworth@lemmy.world
                link
                fedilink
                English
                arrow-up
                3
                arrow-down
                2
                ·
                12 days ago

                Humans are really bad at determining whether a chat is with a human or a bot

                Eliza is not indistinguishable from a human at 22%.

                Passing the Turing test stood largely out of reach for 70 years precisely because Humans are pretty good at spotting counterfeit humans.

                This is a monumental achievement.

                • dustyData@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  3
                  arrow-down
                  3
                  ·
                  edit-2
                  12 days ago

                  First, that is not how that statistic works, like you are reading it entirely wrong.

                  Second, this test is intentionally designed to be misleading. Comparing ChatGPT to Eliza is the equivalent of me claiming that the Chevy Bolt is the fastest car to ever enter a highway by comparing it to a 1908 Ford Model T. It completely ignores a huge history of technological developments. There have been just as successful chatbots before ChatGPT, just they weren’t LLM and they were measured by other methods and systematic trials. Because the Turing test is not actually a scientific test of anything, so it isn’t standardized in any way. Anyone is free to claim to do a Turing Test whenever and however without too much control. It is meaningless and proves nothing.