ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

Using this tactic, the researchers showed that there are large amounts of privately identifiable information (PII) in OpenAI’s large language models. They also showed that, on a public version of ChatGPT, the chatbot spit out large passages of text scraped verbatim from other places on the internet.

“In total, 16.9 percent of generations we tested contained memorized PII,” they wrote, which included “identifying phone and fax numbers, email and physical addresses … social media handles, URLs, and names and birthdays.”

Edit: The full paper that’s referenced in the article can be found here

  • JackGreenEarth
    link
    fedilink
    arrow-up
    13
    arrow-down
    3
    ·
    1 year ago

    CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments

    Those are all publicly available data sites. It’s not telling you anything you couldn’t know yourself already without it.

    • stolid_agnostic@lemmy.ml
      link
      fedilink
      arrow-up
      23
      ·
      1 year ago

      I think the point is that it doesn’t matter how you got it, you still have an ethical responsibility to protect PII/PHI.

    • BraveSirZaphod@kbin.social
      link
      fedilink
      arrow-up
      2
      ·
      1 year ago

      Being publicly accessible does not void copyright. You can read a CNN article for free, but if you start printing them out and selling a CNN print edition, you’ll have very expensive lawyers talking to you very quickly.

      • JackGreenEarth
        link
        fedilink
        arrow-up
        1
        ·
        1 year ago

        If you make money from it, that’s fair that you are violating copyright. If you print it out and freely distribute the printed copies, it shouldn’t make any difference to CNN’s finances though.

        • BraveSirZaphod@kbin.social
          link
          fedilink
          arrow-up
          2
          ·
          1 year ago

          You’re still violating copyright though. That a copyright holder may allow you to violate its copyright because they don’t care enough to prosecute doesn’t mean that you have the inherent right to do so (though doing so without seeking profit makes it more likely that you can successfully argue fair use).

          Even in this contrived example, CNN could plausibly argue that, by providing their news for free, those receiving the prints are less likely to watch the live channel, which reduces viewership, makes ad space less valuable, and ultimately harms their revenue by reducing how much they can change for commercials.

          Fair use is a real thing, but it’s an argument that you have to affirmatively make and has specific criteria that a judge must accept. Non-commercial use doesn’t guarantee fair use, though it does help your case.