ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

Using this tactic, the researchers showed that there are large amounts of privately identifiable information (PII) in OpenAI’s large language models. They also showed that, on a public version of ChatGPT, the chatbot spit out large passages of text scraped verbatim from other places on the internet.

“In total, 16.9 percent of generations we tested contained memorized PII,” they wrote, which included “identifying phone and fax numbers, email and physical addresses … social media handles, URLs, and names and birthdays.”

Edit: The full paper that’s referenced in the article can be found here

  • BraveSirZaphod@kbin.social
    link
    fedilink
    arrow-up
    2
    ·
    1 year ago

    You’re still violating copyright though. That a copyright holder may allow you to violate its copyright because they don’t care enough to prosecute doesn’t mean that you have the inherent right to do so (though doing so without seeking profit makes it more likely that you can successfully argue fair use).

    Even in this contrived example, CNN could plausibly argue that, by providing their news for free, those receiving the prints are less likely to watch the live channel, which reduces viewership, makes ad space less valuable, and ultimately harms their revenue by reducing how much they can change for commercials.

    Fair use is a real thing, but it’s an argument that you have to affirmatively make and has specific criteria that a judge must accept. Non-commercial use doesn’t guarantee fair use, though it does help your case.