Some of the world’s wealthiest companies, including Apple and Nvidia, are among countless parties who allegedly trained their AI using scraped YouTube videos as training data. The YouTube transcripts were reportedly accumulated through means that violate YouTube’s Terms of Service and have some creators seeing red. The news was first discovered in a joint investigation by Proof News and Wired.

While major AI companies and producers often keep their AI training data secret, heavyweights like Apple, Nvidia, and Salesforce have revealed their use of “The Pile”, an 800GB training dataset created by EleutherAI, and the YouTube Subtitles dataset within it. The YouTube Subtitles training data is made up of 173,536 YouTube plaintext transcripts scraped from the site, including 12,000+ videos which have been removed since the dataset’s creation in 2020.

Affected parties whose work was purportedly scraped for the training data include education channels like Crash Course (1,862 videos taken for training) and Philosophy Tube (146 videos taken), YouTube megastars like MrBeast (two videos) and Pewdiepie (337 videos), and TechTubers like Marques Brownlee (seven videos) and Linus Tech Tips (90 videos). Proof News created a tool you can use to survey the entirety of the YouTube videos allegedly used without consent.

  • mindbleach@sh.itjust.works
    link
    fedilink
    arrow-up
    3
    ·
    2 months ago

    I truly do not understand why anyone gives a shit.

    Someone downloaded subtitles from Youtube. Good, frankly. Fuck API TOS. People will save data that’s sent. You can’t serve files to any rando with a browser and pretend they’re a secret. I have used youtube-dl exclusively in lieu of the actual website.

    They compiled it for anyone to train models on. “Anyone” included the few giants who already have oodles of data… like Google, the owners of Youtube. And that’s a problem somehow? “However, this idyllic dream of supporting the little guy with The Pile has become another fuel source for major corporations to train AI, rather than DIYers.” You mean in addition to DIYers. It’s still a big open thing for anyone to use.

    Am I supposed to be mad because of copyright? I don’t even respect copyright for works of art that cost a billion dollars. I’m not getting excited over audience transcripts of some guy reviewing gizmos.

    Models will scan every book in the library, every movie that’s streaming, and every JPG on the internet. No kidding they might scan Youtube videos. Or in this case, the possibly-automated subtitles of Youtube videos.