stolen training data - written by a friend

This commit is contained in:
jessebot 2026-02-15 17:55:57 +01:00
commit eb40a43b89
No known key found for this signature in database

View file

@ -224,11 +224,19 @@ This is for both social media websites and apps.
Could use some help writing this with concrete receipts on environmental, social, political, and economic/labor harms.
### Stolen Training Data
AI companies use data from across the web for training their models, most often without the website owners' and users' consent. Big tech companies like Google and Meta are scraping data from the users of major FOSS projects, such as Mastodon, WordPress, and other AcitivityPub-powered and self-hosted software.
* In 2023, [the Washington Post published a list of sources in Google's C4 data set](https://archive.ph/eehKq). A multitude of fediverse instances and personal sites were included. The fediverse is known for its userbase being major proponents of privacy and opt-in consent, making this especially jarring for those who have chosen to use decentralized social media for control over their data.
* In 2025, [a similar leak of Meta's sources was published](https://archive.ph/NZlf3). Meta's list demonstrates how their integration of ActivityPub into their Threads software has enhanced their ability to mine content without authorization. Threads is widely blocked in some parts of the fediverse, but their scraping of server CDNs has allowed them to get around that. Notably, both the CDN domains of the managed hosting services masto.host and fedi.monster are included in the list; large servers like mastodon.art, which is hosted by the former and has many artists who've left sites like DeviantArt and others due to their AI scraping of user content, had [media unknowingly scraped](https://mastodon.art/@Curator/115022115346692178).
FOSS projects listed in this repo are using tooling that blatantly disregards licensing and violates even the most basic of Code of Conducts, making said tools antithetical to FOSS' purpose.
### Legal Ramifications
1. LLMs are often trained on, and thus prone to, regurgitate either completely, or in-part, chunks of code that are licensed under terms which have specific legal requirements that a sloperator may not understand or even be aware of when making a contribution. Regardless of this ignorance, it falls to the repo's owner to comply with the terms of any and all licensed code integrated into their project.
See also: [lawsuits against AI companies](https://chatgptiseatingtheworld.com/2025/11/02/tracker-of-tort-lawsuits-v-ai-companies/)
### Environmental Impact