From 9f5508f205564a777276d550aace25b3b401e6a6 Mon Sep 17 00:00:00 2001 From: Fabio Henrique Date: Thu, 16 Apr 2026 22:23:31 +0200 Subject: [PATCH] Add new research as stolen training data evidence (#400) I'm open to relocating this to another section in case Stolen Training Data is not the right place for it. The reasoning behind choosing this section is that all the others do not, AFAICS, fit this research: 'Legal Cases and Law Problems' is not exactly the case here because while there's a copyright violation argument that can be made AFAIK this research was not used in any court case at this point in time, 'License Problems' is the same thing, up until now this research hasn't been used anywhere related to licensing. It does provide evidence though that LLMs do, in fact, steal and store copyrighted data. Reviewed-on: https://codeberg.org/small-hack/open-slopware/pulls/400 Reviewed-by: JesseBot Co-authored-by: Fabio Henrique Co-committed-by: Fabio Henrique --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 618642b..b5a7b89 100644 --- a/README.md +++ b/README.md @@ -852,6 +852,7 @@ AI companies use data from across the web for training their models, most often * In 2023, [the Washington Post published a list of sources in Google's C4 data set](https://archive.ph/eehKq). A multitude of fediverse instances and personal sites were included. The fediverse is known for its userbase being major proponents of privacy and opt-in consent, making this especially jarring for those who have chosen to use decentralized social media for control over their data. * In 2025, [a similar leak of Meta's sources was published](https://archive.ph/NZlf3). Meta's list demonstrates how their integration of ActivityPub into their Threads software has enhanced their ability to mine content without authorization. Threads is widely blocked in some parts of the fediverse, but their scraping of server CDNs has allowed them to get around that. Notably, both the CDN domains of the managed hosting services masto.host and fedi.monster are included in the list; large servers like mastodon.art, which is hosted by the former and has many artists who've left sites like DeviantArt and others due to their AI scraping of user content, had [media unknowingly scraped](https://mastodon.art/@Curator/115022115346692178). +* In March 2026, [a research paper](https://arxiv.org/html/2603.20957v2) showed that simply finetuning LLMs unlocked exact verbatim recall of up to 90% of entire copyrighted books contradicting LLM companies previous statements in court that their models do not store copies of traning data. After finetuning exclusively on a single author, the researchers were able to unlock verbatim recall of over 30 completely unrelated authors across different genres. None of the models were explicitly trained on these books by the researchers, this indicates that LLMs always carry with them a considerable amount of copyrighted materials from pre-training. FOSS projects listed in this repo are using tooling that blatantly disregard licensing and violate of Codes of Conduct, making said tools antithetical to FOSS' purpose.