thoughts on academic labor, digital labor, intellectual property, and generative AI
- 3 minutes read - 496 wordsThanks to this article from The Atlantic that I saw on Bluesky, I’ve been able to confirm something that I’ve long assumed to be the case: that my creative and scholarly work is being used to train generative AI tools. More specifically, I used the searchable database embedded in the article to search for myself and find that at least eight of my articles (plus two corrections) are available in the LibGen pirate library—which means that they were almost certainly used by Meta to train their Llama LLM.
I’m not thrilled about this (though it’s hard to get too upset about something that I was confident was already happening), but as fiercely critical as I am about AI companies’ data scraping, this particular example also illustrates some of the complicated feelings and perspectives I have to sort through as I voice those criticisms. Here’s the thing: I don’t like generative AI companies using my data, but I also don’t really mind that my articles are being pirated by LibGen. I would prefer that my research be freely available to everyone, and I’m frustrated with an academic publishing system that so often locks it behind ridiculous paywalls—not so that I can profit from my own intellectual property, but so that publishing companies can.
So, my complaint here isn’t actually about piracy or intellectual property, it’s a complaint about labor. In fact, I’ve long felt that academics ought to be skeptical of AI companies’ use of our data because it so closely resembles the way that academic publishing companies use our research. I don’t mind my scholarly output being freely available to anybody—what I do mind is academic publishing companies profiting off of my work (and not giving me a penny) by locking my scholarly output behind a paywall. Sure, if my scholarly output is publicly available, I guess that means that it’s available to companies who want to use it to build their LLMs and other tools, but the core problem is still the same: Those companies are profiting off of my work (and not giving me a penny) because my scholarly output isn’t behind a paywall.
I don’t necessarily mind that Meta pirates my research publications, but I don’t think it—or anyone else!—should have to. In fact, more than one person has made the observation that Meta, OpenAI, and their ilk won’t get any more than a slap on the wrist for the same “crime” that was serious enough for the feds to drive Aaron Swartz to take his own life, and that’s a terrible world that I don’t want to live in. I don’t necessarily need to get rich off of my research publications, but if someone else is, they ought to give me a cut. In my view, the problem with generative AI tools’ exploitation of peoples’ work isn’t an intellectual property issue, it’s a deeper labor issue. That doesn’t make it any less serious, but I think the distinction here is really important.
- academic labor
- digital labor
- generative AI
- research
- intellectual property
- copyright
- Bluesky
- LibGen
- Llama
- Mike Masnick
- Aaron Swartz
- labor
Similar Posts:
🔗 linkblog: More academic publishers are doing AI deals'
🔗 linkblog: OpenAI, Mass Scraper of Copyrighted Work, Claims Copyright Over Subreddit's Logo'
I have lots of concerns about LLM training, but I think it’s better to think of the issue in terms of digital labor, not copyright. My blog is licensed for reuse, but that doesn’t mean it’s any less exploitative for someone to scrape it all to develop software that will make them rich off my work.
🔗 linkblog: Ex-Google CEO says successful AI startups can steal IP and hire lawyers to ‘clean up the mess’'
Comments:
You can click on the <
button in the top-right of your browser window to read and write comments on this post with Hypothesis. You can read more about how I use this software here.