Meta Faces Lawsuit Over Training AI with Pirated Books

Published On Tue Apr 01 2025
Meta Faces Lawsuit Over Training AI with Pirated Books

Meta allegedly used pirated books to train AI. Australian authors speak out

Companies developing AI models, such as OpenAI and Meta, train their systems on enormous datasets. These consist of text from newspapers, books (often sourced from unauthorised repositories), academic publications, and various internet sources. The material includes works that are copyrighted. The Atlantic magazine recently alleged Meta, parent company of Facebook and Instagram, had used LibGen, an illegal book repository, to train its generative AI tool. LibGen, created around 2008 by Russian scientists, hosts more than 7.5 million books and 81 million research papers, making it one of the largest online libraries of pirated work in the world.

Legal Battles and Ethical Concerns

The practice of training AI on copyrighted material has sparked intense legal debates and raised serious concerns among writers and publishers who face the risk of their work being devalued or replaced. While some companies, such as OpenAI, have established formal partnerships with some content providers, many publishers and writers have objected to their intellectual property being used without consent or financial compensation.

Artists take legal action against Google: copyright battle over AI

Author Tracey Spicer has described Meta’s use of copyrighted books as “peak technocapitalism,” while Sophie Cunningham, chair of the board of the Australian Society of Authors, has accused the company of “treating writers with contempt.” Meta is being sued in the United States for copyright infringement by a group of authors, including Michael Chabon, Ta-Nehisi Coates, and comedian Sarah Silverman.

Legal Implications

The legal battles center on a fundamental question: does mass data scraping for AI training constitute “fair use”? The stakes are particularly high as AI companies not only train their models using publicly accessible data but use the content to provide Chatbot answers that may compete with the original creators’ works.

Tech Trends for 2025

AI companies defend their data scraping on the grounds of innovation and “fair use” – a legal doctrine that, in the US, permits “the unlicensed use of copyright-protected works in certain circumstances.”

Future Implications and Compensation Models

These legal challenges will have significant implications for the future of the publishing and media industries. The issue is particularly alarming considering the average earnings of authors in both the US and Australia. In response to these challenges, the Australian Society of Authors (ASA) has called for the Australian government to regulate AI to protect the rights of creators.

Does Training a Generative AI Model Using Copyrighted Material

In 2024, HarperCollins signed a deal allowing limited use of selected nonfiction backlist titles for AI training. The compensation model for original creators in these AI training scenarios remains a point of contention, with arguments for more equitable distribution between writers and publishers.

Concerns and Solutions

Publishers and creators are increasingly concerned about the loss of control of intellectual property in the face of AI-generated content. The development of licensing platforms, such as Created by Humans, aims to strike a balance between AI training needs and the protection of creators' rights.

As the legal landscape evolves, various jurisdictions are considering updates to national copyright laws to address AI-specific challenges. The European Union’s Artificial Intelligence Act of 2024 and similar initiatives aim to safeguard copyright holders’ interests while fostering innovation in AI development.

It is reported that the social platform Reddit has authorized data ...

Ultimately, the ongoing legal battles and ethical debates surrounding the use of copyrighted works to train AI systems raise critical questions about intellectual property rights, fair compensation for creators, and the future of content creation in the digital age.