Copyright Challenges in Generative AI Training Databases

Published On Wed May 22 2024

Generative AI and transparency of databases and their content, from a copyright perspective

In May 2024, the Organisation for Economic Cooperation and Development (OECD) updated its Principles on Artificial Intelligence (AI), including the principle of transparency that has contributed to shaping policy and regulatory debates on AI and generative AI.

Generative AI refers to deep learning models capable of creating new content, such as text, computer code, and images, based on a user's prompt. The transparency of databases and their content used to train AI models has become a crucial issue, particularly regarding the protection of copyrighted works.

Foundation models for generalist medical artificial intelligence

Lack of transparency in databases used for AI training

In recent years, AI companies have become increasingly secretive about their training datasets. Companies like Meta and OpenAI have restricted access to information about the databases they use. OpenAI, for example, transitioned from transparently disclosing their training datasets, such as BooksCorpus, to providing vague references to datasets like Books1 and Books2.

Concerns have been raised about the inclusion of copyrighted materials in these datasets, prompting legal actions against AI companies. Despite concerns, companies like Google, Nvidia, and Stability AI have also refrained from disclosing their training datasets' content.

Reinventing the data experience: Use generative AI and modern data

Regulatory developments in the EU and the USA

The EU AI Act, amended in March 2024, now includes provisions to promote transparency in the data used to train General Purpose AI models. However, critics argue that the requirements are ambiguous and challenging.

Our commitments to advance safe, secure, and trustworthy AI

In the USA, the Copyright Office sought feedback on transparency and disclosure levels regarding the use of copyrighted works in training AI models. The proposed Generative AI Copyright Disclosure Act would mandate the submission of notices detailing copyrighted works used in building generative AI systems.

Conclusion

The debate surrounding the transparency of databases and their content used for training AI models continues to evolve. While regulatory efforts aim to address concerns about copyright infringement, challenges remain in balancing the interests of AI companies and rights holders.