Unveiling the Ethics Behind ChatGPT Training

ChatGPT: Learning or Stealing? An exploration of the ethics of AI training

Since its introduction in late 2022, ChatGPT has revolutionized many aspects of our lives. In schools, students use it to help write papers and summarize articles. In the workplace, programmers use it to draft code and automate workflows. At the writer’s desk, authors use it to brainstorm premises and generate settings. Whatever the context, ChatGPT has proven to be a useful tool for people to be more efficient, productive, and creative. This utility is largely due to the language model’s ability to respond to text prompts with detailed, coherent, and readable passages, which typically (though not always!) contain relevant and accurate information.

The Ethics of AI Training

However, this capability did not manifest out of thin air — artificial intelligence models like ChatGPT must be trained in order to do what they are able to do. In the context of AI, training refers to a process in which a machine learning model is fed data, which it uses to make connections and recognize patterns.¹

CPMAI+E Ethical & Responsible AI Framework Development

Below is one depiction of such a process. The datasets that these models are trained off of contain large numbers of works, which often include copyrighted materials. The ethics of training AIs on such materials is a hotly contested debate, with arguments invoking various ethical frameworks and legal precedents.

Arguments for Ethical AI Training

In this article, I use the ethical lenses of consequentialism and deontology to argue that it is ethical for ChatGPT to train from copyrighted materials because it is a transformative use of the works that contributes towards innovation and progress. I also propose the creation of an independent AI oversight board to ensure that AI continues to be developed ethically.

Training Process of ChatGPT

Essentially, ChatGPT is a more advanced version of the autocomplete features that are often found in mobile phone keyboards, email apps, and search engines. In order to string words together in ways that are coherent and relevant, ChatGPT must learn the patterns that dictate which words tend to come after others. It does this by taking in millions of text records and making associations between the words that it finds within them.² These text records can include books, news articles, Wikipedia entries, forum posts, and social media profiles. ChatGPT’s original training set comprised 570 GB of data, and later iterations of the model are trained on even more parameters.³

How AI tools are trained is alarming creators. Is copyright law ...

Debates and Lawsuits

There has been much debate on the topic of artificial intelligence training as it relates to copyright law. At the end of 2023, The New York Times filed a lawsuit against OpenAI on the grounds that the artificial intelligence company used copyrighted NYT articles to train its chatbot with. Through such training, the Times alleges, OpenAI was able to redirect readers to ChatGPT without having to pay a subscription fee.

In response to this lawsuit, OpenAI published a blog post on its website defending its use of articles from The New York Times in training AI models. The company argues that training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents

Ethical Frameworks in AI Training

Let us first take a look at the arguments against AI training on copyrighted works. The first blow against this practice is the claim that it is wrong to use other people’s work without permission, credit, or compensation. This is a deontological argument — it appeals to the inherent morality of the act of training itself as analogous to the act of stealing.

Consequentialism can also be used to defend ChatGPT’s training. OpenAI argues that the artificial intelligence language models that result from training on copyrighted material is sufficiently transformative to constitute fair use. Even though the training may have used copyrighted material, the end result of such training in the form of ChatGPT is something that is different enough from the original training materials that it does not pose an ethical issue.

Conclusion

The benefits of ChatGPT’s training practices outweigh the drawbacks, especially from a consequentialist perspective. Although there are relevant points to be made deontologically that are important to consider, they are unconvincing when compared to the consequentialist arguments that point to the very real and beneficial outcomes of this kind of training.