Are ChatGPT, Bard and Dolly 2.0's Ethics Up for Debate?

Published On Fri May 12 2023

Are ChatGPT, Bard and Dolly 2.0 Trained On Pirated Content?

Large Language Models (LLMs) such as ChatGPT, Bard and Dolly 2.0 are trained on public internet content, including datasets created from pirated books.

The Pythia research paper by EleutherAI mentions that Pythia was trained using the Pile dataset which consists of multiple sets of English language texts, including a dataset called Books3. This dataset contains the text of books that were pirated and hosted at a pirate site called bibliotik.

However, Dolly 2.0 is an open-source AI that was recently released to democratize AI by making it available to everyone who wants to create something with it, even commercial products. Dolly 2.0 is based on an Open-Source Large Language Model (LLM) called Pythia which was created by an open-source group called EleutherAI. One version of Pythia, a 12 billion parameter version, is the one used by DataBricks to create Dolly 2.0, as well as with a dataset that DataBricks created themselves.

The Washington Post recently published a review of Google's Colossal Clean Crawled Corpus dataset, also known as C4, in which they discovered that Google's dataset also contains pirated content. The C4 dataset is important because it's one of the datasets used to train Google's LaMDA LLM, a version of which is what Bard is based on.

It was discovered by the researchers that the C4 dataset contained negative sentiment against people of Arab identities and excluded documents that were associated with Blacks, Hispanics, and documents that mention sexual orientation. This exclusion exacerbates existing (language-based) racial inequality as well as stigmatization of LGBTQ+ identities.

In conclusion, large language models are trained using massive amounts of text data that is derived from multiple sources, including pirated content. However, the intent behind open-source AI like Dolly 2.0 is to democratize AI, and companies like Mozilla are investing in growing the open-source AI ecosystem. It is essential to ensure the datasets used to train these models are unbiased and do not exclude individuals belonging to minority communities.