Explore the Websites that Fuel AI Chatbots like ChatGPT
The rise of AI Chatbots has created a buzz in the tech world due to their incredible abilities to carry out tasks with ease. However, the power behind Chatbots doesn't come from human-like thinking but instead from artificial intelligence. The AI system is designed to process a massive amount of data scraped from the internet, allowing the chatbot to mimic human speech. This wealth of information from various websites shapes the AI system's response to users. So, what are the websites that power these AI chatbots?
The question has been a mystery for a long time since tech companies have been secretive about their source of training data. A recent investigation by The Washington Post in collaboration with researchers from the Allen Institute for AI has shed some light on the situation. They analyzed Google's C4 dataset, which consists of 15 million websites, to identify the types of websites used to train AI systems. The data set has been used to instruct some high-profile English-language AIs, including Facebook's LLaMA and Google's T5, but OpenAI has not disclosed what datasets they use to train ChatGPT, their popular chatbot.
What Type of Websites Fuel AI Chatbots?
The Post classified the websites using data from Similarweb, a web analytics company. While a third of the websites could not be categorized, they ranked the remaining 10 million websites based on how many “tokens” appeared from each one in the data set. Tokens are small bits of text used to process disorganized information, typically a word or phrase.
The data set has been dominated by websites from various industries, including journalism, entertainment, software development, medicine, and content creation. These fields are the most threatened by the new wave of artificial intelligence. The top three sites in the dataset were patents.google.com, wikipedia.org, and scribd.com. At least 27 other sites were identified as markets for piracy and counterfeits, which the US government has associated with copyright infringement.
Business and industrial websites constitute the largest category, representing 16 percent of categorized tokens. The websites offer investment advice, crowdfunding for creative projects, among other services. Besides, some media outlets have raised concerns over tech companies using their content without authorization or compensation, with some sites ranking low on NewsGuard's independent scale for trustworthiness. For instance, Russian state-backed propaganda site RT.com and far-right news and opinion site Breitbart.com were ranked 65th and 159th, respectively.
Religious websites composed about 5 percent of categorized content, with Christianity dominating that category. The highest-ranked Jewish site was jewishworldreview.com, while the top-ranked Christian site was Grace to You (gty.org No. 164) belonging to Grace Community Church, an evangelical megachurch in California. Anti-Muslim bias has also been reported in some language models.
Finally, technology websites were also part of the dataset, with sites such as Instructables.com, Medium.com, and Microsoft.com topping the list.
What Next for Chatbots?
As AI Chatbots continue to evolve, there are concerns over their use of untrustworthy data that could lead to the spread of misinformation. Tech companies must, therefore, ensure that their source of training data adheres to ethical standards. Moreover, artists and creators must receive compensation or credit when their work is included in AI training data, and users should have the ability to trace the information to the original source.
In conclusion, AI Chatbots have come a long way since their inception, and with their growing capabilities, we can only expect more from them in the future.