The Latest Wave of Open Source LLMs and Datasets

The Latest Open Source LLMs and Datasets - Ahead of AI #8

This month's issue is focused on organizing the latest open source LLMs (large language models) projects and datasets that have been shared in the past few weeks. Besides, this article will highlight insights from large training runs and share several curated lists to keep track of everything.

Articles & Trends

With so many research papers coming out this month, it's hard to pick a few favorites for a closer discussion. However, one paper that caught the attention of many is the Eleuther AI's Pythia paper. The open-source Pythia suite of LLM models is an interesting alternative to other autoregressive decoder-style models like GPT-3. The accompanying Pythia: A Suite for Analyzing Large Language Models Across Training and Scale paper provides some interesting insights into the training mechanics alongside introducing various LLMs ranging from 70M to 12B parameters.

Here are some insights and takeaways from the Pythia paper:

Training on duplicated data (i.e., training for >1 epoch) does not make a difference in performance.
Training order does not influence memorization, unfortunately. If it did, we could solve undesirable memorization issues by reordering the training data.
Pretrained term frequency does influence task performance, with few-shot accuracy tending to be higher for terms that occur more frequently.
Doubling the batch size halves the training time but doesn't hurt convergence.

The Pythia model architecture is similar to GPT-3 but includes some improvements like Flash Attention (like LLaMA) and Rotary Positional Embeddings (like PaLM). Pythia was trained on the Pile dataset (an 800GB dataset of diverse texts) for 300 B tokens (~1 epoch on regular Pile, ~1.5 epochs on deduplicated Pile).

Last month, we saw several open-source implementations of large language models (LLMs), and now we are witnessing a new wave of open-source datasets, which is particularly commendable as data collection and cleaning make up ~90% of a real-world machine learning project, but no one likes doing this work.

Open Source Data

The RedPajama dataset is an open-source dataset for pretraining LLMs similar to LLaMA Meta's state-of-the-art LLaMA model. The bulk of the dataset consists of CommonCrawl, which is filtered for websites in the English language, but the Wikipedia articles cover 20 different languages. Databricks-dolly-15k is a dataset for LLM finetuning that includes 15,000+ instruction-pairs written by thousands of DataBricks employees, similar to those used to train systems like InstructGPT and ChatGPT. OpenAssistant Conversations is another dataset for finetuning pretraining LLMs on a collection of ChatGPT assistant-like dialogues that have been created and annotated by humans, encompassing 161,443 messages in 35 diverse languages, along with 461,292 quality assessments.

Research Highlights in Two Sentences

The Eleuther AI team has created a new suite of open-source LLMs called Pythia that includes various models ranging from 70M to 12B parameters. A recent paper on Pythia provides insights into the training mechanics alongside various LLMs.

Open Source Highlights

There are new deep learning fundamentals units on computer vision and large language models, including topics such as Advanced Computer Vision, CNN Architectures, and Transfer Learning.

Notable Quote

With the explosion of OpenAI's GPT-3 comes a surge of excitement, curiosity, and inevitable fear. It is also not without its controversies - its license, its scaling computations, and its potential use cases. LLM's boundary is unclear and arguably limitless.

Upcoming Events

The following events focus on machine learning and artificial intelligence:

The AI for Good Global Summit - June 8-10, 2021
ICML 2021 - July 18-24, 2021
NeurIPS 2021 - December 6-14, 2021

In conclusion, this article has covered the latest open source LLMs and datasets and highlighted trends and research highlights from large training runs. It has also provided useful information about upcoming events focused on AI and machine learning.