Free Dolly: Introducing the World's First Open and Commercially Available Large Language Model
Two weeks ago, Databricks released Dolly, a large language model (LLM) trained for less than $30 to exhibit ChatGPT-like human interactivity. Today, they have released Dolly 2.0, the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
What is Dolly 2.0?
Dolly 2.0 is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction following dataset, crowdsourced among Databricks employees. The entirety of Dolly 2.0, including the training code, the dataset, and the model weights, is now open source and suitable for commercial use. This means that any organization can create, own, and customize powerful LLMs that can talk to people, without paying for API access or sharing data with third parties.
The Dataset
The human-generated instruction dataset, called databricks-dolly-15k, contains 15,000 high-quality prompt/response pairs specifically designed for instruction tuning large language models. Under the licensing terms for databricks-dolly-15k, anyone can use, modify, or extend this dataset for any purpose, including commercial applications.
The Creation of Dolly 2.0
A critical step in the creation of Dolly 1.0 or any instruction-following LLMs is to train the model on a dataset of instruction and response pairs. Dolly 1.0 was trained for $30 using a dataset that the Stanford Alpaca team had created using the OpenAI API. That dataset contained output from ChatGPT, and the terms of service sought to prevent anyone from creating a model that competes with OpenAI. So, unfortunately, commercial use was prohibited.
To get around this, Databricks created a new dataset, databricks-dolly-15k, not tainted for commercial use. They crowdsourced the dataset, consisting of 13,000 demonstrations of instruction following behavior, among Databricks employees. Every answer had to be original and not copied from ChatGPT or anywhere else on the web to ensure dataset purity. They incentivized employees to contribute to the dataset through a contest, and the top 20 labelers received a big award. With nightly leaderboard gamification, they managed to break 15,000 results within a week.
The resulting Dolly 2.0 model, based on EleutherAI’s pythia-12b, exhibited high-quality instruction following behavior. Many of the instruction tuning datasets released in recent months contain synthesized data, which often contains hallucinations and factual errors. Databricks-dolly-15k, on the other hand, is generated by professionals, is high quality, and contains long answers to most tasks.
Applications of Dolly 2.0
Dolly 2.0 has wide-ranging applications across the enterprise. Some examples of how Dolly 2.0 can be used for summarization and content generation are:
- Dolly 2.0 summarizes Databricks documentation
- Brainstorming
- Open QA
Dolly 2.0 is an exciting development towards creating more open and commercially available large language models. By crowdsourcing the dataset, Databricks was able to create a higher quality dataset than ever before. With Dolly 2.0, any organization can create, own, and customize powerful LLMs without paying for API access or sharing data with third parties.