Introduction
On April 2023, folks at Databricks announced Dolly, a Large Language Model (LLM) trained like Chat GPT by Open AI. But after two weeks of the initial release, Databricks released Dolly 2.0, the first open-source LLM that follows instructions like Chat GPT. Also, they fine-tuned a human-generated data set licensed for research and commercial use.
Dolly 2.0 has been developed using the EleutherAI Pythia model family language model, which boasts a whopping 12 billion parameters (it's called 'databricks-dolly-15k'). However, what sets it apart is the fact that it has been fine-tuned exclusively on a top-notch dataset of human-generated instructions. This dataset was carefully crafted by crowdsourcing contributions from Databricks employees, ensuring that it closely mimics the language and style of human communication.
Great news! Databricks has just announced that they are open-sourcing Dolly 2.0, which means that the training code, dataset, and model weights will be available for commercial use. This is a huge step forward in the world of language models, as it enables companies and organizations to create their powerful LLMs without having to pay for an API subscription or worry about their data being shared with third parties. So, if you're interested in using this model, you can now take advantage of it and customize it to suit your needs.
Why 2.0 all of a sudden?
Upon closer examination of Dolly's development, it has come to light that the initial release in March 2023 (Dolly 1.0) heavily relied on ChatGPT outputs. Unfortunately, this approach presented several limitations regarding commercial data, as sufficient training data was not available. As a result, Dolly 1.0 faced particular challenges and limitations in its capabilities. As it relied upon and included some content from ChatGPT under Open AI's terms of service, the Databricks team created a new dataset that allowed Dolly to be available for commercial use.
What is databricks-dolly-15k
Dataset?
The databricks-dolly-15k dataset is a collection of 15,000 high-quality, human-generated prompt/response pairs specifically designed to fine-tune LLMs for instruction purposes. Even more impressive is that this dataset is open-source and released under the Creative Commons Attribution-ShareAlike 3.0 Unported License. This means anyone can use, modify, or extend the dataset for any purpose, including commercial applications.
This is the first human-generated instruction dataset open-sourced to make large language models exhibit the kind of interactivity seen in ChatGPT. Over 5,000 Databricks employees authored the dataset during March and April of 2023, and it is an accurate representation of natural and expressive language. The training records cover various behaviours, including brainstorming, content generation, information extraction, and summarization, making it a highly versatile dataset.
The Databricks team created a contest to motivate their employees to participate in the dataset creation process, with the top 20 labellers receiving a substantial award. In addition, they outlined seven specific tasks that were required for the dataset creation:
The following section is copied from Databrick's official blog post, which I have linked above.
Open Q&A: For instance, "Why do people like comedy movies?" or "What is the capital of France?" In some cases, there's no correct answer; in others, it requires drawing on knowledge of the world.
Closed Q&A: These are questions that can be answered using only the information contained in a passage of reference text. For instance, given a paragraph from Wikipedia on the atom, one might ask, "What is the ratio between protons and neutrons in the nucleus?"
Extract information from Wikipedia: Here an annotator would copy a paragraph from Wikipedia and extract entities or other factual information such as weights or measurements from the passage.
Summarize information from Wikipedia: For this, annotators provided a passage from Wikipedia and were asked to distill it to a short summary.
Brainstorming: This task asked for open-ended ideation and an associated list of possible options. For instance, "What are some fun activities I can do with my friends this weekend?".
Classification: For this task, annotators were asked to make judgments about class membership (e.g. are the items in a list animals, minerals or vegetables) or to judge the properties of a short passage of text, such as the sentiment of a movie review.
Creative writing: This task would include things like writing a poem or a love letter.
Example of Open QA in databricks-dolly-15k
How do I build a campfire?
Safety should always come first when starting a campfire. Ensure you have water or another way to extinguish your fire readily available. Use a campfire ring, pit, or clear a large area around your campfire location. Also, make sure campfires are permitted in your area.
With safety out of the way, all a fire needs are heat, fuel, and oxygen. Any combustible material will start a fire, but good campfire fuel sources include tinder, kindling, and dry logs.
Use stuff like dry pine needles, dry leaves, grasses, and wood shavings for tinder. You could also use newspaper (or any paper), dryer lint, or cardboard. For kindling, use smaller pieces of wood, such as twigs and sticks no thicker than 1 inch in diameter (about the thickness of your thumb).
Dry or well-seasoned logs will ensure a good burn. Stack the wood in a way that guarantees oxygen can easily reach your flame. Many prefer a teepee or log cabin stacking design. Now, use a heat source to create a flame or spark and apply this to your tinder. Once the tinder combusts, it should light the kindling, and the kindling should eventually ignite your logs. If all goes well, you should now be enjoying your campfire.
Example of Brainstorming in databricks-dolly-15k
Instead of making a peanut butter and jelly sandwich, what else could I combine peanut butter with in a sandwich? Give five ideas.
Instead of jelly, try one of the following with peanut butter in a sandwich:
1. Honey
2. Raisins
3. Fruit preserves
4. Bananas
5. Sliced apples
Dolly 2.0 generates content for a tweet
Instruction | Response |
Write me a tweet about the launch of Dolly 2.0, our new LLM. | Dolly 2.0 is here 🤩 |
How did Databricks create the new dataset?
The team at Databricks were inspired by the original InstructGPT model, which was trained on a dataset of 13,000 demonstrations of the instruction following behaviour. They decided to create their dataset with the help of their 5,000 employees, hoping to achieve similar results.
Generating 13,000 original questions and answers is no easy feat. Each response had to be unique and not copied from external sources, including ChatGPT or the web. Nonetheless, the team was determined to create a high-quality dataset and knew that crowdsourcing among their employees was the way to go.
To incentivize their employees to participate in creating the dataset, the team at Databricks had to devise a plan. They knew everyone was busy with their full-time jobs, so they needed to provide the right motivation. Ultimately, the team created the databricks-dolly-15k dataset, a remarkable accomplishment demonstrating the power of crowdsourcing in developing large language models.
Dolly 2.0 vs, ChatGPT
In contrast to Dolly 2.0, OpenAI's ChatGPT is a proprietary model that requires users to pay for API access and follow specific terms of service, which can limit flexibility and customization options for businesses and organizations. On the other hand, Meta's LLaMA, although partially open source (with restricted weights), does not permit commercial use.
It is worth noting that the weights for LLaMA were recently leaked on BitTorrent, which spawned a wave of derivatives. However, these derivatives may not be suitable for commercial use due to the restrictions on the original model. In contrast, Dolly 2.0 offers complete open-source access to its training code, dataset, and model weights, all commercially available, providing businesses and organizations with greater flexibility and customization options.
How do I access Dolly?
Even though Dolly isn't available as a website, unlike ChatGPT, you can download the model weights from Databricks.
What does this mean for the future of AI-Language models?
Dolly 2.0 is a game-changer in the field of AI and language models. It is the first open-source LLM that can be used commercially and can potentially transform how businesses use AI. What makes Dolly 2.0 unique is that it is freely accessible to anyone, whether a large corporation or a small start-up. Companies no longer worry about paying for access or sharing their data with third parties.
In contrast, models like ChatGPT come with specific terms of service and require payment for access, limiting the flexibility and customization options available to businesses. By making Dolly 2.0 available for commercial use, Databricks has provided companies with a powerful tool to engage in natural and engaging conversations with customers while maintaining complete control over their data. With this development, we can expect to see new and innovative applications of AI in the business world.
But Dolly 2.0 is not just about accessibility but also quality. The dataset used to train the model, databricks-dolly-15k, was explicitly designed for instruction tuning of LLMs, with 15,000 high-quality human-generated prompt/response pairs. These training records are natural, expressive, and designed to represent various behaviours, from Brainstorming and content generation to information extraction and summarization.
This focused effort results in a powerful and versatile model capable of performing a wide range of tasks with remarkable accuracy and speed. And because the model is open source, it is infinitely customizable, allowing users to tweak and refine it to suit their specific needs.
In short, Dolly 2.0 represents a significant step forward in developing large language models. By making it accessible and open-source, Databricks has levelled the playing field and opened up a new world of possibilities for businesses and organizations. As the field of AI continues to evolve, it is clear that Dolly 2.0 will play a pivotal role in shaping the future of LLMs and driving innovation in this exciting field.
What do you think about Dolly 2.0? Let me know in the comments. I'll see you at the next one. Have a wonderful day!