Interesting to see that it's trained on data completely generated by Databricks employees. I wonder how "biased" that makes the data, and how much they spent in terms of lost man hours?
All datasets are biased, including this specific one. However, we believe it's still very valuable to open source, for a few reasons:
- This dataset is primarily used to train instruction reasoning, not for knowledge. (Keep in mind Dolly and any of the well known models have not been specifically trained for knowledge. They are all just demonstrating instruction reasoning.) The lack of a true open source (available for both research and commercial use) instruction dataset is the primary blocker for making these LLMs available for commercial use.
- We hope this will lead to not just open source innovations in models, but also future training datasets.
- Given the international population of our employee based, it's likely more diverse than datasets created by a small number of human labelers. And it is easier to identify, discuss, and debate dataset bias in the open.
Since databricks has employees all around the world, I expect to see that the data is not biased. But for sure there must be considerable man hours lost. However it shows, how much they are dedicated towards open source contributions.